摘要
多样化图像描述生成已成为图像描述领域研究热点.然而,现有方法忽视了全局和序列隐向量之间的依赖关系,严重限制了图像描述性能的提升.针对该问题,本文提出了基于混合变分Transformer的多样化图像描述生成框架.具体地,首先构建全局与序列混合条件变分自编码模型,解决全局与序列隐向量之间依赖关系表示的问题.其次,通过最大化条件似然推导混合模型的变分证据下界,解决多样化图像描述目标函数设计问题.最后,无缝融合Transformer和混合变分自编码模型,通过联合优化提升多样化图像描述的泛化性能.在MSCOCO数据集上实验结果表明,与当前最优基准方法相比,在随机生成20和100个描述语句时,多样性指标m-BLEU(mutual overlap-BiLingual Evaluation Understudy)分别提升了4.2%和4.7%,同时准确性指标CIDEr(Consensus-based Image Description Evaluation)分别提升了4.4%和15.2%.
Diverse image captioning has become a research hotspot in the field of image description.Existing meth⁃ods generally ignore the dependency relationship between global and sequential latent vectors,which seriously limits the performance improvement.To address this problem,this paper proposes a hybrid variational Transformer based diverse im⁃age captioning framework.Firstly,we construct a hybrid conditional variational autoencoder to effectively model the depen⁃dency between global and sequential latent vectors.Secondly,the evidence lower bound is derived by maximizing the condi⁃tional likelihood of the hybrid autoencoder,which serves as the objective function for diverse image captioning.Finally,we seamlessly combine the Transformer model with the hybrid conditional variational autoencoder,which can be jointly opti⁃mized to improve the generalization performance of diverse image captioning.The experimental results on MSCOCO datas⁃et show that compared with the state-of-the-art methods,when randomly generating 20 and 100 captions,the diversity met⁃ric m-BLEU(Mutual overlap Bilingual Evaluation Under study)has improved by 4.2%and 4.7%,respectively,while the ac⁃curacy metric CIDEr(Consensus based Image Description Evaluation)has improved by 4.4%and 15.2%,respectively.
作者
刘兵
李穗
刘明明
刘浩
LIU Bing;LI Sui;LIU Ming-ming;LIU Hao(School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,Jiangsu 221116,China;Ministry of Education Engineering Research Center of Mine Digitization,Xuzhou,Jiangsu 221116,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2024年第4期1305-1314,共10页
Acta Electronica Sinica
基金
国家自然科学基金(No.62276266,No.61801198)。
关键词
图像理解
图像描述
变分自编码
隐嵌入
多模态学习
生成模型
image understanding
image captioning
variational autoencoding
latent embedding
multi-modal learn⁃ing
generative model