期刊文献+

结合视觉特征和场景语义的图像描述生成 被引量:26

Combine Visual Features and Scene Semantics for Image Captioning
下载PDF
导出
摘要 现有的图像描述生成方法大多只使用图像的视觉信息来指导描述的生成,缺乏有效的场景语义信息的指导,而且目前的视觉注意机制也无法调整对图像注意的聚焦强度.针对这些问题,本文首先提出了一种改进的视觉注意模型,引入聚焦强度系数自动调整注意强度.在解码器的每个时间步,通过模型的上下文信息和图像信息计算注意机制的聚焦强度系数,并通过该系数自动调整注意机制的“软”、“硬”强度,从而提取到更准确的图像视觉信息.此外,本文利用潜在狄利克雷分布模型与多层感知机提取出一系列与图像场景相关的主题词来表示图像场景语义信息,并将这些信息添加到语言生成模型中来指导单词的生成.由于图像的场景主题信息是通过分析描述文本获得,包含描述的全局信息,所以模型可以生成一些适合图像场景的重要单词.最后,本文利用注意机制来确定模型在解码的每一时刻所关注的图像视觉信息和场景语义信息,并将它们结合起来共同指导模型生成更加准确且符合场景主题的描述.实验评估在MSCOCO和Flickr30k两个标准数据集上进行,实验结果表明本文方法能够生成更加准确的描述,并且在整体的评价指标上与基线方法相比有3%左右的性能提升. Most of the existing image captioning methods only use the visual information of the image to guide the generation of the captions,lacking the guidance of effective scene semantic information.In addition,the current visual attention mechanism cannot adjust the focus intensity on the image effectively.In order to solve these problems,this paper firstly proposes an improved visual attention model,which introduces a focus intensity coefficient so as to adjust attention intensity automatically.Specifically,the focus intensity coefficient of the attention mechanism is a learnable scaling factor.It can be calculated by the image information and the context information of the model at each time step of the language model decoding procedure.When using the attention mechanism to calculate the attention weight distribution on the image,the“soft”or“hard”intensity of attention mechanism can be adjusted automatically by adaptively scaling the input value of softmax function through the focus intensity coefficient.Then the concentration and dispersion of the visual attention can be achieved.Therefore,the proposed attention model can make the extracted image visual information more accurate.Furthermore,we combine unsupervised and supervised learning methods to extract a series of topic words related to the image scene to represent scene semantic information of the image,which is added to the language model to guide the generation of captions.We believe that each image contains several scene topic concepts,and each topic concept can be represented by some topic words.Specifically,we use the latent Direchlet allocation(LDA)model to cluster all the caption texts in the dataset.Then the topic category of the caption text is used to represent the scene category of corresponding image.What is more,we train a multi-layer perceptron(MLP)to classify the image into topic concepts.As a result,each topic category is represented by a series of topic words obtained from clustering.Then the scene semantic information of each image can be represented by these topic words,which are very relevant to the image scene.We add these topic words to the language model so that it can obtain more prior knowledge.Since the topic information of the image scene is obtained through analyzing the captions,it contains some global information of the captions to be generated.Therefore,our model can predict some important words that suitable for image scene.Finally,we use the attention mechanism to determine the visual information of the image and the semantic information of the scene that the model pays attention to at each time step of the decoding procedure,and use the gating mechanism to control the proportion of the input of these two information.Afterwards,both information is combined to guide the model to generate more accurate and scene-specific captions.In the experimental section,we evaluate our model on two standard datasets,i.e.MSCOCO and Flickr30k.The experimental results show that our approach can generate more accurate captions than many state-of-the-art approaches.In addition,compared with the baseline approach,our approach achieves about 3%improvement on overall evaluation metrics.
作者 李志欣 魏海洋 黄飞成 张灿龙 马慧芳 史忠植 LI Zhi-Xin;WEI Hai-Yang;HUANG Fei-Cheng;ZHANG Can-Long;MA Hui-Fang;SHI Zhong-Zhi(Guangxi Key Lab of Multi-source Information Mining&Security,Guangxi Normal University,Guilin 541004;College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070;Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190)
出处 《计算机学报》 EI CSCD 北大核心 2020年第9期1624-1640,共17页 Chinese Journal of Computers
基金 国家自然科学基金(61966004,61663004,61866004,61762078) 广西自然科学基金(2019GXNSFDA245018,2018GXNSFDA281009,2017GXNSFAA198365) 广西多源信息挖掘与安全重点实验室基金(16-A-03-02,MIMS18-08,MIMS 19-02)资助.
关键词 图像描述生成 注意机制 场景语义 编码器-解码器框架 强化学习 image captioning attention mechanism scene semantics encoder-decoder framework reinforcement learning
  • 相关文献

同被引文献84

引证文献26

二级引证文献58

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部