基于双路径多模态交互的一阶段视觉定位模型

Dual-path network with multilevel interaction for one-stage visual grounding

下载PDF

导出

摘要现有的一阶段方法分别提取视觉特征映射和文本特征,并进行多模态推理来预测被引用对象的边界框.这些方法存在以下两个缺点:首先,预先训练的视觉特征提取器在视觉特征中引入了与文本无关的视觉信号,阻碍了多模态交互;其次,现有模型的推理过程缺乏对语言建模的可视化指导.基于上述缺点,现有的一阶段方法的推理能力是有限的.提出了一种提取文本相关视觉特征映射的低阶交互和一种整合视觉特征的高阶交互来指导语言建模,并进一步对视觉特征进行多步推理.在此基础上,提出了一种新的网络结构,称为双路径多级交互网络.在5种常用的视觉定位数据集上进行了实验,结果表明该方法具有较好的性能和实时性. This study explores the multimodal understanding and reasoning for one-stage visual grounding.Existing one-stage methods extract visual feature maps and textual features separately,and then,multimodal reasoning is performed to predict the bounding box of the referred object.These methods suffer from the following two weaknesses:Firstly,the pre-trained visual feature extractors introduce textunrelated visual signals into the visual features that hinder multimodal interaction.Secondly,the reasoning process followed in these two methods lacks visual guidance for language modeling.It is clear from these shortcomings that the reasoning ability of existing one-stage methods is limited.We propose a lowlevel interaction to extract text-related visual feature maps,and a high-level interaction to incorporate visual features in guiding the language modeling and further performing multistep reasoning on visual features.Based on the proposed interactions,we present a novel network architecture called the dual-path multilevel interaction network(DPMIN).Furthermore,experiments on five commonly used visual grounding datasets are conducted.The results demonstrate the superior performance of the proposed method and its real-time applicability.

作者王月叶加博林欣 WANG Yue;YE Jiabo;LIN Xin(School of Computer Science and Technology,East China Normal University,Shanghai 200062,China)

机构地区华东师范大学计算机科学与技术学院

出处《华东师范大学学报（自然科学版）》 CAS CSCD 北大核心 2024年第2期65-75,共11页 Journal of East China Normal University(Natural Science)

关键词视觉定位多模态推理引用表达 visual grounding multimodal understanding referring expressions

分类号 TP389.1 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

1杨鹏飞,张鸿鹏,赵秋锦.计算机虚拟现实关键技术及其应用研究[J].电脑乐园,2023(2):34-36.
2苏力健.“简而言之”——论扁平化设计在品牌包装中的应用[J].苏州工艺美术职业技术学院学报,2023(3):20-23.
3时倩如,李贺,沈旺,刘嘉宇,田聪淼.基于高阶和低阶交互关系的深度学习推荐模型研究[J].情报理论与实践,2024,47(4):189-196. 被引量：1
4张娜,吴沛霞,李文妍,郝维明,赵卫东,席淑新.听神经瘤术后早期前庭症状特征及影响因素分析[J].中国眼耳鼻喉科杂志,2024,24(2):96-100.
5邵云帆,孙天祥,邱锡鹏.探索中文预训练模型的混合粒度编码和IDF遮蔽[J].中文信息学报,2024,38(1):57-64.

华东师范大学学报（自然科学版）

2024年第2期

浏览历史

内容加载中请稍等...

基于双路径多模态交互的一阶段视觉定位模型

相关作者

相关机构

相关主题

浏览历史