摘要
In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.
作者
曾志贤
曹建军
翁年凤
袁震
余旭
ZENG Zhician;CAO Jianjun;WENG Nianfeng;YUAN Zhen;YU Xu(The Sixty-third Research Institute,National University of Defense Technology,Nanjing 210007,China)
基金
the Special Research Fund for the China Postdoctoral Science Foundation(No.2015M582832)
the Major National Science and Technology Program(No.2015ZX01040201)
the National Natural Science Foundation of China(No.61371196)。