摘要
针对Web图像数据规模大幅增长难以管理及人工标注费时费力等问题,提出了一种仅利用伴随文本信息进行Web图像批量标注的方法。首先对图像的文本信息进行分词、去除停用词、词向量化等预处理;然后利用近邻传播算法对文本聚类,并利用TF-IDF对文档进行关键词抽取,建立候选词词典。分别定义和计算候选词与关键词、候选词与文档、候选词与聚类簇的相似度;最终选取相似度较大的候选词作为图像簇的标注。实验结果表明,基于伴随文本信息的图像标注算法在自建数据集上标注精度和宏F1值达到了88%和49%,达到了预期目标,提高了标注效率。
In order to solve the problems of large scale increasing of Web image data and the time and efforts consuming on manually annotation,this paper proposes a method of automatic annotation of Web images only based on the surrounding text information. Firstly,this method divides the text information of the image into words,then,it removes the stop words and represents words into vector forms as pretreatment. Secondly,it realizes text clustering by affinity propagation and finishes keyword extraction of documents by TF-IDF,and establishes candidate words dictionary. This method defines and calculates the similarities between candidate words and extracted keywords,candidate words and documents as well as candidate words and clusters respectively. The final image tags are several words with lager similarities selected from the candidate words. The experimental results show that the accuracy of the annotation and the value of the macro-F1 on the self-built data set are 88%and 49% by using the proposed method,which achieves the expected goals and improves the annotation efficiency.
作者
郭蕾蕾
俞璐
段国仑
陶性留
Guo Leilei;Yu Lu;Duan Guolun;Tao Xingliu(Institute of Communications Engineering,Army Engineering University of PLA,Nanjing 210007,China;Institute of Command Control Engineering,Army Engineering University of PLA,Nanjing 210007,China)
出处
《信息技术与网络安全》
2018年第9期70-75,共6页
Information Technology and Network Security
关键词
图像标注
文本聚类
伴随文本
相似度度量
关键词抽取
image annotation
text clustering
surrounding text
similarity measure
keyword extraction