摘要
针对短文本长度过短、关键词偏少和标签信息利用不足造成的分类过程中面临特征稀疏和语义不明确的问题,提出了融合标签语义嵌入的图卷积网络模型。首先,在传统的术语频率和逆文档频率算法基础上,提出了融合单词所属文本的类间、类内分布关系的全局词频提取算法。其次,利用融合标签嵌入的方法,将每条训练文本与相对应的标签引入到同一个特征空间内,通过筛选聚合提取更能突显文本类别的近义词嵌入,作为文本图的文档节点的嵌入表示。最后,将文本图输入到图卷积神经网络学习后,获得的特征与预训练模型提取文本上下文的特征相融合,提升短文本的分类质量以及整个模型的泛化能力,在4个短文本数据集MR、web_snippets、R8和R52上对本文模型和14个基线算法进行了对比实验,结果表明本文提出的模型相比于对比模型具有更优的结果,在识别精度、召回率以及F_1值上有着更好的表现。
In short text classification,too short text length,fewer keywords and underutilization of the label information leads to the severe problems of sparse features and ambiguous semantics,which can affect the performance of short text classification.Agraph convolution network model based on tag semantic embedding is proposed for the problem.Firstly,according to TF/IDF,a new word frequency method is proposed,which comprehensively considers the inter-class and intra-class distribution of words in the global corpus.Then,through By label embedding method,each training text with the corresponding label is mapped into one feature space in the text graph.After filtering and aggregation in one feature space,synonyms embedded of label information can highlight the category representation.Finally,the text graph is input into the graph convolution neural network to learn new feature.Both the learned new feature Both the learned new feature and the features from the pre-training model can improve the classification accuracy of short texts and the generalization ability of the whole model.We choose four short text datasets such as MR,web_snippets,R8 and R52,to evaluate the performance of our proposed algorithm and fourteen benchmark models.The experimental results show that the proposed model in this paper is superior to others in classification accuracy,recall ratio and F1-score.
作者
张灵
李荣臻
郑苏
Zhang Ling;Li Rong-zhen;Zheng Su(School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,China;College of Education,Ningxia University,Yinchuan 750001,China)
出处
《广东工业大学学报》
CAS
2024年第1期69-78,共10页
Journal of Guangdong University of Technology
基金
广东省交通运输厅科技项目(科技-2016-02-030)。
关键词
短文本
标签语义
特征空间
图卷积网络
预训练模型
short text
semantics of label
feature space
graph convolution network
pre-training model