摘要
针对新闻长文本语义表征的难点,基于Doc2Vec文档嵌入和词向量加权方式构建增强的特征表示。利用DV-sim方法和DV-tfidf方法从文档首尾部分特定词性的内容中提取增强特征,再分别与Doc2Vec文档向量组合,形成新的全局表征。DV-sim从语义角度,采用特征词与Doc2Vec向量的相似度获得词权重;DV-tfidf从词频统计角度,采用词频-逆文档频率方式获得词权重,然后利用HDBSCAN算法在THUCNews和Sogou数据集上进行主题聚类。相比直接应用Doc2Vec向量,DV-sim在两个数据集上的噪声数分别减少60.82%和60.63%,准确率提高12.14%和20.58%,F1-Score值提高15.61%和11.58%;DV-tfifd在两个数据集上的噪声数分别减少15.20%和59.55%,准确率提高10.85%和17.93%,F1-Score值提高15.60%和9.21%。实验结果表明,DV-sim和DV-tfidf都可以提高主题聚类性能,且基于语义的增强特征比基于词频的效果更好,DV-sim在优秀女性人物报道的主题聚类上也得到了有效应用。
Aimed at the difficulties of semantic representation of long news text,an enhanced document feature representation is constructed based on Doc2Vec embedding and word vector weighting.Enhanced features from the specific parts-of-speech contents on the head and tail of the document are extracted by the method of DV-sim or DV-tfidf.These features are then combined with doc2vec to form a new global representation.DV-sim uses the similarity between feature words and doc2vec vectors to obtain word weight from the semantic point of view,and DV-tfidf uses term frequency inverse document frequency to obtain word weight from the word frequency statistics point of view.Then the HDBSCAN algorithm is applied to cluster topics on the Thucnews and Sogou datasets.Compared with the Doc2Vec vector,the noise number on the two datasets reduces by 60.82%and 60.63%,the accuracy improves by 12.14%and 20.58%,and the F1-score increases by 15.61%and 11.58%,respectively,with DV-sim.The noise number on the two datasets reduces by 15.20%and 59.55%,the accuracy improves by 10.85%and 17.93%,and the F1-score increases by 15.60%and 9.21%,respectively,with DV-tfidf.Experiments show that both DV-sim and DV-tfidf can improve the performance of topic clustering,and the enhancement feature based on semantics is better than that based on word frequency.DV-sim has also been effectively applied in topic clustering of excellent female character reports.
作者
陈洁
CHEN Jie(School of Data Science and Information Technology,China Women’s University,Beijing 100101,China)
出处
《计算机科学》
CSCD
北大核心
2023年第S01期211-216,共6页
Computer Science
基金
中华女子学院科研基金(ZKY200020228)。