摘要
针对传统的主题模型算法没有充分利用词间语义关系和上下文语境而导致主题语义一致性、可解释性差的问题,给出一种基于LDA2Vec主题模型联合训练的热点主题识别方法——NS-LDA2Vec方法。该方法通过扩展Skip-gram模型,将初始化后的文档向量和枢轴词向量联合训练,以获得上下文向量,然后利用该向量来预测中枢词的上下文单词,从而将主题信息嵌入到词表示和文档表示中,使得预测过程中负采样损失和Dirichlet似然项总和最小化,产生可解释性更好的文本表示。结果表明:所提方法取得的F1值最高可达到0.898,在热点主题分类任务上,相比传统的LDA主题模型,主题相关度提升了约9%,能够有效提升主题识别任务的效果。
The traditional topic model algorithm does not make full use of the semantic relationship between words and the context,which leads to the inconsistency of topic semantics and poor interpretability.A hot topic recognition method based on the joint training of the LDA2Vec topic model(NS-LDA2Vec)was thus proposed.This method expanded the Skip-gram model to jointly train the initialized document vector and pivoted word vector to obtain the context vector,and then used the vector to predict the context word of the pivot word,thereby embedding topic information into the word representation and document in the representation;the sum of the negative sampling loss and the Dirichlet likelihood term in the prediction process was minimized,which resulted in a better interpretable text representation.The results show that the F1 value obtained by the proposed method can reach up to 0.898.Compared with the traditional LDA topic model,the topic relevance is improved by about 9%on the hot topic classification task,which can effectively improve the effect of the topic recognition task.
作者
薛涛
郭莹
胡伟华
XUE Tao;GUO Ying;HU Weihua(School of Computer Science, Xi’an Polytechnic University, Xi’an 710048, China;School of Humanities and Social Science, Xi’an Polytechnic University, Xi’an 710048, China)
出处
《西安工程大学学报》
CAS
2021年第4期95-101,共7页
Journal of Xi’an Polytechnic University
基金
国家社会科学基金(18XYY010)。