摘要
近年来,LDA(Latent Dirichlet Allocation)主题模型通过挖掘文本的潜在语义主题进行文本表示,为短文本的相似度计算提供了新思路。针对短文本特征稀疏,应用LDA主题模型易导致文本相似度计算结果缺乏准确性的问题,提出了基于LDA的多特征融合的短文本相似度算法。该方法融合了主题相似度因子ST(Similarity Topic)和词语共现度因子CW(Co-occurrence Words),建立了联合相似度模型以规约不同ST区间下CW对ST产生的约束或补充条件,并最终权衡了准确性更高的相似度结果。对改进后的算法进行文本聚类实验,结果表明改进后的算法在F度量值上取得了一定程度的提升。
In recent years,latent dirichlet allocation(LDA)topic model provides a new idea for short text similarity calculation by mining the latent semantic themes of text.In view of the sparse features of short text,because the application of LDA theme model may easily lead to inaccurate results of similarity computation,this paper presented a calculation method based on LDA model combining similarity topics factor ST and co-occurrence words factor CW to establish union similarity model.In the protocol of different ST intervals,CW generates constraint or supplementary conditions to ST,and obtains higher accuracy of text similarity.A text clustering experiment was used to verify the method.The experimental results show that the proposed method gains a certain improvement of F measure value.
作者
张小川
余林峰
张宜浩
ZHANG Xiao-chuan;YU Lin-feng;ZHANG Yi-hao(College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 401320,China)
出处
《计算机科学》
CSCD
北大核心
2018年第9期266-270,共5页
Computer Science
基金
国家自然科学基金(60443004)
重庆市重大科技项目(cstc2013jcsf-jcssX0020)
重庆市基础科学与前沿技术研究计划项目(cstc2015jcyjA40041)资助