摘要
【目的】针对传统LDA模型因新闻文本数据集中不同主题间文本数量不均衡导致文本主题识别不准确问题,提出一种在主题不平衡新闻文本数据集上的主题识别方法。【方法】该方法基于传统LDA模型,结合独立性检测、方差检测和信息熵检测三种不同的特征检测方法来识别文本的主题。【结果】在10000篇新闻文本规模的数据集上实验验证,该方法相比传统的LDA主题识别方法,查全率提高了0.2121、查准率提高了0.0407,F1值提高了0.1520。【局限】由于新闻文本中新词较多,实验中使用的分词工具的分词准确率会降低,新闻文本主题识别的效果因对分词准确率的依赖而受到影响。【结论】实验证明,所提方法能够在一定程度上解决LDA对新闻文本数据集中不同主题间文本数量不均衡导致文本主题识别不准确问题。
[Objective] This paper proposes a topic recognition method for news dataset with imbalanced number of reports on different topics, aiming to address the issue of inaccurate topic recognition by traditional LDA model. [Methods] First, we modified the LDA model with three feature detection methods: independence detection, variance detection and information entropy detection. Then, we identified news topics with the proposed model. [Results] We examined our model with the dataset of 10,000 news reports. Compared with the traditional LDA topic recognition method, the recall, precision and F1 values of the proposed method were improved by 0.2121, 0.0407 and 0.1520. [Limitations] Due to the large number of new words, the word segmentation accuracy was not very satisfactory, which affected the performance of news topic recognition.[Conclusions] The proposed method could effectively identify news topics from reports with imbalanced contents.
作者
王红斌
王健雄
张亚飞
杨恒
Wang Hongbin;Wang Jianxiong;Zhang Yafei;Yang Heng(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China;Yun Nan Wei Heng Ji Ye Co.,Ltd.,Kunming 650000,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2021年第3期109-120,共12页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目(项目编号:61966020,61762056)
云南省重大科技专项项目(项目编号:2018ZF019)的研究成果之一。