期刊文献+

主题不平衡新闻文本数据集的主题识别方法研究 被引量:5

Topic Recognition of News Reports with Imbalanced Contents
原文传递
导出
摘要 【目的】针对传统LDA模型因新闻文本数据集中不同主题间文本数量不均衡导致文本主题识别不准确问题,提出一种在主题不平衡新闻文本数据集上的主题识别方法。【方法】该方法基于传统LDA模型,结合独立性检测、方差检测和信息熵检测三种不同的特征检测方法来识别文本的主题。【结果】在10000篇新闻文本规模的数据集上实验验证,该方法相比传统的LDA主题识别方法,查全率提高了0.2121、查准率提高了0.0407,F1值提高了0.1520。【局限】由于新闻文本中新词较多,实验中使用的分词工具的分词准确率会降低,新闻文本主题识别的效果因对分词准确率的依赖而受到影响。【结论】实验证明,所提方法能够在一定程度上解决LDA对新闻文本数据集中不同主题间文本数量不均衡导致文本主题识别不准确问题。 [Objective] This paper proposes a topic recognition method for news dataset with imbalanced number of reports on different topics, aiming to address the issue of inaccurate topic recognition by traditional LDA model. [Methods] First, we modified the LDA model with three feature detection methods: independence detection, variance detection and information entropy detection. Then, we identified news topics with the proposed model. [Results] We examined our model with the dataset of 10,000 news reports. Compared with the traditional LDA topic recognition method, the recall, precision and F1 values of the proposed method were improved by 0.2121, 0.0407 and 0.1520. [Limitations] Due to the large number of new words, the word segmentation accuracy was not very satisfactory, which affected the performance of news topic recognition.[Conclusions] The proposed method could effectively identify news topics from reports with imbalanced contents.
作者 王红斌 王健雄 张亚飞 杨恒 Wang Hongbin;Wang Jianxiong;Zhang Yafei;Yang Heng(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China;Yun Nan Wei Heng Ji Ye Co.,Ltd.,Kunming 650000,China)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2021年第3期109-120,共12页 Data Analysis and Knowledge Discovery
基金 国家自然科学基金项目(项目编号:61966020,61762056) 云南省重大科技专项项目(项目编号:2018ZF019)的研究成果之一。
关键词 主题不平衡 新闻文本数据集 主题识别 潜在狄利克雷分布 Topic Imbalanced News Text Data Set Topic Recognition Latent Dirichlet Allocation(LDA)
  • 相关文献

参考文献7

二级参考文献58

  • 1马勇,丁晓青.Real-Time Multi-View Face Detection and Pose Estimation Based on Cost-Sensitive AdaBoost[J].Tsinghua Science and Technology,2005,10(2):152-157. 被引量:4
  • 2郭炜强,戴天,文贵华.基于领域知识的专利自动分类[J].计算机工程,2005,31(23):52-54. 被引量:17
  • 3郑恩辉,李平,宋执环.代价敏感支持向量机[J].控制与决策,2006,21(4):473-476. 被引量:33
  • 4李程雄,丁月华,文贵华.SVM-KNN组合改进算法在专利文本分类中的应用[J].计算机工程与应用,2006,42(20):193-195. 被引量:23
  • 5Lee Yong-Bae,Hyon M.Text Genre Classification with Genre-revealing and Subject-revealing Features[C]//Proc.of the 25th Annual Int’l Conf.on Research and Development in Information Retrieval.Tampere,Finland:[s.n.],2002:327-331.
  • 6Aidan F,Nicholas K.Learning to Classify Documents According to Genre[J].Journal of the American Society for Information Science and Technology,2006,57(11):1506-1518.
  • 7Huang Chang,Ai Haizhou,Li Yuan,et al.High-performance Rotation Invariant Multi-view Face Detection[J].IEEE Trans.on Pattern Analysis and Machine Intelligence,2007,29(4):671-686.
  • 8Li Ling.Data Complexity in Machine Learning and Novel Classification Algorithms[D].Pasadena,California,USA:California Institute of Technology,2006.
  • 9Hearst M A, Dumais S T, Osman E, Platt J, Scholkopf B.Support Vector Machines. IEEE Intelligent Systems, 1998, 13(4) : 18-28.
  • 10Ke Hai-Xin,Zhang Xue-Gong. Editing support vector machines.In: Proceedings of International Joint Conference on Neural Networks, Washington, USA, 2001, 2:1464-1467.

共引文献202

同被引文献110

引证文献5

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部