期刊文献+

文本分类中基于概率主题模型的噪声处理方法 被引量:9

A Probabilistic Topic Model Based Noise Processing Method for Text Classification
下载PDF
导出
摘要 训练集中文本质量的好坏直接决定着文本分类的结果。实际应用中训练集的构建不可避免地会产生噪声样本,从而影响文本分类方法的实际应用效果。为此,针对文本分类中的噪声问题,本文提出一种基于概率主题模型的噪声处理方法,首先对训练集中的每个样本计算其类别熵,根据类别熵对噪声样本进行过滤;然后利用主题模型进行数据平滑,进一步减弱噪声样本的影响。这种方法不但能够减弱噪声样本对分类结果的影响,同时还保持了训练集的原有规模。在真实数据上的实验表明,该方法对噪声样本的分布具有较好的鲁棒性,在噪声比例较大的情况下仍能保持较好的分类结果。 The performance of text classification depends directly on the quality of training corpus.In practical applications,noise samples are unavoidable in the training corpus and thus influence the effect of the text classification approach.To this end,a novel probabilistic topic model based noise processing method is proposed for text classification.In our method,the noise samples are filtered according to the class entropy.Then the data is smoothed using the generative process of the topic model to further weaken the influence of noise samples,meanwhile the original size of the training corpus is kept.The experimental results of the real world data show that the method proposed is robust to the distribution of noise samples,and has a relative good performance on the data sets with a high noise ratio.
出处 《计算机工程与科学》 CSCD 北大核心 2010年第7期89-92,119,共5页 Computer Engineering & Science
基金 国家自然科学基金资助项目(60775037) 国家863计划资助项目(2009AA01Z123)
关键词 噪声数据 文本分类 概率主题模型 类别熵 noisy data text classification probabilistic topic model class entropy
  • 相关文献

参考文献11

  • 1Li Yunlei, Wessels L F A, de Ridder D, et al. Classification in the Presence of Class Noise Using a Probabilistic Kernel Fisher Method[J]. Pattern Recognition, 2007,40 (12) : 3349- 3357.
  • 2Zhu Xingquan, Wu Xingdong, Chen Qijun. Eliminating Class Noise in Large Datasets[C]//Proc of ICML'03, 2003: 920-927.
  • 3Knorr E M, Ng R T. Algorithms for Mining Distance-Based Outliers in Large Datasets[C]//Proc of the 24th VLDB Conf, 1998:392-403.
  • 4Brodley C E, Friedl M A. Identifying and Eliminating Mislabeled Training Instances[C]//Proc of AAAI/IAAI, 1996,1: 799-805.
  • 5Gamberger D, Lavrac N, Dzeroski S. Noise Detection and Elimination in Data Preprocessing: Experiments in Medical Domains[J]. Applied Artificial Intelligence, 2000,14(2) : 205- 223.
  • 6Ramaswamy S, Rastogi R, Shim K. Efficient Algorithms for Mining Outliers from Large Data Sets[C]//Proc of SIGMOD Conf, 2000 : 427-438.
  • 7Griffiths T L, Steyvers M. Finding Scientific Topics[C]// Proc of the National Academy of Sciences,2004:5228-5235.
  • 8Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research, 2003, 3: 993- 1022 .
  • 9Landauer T K, McNamara D S, Dennis S, et al. Latent Semantic Analysis: A Road to Meaning [M]. Oxford, UK:Routtedge, 2006.
  • 10Chang Chih-Chung, Lin Chih-Jen. LIBSVM= A Library for Support Vector Machines[EB/OL]. [2008-12-15]. http:// www. csie. ntu. edu. tw/-cjlin/libsvm.

同被引文献150

引证文献9

二级引证文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部