期刊文献+

基于n-gram语言模型和链状朴素贝叶斯分类器的中文文本分类系统 被引量:16

A Chinese Text Classifier Based on n-gram Language Model and Chain Augmented Na?ve Bayesian Classifier
下载PDF
导出
摘要 本文提出了一个基于n-gram语言模型进行文本表示,采用链状朴素贝叶斯分类器进行分类的中文文本分类系统。介绍了如何用n-gram语言模型进行文本表示,阐述了链状朴素贝叶斯分类器与n-gram语言模型相结合的优势,分析了n-gram语言模型参数的选取,讨论了分类系统的若干重要问题,研究了训练集的规模和质量对分类系统的影响。根据863计划文本分类测评组所提供的测试标准、训练集以及测试集对本文所设计的分类系统进行测试,实验结果表明该分类系统有良好的分类效果。 An automatic Chinese text categorization method based on n-gram language model and chain augmented naYve Bayesian classifier is proposed. The paper introduces the representation of a text through n-gram language model, argues the advantage of Combining n-gram language model and chain augmented naive Bayesian classifier, analyzes how to choose the parameters of n-gram language model, and discusses some crucial problems of the categorization system. The effect of quantity and quality of training corpus on classifier pedormance is also studied experimentally. The categorization system is tested on the 863-project data set for Chinese text categorization. The experimental result shows that the system performs well.
出处 《中文信息学报》 CSCD 北大核心 2006年第3期29-35,共7页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60475007)
关键词 计算机应用 中文信息处理 中文文本分类 N-GRAM语言模型 链状朴素贝叶斯分类器 computer application Chinese information processing Chinese text categorization n-gram language model chain augmented naYve Bayesian classifier
  • 相关文献

参考文献16

  • 1Ricardo Baeza-Yates,Berthier Ribeiro-Neto,Modern Information Retrieval.[M] China Machine Press,2003.
  • 2Fuchun Peng,Dale Schuurmans,Shaojun Wang,Augmenting Naive Bayes Classiers with Statistical Language Models[M].School of Computer Science at University of Waterloo,2004.
  • 3http://www.863data.org.cn/[OL].
  • 4D.Hiemstra,Using Language Models for Information Retrieval[D].Centre for Telematics and Information Technology,University of Twente,2001.
  • 5A.McCallum,K.Nigam,A Comparison of Event Models for Naive Bayes Text Classification[R].In:proceedings of AAAI-98 Workshop on "Learning for Text Categorization",1998.
  • 6D.Holmes,R.Forsyth,The Federalist Revisited:New Directions in Authorship Attribution[J].Literary and linguistic Computing,1995 (10):111-127.
  • 7J.Ponte,W.Croft,A Language Modeling Approach to Information Retrieval[A].In:proceeding of ACM Research and Development in Information Retrieval(SIGIR)[C],1998.
  • 8刘斌,黄铁军,程军,高文.一种新的基于统计的自动文本分类方法[J].中文信息学报,2002,16(6):18-24. 被引量:48
  • 9刘静,尹存燕,陈家骏.一种规则和贝叶斯方法相结合的文本自动分类策略[J].计算机应用研究,2005,22(7):84-86. 被引量:7
  • 10周水庚,关佶红,俞红奇,胡运发.基于Ngram信息的中文文档分类研究[J].中文信息学报,2001,15(1):34-39. 被引量:23

二级参考文献48

  • 1吴军,王作英,禹锋,王侠.汉语语料的自动分类[J].中文信息学报,1995,9(4):25-32. 被引量:24
  • 2王还 常宝儒.现代汉语频率词典[M].北京:北京语言学院出版社,1986..
  • 3卜东波.聚类/分类理论研究及其在文本挖掘中的应用.中科院计算所博士学位论文[M].-,2000..
  • 4David Maxwell. Learning equivalence classes of Bayesian - network structures[ J ]. Machine Learning, 2002 (2) :445 - 498.
  • 5Nir Friedman. Bayesian network classifiers[ J ]. Machine Learming, 1997,29:131-163.
  • 6Marco Ramcni. Robust Bayes clasifiers[J]. Artificial Intelligence,2001,125(1,2) :209- 226.
  • 7David Heckerman. Learning Bayesian networks: the combination of knowledge and statistical data[J ]. Machine Learning, 1995,20:197- 243.
  • 8Cheng Jie. Learning Bayesian networks from data: an information - theory based approach[J]. Artificial Intelligence,2002,137(1,2) :43-90.
  • 9Wong M L. Using evolutionary programming and mininum description length prindple for data mining of Bayesian networks[ J ]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999,21 ( 2 ) : 174-178.
  • 10Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations[M].Seattle: Morgan Kaufmann Publishers,2000. 265 - 314.

共引文献314

同被引文献132

引证文献16

二级引证文献48

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部