摘要
本文提出了一个基于n-gram语言模型进行文本表示,采用链状朴素贝叶斯分类器进行分类的中文文本分类系统。介绍了如何用n-gram语言模型进行文本表示,阐述了链状朴素贝叶斯分类器与n-gram语言模型相结合的优势,分析了n-gram语言模型参数的选取,讨论了分类系统的若干重要问题,研究了训练集的规模和质量对分类系统的影响。根据863计划文本分类测评组所提供的测试标准、训练集以及测试集对本文所设计的分类系统进行测试,实验结果表明该分类系统有良好的分类效果。
An automatic Chinese text categorization method based on n-gram language model and chain augmented naYve Bayesian classifier is proposed. The paper introduces the representation of a text through n-gram language model, argues the advantage of Combining n-gram language model and chain augmented naive Bayesian classifier, analyzes how to choose the parameters of n-gram language model, and discusses some crucial problems of the categorization system. The effect of quantity and quality of training corpus on classifier pedormance is also studied experimentally. The categorization system is tested on the 863-project data set for Chinese text categorization. The experimental result shows that the system performs well.
出处
《中文信息学报》
CSCD
北大核心
2006年第3期29-35,共7页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(60475007)
关键词
计算机应用
中文信息处理
中文文本分类
N-GRAM语言模型
链状朴素贝叶斯分类器
computer application
Chinese information processing
Chinese text categorization
n-gram language model
chain augmented naYve Bayesian classifier