摘要
自动文本分类就是在给定的分类体系下 ,让计算机根据文本的内容确定与它相关联的类别。为了提高分类性能 ,本文提出了中文文本多层次特征提取方法和基于核的距离加权KNN算法。多层次特征提取方法在汉字、常用词表和专业词表三个层次上提取文档的统计特征 ,能够更好地反映文档的统计分布。基于核的距离加权KNN算法解决了样本的多峰分布、边界重叠问题和分类器的精确分类决策问题。实际应用中 ,互联网和文本库提供了大量经过粗分类的训练文本 ,但普遍存在样本质量较差的问题 ,本文通过样本重要性分析技术解决此问题。实验系统证明了新方法的有效性。
Automatic text classification is defined as the task to assign pre defined category labels to documents.To improve the classification performance,this article puts forward the multi level feature selection method and the kernel based distance weighted KNN algorithm.We extract the statistical text features on three different levels as Chinese letters,the common wordlist and the professional wordlist,which can represent more statistical character of the document set.The kernel based weighted KNN algorithm solves the multi peak distribution problem and the overlap boundary problem of the sample set,as well as the classifier's precise decision problem.In practical use,the Internet and text data bases provide many pre classified training samples.But some of them are not good for training the classifier.We use sample weightiness analysis to address this problem.The experimental system shows the effectiveness of the method.
出处
《中文信息学报》
CSCD
北大核心
2002年第6期18-24,共7页
Journal of Chinese Information Processing
基金
国家科学数字图书馆重大专项 (CSDL2 0 0 2 - 18)
关键词
统计
自动文本分类
多层次特征提取
距离加权KNN算法
样本重要性分析
汉字识别
automatic text classification
multi level feature selection
Kernel based Distance weighted KNN algorithm
sample weightiness analysis