摘要
中文特征词的选取是中文信息预处理内容之一,对文档分类有重要影响。中文分词处理后,采用特征词构建的向量模型表示文档时,导致特征词的稀疏性和高维性,从而影响文档分类的性能和精度。在分析、总结多种经典文本特征选取方法基础上,以文档频为主,实现文档集中的特征词频及其分布为修正的特征词选取方法(DC)。采用宏F值和微F值为评价指标,通过实验对比证明,该方法的特征选取效果好于经典文本特征选取方法。
Abstract.- Feature words selection from texts is a significant step in Chinese text information pre-processing. After the seg- mentation of Chinese texts, a Vector Model constructed by feature words representing the Chinese text documents cannot a- void low accuracy of document classification (or document retrieval) due to the sparseness and high-dimension of feature words. On the basis of an analysis of several classical text feature selection methods, a new method of text feature selection (DC) is presented, which is based on a modified document frequency. Experiments prove the performance of DC, is better than that of typical other methods according to macro-F values and micro-F values.
出处
《中文信息学报》
CSCD
北大核心
2015年第4期120-125,共6页
Journal of Chinese Information Processing
基金
国家高新技术研究发展计划(2009AA062802)
国家自然科学基金(60473125)
中国石油(CNPC)石油科技中青年创新基金(05E7013)
国家重大专项子课题(G5800-08-ZS-WX)
关键词
文本文档
特征词
特征选取
文本分类
Text document
Feature word
Feature selection
Text classification