摘要
为改善文本分类的效率和效果,降低计算复杂度,在分析了经典的特征选择方法后,提出加权的文本特征选择方法。该方法不仅利用数据集中文本的个数,还充分考虑到索引项的权重信息,并构造新的评估函数,改进了信息增益、期望交叉熵以及文本证据权。利用KNN分类器在Reuters-21578标准数据集上进行训练和测试。实验结果表明,该方法能够选出有效特征,提高文本分类的性能。
To improve the efficiency and effectiveness and reduce computational complexity for text categorization, text feature selection with term weight is prop6sed based on the classical method. This method not only used the numbers of documents in datasets, but also fully took the information of term weight into account in the text. Thus, new evaluation function is constructed. It works better than information gain, expected cross entropy and weight of evidence for text. Using K-Nearest neighbor classifier, Reuters-21578 is used as standard data collection. Experimental results show that the new method select good features and effectively improve the performance of text categorization.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第5期1149-1151,共3页
Computer Engineering and Design
基金
国家自然科学基金项目(60673186)
关键词
文本分类
特征选择
索引项权重
信息增益
期望交叉熵
文本证据权
text categorization
feature selection
term weight
information gain
expected cross entropy
weight of evidence for text