摘要
特征选择是文本分类的关键步骤之一,所选特征子集的优劣直接影响文本分类的结果。分析了词频法和文档频法并总结了其缺陷,给出了一个改进的文档频方法;引进粗糙集理论,提出了一个属性约简算法;最后提出了一个新的特征选择方法。该特征选择方法使用改进的文档频初选特征并用所提属性约简算法消除冗余。仿真结果表明该特征选择方法性能较好。
Feature selection is one of the key steps in text categorization.The selected feature subset directly influences results of text categorization.Firstly,word frequency and document frequency are analyzed,and their deficiencies are summarized.Then an improved document frequency is presented.Next,it introduced RS and provided a new attribute reduction algorithm based on discernibility object pair set.Finally,a new feature selection method is proposed.The new feature selection method uses the improved document frequency to select feature and employs the proposed attribute reduction algorithm to eliminate redundancy.The simulation results show that the new feature selection method is promising.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第3期622-625,共4页
Computer Engineering and Design
基金
四川省科技计划基金项目(2008GZ0003)
关键词
特征选择
文本分类
文档频
差别对象对集
属性约简
feature selection
text categorization
document frequency
discernibility object pair set
attribute reduction