摘要
特征选择是文本分类的关键步骤之一,所选特征子集的优劣直接影响文本分类的结果。首先分析了词频和文档频并在此基础上对文档频进行优化。然后又以此为基础提出了特征分辨率并先用它初选文本特征。紧接着又把粗糙集引入进来并给出了一个基于等价类相关矩阵的属性约简算法,以此来进一步消除冗余特征。仿真结果表明上述方法无论是在精确度和召回率方面,还是时间性能及平均分类精度方面,都具有一定的优势。
Feature selection is one of the key steps in text categorization, selected feature subset directly influ- ences results of text categorization. Firstly, word frequency and document frequency were analyzed, and an im- proved document frequency was improved. And then, feature resolution was presented based on the improved docu- ment frequency. Subsequently, rough sets were introduced into feature selection and a new attribute reduction algo- rithm based on correlation matrix of equivalence classes was provided. Finally, combining feature resolution with the provided attribute reduction algorithm, a new feature selection method was proposed. The new feature selection method firstly uses feature resolution to select text features and filter out some terms to reduce the sparsity of text feature spaces, and then employs the provided attribute reduction algorithm to eliminate redundancy. The simula- tion results show that the proposed feature selection method to a certain extent has advantages in precision rate, re- call rate, time performance and average classification accuracy.
出处
《科学技术与工程》
北大核心
2012年第34期9234-9237,9242,共5页
Science Technology and Engineering
基金
阿坝师范高等专科学校校级科研项目(ASB12-23)资助
关键词
特征选择
文本分类
特征分辨率
粗糙集
相关矩阵
feature selection text categorization feature resolution rough sets correlation matrix