摘要
针对特征选择这一文本分类的核心问题,首先提出一个基于最小词频的文档频方法,然后引进粗糙集和Tabu搜索,分析了把Tabu搜索用于属性约简所存在的问题并给出了解决办法,并以此为基础详细设计了一个基于优化的Tabu搜索的属性约简方法,最后把上述两种方法结合起来提出了一个综合性特征选择方法.该方法利用基于最小词频的文档频方法提取初始特征,利用所给属性约简方法进行优选以消除冗余,从而获得较具代表性的特征子集.实验结果表明该综合方法优于IG,CHI和MI方法.
Feature selection is the core research topic in text categorization. A document frequency method based on minimum word frequency was presented. Then RS and Tabu search were introduced, the problems in attribute reduction based on Tabu search were analyzed, and some corresponding solutions were provided. Subsequently, an attribute reduction method based on the improved Tabu search was proposed. Finally, a comprehensive feature selection method based on the above-mentioned two methods was provided. The comprehensive method firstly uses the document frequency method based on minimum word frequency to extract original features, and then employs the proposed attribute reduction method to optimize and eliminate redundancy. Experimental results show that the comprehensive method is betterthan IG, CHI and MI.
出处
《华中科技大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2010年第2期4-7,40,共5页
Journal of Huazhong University of Science and Technology(Natural Science Edition)
基金
四川省科技计划资助项目(2008GZ0003)
四川省科技攻关项目(07GG006-019)
关键词
特征选择
文本分类
文档频
TABU搜索
属性约简
feature selection
text categorization
document frequency
Tabu search
attribute reduction