摘要
提出一种针对层次分类的文本特征选择方法.先给出类别层次相关度的概念,并利用分类树和训练数据在不同层次上的概率分布进行计算,进而得到分类树中不同类别的重要性.最后基于前面的计算结果,计算每个特征对类别的识别能力,并选择识别能力大的特征组成用于分类的特征集合.实验表明该方法在选取的特征质量以及在accuracy、F1和micro-Precision等分类测度上均优于传统方法.
An approach of feature selection for hierarchical classification is proposed. Firstly, the concept of category hierarchical correlation degree is introduced and it is calculated according to the category tree and the probability distribution of training data on different levels. Then, the importance degrees of categories are computed according to hierarchical correlation degree. Finally, the discriminative abilities of features are calculated based on the previous computation and the features with the greater discriminative ability are chosen as the feature set for classification. Experimental results show that the proposed approach outperforms the traditional feature selection methods on both quality of the features selected and standard classification metrics in terms of accuracy, F1 and micro-precision.
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2011年第1期103-110,共8页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.60970047)
山东省自然科学基金项目(No.Y2008G19)
山东省科技攻关项目(No.2007GG10001002
2008GG10001026)资助
关键词
文本特征选择
类别层次相关
层次分类
机器学习
Text Feature Selection, Category Hierarchical Correlation, Hierarchical Classification,Machine Learning