摘要
特征抽取是文本分类的重要研究领域,针对原始特征空间的高维性与稀疏性给分类算法带来"维数灾难"问题,探讨了基于词条聚合的特征抽取方法,设计了一种利用词条聚合进行特征抽取的文本分类的方案.该方案利用改进的树型动态自组织映射(TGSOM)进行词条聚合,并根据聚合特征的特点,考虑所包含的词条的文档频率的不同和区分文档类别属性的能力的不同,提出了一种新权重计算方法,最后利用SPR INT决策树算法进行分类,实验表明该方法比普通方法分类精度提高4.32%.
Feature extraction is essential for text classification. In this paper we discussed the basic ideas behind word-clustering-based feature extraction. Then a text classification method for feature extraction by the means of words clustering was presented. It employed an improved tree-structured growing self-organization map (TGSOM) to carry out word clustering. Also a new formula for calculating weights was developed by taking account of the distinction between clustered word features and plain word features. Finally, the SPRINT decision tree was applied to complete the text classification. Experiments showed that the precision of text classification using the proposed method is improved by 4.32%.
出处
《哈尔滨工程大学学报》
EI
CAS
CSCD
北大核心
2008年第11期1205-1209,共5页
Journal of Harbin Engineering University