摘要
特征空间的高维特点限制了分类算法的选择,影响了分类器的设计和准确度,降低了分类器的泛化能力,从而出现分类器过拟合的现象,因此需要进行特征选择以避免维数灾难。首先简单分析了几种经典特征选择方法,总结了它们的不足;然后给出了一个优化的文档频方法,并用它过滤掉一些词条以降低文本矩阵的稀疏性;最后应用模式聚合(PA)理论建立文本集的向量空间模型,从分类贡献的角度强化词条的作用,消减原词条矩阵中包含的冗余模式,从而有效地降低了向量空间的维数,提高了文本分类的精度和速度。实验结果表明此种综合性特征选择方法效果良好。
Feature space has characteristic of high dimensional, which restricts choice of classification algorithms and makes the classifier hardly design, also lows the generalization ability and makes the classifier overfitting, so feature selection is necessary to avoid curse of dimensionality. This paper firstly analyzed simply several classic feature selection methods and summarized their deficiencies. And then it presented an optimized document frequency method and used this method to filter out some terms to reduce the sparsity of text matrix. Finally,it established the vector space model of text sets weight by means of the theory of PA, which enhanced the function of the words from the viewpoint of categorization effect, decreased the dimension of vector by eliminating redundant features and raised speed and accuracy of text categorization. The experimental results show that the combined method is promising.
出处
《计算机应用研究》
CSCD
北大核心
2010年第1期36-38,共3页
Application Research of Computers
基金
四川省科技计划资助项目(2008GZ0003)
关键词
特征选择
文本分类
词频
文档频
模式聚合
feature selection
text categorization
word frequency
document frequency
PA