摘要
在类和特征分布不均时,传统信息增益算法的分类性能急剧下降。针对此不足,提出一种基于信息增益的文本特征选择方法(TDpIG)。首先对数据集按类进行特征选择,以减少数据集不平衡性对特征选取的影响。其次运用特征出现概率计算信息增益权值,以降低低频词对特征选择的干扰。最后使用离散度分析特征在每类中的信息增益值,过滤掉高频词中的相对冗余特征,并对选取的特征应用信息增益差值做进一步细化,获取均匀精确的特征子集。通过对比实验表明,选取的特征具有更好的分类性能。
Due to the maldistribution of class and feature,the classification performance of traditional information gain algorithm will decrease sharply.Considering that,a text feature selection method TDpIG based on the information gain was proposed.First of all,selected feature in dataset based on the class,which can reduce the effect of dataset imbalance on feature selection.Secondly,calculated information gain weight by using feature occurrence probability to decrease the interference of low frequency words to feature selection.At last,analysed the increasing information of each class by use of dispersion,filtering out the relative redundant features of high frequency words,further refining the selected feature applied increasing information,and getting the uniform and accurate subsets.The comparison experiment shows that the method has better classification performance.
出处
《计算机科学》
CSCD
北大核心
2012年第11期127-130,共4页
Computer Science
基金
国家自然科学基金项目(60603047)
教育部留学回国人员科研启动基金资助项目
辽宁省科技计划项目(2008216014)
辽宁省教育厅高等学校科研基金(L2010229)
大连市优秀青年科技人才基金(2008J23JH026)资助
关键词
特征选择
文本分类
信息增益值
冗余特征
不平衡数据集
Feature selection
Text classification
Information gain
Redundant feature
Imbalanced dataset