期刊文献+

基于信息增益的文本特征选择方法 被引量:31

Information-gain-based Text Feature Selection Method
下载PDF
导出
摘要 在类和特征分布不均时,传统信息增益算法的分类性能急剧下降。针对此不足,提出一种基于信息增益的文本特征选择方法(TDpIG)。首先对数据集按类进行特征选择,以减少数据集不平衡性对特征选取的影响。其次运用特征出现概率计算信息增益权值,以降低低频词对特征选择的干扰。最后使用离散度分析特征在每类中的信息增益值,过滤掉高频词中的相对冗余特征,并对选取的特征应用信息增益差值做进一步细化,获取均匀精确的特征子集。通过对比实验表明,选取的特征具有更好的分类性能。 Due to the maldistribution of class and feature,the classification performance of traditional information gain algorithm will decrease sharply.Considering that,a text feature selection method TDpIG based on the information gain was proposed.First of all,selected feature in dataset based on the class,which can reduce the effect of dataset imbalance on feature selection.Secondly,calculated information gain weight by using feature occurrence probability to decrease the interference of low frequency words to feature selection.At last,analysed the increasing information of each class by use of dispersion,filtering out the relative redundant features of high frequency words,further refining the selected feature applied increasing information,and getting the uniform and accurate subsets.The comparison experiment shows that the method has better classification performance.
出处 《计算机科学》 CSCD 北大核心 2012年第11期127-130,共4页 Computer Science
基金 国家自然科学基金项目(60603047) 教育部留学回国人员科研启动基金资助项目 辽宁省科技计划项目(2008216014) 辽宁省教育厅高等学校科研基金(L2010229) 大连市优秀青年科技人才基金(2008J23JH026)资助
关键词 特征选择 文本分类 信息增益值 冗余特征 不平衡数据集 Feature selection Text classification Information gain Redundant feature Imbalanced dataset
  • 相关文献

参考文献9

  • 1Yang Yi-ming, Pedersen J O. A Comparative Study on feature selection in text categorization [C]//Proceedings of the 14th In- ternational Conference on Machine Learning (ICML ' 97). Nash- villr = Morgan Kaufmann Publishers, 1997 412-420.
  • 2Ng H, Goh W, Low K. Feature selection, perceptron learning and a usability case study {or text categorization [C]//Procee- dings o{ the g0th ACM International Conference onResearch and Development in InformationRetrieval(SIGIR-97). 1997 : 67-73.
  • 3Wang Bin,Jones G J F, Pan Wen-feng. Using online linear clas- sifiers to filter spam emails[J]. Pattern Analysis Applica- tions, 2006,9(4) : 339-351.
  • 4杨玉珍,刘培玉,朱振方,邱烨.应用特征项分布信息的信息增益改进方法研究[J].山东大学学报(理学版),2009,44(11):48-51. 被引量:14
  • 5Zheng Zhachui, Wu Xiao-yun, Srihari R. Feature Selection for Text Categorization on Imbalaneed Data[J]. ACM SIGKDD Ex- plorations Newsletter, 2004(6) : 80-89.
  • 6单丽莉,刘秉权,孙承杰.文本分类中特征选择方法的比较与改进[J].哈尔滨工业大学学报,2011,43(S1):319-324. 被引量:25
  • 7Xu Yan, Chen Lin. Term-frequency Based Feature Selection Methods for Text Categorization[C]//Proceedings of the 2010 Fourth International Conference on Genetic and Evolutionary Computing, Dec, 2010 : 280-283.
  • 8Robertson S E, Walker S, Jones S, et al. Okapi at tree-3 [C]// Gaithersburg M D. Proceedings of the Third Text Retrieval Conference (TR[C-3). USA= the National Inst. of Stan- dardsTechnology(NIST) &Defense Advanced Research Pro- jects Agency(DARPA). 1994 :109-126.
  • 9Hu Qing-hua, Yu Da-ren, Xie Zong-xia. Neighborhood classifiers [Z]. Scienc Edirect. Dec. 2006.

二级参考文献10

共引文献36

同被引文献246

引证文献31

二级引证文献131

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部