期刊文献+

基于最低词频CHI的特征选择算法研究 被引量:6

Research on Feature Selection Algorithm Based on the Lowest Term Frequency of CHI
原文传递
导出
摘要 CHI是文本分类中特征选择的重要方法.本文分析了CHI特征选择的特点,针对该方法的不足之处,提出了一种新的基于最低词频CHI的特征选择算法.该方法通过设置最低词频阈值去除了部分低频词,减少了CHI特征选择时低频词带来的干扰.同时本文对传统的TF-IDF特征权重计算方法进行了改进,在特征权重计算里加入改进后的CHI特征选择函数,使文本的表示更合理.通过在均衡语料和非均衡语料上的实验验证,新的方法有效提高了文本分类的效果. CHI is an important method of feature selection in text categorization .In order to overcome the deficiencies of this method ,a new feature selection algorithm based on the lowest word frequency of CHI is proposed in this paper .This new approach removes some low‐frequency terms by setting a threshold value of the lowest frequency terms ,and thus reduces the interference of the low‐frequency terms in CHI feature selection .Meanwhile ,the classical feature weighting method of TF‐IDF is improved in this study ,and the addition of an improved feature selection function of CHI to the feature weighting method makes the text more reasonable .The experimental results on the corpora of even distribution and uneven distribution show that the new approach has effectively improved the quality of text categorization .
出处 《西南大学学报(自然科学版)》 CAS CSCD 北大核心 2015年第6期137-142,共6页 Journal of Southwest University(Natural Science Edition)
基金 国家自然科学基金项目(61462008) 重庆市教委科技项目(KJ120622)
关键词 文本分类 向量空间模型 特征选择 χ2统计 低频词 权重计算 text categorization vector space model feature selection Chi-square statistic low-frequency term term weighting
  • 相关文献

参考文献10

二级参考文献62

共引文献108

同被引文献46

引证文献6

二级引证文献41

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部