期刊文献+

基于信息增益的文本特征权重改进算法 被引量:9

Improved Algorithm of Text Feature Weighting Based on Information Gain
下载PDF
导出
摘要 传统tf.idf算法中的idf函数只能从宏观上评价特征区分不同文档的能力,无法反映特征在训练集各文档以及各类别中分布比例上的差异对特征权重计算结果的影响,降低文本表示的准确性。针对以上问题,提出一种改进的特征权重计算方法tf.igt.igC。该方法从考察特征分布入手,通过引入信息论中信息增益的概念,实现对上述特征分布具体维度的综合考虑,克服传统公式存在的不足。实验结果表明,与tf.idf.ig和tf.idf.igc 2种特征权重计算方法相比,tf.igt.igC在计算特征权重时更加有效。 The idf function of traditional (f..idf algorithm can only evaluate the ability of features to discriminate different documents in a macroscopically way, which can not reflect the differences of distribution proportion for features in each document and each class of the whole training set, it reduces the accuracy of text representation. To solve the above problem, this paper proposes an improved feature weighting method called tfig,.igc. This method begins from analyzing the characteristics of feature distribution, through introducing the concept of information gain in the information theory, realizes the comprehensive consideration of the two specific dimensions of feature distributions, and overcomes the shortcomings of the traditional formula. Experimental results on the two open source corpus show that compared to other two feature weighting methods, tf.ig.igc is more effective in terms of calculating the feature weighting.
出处 《计算机工程》 CAS CSCD 北大核心 2011年第1期16-18,21,共4页 Computer Engineering
基金 中国博士后科学基金资助项目(20090461425) 江苏省博士后科研计划基金资助项目(0901014B)
关键词 特征分布 特征加权 文本分类 feature distribution feature weighting text classification
  • 相关文献

参考文献7

二级参考文献41

共引文献618

同被引文献74

引证文献9

二级引证文献31

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部