摘要
针对传统TFIDF方法将文档集作为整体来处理,并没有考虑到特征项在类间和类内的分布情况的不足,提出一种结合信息熵的TFIDF改进方法。该方法采用结合特征项在类间和类内信息分布熵来调整TFIDF特征项的权重计算,避免了那些对分类没有贡献的特征项被赋予较大权值的缺陷,能更有效计算文本特征项的权重。实验结果表明该方法提高了文本分类的精确度和召回率,是一种比较有效的文本特征加权方法。
Aiming at the problem that the document set is dealt with as a whole and the distribution of feature items among and in classes is not taken into full account when using traditional TFIDF method,an improved TFIDF method which is combined with information entropy is proposed.This method modifies the method of calculating weights of feature items of TFIDF by combining information entropies of feature items among and in classes,which overcomes the defect that the feature items that made less contribution to the categorisation would be given greater weight,thus is able to calculate weights of text feature items more efficiently.Experimental results show that the proposed method enhances recall and precision of text categorisation and is a more effective text feature weighting method.
出处
《计算机应用与软件》
CSCD
2011年第2期17-20,共4页
Computer Applications and Software
基金
国家自然科学基金项目(60841003)
国家火炬计划项目(2004EB33006)
关键词
TFIDF
文本分类
特征加权
向量空间模型
Term frequency-inverse document frequency(TFIDF) Text categorisation Feature weighting Vector space model