期刊文献+

基于TFIDF文本特征加权方法的改进研究 被引量:37

AN IMPROVED TEXT FEATURE WEIGHTING ALGORITHM BASED ON TFIDF
下载PDF
导出
摘要 针对传统TFIDF方法将文档集作为整体来处理,并没有考虑到特征项在类间和类内的分布情况的不足,提出一种结合信息熵的TFIDF改进方法。该方法采用结合特征项在类间和类内信息分布熵来调整TFIDF特征项的权重计算,避免了那些对分类没有贡献的特征项被赋予较大权值的缺陷,能更有效计算文本特征项的权重。实验结果表明该方法提高了文本分类的精确度和召回率,是一种比较有效的文本特征加权方法。 Aiming at the problem that the document set is dealt with as a whole and the distribution of feature items among and in classes is not taken into full account when using traditional TFIDF method,an improved TFIDF method which is combined with information entropy is proposed.This method modifies the method of calculating weights of feature items of TFIDF by combining information entropies of feature items among and in classes,which overcomes the defect that the feature items that made less contribution to the categorisation would be given greater weight,thus is able to calculate weights of text feature items more efficiently.Experimental results show that the proposed method enhances recall and precision of text categorisation and is a more effective text feature weighting method.
出处 《计算机应用与软件》 CSCD 2011年第2期17-20,共4页 Computer Applications and Software
基金 国家自然科学基金项目(60841003) 国家火炬计划项目(2004EB33006)
关键词 TFIDF 文本分类 特征加权 向量空间模型 Term frequency-inverse document frequency(TFIDF) Text categorisation Feature weighting Vector space model
  • 相关文献

参考文献10

  • 1徐燕,李锦涛,王斌,孙春明.基于区分类别能力的高性能特征选择方法[J].软件学报,2008(1):82-89. 被引量:83
  • 2范焱,郑诚,王清毅,蔡庆生,刘洁.用Naive Bayes方法协调分类Web网页[J].软件学报,2001,12(9):1386-1392. 被引量:53
  • 3Thorsten Joachims,Text Categorization with Support Vector Machines:Learning with Many Relevant Features[C]//European Conferrence on Machine Learning (ECML).Berlin:Springer,1998:137-142.
  • 4Yang Y,Liu X.A re-examination of text categorization methods[C]//The 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM Press,1999:42-49.
  • 5Yang Yi-ming,Pederson Jan O.A comparative study on feature se-lection in text categorization[C]//Proceedings of the 14th International Conference on Machine learning,Bled:Morgan Kaufmann,1997:258-267.
  • 6鲁松,李晓黎,白硕,王实.文档中词语权重计算方法的改进[J].中文信息学报,2000,14(6):8-13. 被引量:120
  • 7Zhou Yanan,Tang Jianbo,Wang Jiaqin.An improved TFIDF feature selection algorithm based on information entropy[C]//Proceedings of the 26th Chinese Control Conference,CCC 2007:312-315.
  • 8李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:96
  • 9Shouning Qu,Sujuan Wang,Yan Zou.Improvement of Text Feature Selection Method based on TFIDF[C]//International Seminar on Future Information Technology and Management Engineering.2008:79-81.
  • 10Yang Chengcheng,He Xingshi.A text feature selection algorithm based on improved TFIDF[C]//Proceedings of the 2008 Chinese Conference on Pattern Recognition,CCPR 2008:416-419.

二级参考文献21

  • 1赵世奇,张宇,刘挺,陈毅恒,黄永光,李生.基于类别特征域的文本分类特征选择方法[J].中文信息学报,2005,19(6):21-27. 被引量:21
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:387
  • 3D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998, 4-15.
  • 4Y. Yang, X. Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf. onResearch and Development in the Information Retrieval. NewYork: ACM Press, 1999.
  • 5Y. Yang, C. G. Chute. An example based mapping method for text categorization and retrieval. ACM Trans. on Information Systems, 1994, 12(3): 252 -277.
  • 6E. Wiener. A neural network approach to topic spotting. The 4th Annual Syrup. on Document Analysis and Information Retrieval,Las Vegas, NV, 1995.
  • 7R. E. Schapire, Y. Singer. Improved boosting algorithms using confidence-rated predications. In: Proc. of the 11th Annual Conf.on Computational Learning Theory. New York: ACM Press,1998. 80--91.
  • 8T. Joachims. Text categorization with support vector machines:Learning with many relevant features. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998. 137-142.
  • 9Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1 ( 1 ) : 76-- 88.
  • 10R. Adwait. Maximum entropy models for natural language ambiguity resolution: [ Ph. D. dissertation ] . Pennsylvania:University of Pennsylvania, 1998.

共引文献342

同被引文献330

引证文献37

二级引证文献266

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部