期刊文献+

文本分类中词语权重计算方法的改进与应用 被引量:28

Improvement and application to weighting terms based on text classification
下载PDF
导出
摘要 文本的形式化表示一直是信息检索领域关注的基础性问题。向量空间模型(Vector SpaceModel)中的tf.idf文本表示是该领域里得到广泛应用,并且取得较好效果的一种文本表示方法。词语在文本集合中的分布比例量上的差异是决定词语表达文本内容的重要因素之一。但是其IDF的计算,并没有考虑到特征项在类间的分布情况,也没有考虑到在类内分布相对均匀的特征项的权重应该比分布不均匀的要高,应该赋予其较高的权重。用改进的TFIDF选择特征词条、用KNN分类算法和遗传算法训练分类器来验证其有效性,实验表明改进的策略是可行的。 Text representation has been the fundamental problem in Information Retrieval.tf.idf (term frequency,inverse document frequency) as one of term weighting schemes in Vector Space Model is a good text representation,Which is popular and make good results in the field of Information Retrieval.The difference of the proportion of distribution of terms in text collection is one of the most important factors of expressing the content of text.But the calculation of IDF,don't consider the information of distribution about terms among classes,and don't consider the more term weighting for the terms of the relative distributed balance inner classes.The improved TFIDF are used to select feature,KNN algorithm and genetic algorithm are used to train the classifier.and proves that the improved TFIDF method is feasible.
出处 《计算机工程与应用》 CSCD 北大核心 2008年第5期187-189,共3页 Computer Engineering and Applications
基金 重庆市自然科学基金(the Natural Science Foundation of Chongqing City of China under Grant No.CSTC2006BB2021)
关键词 文本表示 向量空间模型 特征选择 TFIDF text representation Vector Space Model feature selection TFIDF
  • 相关文献

参考文献6

二级参考文献45

  • 1盛骤 谢式千.概率论与数理统计[M].北京:高等教育出版社,1989.189-194.
  • 2黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38.
  • 3James Auen.Natural Language Understandin[M].The Benjamin/Cummings Publishing Company, 1991-05.
  • 4Apte C,Damerau F J,Weiss S M.Automated Learning of Decision Rules for Text Categorization[J].ACM Trans On Inform Syst,12(3): 233-251.
  • 5Salton G,Buckley B.Term-weighting Approaches in Automatic Text Retrieval[J].Information Processing and Management, 1998 ; 24(5 ) :513 -523.
  • 6Larkey L S.A Patent Search and Classification System[C].In:proceedings of DL-99,4th ACM Conference on Digital Libraries Berkeley,CA,1999:179-187.
  • 7Salton G,Lesk M E.Computer Evaluation of Indexing and Text Processing[J].Association for Computing Machinery, 1968 ; 15 ( 1 ) : 8-36.
  • 8Yang Yiming,ProceedingsoftheSeventeenthInternationalACMSIGIRConferenceonResearchandDevelopme,1994年,12页
  • 9Michelle Keim, David D. Lewis, David Madigan. Bayesian Information Retrieval: Preliminary Evaluation. In Preliminary Papers of the Sixth International Workshop on Artificial Intelligence and Statistics,1997.
  • 10C. J. van Rijsbergen B. Se. Information Retrieval. Butterworths. London:United Kingdom. 1979.

共引文献503

同被引文献218

引证文献28

二级引证文献497

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部