摘要
传统TF*IDF算法是计算文档关键字的权值的重要方法。分析了传统TF*IDF算法在划分垃圾邮件和合法邮件时的缺点。即忽视了在一类文档中反复出现的单词,反复出现的单词往往最具有代表该类文档的特征,权重应该是比较高的。但这种情况,传统TF*IDF算法计算出结果恰恰相反,权重偏低,达不到设计者的要求。故通过改进了传统TF*IDF算法计算公式,来增加这些单词的权重。实验证明改进算法优于传统算法:
Traditional TF*IDF algorithm is important methods to calculate the weight of keywords in documents. Analyzing disadvantages of the traditional TF * IDF algorithm division spam and lawful email. It has neglected the repeated words in a class of the document, the repeated words often represent features of the class of this document, weight of words should be higher. But this kind of situation , traditional TF*IDF algorithm calculated results, on the contrary, low weight, and not reaching the requirement of designers. Through the improvement of traditional TF * IDF algorithm, and to increase the weight of these words. Experiments prove the improved algorithm is superior to the traditional algorithm.
作者
常凯
CHANG Kai (Hubei University of Technology, Wuhan 430068, China)
出处
《电脑知识与技术》
2010年第9期6928-6930,共3页
Computer Knowledge and Technology