期刊文献+

一种基于二元模型的分层文本过滤方法

A LAYERED TEXT FILTERING METHOD BASED ON BIGRAM MODEL
下载PDF
导出
摘要 提出一种基于二元模型的分层过滤策略的中文文本过滤方法。首先,在非法文本集中使用文档频率和卡方统计相结合的方法抽取非法关键词集合,并根据制定的策略,筛选出非法文本和一些包含非法关键词的合法文本;其次,在筛选出的文本中,选取包含非法关键词的二元词串作为特征集合,以卡方统计方法对特征进行评估,选取预定数目作为结果的特征子集,使用支持向量机分类器过滤非法文本。实验表明提出的过滤方法的准确率、召唤率和F1的值分别为:95.65%,84.87%和89.93%。 This paper presents a Chinese text filtering method that uses layered filtering strategy based on bigram model.First,it extracts illegal keywords set from illegal text collection by using a method integrating the document frequency and the chi-square statistics,and then screens out illegal texts and some legal texts including illegal keywords according to the strategy set down.Secondly,it collects all bigram strings which include illegal keywords as features collection from the sifted texts,and then selects a predetermined number of the features from that collection as the resulting feature subset after assessing these features with chi-square statistics.Finally,it filters the illegal text by SVM classifier.Experimental results show that our method has achieved the precision rate,recall rate and the value of F1 to be 95.65%,84.87% and 89.93% respectively.
出处 《计算机应用与软件》 CSCD 2011年第7期16-18,共3页 Computer Applications and Software
基金 国家自然科学基金(90920004 60970056 60873150) 江苏省自然科学基金(BK2008160) 江苏省高校自然科学重大基础研究项目(08KJA520002)
关键词 文本过滤 卡方统计 关键词抽取 二元词串 Text filtering Chi-square statistic Keywords extract Bigram
  • 相关文献

参考文献5

二级参考文献18

  • 1寇苏玲,蔡庆生.中文文本分类中的特征选择研究[J].计算机仿真,2007,24(3):289-291. 被引量:30
  • 2Lewis D. D.. An evaluation of phrasal and clustered representalions on a text categorization task. In: Proceedings of SIGIR'92,the 15st ACM International Conference on Research and Development in Information Retrieval, Copenhagen, Denmark,1992, 37-50.
  • 3Sebastiani F,. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47.
  • 4Lewis D.. Naive bayes at forty: The independence assumption in information retrieval. In: Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, 1998,4-15.
  • 5Salton G.. Automatic Text Processing: The Transformation,Analysis, and Retrieval of Information by Computer. Reading,MA: Addison Wesley, 1989.
  • 6Mitchell T. M.. Machine Learning. New York: McCraw Hill,1996.
  • 7Joachims T.. Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning,Chemnitz, Germany, 1998, 137-142.
  • 8Yang Y. , Liu X.. A Re-examination of text categorization methods. In: Proceedings of SIGIR'99, the 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, CA, 1999, 42-49.
  • 9樊兴华.因果推理和文本分类.清华大学博士后出站报告,2004.
  • 10Larkey L. S.. Automatic essay grading using text categorization techniques.. In: Proceedings of SIGIR'98, the 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998, 90-95.

共引文献111

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部