一种基于二元模型的分层文本过滤方法

A LAYERED TEXT FILTERING METHOD BASED ON BIGRAM MODEL

下载PDF

导出

摘要提出一种基于二元模型的分层过滤策略的中文文本过滤方法。首先,在非法文本集中使用文档频率和卡方统计相结合的方法抽取非法关键词集合,并根据制定的策略,筛选出非法文本和一些包含非法关键词的合法文本;其次,在筛选出的文本中,选取包含非法关键词的二元词串作为特征集合,以卡方统计方法对特征进行评估,选取预定数目作为结果的特征子集,使用支持向量机分类器过滤非法文本。实验表明提出的过滤方法的准确率、召唤率和F1的值分别为:95.65%,84.87%和89.93%。 This paper presents a Chinese text filtering method that uses layered filtering strategy based on bigram model.First,it extracts illegal keywords set from illegal text collection by using a method integrating the document frequency and the chi-square statistics,and then screens out illegal texts and some legal texts including illegal keywords according to the strategy set down.Secondly,it collects all bigram strings which include illegal keywords as features collection from the sifted texts,and then selects a predetermined number of the features from that collection as the resulting feature subset after assessing these features with chi-square statistics.Finally,it filters the illegal text by SVM classifier.Experimental results show that our method has achieved the precision rate,recall rate and the value of F1 to be 95.65%,84.87% and 89.93% respectively.

作者周聚李培峰朱巧明

机构地区苏州大学计算机科学与技术学院苏州大学江苏省计算机信息处理技术重点实验室

出处《计算机应用与软件》 CSCD 2011年第7期16-18,共3页 Computer Applications and Software

基金国家自然科学基金(90920004 60970056 60873150) 江苏省自然科学基金(BK2008160) 江苏省高校自然科学重大基础研究项目(08KJA520002)

关键词文本过滤卡方统计关键词抽取二元词串 Text filtering Chi-square statistic Keywords extract Bigram

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1Sebastiani F. Machine learning in automated text categorization [ J ]. ACM Computing Surveys,2002,34 ( 1 ) : 1 - 47.
2樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-131. 被引量：70
3Yang Y, Pedersen J Q. A comparative study on feature selection in text categorization[ C] //Proceeding of the 14 International Conference on Machine Learning( ICML), 1997:412 -420.
4肖婷,唐雁.改进的χ^2统计文本特征选择方法[J].计算机工程与应用,2009,45(14):136-137. 被引量：26
5张俐,李星,陆大.中文网页自动分类新算法[J].清华大学学报（自然科学版）,2000,40(1):39-42. 被引量：18

二级参考文献18

1寇苏玲,蔡庆生.中文文本分类中的特征选择研究[J].计算机仿真,2007,24(3):289-291. 被引量：30
2Lewis D. D.. An evaluation of phrasal and clustered representalions on a text categorization task. In: Proceedings of SIGIR'92,the 15st ACM International Conference on Research and Development in Information Retrieval, Copenhagen, Denmark,1992, 37-50.
3Sebastiani F,. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47.
4Lewis D.. Naive bayes at forty: The independence assumption in information retrieval. In: Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, 1998,4-15.
5Salton G.. Automatic Text Processing: The Transformation,Analysis, and Retrieval of Information by Computer. Reading,MA: Addison Wesley, 1989.
6Mitchell T. M.. Machine Learning. New York: McCraw Hill,1996.
7Joachims T.. Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning,Chemnitz, Germany, 1998, 137-142.
8Yang Y. , Liu X.. A Re-examination of text categorization methods. In: Proceedings of SIGIR'99, the 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, CA, 1999, 42-49.
9樊兴华.因果推理和文本分类.清华大学博士后出站报告,2004.
10Larkey L. S.. Automatic essay grading using text categorization techniques.. In: Proceedings of SIGIR'98, the 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998, 90-95.

共引文献111

1孙登林,李生红,荆涛,刘功申.一种针对不良主题的文本过滤方法[J].信息安全与通信保密,2008,30(2):92-93. 被引量：4
2王细薇,樊兴华,赵军.一种基于特征扩展的中文短文本分类方法[J].计算机应用,2009,29(3):843-845. 被引量：36
3彭昱忠,元昌安,王艳,覃晓.基于内容理解的不良信息过滤技术研究[J].计算机应用研究,2009,26(2):433-438. 被引量：19
4彭京,杨冬青,唐世渭,王腾蛟,高军.基于概念相似度的文本相似计算[J].中国科学（F辑:信息科学）,2009,39(5):534-544. 被引量：17
5张莉,康耀红,王曙光,张春元.中文网页自动分类现状的研究[J].福建电脑,2004,20(5):3-4. 被引量：1
6李明杰.特征抽取方法在网页分类中的应用[J].常熟理工学院学报,2005,19(4):106-108. 被引量：1
7石敏,康耀红.一种联合的文本分类特征抽取函数[J].海南大学学报（自然科学版）,2005,23(4):347-350.
8张雪英.基于机器学习的文本自动分类研究进展[J].情报学报,2006,25(6):730-739. 被引量：11
9王力.基于元数据的WEB信息描述方法的应用研究[J].微计算机信息,2007,23(02X):88-90. 被引量：6
10LI Yanling,DAI Guanzhong,ZHU Yehang,QIN Sen.A High-Performance Extraction Method for Public Opinion on Internet[J].Wuhan University Journal of Natural Sciences,2007,12(5):902-906. 被引量：3

1周聚,李培峰,朱巧明.一个基于分层的网页文本过滤系统[J].计算机与数字工程,2010,38(8):18-21.
2刘冬彦,李婷,李岩.基于互联网信息过滤技术的研究与设计[J].科技创新与应用,2013,3(14):70-70.
3周蕾,朱巧明.基于统计和规则的未登录词识别方法研究[J].计算机工程,2007,33(8):196-198. 被引量：21
4李亮,张树生,白晓亮,邵立.基于遗传算法的三维CAD模型多特征融合和检索[J].制造业自动化,2013,35(3):78-81. 被引量：4
5专为企业设计[J].电脑时空,2013(10):16-16.
6李卫疆,赵铁军,王宪刚.基于上下文的查询扩展[J].计算机研究与发展,2010,47(2):300-304. 被引量：32
7李玉林.将Word中的常用命令打包集中使用[J].电脑入门,2009(8):42-44.
8曲慧雁,赵伟,王东海,李洁.基于隐Markov模型汉语词性自动标注的新算法[J].东北师大学报（自然科学版）,2013,45(4):66-70.
9王东海,赵伟,陈洁,梁贺.基于隐Markov模型汉语词性自动标注的若干分析与改进[J].长春工业大学学报,2007,28(1):48-52. 被引量：2
10周蕾,朱巧明.词结合型未登录词识别方法研究[J].常熟理工学院学报,2012,26(4):110-114.

计算机应用与软件

2011年第7期

浏览历史

内容加载中请稍等...

一种基于二元模型的分层文本过滤方法

参考文献5

二级参考文献18

共引文献111

相关作者

相关机构

相关主题

浏览历史