摘要
提出一种基于二元模型的分层过滤策略的中文文本过滤方法。首先,在非法文本集中使用文档频率和卡方统计相结合的方法抽取非法关键词集合,并根据制定的策略,筛选出非法文本和一些包含非法关键词的合法文本;其次,在筛选出的文本中,选取包含非法关键词的二元词串作为特征集合,以卡方统计方法对特征进行评估,选取预定数目作为结果的特征子集,使用支持向量机分类器过滤非法文本。实验表明提出的过滤方法的准确率、召唤率和F1的值分别为:95.65%,84.87%和89.93%。
This paper presents a Chinese text filtering method that uses layered filtering strategy based on bigram model.First,it extracts illegal keywords set from illegal text collection by using a method integrating the document frequency and the chi-square statistics,and then screens out illegal texts and some legal texts including illegal keywords according to the strategy set down.Secondly,it collects all bigram strings which include illegal keywords as features collection from the sifted texts,and then selects a predetermined number of the features from that collection as the resulting feature subset after assessing these features with chi-square statistics.Finally,it filters the illegal text by SVM classifier.Experimental results show that our method has achieved the precision rate,recall rate and the value of F1 to be 95.65%,84.87% and 89.93% respectively.
出处
《计算机应用与软件》
CSCD
2011年第7期16-18,共3页
Computer Applications and Software
基金
国家自然科学基金(90920004
60970056
60873150)
江苏省自然科学基金(BK2008160)
江苏省高校自然科学重大基础研究项目(08KJA520002)