摘要
识别存在于大量的WEB网页中的不良信息的非法文本,并将其有效屏蔽,是未来信息过滤研究的新领域。在传统方法的基础上,在对抓取到的网页进行预处理后,设置加权的关键字词典;应用汉语语料库里同类词的概念,从词汇关联的角度出发,最终提出了基于同类词权重均值的关联过滤算法。最后,从两个角度进行算法评估,该过滤算法更为高效,并且能够很好的应对不良网站的反关键字过滤策略。
As the World Wide Web continues to grow at an exponential rate,the Webpage Information Filtering used for identify the illegitimate text includes ill information,and then delete them.Result from the ever-increasing of the ill information in webpage,in the future it is a new field in the research of information filtering.Based on the traditional way of keywords,the webpage grasped was per-treated and then the key word dictionary was set up with weight;by applying the concepts of the same category words in Chinese corpus,from an angle of lexical relevance,the relevance filtering algorithm based on same category words weight was put forward.Finally,an algorithm evaluation from two angles consideration was carried out.The filter algorithm is more effective and copes with the strategy to the anti-keyword filtering of eroticism website.
出处
《中南林业科技大学学报》
CAS
CSCD
北大核心
2011年第12期197-201,共5页
Journal of Central South University of Forestry & Technology
关键词
网页过滤
矩阵词典
权重均值
webpage filtering
matrix dictionary
weight equal value