摘要
研究网页查重问题。针对传统的SCAM网页查重算法根据比较几个关键词网页中出现次数来判断网页是否重复,当网站中存在相似网页时,由于其关键词非常相近,导致出现误判,造成查重准确率不高的问题。本文提出一种网页指纹查重算法,通过采用信息检索技术,提取出待检测网页的网页指纹,然后通过与网页库中的网页指纹比较判决,完成网页的查重,避免了传统方法只依靠几个关键词而造成的查重准确率不高的问题。实验证明,这种利用网页指纹查重的方法能准确判断网页是否重复,提高了网页信息的准确性,取得了满意的结果。
Study the problem of seeking duplicated web pages. The traditional re-SCAM algorithm determines if the web pages are repeated according to the repeating times of a few key words, When some users browse web pages, if the key words then used are very similar, the miscarriage of justice and re-checking will be resulted and the accu- racy is not high. This paper presents an repeat checking algorithm of web page fingerprint. Information retrieval tech- nology is used to extract fingerprint information of the page to be detected, then the fingerprint information is com- pared with the Web fingerprint of Web page library to complete the repeat checking. This method avoids the low accu- racy in traditional algorithm. Experimental results show that the method of repeat cheching of web fingerprint can ac- curately determine whether a page is repeated, improve the accuracy of the information page, and achieve satisfactory results.
出处
《计算机仿真》
CSCD
北大核心
2011年第9期154-157,共4页
Computer Simulation
关键词
网页查重
关键词
网页指纹
Duplicated web page seek
Keyword
Page fingerprint