期刊文献+

一种基于网页指纹的网页查重技术研究 被引量:2

Research on NLP-basedpage fingerprint Seek Algorithm
下载PDF
导出
摘要 研究网页查重问题。针对传统的SCAM网页查重算法根据比较几个关键词网页中出现次数来判断网页是否重复,当网站中存在相似网页时,由于其关键词非常相近,导致出现误判,造成查重准确率不高的问题。本文提出一种网页指纹查重算法,通过采用信息检索技术,提取出待检测网页的网页指纹,然后通过与网页库中的网页指纹比较判决,完成网页的查重,避免了传统方法只依靠几个关键词而造成的查重准确率不高的问题。实验证明,这种利用网页指纹查重的方法能准确判断网页是否重复,提高了网页信息的准确性,取得了满意的结果。 Study the problem of seeking duplicated web pages. The traditional re-SCAM algorithm determines if the web pages are repeated according to the repeating times of a few key words, When some users browse web pages, if the key words then used are very similar, the miscarriage of justice and re-checking will be resulted and the accu- racy is not high. This paper presents an repeat checking algorithm of web page fingerprint. Information retrieval tech- nology is used to extract fingerprint information of the page to be detected, then the fingerprint information is com- pared with the Web fingerprint of Web page library to complete the repeat checking. This method avoids the low accu- racy in traditional algorithm. Experimental results show that the method of repeat cheching of web fingerprint can ac- curately determine whether a page is repeated, improve the accuracy of the information page, and achieve satisfactory results.
作者 王希杰
机构地区 安阳师范学院
出处 《计算机仿真》 CSCD 北大核心 2011年第9期154-157,共4页 Computer Simulation
关键词 网页查重 关键词 网页指纹 Duplicated web page seek Keyword Page fingerprint
  • 相关文献

参考文献5

二级参考文献31

  • 1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 2中国互联网络信息中心.第十六次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/in-dex/OE/00/11/index.htm,2005,07-01
  • 3Andrei Z. Broder, Steven C. Glassman. Syntactic Clustering of the Web [DB/OL]. http://gatekeeper. research.compaq.com/pub/DEC/SRC/technical--notes/SRC--1997--015 html
  • 4吴军,数学之美系列十三信息指纹及其应用[DB/OL].http://www.googlechinablog.com/2006/08/blog-post.html
  • 5Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. Detecting Near--Duplicates for Web Crawlng[C]. In ternational World Wide Web Conference, Banff, Alberta, Canada, New York, USA: ACM, 2007: 141-- 150
  • 6Moses S. Charikar, Similarity Estimation Tech niques from Rounding Algorithms[C]. Annual ACM Sym posium on Theory of Computing, Montreal, Quebec, Cana da, New York, USA:ACM, 2002 : 380-388
  • 7CHO J, SHIVAKUMAR N, GARCIA-MOLINA H. CA 94305,Finding replicated web collections[R]. Department of Computer Science Stanford, 1999.
  • 8CHOWDHURY A, FR1EDER O, GROSSMAN D, et al. Collection Statistics for Fast Duplicate Document Detection[J]. ACM Transactions on Information System, 2002, 20(2): 171 - 191.
  • 9LOPRESTI DP. Models and Algorithms for Duplicate Document Detection Bell Labs[A]. Proceedings of the Fifth International Conference on Document Analysis and Recognition[C], 1999.
  • 10CAMPBELL DM, CHEN WR, SMITH DM. Copy Detection Systems for Digital Documents [A].Advances in Digital Libraries 2000( ADL 2000) [C], 2000.

共引文献21

同被引文献13

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部