期刊文献+

基于网页正文结构和特征串的相似网页去重算法 被引量:11

Detection and elimination of similar Web pages based on text structure and string of feature code
下载PDF
导出
摘要 为了减少重复网页对用户的干扰,提高去重效率,提出一种新的大规模网页去重算法。首先利用预定义网页标签值建立网页正文结构树,实现了层次计算指纹相似度;其次,提取网页中高频标点字符所在句子中的首尾汉字作为特征码;最后,利用Bloom Filter算法对获取的特征指纹进行网页相似度判别。实验表明,该算法将召回率提高到了90%以上,时间复杂度降低到了O(n)。 In order to reduce the interference of the duplicated Web pages, and improve the efficiency of detection and elimination of similar Web pages, a new kind of large-scale Web page detection algorithm was proposed. Firstly, adopting the Web label values, the algorithm created the text structure trees to realize the fingerprint similarity calculation layer by layer. Secondly, the head and tail words of a certain sentence, in which high frequency punctuations occur, were extracted out as the feature code. Lastly, the fingerprint similarity of Web page features was discriminated with Bloom filter algorithm. The experimental results show that the algorithm can improve the recall rate up to more than 90%, and reduce the time complexity to O(n).
出处 《计算机应用》 CSCD 北大核心 2013年第2期554-557,共4页 journal of Computer Applications
关键词 网页去重 网页标签值 高频标点 特征码 网页指纹相似度 detection and elimination of similar Web pages Web label value high frequency punctuation feature code fingerprint similarity of Web page
  • 相关文献

参考文献13

  • 1毛晓燕.搜索引擎用户满意度研究的实证分析——以百度和Google中国为例[J].图书馆杂志,2008,27(3):40-47. 被引量:14
  • 2CROFT W B;METZLER D;STROHMAN T;刘挺;秦兵;张宇.搜索引擎——信息检索实践[M]北京:机械工业出版社,2010.
  • 3SHIVAKUMAR N,GAREIA-MONLINA H. SCAM:a copy detection mechanism for digital documents[A].Austin:Texas A & M University,1995.201-210.
  • 4BRODER A Z,GLASSMAN S C,MANASSE M S. Syntactic clustering of the Web[A].Essex:Elsevier Science Publishers,1997.1157-1166.
  • 5CONRAD J G,GUO X S,SCHRIBER C P. Online duplicate document detection:signature reliability in a dynamic retrieval environment[A].New York:ACM,2003.443-452.
  • 6CHOWDHURY A,FRIEDER O,GROSSMAN F D. Collection statistics for fast duplicate document detection[J].ACM Transactions on Information Systems,2002,(02):171-191.doi:10.1145/506309.506311.
  • 7KOLCZ A,CHOWDHURY A. Lexicon randomization for near-duplicate detection with I-Match[J].Journal of Supercomputing,2008,(03):255-276.
  • 8王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 9LI W,LIU J Y,WANG C. Web document duplicate removal algorithm based on keyword sequences[A].Piscataway(NJ):IEEE,2005.511-516.
  • 10吴平博,陈群秀,马亮.基于特征串的大规模中文网页快速去重算法研究[J].中文信息学报,2003,17(2):28-35. 被引量:41

二级参考文献19

  • 1王建勇,谢正茂,雷鸣,李晓明.近似镜像网页检测算法的研究与评价[J].电子学报,2000,28(z1):130-132. 被引量:21
  • 2[1]T.W. Yan and H. Garcia- Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases(VLDB' 95) ,66 - 77,San Francisco,Ca., USA,September 1995. Morgan Kaufmann Publishers, Inc.
  • 3[2]Narayanan Shivakumar and Hector Garcia- Molina. SCAM: a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) ,Austin, Texas,June 1995.
  • 4[3]T. Yan and H. Garcia- Molina. The sift information dissemination system. In ACM TODS,2000.
  • 5[4]J.W. Kirriemuir & P. Willett Identification of duplicate and near - duplicate full - text records in database search outputs using hierarchic cluster analysis,in Program-automated library and information,(1995)29(3) :241-256.
  • 6[5]Buckley C. ,Cardie C. ,Mardis S. ,Mitra M. ,Pierce D. ,Wagstaff K. ,Walz J. ,The Smart/Empire TIPSTER IR System, TIPSTER Phase Ⅲ Proceedings,Morgan Kaufmann,San Francisco,CA,2000.
  • 7[1]Narayanan Shivakumar,et al.Finding near-replicas of documents on the web[DB/OL].http://dbpubs.stanford.edu/pub/1998-31.
  • 8[2]J.Liu,M.Lei,J.Wang,and B.Chen.Digging for gold on the web:Experience with the WebGather[A].Proc.of the 4th Inter.Conf.on High Performance Computing in the Asia-Pacific Region[C],Beijing,P.R.China,May 2000:751-755.
  • 9[3]U.Manber.Finding similar files in a large file system[R].Technical Report TR 93-33,University of Arizona,Tuscon,Arizona,October 1993.
  • 10LI Wei,LIU Jian-yi,WANG Cong.Web document duplicate removal algorithm based on keyword sequences[C] //Proc of Natural Language Processing and Knowledge Engineering.Valencia:IEEE Press,2005:511-516.

共引文献69

同被引文献75

引证文献11

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部