期刊文献+

基于网页正文主题和摘要的网页去重算法 被引量:5

The Detection on Duplicated Web Pages from Meta Search
下载PDF
导出
摘要 针对元搜索返回的网页内容相同,别名差异很大的重复网页,提出基于网页正文主题和摘要的网页去重算法,并通过实验对算法进行有效性验证。该算法首先对各成员搜索引擎返回来的网页标题进行有关处理,提取出网页的主题信息,然后对摘要进行分词,再计算摘要的相似度,二者结合能更好地现出文章摘要的内容,实现网页去重。该算法有效,并且比基于传统特征码的算法有明显的优势,更接近人工统计结果。 According to the duplicated web pages returning from meta-search engine with same contents,but different name,an algorithm of duplicated webpages detection based on a combined duplication detection of the title and summary of web page is proposed.The effectiveness of the algorithm is verified through experiments.First,the algorithm analyze the page title which single search engines return;second,thematic information of page is extracted and word segmentation on the summary is carried out;finally,the similarity is calculated. By combining thematic information of web page title and the similarity of word segmentation on the summary, the algorithm can better to reflect the contents of the article summary, realize to detection and elimination of duplicated web pages. The algorithm has obvious advantages compared with the traditional signature-based algorithm,and is closer to artificial results.
出处 《广西科学院学报》 2009年第4期251-253,共3页 Journal of Guangxi Academy of Sciences
基金 国家中小企业创新基金项目(编号:08c26224501313)资助
关键词 去重 网页 分词 相似度 元搜索 duplicate detection Web pages Chinese word segmentation repetition rate meta search engine
  • 相关文献

参考文献5

二级参考文献33

共引文献115

同被引文献43

引证文献5

二级引证文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部