摘要
针对元搜索的重复网页问题,提出基于元搜索的网页去重算法,并通过实验对算法进行有效性验证。该算法首先对各成员搜索引擎返回来的结果网页的URL进行比较,然后对各结果网页的标题进行有关处理,提取出网页的主题信息,再对摘要进行分词,计算摘要的相似度,三者结合能很好的检测出重复网页,实现网页去重。该算法有效,并且比以往算法有明显的优势,更接近人工统计结果。
According to the duplicated web pages returning from meta-search engine,an algorithm of deletion of duplicated web pages based on meta-search engine is proposed.The effectiveness of the algorithm is verified through experiments.Firstly,the URL of resultweb pages is compared,which is return by single search engines.Secondly,the titles of resultweb pages are processed, and thematic information of pages is extracted.Finally,the word segmentation on the summary is carried out,and the similarity of the summary is calculated.By combining these,the algorithmis able to test the duplicatedweb pages,realize the goal of deletion of duplicated web pages.Compared with the previous algorithms,the algorithm has obvious advantages and is closer to artificial results.
出处
《燕山大学学报》
CAS
2011年第2期121-123,161,共4页
Journal of Yanshan University
关键词
元搜索
网页
去重
分词
meta-search engine
web pages
duplicate detection
Chinese word segmentation