摘要
随着互联网的发展及网络信息的指数状增长,网络上出现了大量的重复网页,降低了检索的查全率和查准率,影响了检索效率。因此,网页去重的准确度直接影响着搜索引擎的质量。本文通过对结构化文本的描述,提出了一种基于MD5的改进的网页去重算法,并从算法内容、算法特征、算法设计进行了阐述,实验表明该方法对提高查全率和查准率具有很好的效果。
By the development of thelinternet and exponential growth of network information, there are a large number of duplicated pages on the network. The fact reduces the retrieval of speed and precision, and affects the retrieval efficiency, moreover, influences the quality of search engine. On the basis of the structural text description, this paper proposes an improved eliminating repetitive algorithm method by using MD5. The paper deals with the algorithm content, algorithm characteristics and algorithm design. It has been proved through experiment that the method has a good effect on improving the recall ratio and the precision.
出处
《实验室研究与探索》
CAS
北大核心
2013年第12期105-108,共4页
Research and Exploration In Laboratory
基金
山西省科学技术厅软科学研究项目(2013041049-03)
山西省教育科学规划课题(GH-11178)
关键词
结构化网页
MD5
网页去重
去重算法
structured Web
MD5
eliminating repetitive of network
eliminating repetitive algorithm