摘要
本文详细叙述了一个用于信息检索的基于P2P的分布式爬虫系统的设计和实现过程。系统基于锚文本上下文进行主题相关性判定,采用P2P式的分布式结构,充分利用其动态增加新结点的特性来动态地扩展系统的规模,提高系统的整体吞吐能力,以满足现在和将来的用户对日益增长的大数据量检索需求。实验结果表明,可根据用户给定的主题对网页链接上下文进行主题相关性判定以引导爬虫的爬行路径,能够有效地获取相关主题信息。
Topical crawlers or focused crawlers adapting to the specific theme and personalized search are required in order to meet the needs of the rapid growth of web information,which supplies more accurate and more comprehensive information and services in the shortest time.The design and implementation of a distributed web crawler is proposed in the paper,It is based on P2P-distributed architecture and makes full use of P2P-style dynamic characteristics of adding new nodes to increase the scale and improve the overall capacity.The experiments showed that this system could efficiently provide users with relevant files or web pages according to the topic(s) they defined.
出处
《情报学报》
CSSCI
北大核心
2010年第3期402-407,共6页
Journal of the China Society for Scientific and Technical Information
关键词
网络爬虫
对等网络
分布式计算
信息检索
主题爬虫
Web crawler
peer to peer
distributed computing
information retrieval
topical crawler