期刊文献+

基于P2P的分布式主题爬虫系统的设计与实现 被引量:6

Design and Implementation of Distributed Topic Crawler Based on P2P for Image Retrieval
下载PDF
导出
摘要 本文详细叙述了一个用于信息检索的基于P2P的分布式爬虫系统的设计和实现过程。系统基于锚文本上下文进行主题相关性判定,采用P2P式的分布式结构,充分利用其动态增加新结点的特性来动态地扩展系统的规模,提高系统的整体吞吐能力,以满足现在和将来的用户对日益增长的大数据量检索需求。实验结果表明,可根据用户给定的主题对网页链接上下文进行主题相关性判定以引导爬虫的爬行路径,能够有效地获取相关主题信息。 Topical crawlers or focused crawlers adapting to the specific theme and personalized search are required in order to meet the needs of the rapid growth of web information,which supplies more accurate and more comprehensive information and services in the shortest time.The design and implementation of a distributed web crawler is proposed in the paper,It is based on P2P-distributed architecture and makes full use of P2P-style dynamic characteristics of adding new nodes to increase the scale and improve the overall capacity.The experiments showed that this system could efficiently provide users with relevant files or web pages according to the topic(s) they defined.
出处 《情报学报》 CSSCI 北大核心 2010年第3期402-407,共6页 Journal of the China Society for Scientific and Technical Information
关键词 网络爬虫 对等网络 分布式计算 信息检索 主题爬虫 Web crawler peer to peer distributed computing information retrieval topical crawler
  • 相关文献

参考文献14

  • 1Takahashi T,Soonsang H,Taura K,et all.World Wide Web Crawler[OL].[2009-05-09].http://www.2002.org/ CDROM/poster/182/.
  • 2Shkapenyuk V,Suel T.Design and implementation of a high-performance distributed Web crawler[C] ∥Proceedings of the 18th International Conference on Data Engineering,April,2002:357-368.
  • 3Sing L.JXTA 2:A High-Performance,Massively Scalable P2P Network[OL].[2009-05-09].http:// www.ibm.com/developerworks/java/library/j-jxta2/.
  • 4De Bra P,Houben G,Kornatzky Y,et al.Information retrieval in distributed hypertexts[C] ∥Proceedings of the 4th RIAO Conference.New York,1994:481-493.
  • 5刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10):26-29. 被引量:132
  • 6Chakrabarti S,Punera K,Subramanyam M.Accelerated focused crawling through online relevance feed-back[C ] ∥Proceedings of the 11th International World Wide Web Conference.Hawaii,2002:148-159.
  • 7李兆春,谢庆生,徐立章.机械主题爬虫的设计与实现[J].现代机械,2007(6):68-70. 被引量:1
  • 8张校慧,徐彬,陈国强,陈珊.民航主题Hidden-Web爬虫的设计与实现[J].计算机应用与软件,2008,25(7):187-189. 被引量:1
  • 9Jakarta Common HttpClient[OL].[2008-03-01].http://hc.apache.org/httpclient-3.x/.
  • 10Najork M,Heydon A.High-Performance Web Crawling,COMPAQ System Research Center(SRC),Research Report[R].Kluwer Academic Publishers,September,2001.

二级参考文献35

  • 1荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(z1):84-87. 被引量:21
  • 2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:156
  • 3李卫,刘建毅,何华灿,王枞.基于主题的智能Web信息采集系统的研究与实现[J].计算机应用研究,2006,23(2):163-166. 被引量:15
  • 4[1]De Bra P,Houben G,Kornatzky Y,et al.Information retrieval in distributed hypertexts[C]//Proc of the 4th RIAO Conference.New York,1994:481-491.
  • 5[3]Chakrabarti S,Punera K,Subramanyam M.Accelerated focused crawling through online relevance feedback[C].Proc of the 11 th International World Wide Web Conference.Hawaii:[s.n.],2002.
  • 6[5]Brin S,Page L.The anatomy of a large-scale hypertextual Web search Engine[C].Proc the 7th World Wide Web Conference,[s.n.],1998:146-164.
  • 7[6]Lucene[EB/OL].http://lucene.apache.org/,2008.7.21.
  • 8MURRAY B,MOORE A.Sizing the Internet[M].[S.l.]:Cyveillance Inc,2000.
  • 9LAWRENCE S,GILES L.Accessibility and distribution of information on the Web[J].Nature,1999,400(8):107-109.
  • 10CHO J,CARCIA M H.The evolution of the Web and implication for an incremental crawler[C]//Proc of the 26th International Conference on Very Large Databases (NVLDB-00).2000.

共引文献131

同被引文献70

引证文献6

二级引证文献27

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部