期刊文献+

分布式多主题网络爬虫系统的研究与实现 被引量:20

Research and Implementation of Distributed and Multi-topic Web Crawler System
下载PDF
导出
摘要 提出一种基于数据抽取器的分布式爬虫架构。该架构采用基于分类标注的多主题策略,解决同一爬虫系统内多主题自适应兼容的问题。介绍二级加权任务分割算法,解决基于目标导向、负载均衡的URL分配问题,增强系统可扩展性。给出基于Trie树的URL存储策略的改进方法,可以高效地支持URL查询、插入和重复性检测。 This paper proposes an architecture of distributed Web crawler system based on data-trapper. It implements a multi-topic schema based on classics-label, so that one crawler can contain different topics adaptively and designs a two-tiered weighted task partition algorithm that realizes target-guided URL configuration based on Agents' load while providing better dynamic scalability. It improves URL storage with Trie tree, which efficiently supports URL search, insertion and repetition judgment.
出处 《计算机工程》 CAS CSCD 北大核心 2009年第19期13-16,19,共5页 Computer Engineering
基金 国家"863"计划基金资助项目"融合型旅游在线服务业务的研究"(2008AA01A307)
关键词 网络爬虫 多主题 分布式 Web crawler multi-topic distributed
  • 相关文献

参考文献8

  • 1Rungsawang A, Angkawattanawit N. Learnable Topic-specific Web Crawler[J]. Journal of Network and Computer Applications, 2005, 28(2): 97-114.
  • 2Chakrabhik S, Vandenburg M, Dom B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery[C]//Proceedings of the 8th International World-Wide Web Conference. Toronto, Canada: [s. n.], 1999.
  • 3Liu Hongyu, MIuOS E, Janssen J. Probabilistic Models for Focused Web Crawling[C]//Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management. New York, USA: ACM Press, 2004.
  • 4刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10):26-29. 被引量:132
  • 5Florescu D, Levy A, Mendelzon A. Database Techniques for the World-Wide Web: A Survey[J]. SIGMOD Record, 1998, 27(3): 59-74.
  • 6Wei Jiying, Wen Jirong. instance-based Schema Matching for Web Databases by Domain-specific Query Probing[C]//Proceedings of the 30th international Conference on VLDB. Toronto, Canada: [s. n.], 2004.
  • 7叶允明,于水,马范援,宋晖,张岭.分布式Web Crawler的研究:结构、算法和策略[J].电子学报,2002,30(12A):2008-2011. 被引量:23
  • 8钱榕,徐新华,郑莹,杨炳儒.智能专题化信息搜集Crawler[J].计算机工程,2006,32(3):57-59. 被引量:4

二级参考文献32

  • 1李卫,刘建毅,何华灿,王枞.基于主题的智能Web信息采集系统的研究与实现[J].计算机应用研究,2006,23(2):163-166. 被引量:15
  • 2Menczer F,Srinivasan G P P,Ruiz M.Evaluating Topic-driven Web Crawlers[C].Proceedings of the 24th Annual International ACM/SIGIR Conference,2001.
  • 3Grama A,Karypis G,Kumar V,et al.Introduction to Parallel Computing (Second Edition)[M].Boston:Addison-Wesley,2003.
  • 4Brin S, Page L. The Anatomy of a Large Scale Hyper Textual Web Search Engine [C]. Proceeding of the WWW7 Conference, Elsevier,Australia, 1998: 107-117.
  • 5杨炳儒.基于内在机理的知识发现理论及应用[M].北京:电子工业出版社,2003..
  • 6王永庆.人工智能原理与方法[M].西安:西安交通大学出版社,1999..
  • 7韩家炜 坎伯(加).数据挖掘[M].北京:机械工业出版社,2001.223-259.
  • 8MURRAY B,MOORE A.Sizing the Internet[M].[S.l.]:Cyveillance Inc,2000.
  • 9LAWRENCE S,GILES L.Accessibility and distribution of information on the Web[J].Nature,1999,400(8):107-109.
  • 10CHO J,CARCIA M H.The evolution of the Web and implication for an incremental crawler[C]//Proc of the 26th International Conference on Very Large Databases (NVLDB-00).2000.

共引文献154

同被引文献251

引证文献20

二级引证文献172

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部