摘要
提出一种基于数据抽取器的分布式爬虫架构。该架构采用基于分类标注的多主题策略,解决同一爬虫系统内多主题自适应兼容的问题。介绍二级加权任务分割算法,解决基于目标导向、负载均衡的URL分配问题,增强系统可扩展性。给出基于Trie树的URL存储策略的改进方法,可以高效地支持URL查询、插入和重复性检测。
This paper proposes an architecture of distributed Web crawler system based on data-trapper. It implements a multi-topic schema based on classics-label, so that one crawler can contain different topics adaptively and designs a two-tiered weighted task partition algorithm that realizes target-guided URL configuration based on Agents' load while providing better dynamic scalability. It improves URL storage with Trie tree, which efficiently supports URL search, insertion and repetition judgment.
出处
《计算机工程》
CAS
CSCD
北大核心
2009年第19期13-16,19,共5页
Computer Engineering
基金
国家"863"计划基金资助项目"融合型旅游在线服务业务的研究"(2008AA01A307)
关键词
网络爬虫
多主题
分布式
Web crawler
multi-topic
distributed