摘要
URL的存储检索效率是构建大规模分布式信息搜集系统的关键 ,其决定了系统搜集 Web文档的效率 .对 URL存储检索性能做定量分析 ,分别得出 URL存储及检索所需要达到的速度指标 .在此基础上 ,提出了两种 URL存储检索原型 ,即集中 URL服务器存储检索和分布 URL存储检索 ,并对这两种原型系统的检索速度、性能价格比、可扩展性以及可靠性进行了分析比较 .实际应用中 ,可以根据优化目标选择相应的
With the scale of World Wide Web increasing exponentially, the key technique of improving the distributed crawler system performance is the efficiency of URL storage and indexing. Based on the quantitative analyzing of the performance metrics of the URL index and storage,this paper presented two URL storage and index architectures in distributed crawler system: centralized URL server storage and index, distributed URL storage and index. The advantage and disadvantage of each were discussed. The distributed URL system was realized in our distributed crawler system, and the work is efficient.
出处
《上海交通大学学报》
EI
CAS
CSCD
北大核心
2003年第3期454-457,共4页
Journal of Shanghai Jiaotong University
基金
上海市科委重点基础科研项目 ( 0 2 DJ14 0 45 )