面向情报获取的主题采集工具设计与实现被引量：2

Design and Implementation of the Topic Information Crawler for Intelligence Acquisition

导出

摘要面向互联网的主题采集是情报获取的重要手段,面对爆发式增长的互联网信息资源,设计并实现一套由采集准备、URL分析及提取、模板学习、正文抽取等几阶段组成的主题采集工具,其中URL分析与提取采用基于链接类型的URL筛选方法,实现正文网页URL的筛选;模板学习和正文抽取部分采用基于DOM树的节点比对方法,完成模板的构建与正文抽取。实验结果表明,本文所提出的主题采集工具采集准确率较高,能够适应目前情报信息采集的需求。 Topic information collection based on the Internet is an important means of acquiring intelligence. A topic information crawler is designed and realized to deal with the explosive growth of Internet information resources. The crawler comprises stages of acquisition preparation, URL analysis and extraction, template learning, and text extraction. A URL filtering method based on link types is used in the URL analysis and extraction stage to filter the URLs of text - containing Web pages. A node comparison method based on the DOM tree is used in the template learning and text extraction stages to construct templates and extract text. Test results show that the topic information crawler has a high accuracy in gathering information, and thus can meet the current need for information acquisition.

作者谷俊翁佳许鑫

机构地区上海宝山钢铁股份有限公司上海理工大学图书馆华东师范大学商学院信息学系

出处《图书情报工作》 CSSCI 北大核心 2014年第20期91-99,共9页 Library and Information Service

基金上海市科技发展基金软科学研究项目"大数据环境下基于领域本体的情报处理分析方法研究--以钢铁行业为例"(项目编号:14692107100)研究成果之一

关键词网络爬虫主题采集链接筛选 DOM树 Web crawler topic information acquisition link filtering DOM tree

分类号 TP393.092 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献16

1许鑫,黄仲清,邓三鸿.互联网侨情信息采集系统设计与实现[J].现代图书情报技术,2010(7):95-101. 被引量：6
2Chakrabarti S, Van Den Berg M, Dom B. Focused crawling: A newapproach to topic - specific Web resource discovery[ J]. ComputerNetworks,1999(11) : 1623 -1640.
3Aggarwal C C, Al-Garawi F,Yu Philip S. Intelligent crawling onthe World Wide Web with arbitrary predicates [ C ] //Proceedings ofthe 10th International Conference on World Wide Web. HongKong:ACM, 2001:96 -105.
4Nie Zaiqing, Zhang Yuanzhi, Wen Jirong, et al. Object - levelranking : Bringing order to Web objects [ C ] //Proceedings of the14th International Conference on World Wide Web. New York:ACM, 2005:567 -574.
5杜义华,及俊川.通用互联网信息采集系统的设计与初步实现[J].计算机应用研究,2005,22(1):187-189. 被引量：9
6宫进,胡长军,曾广平.互联网信息定向采集系统的设计与实现[J].计算机应用,2007,27(B06):16-17. 被引量：7
7罗立宏,陈志.基于语义分析的垂直搜索网络蜘蛛[J].计算机工程与设计,2008,29(18):4662-4665. 被引量：8
8许鑫,黄仲清.垂直搜索引擎应用中的若干策略探讨——以12580餐饮垂直搜索为例[J].现代图书情报技术,2009(2):62-70. 被引量：7
9姚双良.基于主题的Deep Web聚焦爬虫研究与设计[J].西北师范大学学报（自然科学版）,2013,49(2):40-43. 被引量：2
10暗网[EB/0L].[2014 -06 -20]. http://zh. wikipedia. org/wi-ki/% E6%9A%97% E7% BD%91.