期刊文献+

面向Web论坛的网络信息获取技术及系统实现 被引量:7

The Web Forum Crawling Technology and System Implementation
下载PDF
导出
摘要 网络爬虫技术是网络信息获取的重要手段,面向Web论坛的信息获取则是网络爬虫技术所面临的新课题。在分析和研究面向Web论坛信息获取技术的基础上,本文设计和实现了一种用于Web论坛信息获取的主题网络爬虫系统,根据Web论坛信息组织结构,提出了基于遍历策略的信息搜索技术;根据正文信息分布及论坛自身特点,提出了基于DOM与分块算法相结合的正文提取技术。实验结果表明,遍历策略比传统的网络爬虫遍历策略具有更高的效率,能够采集到更多主题相关度高的网页;经过噪声清洗处理后,有效提取网页正文,提高了信息采集精度。 The Web spider is very important in gathering information, which also faces new challenges when it's been used in crawling the Web forum. This paper mainly studies the basic technologies of crawling in the Web forum, designs and implements such a system, which is mainly used to gather the information of the Web forum. According to the information structure, a traversal strategy is proposed. Based on the distribution of the context, a DOM and block algorithm is proposed. The experimental result shows that the traversal strategy is more efficient than the traditional traverses to get those highly subjectrelevant Web pages, and after using the strategy for the context extracting of Web pages, effectively improves the accuracy of the information collection.
作者 彭冬 蔡皖东
出处 《计算机工程与科学》 CSCD 北大核心 2011年第1期157-160,共4页 Computer Engineering & Science
基金 国家863计划资助项目(2009AA01Z424) 2009届西北工业大学本科毕业设计重点扶持项目
关键词 网络爬虫 WEB论坛 正文提取 主题相关度 web spider web forum context extracting subject relevant
  • 相关文献

参考文献8

二级参考文献60

共引文献210

同被引文献64

  • 1钟锃光.经济学家也要学点网络爬虫技术——漫谈爬虫技术与经济数据收集[J].经济资料译丛,2014(2):94-100. 被引量:3
  • 2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:156
  • 3李魁,程学旗,郭岩,张凯.WWW论坛中的动态网页采集[J].计算机工程,2007,33(6):80-82. 被引量:11
  • 4MA Huifang. Hot topic extraction using time window[C]//IEEE International conference on machine learning and cybernetics(ICMLC). Guilin, China, 2011: 56-60.
  • 5LIU Zheng, Yu J X. Discovering burst areas in fast evolving graphs[C]//The 15th International Conference on Database Systems for Advanced Applications (DASFAA). Tsukuba, Japan, 2010: 171-185.
  • 6Saito K, Ohara K, Kimura M, et al. Burst detection in a sequence of tweets based on information diffusion model[C]//Tbe 15th International Conference on Discovery Science. Lyon, France, 2012: 239-253.
  • 7ZHU Mingliang, HU Weiming, WU Ou. Topic detection and tracking for threaded discussion communities[C]//2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Sydney, Australia, 2008: 77-83.
  • 8祝华新,单学刚,胡江春.2011年中国互联网舆情分析报告[EB/OL].http://yuqing.people.com.cn/n/2012/0727/c209170-18615551.html,2012-07-07.
  • 9Chen C C, Chen Y T, Sun Y, et al. Life cycle modeling of news events using aging theory[C]//The 14th European Conference on Machine Learning(ECML). Cavtat-Dubrovnik, Croatia, 2003: 47-59.
  • 10Gaito S, Zignani M, Rossi G P, et al. On the bursty evolution of online social networks[C]//ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research (Hot Social). New York, USA, 2012: 1-8.

引证文献7

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部