期刊文献+

微博数据通用抓取算法 被引量:5

Universal Crawling Algorithm for Microblogging Data
下载PDF
导出
摘要 目前常用的网络爬虫和基于微博API抓取数据的算法很难满足舆情系统对微博数据的需求。为此,提出一种模拟浏览器登录微博抓取网页数据的算法,以方便地获取任意微博用户网页上的所有数据。通过微博用户之间的关系构建用户网络,并通过该网络发现新用户。为获取微博上有质量的数据,建立一个完整的数学模型,根据用户的发帖数、发帖频率、粉丝数、转发数、评论数等因素来计算用户影响力,以影响力为主要因子构建优先队列,使得影响力越大的用户数据采集频率越高,同时计算时间间隔以兼顾非活跃用户的数据获取。实验结果表明,该算法具有通用性强、完全无需人工干预、获取信息的质量高、速度快等优点。 Currently, Web crawler and microblog API which are used to grab data from the microblog are difficult to satisfy the public opinion system demands for microblog data. To settle the problem, this paper presents a feasible solution which is the similar as the browser login microblog to capture data from Web pages. It can easily get all data from any microblog users. On this basis, it constructs a microblogging network through interconnections among users, and discovers new users through it. In order to get high quality data, it builds mathematical models to calculate the user’s influence index by using posting number, posting frequency, fans number, forwarding number and comments number. Moreover, it builds priority queue according to the calculated influence factor, which let those that have bigger influence index have high acquisition frequency. Finally, it calculates time interval to balance the lower frequency of non-active microblog user. The experimental results show that this method not only processes easily and has higher speed but also can obtain high quality information and have huge versatility.
出处 《计算机工程》 CAS CSCD 2014年第5期12-16,20,共6页 Computer Engineering
基金 湖南省自然科学基金资助项目(12JJ3066) 湖南省高校科技成果产业化培育基金资助项目(11CY018) 湖南省重点学科基金资助项目
关键词 微博数据 模拟登录 用户网络 用户影响力 网络舆情 优先队列 microblogging data analog login user network user influence Internet public opinion priority queue
  • 相关文献

参考文献3

二级参考文献39

  • 1解(亻刍),汪小帆.复杂网络中的社团结构分析算法研究综述[J].复杂系统与复杂性科学,2005,2(3):1-12. 被引量:86
  • 2EHRIG M, MAEDCHE A. Ontology-focused crawling of Web documents[A]. Proceedings of the 2003 ACM symposium on Applied computing[C], March 2003.
  • 3GUO Q, GUO H, ZHANG ZQ, et al. Schema Driven Topic Specific Web Crawling[A]. DASFAA[C], 2005.
  • 4GRAUPMANN J, BIWER M, ZIMMER C, et al. COMPASS: A Concept-based Web Search Engine for HTML, XML, and Deep Web Data[A]. Proceedings of the 30th VLDB Conference[C],2004.
  • 5QIN JL, ZHOU YL, CHAU M. Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method[A]. Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries[C], June 2004.
  • 6CHO J , GARCIA - MOLINA H , PAGE L . Efficient crawling through URL ordering[A]. Proceedings of the seventh international conference on World Wide Web 7[C], April 1998.
  • 7FLORESCU D, LEVY AY, MENDELZON AO. Database techniques for the world-wide web: A survey[J]. SIGMOD Record, 1998,27(3) :59 -74.
  • 8LAWRENCE S, GILES CL. Searching the World Wide Web[J].Science, 1998,280(5360):98.
  • 9CHAKRABARTI S, VAN DEN BERG M, DOM B. Focused crawling: A new approach to topicspecific web resource discovery[A].Proceedings of the Eighth International World-Wide Web Conference[C], 1999.
  • 10DAVULCU H, KODURI S, NAGARAJAN S. Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites[A]. Proceedings of the 5th ACM international workshop on Web information and data management[C], November 2003.

共引文献184

同被引文献42

  • 1薛澜,钟开斌.突发公共事件分类、分级与分期:应急体制的管理基础[J].中国行政管理,2005(2):102-107. 被引量:329
  • 2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:156
  • 3王来华.舆情变动规律初论[J].学术交流,2005(12):155-159. 被引量:57
  • 4中国互联网信息中心.第33次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201403/t20140305_46240.htm,2014/3/5.
  • 5新浪网.新浪微博开放平台APL[EB/OL].http://open.weibo.com/wiki/接口访问频次权限.
  • 6crifan.关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项.[EB/0L].http://www.crifan.com/summary_about_flow process_of fetch_webpage_simulate_login_website and some_not- ice/.
  • 7高森.Python网络编程基础[M].北京:电子工业出版社.2007:326.
  • 8Cheerio.Open source connections[M/OL].[2014-09-30].http://www.cheeriojs.github.io Cheerio.
  • 9中国互联网络信息中,心.第36次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.cn/hlwfzyj/hlwxzbg/hlwtjbg/201507/P020150723549500667087.pdf,2015-09-01.
  • 10Ma, Y. P.,Shu,X. M.,Shen , S. F. Study on Network Public Opinion Dissemination and Coping Strategies in Large FireDisasters[ J], Procedia Engineering, 2014,71(1) :616-621.

引证文献5

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部