期刊文献+

基于视觉热区的网页内容抽取方法 被引量:1

WEB PAGES CONTENT EXTRACTION BASED ON VISUAL HOT ZONE
下载PDF
导出
摘要 对网页抽取进行研究,提出一种新的网页正文信息提取方法,它利用网页布局特征与网页视觉热区来确定网页正文信息。首先选取网页的一部分区域作为网页视觉热区,通过文档对象模型得到候选正文信息块,在此基础上,给出候选正文信息块重要度函数确定网页正文信息。实验结果表明,该方法具有良好的性能。 A study is made on web pages extraction and a new extraction method for web pages content is suggested.Layout features and visual hot zone are used by it to determine web pages content.In the paper,first a part of web page’s region is selected as web page visual hot zone,the candidate content blocks are then obtained by documents object model.Furthermore,the significance function of the candidate content blocks is deduced to extracting content for web pages.Experimental results indicate that the proposed method has good performance.
作者 邵俊
出处 《计算机应用与软件》 CSCD 北大核心 2012年第6期199-201,共3页 Computer Applications and Software
关键词 布局特征 视觉热区 文档对象模型 候选正文信息块 重要度函数 Layout features Visual hot zone Document object model Candidate content blocks Significance function
  • 相关文献

参考文献7

二级参考文献34

  • 1朱精南,赵明生.网页版面中区域几何信息的确定[J].计算机工程,2004,30(10):45-48. 被引量:4
  • 2于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 3高波.嵌入式浏览器开发.http://jserv.sayya.org/netbit/.
  • 4Cobra HTML Parser.http://lobobrowser.org/cobra.jsp.
  • 5HTML 4.01 Specification.http://www.w3.org/TR/REC-html40/.
  • 6Vadrevu S,Gelgi F.Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge.World Wide Web,2007,10:157.
  • 7Arasu A,Garcia-Molina H.Extracting Structured Data from Web Pages.International Conference on Management of Data,Proceedings of the 2003 ACM SIGMOD international conference on Management of data,2003.
  • 8Deng Cai,Shipeng Yu,Ji-Rong Wen,et al.VIPS:a Vision-based Page Segmentation Algorithm.http://research.microsoft.com/~jrwen/jrwen_files/publications/VIPS_Technical%20Report.PDF 2003.
  • 9Kovacevic M,Diligenti M,Gori M,et al.Recognition of Common Areas in a Web Page Using Visual Information:a possible application in a page classification.Second IEEE International Conference on Data Mining (ICDM'02),2002:250.
  • 10Peifeng Xiang,Xin Yang,Yuanchun Shi.Effective Page Segmentation Combining Pattern Analysis and Visual Separators for Browsing on Small Screens.Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI'06),2006:831.

共引文献16

同被引文献22

  • 1赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-145. 被引量:33
  • 2王磊,蒋建中,郭军利.基于扩展DOM树的Web页面信息抽取[J].计算机应用与软件,2007,24(6):137-139. 被引量:12
  • 3中国互联网络信息中心.第35次中国互联网络发展状况统计报告[R/OL].[2015-02-03].http://www.cnnic.neLcn/hlw.fzyj.
  • 4孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5):17-22.
  • 5梅雪,程学旗,郭岩,张刚,丁国栋.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22-29. 被引量:21
  • 6COWIE J, LEHNERT W. Information extraction [ J]. Communica- tions of the ACM, 1996, 39(1) : 80 -91.
  • 7MOONEY R J, BUNESCU R. Mining knowledge from text using in- formation extraction [ J]. ACM SIGKDD Explorations Newsletter, 2005, 7(1): 3-10.
  • 8CHANG C-H, LUI S-C. IEPAD : information extraction based on pattern discovery [ C]// WWW '01: Proceedings of the 10th Inter- national Conference on World Wide Web. New York: ACM, 2001: 681 - 688.
  • 9BANKO M, CAFARELLA M J, SODERLAND S, et al. Open infor- mation extraction from the Web [ C]// IJCAI 2007: Proceedings of the 20th International Joint Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2007:2670-2676.
  • 10BAUMGARTNER R, FLESCA S, GOTTLOB G. Visual Web infor- mation extraction with Lixlo [C]// VLDB '01: Proceedings of the 27th International Conference on Very Large Data Bases. San Fran- cisco, CA: Morgan Kaufmann, 2001:119 - 128.

引证文献1

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部