期刊文献+

基于XML的Web数据抽取研究 被引量:5

Study on Web Data Extraction Based on XML
下载PDF
导出
摘要 文中介绍了三种常用的Web数据抽取的方法:直接解析HTML文档的方法,基于XML的方法(也称作为分析HTML层次结构的方法)以及基于概念建模的方法。重点研究其中的基于XML的数据抽取方法,基本做法是将原始的HTML文档通过一个过滤器检查并修改HTML文档的语法结构,从而形成一篇基于XML的XHTML,然后利用XML工具来处理这些HTML文档。实现了从非结构化的HTML文档向结构化的XML文档转化的预处理过程,给在Web挖掘中使用传统的数据抽取方法进行数据抽取创造了有利条件。 Introduces three common methods for Web data extraction.method that directly analyses HTML document, method that bases on XML(it is also called method that analyses the structure of HTML document ) and conceptual - model- based approach, especially,Web data extraction based on XML is studied. The original HTML document gets through a filter which checks and corrects the syntax structure of HTML document, then forms an well- formed XHTML, XML stools can be used to dispose these HTML documents. Implemented a data preprocessing which transformed the semi- structured HTML document to the structured XML document. Also it created a good condition of using the traditional data extraction methods to deeply data extraction.
作者 吕锋 余丽
机构地区 武汉理工大学
出处 《计算机技术与发展》 2007年第6期53-55,共3页 Computer Technology and Development
基金 教育部重点实验室开放研究基金(TKLJ0203)
关键词 XML WEB 数据抽取 XML Web data extraction
  • 相关文献

参考文献5

二级参考文献20

  • 1黄中杰 王天利.XML新网页语言开发手册[M].北京:清华大学出版社,2000.50-90.
  • 2[1]Bay T, Paoli J, Sperberg-McQueen C M. Extensible Markup Language(XML) 1.0 Specification World Wide Web Consortium Recommendation[EB/OL]. http://www.w3.org/TR/REC-xml/,1999.
  • 3[2]Ananel S S. Designing a kenel for data mining[J]. IEEE Expert on Intelligent System,1997,27(3):947-963.
  • 4[3]Lawrence S, et al. Searching the world wide web[J]. Science,1998,280(5360):98-100.
  • 5[4]Anne Lear. XML Seen as Integral to application integration[J]. IT Pro,1999,(9/10):1012-1031.
  • 6[2]Florescu D, Levy A, Mendelzon A. Database Techniques for the WorldWide Web - A Survery[J]. SIGMOD Record, 1998, 27(3).
  • 7[3]Abiteboul S. Query semi - structured data[A]. Proc. Of the Intl.Conf. on Database Theory(ICDT) [C], Dephi, Greece, 1997.
  • 8[4]Suciu D. Semi - structured Data and XML[ R]. AT&T Labs, 1999.
  • 9[5]Widom J. Data Management for XML; Research Directions[ Z]. Bulletion of the IEEE Compuer Society Technical Committee on Data Engineering, 1999.
  • 10Arnaud Sahuguet, Fabien Azavant. Building lightweight wrappers for legacy web data-sources using W4F[ A]. In Proe International Conference on Very Large Data Bases (VLDB) [C]. Scotland: Edinburgh, 1999.

共引文献57

同被引文献33

引证文献5

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部