摘要
文中介绍了三种常用的Web数据抽取的方法:直接解析HTML文档的方法,基于XML的方法(也称作为分析HTML层次结构的方法)以及基于概念建模的方法。重点研究其中的基于XML的数据抽取方法,基本做法是将原始的HTML文档通过一个过滤器检查并修改HTML文档的语法结构,从而形成一篇基于XML的XHTML,然后利用XML工具来处理这些HTML文档。实现了从非结构化的HTML文档向结构化的XML文档转化的预处理过程,给在Web挖掘中使用传统的数据抽取方法进行数据抽取创造了有利条件。
Introduces three common methods for Web data extraction.method that directly analyses HTML document, method that bases on XML(it is also called method that analyses the structure of HTML document ) and conceptual - model- based approach, especially,Web data extraction based on XML is studied. The original HTML document gets through a filter which checks and corrects the syntax structure of HTML document, then forms an well- formed XHTML, XML stools can be used to dispose these HTML documents. Implemented a data preprocessing which transformed the semi- structured HTML document to the structured XML document. Also it created a good condition of using the traditional data extraction methods to deeply data extraction.
出处
《计算机技术与发展》
2007年第6期53-55,共3页
Computer Technology and Development
基金
教育部重点实验室开放研究基金(TKLJ0203)