摘要
互联网上有大量信息采用HTML表格表示,由于HTML不描述数据的内容,机器不能理解和查询。论文利用HTML表格属性,在表格中插入冗余单元,使HTML表格规范化;对没有标志表头的HTML表格,采用格式化的信息的量化值识别网上表格的表头。在此基础上,提出了通过获取表格属性与值对应的语义层次,自动转换HTML表格数据为XML文挡的新方法。
A large amount of information available on the Web is formatted in HTML tables,which are not content-oriented,and are not suitable for understanding and query by machines,In this paper,we normalize the HTML tables by inserting redundant cells into them according the attributes of HTML tables.For some HTML tables without marked headings we recognize its headings by using the measure of formatting information.By capturing the attribute-value pairs according to the headings and their corresponding data cells based on the normalized table,we present the new approach to automatically convert HTML tables into XML documents.
出处
《计算机工程与应用》
CSCD
北大核心
2007年第2期190-192,共3页
Computer Engineering and Applications
基金
湖北省自然科学基金资助项目(2005ABA238)
国家自然科学基金资助项目(60273072)。