期刊文献+

网上表格数据到XML的自动转换 被引量:5

Automatic conversion of HTML tables into XML
下载PDF
导出
摘要 互联网上有大量信息采用HTML表格表示,由于HTML不描述数据的内容,机器不能理解和查询。论文利用HTML表格属性,在表格中插入冗余单元,使HTML表格规范化;对没有标志表头的HTML表格,采用格式化的信息的量化值识别网上表格的表头。在此基础上,提出了通过获取表格属性与值对应的语义层次,自动转换HTML表格数据为XML文挡的新方法。 A large amount of information available on the Web is formatted in HTML tables,which are not content-oriented,and are not suitable for understanding and query by machines,In this paper,we normalize the HTML tables by inserting redundant cells into them according the attributes of HTML tables.For some HTML tables without marked headings we recognize its headings by using the measure of formatting information.By capturing the attribute-value pairs according to the headings and their corresponding data cells based on the normalized table,we present the new approach to automatically convert HTML tables into XML documents.
作者 张瑞 李石君
出处 《计算机工程与应用》 CSCD 北大核心 2007年第2期190-192,共3页 Computer Engineering and Applications
基金 湖北省自然科学基金资助项目(2005ABA238) 国家自然科学基金资助项目(60273072)。
关键词 HTML表格 信息提取 WEB XML HTML table information extraction Web XML
  • 相关文献

参考文献8

  • 1Kushmerick N,Weld D,Doorenbos R.Wrapper induction for information extraction[C]//Proc IJCAI,1997.
  • 2胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量:21
  • 3Embley D W,Tao C,Liddle S W.Automatically extracting ontologically specified data from HTML table of unknown structure[C]//ER 2002,2002:322-337.
  • 4高军,王腾蛟,杨冬青,唐世渭.基于Ontology的Web内容二阶段半自动提取方法[J].计算机学报,2004,27(3):310-318. 被引量:18
  • 5Lim S J,Ng Y K,Yang X.Integrating HTML tables using semantic hierarchies and meta-data sets[C]//IDEAS 2002,2002:160-169.
  • 6Yang Y,Luk W S.A framework for Web table mining[C]//WIDM'02,2002:36-42.
  • 7Lerman K,Getoor L,Minton S,et al.Knoblock:using the structure of Web sites for automatic segmentation of tables[C]//Proc Sigmod,2004:119-130.
  • 8He Hai,Meng Wei-yi,Yu C T,et al.WISE-Integrator:a system for extracting and integrating complex Web search interfaces of the deep Web[C]//Proc VLDB,2005:1314-1317.

二级参考文献18

  • 1[1]Baumgartner R.,Flesca S.,Gottlob G.. Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001,119~128
  • 2[2]Liu L.,Pu C., Han W.. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, California, 2000, 611~621
  • 3[3]Gottlob G., Koch C.. Monadic datalog and the expressive power of languages for web Information extraction. In: Proceedings of the 21th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Wisconsin, 2002, 17~28
  • 4[4]Hamer J.,Brennig M., Garcia-Molina H.. Template-based wrappers in the TSIMMIS system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Arizona, 1997, 532~535
  • 5[5]Atzeni P., Mecca G.. Cut and paste. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Arizona, 1997, 144~153
  • 6[6]Crescenzi V., Mecca G., Merialdo P.. RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001, 109~118
  • 7[7]Soderland S.. Learning information extraction rules for semistructured and free text. Machine Learning,1999, 34(1~3):233~272
  • 8[8]Adelberg B.. Nodose-A tool for semi automatically extracting structured and semi-structured data from text document. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, 1998, 283~294
  • 9[9]Ribeiro-Neto B.A., Laender A., da silva A.S.. Extracting semistructured data through examples. In: Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Missouri, 1999,94~101
  • 10[10]EmbleyD.W., Campbell D.M., Jiang Y.S.. A conceptual-modeling approach to extracting data from web. In: Proceedings of the 17th International Conference on Conceptual Modeling, Singapore, 1998,78~91

共引文献36

同被引文献22

  • 1胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量:21
  • 2李明,张为群.基于标记树的WEB页面净化技术研究[J].西南师范大学学报(自然科学版),2006,31(5):128-131. 被引量:3
  • 3林科锵,左志宏,林琳.Web表格信息抽取的研究[J].通讯和计算机(中英文版),2005,2(8):27-31. 被引量:1
  • 4魏志华,黄孝伦,刘亮,史林霞.基于对称性的HTML到XML的转换方法[J].武汉理工大学学报(信息与管理工程版),2007,29(7):45-48. 被引量:2
  • 5Lim S J,Ng Y K,Yang X.Integrating HTML Tables Using Semantic Hierarchies and Meta-data Sets[C]//Proc.of International Symposium on Database Engineering and Applications.[S.1.]:IEEE Press,2002:160-169.
  • 6Jung S W,Kwon H C.A Scalable Hybrid Approach for Extracting Head Components from Web Tables[J].IEEE Transactions on Knowledge and Data Engineering,2006,18(2):174-187.
  • 7Li Shijun,Liu Mengchi,Peng Zhiyong.Wrapping HTML Tables into XML[C]//Proc.of the 5th International Conference on Web Information Systems Engineering.Brisbane,Australia:Springer,2004:147-152.
  • 8Jung S W, Kwon H C. A Scalable Hybrid Approach for Extracting Head Components from Web Tables [J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(2): 174- 187.
  • 9LI Shi-jun, LIU Meng-chi, PENG Zhi-yong. Wrapping HTML Tables into XML [ C ]//Proc. of the 5th Interna- tional Conference on Web Information Systems Engineer- ing. Brisbane, Australia: Springer, 2004 : 147-152.
  • 10LEE Minhyung, KIN Yeon Seok, LEE Kyong Ho. Logical structure analysis : From HTML to XML [ J ]. Computer standards &Interfaces ,2007 (29) : 109-124.

引证文献5

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部