网上表格数据到XML的自动转换被引量：5

Automatic conversion of HTML tables into XML

下载PDF

导出

摘要互联网上有大量信息采用HTML表格表示,由于HTML不描述数据的内容,机器不能理解和查询。论文利用HTML表格属性,在表格中插入冗余单元,使HTML表格规范化;对没有标志表头的HTML表格,采用格式化的信息的量化值识别网上表格的表头。在此基础上,提出了通过获取表格属性与值对应的语义层次,自动转换HTML表格数据为XML文挡的新方法。 A large amount of information available on the Web is formatted in HTML tables,which are not content-oriented,and are not suitable for understanding and query by machines,In this paper,we normalize the HTML tables by inserting redundant cells into them according the attributes of HTML tables.For some HTML tables without marked headings we recognize its headings by using the measure of formatting information.By capturing the attribute-value pairs according to the headings and their corresponding data cells based on the normalized table,we present the new approach to automatically convert HTML tables into XML documents.

作者张瑞李石君

机构地区武汉大学计算机学院新汶矿业集团职工大学

出处《计算机工程与应用》 CSCD 北大核心 2007年第2期190-192,共3页 Computer Engineering and Applications

基金湖北省自然科学基金资助项目(2005ABA238) 国家自然科学基金资助项目(60273072)。

关键词 HTML表格信息提取 WEB XML HTML table information extraction Web XML

分类号 TP311.135 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献8

1Kushmerick N,Weld D,Doorenbos R.Wrapper induction for information extraction[C]//Proc IJCAI,1997.
2胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量：21
3Embley D W,Tao C,Liddle S W.Automatically extracting ontologically specified data from HTML table of unknown structure[C]//ER 2002,2002:322-337.
4高军,王腾蛟,杨冬青,唐世渭.基于Ontology的Web内容二阶段半自动提取方法[J].计算机学报,2004,27(3):310-318. 被引量：18
5Lim S J,Ng Y K,Yang X.Integrating HTML tables using semantic hierarchies and meta-data sets[C]//IDEAS 2002,2002:160-169.
6Yang Y,Luk W S.A framework for Web table mining[C]//WIDM'02,2002:36-42.
7Lerman K,Getoor L,Minton S,et al.Knoblock:using the structure of Web sites for automatic segmentation of tables[C]//Proc Sigmod,2004:119-130.
8He Hai,Meng Wei-yi,Yu C T,et al.WISE-Integrator:a system for extracting and integrating complex Web search interfaces of the deep Web[C]//Proc VLDB,2005:1314-1317.

二级参考文献18

1[1]Baumgartner R.,Flesca S.,Gottlob G.. Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001,119～128
2[2]Liu L.,Pu C., Han W.. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, California, 2000, 611～621
3[3]Gottlob G., Koch C.. Monadic datalog and the expressive power of languages for web Information extraction. In: Proceedings of the 21th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Wisconsin, 2002, 17～28
4[4]Hamer J.,Brennig M., Garcia-Molina H.. Template-based wrappers in the TSIMMIS system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Arizona, 1997, 532～535
5[5]Atzeni P., Mecca G.. Cut and paste. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Arizona, 1997, 144～153
6[6]Crescenzi V., Mecca G., Merialdo P.. RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001, 109～118
7[7]Soderland S.. Learning information extraction rules for semistructured and free text. Machine Learning,1999, 34(1～3):233～272
8[8]Adelberg B.. Nodose-A tool for semi automatically extracting structured and semi-structured data from text document. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, 1998, 283～294
9[9]Ribeiro-Neto B.A., Laender A., da silva A.S.. Extracting semistructured data through examples. In: Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Missouri, 1999,94～101
10[10]EmbleyD.W., Campbell D.M., Jiang Y.S.. A conceptual-modeling approach to extracting data from web. In: Proceedings of the 17th International Conference on Conceptual Modeling, Singapore, 1998,78～91

共引文献36

1黄玲,陈龙.基于网页分块的正文信息提取方法[J].计算机应用,2008,28(S2):326-328. 被引量：13
2邓绪斌,朱扬勇.ReDE:一个基于正则表达式的生物数据抽取方法[J].计算机研究与发展,2005,42(12):2184-2191. 被引量：8
3李石君,欧伟杰,简伟,黄河.基于有限状态自动机提取不规范表结构Web信息[J].武汉大学学报（工学版）,2005,38(6):128-132.
4陈海山,吴芸.广义表的二叉链式存储表示及其算法设计[J].计算机工程与应用,2005,41(35):38-41. 被引量：4
5李石君,于俊清,欧伟杰.基于HTML模式代数的Web信息提取方法[J].计算机研究与发展,2006,43(9):1644-1650. 被引量：8
6胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9. 被引量：16
7卢林兰,李明.利用ontology实现的多库知识获取方法[J].计算机工程与设计,2007,28(15):3731-3733. 被引量：1
8任仲晟,薛永生.基于页面标签的Web结构化数据抽取[J].计算机科学,2007,34(10):133-136. 被引量：8
9李纲,戴强斌.WNBTE网页正文抽取方法研究[J].情报科学,2008,26(3):333-336. 被引量：5
10刘辉,陈静玉,徐学洲.基于模板流程配置的Web信息抽取[J].计算机工程,2008,34(20):55-57. 被引量：5

同被引文献22

1胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量：21
2李明,张为群.基于标记树的WEB页面净化技术研究[J].西南师范大学学报（自然科学版）,2006,31(5):128-131. 被引量：3
3林科锵,左志宏,林琳.Web表格信息抽取的研究[J].通讯和计算机（中英文版）,2005,2(8):27-31. 被引量：1
4魏志华,黄孝伦,刘亮,史林霞.基于对称性的HTML到XML的转换方法[J].武汉理工大学学报（信息与管理工程版）,2007,29(7):45-48. 被引量：2
5Lim S J,Ng Y K,Yang X.Integrating HTML Tables Using Semantic Hierarchies and Meta-data Sets[C]//Proc.of International Symposium on Database Engineering and Applications.[S.1.]:IEEE Press,2002:160-169.
6Jung S W,Kwon H C.A Scalable Hybrid Approach for Extracting Head Components from Web Tables[J].IEEE Transactions on Knowledge and Data Engineering,2006,18(2):174-187.
7Li Shijun,Liu Mengchi,Peng Zhiyong.Wrapping HTML Tables into XML[C]//Proc.of the 5th International Conference on Web Information Systems Engineering.Brisbane,Australia:Springer,2004:147-152.
8Jung S W, Kwon H C. A Scalable Hybrid Approach for Extracting Head Components from Web Tables [J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(2): 174- 187.
9LI Shi-jun, LIU Meng-chi, PENG Zhi-yong. Wrapping HTML Tables into XML [ C ]//Proc. of the 5th Interna- tional Conference on Web Information Systems Engineer- ing. Brisbane, Australia: Springer, 2004 : 147-152.
10LEE Minhyung, KIN Yeon Seok, LEE Kyong Ho. Logical structure analysis : From HTML to XML [ J ]. Computer standards &Interfaces ,2007 (29) : 109-124.

引证文献5

1贾长云,程永上.HTML表格向XML的智能转换[J].计算机工程,2009,35(14):32-34. 被引量：3
2曾广朴,陶维安.基于信息量的Web表格信息抽取方法[J].西南师范大学学报（自然科学版）,2010,35(4):159-163. 被引量：2
3林晓莉.对图书资料存储方式自动转换的探讨[J].科技情报开发与经济,2011,21(20):80-83. 被引量：1
4杜茂康,李韶华,刘苗.基于MEDL模型的HTML向XML的转换方法[J].重庆邮电大学学报（自然科学版）,2012,24(6):788-791.
5张兴兰,刘岩.Web实体表格结构识别研究[J].软件导刊,2016,15(6):1-5. 被引量：1

二级引证文献7

1曾广朴,陶维安.基于信息量的Web表格信息抽取方法[J].西南师范大学学报（自然科学版）,2010,35(4):159-163. 被引量：2
2杜茂康,李韶华,刘苗.基于MEDL模型的HTML向XML的转换方法[J].重庆邮电大学学报（自然科学版）,2012,24(6):788-791.
3张兴兰,刘岩.Web实体表格结构识别研究[J].软件导刊,2016,15(6):1-5. 被引量：1
4李杨,朱月琴,李朝奎,肖克炎,范建福,李秋平.面向海量地质文档的表格信息快速抽取方法研究[J].中国矿业,2017,26(9):98-103. 被引量：3
5余朋,陈甫.民航气象资料存储与管理系统的设计与研究[J].电脑知识与技术（过刊）,2012,18(8X):5556-5560.
6鲁建明,冀星,刘畅.多特征融合的表格单元格分类模型[J].信息技术与信息化,2021(5):7-11. 被引量：3
7杨烨,王德军,孟博.基于深度学习的政务表格单元格结构检测[J].中南民族大学学报（自然科学版）,2023,42(2):253-259. 被引量：1

1秦振海,谭守标,徐超.基于Web的表格信息抽取研究[J].计算机技术与发展,2010,20(2):217-220. 被引量：6
2张月琳,姚卓英.FoxBASE数据库转换为HTML表格[J].中国计算机用户,1997(6):54-56.
3袁鸿雁.基于本体的HTML表格识别技术的研究[J].长春工程学院学报（自然科学版）,2010,11(1):108-110.
4张月琳,姚卓英.FoxBASE数据库转换为HTML表格[J].中国计算机用户,1997(16):54-56.
5密海英.谈网页布局及布局网页的方法[J].商业文化（学术版）,2008,0(12):304-304.
6杜戎平.用Java语言实现Excel表格数据到HTML表格数据的转换[J].电脑编程技巧与维护,2014(23):62-64.
7荆天培.对图书资料存储方式自动转换的探讨[J].决策与信息,2016(14):217-217.
8赵玉英.WORD+EXCEL的使用技巧[J].农村青少年科学探究,2012(12):43-43.
9孟飞.离成功又近一步——永中Office 2003试用手记[J].软件世界,2003(10):106-107.
10陈悫,张凤登,张晓霞,张大庆.分布式FlexRay线控转向系统可靠性及容错技术研究[J].工业控制计算机,2014,27(1):63-66. 被引量：2

计算机工程与应用

2007年第2期

浏览历史

内容加载中请稍等...

网上表格数据到XML的自动转换被引量：5

参考文献8

二级参考文献18

共引文献36

同被引文献22

引证文献5

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

网上表格数据到XML的自动转换 被引量：5

参考文献8

二级参考文献18

共引文献36

同被引文献22

引证文献5

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

网上表格数据到XML的自动转换被引量：5