摘要
提出了基于相似记录项归纳的动态网页信息抽取方法.该方法采用编辑距离算法和树排列算法归纳产生记录项的包装器树.对各种类型网页进行信息抽取实验,取得98.11%的召回率和96.90%的准确率.
Dynamic Web pages are pages which are generated by programs automatically. It is estimated that most Web pages exist in the form of dynamic web pages. This paper puts forward an extraction method based on similar records induction ( SRI), which uses string editing distance algorithm and DOM tree alignment algorithm to generate record wrapper. Experimental results show that the extraction method gets a recall of 98.11% and a precision of 96.90% for all kinds of dynamic Web pages.
出处
《重庆工学院学报(自然科学版)》
2009年第10期87-93,共7页
Journal of Chongqing Institute of Technology
基金
国家自然科学基金资助项目(60873153
60803061)
关键词
动态网页
信息抽取
包装器
DOM树
dynamic Web page
information extraction
wrapper
DOM tree