摘要
数据抽取常用正则表达式(RE)来描述数据源。为实现可视化描述,需将RE转换成分析树。但现有基于改写的RE分析树构造方法会破坏数据对象的内在结构,不能用于数据抽取问题。提出了一种无改写的RE分析树构造算法。实验表明,该算法在时空间性能和实用性等方面优于现有RE分析树构造算法。
Data extraction often applies regular expressions (REs) to describe data sources. In order to visualize the description, REs must be converted into parse trees. However, as the present methods for creating rewriting-based RE parse trees will destroy the inner structure of data objects,they are not fit for data extraction An algorithm for creating RE parse trees without rewriting is proposed. Experiments show that the algorithm outperforms the present counterparts not only in time and space behaviors, but also in practicality.
出处
《计算机应用与软件》
CSCD
北大核心
2007年第12期65-66,84,共3页
Computer Applications and Software
基金
浙江省教育厅项目:高自动化Web信息抽取工具研究(20060144)
关键词
正则表达武
分析树
数据抽取
改写
Regular expression Parse tree Data extraction Rewriting