摘要
Web语料是外语语料库的重要组成部分,语料抓取系统要适用于不同语种、不同结构的网站。文中介绍了针对内容管理型新闻网站设计的语料抓取软件模型,模型根据新闻网站中标题页和内容页的模版特点,采用正则表达式和动态字符串方法定义信息抓取路径和信息块抓取规则,并通过对抓取路径的去噪和去重过滤,保证每次Web访问均能抓取有效数据。基于该模型的语料抓取工具NPCrawler在C#和SQL Server2005环境下实现,通过在不同结构的多个语种的网站中实际应用证明,Web新闻语料抓取结果命中率和准确率接近1 00%,且抓取效率较高。
Web corpus is an important component of foreign language corpus,and the data extraction system should fit into different languages and various websites.This paper introduces a corpus extraction software model designed for content administrative news websites.Based on the template feature of title pages and text pages,this model uses regular expressions and dynamic strings to define the data extraction path and information chunk extraction rules,and ensures valuable extractions from Web through de-noising and duplication removing filtration of the extraction path.NPCrawler is the corpus extracting tool for this model,which is developed under C# and SQL Server 2005.The application of this software to multilingual and different structured websites shows that the hit and accuracy rate is close to 100% and the efficiency of extraction is higher.
出处
《洛阳理工学院学报(自然科学版)》
2013年第4期34-39,共6页
Journal of Luoyang Institute of Science and Technology:Natural Science Edition
基金
教育部哲学社会科学研究重大课题攻关项目(12JZD014)