摘要
针对双语语料是开发统计机器翻译系统的重要资源,提出一种从网络中自动挖掘双语平行网页的方法。与传统从指定网站中挖掘平行网页的方法不同,该方法从整个互联网中自动挖掘平行网页,对新的语言对和内容领域有很强的适应能力,实现双语平行网页挖掘的系统。实验结果显示,该系统可以为统计机器翻译系统提供大量高质量的平行网页。
Aiming at bilingual corpora is critical resom:ces for developing statistical machine translation system, this paper presents a method which automatically mines bilingual parallel Web page form Web, Different from mining data from pre-specified Web sites, the system is developed to mine parallel Web page from the entire Web, it is greatly suitable for new content domains and language pairs. It implements a parallel Web page mining system. Experimental results show that the system can provide large scale and high quality parallel Web page for statistical machine translation.
出处
《计算机工程》
CAS
CSCD
北大核心
2009年第14期267-269,共3页
Computer Engineering
关键词
自然语言处理
统计机器翻译
双语语料
网络挖掘
natural language processing
statistical machine translation
bilingual corpora
Web mining