期刊文献+

双语平行网页挖掘系统的设计与实现 被引量:5

Design and Implementation of Bilingual Parallel Web Page Mining System
下载PDF
导出
摘要 针对双语语料是开发统计机器翻译系统的重要资源,提出一种从网络中自动挖掘双语平行网页的方法。与传统从指定网站中挖掘平行网页的方法不同,该方法从整个互联网中自动挖掘平行网页,对新的语言对和内容领域有很强的适应能力,实现双语平行网页挖掘的系统。实验结果显示,该系统可以为统计机器翻译系统提供大量高质量的平行网页。 Aiming at bilingual corpora is critical resom:ces for developing statistical machine translation system, this paper presents a method which automatically mines bilingual parallel Web page form Web, Different from mining data from pre-specified Web sites, the system is developed to mine parallel Web page from the entire Web, it is greatly suitable for new content domains and language pairs. It implements a parallel Web page mining system. Experimental results show that the system can provide large scale and high quality parallel Web page for statistical machine translation.
出处 《计算机工程》 CAS CSCD 北大核心 2009年第14期267-269,共3页 Computer Engineering
关键词 自然语言处理 统计机器翻译 双语语料 网络挖掘 natural language processing statistical machine translation bilingual corpora Web mining
  • 相关文献

参考文献11

  • 1Peter F B,Vincent J D P,Stephen A D P,et al.The Mathematics of Statistical Machine Translation:Parameter Estimation[J].Computational Linguistics,1993,19(2):263-311.
  • 2Franz J O,Hermann N.The Alignment Template Approach to Statistical Machine Translation[J].Computational Linguistics,2004,30(4):417-449.
  • 3Shi Lixin,Nie Jianyun.Filtering or Adapting:Two Strategies to Exploit Noisy Parallel Corpora for Cross-language Information Retrieval[C]//Proceedings of the 15th ACM International Conference on Information and Knowledge Management.Arlington,USA:ICST Press,2006:814-815.
  • 4Yee S C,Hwee T N.Scaling up Word Sense Disambiguation via Parallel Texts[C]//Proc.of AAAI'05.Pittsburgh,USA:[s.n.],2005:1037-1042.
  • 5Philip R,Noah A S.The Web as a Parallel Corpus[J].Com-putational Linguistics,2003,29(3):349-380.
  • 6Jisong C,Rowena C,Chung H Y.Discovering Parallel Text from the World Wide Web[C]//Proceedings of the 2nd Workshop on Australasian Information Security,Data Mining and Web Intelligence and Software Internationalization.Dunedin,New Zealand:[s.n.],2004:157-161.
  • 7Cavnar W B,Trenkle J M.N-gram-based Text Categorization[C]//Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval.Las Vegas,USA:[s.n.],1994:161-175.
  • 8Noord G V.TextCat Language Guesser[Z].[2008-12-03].http:// odur.let.rug.nl/~vannoord/TextCat/.
  • 9Ruiqiang Z,Genichiro K,Eiichiro S.Subword-based Tagging for Confidence-dependent Chinese Word Segmentation[C]//Proceedings of the COLING/ACL on Main Conference Poster Sessions.Sydney,Australia:[s.n.],2006:193-196.
  • 10Ruiqiang Z.Achilles:A Chinese Word Segmentation Program[Z].[2008-12-03].http://www.slc.atr.jp/~rzhang/Achilles.html.

同被引文献35

引证文献5

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部