双语平行网页挖掘系统的设计与实现被引量：5

Design and Implementation of Bilingual Parallel Web Page Mining System

下载PDF

导出

摘要针对双语语料是开发统计机器翻译系统的重要资源,提出一种从网络中自动挖掘双语平行网页的方法。与传统从指定网站中挖掘平行网页的方法不同,该方法从整个互联网中自动挖掘平行网页,对新的语言对和内容领域有很强的适应能力,实现双语平行网页挖掘的系统。实验结果显示,该系统可以为统计机器翻译系统提供大量高质量的平行网页。 Aiming at bilingual corpora is critical resom：ces for developing statistical machine translation system, this paper presents a method which automatically mines bilingual parallel Web page form Web, Different from mining data from pre-specified Web sites, the system is developed to mine parallel Web page from the entire Web, it is greatly suitable for new content domains and language pairs. It implements a parallel Web page mining system. Experimental results show that the system can provide large scale and high quality parallel Web page for statistical machine translation.

作者陈伟黄蕾刘峰赵志宏

机构地区南京大学软件学院

出处《计算机工程》 CAS CSCD 北大核心 2009年第14期267-269,共3页 Computer Engineering

关键词自然语言处理统计机器翻译双语语料网络挖掘 natural language processing statistical machine translation bilingual corpora Web mining

分类号 TP312 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献11

1Peter F B,Vincent J D P,Stephen A D P,et al.The Mathematics of Statistical Machine Translation:Parameter Estimation[J].Computational Linguistics,1993,19(2):263-311.
2Franz J O,Hermann N.The Alignment Template Approach to Statistical Machine Translation[J].Computational Linguistics,2004,30(4):417-449.
3Shi Lixin,Nie Jianyun.Filtering or Adapting:Two Strategies to Exploit Noisy Parallel Corpora for Cross-language Information Retrieval[C]//Proceedings of the 15th ACM International Conference on Information and Knowledge Management.Arlington,USA:ICST Press,2006:814-815.
4Yee S C,Hwee T N.Scaling up Word Sense Disambiguation via Parallel Texts[C]//Proc.of AAAI'05.Pittsburgh,USA:[s.n.],2005:1037-1042.
5Philip R,Noah A S.The Web as a Parallel Corpus[J].Com-putational Linguistics,2003,29(3):349-380.
6Jisong C,Rowena C,Chung H Y.Discovering Parallel Text from the World Wide Web[C]//Proceedings of the 2nd Workshop on Australasian Information Security,Data Mining and Web Intelligence and Software Internationalization.Dunedin,New Zealand:[s.n.],2004:157-161.
7Cavnar W B,Trenkle J M.N-gram-based Text Categorization[C]//Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval.Las Vegas,USA:[s.n.],1994:161-175.
8Noord G V.TextCat Language Guesser[Z].[2008-12-03].http:// odur.let.rug.nl/~vannoord/TextCat/.
9Ruiqiang Z,Genichiro K,Eiichiro S.Subword-based Tagging for Confidence-dependent Chinese Word Segmentation[C]//Proceedings of the COLING/ACL on Main Conference Poster Sessions.Sydney,Australia:[s.n.],2006:193-196.
10Ruiqiang Z.Achilles:A Chinese Word Segmentation Program[Z].[2008-12-03].http://www.slc.atr.jp/~rzhang/Achilles.html.

同被引文献35

1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报（自然科学版）,2005,45(S1):1743-1747. 被引量：70
2荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报（自然科学版）,2004,32(z1):84-87. 被引量：21
3游贵荣,陆玉昌.基于统计和机器学习的中文Web网页正文内容抽取[J].福建商业高等专科学校学报,2009(2):68-72. 被引量：5
4张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量：57
5刘挺,马金山,李生.基于词汇支配度的汉语依存分析模型[J].软件学报,2006,17(9):1876-1883. 被引量：24
6肖治国.基于J2EE的内容管理系统的研究与开发[D].北京:中国航天第二研究院,2008.
7Liu Dongfei, Su Bi. Research in identification and purification of the bilingual web page [C]. Interr:ational Colloquium on Computing, Communication, Control, and Management.USA:ISECS Press, 2008: 576-579.
8Lin Shianhua,Ho Janming.Discovering informative content blocks from web documents[C]//SIGKDD.New York:ACM Press,2002:588-593.
9Wang Jiying,Lochovsky F H.Data-rich section extraction from HTML page[C]//Proceeding of the Third International Conference on Web Information Systems Engineering (Workshops).Singapore:IEEE Computer Society,2002:313-322.
10Cai Deng,Yu Shipeng,Wen Jirong,et al.Extracting content structure for Web pages Based on visual representation[C]//Proceeding of the 6th Asia Pacific Web conference.Xi,an:Springer,2003:406-417.

引证文献5

1王敏.基于数据库设计的中英文双语网站开发模式[J].微型机与应用,2011,30(23):19-22.
2梁建飞,吐尔根.依布拉音,田生伟,赛依旦.阿不力米提.汉维主题网页自动获取技术的研究[J].计算机应用与软件,2012,29(1):42-45. 被引量：2
3王敏,王智超,陈亚光.一种中英文管理信息系统网站的设计模式[J].计算机应用与软件,2012,29(7):192-195. 被引量：1
4曹希彬,胡辉.基于SNS的网络挖掘系统研究[J].现代计算机,2012,18(13):10-13.
5彭丽蓉,周磊.认知行为计算模型结合DM的教学质量提升[J].计算机工程与应用,2014,50(9):237-241.

二级引证文献3

1廖旺胜,范冰冰.基于CMS的属性自定义方案的设计与应用[J].计算机与现代化,2013(8):140-144. 被引量：5
2彭飞,吐尔根.依布拉音,艾山.吾买尔,米尔夏提.力提甫.用于双语科技术语对齐的汉维文可比语料库构建[J].新疆大学学报（自然科学版）,2017,34(3):316-321. 被引量：2
3冯韬,李淼,曹宜超,曾伟辉.汉维可比语料数据集[J].中国科学数据（中英文网络版）,2020,5(1):163-168. 被引量：1

1叶莎妮,吕雅娟,黄赟,刘群.基于Web的双语平行句对自动获取[J].中文信息学报,2008,22(5):67-73. 被引量：12
2刘寿臣.网页爬虫技术的关键技术研究探索[J].电脑知识与技术（过刊）,2016,22(6X):16-17. 被引量：6
3姜子进,吐尔根.依布拉音,赛依旦.阿不力米提,田生伟.Web环境下自动获取汉、维语料库[J].计算机应用与软件,2011,28(12):19-21. 被引量：1
4赵杨.信息技术的应用与研究[J].才智,2011,0(14):80-80.
5杨青松.爬虫技术在互联网领域的应用探索[J].电脑知识与技术,2016,12(5X):62-64. 被引量：9
6张剑飞,李大辉.网页相关性挖掘原型系统的设计[J].齐齐哈尔大学学报（自然科学版）,2007,23(5):31-34.
7鲍雷,杨天奇.基于用户浏览行为的个性化网页推荐[J].微计算机信息,2010,26(6):125-126.
8夏敏捷,张锦歌.WEB挖掘的应用——个性化网络广告[J].中原工学院学报,2003,14(1):55-58. 被引量：2
9罗远胜,王明文,勒中坚,张华伟.跨语言信息检索中的双语主题相关模型[J].小型微型计算机系统,2013,34(12):2758-2763. 被引量：8
10王澍,郑德权,赵铁军.大规模双语句对自动获取技术[J].智能计算机与应用,2012,2(3):72-75.

计算机工程

2009年第14期

浏览历史

内容加载中请稍等...

双语平行网页挖掘系统的设计与实现被引量：5

参考文献11

同被引文献35

引证文献5

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

双语平行网页挖掘系统的设计与实现 被引量：5

参考文献11

同被引文献35

引证文献5

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

双语平行网页挖掘系统的设计与实现被引量：5