摘要
[目的]为研究形态切分效果与维吾尔语机器翻译性能的关系,对不同的形态切分方法及其与维吾尔语机器翻译性能的关系进行研究.[方法]从有监督学习和无监督学习的角度出发,对比了不同形态切分方法与字节对编码方法在维吾尔语机器翻译任务中的性能表现,并将自监督学习应用到维吾尔语形态切分上.随后,利用不同切分方法处理后的语料进行机器翻译实验,以观察不同切分方法对翻译结果的影响.最后,对实验结果进行统计学检验,比较不同方法之间的性能差异.[结果]相较于无监督学习,基于有监督学习的形态切分方法能够取得更好的切分效果.本研究提出的方法在维吾尔语-汉语和维吾尔语-英语机器翻译任务上,与字节对编码方法相比多个评价指标之间不存在显著性差异.[结论]形态切分效果与机器翻译质量并不以绝对正相关的形式呈现,本研究提出的方法能够很好地兼顾形态切分效果和翻译质量,并显示出一定的优势.
[Objective]The advancement of neural machine translation(NMT)has dramatically changed the landscape of computational linguistics,resulting in unprecedented improvements in the translation quality of numerous languages.These technological strides have enabled more precise and fluent translations,thus significantly enhancing cross-linguistic communication.Despite these advances,the translation of low-resource languages,especially those with complex morphological structures such as Uyghur,remains considerably challenging.This article rigorously assesses the impact of morphological segmentation on the quality of Uyghur NMT,and focuses on translations between Uyghur and two high-resource languages:Chinese and English.Finally,the study aims to identify effective ways to improve the accuracy and the fluency of Uyghur translations.[Methods]In Uyghur-Chinese and Uyghur-English NMT tasks,seven different morphological segmentation methods,including a herein-proposed method that incorporates self-supervised learning and the widely used byte pair encoding(BPE)technique,are comprehensively evaluated.State-of-the-art NMT models such as Transformer and DeltaLM are employed to ensure the relevance of these findings to current translation technologies.The evaluation relies on BLEU,chrF2++and TER metrics to provide a multifaceted understanding of translation quality.For each NMT task,five randomized experiments are conducted,and statistical tests are utilized to examine if significant differences between different morphological segmentation methods and BPE in terms of translation effectiveness exist.Furthermore,effects of segmentation granularity and method model compatibility on the overall translation effectiveness are explored.[Results]This study assessed various morphological segmentation methods for Uyghur and compared their performance across different translation models.In terms of supervised versus unsupervised methods,supervised approaches demonstrated superior accuracy.The CNN-BiLSTM-CRF method notably stood out,recording F-values of 96.90%and 97.65%on the validation and test datasets,respectively,along with P-values of 96.87%and 97.40%,and R-values of 97.02%and 98.00%.In the task of Uyghur-Chinese translation,the BPE method,when used with the Transformer model,achieved average BLEU and chrF2++scores of 33.25%and 31.81%,respectively,with a TER of 65.12%.Conversely,when used with the DeltaLM model,it recorded higher average BLEU and chrF2++scores of 38.83%and 36.87%,and a lower TER of 57.86%.For Uyghur-English translation,the Morfessor method excelled when used with the Transformer model,attaining the highest BLEU and chrF2++scores with averages of 28.35%and 47.49%,respectively,and a TER of 62.07%.Furthermore,the LMVR method behaved notably satisfactorily in the DeltaLM model for achieving the highest BLEU and chrF2++scores,alongside the lowest TER,with average values of 29.29%,48.25%,and 59.91%.Additionally,the study found that the effectiveness of morphological segmentation varies significantly across different model architectures and language pairs,indicating that no single method consistently outperforms others in all scenarios.This variability underscores the intricate dynamics between morphological segmentation accuracy and its impact on the subsequent quality of neural machine translation.[Conclusions]The investigation underscores the pivotal role of morphological segmentation in enhancing NMT for low-resource languages such as Uyghur.The complex relationship between segmentation accuracy and translation quality suggests that optimal segmentation strategies may differ by NMT model and language pair.Notably,the application of self-supervised learning in morphological segmentation yielded promising results comparable to the BPE method,indicating the potential for future advancements in NMT.Herein,we identify the necessity of developing tailored segmentation approaches aligned with specific NMT models to fully exploit their capabilities in handling complex morphological structures.Future research should explore more adaptable and sophisticated segmentation techniques to further improve NMT performance for Uyghur and other morphologically-enriching low-resource languages.
作者
阿布都克力木·阿布力孜
史亚庆
侯钰涛
张雨宁
阿力木江·亚森
哈里旦木·阿布都克里木
ABUDUKELIMU Abulizi;SHI Yaqing;HOU Yutao;ZHANG Yuning;ALIMUJIANG Yasen;HALIDANMU Abudukelimu(School of Information Management,Xinjiang University of Finance and Economics,Urumqi 830012,China)
出处
《厦门大学学报(自然科学版)》
CAS
CSCD
北大核心
2024年第4期694-704,共11页
Journal of Xiamen University:Natural Science
基金
国家自然科学基金(62366050,61966033)。
关键词
维吾尔语
机器翻译
形态切分
自监督学习
Uyghur language
machine translation
morphological segmentation
self-supervised learning