摘要
基于数据驱动和机器翻译模型的英语语法纠错是神经语言模型的主要应用之一。人工标注语料库的数量和质量是影响此类方法性能的重要因素。通过分析现有学习者语料的错误类型分布,对常见的错误类型如动词、名词、部分介词、拼写和标点建立混淆集。使用混淆集结合人工规则的方法对单语语料数据进行加噪处理,与学习者语料分别用于基于机器翻译的自动错误生成模型的预训练和微调;使用错误生成模型生成的合成数据与学习者语料共同训练语法纠错模型,模型性能在CoNLL-2014和JFLEG数据集上得到显著性提高。此外,通过使用语法纠正模型纠正学习者语料库源句,将产生的中间数据反馈输入到错误生成模型,并进行交替训练。纠错系统在标准数据集上的性能得到进一步提升。
English grammar error correction method based on data-driven and machine translation models is one of the main applications of neural language models.The quantity and quality of artificially annotated corpora are important factors that affect the performance of such methods.By analyzing the distribution of error types in the existing learner corpus,a confusion set is established for common error types such as verbs,nouns,some prepositions,spelling and punctuation.Confusion sets is combined with artificial rules to add noise to the monolingual corpus data,and the learner corpus is used separately for the pre-training and fine-tuning of the automatic error generation model based on machine translation.The synthetic data generated by the error generation model and the learner’s corpus are applied to train the grammatical error correction model,the performance of the model is significantly improved on the CoNLL-2014 and JFLEG data sets.In addition,by using the grammar correction model to correct the source sentences of the learner's corpus,the generated intermediate data is fed back into the error generation model,and alternate training is performed.The performance of the error correction system on the standard data set has been further improved.
作者
孙晓东
王丕坤
杨东强
SUN Xiao-dong;WANG Pi-kun;YANG Dong-qiang(School of Computer Science and Technology,Shandong Jianzhu University,Jinan 250101,China)
出处
《计算机技术与发展》
2022年第10期143-150,共8页
Computer Technology and Development
基金
国家教育部人文社科基金资助项目(15YJA740054)。
关键词
数据增广
反向翻译
规则
语法纠错
交替训练
data augmentation
back-translation
rule
grammatical error correction
alternating training