摘要
【目的】利用预训练语言模型对典籍文本进行风格计算与对比分析,宏观把控跨语言环境下典籍语言风格特征,提升典籍外译质量。【方法】分别应用5种预训练语言模型并对比深度学习模型Bi-LSTM-CRF在《论语》、《道德经》、《礼记》、《尚书》和《战国策》所构建的跨语言典籍古汉英语料库上的分词词性标注性能,基于预训练模型的最优训练结果完成对语料库中所有古汉语典籍的分词与词性标注,在这基础上进行对古汉语典籍及其对应的白话文和英文翻译在词汇层面的语言风格分析,包括词性、词汇长度、词汇多样性和密度的比较和总结。【结果】SikuBERT预训练语言模型对典籍词汇识别准确率、召回率、调和平均值F1达到91.29%、91.76%和91.52%,现代汉语译文较典籍原文词汇表意指代更为明确、词组功能相对单一、词汇组合方式更为多样,而英文译文存在翻译简化的现象。【局限】因数据抽样偏差,仅选取了特定的先秦典籍文本与译本,结论扩展到其他领域文本的有效性需进一步验证。【结论】本研究验证了预训练语言模型SikuBERT对典籍语言风格挖掘研究的可行性,深入分析典籍文本语言风格差异,为提升古代汉语翻译质量与促进中国优秀典籍跨文化传播奠定了研究基础。
[Objective]This paper uses pre-trained language models to explore and study the linguistic style of canonical texts,aiming to improve their connotation quality.[Methods]We compared the performance of five pre-trained language models with the deep learning model Bi-LSTM-CRF on the cross-lingual canonical ancient Chinese-English corpus.The selected works include The Analects of Confucius,The Tao Te Ching,The Book of Rites,The Shangshu,and The Warring States Curse.We also examined the lexicon-based canonical language style.[Results]The SikuBERT pre-trained language model achieved 91.29%precision,91.76%recall,and 91.52%in concordance mean F1 for recognizing canonical words.The modern Chinese translation yielded deeper semantic meaning,clearer ideographic referents,and more vivid and flexible word combinations than the original canonical words.[Limitations]This study only chose specific pre-Qin classical texts and their translations.More research is needed to examine the models’performance in other domains.[Conclusions]The pre-trained language model SikuBERT could effectively analyze language style differences of cross-lingual canonical texts,which promotes the dissemination of classic Chinese works.
作者
张逸勤
邓三鸿
胡昊天
王东波
Zhang Yiqin;Deng Sanhong;Hu Haotian;Wang Dongbo(School of Information Management,Nanjing University,Nanjing 210023,China;School of Information Management,Nanjing Agricultural University,Nanjing 210095,China)
出处
《数据分析与知识发现》
EI
CSSCI
CSCD
北大核心
2023年第10期50-62,共13页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金重大项目(项目编号:21&ZD331)的研究成果之一。
关键词
预训练语言模型
语言风格
数字人文
典籍文本
Pre-Trained Language Models
Language Style
Digital Humanities
Canonical Texts