摘要
古汉语信息处理的基础任务包括自动断句、自动分词、词性标注、专名识别等。大量的古汉语文本未经标点断句,所以词法分析等任务首先需要建立在断句基础之上。然而,分步处理容易造成错误的多级扩散,该文设计实现了古汉语断句与词法分析一体化的标注方法,基于BiLSTM-CRF神经网络模型在四种跨时代的测试集上验证了不同标注层次下模型对断句、词法分析的效果以及对不同时代文本标注的泛化能力。研究表明,一体化的标注方法对古汉语的断句、分词及词性标注任务的F1值均有提升。综合各测试集的实验结果,断句任务F1值达到78.95%,平均提升了3.5%;分词任务F1值达到85.73%,平均提升了0.18%;词性标注任务F1值达到72.65%,平均提升了0.35%。
The basic tasks of ancient Chinese information processing include automatic sentence segmentation,word segmentation,part-of-speech tagging and named entity recognition.To avoid the error accumulation in the pipeline processing,this paper proposes a joint approach to sentence segmentation and lexical analysis.The BiLSTM-CRF neural network model is used to verify the generalization ability and the effect of sentence segmentation and lexical analysis on different label levels on four cross-age test sets.Experiments show that the joint model achieves improvements on the F1-score of sentence segmentation,word segmentation and part-of-speech tagging:yielding 78.95%for sentence segmentation(with an average increase of 3.5%),85.73%for word segmentation(with an average increase of 0.18%),and 72.65%for part-of-speech tagging(with an average increase of 0.35%).
作者
程宁
李斌
葛四嘉
郝星月
冯敏萱
CHENG Ning;LI Bin;GE Sijia;HAO Xingyue;FENG Minxuan(School of Chinese Language and Literature,Nanjing Normal University,Nanjing,Jiangsu 210097,China;Institute for Quantitative Social Science,Harvard University,Cambridge,MA 02138,USA)
出处
《中文信息学报》
CSCD
北大核心
2020年第4期1-9,共9页
Journal of Chinese Information Processing
基金
国家自然科学基金(71673143)
国家语委科研项目(WT135-24,YB135-61)
江苏省高校哲学社会科学优秀创新团队建设项目(2017STD006)
关键词
古文断句
分词
词性标注
BiLSTM-CRF
古汉语信息处理
sentence segmentation of ancient Chinese
word segmentation
part-of-speech tagging
BiLSTM-CRF
ancient Chinese information processing