摘要
【目的/意义】旨在研究少量标注样本构建古文断句模型,减少在模型训练过程中样本标注所需的成本,为探索数字技术与人文学科的融合发展提供崭新的思路。【方法/过程】从古文样本的不确定性和多样性出发,提出一种加权多策略选样方法,有效结合了BERT-BiLSTM-CRF、BERT-CRF等古文断句模型。通过引入信息熵和相似性等概念,深入分析古籍文本的不确定性和多样性,运用加权计算评估古文样本对模型训练的价值高低,对加权多策略方法所筛选的有价值样本进行人工标注,同时更新到训练集进行模型迭代训练。【结果/结论】以古籍《宋史》为例进行研究,所提出的方法分别在BERT-BiLSTM-CRF、BERT-CRF等古文断句模型训练过程中减少原来训练样本量的50%、55%,进一步验证了方法的有效性。【创新/局限】加权多策略选样的方法为古文断句模型训练提供了一种新思路,未来将探索该方法在古籍整理中其他任务的适用性。
【Purpose/significance】The aim of this paper is to study a small number of annotated training samples to construct a sentence segmentation model of ancient texts,reduce the cost of the sample annotation in the process of model training,and provide a new idea for exploring the integration of digital technology and humanities.【Method/process】Based on the uncertainty and diversity of ancient text samples,this paper proposes a weighted multi-strategy sample selection method to train the sentence segmentation model for ancient texts,which effectively combines BERT-BiLSTM-CRF,and BERT-CRF models.Then,based on the concepts of information entropy and similarity,the uncertainty and diversity of ancient texts are analyzed in depth,and the value of ancient text samples for models training is evaluated by weighted quantitative calculation.the weighted multi-strategy sample selection method is applied to the training of sentence segmentation models of ancient texts,which selects valuable samples and updates them to the model training set after labeling,the model will then be trained iteratively.【Result/conclusion】The ancient book History of Song Dynasty is taken as an example,the proposed method can reduce the original training sample size by 50%and 55%respectively in the training process of ancient text segmentation models such as BERT-BiLSTM-CRF and BERT-CRF,which further verifies the effectiveness of the proposed method.【Innovation/limitation】The weighted multi-strategy sample selection method provides a new idea for the training of ancient text segmentation models.Especially,it will explore the applicability of the proposed method in other tasks of ancient texts collation in the future.
作者
张景素
魏明珠
ZHANG Jing-su;WEI Ming-zhu(School of Humanities,Huazhong University of Science and Technology,Wuhan 430074,China;School of Business and Management,Jilin University,Changchun 130000,China)
出处
《情报科学》
CSSCI
北大核心
2022年第10期164-170,共7页
Information Science
关键词
古文断句
主动学习
数字人文
选样策略
BERT
sentence segmentation of ancient Chinese
active learning
digital humanities
sample selection strategy
BERT