基于加权多策略选样的古文断句模型研究——以古籍《宋史》为例

Sentence Segmentation Model of Ancient Books Text Based on Weighted Multi-strategy Sampling:Taking History of Song Dynasty as an Example

原文传递

导出

摘要【目的/意义】旨在研究少量标注样本构建古文断句模型,减少在模型训练过程中样本标注所需的成本,为探索数字技术与人文学科的融合发展提供崭新的思路。【方法/过程】从古文样本的不确定性和多样性出发,提出一种加权多策略选样方法,有效结合了BERT-BiLSTM-CRF、BERT-CRF等古文断句模型。通过引入信息熵和相似性等概念,深入分析古籍文本的不确定性和多样性,运用加权计算评估古文样本对模型训练的价值高低,对加权多策略方法所筛选的有价值样本进行人工标注,同时更新到训练集进行模型迭代训练。【结果/结论】以古籍《宋史》为例进行研究,所提出的方法分别在BERT-BiLSTM-CRF、BERT-CRF等古文断句模型训练过程中减少原来训练样本量的50%、55%,进一步验证了方法的有效性。【创新/局限】加权多策略选样的方法为古文断句模型训练提供了一种新思路,未来将探索该方法在古籍整理中其他任务的适用性。【Purpose/significance】The aim of this paper is to study a small number of annotated training samples to construct a sentence segmentation model of ancient texts,reduce the cost of the sample annotation in the process of model training,and provide a new idea for exploring the integration of digital technology and humanities.【Method/process】Based on the uncertainty and diversity of ancient text samples,this paper proposes a weighted multi-strategy sample selection method to train the sentence segmentation model for ancient texts,which effectively combines BERT-BiLSTM-CRF,and BERT-CRF models.Then,based on the concepts of information entropy and similarity,the uncertainty and diversity of ancient texts are analyzed in depth,and the value of ancient text samples for models training is evaluated by weighted quantitative calculation.the weighted multi-strategy sample selection method is applied to the training of sentence segmentation models of ancient texts,which selects valuable samples and updates them to the model training set after labeling,the model will then be trained iteratively.【Result/conclusion】The ancient book History of Song Dynasty is taken as an example,the proposed method can reduce the original training sample size by 50%and 55%respectively in the training process of ancient text segmentation models such as BERT-BiLSTM-CRF and BERT-CRF,which further verifies the effectiveness of the proposed method.【Innovation/limitation】The weighted multi-strategy sample selection method provides a new idea for the training of ancient text segmentation models.Especially,it will explore the applicability of the proposed method in other tasks of ancient texts collation in the future.

作者张景素魏明珠 ZHANG Jing-su;WEI Ming-zhu(School of Humanities,Huazhong University of Science and Technology,Wuhan 430074,China;School of Business and Management,Jilin University,Changchun 130000,China)

机构地区华中科技大学人文学院吉林大学商学与管理学院

出处《情报科学》 CSSCI 北大核心 2022年第10期164-170,共7页 Information Science

关键词古文断句主动学习数字人文选样策略 BERT sentence segmentation of ancient Chinese active learning digital humanities sample selection strategy BERT