摘要
针对信息抽取中,序列标注模型很难捕获句子的长距离语义而导致输入特征使用不充分,使得在裁判文书中的证据实体抽取上性能较差的问题,提出一种融合标签信息的裁判文书证据抽取方法。首先,将数据的序列标注格式转换为融合标签信息的机器阅读理解格式的三元组;其次,将文本信息与标签信息融合送入BERT预训练模型;最后,设定阈值,通过MLP输出预测的证据实体索引。实验结果表明:在2293篇裁判文书数据集上,论文提出的方法相较于传统序列标注模型,在F1值上提高了1.93%。
In information extraction,it is difficult for the sequence labeling model to capture the long-distance semantics of sentences,which leads to insufficient use of input features and poor performance in the extraction of evidence entities in judgment documents. A judgment document evidence fusion with label information is proposed. extraction method. Firstly,the sequence labeling format of the data is converted into triples of machine reading comprehension format fused with label information. Secondly,the text information and label information are fused into the BERT pre-training model. Finally,threshold setting to the predicted evidence entity index is output through MLP. The experimental results show that on the dataset of 2293 judgment documents,the method proposed in this paper improves the F1 value by 1.93% compared with the traditional sequence labeling model.
作者
周裕林
鹿安琪
周雯童
刘林红
ZHOU Yulin;LU Anqi;ZHOU Wentong;LIU Linhong(State Key Laboratory of Public Big Data,Guiyang 550025;College of Computer Science&Technology,Guizhou University,Guiyang 550025)
出处
《计算机与数字工程》
2022年第9期2025-2029,共5页
Computer & Digital Engineering
基金
贵州大学大学生创新创业训练计划项目(编号:贵大(省)创字2021(055))资助。
关键词
标签信息
裁判文书
机器阅读理解
证据抽取
label information
judgment documents
machine reading comprehension
evidence extraction