摘要
马铃薯育种领域积累有大量尚未结构化处理的育种文献文本,文献格式为PDF文档,人工整理文献内的种质资源数据费时费力。为了快速、准确地从育种文献中提取种质资源数据,使用基于词性标注规则和预设词的方法抽取文献数据。对于不能直接获取文档文本的情况,使用游程平滑算法和光学字符识别(optical character recognition,OCR)获取文本内容。采用用户可灵活建立的关键词库保存抽取项,通过正则表达式获取关键词所在语句,并利用自然语言处理工具对语句进行分词与词性标注,根据规则抽取目标词,同时采用基于关键词与预设词距离的信息抽取方法,实现将育种文献从自由文本转化为结构化数据。对115篇文献的1490个抽取项进行信息抽取,实验表明,该方法的准确率为82.97%,召回率为99.72%,F为90.58%,能以较高的准确率和召回率对马铃薯育种文献种质资源进行抽取,可为构建马铃薯遗传育种数据库提供数据基础。
The potato breeding has accumulated a large number of unstructured literature texts.Manual collation of germplasm resource data from literature is time-consuming and labor-intensive.To swiftly and accurately extract data on plant resources from breeding literature,a method utilizing part-of-speech tagging rules and predetermined vocabulary was employed for data extraction.The document format is PDF.For those cannot obtain document text directly,run length smoothing algorithm and optical character recognition(OCR)was used to obtain the text content.The method of information extraction used word-based marking rules and preset words.A user-configurable keyword repository was utilized to preserve extraction elements.By employing regular expressions,sentences containing the keywords were acquired,and natural language processing tools were used for tokenization and part-of-speech tagging of the sentences.Target words were extracted according to specific rules,while an information extraction method based on the distance between keywords and pre-established words was implemented.This approach facilitates the conversion of breeding literature from unstructured text into structured data.Information extraction of 1490 extracted items from 115 articles,shows that the accuracy rate of this method is 82.97%,the recall rate is 99.72%,and the F is 90.58%.It can be extracted for potato breeding documents at a high accuracy and recall rate.It provides a data basis for the construction of potato genetic breeding databases.
作者
王腾阳
赵小丹
胡林
WANG Teng-yang;ZHAO Xiao-dan;HU Lin(Agricultural Information Institute,Chinese Academy of Agricultural Sciences,Beijing 100081,China)
出处
《科学技术与工程》
北大核心
2023年第27期11562-11569,共8页
Science Technology and Engineering
基金
内蒙古自治区科技重大专项(2021SZD0026)。
关键词
马铃薯
词性标注
信息抽取
自然语言处理
potato
part-of-speech tagging
information extraction
natural language processing