期刊文献+

不同自然语言处理方法在土壤环境污染调查报告文本信息抽取中的对比研究

Comparative Study of Different Natural Language Processing Methods in the Information Extraction from Soil Environment Investigation Reports
下载PDF
导出
摘要 土壤环境污染调查报告中包含着丰富的土壤环境、污染源、迁移途径和受体等信息,但是这些非结构化类型的数据很难直接使用,需要先进行文本信息抽取,以供后续进一步分析处理.本研究针对土壤环境污染调查报告文本信息抽取的技术难点,分别采用传统规则匹配方法、BERT模型和GPT模型的自然语言处理(NLP)方法,进行文本信息抽取,并对其抽取效果进行评价.结果表明:GPT模型的抽取准确率、召回率和F1分数分别达到97.80%、84.43%和90.62%,相比于传统规则匹配方法分别提高了86.70%、299.12%和200.70%,相比于BERT模型分别提高了18.10%、154.21%和91.15%.进一步分析发现,虽然GPT模型在文本要素信息抽取中具有一定优势,但是规则匹配方法简单易用且部分要素抽取效率较高;同时,通过增加训练样本量及优化标注和模型参数等手段,BERT模型有较大的提升空间,因此,针对土壤环境污染调查报告中不同文本要素标签,可以采用合适的NLP方法,以兼顾文本信息抽取效率与精度. The soil environment investigation reports contain ample information,such as soil environmental background,pollution sources,migration pathways,and sensitive receptors.However,as typical unstructured data,these reports are difficult to use directly and must be processed using information extraction approaches to get focused information for the management of contaminated sites.Focusing on the issues of information extraction from soil environment investigation reports,based on natural language processing(NLP)methods,we compared traditional rule-based matching method,BERT pre-trained language model,and the autoregressive language model GPT(specifically ChatGPT)and evaluated the performance of different methods.The results showed that the GPT model achieved high extraction performance,with accuracy,recall and F1 scores of 97.80%,84.43%and 90.62%,respectively.Compared with the traditional rule-based matching method,the scores were improved by 86.70%,299.12%and 200.70%,respectively.Compared with the BERT pretrained language model,the scores were improved by 18.10%,154.21%and 91.15%,respectively.Through further analyses and discussion,it was found that although the GPT model had certain advantages in text information extraction,the rule-based matching method was simple and could efficiently extracted some text elements.Meanwhile,the BERT model can be greatly improved by increasing training samples and optimizing the labels and model parameters.Thus,for different text element labels in soil environment reports,suitable NLP methods can be used to balance the efficiency and accuracy of text information extraction.
作者 孙维维 潘贤章 刘杰 郭观林 李衍 王娟 项钰 王睿 SUN Weiwei;PAN Xianzhang;LIU Jie;GUO Guanlin;LI Yan;WANG Juan;XIANG Yu;WANG Rui(State Key Laboratory of Soil and Sustainable Agriculture,Institute of Soil Science,Chinese Academy of Sciences,Nanjing 210008,China;University of Chinese Academy of Sciences,Beijing 100049,China;Technical Centre for Ecology and Environment of Soil,Agriculture and rural Areas,Ministry of Ecology and Environment,Beijing 100012,China;Iflytek Intelligent System Co.,Ltd.,Hefei 230088,China)
出处 《环境科学研究》 CAS CSCD 北大核心 2024年第3期607-615,共9页 Research of Environmental Sciences
基金 国家重点研发计划项目(No.2020YFC1807401) 中国科学院土壤环境与污染修复重点实验室开放基金课题(No.SEPR2020-10)。
关键词 文本要素抽取 BERT模型 GPT模型 污染地块 土壤环境污染调查报告 text feature extraction BERT model GPT model contaminated sites investigation reports on soil environmental pollution
  • 相关文献

参考文献20

二级参考文献182

共引文献281

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部