摘要
为解决命名实体之间的复杂嵌套以及语料库中标注误差导致的相邻命名实体边界重叠问题,提出一种中文重叠命名实体识别方法。利用基于随机合并与拆分的层次化聚类算法将重叠命名实体标签划分到不同的聚类簇中,建立文字到实体标签之间的一对一关联关系,解决了实体标签聚类陷入局部最优的问题,并在每个标签聚类簇中采用融合中文部首的BiLSTM-CRF模型提高重叠命名实体的识别稳定性。实验结果表明,该方法通过标签聚类的方式有效避免标注误差对识别过程的干扰,F1值相比现有识别方法平均提高了0.05。
To address complex nested relations between named entities and overlapping boundaries of adjacent named entities caused by mislabeling in corpus,this paper proposes a method of Chinese overlapping Named Entity Recognition(NER).First,a hierarchical clustering algorithm based on random merging and splitting is used to divide the labels of overlapping named entities into different clusters to build one-to-one relations between words and entity labels,which prevents the clustering of entity labels from falling into local optimization.Then,a Bidirectional Long Short Term Memory-Conditional Random Fields(BiLSTM-CRF)model integrating Chinese radicals is used in each label clustering to improve the stability of overlapping NER.Experimental results show that the proposed method can effectively avoid the impact of mislabeling on recognition through label clustering,improving the F1 value by 0.05 compared with the existing methods.
作者
温秀秀
马超
高原原
康子路
WEN Xiuxiu;MA Chao;GAO Yuanyuan;KANG Zilu(Information Science Academy,China Electronics Technology Group Corporation,Beijing 100081,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2020年第5期41-46,共6页
Computer Engineering
基金
国家重点研发计划“面向云计算的网络化操作系统”(2016YFB1000500)。
关键词
命名实体识别
实体重叠
中文命名实体
标签聚类
层次化聚类
Named Entity Recognition(NER)
entity overlapping
Chinese named entity
label clustering
hierarchical clustering