摘要
中文层级地址分词是中文地址标准化的基础工作和地理编码的重要手段,同时也是中文分词和地理研究领域中关注的重点。高质量中文地址层级提取方法通常依赖于大量人工标注数据,而获取带标注的数据集耗时长,成本昂贵,不易实现。为解决上述问题,文中提出基于置信度的双向长短时记忆和条件随机场主动学习混合模型(Active-BiLSTM-CRF)来构建地址词库,创新性的基于CRF模型在样本上的置信度高效筛出需要标注的关键地址样本,利用BiLSTM记忆地址的上下文信息,通过CRF的转移概率矩阵控制地址标注输出的能力,循环标注并训练模型。最后基于某区县户籍地址数据验证了该方法在有限标注成本下的准确率及召回率,实验显示当标记数据占比在20%时,Active-BiLSTM-CRF模型准确率能达到97.71%,召回率能达到97.34%。
Chinese-level address segmentation is a basic work of Chinese address standardization and an important means of geocoding,and it is also the focus of attention in the field of Chinese word segmentation and geographic research. High-quality Chinese-level address segmentation methods rely on a large amount of manually labeled data,but generally,obtaining labeled data sets is time-consuming,expensive,and difficult to implement. In this paper,An active learning model is proposed based on confidence( Active-BiLSTM-CRF) to construct an address vocabulary,which combines bidirectional long-term memory networks and conditional random fields algorithm. The confidence of CRF algorithm is innovatively adopted to efficiently screen out key address samples that need to be labeled. BiLSTM is applied to memorize the context information of address which controls the ability of address annotation output through the CRF transition probability matrix and labels samples and trains the model in cycles. Finally,the accuracy and recall rate of the method are verified under the limited labeling cost on the basis of the household registration address of a certain county. Experimental results show that when the proportion of labeled data is 20%,the precision of the Active-BiLSTM-CRF model can reach 97. 71%,and the recall rate can reach 97. 34%.
作者
侯位昭
张欣海
宋凯磊
韩志卓
张世立
HOU Wei-zhao;ZHANG Xin-hai;SONG Kai-lei;HAN Zhi-zhuo;ZHANG Shi-li(The 54th Research Institute of China Electronics Technology Group Corporation,Shijiazhuang 050081,China;Hebei Far East Communication System Engineering Co.,Ltd,Shijiazhuang 050200,China;China Academy of Electronics and Information Technology,Beijing 100041,China;National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data,Beijing 100041,China)
出处
《中国电子科学研究院学报》
北大核心
2021年第7期639-644,660,共7页
Journal of China Academy of Electronics and Information Technology
基金
国家重点研发计划资助项目(2017YFC0820505)。
关键词
主动学习
置信度
地址分词
双向长短时记忆网络
条件随机场
地址分词标注
active learning
confidence
address segmentation
bidirectional long short-term memory(BiLSTM)
conditional random field(CRF)
address segmentation labeling