摘要
当前主流的中文分词方法是基于有监督的学习算法,该方法需要大量的人工标注语料,并且提取的局部特征存在稀疏等问题。针对上述问题,提出一种双向长短时记忆条件随机场(BI_LSTM_CRF)模型,可以自动学习文本特征,能对文本上下文依赖信息进行建模,同时CRF层考虑了句子字符前后的标签信息,对文本信息进行了推理。该分词模型不仅在MSRA,PKU,CTB 6.0数据集上取得了很好的分词结果,而且在新闻数据、微博数据、汽车论坛数据、餐饮点评数据上进行了实验,实验结果表明,BI_LSTM_CRF模型不仅在测试集上有很好的分词性能,同时在跨领域数据测试上也有很好的泛化能力。
The mainstream Chinese word segmentation method based on supervised learning algorithm requires a lot of corpora labeled manually,and the extracted local feature has sparse problem.Therefore,a bidirectional long short.term memory conditional random field(BI_LSTM_CRF)model is proposed,which can automatically learn the text features,and model the text context dependent information.The tag information before and after sentence character is considered in CRF layer,and the text information is deduced.The word segmentation model has achieved perfect word segmentation results on datasets of MSRA,PKU and CTB6.0,and the experiment for the model is carried out with news data,MicroBlog data,automobile forum data and restaurant review data.The experimental results show that the BI_LSTM_CRF model has high word segmentation performance in testing set,and strong generalization ability in cross.domain data testing.
作者
姚茂建
李晗静
吕会华
姚登峰
YAO Maojian;LI Hanjing;Lü Huihua;YAO Dengfeng(Beijing Key Laboratory of Information Service Engineering,Beijing Union University,Beijing 100101,China;Special Education College of Beijing Union University,Beijing 100075,China)
出处
《现代电子技术》
北大核心
2019年第1期95-99,共5页
Modern Electronics Technique
基金
国家语委重点项目(ZDI135-31)
北京教育科学规划重点课题(ADA14121)
北京市属高校高水平教师队伍建设创新团队建设提升计划(IDHT20170511)~~
关键词
自然语言处理
中文分词
神经网络
双向长短时记忆条件随机场
字嵌入
序列标注
natural language processing
Chinese word segmentation
neural network
bidirectional long short-termmemory random field
word embedding
sequence labeling