摘要
针对中文分词、词性标注等序列标注任务,本文提出了结合BERT语言模型、BiLSTM(双向长短时记忆模型)、CRF(条件随机场模型)和马尔可夫族模型(MFM)或树形概率(TLP)构建的中文分词和词性标注联合方法.隐马尔可夫(HMM)词性标注方法忽略了词本身到词性的发射概率,而在利用树形概率或马尔可夫族统计模型的词性标记中,一个词的词性不仅和该词前一个词的词性关联,且与该词自身关联.使用联合方法有助于使用词性信息帮助分词,将两者紧密结合能够帮助消除歧义和改进分词、词性标记的性能.实验结果表明本文使用的中文分词和词性标注联合方法与普通的BiLSTM-CRF分词算法相比,可以明显提升分词性能,而且相比于通常的隐马尔可夫词性标注方法能够大幅度提高词性标注的准确率.
For sequence tagging tasks such as Chinese word segmentation and part-of-speech tagging,this paper proposes a joint method for Chinese word segmentation and part-of-speech tagging that combines BERT model,BiLSTM(bi-directional long-short term memory model),CRF(conditional random field model),Markov family model(MFM)or tree-like probability(TLP).Part-of-speech tagging method based on HMM(Hidden Markov Model)ignores the emission probability of the word itself to the part-of-speech.In part-of-speech tagging based on MFM or TLP,the part-of-speech of the current word is not only related to the part-of-speech of the previous word,but also related to the current word itself.The use of the joint method helps to use part-of-speech tagging information to achieve word segmentation,and organically combining the two is beneficial to eliminate ambiguity and improve the accuracy of word segmentation and part-of-speech tagging tasks.The experimental results show that the joint method of Chinese word segmentation and part-of-speech tagging used in this paper can greatly improve the accuracy of word segmentation compared with the usual word segmentation model based on BiLSTM-CRF,and it can also greatly improve the accuracy of part-of-speech tagging compared with the traditional part-of-speech tagging method based on HMM.
作者
袁里驰
YUAN Li-chi(School of Software and Internet of Things Engineering,Jiangxi University of Finance and Economics,Nanchang 330013,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2023年第9期1906-1911,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61962025,61562034)资助.
关键词
BERT
双向长短时记忆模型
中文分词
词性标注
马尔可夫族模型
树形概率
bidirectional encoder representation from transformers
bi-directional long-short term memory model
Chinese word segmentation
part-of-speech tagging
Markov family model
tree-like probability