摘要
命名实体识别是自然语言处理领域的一项基础性技术。近年来微博等网络社交平台发展迅速,其独特的形式对传统的命名实体识别技术提出了新的挑战。故提出一种基于条件随机场模型的改进方法,针对微博文本短小、语义含糊等特点,引入外部数据源提取主题特征和词向量特征来训练模型,针对微博数据规模大、人工标准化处理代价大的特点,采取一种基于最小置信度的主动学习算法,以较小的人工代价强化模型的训练效果。在新浪微博数据集上的实验证明,该方法与传统的条件随机场方法相比F值提高了4.54%。
Named entity recognition is a fundamental technology in natural language processing( NLP). In recent years, rapid devel-opment of social network platforms such as microblog presents new challenges to the traditional named entity recognition( NER) tech-nology because of the unique form. In this paper, an improved method based on the conditional random field( CRF) model is pro-posed for microblog texts. Due to the short texts and semantic ambiguity, external data resources are introduced to generate the top-ic feature and word representation feature for training the model. Due to the large-scale of microblog data and the high cost of manual standardization, an active learning algorithm based on least confidence is adopted to enhance the training effect at a lower cost of labor. Experiments on a Sina weibo data set show that this method improves the F-score by 4. 54 % compared to the tradi-tional CRF methods.
出处
《电子技术应用》
2018年第1期118-120,124,共4页
Application of Electronic Technique
基金
国家自然科学基金项目(U1536207)
关键词
命名实体识别
微博
条件随机场
词向量
主动学习
named entity recognition
micro-blog
conditional random field
word representation
active learning