摘要
生物医学命名实体识别是生物医学信息抽取的前提.目前实体识别大多采用机器学习的方法,依靠人工根据领域知识和经验制定特征,需要反复实验进行相应的特征选择,并且这些特征很少使用深层次的语义信息.为了探究语义信息对命名实体识别的影响,本文尝试在大规模未标注数据上进行训练,自动获得语义信息,得到三种词表示方法:词向量、基于词向量的聚类和布朗聚类.将其作为CRF和SVM的特征进行半监督学习,并在相同条件下进行对比实验.实验结果表明,词表示方法能有效地学习到潜在的语义信息,从而提高现有基于机器学习系统的性能.在未利用词典等任何外部资源的情况下,公共评测语料Bio Creative II GM上的实验结果为:精确率、召回率、F值分别达到91.24%、85.80%、88.44%.
Biomedical named entity recognition is the prerequisite for biomedical information extraction. The current entity recognition methods, which are based on machine learning, mainly depend on manually summarizing features, according to the domain knowledge and experience, and need to do experiments repeatedly for selecting the appropriate features. And these features rarely utilize the deep semantic information. To investigate the effect of semantic information on Named Entity Recognition, this paper attempts to obtain se- mantic information automatically from the large-scale unlabeled corpus, which can be downloaded from public database, such as PubMed, and get three kinds of word representation approaches, including word embeddings, cluster based on word embeddings, and Brown cluster. The three kinds of word representation are adopted as the features of CRF model and SVM model for semi-supervised learning. Comparative experiments are conducted under the same conditions : the dimension of word embeddings and the number of clusters. The experimental results show that the word representation approaches can learn the latent semantic information effectively and thus improve the performance of existing entity recognition systems based on machine learning. Experimental results ( Precision, Recall, F-score) on public evaluation corpus BioCreative II GM reaches 91.24% ,85.80%, and 88.44% respectively without the dic- tionary or any other external resources.
出处
《小型微型计算机系统》
CSCD
北大核心
2016年第2期302-307,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61173101
61173100)资助
关键词
半监督
词表示
聚类
实体识别
semi-supervised
word representation
cluster
entity recognition