摘要
为了解决基于向量空间模型构建短文本分类器时造成的文本结构信息的缺失以及大量样本存在的标注瓶颈问题,提出一种基于图结构的半监督学习分类方法,这种方法既能保留短文本的结构语义关系,又能实现未标注样本的充分利用,提高分类器的性能。通过引入半监督学习的思想,将数量规模较大的未标注样本与少量已标注样本相结合进行基于图结构的自训练学习,不断迭代实现训练样本集的扩充,从而构建最终短文本分类器。经对比实验证明,这种方法能够获得较好的分类效果。
In order to resolve the problems of the lack of text structure and semantic information in the vector space model and the bottleneck problem of annotation in dealing with large numbers of unlabeled samples, this paper introduces a method of short texts classification based on semi-supervised learning. It is feasible to maintain the relationship between samples and can also make full use of the unlabeled parts to improve the performance of the classifier. It is a self-training algorithm that connects the large numbers of unlabeled parts and the labeled together to learn based on graph structure, so that the training samples can be enlarged and used to build the final text classifier. The contrast experiment shows that the algorithm of short text classification based on semi-supervised learning can get better classified effect.
出处
《图书情报工作》
CSSCI
北大核心
2013年第21期126-132,共7页
Library and Information Service
关键词
半监督学习
短文本
图结构
自训练
semi-supervised learning
short text
graph structure
self-training