摘要
主动学习算法通过选择信息含量大的未标记样例交由专家进行标记,多次循环使分类器的正确率逐步提高,进而在标记总代价最小的情况下获得分类器的强泛化能力,这一技术引起了国内外研究人员的关注.侧重从采样策略的角度,详细介绍了主动学习中学习引擎和采样引擎的工作过程,总结了主动学习算法的理论研究成果,详细评述了主动学习的研究现状和发展动态.首先,针对采样策略选择样例的不同方式将主动学习算法划分为不同类型,进而,对基于不同采样策略的主动学习算法进行了深入地分析和比较,讨论了各种算法适用的应用领域及其优缺点.最后指出了存在的开放性问题和进一步的研究方向.
The classifier in active learning algorithms is trained by choosing the most informative unlabeled instances for human experts to label. In the cycling procedure, the classification accuracy of the model is improved, and then the classifier with high generalization capability is obtained by minimizing the totally labeling cost. Active learning has attracted attentions of researchers both at home and abroad widely. It is pointed out that the active learning technique is a very important research at present. In this paper, the active learning algorithms are introduced by putting a particular emphasis on the sampling strategies. The iterative processes of the learning engine and the sampling engine are described in detail. The existing theories of active learning are summarized. The recent work and the development of active learning are discussed, including their approaches and corresponding sampling strategies. Firstly, the active learning algorithms are categorized into three main classes according to different ways of selecting the examples. And then, the sampling strategies are summarized by analyzing their correlations. The advantages and the shortcomings of sampling strategies are discussed and compared deeply within real applications. Finally the open problems which are still remained, and the interests of active learning in future research are forecasted.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2012年第6期1162-1173,共12页
Journal of Computer Research and Development
基金
国家自然科学基金项目(61171185
60932008
60832010)
中国博士后科学基金特别资助项目(201003446)
关键词
机器学习
主动学习
采样策略
标记代价
样例选择
machine learning
active learning
sampling strategy
labeling cost
instances selection