摘要
针对现有机器学习算法难以有效提高不均衡在线贯序数据中少类样本分类精度的问题,提出了一种基于主曲线的不均衡在线贯序极限学习机。该方法的核心思路是根据在线贯序数据的分布特性,均衡各类别样本,以减少少类样本合成过程中的盲目性,主要包括离线和在线两个阶段。离线阶段采用主曲线分别建立各类别样本的分布模型,利用少类样本合成过采样算法对少类样本过采样,并根据各样本点到对应主曲线的投影距离分别为其设定相应大小的隶属度,最后根据隶属区间削减多类和少类虚拟样本,进而建立初始模型。在线阶段对贯序到达的少类样本过采样,并根据隶属区间均衡贯序样本,进而动态更新网络权值。通过理论分析证明了所提算法在理论上存在损失信息上界。采用UCI标准数据集和实际澳门气象数据进行仿真实验,结果表明,与现有典型算法相比,该算法对少类样本的预测精度更高,数值稳定性更好。
Many traditional machine learning methods tend to get biased classifier which leads to lower classification precision for minor class in sequential imbalanced data.To improve the classification accuracy of minor class,a new imbalanced online sequential extreme learning machine based on principal curve was proposed.The core idea of the method is to get balanced samples based on the distribution features of online sequential data,reducing the blindness in the process of synthetic minority,which contains two stages.In offline stage,the principal curve is introduced to establish the distribution model of two kinds of samples.Over-sampling is done by using SMOTE for minor class.Then the membership degree of each sample is set according to the projection distance respectively,and the majority and virtual minor samples are deleted according to the under interval.Then the initial model is established.In online stage,over-sampling is done by using SMOTE for online sequential minor samples,getting the balanced samples according to the under interval.Then network weight is updated dynamically.The proposed algorithm has upper bound of the loss of information through the theoretical proof.The experiment was taken on three UCI datasets and the real-world air pollutant forecasting dataset,which shows that the proposed method outperforms the traditional methods in terms of prediction accuracy and numerical stability.
出处
《计算机科学》
CSCD
北大核心
2016年第3期62-67,共6页
Computer Science
基金
国家自然科学基金(U1204609)
河南省基础与前沿技术研究计划项目(132300410430)资助
关键词
在线贯序极限学习机
不均衡数据
主曲线
少类样本合成过采样
Online sequential extreme learning machine
Imbalanced data
Principal curve
Synthetic minority over-sampling