摘要
样本数量分布不平衡时,特征的分布同样会不平衡。大类别中经常出现的特征,在小类别中很少出现或者根本不出现,使得分类器被大类别所淹没,小类别的识别率很低。为此,根据数据的类别分布提出一种基于差异系数的增量特征选择算法CVIFS(Coefficient Variance-based Incremental Feature Selection),选取最具有区分能力的特征,提高小类别的识别率,使用区间估计检测概念漂移。经实验验证,该算法处理偏斜数据流时优于信息增益,具有较低的均衡误差率(Balanced Error Rate BER)。
The distribution of sample size is very uneven, and the feature distribution of sample will be un- even too. Classifier is submerged by the majority classes easily and the minority classes are hardly distinguished, because the features which often appear in the majority classes hardly appear in the minority classes or even do not occur. In this paper, the method for discovering concept drifting on imbalanced data streams and CVIFS (Coefficient Variance-based Incremental Feature Selection) algorithm are proposed according to the characteris- tics of imbalaneed classification problems. The interval estimation is used to detect concept drifting. Experimen- tal study on Moving Hyperplane dataset shows that the proposed algorithm has lower BER (Balanced Error Rate)than Information Gain on imbalaneed data streams with concept drifting.
出处
《宿州学院学报》
2014年第11期75-78,共4页
Journal of Suzhou University
基金
安徽省高校自然科学研究项目"云计算环境下信息服务交互信任管理的关键问题研究"(KJ2013Z281)
淮北师范大学青年科研项目"基于类别分布的增量特征选择算法研究"(2014xq012)
淮北师范大学青年自然科学研究项目"面向云服务的交互信任模型构建与信任实体评价研究"(700693)
关键词
概念漂移
偏斜分布
差异系数
信息增益
concept drifting
imbalanced distribution
coefficient variance
information gain