期刊文献+

针对不平衡数据集的Bagging改进算法 被引量:12

Improving Bagging algorithm for imbalance data
下载PDF
导出
摘要 传统的Bagging分类方法对不平衡数据集进行分类时,虽然能够达到很高的分类精度,但是对其中少数类的分类准确率不高。为提高其对少数类数据的分类精度,利用SMOTE算法对样例集中的少数类样例进行加工,在Bagging算法中根据类值对各个样例的权重进行调整。混淆矩阵和ROC曲线表明改进算法达到了既能保证整体的分类准确率,又能提高少数类分类精度的目的。 The traditional Bagging method can achieve a high accuracy for imbalance data,but gets low accuracy of the minority class samples.In order to improve the accuracy of the minority class samples with Bagging algorithm,the paper proposes a two-step approach.Firstly,SMOTE algorithm is used to increase the number of the minority class samples and then adjusts the weight for each instance in Bagging according to its class value.Results of the confusing matrix and the ROC show the approach improves not only the classification performance of data as a whole but also that of the minority part.
出处 《计算机工程与应用》 CSCD 北大核心 2010年第30期40-42,共3页 Computer Engineering and Applications
基金 山东省高新技术自主创新工程专项计划(No.2007ZZ17) 山东省自然科学基金No.Y2007G16 山东省科技攻关计划No.2008GG10001015 山东省电子发展基金(No.2008B0026)~~
关键词 不平衡类 少类样本合成过采样技术(SMOTE) BAGGING算法 权重 受试者工作特征曲线(ROC) imbalance dataset; Synthetic Minority Over-sampling Technique(SMOTE); Bagging; weights; Receiver Operating Characteristic(ROC) curve;
  • 相关文献

参考文献10

  • 1王和勇,樊泓坤,姚正安,李成安.不平衡数据集的分类方法研究[J].计算机应用研究,2008,25(5):1301-1303. 被引量:24
  • 2Chawla N V, Bowyer K W.SMOTE: Synthetic minority oversampling technique[J].Journal of Artificial Intelligence Research, 2002,16:321-357.
  • 3Tomek I.Two modifications of CNN[J].IEEE Transactions on Systems, Man and Communications, SMC-6,1976: 769-772.
  • 4Laurikkala J.Improving identification of difficult small classes by balancing class distribution[C]//Proceedings of the 8th Conference on AI in Medicine Europe: Artificial Intelligence Medicine, 2001 : 63-66.
  • 5Breiman L.Bagging predictors[J].Machine Learning, 1996,24 ( 1 ) : 123-140.
  • 6Efon B, Tibshirani R J.An introduction to the Bootstrap[M].New York: Chapman Hall, 1993 : 1-430.
  • 7Kearns M,Valiant L.Cryptographic limitation on learning Boolean formulae and finite automata[C]//Proceedings of the 21st Annual ACM Symposium on Theory of Computing.New York, NY:ACM Press, 1989:433-444.
  • 8Spackman K A.Signal detection theory:Valuable tools for evaluating inductive leaming[C]//Proceedings of the Sixth International Workshop on Machine Learning, 1989.
  • 9Brefeld U, Scheffer T.AUC maximizing support vector learning[C]// Proc of ICML Workshop on ROC Analysis in Machine Learning.Bonn: [s.n.], 2005.
  • 10Optiz D W, Shavlik J W.Actively searching for an effective neural network ensemble[J].Connection Science, 1996, 8 (3/4) : 337-353.

二级参考文献29

  • 1EZAWA K J, SINGH M, NORTON S W. Learning goal oriented Bayesian networks for telecommunications management [ C ]//Proc of the 13th International Conference on Machine Learning. San Fransisco: Morgan Kaufmann, 1996:139-147.
  • 2CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE:synthetic minority over-sampling technique[ J ]. Journal of Artificial Intelligence Research, 2002,16:321-357.
  • 3KUBAT M, HOLTE R, MATWIN S. Machine learning for the detection of oil spills in satellite radar images [ J ]. Machine Learning, 1998,30(2) :195-215.
  • 4BOSCH A T, HERIK H J, DAELEMANS W. When small disjuncts abound, try lazy learning: a case study[ C ]//Proc of the 7th Belgian- Dutch Conference on Machine Learning. 1997 : 109-118.
  • 5ZHENG Zhao-hui, WU Xiao-yun, SRIHARI R. Feature selection for text categorization on imbalanced data[ J ]. SIGKDD Explorations, 2004,6( 1 ) :80-89.
  • 6FAWCETT T, PROVOST F. Combining data mining and machine learning for effective user profile [ C ]//Proc of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland: AAAI Press, 1996:8-13.
  • 7JAPKOWICZ N. Learning form imbalanced data sets : a comparison of various strategies, WS-00-05 [ R]. Menlo Park: AAAI Press, 2000.
  • 8CHAWLA N V, JAPKOWICZ N, KOLCZ A. Proceedings of the ICML workshop on learning from imbalanced data sets[ C]. 2003.
  • 9CHAWLA N V, JAPKOWICZ N, KOLCZ A. Editorial: special issue on learning from imbalanced data sets[J]. ACM SIGKDD Exploration Newsletter, 2004,6( 1 ) : 1-6.
  • 10BRADLEY A. The use of the area under the ROC curve in the evaluation of machine learning algorithms [ J ]. Pattern Recognition, 1997,30(6) : 1145-1159.

共引文献23

同被引文献128

引证文献12

二级引证文献71

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部