摘要
合成少数类过采样技术(SMOTE)因能有效处理少数类样本已成为处理不平衡数据的主流方法之一,而且许多SMOTE改进算法已被提出,但目前已有的调研极少考虑到流行的算法级改进方法。因此对现有SMOTE类算法进行更全面的分析与总结。首先详细阐述了SMOTE方法的基本原理,然后主要从数据级、算法级两个层面系统性地梳理分析SMOTE类算法,并介绍数据级和算法级混合改进的新思路。数据级改进是在预处理时通过不同操作删除或添加数据来平衡数据分布;算法级改进不会改变数据分布,主要通过修改或创建算法来加强对少数类样本的关注度。二者相比,数据级方法应用受限更少,算法级改进的算法鲁棒性普遍更高。为了更全面地提供SMOTE类算法的基础研究材料,最后列出常用数据集、评价指标,给出未来可能尝试进行的研究思路,以更好地应对不平衡数据问题。
Synthetic minority oversampling technique(SMOTE)has become one of the mainstream methods for dealing with unbalanced data due to its ability to effectively deal with minority samples,and many SMOTE improvement algorithms have been proposed,but very little research existing considers popular algorithmic-level improvement methods.Therefore a more comprehensive analysis of existing SMOTE class algorithms is provided.Firstly,the basic principles of the SMOTE method are elaborated in detail,and then the SMOTE class algorithms are systematically analyzed mainly from the two levels of data level and algorithmic level,and the new ideas of the hybrid improvement of data level and algorithmic level are introduced.Data-level improvement is to balance the data distribution by deleting or adding data through different operations during preprocessing;algorithmic-level improvement will not change the data distribution,and mainly strengthens the focus on minority samples by modifying or creating algorithms.Comparison between these two kinds of methods shows that,data-level methods are less restricted in their application,and algorithmic-level improvements generally have higher algorithmic robustness.In order to provide more comprehensive basic research material on SMOTE class algorithms,this paper finally lists the commonly used datasets,evaluation metrics,and gives ideas of research in the future to better cope with unbalanced data problem.
作者
王晓霞
李雷孝
林浩
WANG Xiaoxia;LI Leixiao;LIN Hao(College of Data Science and Application,Inner Mongolia University of Technology,Hohhot 010080,China;Inner Mongolia Autonomous Region Software Service Engineering Technology Research Center Based on Big Data,Hohhot 010080,China;College of Computer Science and Engineering,Tianjin University of Technology,Tianjin 300384,China)
出处
《计算机科学与探索》
CSCD
北大核心
2024年第5期1135-1159,共25页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金(62362055)
内蒙古自治区重点研发与成果转化计划项目(2022YFSJ0013,2023YFHH0052)
内蒙古自治区高等学校青年科技英才支持计划项目(NJYT22084)
内蒙古自然科学基金(2023MS06008)
内蒙古自治区科技成果转化专项资金项目(2020CG0073,2021CG0033)。