摘要
不均衡数据在实际应用中广泛存在,它们已对机器学习领域构成了一个挑战,如何有效处理不均衡数据也成为目前的一个新的研究热点.在故障诊断数据集中,故障样本数通常比非故障样本数要少很多,由此引发了数据不均衡问题下故障诊断的问题.以往的研究很少关注这种数据不均衡问题对故障诊断的影响.此外,在故障数据集中有一些冗余甚至是不相关的特征,这些特征降低了学习器的泛化能力.为解决这类问题,提出了一种基于嵌入式特征选择的EasyEnsemble算法来解决故障诊断中的数据不均衡问题.在UCI数据集和柴油发动机数据集上的实验结果表明新算法提高了分类器在不均衡数据集上的分类性能和预报能力.
There are many labeled data sets which have an unbalanced representation among the classes in them. When the imbalance is large,classification accuracy on the smaller class tends to be lower. In particular,when a class is of great interest but occurs relatively rarely such as cases of fraud, instances of disease, and so on, it is important to accurately identify it. Fault diagnosis on diesel engine is a difficult problem due to the complex structure of the engines and the presence of multi-excite sources. Class imbalance problem is also encountered in the fault diagnosis, which causes seriously negative effect on the performance of classifiers that assume a balanced distribution of classes. Though it is critical,few previous works paid attention to this class imbalance problem in the fault diagnosis of diesel engine. In imbalanced problems, some features are redundant and even irrelevant. These features will hurt the generalization performance of learning machines. Here we propose PREE (Prediction Risk based feature selection for EasyEnsemble) to solve the class imbalanced problem in the fault diagnosis of diesel engine. Experimental results on UCI data sets and diesel engine data set show that PREE improves the classification performance and prediction ability on the imbalanced dataset.
出处
《小型微型计算机系统》
CSCD
北大核心
2009年第5期924-927,共4页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(20503105
60873129)资助
上海市科委创新行动计划重大项目(07DZ19726)资助
上海市青年科技启明星计划项目(08QA1403200)资助
上海高校选拔培养优秀青年教师科研专项基金项目(sdj-07003)资助
关键词
特征选择
不均衡数据集
集成学习
故障诊断
柴油发动机
feature selection
imbalanced data sets
ensemble learning
fault diagnosis
diesel engine