期刊文献+

基于不平衡数据集的改进随机森林算法研究 被引量:10

Research on Improved Random Forest Algorithm Based on Unbalanced Datasets
下载PDF
导出
摘要 随机森林算法在多种应用场景与数据集中都实现了良好的模型分类效果,但该算法在应用于不平衡二分类数据集时,受限于样本数据量本身的好坏比倾斜与决策子树叶节点投票机制,对样本量占相对少数的小类属样本不能很好地对分类进行表决。对此,文中对原有随机森林算法的节点分类规则进行改进。在模型训练过程中,综合考虑度量节点样本分类占比与节点深度,增加有利于少量类样本分类信息,从而提高了少数样本类的分类准确率。通过在不同数据集上进行随机森林改进算法的效果测试,证明改进算法相对于传统算法在不平衡数据集上有更好的模型表现,大样本条件下少量类样本分类准确率有显著提升。 Random forest algorithm has achieved a great classification effect in a variety of scenarios and datasets,but when applied in the unbalanced binary classification datasets,it is restricted to the imbalance of sample data itself and the leaf node voting mechanism,the sample which size of relatively few samples can't vote on classification very well. For this,we improve the node classification rules of original random forest algorithm. In model training,by considering sample classification proportion and the depth of the measurement nodes comprehensively,and increasing classified information in favor for the small amount of samples,the accuracy of the few sample classification can be raised. After testing on different datasets,it proves that the improved algorithm on unbalanced dataset has better performance than the traditional algorithm,and that the few sample classification accuracy has been increased significantly under the condition of large amount of dataset.
作者 刘耀杰 刘独玉 LIU Yao-jie;LIU Du-yu(School of Electrical and Information Engineering,Southwest Minzu University,Chengdu 610041,China)
出处 《计算机技术与发展》 2019年第6期100-104,共5页 Computer Technology and Development
基金 中央高校基本科研业务费专项资金项目(2017ZYXS09)
关键词 不平衡数据集 随机森林 决策树 节点分裂 分类准确率 imbalance data random forest decision tree node split classification accuracy
  • 相关文献

参考文献6

二级参考文献49

  • 1覃泽,韦建忠.CSL中测试属性选择方法[J].微计算机信息,2008,24(6):288-289. 被引量:1
  • 2刘星毅.基于性价比的分裂属性选择方法[J].计算机应用,2009,29(3):839-842. 被引量:1
  • 3韩松来,张辉,周华平.基于关联度函数的决策树分类算法[J].计算机应用,2005,25(11):2655-2657. 被引量:36
  • 4冯少荣.决策树算法的研究与改进[J].厦门大学学报(自然科学版),2007,46(4):496-500. 被引量:67
  • 5Davies S, Russl S. NP completeness of searches for smallest possible feature sets[C]//Proceedings of the AAAI Fall Symposiums on Relevance, Menlo Park, 1994:37-39.
  • 6Breiman L. Random forests[J]. Machine Learning, 2001, 45(1): 5-32.
  • 7Strobl Carolin, Boulesteix Anne-Laure, Kneib Thomas, et al. Conditional variable importance for random forests[J]. BMC Bioinformatics, 2008, 9 (1) : 1-11.
  • 8Reif David M, Motsinger Alison A, McKinney Brett A, et al. Feature selection using a random forests classifier for the integrated analysis of multiple data types[C]//IEEE Symposium on Computational In- telligence and Bioinformatics and Computational Bi- ology, 2006: 171-178.
  • 9Mohammed Khalilia, Sounak Chakraborty, Mihail Popescu. Predicting disease risks from highly im- balanced data using random forese[J]. BMC Medi- cal Informaties and Decision Making, 2011, 11(7): 51-58.
  • 10Verikas A, Gelzinis A, Bacauskiene M. Mining data with random forests: a survey and results of new tests[J]. Pattern Recognition, 2011, 44 (2): 330-349.

共引文献341

同被引文献115

引证文献10

二级引证文献138

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部