期刊文献+

一种均衡的RHS交叉验证 被引量:2

A type of balanced repeated half sampling cross validations
下载PDF
导出
摘要 在统计机器学习中,交叉验证方法利用对一个数据集的多次切分,来构造多次重复实验,并以此估计机器学习模型的预测误差.然而交叉验证估计的稳定性与数据集的切分方式有着密切的关系.也就是说,不同的切分方式会导致训练集中所含共同样本的个数不同,当共同样本较多时,交叉验证估计具有较大的方差.为此构造了一种均衡的RHS(Repeated Half-sampling)交叉验证,使得训练集所含共同样本的个数的总和最小,并且任意两个切分之间的共同样本个数保持均衡,进而降低泛化误差估计的方差,进而有效地提高泛化误差估计的稳定性.从理论上证明了6次均衡的RHS交叉验证估计的方差小于组块3×2交叉验证,并且进一步通过模拟实验验证这一结论.同时,从实验结果可以说明6次均衡的RHS交叉验证估计的方差小于随机RHS交叉验证估计的方差.进一步,在真实数据集上大量的实验验证了这些结论. In statistical machine learning,cross validation is a very commomly employed technique used to estimate prediction error and evaluate the performance for a machine learning model.The performance of expected prediction error estimator has a significant impact on model selection,test of significance of different model prediction errors and variable selection.In order to estimate prediction error of a machine learning model,cross validation method employs multiple splits on a dataset to construct several held-out experiments.Therefore,a large number of experiments show that variances of cross validation estimators are closely related to the employed splits.That is,the number of common samples for two training sets relies on the two corresponding splits of dataset.The more the number of common samples is,the larger the variance of cross validation estimator has.Therefore,this paper proposes a balanced repeated half-sampling(RHS)cross validation method with six repetitions.It guarantees that the sum of numbers of common samples between multiple training sets is the smallest,and these numbers of common samples are equal.These features make our balanced RHS estimator of generalization error has smaller variance,and effiectively improve the stability of the estimator of generalization error.Furthermore,we theoretically prove that variance of ourbalanced RHS cross-validation estimator is less than that of block 3×2cross validation estimator.And it is well supported by our experiments conducted on simulation datasets for regression and classification task.In addition,the experimental results present variance of our proposed methods which is less than that of random 6repeated half-sampling cross validation estimator.To achieve more significant results,we repeated this experiments under 10real-datasets using UCI and KEEL software tool.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2015年第4期842-849,共8页 Journal of Nanjing University(Natural Science)
关键词 交叉验证 泛化误差 组块3×2交叉验证 RHS交叉验证 cross validation generalization error block 3×2 cross validation RHS cross-validation
  • 相关文献

参考文献18

  • 1Picard R R, Cook R D. Cross-validation of regression models. The Journal of the American Statistical Association, 1984,79 (387) : 575- 583.
  • 2Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Statistics Surveys, 2010,4 : 40 - 79.
  • 3Moreno Torres J G, Sdez J A, Herrera F. Study on the impact of partition-induced dataset shift on-fold cross-validation. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23 (8) : 1304 - 1312.
  • 4Zeng X, Martinez T R. Distribution-balanced stratified cross-validation for accuracy estimation. Journal of Experimental I~ Theoretical Artificial In- telligence, 2000,12 ( 1 ) .. 1 - 12.
  • 5Breiman L, Friedman J H, Stone C J, et al. Classification and regression trees. In:Wadsworth Statistics/Probability. The 1~t Edition. Chapman and HaI1/CRC, 1984,368.
  • 6Nadeau C, Bengio Y. Inference for the generalization error. Machine Learning, 2003,52 (3) :239-281.
  • 7张燕平,邹慧锦,赵姝.基于CCA的代价敏感三支决策模型[J].南京大学学报(自然科学版),2015,51(2):447-452. 被引量:11
  • 8Dietterich T. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 1998, 10 ( 7 ) : 1895-1924.
  • 9Alpaydin E. Combined 5 X 2 CV F test for comparing supervised classification learning algorithms. Neural Computation, 1999, 11 (8) : 1885-1892.
  • 10Yang Y. Comparing learning methods forclassification. Statistic Sinica, 2006, 16 635-657.

二级参考文献20

  • 1蒋盛益,谢照青,余雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究[J].计算机研究与发展,2011,48(S1):387-390. 被引量:21
  • 2刘挺,车万翔,李生.基于最大熵分类器的语义角色标注[J].软件学报,2007,18(3):565-573. 被引量:73
  • 3凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量:35
  • 4Yao Y Y, Wong S. A decision theoretic framework for approximating concepts. International Journal of Man-Machine Studies, 1992,37 (6):793 -809.
  • 5Yao Y Y. Two semantic issues in a probabilistic rough set model. Fundamenta Informaticae, 2011, 108(3): 249-265.
  • 6Yao Y Y. The superiority of three-way decisions in probabilistic rough set models. Information Science, 2011, 181(6):1080-1096.
  • 7Yao Y Y. Two semantic issues in a probabilistic rough set model. Fundamenta Informaticae, 2011, 108(3-4): 249-265.
  • 8Drummond C, Holte R C. Explicitly representing expected cost: An alternative to ROC representation. In: Proceeding of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Ottawa: Elsevier, 2000, 198-207.
  • 9Domingos P. MetaCost: A general method for making classifiers cost-sensitive. In: Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 1999, 155-164.
  • 10Elkan C. The foundations of cost-sensitive learning. In: Proceedings of 17th International Joint Conference on Artificial Intelligence.San Diego: Elsevier Inc, 2011, 973-978.

共引文献51

同被引文献27

引证文献2

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部