一种均衡的RHS交叉验证被引量：2

A type of balanced repeated half sampling cross validations

下载PDF

导出

摘要在统计机器学习中,交叉验证方法利用对一个数据集的多次切分,来构造多次重复实验,并以此估计机器学习模型的预测误差.然而交叉验证估计的稳定性与数据集的切分方式有着密切的关系.也就是说,不同的切分方式会导致训练集中所含共同样本的个数不同,当共同样本较多时,交叉验证估计具有较大的方差.为此构造了一种均衡的RHS(Repeated Half-sampling)交叉验证,使得训练集所含共同样本的个数的总和最小,并且任意两个切分之间的共同样本个数保持均衡,进而降低泛化误差估计的方差,进而有效地提高泛化误差估计的稳定性.从理论上证明了6次均衡的RHS交叉验证估计的方差小于组块3×2交叉验证,并且进一步通过模拟实验验证这一结论.同时,从实验结果可以说明6次均衡的RHS交叉验证估计的方差小于随机RHS交叉验证估计的方差.进一步,在真实数据集上大量的实验验证了这些结论. In statistical machine learning,cross validation is a very commomly employed technique used to estimate prediction error and evaluate the performance for a machine learning model.The performance of expected prediction error estimator has a significant impact on model selection,test of significance of different model prediction errors and variable selection.In order to estimate prediction error of a machine learning model,cross validation method employs multiple splits on a dataset to construct several held-out experiments.Therefore,a large number of experiments show that variances of cross validation estimators are closely related to the employed splits.That is,the number of common samples for two training sets relies on the two corresponding splits of dataset.The more the number of common samples is,the larger the variance of cross validation estimator has.Therefore,this paper proposes a balanced repeated half-sampling（RHS）cross validation method with six repetitions.It guarantees that the sum of numbers of common samples between multiple training sets is the smallest,and these numbers of common samples are equal.These features make our balanced RHS estimator of generalization error has smaller variance,and effiectively improve the stability of the estimator of generalization error.Furthermore,we theoretically prove that variance of ourbalanced RHS cross-validation estimator is less than that of block 3×2cross validation estimator.And it is well supported by our experiments conducted on simulation datasets for regression and classification task.In addition,the experimental results present variance of our proposed methods which is less than that of random 6repeated half-sampling cross validation estimator.To achieve more significant results,we repeated this experiments under 10real-datasets using UCI and KEEL software tool.

作者杨静王瑞波李济洪

机构地区山西大学数学科学学院山西大学软件学院

出处《南京大学学报（自然科学版）》 CAS CSCD 北大核心 2015年第4期842-849,共8页 Journal of Nanjing University（Natural Science）

关键词交叉验证泛化误差组块3×2交叉验证 RHS交叉验证 cross validation generalization error block 3×2 cross validation RHS cross-validation

分类号 O212 [理学—概率论与数理统计]

引文网络
相关文献

参考文献18

1Picard R R, Cook R D. Cross-validation of regression models. The Journal of the American Statistical Association, 1984,79 (387) : 575- 583.
2Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Statistics Surveys, 2010,4 : 40 - 79.
3Moreno Torres J G, Sdez J A, Herrera F. Study on the impact of partition-induced dataset shift on-fold cross-validation. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23 (8) : 1304 - 1312.
4Zeng X, Martinez T R. Distribution-balanced stratified cross-validation for accuracy estimation. Journal of Experimental I~ Theoretical Artificial In- telligence, 2000,12 ( 1 ) .. 1 - 12.
5Breiman L, Friedman J H, Stone C J, et al. Classification and regression trees. In:Wadsworth Statistics/Probability. The 1~t Edition. Chapman and HaI1/CRC, 1984,368.
6Nadeau C, Bengio Y. Inference for the generalization error. Machine Learning, 2003,52 (3) :239-281.
7张燕平,邹慧锦,赵姝.基于CCA的代价敏感三支决策模型[J].南京大学学报（自然科学版）,2015,51(2):447-452. 被引量：11
8Dietterich T. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 1998, 10 ( 7 ) : 1895-1924.
9Alpaydin E. Combined 5 X 2 CV F test for comparing supervised classification learning algorithms. Neural Computation, 1999, 11 (8) : 1885-1892.
10Yang Y. Comparing learning methods forclassification. Statistic Sinica, 2006, 16 635-657.

二级参考文献20

1蒋盛益,谢照青,余雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究[J].计算机研究与发展,2011,48(S1):387-390. 被引量：21
2刘挺,车万翔,李生.基于最大熵分类器的语义角色标注[J].软件学报,2007,18(3):565-573. 被引量：73
3凌晓峰,SHENG Victor S..代价敏感分类器的比较研究(英文)[J].计算机学报,2007,30(8):1203-1212. 被引量：35
4Yao Y Y, Wong S. A decision theoretic framework for approximating concepts. International Journal of Man-Machine Studies, 1992,37 (6):793 -809.
5Yao Y Y. Two semantic issues in a probabilistic rough set model. Fundamenta Informaticae, 2011, 108(3): 249-265.
6Yao Y Y. The superiority of three-way decisions in probabilistic rough set models. Information Science, 2011, 181(6):1080-1096.
7Yao Y Y. Two semantic issues in a probabilistic rough set model. Fundamenta Informaticae, 2011, 108(3-4): 249-265.
8Drummond C, Holte R C. Explicitly representing expected cost: An alternative to ROC representation. In: Proceeding of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Ottawa: Elsevier, 2000, 198-207.
9Domingos P. MetaCost: A general method for making classifiers cost-sensitive. In: Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 1999, 155-164.
10Elkan C. The foundations of cost-sensitive learning. In: Proceedings of 17th International Joint Conference on Artificial Intelligence.San Diego: Elsevier Inc, 2011, 973-978.

共引文献51

1赵文娟,闫红梅,王蔚林.基于汉语框架网的语义角色标注算法[J].图书情报工作,2011,55(6):57-60.
2李济洪,高亚慧,王瑞波,李国臣.汉语框架自动识别中的歧义消解[J].中文信息学报,2011,25(3):38-44. 被引量：11
3许旭阳,李弼程,张先飞,席耀一.基于条件随机场与自定义规则的时间表达式识别[J].情报学报,2011,30(10):1065-1071. 被引量：3
4刘开瑛.汉语框架语义网构建及其应用技术研究[J].中文信息学报,2011,25(6):46-52. 被引量：24
5家会臣,靳竹萱,李济洪.Logistic模型选择中三种交叉验证策略的比较[J].太原师范学院学报（自然科学版）,2012,11(1):87-90. 被引量：5
6张晓孪.基于语义角色标注的问答系统的研究与实现[J].计算机与数字工程,2012,40(7):38-40.
7陈菜芳.中文语义角色标注研究概述[J].文教资料,2012(27):139-141. 被引量：1
8王智强,李茹,阴志洲,刘海静,李双红.基于依存特征的汉语框架语义角色自动标注[J].中文信息学报,2013,27(2):34-40. 被引量：8
9胡军艳,王钰,李济洪.泛化误差的三种交叉验证估计方法的比较[J].太原师范学院学报（自然科学版）,2013,12(1):24-26. 被引量：2
10杜伟杰,王瑞波,李济洪.基于均衡7×2交叉验证的模型选择方法[J].太原师范学院学报（自然科学版）,2013,12(1):27-31. 被引量：2

同被引文献27

1宋子齐,杨立雷,程英,王楠,丁健.非均质砾岩储层综合评价方法——以克拉玛依油田七中、东区砾岩储层为例[J].石油实验地质,2007,29(4):415-419. 被引量：44
2蔚远江,何登发,雷振宇,尹成,张立平,胡素云,董大忠.准噶尔盆地西北缘前陆冲断带二叠纪逆冲断裂活动的沉积响应[J].地质学报,2004,78(5):612-625. 被引量：69
3马欣本.克拉玛依油田七东1区克下组油藏稳产经验[J].新疆石油地质,1993,14(2):152-159. 被引量：1
4雷振宇,鲁兵,蔚远江,张立平,石昕.准噶尔盆地西北缘构造演化与扇体形成和分布[J].石油与天然气地质,2005,26(1):86-91. 被引量：159
5黄宏度,何涛,吴一慧,万安.复配表面活性剂/碱驱油体系中石油羧酸盐与重烷基苯磺酸盐的吸附研究[J].油田化学,2004,21(4):361-363. 被引量：7
6蔚远江,胡素云,雷振宇,何登发,张立平,许世军.准噶尔西北缘前陆冲断带三叠纪—侏罗纪逆冲断裂活动的沉积响应[J].地学前缘,2005,12(4):423-437. 被引量：43
7匡丽,王宝辉,张学佳,阮琴,纪巍,闫雪.聚丙烯酰胺在土壤上的吸附研究[J].环境污染与防治,2008,30(1):25-27. 被引量：9
8韩萌,丁剑.基于交叉验证的BP算法的改进与实现[J].计算机工程与设计,2008,29(14):3738-3739. 被引量：28
9周雅萍,赵丽辉,王希芹,李亚文.化学驱油体系中各组分在油砂表面上静吸附特征研究[J].化学工程师,2009,23(2):63-67. 被引量：13
10谭锋奇,李洪奇,孟照旭,郭海峰,李雄炎.数据挖掘方法在石油勘探开发中的应用研究[J].石油地球物理勘探,2010,45(1):85-91. 被引量：30

引证文献2

1张恩瑜,师永民,张国山,侯军伟,许可,梁耀欢,张志强.基于交叉验证法探究砾岩储集层特征及对化学驱油剂损耗的影响[J].科学技术与工程,2017,17(12):45-54. 被引量：1
2房乐楠,何腾鹏,刘宇红.一种改进型PSO算法在SVM参数寻优中的应用[J].电子科技,2018,31(6):17-19. 被引量：15

二级引证文献16

1谈笑.基于Spark大数据平台的老年病风险预警模型[J].微型电脑应用,2020,36(2):71-74. 被引量：2
2韩祥民,刘晓波,徐邦贤,邱知,唐辉.基于CEEMD与GWO-SVM算法的配电网高阻接地故障选线方法[J].智能计算机与应用,2021,11(12):143-148. 被引量：2
3杨扬,王逢瑚,蒋大鹏.室内装饰界面材料生态学属性PSO-SVR模型的建立[J].东北林业大学学报,2019,47(4):81-85. 被引量：4
4樊海玮,张博敏,史双,张艳萍,蔺琪,孙欢.多特征融合的行人目标优选算法研究[J].信息通信,2019,0(5):26-28.
5成谢锋,汪晶,王悦.一种多阈值融合心音递归图的设计与应用[J].振动与冲击,2019,38(16):108-114.
6杨海柱,江昭阳,李梦龙,康乐.基于改进人工鱼群-蛙跳算法优化LSSVM参数短期负荷预测[J].电子科技,2020,33(12):67-74. 被引量：10
7巨志勇,翟春宇,张文馨.基于SVM与区域生长的彩色商品标签图像分割方法[J].电子科技,2021,34(10):69-74. 被引量：13
8李辉,王一丞.基于CNNCIFG-Attention模型的文本情感分类[J].电子科技,2022,35(2):46-51. 被引量：1
9王溢清.基于射频技术与智能动作识别的人员身份认证方法研究[J].电子设计工程,2022,30(6):24-28.
10王鹏飞,任丽佳,高燕.基于改进收缩因子的粒子群优化算法[J].电子科技,2022,35(5):14-18. 被引量：6

1李济洪,陈萌萌,杨杏丽.高维回归中基于组块3×2交叉验证的调节参数选择[J].云南师范大学学报（自然科学版）,2015,35(3):27-32.
2杨杏丽,王钰,王瑞波,李济洪.基于组块3×2交叉验证的预测误差估计的方差[J].应用概率统计,2014,30(4):372-380. 被引量：1
3胡军艳,王钰,李济洪.泛化误差的三种交叉验证估计方法的比较[J].太原师范学院学报（自然科学版）,2013,12(1):24-26. 被引量：2
4张永全,李有梅.高斯核正则化学习算法的泛化误差[J].数学物理学报（A辑）,2014,34(5):1049-1060.
5李济洪,胡军艳,王钰.预测误差的组块3×2交叉验证估计——基于生物数据的模拟比较研究[J].生物数学学报,2014,29(4):700-710. 被引量：3
6徐小伍.从“信息加工论”谈数学教学[J].安徽卫生职业技术学院学报,2004,3(2):79-80. 被引量：8
7华伯浩,毛昭林,徐次达.二、三维问题加权残值法数值精度的改进[J].计算机应用与软件,1994,11(6):19-23.
8令狐荣锋,李劲,吕兵,徐梅,杨向东.BeH_2(~1Σ_g^+)与H_2S(~1A_1)分子的结构与解析势能函数[J].物理学报,2009,58(1):185-192. 被引量：2
9俞钟行.控制图的样本个数[J].企业标准化,2003(5):13-14.
10滕正平.探讨如何有效实施初中数学教学反思[J].数理化学习,2014(10):68-68. 被引量：1

南京大学学报（自然科学版）

2015年第4期

浏览历史

内容加载中请稍等...

一种均衡的RHS交叉验证被引量：2

参考文献18

二级参考文献20

共引文献51

同被引文献27

引证文献2

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

一种均衡的RHS交叉验证 被引量：2

参考文献18

二级参考文献20

共引文献51

同被引文献27

引证文献2

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

一种均衡的RHS交叉验证被引量：2