摘要
在统计机器学习中,交叉验证方法利用对一个数据集的多次切分,来构造多次重复实验,并以此估计机器学习模型的预测误差.然而交叉验证估计的稳定性与数据集的切分方式有着密切的关系.也就是说,不同的切分方式会导致训练集中所含共同样本的个数不同,当共同样本较多时,交叉验证估计具有较大的方差.为此构造了一种均衡的RHS(Repeated Half-sampling)交叉验证,使得训练集所含共同样本的个数的总和最小,并且任意两个切分之间的共同样本个数保持均衡,进而降低泛化误差估计的方差,进而有效地提高泛化误差估计的稳定性.从理论上证明了6次均衡的RHS交叉验证估计的方差小于组块3×2交叉验证,并且进一步通过模拟实验验证这一结论.同时,从实验结果可以说明6次均衡的RHS交叉验证估计的方差小于随机RHS交叉验证估计的方差.进一步,在真实数据集上大量的实验验证了这些结论.
In statistical machine learning,cross validation is a very commomly employed technique used to estimate prediction error and evaluate the performance for a machine learning model.The performance of expected prediction error estimator has a significant impact on model selection,test of significance of different model prediction errors and variable selection.In order to estimate prediction error of a machine learning model,cross validation method employs multiple splits on a dataset to construct several held-out experiments.Therefore,a large number of experiments show that variances of cross validation estimators are closely related to the employed splits.That is,the number of common samples for two training sets relies on the two corresponding splits of dataset.The more the number of common samples is,the larger the variance of cross validation estimator has.Therefore,this paper proposes a balanced repeated half-sampling(RHS)cross validation method with six repetitions.It guarantees that the sum of numbers of common samples between multiple training sets is the smallest,and these numbers of common samples are equal.These features make our balanced RHS estimator of generalization error has smaller variance,and effiectively improve the stability of the estimator of generalization error.Furthermore,we theoretically prove that variance of ourbalanced RHS cross-validation estimator is less than that of block 3×2cross validation estimator.And it is well supported by our experiments conducted on simulation datasets for regression and classification task.In addition,the experimental results present variance of our proposed methods which is less than that of random 6repeated half-sampling cross validation estimator.To achieve more significant results,we repeated this experiments under 10real-datasets using UCI and KEEL software tool.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2015年第4期842-849,共8页
Journal of Nanjing University(Natural Science)