Resampling methods involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model
Two of the most commonly used resampling methods are cross-validation and the bootstrap
Resampling methods can be useful in model assessment, the process of evaluating a model’s performance, or in model selection, the process of selecting the proper level of flexibility.
Cross-validation
The Validation Set Approach
Randomly divide the data into a training set and validation set. The model is fit on the training set and its prediction performance on the test set provides an estimate of overall performance.
In the case of a quantitative response, the prediction performance is measured by the mean-squared-error. The validation estimates the “true” MSE with the mean-squared error MSEvalidation computed on the validation set.
Advantages
conceptual simplicity
ease of implementation
low computational resources
Disadvantages
the validation estimate is highly variable - it is highly dependent on the train/validation set split
since the model is trained on a subset of the dataset, it may tend to overestimate the test error rate if it was trained on the entire dataset
Leave-One-Out Cross Validation
Given paired observations D={(x1,y1),…,(xn,yn)}, for each 1⩽i⩽n:
Divide the data D into a training set Di.=D{(xi,yi)} and a validation set {(xi,yi)}.
Train a model Mi on Di. and use it to predict y^i.
Given paired observations D={(x1,y1),…,(xn,yn)}, divide the data D into Kfolds (sets) D1,…,DK of roughly equal size.3 Then for each 1⩽k⩽K:
Train a model on Mk on ∪j=kDj and validate on Dk.
The k-fold CV estimate for MSEtest is
CV(k)=k1i=1∑kMSEk
where MSEk is the mean-squared-error on the validation set Dk
Advantages
computationally faster than LOOCV if k>1
less variance than validation set approach or LOOCV
Disdvantages
more biased than LOOCV if k>1.
Bias-Variance Tradeoff for k-fold Cross Validation
As k→n, bias ↓ but variance ↑
Cross-Validation on Classification Problems
In the classification setting, we define the LOOCV estimate
CV(n)=n1i=1∑nErri
where Erri=I(yi=y^i). The k-fold CV and validation error rates are defined analogously.
The Bootstrap
The bootstrap is a method for estimating the standard error of a statistic 4 or statistical learning process. In the case of an estimator S^ for a statistic S proceeds as follows:
Given a dataset D with ∣D=n∣, for 1⩽i⩽B:
Create a bootstrap dataset Di∗ by sampling uniformly n times from D
Calculate the statistic S on Di∗ to get a bootstrap estimate Si∗ of S
Then the bootstrap estimate for the se(S) the sample standard deviation of the boostrap estimates S1∗,…,SB∗:
se^(S^)=B−11i=1∑B(Si∗−S∗)2
Footnotes
CV(n) is sometimes called the LOOCV error rate – it can be seen as the average error rate over the singleton validation sets {(xi,yi)}
MSEi is just the mean-squared error of the model Mi on the validation set {(xi,yi)}. It is an approximately unbiased estimator of MSEtest but it has high variance. But as the average of the MSEi, CV(n) has much lower variance. ↩