islr notes and exercises from An Introduction to Statistical Learning

5. Resampling Methods

Table of Contents

  • Resampling methods involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model

  • Two of the most commonly used resampling methods are cross-validation and the bootstrap

  • Resampling methods can be useful in model assessment, the process of evaluating a model’s performance, or in model selection, the process of selecting the proper level of flexibility.

Cross-validation

The Validation Set Approach

  • Randomly divide the data into a training set and validation set. The model is fit on the training set and its prediction performance on the test set provides an estimate of overall performance.

  • In the case of a quantitative response, the prediction performance is measured by the mean-squared-error. The validation estimates the “true” MSE\text{MSE} with the mean-squared error MSEvalidation\text{MSE}_{validation} computed on the validation set.

Advantages
  • conceptual simplicity
  • ease of implementation
  • low computational resources
Disadvantages
  • the validation estimate is highly variable - it is highly dependent on the train/validation set split
  • since the model is trained on a subset of the dataset, it may tend to overestimate the test error rate if it was trained on the entire dataset

Leave-One-Out Cross Validation

Given paired observations D={(x1,y1),,(xn,yn)}\mathcal{D} = \{(x_1, y_1), \dots, (x_n, y_n)\}, for each 1in1 \leqslant i \leqslant n:

  • Divide the data D\mathcal{D} into a training set Di.=D {(xi,yi)}\mathcal{D}_{i.} = \mathcal{D}\ \{(x_i, y_i)\} and a validation set {(xi,yi)}\{(x_i, y_i)\}.
  • Train a model Mi\mathcal{M}_i on Di.\mathcal{D}_{i.} and use it to predict y^i\hat{y}_i.
  • The LOOCV estimate for MSEtest\text{MSE}_{test} is

    CV(n)=1ni=1nMSEiCV_{(n)} = \frac{1}{n}\sum_{i=1}^n \text{MSE}_i

    where MSEi=(yiy^i)\text{MSE}_i = (y_i - \hat{y}_i)1

Advantages
  • approximately unbiased
  • deterministic - doesn’t depend on a random train/test split.
  • computationally fast in least squares regression CV(n)=1ni=1n(yiy^i1hi)2CV_{(n)} = \frac{1}{n}\sum_{i=1}^n \left(\frac{y_i - \hat{y}_i}{1 - h_i}\right)^2 where hih_i is the leverage of point i
Disdvantages
  • Computationally expensive2 in general

k-fold Cross-Validation

Given paired observations D={(x1,y1),,(xn,yn)}\mathcal{D} = \{(x_1, y_1), \dots, (x_n, y_n)\}, divide the data D\mathcal{D} into KK folds (sets) D1,,DK\mathcal{D}_1, \dots, \mathcal{D}_K of roughly equal size.3 Then for each 1kK1 \leqslant k \leqslant K:

  • Train a model on Mk\mathcal{M}_k on jkDj\cup_{j\neq k} \mathcal{D}_{j} and validate on Dk\mathcal{D}_k.
  • The kk-fold CV estimate for MSEtest\text{MSE}_{test} is

    CV(k)=1ki=1kMSEkCV_{(k)} = \frac{1}{k}\sum_{i=1}^k \text{MSE}_k

    where MSEk\text{MSE}_k is the mean-squared-error on the validation set Dk\mathcal{D}_k

Advantages
  • computationally faster than LOOCVLOOCV if k>1k > 1
  • less variance than validation set approach or LOOCV
Disdvantages
  • more biased than LOOCV if k>1k > 1.

Bias-Variance Tradeoff for k-fold Cross Validation

As knk \rightarrow n, bias \downarrow but variance \uparrow

Cross-Validation on Classification Problems

In the classification setting, we define the LOOCV estimate

CV(n)=1ni=1nErriCV_{(n)} = \frac{1}{n}\sum_{i=1}^n \text{Err}_i

where Erri=I(yiy^i)\text{Err}_i = I(y_i \neq \hat{y}_i). The kk-fold CV and validation error rates are defined analogously.

The Bootstrap

The bootstrap is a method for estimating the standard error of a statistic 4 or statistical learning process. In the case of an estimator S^\hat{S} for a statistic SS proceeds as follows:

Given a dataset D\mathcal{D} with D=n|\mathcal{D}=n|, for 1iB1 \leqslant i \leqslant B:

  • Create a bootstrap dataset Di\mathcal{D}^\ast_i by sampling uniformly nn times from D\mathcal{D}
  • Calculate the statistic SS on Di\mathcal{D}^\ast_i to get a bootstrap estimate SiS^\ast_i of SS

Then the bootstrap estimate for the se(S)\mathbf{se}(S) the sample standard deviation of the boostrap estimates S1,,SBS^\ast_1, \dots, S^\ast_B:

se^(S^)=1B1i=1B(SiS)2\hat{se}(\hat{S}) = \sqrt{\frac{1}{B-1} \sum_{i = 1}^ B \left(S^\ast_i - \overline{S^\ast}\right)^2}


Footnotes

CV(n)CV_{(n)} is sometimes called the LOOCV error rate – it can be seen as the average error rate over the singleton validation sets {(xi,yi)}\{(x_i, y_i)\}

  1. MSEi\text{MSE}_i is just the mean-squared error of the model Mi\mathcal{M}_i on the validation set {(xi,yi)}\{(x_i, y_i)\}. It is an approximately unbiased estimator of MSEtest\text{MSE}_{test} but it has high variance. But as the average of the MSEi\text{MSE}_i, CV(n)CV_{(n)} has much lower variance. 

  2. Specifically O(nmodel fit time)O(n * \text{model fit time}) 

  3. LOOCV is then kk-fold CV in the case k=nk=n. Analogous, CVkCV_{k} is sometimes called the kk-fold CV error rate, the average error over the folds. 

  4. Recall a statistic SS is just a function of a sample S=S(X1,,Xn)S = S(X_1,\dots, X_n)