islr notes and exercises from An Introduction to Statistical Learning

6. Linear Model Selection and Regularization

Alternatives to the least squares fitting procedures can yield better

  • prediction accuracy
  • model interpretability

Subset Selection

Methods for selecting a subset of the predictors to improve test performance

Best Subset Selection

Algorithm: Best Subset Selection (BSS) for linear regression
  1. Let M0\mathcal{M}_0 denote the null model1
  2. For 1kp1 \leqslant k \leqslant p:
    1. Fit all (pk)\binom{p}{k} linear regression models with kk predictors
    2. Let Mk=argminmodels RSS\mathcal{M}_k = \underset{\text{models}}{\text{argmin}}\ RSS
  3. Choose the best model Mi,1ip\mathcal{M}_i, 1 \leqslant i \leqslant p based on estimated test error 2

For logistic regression, in step 2.A., let Mk=argminmodels D(y,y^)\mathcal{M}_k = \underset{\text{models}}{\text{argmin}}\ D(y, \hat{y}) where D(y,y^)D(y, \hat{y}) is the deviance3 of the model

Advantages

  • Slightly faster than brute force. Model evaluation is O(p)O(p) as opposed to O(2p)O(2^p) for brute force.
  • Conceptually simple

Disadvantages

  • Still very slow. Fitting is O(2p)O(2^p) as for brute force
  • Overfitting and high variance of coefficient estimates when pp is large

Stepwise Selection

Forward Stepwise Selection

Algorithm: Forward Stepwise Selection (FSS) for linear regression4
  1. Let M0\mathcal{M}_0 denote the null model
  2. For 0kp10 \leqslant k \leqslant p - 1:
    1. Fit all pkp-k linear regression models that augment model Mk\mathcal{M}_k with one additional predictor
    2. Let Mk+1=argminmodels RSS\mathcal{M}_{k+1} = \underset{\text{models}}{\text{argmin}}\ RSS
  3. Choose the best model Mi,1ip\mathcal{M}_i, 1 \leqslant i \leqslant p based on estimated test error
Advantages
  • Faster than BSS. Fitting is O(p2)O(p^2) and evaluation is O(p)O(p)
  • Can be applied in the high-dimensional setting n<pn < p
Disadvantages
  • Evaluation is more challenging since it compares models with different numbers of predictors.
  • Searches less of the parameter space, hence may be suboptimal

Backward Stepwise Selection

Algorithm: Backward Stepwise Selection (BKSS) for linear regression 5 is
  1. Let Mp\mathcal{M}_p denote the full model 6
  2. For k=p,p1,,1k = p, p-1, \dots, 1:
    1. Fit all kk linear regression models of k1k-1 predictors that contain all but one of the predictors in Mk\mathcal{M}_k.
    2. Let Mk1=argminmodels RSS\mathcal{M}_{k-1} = \underset{\text{models}}{\text{argmin}}\ RSS
  3. Choose the best model Mi,1ip\mathcal{M}_i, 1 \leqslant i \leqslant p based on estimated test error
Advantages
  • As fast as FSS
Disadvantages
  • Same disadvantages as FSS
  • Cannot be used when n<pn < p

Hybrid Approaches

Other approaches exist which may add variables sequentially (as with FSS) but may also remove variables (as with BSS). These methods strike a balance between optimality (e.g. BSS) and speed (FSS/BSS)

Choosing the Optimal Model

Two common approaches to estimating the test error:

  1. Estimate indirectly by adjusting the training error to account for overfitting bias
  2. Estimate directly using a validation approach

CpC_p, AIC, BIC and Adjusted R2R^2

  • Train MSE underestimates test MSE and decreases as pp increases, so it cannot be used to select from models with different numbers of predictors. However we may adjust the training error to account for the model size, and use this to estimate the test MSE

  • For least squares models, the CpC_p estimate7 of the test MSE for a model with dd predictors is Cp=1n(RSS+2dσ^2) C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)

    where σ^=V^(ϵ)\hat{\sigma} = \hat{\mathbb{V}}(\epsilon).

  • For maximum likelihood models8, the Akaike Information Criterion (AIC) estimate of the test MSE is

AIC=1nσ^2(RSS+2dσ^2) AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)

  • For least squares models, the Bayes Information Criterion (BIC) estimate9 of the test MSE is

BIC=1n(RSS+log(n)dσ^2) BIC = \frac{1}{n}(RSS + \log(n)d\hat{\sigma}^2)

  • For least squares models, the adjusted R2R^2 statistic10 is

AdjR2=1RSS/(nd1)TSS/(n1)AdjR^2 = 1 - \frac{RSS/(n - d - 1)}{TSS/(n - 1)}

Validation and Cross-Validation

  • Instead using adjusted training error to estimate test error indirectly, we can directly estimate using validation or cross-validation
  • In the past this was computationally prohibitive but advances in computation have made this method very attractive.
  • In this approach, we can select a model using the one-standard-error rule, i.e. selecting the model for which the estimated standard error is within one standard error of the pp vs. error curve.

Shrinkage Methods

Methods for constraining or regularizing the coefficient estimates, i.e. shrinking them towards zero. This can significantly reduce their variance.

Ridge Regression

  • Ridge regression introduces an L2L^2-penalty11 for the training error and estimates β^R=RSS+λβ~22\hat{\beta}^R = RSS+\lambda\|\tilde{\beta}\|_2^2 where λ\lambda is a tuning parameter12 and β~=(β1,,βp)\tilde{\beta} = (\beta_1, \dots, \beta_p)13.

  • The term λβ22\lambda\|\beta\|_2^2 is called a shrinkage penalty

  • Selecting a good value for λ\lambda is critical, see section 6.2.3

  • Standardizing the predictors XiXiμisiX_i \mapsto \frac{X_i - \mu_i}{s_i} is advised.

Advantages
  • Takes advantage of bias-variance tradeoff by decreasing flexibility 14 thus decreasing variance.
  • Preferable to least squares in situations when the latter has high variance (close to linear relationship, pnp \lesssim n
  • In contrast to least squares, works when p>np > n
Disadvantages
  • Lower variance means higher bias.
  • Will not eliminate any predictors which can be an issue for interpretation when pp is large.

The Lasso

  • Lasso regression introduces an L1L^1-penalty 15 for the training error and estimates

β^R=RSS+λβ~12\hat{\beta}^R = RSS+\lambda\|\tilde{\beta}\|^2_1

Advantages
  • Same advantages as ridge regression.
  • Improves over ridge regression by yielding sparse models (i.e. performs variable selection) when λ\lambda is sufficiently large
Disadvantages
  • Lower variance means higher bias.
Another Formulation for Ridge Regression and the Lasso
  • Ridge Regression is equivalent to the quadratic optimization problem:

min RSS+β~2s.t. β~22s \begin{aligned} \min&\ RSS + \|\tilde{\beta}\|_2\\ \text{s.t.}&\ \| \tilde{\beta} \|_2^2 \leqslant s \\ \end{aligned}

  • Lasso Regression is equivalent to the quadratic optimization problem:

min RSS+β~1s.t. β~1s \begin{aligned} \min&\ RSS + \|\tilde{\beta}\|_1\\ \text{s.t.}&\ \| \tilde{\beta} \|_1 \leqslant s \\ \end{aligned}

Bayesian Interpretation for Ridge and Lasso Regression

Given Gaussian errors, and simple assumptions on the prior p(β)p(\beta), ridge and lasso regression emerge as solutions

  • If the βiNormal(0,h(λ))\beta_i \sim \text{Normal}(0, h(\lambda)) iid for some function h=h(λ)h=h(\lambda) then the posterior mode for β\beta (i.e. argmaxβp(βX,Y)\underset{\beta}{\text{argmax}} p(\beta| X, Y)) is the ridge regression solution

  • If the βiLaplace(0,h(λ))\beta_i \sim \text{Laplace}(0, h(\lambda)) iid then the posterior mode is the lasso regression solution.

Selecting the Tuning Parameter

Compute the cross-validation error CV(n),iCV_{(n),i} for for a “grid” (evenly-spaced discrete set) of values λi\lambda_i, and choose

λ=argmin CV(n),ii \lambda = \underset{i}{\text{argmin}\ CV_{(n),i}}

Dimension Reduction Methods

  • Dimension reduction methods transform the predictors X1,,XpX_1, \dots, X_p into a smaller set of predictors Z1,,ZMZ_1, \dots, Z_M, M<pM < p.

  • When p>>np >> n, M<<pM << p can greatly reduce the variance of the coefficient estimates.

  • In this section we consider linear transformations

    Zm=j=1pϕjmXjZ_m = \sum_{j = 1}^p \phi_{jm}X_j

    and a least squares regression model

    Y=Zθ+ϵ Y = \mathbf{Z}\theta + \epsilon

    where Z=(1,Z1,,ZM)\mathbf{Z} = (1,Z_1, \dots, Z_M)

Principal Components Regression

Principal Components Analysis is a popular unsupervised approach 16 that can be used for dimensional reduction

An Overview of Principal Components Analysis
  • The principal components of a data matrix n×pn\times p matrix X\mathbf{X} can be seen (among many different perspectives) as the right singular eigenvectors v1,,vpv_1, \dots, v_p of the p×pp\times p sample covariance matrix CC, i.e. the eigenvectors of CCC^{\top}C) ordered by decreasing absolute value of the corresponding eigenvalues.

  • Let σ12,,σk2\sigma_1^2,\dots, \sigma_k^2 be the singular values of CC (the squares of the eigenvalues of CCC^{\top}C) and let v1,,vpv_1, \dots, v_p be the corresponding eigenvectors of CC. Then σi2\sigma_i^2 is the variance of the data along the direction viv_i, and σ12\sigma_1^2 is the direction of maximal variance.

The Principal Components Regression Approach
  • Principal Components Regression takes Z1,,ZMZ_1,\dots, Z_M to be the first MM principal components of X\mathbf{X} and then fits a least squares model on these components.
  • The assumption is that, since the principal components correspond to the directions of greatest variation of the data, they show the most association with YY. Furthermore, they are ordered by decreasing magnitude of association.
  • Typically MM is chosen by cross-validation.
Advantages
  • If the assumption holds then the least squares model on Z1,,ZMZ_1, \dots, Z_M will perform better than X1,,XpX_1, \dots, X_p, since it will contain most of the information related to the response 17, and by choosing M<<pM << p we can mitigate overfitting.

  • Decreased variance of coefficient estimates relative to OLS regression

Disadvantages
  • Is not a feature selection method, since each ZiZ_i is a linear function of the predictors
Recommendations
  • Data should usually be standarized prior to finding the principal components.

Partial Least Squares

A supervised dimension reduction method which proceeds roughly as follows

  • Standardize the variables
  • Compute Z1Z_1 by setting ϕj1=βj^\phi_{j1} = \hat{\beta_j} the ordinary least squares estimate
  • For 1<m<M 1 < m < M, ZmZ_m is determined by
    • Adjust the data Xj=ϵjX_j = \epsilon_j where ϵj\epsilon_j is the residual from regression of Zm1Z_{m - 1} onto XjX_j
    • Compute ZmZ_m in the same fashion as Z1Z_1 on the adjusted data

As with PCR, MM is chosen by cross-validation

Advantages
  • Decreased variance of coefficient estimates relative to OLS regression
  • Supervised dimension reduction may reduce bias
Disadvantages
  • May increase variance relative to PCR (which is unsupervised).
  • May be no better than PCR in practice

Considerations in High Dimensions

High-Dimensional Data

Low dimensional means p<<np << n, high dimensional is pnp \gtrsim n

What Goes Wrong in High Dimensions?

  • If pnp \gtrsim n, then linear models will create a perfect fit, hence overfit (usually badly)

  • CpC_p, AICAIC, BICBIC, and R2R^2 approaches don’t work in well in this setting

Regression in High Dimensions

  1. Regularization or shrinkage plays a key role in high-dimensional problems.
  2. Appropriate tuning parameter selection is crucial for good predictive performance.
  3. The test error tends to increase as the dimensionality of the problem increases if the additional features aren’t truly associated with the response (the curse of dimensionality)

Interpreting Results in High Dimensions

  • Multicollinearity problem is maximal in high dimensional setting

  • This makes interpretation difficult, since models obtained from highly multicollinear data fail to identify which features are “preferred”

  • Care must be taken to measure performance 18


Footnotes

  1. This is the model that predicts y^=y\hat{y} = \overline{y}, i.e. βi^=0\hat{\beta_i} = 0 for i>1i > 1 and β0^=y\hat{\beta_0} = \overline{y}

  2. Estimates of test error can come from CV, Cp(AIC)C_p (AIC), BICBIC or adjusted R2R^2 

  3. Here D(y,y^)=2log(p(y  β^)D(y, \hat{y}) = -2\log(p(y\ |\ \hat{\beta}) where β^\hat{\beta} is the MLE for β\beta. The author’s definition of deviance can be found in the comment on the Wikipedia entry if θ^0\hat{\theta}_0

  4. As with BSS, we can use FSS for logistic regression by replacing RSSRSS with the deviance in step 2B. 

  5. As with BSS, we can use BackSS for logistic regression by replacing RSSRSS with the deviance in step 2B. 

  6. Here full means contains all pp predictors. 

  7. Thus CpC_p is RSS plus a penalty which depends on the number of predictors and the estimate of the error variance. One can show that if σ^2\hat{\sigma}^2 is unbiased then then CpC_p is unbiased. 

  8. For Gaussian errors, the least squares estimate is the maximumlikelihood estimate so in that case CpC_p and AICAIC are proportional. 

  9. The BIC places a heavier penalty than CpC_p when n>7n > 7 due to the log(n)dσ^2\log(n)d\hat{\sigma}^2 term. The book says this means BIC places a heavier penalty than CpC_p on models with many variables although this isn’t clear. It would seem it places a penalty on large numbers of observation (unless somehow larger numbers of observations are correlated with larger numbers of predictors). 

  10. Cp,AICC_p, AIC and BICBIC are all estimates of the test MSEMSE so smaller values are better. By contrast, larger values of adjusted R2R^2, but this is equivalent to minimizing RSS/(nd1)RSS/(n - d - 1) which likely can be thought of as a test MSE estimate. 

  11. Here L2L^2 is a reference to the LpL^p norm (denoted 2\| \cdot \|_2) when p=2p=2 (see also p216), which is just the standard Euclidean norm. 

  12. The tuning parameter λ\lambda is actually a Lagrange multiplier used to turn a constrained optimization problem into an unconstrained optimization problem – see above 

  13. We use β~\tilde{\beta} instead of β\beta because we don’t want to shrink the intercept β0\beta_0. If the data have been centered about their mean then β^0=y\hat{\beta}_0 = \overline{y} 

  14. Flexibility decreases because the shrinkage penalty effectively decreases the size of the parameter space? 

  15. The L1L^1 norm is β~1=i=1pβp\|\tilde{\beta}\|_1 = \sum_{i=1}^p |\beta_p| 

  16. Unsupervised since it only takes the predictors X\mathbf{X} and not the response Y\mathbf{Y} as input. 

  17. Under certain assumptions, PCA is an optimal dimension reduction method from an information theoretic perspective 

  18. For example SSE, pp-values, and R2R^2 statistics from the training data are useless in this setting. Thus it is important to e.g. evaluate performance on an independent test set or use resampling methods