6. Linear Model Selection and Regularization
Alternatives to the least squares fitting procedures can yield
better
- prediction accuracy
- model interpretability
Subset Selection
Methods for selecting a subset of the predictors to improve test performance
Best Subset Selection
Algorithm: Best Subset Selection (BSS) for linear regression
- Let M0 denote the null model
- For 1⩽k⩽p:
- Fit all (kp) linear regression models with k predictors
- Let Mk=modelsargmin RSS
- Choose the best model Mi,1⩽i⩽p based on estimated test error
For logistic regression, in step 2.A., let Mk=modelsargmin D(y,y^) where D(y,y^) is the deviance of the model
Advantages
- Slightly faster than brute force. Model evaluation is O(p) as opposed to O(2p) for brute force.
- Conceptually simple
Disadvantages
- Still very slow. Fitting is O(2p) as for brute force
- Overfitting and high variance of coefficient estimates when p is large
Stepwise Selection
Forward Stepwise Selection
Algorithm: Forward Stepwise Selection (FSS) for linear regression
- Let M0 denote the null model
- For 0⩽k⩽p−1:
- Fit all p−k linear regression models that augment model Mk with one additional predictor
- Let Mk+1=modelsargmin RSS
- Choose the best model Mi,1⩽i⩽p based on estimated test error
Advantages
- Faster than BSS. Fitting is O(p2) and evaluation is O(p)
- Can be applied in the high-dimensional setting n<p
Disadvantages
- Evaluation is more challenging since it compares models with different numbers of predictors.
- Searches less of the parameter space, hence may be suboptimal
Backward Stepwise Selection
Algorithm: Backward Stepwise Selection (BKSS) for linear regression is
- Let Mp denote the full model
- For k=p,p−1,…,1:
- Fit all k linear regression models of k−1 predictors that contain all but one of the predictors in Mk.
- Let Mk−1=modelsargmin RSS
- Choose the best model Mi,1⩽i⩽p based on estimated test error
Advantages
Disadvantages
- Same disadvantages as FSS
- Cannot be used when n<p
Hybrid Approaches
Other approaches exist which may add variables sequentially (as with FSS) but may also remove variables (as with BSS). These methods strike a balance between optimality (e.g. BSS) and speed (FSS/BSS)
Choosing the Optimal Model
Two common approaches to estimating the test error:
- Estimate indirectly by adjusting the training error to account for overfitting bias
- Estimate directly using a validation approach
Cp, AIC, BIC and Adjusted R2
-
Train MSE underestimates test MSE and decreases as p increases, so it cannot be used to select from models with different numbers of predictors. However we may adjust the training error to account for the model size, and use this to estimate the test MSE
-
For least squares models, the Cp estimate of the test MSE for a model with d predictors is
Cp=n1(RSS+2dσ^2)
where σ^=V^(ϵ).
-
For maximum likelihood models, the Akaike Information Criterion (AIC) estimate of the test MSE is
AIC=nσ^21(RSS+2dσ^2)
- For least squares models, the Bayes Information Criterion (BIC) estimate of the test MSE is
BIC=n1(RSS+log(n)dσ^2)
- For least squares models, the adjusted R2 statistic is
AdjR2=1−TSS/(n−1)RSS/(n−d−1)
Validation and Cross-Validation
- Instead using adjusted training error to estimate test error indirectly, we can directly estimate using validation or cross-validation
- In the past this was computationally prohibitive but advances in computation have made this method very attractive.
- In this approach, we can select a model using the one-standard-error rule, i.e. selecting the model for which the estimated standard error is within one standard error of the p vs. error curve.
Shrinkage Methods
Methods for constraining or regularizing the coefficient estimates, i.e. shrinking them towards zero. This can significantly reduce their variance.
Ridge Regression
-
Ridge regression introduces an L2-penalty for the training error and estimates
β^R=RSS+λ∥β~∥22
where λ is a tuning parameter and β~=(β1,…,βp).
-
The term λ∥β∥22 is called a shrinkage penalty
-
Selecting a good value for λ is critical, see section 6.2.3
-
Standardizing the predictors Xi↦siXi−μi is advised.
Advantages
- Takes advantage of bias-variance tradeoff by decreasing flexibility thus decreasing variance.
- Preferable to least squares in situations when the latter has high variance (close to linear relationship, p≲n
- In contrast to least squares, works when p>n
Disadvantages
- Lower variance means higher bias.
- Will not eliminate any predictors which can be an issue for interpretation when p is large.
The Lasso
- Lasso regression introduces an L1-penalty for the training error and estimates
β^R=RSS+λ∥β~∥12
Advantages
- Same advantages as ridge regression.
- Improves over ridge regression by yielding sparse models (i.e. performs variable selection) when λ is sufficiently large
Disadvantages
- Lower variance means higher bias.
- Ridge Regression is equivalent to the quadratic optimization problem:
mins.t. RSS+∥β~∥2 ∥β~∥22⩽s
- Lasso Regression is equivalent to the quadratic optimization problem:
mins.t. RSS+∥β~∥1 ∥β~∥1⩽s
Bayesian Interpretation for Ridge and Lasso Regression
Given Gaussian errors, and simple assumptions on the prior p(β), ridge and lasso regression emerge as solutions
-
If the βi∼Normal(0,h(λ)) iid for some function h=h(λ) then the posterior mode for β (i.e. βargmaxp(β∣X,Y)) is the ridge regression solution
-
If the βi∼Laplace(0,h(λ)) iid then the posterior mode is the lasso regression solution.
Selecting the Tuning Parameter
Compute the cross-validation error CV(n),i for for a “grid” (evenly-spaced discrete set) of values λi, and choose
λ=iargmin CV(n),i
Dimension Reduction Methods
-
Dimension reduction methods transform the predictors X1,…,Xp into a smaller set of predictors
Z1,…,ZM, M<p.
-
When p>>n, M<<p can greatly reduce the variance of the coefficient estimates.
-
In this section we consider linear transformations
Zm=j=1∑pϕjmXj
and a least squares regression model
Y=Zθ+ϵ
where Z=(1,Z1,…,ZM)
Principal Components Regression
Principal Components Analysis is a popular unsupervised approach that can be used for dimensional reduction
An Overview of Principal Components Analysis
-
The principal components of a data matrix n×p matrix X can be seen (among many different perspectives) as the right singular eigenvectors v1,…,vp of the p×p sample covariance matrix C, i.e. the eigenvectors of C⊤C) ordered by decreasing absolute value of the corresponding eigenvalues.
-
Let σ12,…,σk2 be the singular values of C (the squares of the eigenvalues of C⊤C) and let v1,…,vp be the corresponding eigenvectors of C. Then σi2 is the variance of the data along the direction vi, and σ12 is the direction of maximal variance.
The Principal Components Regression Approach
- Principal Components Regression takes Z1,…,ZM to be the first M principal components of X and then fits a least squares model on these components.
- The assumption is that, since the principal components correspond to the directions of greatest variation of the data, they show the most association with Y. Furthermore, they are ordered by decreasing magnitude of association.
- Typically M is chosen by cross-validation.
Advantages
-
If the assumption holds then the least squares model on Z1,…,ZM will perform better than X1,…,Xp, since it will contain most of the information related to the response , and by choosing M<<p we can mitigate overfitting.
-
Decreased variance of coefficient estimates relative to OLS regression
Disadvantages
- Is not a feature selection method, since each Zi is a linear function of the predictors
Recommendations
- Data should usually be standarized prior to finding the principal components.
Partial Least Squares
A supervised dimension reduction method which proceeds roughly as follows
- Standardize the variables
- Compute Z1 by setting ϕj1=βj^ the ordinary least squares estimate
- For 1<m<M, Zm is determined by
- Adjust the data Xj=ϵj where ϵj is the residual from regression of Zm−1 onto Xj
- Compute Zm in the same fashion as Z1 on the adjusted data
As with PCR, M is chosen by cross-validation
Advantages
- Decreased variance of coefficient estimates relative to OLS regression
- Supervised dimension reduction may reduce bias
Disadvantages
- May increase variance relative to PCR (which is unsupervised).
- May be no better than PCR in practice
Considerations in High Dimensions
High-Dimensional Data
Low dimensional means p<<n, high dimensional is p≳n
What Goes Wrong in High Dimensions?
-
If p≳n, then linear models will create a perfect fit, hence overfit (usually badly)
-
Cp, AIC, BIC, and R2 approaches don’t work in well in this setting
Regression in High Dimensions
- Regularization or shrinkage plays a key role in high-dimensional problems.
- Appropriate tuning parameter selection is crucial for good predictive performance.
- The test error tends to increase as the dimensionality of the problem increases if the additional features aren’t truly associated with the response (the curse of dimensionality)
Interpreting Results in High Dimensions
-
Multicollinearity problem is maximal in high dimensional setting
-
This makes interpretation difficult, since models obtained from highly multicollinear data fail to identify which features are “preferred”
-
Care must be taken to measure performance