islr notes and exercises from An Introduction to Statistical Learning

6. Linear Model Selection and Regularization

Conceptual Exercises

Exercise 1: Test and train RSS for subset selection

a.

Best subset selection has the most flexibility (searches a larger model space) so it should have the smallest test error for each kk

b.

The answer depends on the value of kk. When k<pkk < p - k, FSS yeilds a less flexible model while BSS yeilds a more flexible model, so FSS should have a lower test RSS than BSS. When k>pkk > p -k, the converse shoule be true.

c.

i. True. FSS augments the by a single predictor at each iteration. ii. False. Replace k+1k + 1 by k1k - 1 and this becomes true. iii. False. There is no necessary connection between the models identified by FSS and BSS. iv. False. Same reason. v. False. Best subset considers all possible subsets of k+1k+1 predictors so it may include a predictor in Mk+1\mathcal{M}_{k+1} that was not in Mk\mathcal{M}_k.

Exercise 2: Comparing Lasso Regression, Ridge Regression, and Least Squares

a.

iii. is correct. The Lasso is less flexible since it searches a restricted parameter space (i.e. not Rp\mathbb{R}^p), so it will usually have increased bias and decreased variance.

b.

iii. is correct again, for the same reasons

c.

ii. is correct. Non-linear methods are more flexible which usually means decreased bias and increased variance.

Exercise 3: How change in ss affects Lasso performance

a. Train RSS

None of these answers seem correct.

For some small s>0s > 0, i.e. some 1\ell_1-neighborhood Bs(0)RpB_s(\mathbf{0})\subseteq\mathbb{R}^p, the least squares estimator β^LSBs(0)\hat{\beta}_{LS}\notin B_s(\mathbf{0}), hence the Lasso estimator β^Lassoβ^LS\hat{\beta}_{Lasso} \neq \hat{\beta}_{LS}, so RSStrain(β^Lasso)RSStrain(β^LS)RSS_{train}(\hat{\beta}_{Lasso}) \geqslant RSS_{train}(\hat{\beta}_{LS}). As sβ^LS1s \rightarrow \|\hat{\beta}_{LS}\|_1 from below, β^Lassoβ^LS\hat{\beta}_{Lasso} \rightarrow \hat{\beta}_{LS}, so RSStrain(β^Lasso)RSStrain(β^LS)RSS_{train}(\hat{\beta}_{Lasso}) \rightarrow RSS_{train}(\hat{\beta}_{LS}) from above. When sβ^LS1s \geqslant \|\hat{\beta}_{LS}\|_1, β^Lasso=β^LS\hat{\beta}_{Lasso} = \hat{\beta}_{LS} so RSStrain(β^Lasso)=RSStrain(β^LS)RSS_{train}(\hat{\beta}_{Lasso}) = RSS_{train}(\hat{\beta}_{LS}).

In other words, RSStrain(β^Lasso)RSS_{train}(\hat{\beta}_{Lasso}) will initially decrease as ss increases, until Bs(0)B_s(\mathbf{0}) catches β^LS\hat{\beta}_{LS}, and thereafter it will remain constant RSStrain(β^Lasso)=RSStrain(β^LS)RSS_{train}(\hat{\beta}_{Lasso}) = RSS_{train}(\hat{\beta}_{LS}). The closest answer is iv., although “steadily decreasing” isn’t the same thing.

A better answer would be iv. then v..

b. Test RSS

ii. Test RSS will be minimized at some optimal value s0s_0 of ss and will be greater for s<s0s < s_0 (lower flexibility and bias outweighs variance) and s>s0s > s_0 (higher flexibility and variance outweighs bias)1.

c. Variance

iii. We expect variance to increase monotonically with model flexibility.

d. (Squared) Bias

iv. We expect bias to decrease monotonically with model flexibility.

e. Irreducible Error

v.. The irredicible error is the variance of the noise V(ϵ)\mathbb{V}(\epsilon) which is not a function of ss.

Exercise 4: How change in λ\lambda affects Regression performance.

For this exercise, we can observe that s  λs \uparrow\ \Rightarrow\ \lambda \downarrow (that is, model flexibility increases as ss increases) and that our answers will be unaffected by whether we use the 1\ell_1 norm (Lasso) or 2\ell_2 norm (Ridge), so we can use the same reasoning as in exercise 3

a. Train RSS

v. then iii.

b. Test RSS

ii.

c. Variance

iv.

d. Irreducible Error

v.

Exercise 5: Ridge and Lasso treat correlated variables differently

For this exercise we have data

{(x11,x12,y1),(x21,x22,y2)}={(x11,x11,y1),(x11,x11,y1)}\{(x_{11}, x_{12}, y_1), (x_{21}, x_{22}, y_2)\} = \{(x_{11}, x_{11}, y_1), (-x_{11}, -x_{11}, -y_1)\}

a. The ridge optimization problem

The ridge regression problem in general is

minβ  RSS(β)+λβ~2 \begin{aligned} \underset{\beta}{\min}&\ \ RSS(\beta) + \lambda\| \tilde{\beta} \|_2 \end{aligned}

For this problem

RSS=(y1(β0+β1x11+β2x12))2+(y2(β0+β1x21+β2x22))2=(y1β0β1x11β2x12)2+(y2β0β1x21β2x22)2=(y1β0β1x11β2x11)2+(y1+β0β1x11β2x11)2 \begin{aligned} RSS &= \big(y_1 - (\beta_0 + \beta_1 x_{11} + \beta_2 x_{12})\big)^2 + \big(y_2 - (\beta_0 + \beta_1 x_{21} + \beta_2 x_{22})\big)^2\\ &= \big(y_1 - \beta_0 - \beta_1 x_{11} - \beta_2 x_{12}\big)^2 + \big(y_2 - \beta_0 - \beta_1 x_{21} - \beta_2 x_{22}\big)^2\\ &= \big(y_1 - \beta_0 - \beta_1 x_{11} - \beta_2 x_{11}\big)^2 + \big(y_1 + \beta_0 - \beta_1 x_{11} - \beta_2 x_{11}\big)^2\\ \end{aligned}

so we have the optimization problem

minβ  (y1β0β1x11β2x11)2+(y1+β0β1x11β2x11)2+λ(β12+β22) \begin{aligned} \underset{\beta}{\min}&\ \ \big(y_1 - \beta_0 - \beta_1 x_{11} - \beta_2 x_{11}\big)^2 + \big(y_1 + \beta_0 - \beta_1 x_{11} - \beta_2 x_{11}\big)^2 + \lambda(\beta_1^2 + \beta_2^2) \end{aligned}

b. The ridge coefficient estimates

Since we know β^0=0\hat{\beta}_0 = 0, we have the (slightly) simpler problem

min(β1,β2)  (y1β1x11β2x11)2+λ(β12+β22) \begin{aligned} \underset{(\beta_1, \beta_2)}{\min}&\ \ \big(y_1 - \beta_1 x_{11} - \beta_2 x_{11}\big)^2 + \lambda(\beta_1^2 + \beta_2^2) \end{aligned}

write f(β1,β2)f(\beta_1, \beta_2) for this objective function we are trying to minimize 2. Taking partial derivatives and setting equal to zero

fβ1=2(λ+x112)β1+2x112β22x11y1=0fβ1=2(λ+x112)β2+2x112β12x11y1=0 \begin{aligned} \frac{\partial f}{\partial \beta_1} &= 2(\lambda + x_{11}^2)\beta_1 + 2x_{11}^2\beta_2 - 2x_{11}y_1 = 0\\ \frac{\partial f}{\partial \beta_1} &= 2(\lambda + x_{11}^2)\beta_2 + 2x_{11}^2\beta_1 - 2x_{11}y_1 = 0\\ \end{aligned}

Subtract the first equation from the second and collect β1,β2\beta_1, \beta_2 terms. Since by assumption 2 λ0\lambda \neq 0) we can divide through by ((λ+x11)2x11)2\big((\lambda + x_{11})^2 - x_{11}\big)^2 to find 0=β1β20 = \beta_{1} - \beta_{2}, hence β^1=β^2\hat{\beta}_1 = \hat{\beta}_2 3

c. The lasso optimization problem

The lasso regression problem in general is

minβ  RSS(β)+λβ~1 \begin{aligned} \underset{\beta}{\min}&\ \ RSS(\beta) + \lambda\| \tilde{\beta} \|_1 \end{aligned}

For this problem RSS is the same as for part b RSS=(y1(β0+β1x11+β2x12))2+(y2(β0+β1x21+β2x22))2=(y1β0β1x11β2x12)2+(y2β0β1x21β2x22)2=(y1β0β1x11β2x11)2+(y1+β0β1x11β2x11)2 \begin{aligned} RSS &= \big(y_1 - (\beta_0 + \beta_1 x_{11} + \beta_2 x_{12})\big)^2 + \big(y_2 - (\beta_0 + \beta_1 x_{21} + \beta_2 x_{22})\big)^2\\ &= \big(y_1 - \beta_0 - \beta_1 x_{11} - \beta_2 x_{12}\big)^2 + \big(y_2 - \beta_0 - \beta_1 x_{21} - \beta_2 x_{22}\big)^2\\ &= \big(y_1 - \beta_0 - \beta_1 x_{11} - \beta_2 x_{11}\big)^2 + \big(y_1 + \beta_0 - \beta_1 x_{11} - \beta_2 x_{11}\big)^2\\ \end{aligned}

so we have the optimization problem

minβ  (y1β0β1x11β2x11)2+(y1+β0β1x11β2x11)2+λ(β1+β2) \begin{aligned} \underset{\beta}{\min}&\ \ \big(y_1 - \beta_0 - \beta_1 x_{11} - \beta_2 x_{11}\big)^2 + \big(y_1 + \beta_0 - \beta_1 x_{11} - \beta_2 x_{11}\big)^2 + \lambda(|\beta_1| + |\beta_2|) \end{aligned}

d. The lasso coefficient estimates

Again, since we know β^0=0\hat{\beta}_0 = 0, we have the (slightly) simpler problem

min(β1,β2)  (y1(β1+β2)x11)2+λ(β1+β2) \begin{aligned} \underset{(\beta_1, \beta_2)}{\min}&\ \ \big(y_1 - (\beta_1 + \beta_2) x_{11}\big)^2 + \lambda(|\beta_1| + |\beta_2|) \end{aligned}

again write f(β1,β2)f(\beta_1, \beta_2) for this objective function.

It is somewhat difficult to argue analytically that this function has no unique global minimum, so we’ll look at some graphs

Visualizing some examples

Here we’ll graph the objective function f(β1,β2)f(\beta_1, \beta_2) for the case x11=1,y1=1,λ=1x_{11} = 1, y_1 = 1, \lambda = 1 to get a sense of what’s going on:

from mpl_toolkits import mplot3d

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
(x, y, lambda_) = (1, 1, 1)
beta_1 = np.linspace(-6, 6, 30)
beta_2 = beta_1

X, Y = np.meshgrid(beta_1, beta_2)
Z = np.square(y - (X + Y)*x) + lambda_*(np.absolute(X) + np.absolute(Y))

fig = plt.figure(figsize=(15, 15))
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
                cmap='gray', edgecolor='none')
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x116a7a208>

png

ax.view_init(0, 60)
fig

png

###

It’s difficult to see from this picture, but it appears that minimum of this graph is (β1,β2)=(0,0)(\beta_1, \beta_2) = (0, 0), which is uninteresting.

Now let’s look at x11=2,y1=3,λ=10x_{11} = -2, y_1 = 3, \lambda = 10

(x, y, lambda_) = (-2, 3, 10)
Z = np.square(y - (X + Y)*x) - lambda_*(np.absolute(X) + np.absolute(Y))

fig = plt.figure(figsize=(15, 15))
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
                cmap='gray', edgecolor='none')
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x114292e10>

png

ax.view_init(0, 60)
fig

png

Now we can see no global minimum whatsoever

Arguing analytically

The graphs above have suggested no global minimum exists, so let’s see if we can argue that analytically.

Observe that, since x|x| is a piecewise function, f(β1,β2)f(\beta_1, \beta_2) is. Thus we can minimize ff by minimizing each of

f(+,+)(β1,β2)=(y1(β1+β2)x11)2+λ(β1+β2)f(+,)(β1,β2)=(y1(β1+β2)x11)2+λ(β1β2)f(,+)(β1,β2)=(y1(β1+β2)x11)2+λ(β1+β2)f(,)(β1,β2)=(y1(β1+β2)x11)2+λ(β1β2) f_{(+,+)}(\beta_1, \beta_2) = \big(y_1 - (\beta_1 + \beta_2) x_{11}\big)^2 + \lambda(\beta_1 + \beta_2) f_{(+,-)}(\beta_1, \beta_2) = \big(y_1 - (\beta_1 + \beta_2) x_{11}\big)^2 + \lambda(\beta_1 - \beta_2) f_{(-,+)}(\beta_1, \beta_2) = \big(y_1 - (\beta_1 + \beta_2) x_{11}\big)^2 + \lambda(-\beta_1 + \beta_2) f_{(-,-)}(\beta_1, \beta_2) = \big(y_1 - (\beta_1 + \beta_2) x_{11}\big)^2 + \lambda(-\beta_1 - \beta_2)

and then taking the minimum over all of these 4.

This corresponds to find

Minimizing f(+,+)f_{(+,+)} and f(,)f_{(-,-)}.

Taking partial derivatives 5 of f(+,+)f_{(+,+)} and setting equal to zero

f(+,+)β1=2x112β1+2x112β2+2(λx11y1)=0f(+,+)β2=2x112β1+2x112β2+2(λx11y1)=0 \begin{aligned} \frac{\partial f_{(+,+)}}{\partial \beta_1} &= 2x_{11}^2\beta_1 + 2x_{11}^2\beta_2 + 2(\lambda - x_{11}y_1) = 0\\ \frac{\partial f_{(+,+)}}{\partial \beta_2} &= 2x_{11}^2\beta_1 + 2x_{11}^2\beta_2 + 2(\lambda - x_{11}y_1) = 0\\ \end{aligned}

these equations are redundant, so we find a minimum for β1,β20\beta_1, \beta_2 \geqslant 0 whenever:

β1=β2λx11y1x112\beta_1 = -\beta_2 - \frac{\lambda - x_{11}y_1}{x_{11}^2}

Similarly, we find a minimum for β1,β20\beta_1, \beta_2 \leqslant 0 whenever:

β1=β2+λ+x11y1x112\beta_1 = -\beta_2 + \frac{\lambda + x_{11}y_1}{x_{11}^2}

Minimizing f(+,)f_{(+, -)}, f(,+)f_{(-, +)}

Taking partial derivatives and setting equal to zero

f(+,)β1=2x112β1+2x112β2+2(λx11y1)=0f(+,)β2=2x112β1+2x112β22(λ+x11y1)=0 \begin{aligned} \frac{\partial f_{(+,-)}}{\partial \beta_1} &= 2x_{11}^2\beta_1 + 2x_{11}^2\beta_2 + 2(\lambda - x_{11}y_1) = 0\\ \frac{\partial f_{(+,-)}}{\partial \beta_2} &= 2x_{11}^2\beta_1 + 2x_{11}^2\beta_2 - 2(\lambda + x_{11}y_1) = 0\\ \end{aligned}

Subtracting the first equation from the second and doing some algebra, we find λ=0\lambda = 0, which is a case we’re not considering.

Now, focusing on the first equation, we conlude f(β1,β2)f(\beta_1, \beta_2) is strictly decreasing for β10,β20\beta_1 \geqslant 0, \beta_2 \leqslant 0 as long as

β2<2(λy1x11)x112 \beta_2 < \frac{2(\lambda - y_1x_{11})}{x_{11}^2}

This inequality is always satisfied for some values of β20\beta_2 \leqslant 0, so we conclude that, provided λ0\lambda \neq 0, f(β1,β2)f(\beta_1, \beta_2) is always strictly decreasing along some direction.

Conclusions

Exactly one of the following is true:

  • λ=0\lambda = 0, in which case a global minimum of RSSRSS exists but we are doing trivial lasso regression (i.e. just ordinary least squares).

  • λ0\lambda \neq 0, in which case no global minimum of RSSRSS exists.

Thus, if we are looking for non-trivial lasso coefficient estimates, we cannot find unique ones 6

Exercise 6: Ridge and Lasso when n=p=1n=p=1, β0=1\beta_0 = 1, and x=1x = 1.

a. Ridge regression

Consider (6.12) in the case n=p=1n=p=1. Then the minimization problem becomes

minβ1  (yβ1)2+λβ12 \begin{aligned} \underset{\beta_1}{\min}&\ \ \big(y - \beta_1\big)^2 + \lambda\beta_1^2 \end{aligned}

Now we plot for the case y=λ=1y = \lambda = 1

lambda_,y = 1, 1

X = np.linspace(-5, 5, 30)
Y = np.square(1 - X) + np.square(X)

plt.plot(X, Y)
[<matplotlib.lines.Line2D at 0x1175ac438>]

png

It’s easy to believe that the minimum is at the value given by (6.14):

β1^R=y1+λ=12\hat{\beta_1}^R = \frac{y}{1 + \lambda} = \frac{1}{2}

b. Lasso regression

Consider (6.13) in the case n=p=1n=p=1. Then the minimization problem becomes

minβ1  (yβ1)2+λβ1 \begin{aligned} \underset{\beta_1}{\min}&\ \ \big(y - \beta_1\big)^2 + \lambda|\beta_1| \end{aligned}

Now we plot for the case y=λ=1y = \lambda = 1

lambda_,y = 1, 1

X = np.linspace(-5, 5, 30)
Y = np.square(1 - X) + np.absolute(X)

plt.plot(X, Y)
[<matplotlib.lines.Line2D at 0x11775ada0>]

png

It’s easy to believe that the minimum is at the value given by (6.15):

β1^L=yλ2=12\hat{\beta_1}^L = y - \frac{\lambda}{2} = \frac{1}{2}

Exercise 7: Deriving the Bayesian connection between Ridge and Lasso

a. Likelihood for the linear model with Gaussian noise

We assume

Y=Xβ+ϵ Y = \mathbf{X}\beta + \epsilon

where ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2) iid 7.

The likelihood is 8

f(YX,β)=i=1nf(YiXi,β)=i=1n12πσ2exp((YXiβ)22σ2)=(2πσ2)2nexp(n(YXiβ)22σ2) \begin{aligned} f(Y|\mathbf{X}, \beta) &= \prod_{i=1}^n f(Y_i| X_i, \beta)\\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(\frac{-(Y - X_i\beta)^2}{2\sigma^2}\right)\\ &= (2\pi\sigma^2)^{-2n}\exp\left(\frac{-n(Y - X_i\beta)^2}{2\sigma^2}\right) \end{aligned}

b.

c.

d.

e.

Footnotes

  1. For evidence, we can observe that s  λs \uparrow\ \Rightarrow\ \lambda \downarrow, and look at Figure 6.8 as a typical example. 

  2. We know that λ=0\lambda = 0 will give a minimum, but then we just have the least squares solution. So assuming λ0\lambda \neq 0 is determined by other means (e.g. cross validation), our objective function shouldn’t depend on λ\lambda 2

  3. Note that, even though we know the ridge regression coefficient estimates are equal, they aren’t unique. So there are still “many possible solutions to the optimization problem”. 

  4. That is, find the minima of f(β1,β2)f(\beta_1, \beta_2) in each of the four quadrants of the (β1,β2)(\beta_1, \beta_2) plane and take the minimum over the quadrants. 

  5. Strictly speaking, we are taking “one-sided derivatives”. 

  6. This is presumably what the book means by “many possible solutions to the optimization problem”. 

  7. As usual, β=(β0,,βp)\beta = (\beta_0, \dots, \beta_p)^\top, ϵ=(ϵ1,,ϵn)\epsilon = (\epsilon_1, \dots, \epsilon_n)^\top and the first column of X\mathbf{X} is (1,,1)Rn(1, \dots, 1)^\top \in \mathbb{R}^n

  8. This follows, since the random variable YiY_i conditioned on the random variates Xi=xi,β=βX_i = x_i, \beta = \beta is xiβ+ϵix_i^\top\beta + \epsilon_i, and from ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2) it follows that YiXi,βN(Xiβ,σ2)Y_i | X_i, \beta \sim N(X_i\beta, \sigma^2)