Exercise 1: Test and train RSS for subset selection
a.
Best subset selection has the most flexibility (searches a larger model space) so it should have the smallest test error for each k
b.
The answer depends on the value of k. When k<p−k, FSS yeilds a less flexible model while BSS yeilds a more flexible model, so FSS should have a lower test RSS than BSS. When k>p−k, the converse shoule be true.
c.
i. True. FSS augments the by a single predictor at each iteration.
ii. False. Replace k+1 by k−1 and this becomes true.
iii. False. There is no necessary connection between the models identified by FSS and BSS.
iv. False. Same reason.
v. False. Best subset considers all possible subsets of k+1 predictors so it may include a predictor in Mk+1 that was not in Mk.
Exercise 2: Comparing Lasso Regression, Ridge Regression, and Least Squares
a.
iii. is correct. The Lasso is less flexible since it searches a restricted parameter space (i.e. not Rp), so it will usually have increased bias and decreased variance.
b.
iii. is correct again, for the same reasons
c.
ii. is correct. Non-linear methods are more flexible which usually means decreased bias and increased variance.
Exercise 3: How change in s affects Lasso performance
a. Train RSS
None of these answers seem correct.
For some small s>0, i.e. some ℓ1-neighborhood Bs(0)⊆Rp, the least squares estimator β^LS∈/Bs(0), hence the Lasso estimator β^Lasso=β^LS, so RSStrain(β^Lasso)⩾RSStrain(β^LS). As s→∥β^LS∥1 from below, β^Lasso→β^LS, so RSStrain(β^Lasso)→RSStrain(β^LS) from above. When s⩾∥β^LS∥1, β^Lasso=β^LS so RSStrain(β^Lasso)=RSStrain(β^LS).
In other words, RSStrain(β^Lasso) will initially decrease as s increases, until Bs(0) catches β^LS, and thereafter it will remain constant RSStrain(β^Lasso)=RSStrain(β^LS). The closest answer is iv., although “steadily decreasing” isn’t the same thing.
A better answer would be iv. then v..
b. Test RSS
ii. Test RSS will be minimized at some optimal value s0 of s and will be greater for s<s0 (lower flexibility and bias outweighs variance) and s>s0 (higher flexibility and variance outweighs bias)1.
c. Variance
iii. We expect variance to increase monotonically with model flexibility.
d. (Squared) Bias
iv. We expect bias to decrease monotonically with model flexibility.
e. Irreducible Error
v.. The irredicible error is the variance of the noise V(ϵ) which is not a function of s.
Exercise 4: How change in λ affects Regression performance.
For this exercise, we can observe that s↑⇒λ↓ (that is, model flexibility increases as s increases) and that our answers will be unaffected by whether we use the ℓ1 norm (Lasso) or ℓ2 norm (Ridge), so we can use the same reasoning as in exercise 3
a. Train RSS
v. then iii.
b. Test RSS
ii.
c. Variance
iv.
d. Irreducible Error
v.
Exercise 5: Ridge and Lasso treat correlated variables differently
Subtract the first equation from the second and collect β1,β2 terms. Since by assumption 2λ=0) we can divide through by ((λ+x11)2−x11)2 to find 0=β1−β2, hence
β^1=β^23
c. The lasso optimization problem
The lasso regression problem in general is
βminRSS(β)+λ∥β~∥1
For this problem RSS is the same as for part bRSS=(y1−(β0+β1x11+β2x12))2+(y2−(β0+β1x21+β2x22))2=(y1−β0−β1x11−β2x12)2+(y2−β0−β1x21−β2x22)2=(y1−β0−β1x11−β2x11)2+(y1+β0−β1x11−β2x11)2
Subtracting the first equation from the second and doing some algebra, we find λ=0, which is a case we’re not considering.
Now, focusing on the first equation, we conlude f(β1,β2) is strictly decreasing for β1⩾0,β2⩽0 as long as
β2<x1122(λ−y1x11)
This inequality is always satisfied for some values of β2⩽0, so we conclude that, provided λ=0, f(β1,β2) is always strictly decreasing along some direction.
Conclusions
Exactly one of the following is true:
λ=0, in which case a global minimum of RSS exists but we are doing trivial lasso regression (i.e. just ordinary least squares).
λ=0, in which case no global minimum of RSS exists.
Thus, if we are looking for non-trivial lasso coefficient estimates, we cannot find unique ones 6
Exercise 6: Ridge and Lasso when n=p=1, β0=1, and x=1.
a. Ridge regression
Consider (6.12) in the case n=p=1. Then the minimization problem becomes
For evidence, we can observe that s↑⇒λ↓, and look at Figure 6.8 as a typical example. ↩
We know that λ=0 will give a minimum, but then we just have the least squares solution. So assuming λ=0 is determined by other means (e.g. cross validation), our objective function shouldn’t depend on λ. ↩↩2
Note that, even though we know the ridge regression coefficient estimates are equal, they aren’t unique. So there are still “many possible solutions to the optimization problem”. ↩
That is, find the minima of f(β1,β2) in each of the four quadrants of the (β1,β2) plane and take the minimum over the quadrants. ↩
Strictly speaking, we are taking “one-sided derivatives”. ↩
This is presumably what the book means by “many possible solutions to the optimization problem”. ↩
As usual, β=(β0,…,βp)⊤, ϵ=(ϵ1,…,ϵn)⊤ and the first column of X is (1,…,1)⊤∈Rn. ↩
This follows, since the random variable Yi conditioned on the random variates Xi=xi,β=β is xi⊤β+ϵi, and from ϵi∼N(0,σ2) it follows that Yi∣Xi,β∼N(Xiβ,σ2). ↩