islr notes and exercises from An Introduction to Statistical Learning

7. Moving Beyond Linearity

Conceptual Exercises

Exercise 1: The truncated power basis generates cubic regression splines

a.

If xξx \leqslant \xi then (xξ)+3=0(x-\xi)_+^3 = 0. So if we take a=1=β0,b1=β1,c1=β2,d1=β3a=1 = \beta_0, b_1 = \beta_1, c_1 = \beta_2, d_1 = \beta_3, then f(x)=f1(x)f(x) = f_1(x) for all xξx \leqslant \xi

b.

If x>ξx > \xi then (xξ)+3=(xξ)3(x-\xi)_+^3 = (x-\xi)^3. Then expanding f(x)f(x)

f(x)=(β0β4ξ3)+(β1+3β4ξ2)x+(β23β4ξ)x2+(β3+β4)x3f(x) = (\beta_0 - \beta_4 \xi^3) + (\beta_1 + 3\beta_4\xi^2)x + (\beta_2 - 3\beta_4\xi)x^2 + (\beta_3 + \beta_4)x^3

we see

\begin{align} a_1 &= \beta_0 - \beta_4 \xi^3
b_1 &= \beta_1 + 3\beta_4\xi^2
c_1 &= \beta_2 - 3\beta_4\xi
d_1 &= \beta_3 + \beta_4 \end{align}

c.

\begin{align} f_2(\xi) &= (\beta_0 - \beta_4 \xi^3) + (\beta_1 + 3\beta_4\xi^2)\xi + (\beta_2 - 3\beta_4\xi)\xi^2 + (\beta_3 + \beta_4)\xi^3
&= \beta_0 - \beta_4\xi^3 +\beta_1\xi + 3\beta_4\xi^3 + \beta_2\xi^2 - 3\beta_4\xi^3 + \beta_3\xi^3 + \beta_4\xi^3
&= \beta_0 + \beta_1\xi + \beta_2\xi^2 + \beta_3\xi^3
&= f_1(\xi) \end{align}

d.

\begin{align} f’_2(\xi) &= (\beta_1 + 3\beta_4\xi^2) + 2(\beta_2 - 3\beta_4\xi)\xi + 3(\beta_3 + \beta_4)\xi^2
&= \beta_1 + 3\beta_4\xi^2 + 2\beta_2\xi - 6\beta_4\xi^2 + 3\beta_3\xi^2 + 3\beta_4\xi^2
&= \beta_1 + 2\beta_2\xi + 3\beta_3\xi^2
&= f’_1(\xi) \end{align}

e.

\begin{align} f’‘_2(\xi) &= 2(\beta_2 - 3\beta_4\xi) + 6(\beta_3 + \beta_4)\xi
&= 2\beta_2 + 6\beta_4\xi
&= f’‘_1(\xi) \end{align}

Exercise 2: Alternative roughness penalties for smoothing splines

We’re just going to describe the solutions instead of drawing them.

a.

If λ=\lambda = \infty and m=0m = 0, then the integral term is minimized by a zero integral, hence by g(0)=g=0g^{(0)} = g = 0. So g^\hat{g} is the zero function.

b.

In this case, the integral is minimized by g=0g' = 0, so g^\hat{g} is a constant function. Likely g=β0=yg = \beta_0 = \overline{y}

c.

This is the smoothing spline discussed in the chapter – g^\hat{g} is the ordinary least squares line.

d.

Now the integral penalty forces g(3)=0g^{(3)} = 0, so gg is necessarily polynomial of degree leqslant 2. It’s clear that it necessarily has degree 2 – one can imagine cases where a linear g^\hat{g} minimizes the RSS term, and cases where a quadratic does. In general, a quadratic has more freedom, so averaging over datasets, the quadratic will have smaller RSS, and hence g^\hat{g} will be quadratic.

e.

In this case, the m=3m = 3 condition is irrelevant. Now we are minimizing the RSS over all functions gg, so g^\hat{g} will be any function which has zero RSS (i.e. which has g^(xi)=yi\hat{g}(x_i) = y_i for all ii). For example, the interpolation spline, or a step (piecewise constant) function which passes through the y_i will have zero RSS. Such a function isn’t unique.

Exercise 3: Example basis function model

skip

Exercise 4: Example basis function model

skip

Exercise 5: Comparing smoothing splines with different roughness penalties

a.

As λ\lambda \rightarrow \infty, the integral much approach 0. Then g^1\hat{g}_1 approaches a polynomial of degree at most 2, and g^2\hat{g}_2 approaches a polynomial of degree 3. Since deg(g^1)deg(g^2)\text{deg}(\hat{g}_1) \leqslant \text{deg}(\hat{g}_2), we expect RSStrain(g^1)RSStrain(g^2)\text{RSS}_{train}(\hat{g}_1) \geqslant \text{RSS}_{train}(\hat{g}_2)

b.

Since deg(g^1)deg(g^2)\text{deg}(\hat{g}_1) \leqslant \text{deg}(\hat{g}_2), we expect RSStest(g^1)RSStest(g^2)\text{RSS}_{test}(\hat{g}_1) \leqslant \text{RSS}_{test}(\hat{g}_2)

c.

For λ=0\lambda = 0, the roughness penalty vanishes and both g^1,g^2\hat{g}_1, \hat{g}^2 can be any function with zero RSS (see Exercise 2 e. above). Such functions aren’t uniquely defined, so in the absence of a rule for choosing g^1,g^2\hat{g}_1, \hat{g}_2 in this case, we can’t answer the question.