islr notes and exercises from An Introduction to Statistical Learning

7. Moving Beyond Linearity

Moving Beyond Linearity

Table of Contents

Polynomial Regression

  • Simple polynomial regression is a regression model which is polynomial 1 in the feature variable X

Y=β0+i=1dβiXdY = \beta_0 + \sum_{i = 1}^d \beta_iX^d

  • The model can be fit as a simple linear regression model with predictors X1,,Xd=X,XdX_1, \dots, X_d = X, \dots X^d.
  • It is rare to take d4d \geqslant 4 because it lead strange curves
Advantages
  • Interpretability
  • More flexibility than linear regression, can better model non-linear relationships
Disadvantages
  • Greater flexibility can lead to overfitting (can be mitigating by keeping dd low)
  • Imposes global structure on target function (as does linear regression)

Step Functions

  • Step functions model the target function as locally constant by converting the continuous variable XX into an ordered categorical variable.as follows
    • Choose KK points c1,,cK[min(X),max(X)]c_1, \dots, c_K \in [\min(X), \max(X)]
    • Define K+1K + 1 “dummy” variables C0(X)=I(X<c1)Ci(X)=I(ciX<ci+1)1iK1CK(X)=I(cKX) \begin{aligned} C_0(X) &= I(X < c_1)\\ C_i(X) &= I(c_i \leqslant X < c_{i+1})\qquad 1 \leqslant i \leqslant K - 1\\ C_K(X) &= I(c_K \leqslant X) \end{aligned}

    </annotation></semantics></math></span></span></span>

    • Fit a linear regression model to the predictors C1,,CKC_1, \dots, C_K 2
Advantages
  • Flexibility to model non-linear relationships
  • Can model local behavior better than global models (e.g. linear and polynomial regression)
Disadvantages
  • Locally constant assumption is strong, breakpoints in data may not be realized.

Basis Functions

In general, we can fit a regression model

Y=β0+i=1Kbi(X)Y = \beta_0 + \sum_{i=1}^Kb_i(X)

where the bi(X)b_i(X) are called basis functions 3

Advantages

Different choices of basis functions are useful for modeling different types of relationships (for example, Fourier basis functions can model periodic behavior).

Disadvantages
  • As usual, greater flexibility can lead to overfitting
  • Some choices of basis functions (i.e. basis functions which are not suited to the assumed true functional relationship) will likely have poor performance.

Regression Splines

Regression splines are a flexible (and common choice of) class of basis functions which extend both polynomial and piecewise constant basis functions.

Piecewise Polynomials

Piecewise polynomials fit separate low-degree polynomials over different regions of XX. The points where the coefficients change are called knots.

Advantages
  • Flexibility to model non-linear relationships (as with all non-linear methods discussed in this chapter)
  • Sensitivity to local behavior (less rigid than global model).
Disadvantages
  • Overly flexible - each piece has independent degrees of freedom
  • Can have unnatural breaks at knots without appropriate constraints
  • Possibility of overfitting (as with all non-linear methods discussed in this chapter)

Constraints and Splines

  • To remedy overflexibility of piecewise polynomials, we can impose constraints at the knots, e.g. continuity, differentiability of various orders (smoothness).
  • A spline is a piecewise degree dd polynomial that has continuous derivatives up to order d1d-1 at each knot (hence everywhere).
Advantages
  • Same advantages to piecewise polynomials, while improving on the disadvantages
Disadvantages
  • Overfitting
  • Poor match to the true relationship

The Spline Basis Representation

  • Regression splines can be modeled using an appropriate basis, of which there are many choices.
  • For example, we can model a dd degree spline with KK knots using truncated power basis b1(X),,bK+d(X)=x,,xd,h(X,ξ1),,h(X,ξK)b_1(X), \dots, b_{K+d}(X) = x, \dots, x^d, h(X, \xi_1), \dots, h(X, \xi_K) where ξi\xi_i is the ithi-th knot and h(Xξi)={(Xξi)dX>ξi0Xξih(X - \xi_i) = \begin{cases} (X-\xi_i)^d & X > \xi_i\\ 0 & X \leqslant \xi_i \end{cases} is the truncated power function of degree dd.
Advantages

Ibid.

Disadvantages

Beyond those mentioned above, splines can have a high variance near min(X),max(X)\min(X), \max(X) (this can be overcome by using natural splines which impose boundary constraints, i.e constraints on the form of the model on [min(X),ξ1][\min(X), \xi_1], [max(X),ξK][\max(X), \xi_K] (e.g. linearity)

Choosing the Number and the Locations of the Knots

  • In practice, we place knots in uniform fashion, e.g. by specifying the desired degrees of freedom and using software to place the knots at uniform quantiles of the data.
  • The desired degrees of freedom (hence number of knots) can be obtained using cross-validation.

Comparison to Polynomial Regression

Often gives superior results to polynomial regression – the latter must use higher degrees (imposing global structure) while the former can increase the number of knots while leaving the degree fixed (sensitivity to local behavior) as well as varying the density of knots (i.e. placing more where the response varies rapidly, less where it is more stable)

Smoothing Splines

An Overview of Smoothing Splines

  • A smoothing spline 4 is a function

g^λ=argmingi=1n(yig(xi))2+λg(t)2dt\hat{g}_\lambda = \underset{g}{\text{argmin}\,}\sum_{i=1}^n(y_i - g(x_i))^2 + \lambda \int g''(t)^2\,dt where λ=0\lambda = 0 is a tuning parameter 5

  • λ\lambda controls the bias-variance tradeoff. λ=0\lambda = 0 corresponds to the interpolation spline which fits all the data points exactly and will be thus woefull overfit. In the limit λ\lambda \rightarrow \infty, g^λ\hat{g}_\lambda approaches the least squares line
  • It can be show that the function g^λ\hat{g}_\lambda is a piecewise cubic polynomial with knots at the unique xix_i and continuous first and second derivatives at the knots 6

Choosing the Smoothing Parameter λ\lambda

  • The parameter λ\lambda controls the effective degrees of freedom dfλdf_{\lambda}. As λ\lambda goes from 00 to \infty, dfλdf_\lambda goes from nn to 22.
  • The effective degress of freedom is defined to be dfλ=trace(Sλ)df_\lambda = \text{trace}(S_\lambda) where SλS_\lambda is the matrix such that g^λ=Sλy\mathbf{\hat{g}}_\lambda = S_\lambda \mathbf{y} where g^\mathbf{\hat{g}} is the vector of fitted values.
  • λ\lambda can be chosen by cross-validation. LOOCV is particularly efficient to compute 7

RSScv(λ)=i=1n(yig^λ(i)(xi))2=i=1n(yig^λ(xi)1tr(Sλ))2RSS_{cv}(\lambda) = \sum_{i=1}^n (y_i - \hat{g}_\lambda^{(-i)}(x_i))^2 = \sum_{i=1}^n\left(\frac{y_i - \hat{g}_\lambda(x_i)}{1-tr(S_{\lambda})}\right)^2

Advantages
  • Flexibility/nonlinearity
  • As a shrinkage method, effective degrees of freedom are reduced, helping to balance bias-variance tradeoff and avoid overfitting.
Disadvantages
  • As usual, flexibility can lead to overfitting

Local Regression

  • Computes the fit at a target point by regressing on nearby training observations
  • Is memory-based - all the training data is necessary for computing a prediction
  • In multiple linear regression, variable coefficient models fit global regression to some variables and local to others
Algorithm: KK-nearest neighbors regression

Fix the parameter 8 1kn1 \leqslant k \leqslant n. For each X=x0X_=x_0:

  1. Get the neighborhood Ni0={k closest xi}N_{i0}= \{k\ \text{closest}\ x_i\}.
  2. Assign a weight Ki0=K(xi,x0)K_{i0} = K(x_i, x_0) to each point xix_i such that such that
    • each point outside xiNi0x_i\notin N_{i0} has Ki0(xi)=0K_{i0}(x_i)=0.
    • the furthest point xiNi0x_i\in N_{i0} has weight zero
    • the closest point xiNi0x_i\in N_{i0} has the highest weight.
  3. Fit a weighted least squares regression

(β0^,β1^)=i=1nKi0(yiβ0β1xi)2 (\hat{\beta_0}, \hat{\beta_1}) = \sum_{i=1}^nK_{i0}(y_i - \beta_0 - \beta_1 x_i)^2

  1. Predict f^(x0)=β0^+β1^x0\hat{f}(x_0) = \hat{\beta_0} + \hat{\beta_1}x_0.

Generalized Additive Models

A Generalized additive model is a model which is a sum of nonlinear functions of the individual predictors.

GAMs for Regression Problems

  • A GAM for regression 9 is a model

Y=β0+j=1pfj(Xj)+ϵY =\beta_0 + \sum_{j=1}^p f_j(X_j) + \epsilon

where the functions fjf_j are smooth non-linear functions.

  • GAMs can be used to combine methods from this chapter – one can fit different nonlinear functions fjf_j to the predictors XjX_j [^10]
  • Standard software can fit GAMs with smoothing splines via backfitting
Advantages
  • Nonlinearity hence flexibility
  • Automatically introduces nonlinearity - obviates the need to experiment with different nonlinear transformations
  • Interpretability/inference - the fjf_j allow to consider the effect of each feature XjX_j independently of the others.
  • Smoothness of individual fjf_j can be summarized via degrees of freedom.
  • Represents a nice compromise betwee linear and fully non-parametric models (see §8).
Disadvantages
  • Usual disadvantages of nonlinearity
  • Doesn’t allow for interactions between features (this can be overcome by including nonlinear functios of the interaction terms f(Xj,Xk)f(X_j,X_k)
  • The additive constraint is strong, restricts flexibility.

GAMs for Classification Problems

GAMs can be used for classification. For example, a GAM for logistic regression is

log(pk(X)1pk(X))=β0+j=1pfj(Xj)+ϵ\log\left(\frac{p_k(X)}{1 - p_k(X)}\right) =\beta_0 + \sum_{j=1}^p f_j(X_j) + \epsilon

where pk(X)=Pr(Y=k  X)p_k(X) =\text{Pr}(Y = k\ |\ X).


Footnotes

  1. In statistical literature, polynomial regression is sometimes referred to as linear regression. This is because the model is linear in the population parameters βi\beta_i

  2. The variable C0(X)C_0(X) accounts for an intercept. Alternatively fit a linear model to C0,,CKC_0, \dots, C_K with no intercept. 

  3. Such a model amounts to the assumption that the target function lives in a finite-dimensional subspace of the vector space of all functions f:XYf:X\rightarrow Y

  4. The function gg is not guaranteed to be smooth in the sense of infinitely differentiable. The penalty on the second derivative (curvature) penalizes the “roughness” or “wiggliness” of gg, hence “smoothes out” noise in the data. Other penalties have been used 

  5. A tuning parameter is also called a hyperparameter 

  6. Thus g^\hat{g} is a natural cubic spline with knots at the xix_i. However, it is not the spline one obtains in §7.4.3. It is a “shrunken” version, where λ\lambda controls the shrinkage. 

  7. Compare to a similar formula in §5.1.2  

  8. Our description of the algorithm deviates a bit from the book, but it’s equivalent. 

  9. “Additive” because we are summing the fif_i. “Generalized” because it generalizes from the linear functions βjXj\beta_jX_j in ordinary linear regression.