7. Moving Beyond Linearity

Moving Beyond Linearity

7 Moving Beyond Linearity

Polynomial Regression

Simple polynomial regression is a regression model which is polynomial ¹ in the feature variable X

$Y = \beta_0 + \sum_{i = 1}^d \beta_iX^d$

The model can be fit as a simple linear regression model with predictors $X_1, \dots, X_d = X, \dots X^d$ .
It is rare to take $d \geqslant 4$ because it lead strange curves

Advantages

Interpretability
More flexibility than linear regression, can better model non-linear relationships

Disadvantages

Greater flexibility can lead to overfitting (can be mitigating by keeping $d$ low)
Imposes global structure on target function (as does linear regression)

Step Functions

Step functions model the target function as locally constant by converting the continuous variable $X$ $X$ into an ordered categorical variable.as follows
- Choose $K$ points $c_1, \dots, c_K \in [\min(X), \max(X)]$
- Define $K + 1$ “dummy” variables $\begin{aligned} C_0(X) &= I(X < c_1)\\ C_i(X) &= I(c_i \leqslant X < c_{i+1})\qquad 1 \leqslant i \leqslant K - 1\\ C_K(X) &= I(c_K \leqslant X) \end{aligned}$
</annotation></semantics></math></span> $C_{0} (X) C_{i} (X) C_{K} (X) = I (X < c_{1}) = I (c_{i} ⩽ X < c_{i + 1}) 1 ⩽ i ⩽ K - 1 = I (c_{K} ⩽ X)$ </span></span>
- Fit a linear regression model to the predictors $C_1, \dots, C_K$ ²

Advantages

Flexibility to model non-linear relationships
Can model local behavior better than global models (e.g. linear and polynomial regression)

Disadvantages

Locally constant assumption is strong, breakpoints in data may not be realized.

Basis Functions

In general, we can fit a regression model

$Y = \beta_0 + \sum_{i=1}^Kb_i(X)$

where the $b_i(X)$ are called basis functions ³

Advantages

Different choices of basis functions are useful for modeling different types of relationships (for example, Fourier basis functions can model periodic behavior).

Disadvantages

As usual, greater flexibility can lead to overfitting
Some choices of basis functions (i.e. basis functions which are not suited to the assumed true functional relationship) will likely have poor performance.

Regression Splines

Regression splines are a flexible (and common choice of) class of basis functions which extend both polynomial and piecewise constant basis functions.

Piecewise Polynomials

Piecewise polynomials fit separate low-degree polynomials over different regions of $X$ . The points where the coefficients change are called knots.

Advantages

Flexibility to model non-linear relationships (as with all non-linear methods discussed in this chapter)
Sensitivity to local behavior (less rigid than global model).

Disadvantages

Overly flexible - each piece has independent degrees of freedom
Can have unnatural breaks at knots without appropriate constraints
Possibility of overfitting (as with all non-linear methods discussed in this chapter)

Constraints and Splines

To remedy overflexibility of piecewise polynomials, we can impose constraints at the knots, e.g. continuity, differentiability of various orders (smoothness).
A spline is a piecewise degree $d$ polynomial that has continuous derivatives up to order $d-1$ at each knot (hence everywhere).

Advantages

Same advantages to piecewise polynomials, while improving on the disadvantages

Disadvantages

Overfitting
Poor match to the true relationship

The Spline Basis Representation

Regression splines can be modeled using an appropriate basis, of which there are many choices.
For example, we can model a $d$ degree spline with $K$ knots using truncated power basis $b_1(X), \dots, b_{K+d}(X) = x, \dots, x^d, h(X, \xi_1), \dots, h(X, \xi_K)$ where $\xi_i$ is the $i-th$ knot and $h(X - \xi_i) = \begin{cases} (X-\xi_i)^d & X > \xi_i\\ 0 & X \leqslant \xi_i \end{cases}$ is the truncated power function of degree $d$ .

Advantages

Ibid.

Disadvantages

Beyond those mentioned above, splines can have a high variance near $\min(X), \max(X)$ (this can be overcome by using natural splines which impose boundary constraints, i.e constraints on the form of the model on $[\min(X), \xi_1]$ , $[\max(X), \xi_K]$ (e.g. linearity)

Choosing the Number and the Locations of the Knots

In practice, we place knots in uniform fashion, e.g. by specifying the desired degrees of freedom and using software to place the knots at uniform quantiles of the data.
The desired degrees of freedom (hence number of knots) can be obtained using cross-validation.

Comparison to Polynomial Regression

Often gives superior results to polynomial regression – the latter must use higher degrees (imposing global structure) while the former can increase the number of knots while leaving the degree fixed (sensitivity to local behavior) as well as varying the density of knots (i.e. placing more where the response varies rapidly, less where it is more stable)

Smoothing Splines

An Overview of Smoothing Splines

A smoothing spline ⁴ is a function

$\hat{g}_\lambda = \underset{g}{\text{argmin}\,}\sum_{i=1}^n(y_i - g(x_i))^2 + \lambda \int g''(t)^2\,dt$ where $\lambda = 0$ is a tuning parameter ⁵

$\lambda$ controls the bias-variance tradeoff. $\lambda = 0$ corresponds to the interpolation spline which fits all the data points exactly and will be thus woefull overfit. In the limit $\lambda \rightarrow \infty$ , $\hat{g}_\lambda$ approaches the least squares line
It can be show that the function $\hat{g}_\lambda$ is a piecewise cubic polynomial with knots at the unique $x_i$ and continuous first and second derivatives at the knots ⁶

Choosing the Smoothing Parameter $\lambda$

The parameter $\lambda$ controls the effective degrees of freedom $df_{\lambda}$ . As $\lambda$ goes from $0$ to $\infty$ , $df_\lambda$ goes from $n$ to $2$ .
The effective degress of freedom is defined to be $df_\lambda = \text{trace}(S_\lambda)$ where $S_\lambda$ is the matrix such that $\mathbf{\hat{g}}_\lambda = S_\lambda \mathbf{y}$ where $\mathbf{\hat{g}}$ is the vector of fitted values.
$\lambda$ can be chosen by cross-validation. LOOCV is particularly efficient to compute ⁷

$RSS_{cv}(\lambda) = \sum_{i=1}^n (y_i - \hat{g}_\lambda^{(-i)}(x_i))^2 = \sum_{i=1}^n\left(\frac{y_i - \hat{g}_\lambda(x_i)}{1-tr(S_{\lambda})}\right)^2$

Advantages

Flexibility/nonlinearity
As a shrinkage method, effective degrees of freedom are reduced, helping to balance bias-variance tradeoff and avoid overfitting.

Disadvantages

As usual, flexibility can lead to overfitting

Local Regression

Computes the fit at a target point by regressing on nearby training observations
Is memory-based - all the training data is necessary for computing a prediction
In multiple linear regression, variable coefficient models fit global regression to some variables and local to others

Algorithm: $K$ -nearest neighbors regression

Fix the parameter ⁸ $1 \leqslant k \leqslant n$ . For each $X_=x_0$ :

Get the neighborhood $N_{i0}= \{k\ \text{closest}\ x_i\}$ .
Assign a weight $K_{i0} = K(x_i, x_0)$ $K_{i 0} = K (x_{i}, x_{0})$ to each point $x_i$ $x_{i}$ such that such that
- each point outside $x_i\notin N_{i0}$ has $K_{i0}(x_i)=0$ .
- the furthest point $x_i\in N_{i0}$ has weight zero
- the closest point $x_i\in N_{i0}$ has the highest weight.
Fit a weighted least squares regression

$(\hat{\beta_0}, \hat{\beta_1}) = \sum_{i=1}^nK_{i0}(y_i - \beta_0 - \beta_1 x_i)^2$

Predict $\hat{f}(x_0) = \hat{\beta_0} + \hat{\beta_1}x_0$ .

Generalized Additive Models

A Generalized additive model is a model which is a sum of nonlinear functions of the individual predictors.

GAMs for Regression Problems

A GAM for regression ⁹ is a model

$Y =\beta_0 + \sum_{j=1}^p f_j(X_j) + \epsilon$

where the functions $f_j$ are smooth non-linear functions.

GAMs can be used to combine methods from this chapter – one can fit different nonlinear functions $f_j$ to the predictors $X_j$ [^10]
Standard software can fit GAMs with smoothing splines via backfitting

Advantages

Nonlinearity hence flexibility
Automatically introduces nonlinearity - obviates the need to experiment with different nonlinear transformations
Interpretability/inference - the $f_j$ allow to consider the effect of each feature $X_j$ independently of the others.
Smoothness of individual $f_j$ can be summarized via degrees of freedom.
Represents a nice compromise betwee linear and fully non-parametric models (see §8).

Disadvantages

Usual disadvantages of nonlinearity
Doesn’t allow for interactions between features (this can be overcome by including nonlinear functios of the interaction terms $f(X_j,X_k)$
The additive constraint is strong, restricts flexibility.

GAMs for Classification Problems

GAMs can be used for classification. For example, a GAM for logistic regression is

$\log\left(\frac{p_k(X)}{1 - p_k(X)}\right) =\beta_0 + \sum_{j=1}^p f_j(X_j) + \epsilon$

where $p_k(X) =\text{Pr}(Y = k\ |\ X)$ .

Footnotes

In statistical literature, polynomial regression is sometimes referred to as linear regression. This is because the model is linear in the population parameters $\beta_i$ . ↩
The variable $C_0(X)$ accounts for an intercept. Alternatively fit a linear model to $C_0, \dots, C_K$ with no intercept. ↩
Such a model amounts to the assumption that the target function lives in a finite-dimensional subspace of the vector space of all functions $f:X\rightarrow Y$ . ↩
The function $g$ is not guaranteed to be smooth in the sense of infinitely differentiable. The penalty on the second derivative (curvature) penalizes the “roughness” or “wiggliness” of $g$ , hence “smoothes out” noise in the data. Other penalties have been used ↩
A tuning parameter is also called a hyperparameter ↩
Thus $\hat{g}$ is a natural cubic spline with knots at the $x_i$ . However, it is not the spline one obtains in §7.4.3. It is a “shrunken” version, where $\lambda$ controls the shrinkage. ↩
Compare to a similar formula in §5.1.2 ↩
Our description of the algorithm deviates a bit from the book, but it’s equivalent. ↩
“Additive” because we are summing the $f_i$ . “Generalized” because it generalizes from the linear functions $\beta_jX_j$ in ordinary linear regression. ↩

islr notes and exercises from An Introduction to Statistical Learning

7. Moving Beyond Linearity

Moving Beyond Linearity

Table of Contents

Polynomial Regression

Advantages

Disadvantages

Step Functions

Advantages

Disadvantages

Basis Functions

Advantages

Disadvantages

Regression Splines

Piecewise Polynomials

Advantages

Disadvantages

Constraints and Splines

Advantages

Disadvantages

The Spline Basis Representation

Advantages

Disadvantages

Choosing the Number and the Locations of the Knots

Comparison to Polynomial Regression

Smoothing Splines

An Overview of Smoothing Splines

Choosing the Smoothing Parameter $\lambda$

Advantages

Disadvantages

Local Regression

Algorithm: $K$ -nearest neighbors regression

Generalized Additive Models

GAMs for Regression Problems

Advantages

Disadvantages

GAMs for Classification Problems

Footnotes

islr notes and exercises from An Introduction to Statistical Learning

7. Moving Beyond Linearity

Moving Beyond Linearity

Table of Contents

Polynomial Regression

Advantages

Disadvantages

Step Functions

Advantages

Disadvantages

Basis Functions

Advantages

Disadvantages

Regression Splines

Piecewise Polynomials

Advantages

Disadvantages

Constraints and Splines

Advantages

Disadvantages

The Spline Basis Representation

Advantages

Disadvantages

Choosing the Number and the Locations of the Knots

Comparison to Polynomial Regression

Smoothing Splines

An Overview of Smoothing Splines

Choosing the Smoothing Parameter λ\lambdaλ

Advantages

Disadvantages

Local Regression

Algorithm: KKK-nearest neighbors regression

Generalized Additive Models

GAMs for Regression Problems

Advantages

Disadvantages

GAMs for Classification Problems

Footnotes

Choosing the Smoothing Parameter $\lambda$

Algorithm: $K$ -nearest neighbors regression