7. Moving Beyond Linearity
Moving Beyond Linearity
Table of Contents
Polynomial Regression
- Simple polynomial regression is a regression model which is polynomial in the feature variable X
Y=β0+i=1∑dβiXd
- The model can be fit as a simple linear regression model with predictors X1,…,Xd=X,…Xd.
- It is rare to take d⩾4 because it lead strange curves
Advantages
- Interpretability
- More flexibility than linear regression, can better model non-linear relationships
Disadvantages
- Greater flexibility can lead to overfitting (can be mitigating by keeping d low)
- Imposes global structure on target function (as does linear regression)
Step Functions
Advantages
- Flexibility to model non-linear relationships
- Can model local behavior better than global models (e.g. linear and polynomial regression)
Disadvantages
- Locally constant assumption is strong, breakpoints in data may not be realized.
Basis Functions
In general, we can fit a regression model
Y=β0+i=1∑Kbi(X)
where the bi(X) are called basis functions
Advantages
Different choices of basis functions are useful for modeling different types of relationships (for example, Fourier basis functions can model periodic behavior).
Disadvantages
- As usual, greater flexibility can lead to overfitting
- Some choices of basis functions (i.e. basis functions which are not suited to the assumed true functional relationship) will likely have poor performance.
Regression Splines
Regression splines are a flexible (and common choice of) class of basis functions which extend both polynomial and piecewise constant basis functions.
Piecewise Polynomials
Piecewise polynomials fit separate low-degree polynomials over different regions of X. The points where the coefficients change are called knots.
Advantages
- Flexibility to model non-linear relationships (as with all non-linear methods discussed in this chapter)
- Sensitivity to local behavior (less rigid than global model).
Disadvantages
- Overly flexible - each piece has independent degrees of freedom
- Can have unnatural breaks at knots without appropriate constraints
- Possibility of overfitting (as with all non-linear methods discussed in this chapter)
Constraints and Splines
- To remedy overflexibility of piecewise polynomials, we can impose constraints at the knots, e.g. continuity, differentiability of various orders (smoothness).
- A spline is a piecewise degree d polynomial that has continuous derivatives up to order d−1 at each knot (hence everywhere).
Advantages
- Same advantages to piecewise polynomials, while improving on the disadvantages
Disadvantages
- Overfitting
- Poor match to the true relationship
The Spline Basis Representation
- Regression splines can be modeled using an appropriate basis, of which there are many choices.
- For example, we can model a d degree spline with K knots using truncated power basis
b1(X),…,bK+d(X)=x,…,xd,h(X,ξ1),…,h(X,ξK)
where ξi is the i−th knot and
h(X−ξi)={(X−ξi)d0X>ξiX⩽ξi
is the truncated power function of degree d.
Advantages
Ibid.
Disadvantages
Beyond those mentioned above, splines can have a high variance near min(X),max(X) (this can be overcome by using natural splines which impose boundary constraints, i.e constraints on the form of the model on [min(X),ξ1], [max(X),ξK] (e.g. linearity)
Choosing the Number and the Locations of the Knots
- In practice, we place knots in uniform fashion, e.g. by specifying the desired degrees of freedom and using software to place the knots at uniform quantiles of the data.
- The desired degrees of freedom (hence number of knots) can be obtained using cross-validation.
Comparison to Polynomial Regression
Often gives superior results to polynomial regression – the latter must use higher degrees (imposing global structure) while the former can increase the number of knots while leaving the degree fixed (sensitivity to local behavior) as well as varying the density of knots (i.e. placing more where the response varies rapidly, less where it is more stable)
Smoothing Splines
An Overview of Smoothing Splines
- A smoothing spline is a function
g^λ=gargmini=1∑n(yi−g(xi))2+λ∫g′′(t)2dt
where λ=0 is a tuning parameter
- λ controls the bias-variance tradeoff. λ=0 corresponds to the interpolation spline which fits all the data points exactly and will be thus woefull overfit. In the limit λ→∞, g^λ approaches the least squares line
- It can be show that the function g^λ is a piecewise cubic polynomial with knots at the unique xi and continuous first and second derivatives at the knots
Choosing the Smoothing Parameter λ
- The parameter λ controls the effective degrees of freedom dfλ. As λ goes from 0 to ∞, dfλ goes from n to 2.
- The effective degress of freedom is defined to be
dfλ=trace(Sλ)
where Sλ is the matrix such that g^λ=Sλy where g^ is the vector of fitted values.
- λ can be chosen by cross-validation. LOOCV is particularly efficient to compute
RSScv(λ)=i=1∑n(yi−g^λ(−i)(xi))2=i=1∑n(1−tr(Sλ)yi−g^λ(xi))2
Advantages
- Flexibility/nonlinearity
- As a shrinkage method, effective degrees of freedom are reduced, helping to balance bias-variance tradeoff and avoid overfitting.
Disadvantages
- As usual, flexibility can lead to overfitting
Local Regression
- Computes the fit at a target point by regressing on nearby training observations
- Is memory-based - all the training data is necessary for computing a prediction
- In multiple linear regression, variable coefficient models fit global regression to some variables and local to others
Algorithm: K-nearest neighbors regression
Fix the parameter 1⩽k⩽n. For each X=x0:
- Get the neighborhood Ni0={k closest xi}.
- Assign a weight Ki0=K(xi,x0) to each point xi such that such that
- each point outside xi∈/Ni0 has Ki0(xi)=0.
- the furthest point xi∈Ni0 has weight zero
- the closest point xi∈Ni0 has the highest weight.
- Fit a weighted least squares regression
(β0^,β1^)=i=1∑nKi0(yi−β0−β1xi)2
- Predict f^(x0)=β0^+β1^x0.
Generalized Additive Models
A Generalized additive model is a model which is a sum of nonlinear functions of the individual predictors.
GAMs for Regression Problems
- A GAM for regression is a model
Y=β0+j=1∑pfj(Xj)+ϵ
where the functions fj are smooth non-linear functions.
- GAMs can be used to combine methods from this chapter – one can fit different nonlinear functions fj to the predictors Xj [^10]
- Standard software can fit GAMs with smoothing splines via backfitting
Advantages
- Nonlinearity hence flexibility
- Automatically introduces nonlinearity - obviates the need to experiment with different nonlinear transformations
- Interpretability/inference - the fj allow to consider the effect of each feature Xj independently of the others.
- Smoothness of individual fj can be summarized via degrees of freedom.
- Represents a nice compromise betwee linear and fully non-parametric models (see §8).
Disadvantages
- Usual disadvantages of nonlinearity
- Doesn’t allow for interactions between features (this can be overcome by including nonlinear functios of the interaction terms f(Xj,Xk)
- The additive constraint is strong, restricts flexibility.
GAMs for Classification Problems
GAMs can be used for classification. For example, a GAM for logistic regression is
log(1−pk(X)pk(X))=β0+j=1∑pfj(Xj)+ϵ
where pk(X)=Pr(Y=k ∣ X).