islr notes and exercises from An Introduction to Statistical Learning

3. Linear Regression

Table of Contents

Simple Linear Regression

  • For data (X, Y), X,YRX, Y\in\mathbb{R}, simple linear regression models YY as a linear function of XX

Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon

and predicts

Y^=β^0+β^1X\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X

where β^i\hat{\beta}_i is the estimate for βi\beta_i.

Estimating the Coefficients

Estimates of the coefficients β0,β1\beta_0, \beta_1 arize from minimizing residual sum of squares

RSS=i=1nei2=i=1n(yiy^i)2RSS = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2

using calculus one finds estimates7

β^1=i(xix)(yiy)i(xix)2β^0=yβ^1x \begin{aligned} \hat{\beta}_1 &= \frac{\sum_i (x_i - \overline{x})(y_i - \overline{y})}{\sum_i (x_i - \overline{x}) ^2}\\ \hat{\beta}_0 &= \overline{y}-\hat{\beta}_1\overline{x} \end{aligned}

These are sometimes called the least squares estimates.

Assessing the Accuracy of the Coefficent Estimates

  • The population regression line8 is the line given by Y=β0+β1X Y = \beta_0 + \beta_1 X and the least squares regression line is the line given by Y=β^0+β^1X Y = \hat{\beta}_0 + \hat{\beta}_1 X

  • The least squares estimate is an unbiased estimator 9

  • Assuming errors ϵi\epsilon_i are uncorrelated with common variance σ2=V(ϵ)\sigma^2=\mathbb{V}(\epsilon), the standard errors of β^0,β^1\hat{\beta}_0, \hat{\beta}_1 are

se(β^0)=σ[1n+xi(xix)2] \mathbf{se}(\hat{\beta}_0) = \sigma\sqrt{\left[\frac{1}{n} + \frac{\overline{x}}{\sum_i (x_i - \overline{x})^2}\right]}

se(β^1)=σ1i(xix)2 \mathbf{se}(\hat{\beta}_1) = \sigma\sqrt{\frac{1}{\sum_i (x_i - \overline{x})^2}}

  • The estimated standard errors se^(β^0),se^(β^0)\hat{\mathbf{se}}(\hat{\beta}_0), \hat{\mathbf{se}}(\hat{\beta}_0) are found by estimating σ\sigma with the residual standard error 10

σ^=RSE:=RSSn2 \hat{\sigma} = RSE := \sqrt{\frac{RSS}{n-2}}

  • Approximate 1α1 - \alpha confidence intervals 11 for the least squares estimators are

βi^±tα/2se^(β^i) \hat{\beta_i} \pm t_{\alpha/2}\hat{\mathbf{se}}(\hat{\beta}_i)

  • Most common hypothesis tests for the least squares estimates are

H0:βi=0H_0: \beta_i = 0 Ha:βi0H_a: \beta_i \neq 0

the rejection region is

{xR  T>t}\{ x\in \mathbb{R}\ |\ T > t \}

where tt is the test-statistic 12

t=β^iβise^(βi^) t = \frac{\hat{\beta}_i - \beta_i}{\hat{\mathbf{se}}(\hat{\beta_i})}

Assessing the Accuracy of the Model

Quality of fit (model accuracy) is commonly assessed using RSERSE and the R2R^2 statistic.

Residual Standard Errors

  • The RSE is a measure of the overall difference between the observed responses yiy_i and the predicted responses y^i\hat{y}_i. Thus it provides a measure of lack-of-fit of the model – higher RSE indicates worse fit.

  • RSE is measured in units of YY so it provides an absolute measure of lack of fit, which is sometimes difficult to interpret

The R-squared Statistic

  • The R2R^2 statistic is

R2=TSSRSSTSS R^2 = \frac{TSS - RSS}{TSS}

where TSS=i(yiy)2TSS = \sum_i (y_i - \overline{y})^2 is the total sum of squares.

  • TSSTSS measures the total variability in YY, while RSSRSS measures the variability left after modeling YY by f(X)f(X). Thus, R2R^2 measures the proportion of variability in YY that can be explained by the model. R2R^2 is dimensionless so it provides a good relative measure of lack-of-fit.

  • As R21R^2 \rightarrow 1, the model explains more of the variability in YY. As R20R^2 \rightarrow 0, the model explains less 13. What constitutes a good R2R^2 value depends on context

  • We can also think of R2R^2 as a measure of the linear relationship between YY and XX. Another such measure is the correlation corr(X,Y)\text{corr}(X,Y), which is estimated by the sample correlation rr. In the case of simple linear regression, R2=r2R^2 = r^2.

Multiple Linear Regression

  • For data (X, Y), X=(X1,,Xp)TRpX=(X_1,\dots,X_p)^T\in\mathbb{R}^p,YRY\in\mathbb{R}, multiple linear regression models YY as a linear function 14 of XX Y=β0+β1X1++βpXp+ϵY = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon and predicts Y^=β^0+β^1X1++β^pXp+ϵ\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p + \epsilon where β^i\hat{\beta}_i is the estimate of βi\beta_i

  • If we form the n×(p+1)n \times (p + 1) matrix X\mathbf{X} with rows (1,Xi1,,Xip)(1, X_{i1}, \dots, X_{ip}), response vector Y=(Y1,,Yn)Y=(Y_1,\dots,Y_n), parameter vector β=(β0,,βp)\beta = (\beta_0, \dots, \beta_p) and noise vector ϵ=(ϵ1,,ϵn)\epsilon = (\epsilon_1, \dots, \epsilon_n) then the model can be written in matrix form

Y=Xβ+ϵ Y = \mathbf{X}\beta + \epsilon

Estimating the Regression Coefficients

  • RSS is defined and estimates β^i\hat{\beta}_i for the parameters βi\beta_i are chosen to minimize RSS 15 as in the case of simple regression.

  • If the data matrix X\mathbf{X} has full rank, then the estimate 16 β^\hat{\beta} for the parameter vector is

β^=(XX)1Xβ \hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \beta

Important Questions

Is There a Relationship Between the Response and Predictors?

  • One way to answer this question is a hypothesis test

H0:βi=0for all 1ipHa:βi0for some 1ip \begin{aligned} H_0:& \beta_i = 0 &\text{for all}\ 1 \leqslant i \leqslant p\\ H_a:& \beta_i \neq 0&\text{for some}\ 1 \leqslant i \leqslant p \end{aligned}

  • The test statistic is the FF-statistic17 F=(TSSRSS)/pRSS/(np1) F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)}

    where TSS,RSSTSS, RSS are defined as in simple linear regression.

  • Assuming the model is correct, E(RSSnp1)=σ2 \mathbb{E}\left(\frac{RSS}{n-p-1}\right) = \sigma^2

    where again, σ2=V(ϵ)\sigma^2 = \mathbb{V}(\epsilon). Further assuming H0H_0 is true, E(TSSTSSp)=σ2 \mathbb{E}\left(\frac{TSS - TSS}{p}\right) = \sigma^2

    hence H0F1H_0 \Rightarrow F \approx 1 and HaF>1H_a \Rightarrow F > 118.

  • Another way to answer this question is a hypothesis test on a subset of the predictors of size qq H0:βi=0for all pq+1ipHa:βi0for some pq+1ip \begin{aligned} H_0:& \beta_{i} = 0 &\text{for all}\ p - q + 1 \leqslant i \leqslant p\\ H_a:& \beta_i \neq 0 &\text{for some}\ p - q + 1 \leqslant i \leqslant p \end{aligned}

</annotation></semantics></math></span></span></span> where RSS0RSS_0 is the residual sum of squares for a second model ommitting the last qq predictors. The FF-statistic is

F=(RSS0RSS)/pRSS/(np1) F = \frac{(RSS_0 - RSS)/p}{RSS/(n - p - 1)}

  • These hypothesis tests help us conclude that at least one of the predictors is related to the response (the second test narrows it down a bit), but don’t indicate which ones.

Deciding on Important Variables

  • The task of finding which predictors are related to the response is sometimes known as variable selection.19

  • Various statistics can be used to judge the quality of models using different subsets of the predictors. Examples are Mallows CpC_p criterion, Akaike Information Criterion (AIC), Bayesian Information Criterion and adjusted R2R^2.

  • Since the number of distinct linear regression models grows exponentially with pp exhaustive search is infeasible unless pp is small. Common approaches to consider a smaller set of possible models are

    • Forward Selection Start with the null model M0M_0 (an intercept but no predictors). Fit pp simple regressions and add to the null model the one with lowest RSSRSS, resulting in a new model M1M_1. Iterate until a stopping rule is reached.
    • Backward Selection Start with a model MpM_p consisting of all predictors. Remove the variable with largest pp-value, resulting in a new model Mp1M_{p-1}. Iterate until a stopping rule is reached.
    • Mixed Selection Proceed with forward selection, but remove any predictors whose pp-value is too large.

Model Fit

  • As in simple regression, RSERSE and R2R^2 are two common measures of model fit

  • In multiple regression, R2=Corr(Y,Y^)2R^2 = Corr(Y, \hat{Y})^2, with the same interpretation as in simple regression. The model Y^\hat{Y} maximizes R2R^2 among all linear models.

  • R2R^2 increases monotonically in the number of predictors, but small increases indicate the low relative value of the corresponding predictor.

  • In multiple regression

RSS=RSSnp1 RSS = \sqrt{\frac{RSS}{n - p - 1}}

  • Visualization can be helpful in assessing model fit, e.g. by suggesting the inclusion of interaction terms

Predictions

There are 3 types of uncertainty associated with predicting YY by Y^\hat{Y}

  • Estimation Error. Y^=f^(X)\hat{Y} = \hat{f}(X) is only an estimate f(X)f(X). This error is reducible. We can compute confidence intervals to quantify it.

  • Model Bias. A linear form for f(X)f(X) may be inappropriate. This error is also reducible

  • Noise. The noise term ϵ\epsilon is a random variable. This error is irreducible. We can compute predeiction intervals to quantify it.

Other Considerations In the Regression Model

Qualitative Predictors

  • If the ii-th predictor XiX_i is a factor (qualitative) with KK levels (that is KK possible values) then we model it by K1K-1 indicator variables (sometimes called a dummy variables).

  • Two commons definitions of the dummy variables are

X~i={1Xi=k0Xik \tilde{X}_{i} = \begin{cases} 1 & X_i = k\\ 0 & X_i \neq k \end{cases}

X~i={1Xi=k1Xik \tilde{X}_{i} = \begin{cases} 1 & X_i = k\\ -1 & X_i \neq k \end{cases}

for 1kK1 \leqslant k \leqslant K.

  • The corresponding regression model is

Y=β0+iβiX~i+ϵ Y = \beta_0 + \sum_{i} \beta_i\tilde{X}_i + \epsilon

since we can only have X~i=1\tilde{X}_i = 1 if X~j1\tilde{X}_j \neq 1 for jij \neq i, this model can be seen as KK distinct models

Y={β0Xi=1β0+β1Xi=2β0+βKXi=K Y = \begin{cases} \beta_0 & X_i = 1 \\ \beta_0 + \beta_1 & X_i = 2 \\ \vdots & \vdots \\ \beta_0 + \beta_K & X_i = K \end{cases}

Extensions of the Linear Model

The standard linear regression we have been discussing relies on the twin assumptions

  • Additivity: The effect of XiX_i on YY is independent of the effect of XjX_j for jij\neq i.
  • Linearity: YY is linear in XiX_i for all ii.

We can extend the model by relaxing these assumptions

Removing the Additive Assumption

  • Dropping the assumption of additivity leads to the possible inclusion of interaction or synergy effects among predictors.

  • One way to model an interaction effect between predictors XiX_i and XjX_j is to include an interaction term, βi+jXiXj\beta_{i + j}X_iX_j. The non-interaction terms βiXi\beta_i X_i model the main effects.

  • We can perform hypothesis tests as in the standard linear model to select important terms/variables. However, the hierarchical principle dictates that, if we include an interaction effect, we should include the corresponding main effects, even if the latter aren’t statistically significant.

Non-linear Relationships

  • Dropping the assumption of linearity leads to the possible includion of non-linear effects.

  • One common way to model non-linearity is to use polynomial regression 20, that is model f(X)f(X) with a polynomial in the predictors. For example in the case of a single predictor XX Y=β0+β1X++βdXsY = \beta_0 + \beta_1 X + \dots + \beta_d X^s models YY as a degree dd polynomial in XX

  • In general one can model a non-linear effect of predictors XiX_i by including a non-linear function of the XiX_i in the model

Potential Problems

Non-linearity of the Data

Residual plots are a useful way of vizualizing non-linearity. The presence of a discernible pattern may indicate a problem with the linearity of the model.

Correlation of Error Terms

  • Standard linear regression assumes Corr(ϵi,ϵj)=0\text{Corr}(\epsilon_i,\epsilon_j) = 0 for iji\neq j.

  • Correlated error terms frequently occur in the context of time series.

  • Positively correlated error terms may display tracking behavior (adjacent residuals may have similar values).

Non-constant Variance of Error Terms

  • Standard linear regression assumes the variance of errors is constant across observations, i.e. V(ϵi)=σ2\mathbb{V}(\epsilon_i) = \sigma^2 for all 1in1 \leqslant i \leqslant n

  • Hetereoscedasticity, or variance which changes across observations can be identified by a funnel shape in the residual plot.

  • One way to reduce hetereoscedasticity is to transform YY by a concave function such as logY\log Y or Y\sqrt{Y}.

  • Another way to do this is weighted least squares. This weights terms in RSSRSS with weights wiw_i inversely proportional to σi2\sigma_i^2 where σi2=V(ϵi)\sigma_i^2 = \mathbb{V}(\epsilon_i).

Outliers

  • An outlier is an observation for which the value of yiy_i given xix_i is unusual, i.e. such that the squared-error (yiy^i)2(y_i - \hat{y}_i)^2 is large

  • Outliers can have disproportionate effects on statistics e.g. R2R^2, which in turn affect the entire analysis (e.g. confidence intervals, hypothesis tests).

  • Residual plots can identify outliers. In practice, we plot studentized residuals

ϵ^ise^(ϵ^i)\frac{\hat{\epsilon}_i}{\hat{\mathbf{se}}(\hat{\epsilon}_i)}

  • If an outlier is due to a data collection error it can be removed, but great care should be taken when doing this.

High Leverage Points

  • A high leverage point is a point with an unusual value of xix_i.

  • High leverage points tend to have a sizable impact on f^\hat{f}.

  • To quantify the leverage of xix_i, we use the leverage statistic. In simple linear regression this is

hi=1n+(XjX)2j(XjX)2 h_i = \frac{1}{n} + \frac{(X_j - \overline{X})^2}{\sum_{j} (X_{j} - \overline{X})^2}

Collinearity

  • Collinearity is a linear relationship among two or more predictors.

  • Collinearity reduces the accuracy of coefficient estimates 21

  • Collinearity reduces the power22 of the hypothesis test

  • Collinearity between two variables can be detected by the sample correlation matrix Σ^\hat{\Sigma}. A high value for (Σ^)ij=corr(Xi,Xj)^|(\hat{\Sigma})_{ij}| = |\hat{\text{corr}(X_i, X_j)}| indicates high correlation between Xi,XjX_i, X_j hence high collinearity in the data23.

  • Multicollinearity is a linear relationship among more than two predictors.

  • Multicollinearity can be detected using the variance inflation factor (VIF)24. VIF(β^i)=11RXiXi2 VIF(\hat{\beta}_i) = \frac{1}{1-R^2_{X_i|X_{-i}}} where RXiXi2R^2_{X_i|X_{-i}} is the R2R^2 from regression of XiX_i onto all other predictors.

  • One solution to the presence of collinearity is to drop one of the problematic variables, which is usually not an issue, since correlation among variables is seen as redundant.

  • Another solution is to combine the problematic variables into a single predictor (e.g. an average)

The Marketing Plan

Skip

Comparison of Linear Regression and K-Nearest Neighbors

  • Linear regression is a parametric model for regression (with parameter β=(β0,,βp)\beta = (\beta_0, \dots, \beta_p)).

  • KNN regression is a popular non-parametric model, which estimates

f^(x0)=1KxiN0yi\hat{f}(x_0) = \frac{1}{K}\sum_{x_i\in\mathcal{N}_0} y_i

  • In general, a parametric model will outperform a non-parametric model if the parametric estimation f^\hat{f} is close to the true ff.

  • KNN regression suffers from the curse of dimensionality - as the dimension increases the data become sparse. Effectively this is a reduction in sample size, hence KNN performance commonly decreases as the dimension pp increases.

  • In general parametric methods outperform non-parametric methods when there is a small number of observations per predictor.

  • Even if performance of KNN and linear regression is comparable, the latter may be favored for interpretability.


Footnotes

  1. The value (β^0,β^1)(\hat{\beta}_0, \hat{\beta}_1) is the local minimum in R2\mathbb{R}^2 of the “loss function” given by RSS

  1. Here estimate means the same as “estimator”, found elsewhere in the statistics literature. The population regression line is given by the “true” (population) values (β0,β1)(\beta_0, \beta_1) of the parameter, while the least squares line is given by the estimator (β^0,β^1)(\hat{\beta}_0, \hat{\beta}_1)

  1. In other words, E((β^0,β^1))=(β0,β1)\mathbb{E}\left((\hat{\beta}_0, \hat{\beta}_1)\right) = (\beta_0, \beta_1)

  1. The factor 1n2\frac{1}{n-2} is a correction to make this an unbiased estimator, the quantity n2n - 2 is known as the “degrees of freedom”. Note this is a special case of np1n - p - 1 degrees of freedom for pp predictors where p=1p = 1.

  1. This appears to be based on the assumption (no doubt proved in the literature) that the least squares estimators are asymptotically t-distributed, β^iStudentn2(βi,se^(β^i))\hat{\beta}_i \approx Student_{n-2}(\beta_i, \hat{\mathbf{se}}(\hat{\beta}_i)).

  1. This is the Wald test for the statistic TT, which (by footnote 4) has TStudentn2(0,1)T \approx Student_{n - 2}(0, 1).

  1. This can happen if either the model is wrong (i.e. a linear form for f(X)f(X) isn’t a good choice) or because V(ϵ)\mathbb{V}(\epsilon) is large.

  1. This estimation method is known as Ordinary Least Squares (OLS). The estimate is the solution to the quadratic minimzation problem

β^=argminβyXβ2 \hat{\beta} = \underset{\beta}{\text{argmin}\,} || y - \mathbf{X}\beta ||^2

  1. The estimate is any solution to the quadratic minimzation problem

β^=argminβyXβ2 \hat{\beta} = \underset{\beta}{\text{argmin}\,} || y - \mathbf{X}\beta ||^2

which can be found by solving the normal equations

XXβ^=Xy\mathbf{X}^\top\mathbf{X}\hat{\beta} = \mathbf{X}^\top y

  1. If X\mathbf{X} has full rank then XX\mathbf{X}^\top\mathbf{X} is invertible and the normal equations have a unique solution

  1. Assuming the ϵi\epsilon_i are normally distributed, ϵiN(μi,σ2)\epsilon_i \sim N(\mu_i, \sigma^2) where μ=β0+βiXi\mu = \beta_0 + \sum \beta_i X_i), the FF-statistic has an FF-distribution with p,npp, n-p degrees of freedom (FF has this asymptotic distribution even without the normality assumption).

The use of the FF statistic arises from ANOVA among the predictors, which is beyond our scope. There is some qualitative discussion of the motivation for the FF statistic on page 77 of the text. It is an appropriate statistic in the case pp i

  1. How much F>1F > 1 should be before we rejct H0H_0 depends on nn and pp. If nn is large, FF need not be much greater than 1, and if it’s small,

  1. This is discussed extensively in chapter 6.

  1. This is discussed in chapter 7.

  1. This is due to issues identifying the global minimum of RSSRSS. In the example in the text, in the presence of collinearity, the global minimum is in a long “valley”. The coefficient estimates are very sensitive to the data – small changes in the data yeild large changes in the estimates.

  1. The power of the test is the probability of correctly rejecting H0:βi=0H_0: \beta_i = 0, i.e. correctly accepting Ha:βi0H_a: \beta_i \neq 0. Since it uncreases uncertainty of the coefficient estimates, it increases se^(βi^)\hat{se}(\hat{\beta_i}), hence reduces the tt-statistic, making it less likely H0H_0 is rejected.

  1. However, the converse is not true – absence of such entries in the sample correlation matrix doesn’t indicate absence of collinearity. The matrix only detects pairwise correlation, and a predictor may correlate two or more other predictors.

  1. This is defined the ratio of the (sample) variance of βi^\hat{\beta_i} when fitting the full model divided by the variance of βi^\hat{\beta_i} when fit on it’s own. It can be computed using the given formula.