and predicts
where is the estimate for .
Estimates of the coefficients arize from minimizing residual sum of squares
using calculus one finds estimates7
These are sometimes called the least squares estimates.
The population regression line8 is the line given by and the least squares regression line is the line given by
The least squares estimate is an unbiased estimator 9
Assuming errors are uncorrelated with common variance , the standard errors of are
the rejection region is
where is the test-statistic 12
Quality of fit (model accuracy) is commonly assessed using and the statistic.
The RSE is a measure of the overall difference between the observed responses and the predicted responses . Thus it provides a measure of lack-of-fit of the model – higher RSE indicates worse fit.
RSE is measured in units of so it provides an absolute measure of lack of fit, which is sometimes difficult to interpret
where is the total sum of squares.
measures the total variability in , while measures the variability left after modeling by . Thus, measures the proportion of variability in that can be explained by the model. is dimensionless so it provides a good relative measure of lack-of-fit.
As , the model explains more of the variability in . As , the model explains less 13. What constitutes a good value depends on context
We can also think of as a measure of the linear relationship between and . Another such measure is the correlation , which is estimated by the sample correlation . In the case of simple linear regression, .
For data (X, Y), ,, multiple linear regression models as a linear function 14 of and predicts where is the estimate of
If we form the matrix with rows , response vector , parameter vector and noise vector then the model can be written in matrix form
RSS is defined and estimates for the parameters are chosen to minimize RSS 15 as in the case of simple regression.
If the data matrix has full rank, then the estimate 16 for the parameter vector is
The test statistic is the -statistic17
where are defined as in simple linear regression.
Assuming the model is correct,
where again, . Further assuming is true,
hence and 18.
Another way to answer this question is a hypothesis test on a subset of the predictors of size
</annotation></semantics></math></span> is the residual sum of squares for a second model ommitting the last predictors. The -statistic is
</span></span> where
The task of finding which predictors are related to the response is sometimes known as variable selection.19
Various statistics can be used to judge the quality of models using different subsets of the predictors. Examples are Mallows criterion, Akaike Information Criterion (AIC), Bayesian Information Criterion and adjusted .
Since the number of distinct linear regression models grows exponentially with exhaustive search is infeasible unless is small. Common approaches to consider a smaller set of possible models are
As in simple regression, and are two common measures of model fit
In multiple regression, , with the same interpretation as in simple regression. The model maximizes among all linear models.
increases monotonically in the number of predictors, but small increases indicate the low relative value of the corresponding predictor.
In multiple regression
There are 3 types of uncertainty associated with predicting by
Estimation Error. is only an estimate . This error is reducible. We can compute confidence intervals to quantify it.
Model Bias. A linear form for may be inappropriate. This error is also reducible
Noise. The noise term is a random variable. This error is irreducible. We can compute predeiction intervals to quantify it.
If the -th predictor is a factor (qualitative) with levels (that is possible values) then we model it by indicator variables (sometimes called a dummy variables).
Two commons definitions of the dummy variables are
for .
since we can only have if for , this model can be seen as distinct models
The standard linear regression we have been discussing relies on the twin assumptions
We can extend the model by relaxing these assumptions
Dropping the assumption of additivity leads to the possible inclusion of interaction or synergy effects among predictors.
One way to model an interaction effect between predictors and is to include an interaction term, . The non-interaction terms model the main effects.
We can perform hypothesis tests as in the standard linear model to select important terms/variables. However, the hierarchical principle dictates that, if we include an interaction effect, we should include the corresponding main effects, even if the latter aren’t statistically significant.
Dropping the assumption of linearity leads to the possible includion of non-linear effects.
One common way to model non-linearity is to use polynomial regression 20, that is model with a polynomial in the predictors. For example in the case of a single predictor models as a degree polynomial in
In general one can model a non-linear effect of predictors by including a non-linear function of the in the model
Residual plots are a useful way of vizualizing non-linearity. The presence of a discernible pattern may indicate a problem with the linearity of the model.
Standard linear regression assumes for .
Correlated error terms frequently occur in the context of time series.
Positively correlated error terms may display tracking behavior (adjacent residuals may have similar values).
Standard linear regression assumes the variance of errors is constant across observations, i.e. for all
Hetereoscedasticity, or variance which changes across observations can be identified by a funnel shape in the residual plot.
One way to reduce hetereoscedasticity is to transform by a concave function such as or .
Another way to do this is weighted least squares. This weights terms in with weights inversely proportional to where .
An outlier is an observation for which the value of given is unusual, i.e. such that the squared-error is large
Outliers can have disproportionate effects on statistics e.g. , which in turn affect the entire analysis (e.g. confidence intervals, hypothesis tests).
Residual plots can identify outliers. In practice, we plot studentized residuals
A high leverage point is a point with an unusual value of .
High leverage points tend to have a sizable impact on .
To quantify the leverage of , we use the leverage statistic. In simple linear regression this is
Collinearity is a linear relationship among two or more predictors.
Collinearity reduces the accuracy of coefficient estimates 21
Collinearity reduces the power22 of the hypothesis test
Collinearity between two variables can be detected by the sample correlation matrix . A high value for indicates high correlation between hence high collinearity in the data23.
Multicollinearity is a linear relationship among more than two predictors.
Multicollinearity can be detected using the variance inflation factor (VIF)24. where is the from regression of onto all other predictors.
One solution to the presence of collinearity is to drop one of the problematic variables, which is usually not an issue, since correlation among variables is seen as redundant.
Another solution is to combine the problematic variables into a single predictor (e.g. an average)
Skip
Linear regression is a parametric model for regression (with parameter ).
KNN regression is a popular non-parametric model, which estimates
In general, a parametric model will outperform a non-parametric model if the parametric estimation is close to the true .
KNN regression suffers from the curse of dimensionality - as the dimension increases the data become sparse. Effectively this is a reduction in sample size, hence KNN performance commonly decreases as the dimension increases.
In general parametric methods outperform non-parametric methods when there is a small number of observations per predictor.
Even if performance of KNN and linear regression is comparable, the latter may be favored for interpretability.
which can be found by solving the normal equations
The use of the statistic arises from ANOVA among the predictors, which is beyond our scope. There is some qualitative discussion of the motivation for the statistic on page 77 of the text. It is an appropriate statistic in the case i ↩