2. Statistical Learning
Table of Contents
What is Statistical Learning?
Given paired data (X,Y), assume a relationship between X and Y modeled by
Y=f(X)+ϵ
where f:Rp→R is a function and ϵ is a random error term with E(ϵ)=0.
Statistical learning is a set of approaches for estimating f
Why Estimate f?
Prediction
-
We may want to predict the output Y from an estimate f^ of f. The predicted value for a given Y is then Y^=f^(X). In prediction, we often treat f as a black-box
-
The mean squared-error mse(Y^)=E(Y−Y^)2 is a good measure of the accuracy of Y^ as a predictor for Y.
-
One can write
mse(Y^)=(f(X)−f^(X))2+V(ϵ)
These two terms are known as the reducible error and irreducible error, respectively
Inference
- Instead of predicting Y from X, we may be more interested how Y changes as a function of X. In inference, we usually do not treat f as a black box.
Examples of important inference questions:
- Which predictors have the largest influence on the response?
- What is the relationship between the response and each predictor?
- Is f linear or non-linear?
How to Estimate f?
Parametric methods
Steps for parametric method:
- Assume a parametric model for f, that is assume a specific functional form
f=f(X,β)
for some vector of parameters β=(β1,…,βp)T
- Use the training data to fit or train the model, that is to choose βi such that
Y≈f(X,β)
Non-parametric methods
These methods make no assumptions about the functional form of f.
Accuracy vs. Interpretability
-
In inference, generally speaking the more flexible the method, the less interpretable.
-
In prediction, generally speaking the more flexible the method, the less accurate
Supervised vs. Unsupervised Learning
-
In supervised learning, training data consists of pairs (X,Y) where X is a vector of predictors and Y a response. Prediction and inference are supervised learning problems, and the response variable (or the relationship between the response and the predictors) supervises the analysis of model
-
In unsupervised learning, training data lacks a response variable.
Regression vs. Classification
-
Problems with a quantitative response (Y∈S⊆R) tend to be called regression problems
-
Problems with a qualitative, or categorical response (Y∈{y1,…,yn}) tend to be called classification problems
Assessing Model Accuracy
There is no free lunch in statistics
Measuring Quality of Fit
-
To evaluate the performance of a method on a data set, we need measure model accuracy (how well predictions match observed data).
-
In regression, the most common measure is the mean-squared error
MSE=n1i=1∑n(yi−f^(xi))2
where yi and f^(xi) are the i true and predicting
responses, respectively.
-
We are usually not interested in minimizing MSE with respect to training data but rather to test data.
-
There is no guarantee low training MSE will translate to low test MSE.
-
Having low training MSE but high test MSE is called overfitting
The Bias-Variance Tradeoff
- For a given x0, the expected MSE can be written
E[y0−f^(x0))2]=)E[f^(x)]−f(x))2+E[f^(x0)−E[f^(x0)])2]+E[ϵ−E[ϵ])2]=bias2)f^(x0)))+V)f^(x0))+V(ϵ)
-
A good method minimizes variance and bias simultaneously.
-
As a general rule, these quantities are inversely proportional. More flexible methods have lower bias but higher variance, while less flexible methods have the opposite. This is the bias-variance tradeoff
-
In practice the mse, variance and bias cannot be calculated exactly but one must keep the bias-variance tradeoff in mind.
The Classification Setting
- In the classification setting, the most common measure of model accuracy is the error rate
n1i=1∑nI(yi=y^i)
- As with the regression, we are interested in minimizing the test error rate, not the training error rate.
The Bayes Classifier
- Given K classes, the Bayes Classifier predicts
y0^=1⩽j⩽KargmaxP(Y=j ∣ X=x0)
-
The set of points
{x0∈Rp ∣ P(Y=j ∣ X=x0)=P)Y=k ∣ X=x0) for all 1⩽j,k⩽K}
is called the Bayes decision boundary
-
The test error rate of the Bayes classifier is the Bayes error rate, which is minimal among classifiers. It is given by
1−E)jmaxP(Y=j ∣ X))
- The Bayes classifier is optimal, but in practice we don’t know P)Y ∣ X).
K-Nearest Neighbors
- The K-nearest neighbors classifier works by estimating P)Y ∣ X) as follows.
- Given K⩾1 and x0, find the set of points
N0={K nearest points to x0}⊆Rp
- For each class j set
P(Y=j ∣ X)=K1xi∈N0∑I(yi=j)
- Predict
y0^=1⩽j⩽KargmaxP(Y=j ∣ X=x0)
f(X)=β0+β1X1+⋯+βpXp
This is linear regression.