islr notes and exercises from An Introduction to Statistical Learning

2. Statistical Learning

Table of Contents

What is Statistical Learning?

Given paired data (X,Y)(X, Y), assume a relationship between XX1 and YY modeled by

Y=f(X)+ϵ Y = f(X) + \epsilon

where f:RpRf:\mathbb{R}^p \rightarrow \mathbb{R} is a function and ϵ\epsilon is a random error term with E(ϵ)=0\mathbb{E}(\epsilon) = 0.

Statistical learning is a set of approaches for estimating ff2

Why Estimate f?

Prediction
  • We may want to predict the output YY from an estimate f^\hat{f} of ff. The predicted value for a given YY is then Y^=f^(X) \hat{Y} = \hat{f}(X). In prediction, we often treat ff as a black-box

  • The mean squared-error3 mse(Y^)=E(YY^)2\mathbf{mse}(\hat{Y})=\mathbb{E}(Y-\hat{Y})^2 is a good measure of the accuracy of Y^\hat{Y} as a predictor for YY.

  • One can write

mse(Y^)=(f(X)f^(X))2+V(ϵ) \mathbf{mse}(\hat{Y}) = (f(X) - \hat{f}(X))^2 + \mathbb{V}(\epsilon)

These two terms are known as the reducible error and irreducible error, respectively4

Inference
  • Instead of predicting YY from XX, we may be more interested how YY changes as a function of XX. In inference, we usually do not treat ff as a black box.

Examples of important inference questions:

  • Which predictors have the largest influence on the response?
  • What is the relationship between the response and each predictor?
  • Is f linear or non-linear?

How to Estimate f?

Parametric methods

Steps for parametric method:

  1. Assume a parametric model for ff, that is assume a specific functional form5

f=f(X,β)f = f(X, \boldsymbol{\beta})

for some vector of parameters β=(β1,,βp)T\boldsymbol{\beta} = (\beta_1,\dots,\beta_p)^T

  1. Use the training data to fit or train the model, that is to choose βi\beta_i such that

Yf(X,β)Y \approx f(X, \boldsymbol{\beta})

Non-parametric methods

These methods make no assumptions about the functional form of ff.

Accuracy vs. Interpretability

  • In inference, generally speaking the more flexible the method, the less interpretable.

  • In prediction, generally speaking the more flexible the method, the less accurate

Supervised vs. Unsupervised Learning

  • In supervised learning, training data consists of pairs (X,Y)(X, Y) where XX is a vector of predictors and YY a response. Prediction and inference are supervised learning problems, and the response variable (or the relationship between the response and the predictors) supervises the analysis of model

  • In unsupervised learning, training data lacks a response variable.

Regression vs. Classification

  • Problems with a quantitative response (YSRY\in S \subseteq \mathbb{R}) tend to be called regression problems

  • Problems with a qualitative, or categorical response (Y{y1,,yn})Y \in \{y_1, \dots, y_n\}) tend to be called classification problems

Assessing Model Accuracy

There is no free lunch in statistics

Measuring Quality of Fit

  • To evaluate the performance of a method on a data set, we need measure model accuracy (how well predictions match observed data).

  • In regression, the most common measure is the mean-squared error

MSE=1ni=1n(yif^(xi))2MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{f}(x_i))^2

where yiy_i and f^(xi)\hat{f}(x_i) are the ii true and predicting
responses, respectively.

  • We are usually not interested in minimizing MSE with respect to training data but rather to test data.

  • There is no guarantee low training MSE will translate to low test MSE.

  • Having low training MSE but high test MSE is called overfitting

The Bias-Variance Tradeoff

  • For a given x0x_0, the expected 6 MSE can be written

E[y0f^(x0))2]=)E[f^(x)]f(x))2+E[f^(x0)E[f^(x0)])2]+E[ϵE[ϵ])2]=bias2)f^(x0)))+V)f^(x0))+V(ϵ) \begin{aligned} \mathbb{E}\left[y_0 - \hat{f}(x_0))^2\right] &= )\mathbb{E}\left[\hat{f}(x) \right] - f(x))^2 + \mathbb{E}\left[\hat{f}(x_0) - \mathbb{E}\left[\hat{f}(x_0)\right])^2\right] + \mathbb{E}\left[\epsilon - \mathbb{E}[\epsilon])^2\right]\\ &= \mathbf{bias}^2)\hat{f}(x_0))) + \mathbb{V})\hat{f}(x_0)) + \mathbb{V}(\epsilon) \end{aligned}

  • A good method minimizes variance and bias simultaneously.

  • As a general rule, these quantities are inversely proportional. More flexible methods have lower bias but higher variance, while less flexible methods have the opposite. This is the bias-variance tradeoff

  • In practice the mse, variance and bias cannot be calculated exactly but one must keep the bias-variance tradeoff in mind.

The Classification Setting

  • In the classification setting, the most common measure of model accuracy is the error rate 7

1ni=1nI(yiy^i)\frac{1}{n}\sum_{i=1}^n I(y_i \neq \hat{y}_i)

  • As with the regression, we are interested in minimizing the test error rate, not the training error rate.
The Bayes Classifier
  • Given KK classes, the Bayes Classifier predicts

y0^=argmax1jKP(Y=j  X=x0) \hat{y_0} = \underset{1\leqslant j \leqslant K}{\text{argmax}\,} \mathbb{P}(Y=j\ |\ X = x_0)

  • The set of points {x0Rp  P(Y=j  X=x0)=P)Y=k  X=x0) for all 1j,kK}\{x_0\in\mathbb{R}^p\ |\ \mathbb{P}(Y=j\ |\ X = x_0) = \mathbb{P})Y=k\ |\ X = x_0)\ \text{for all}\ 1\leqslant j,k \leqslant K\}

    is called the Bayes decision boundary

  • The test error rate of the Bayes classifier is the Bayes error rate, which is minimal among classifiers. It is given by

1E)maxjP(Y=j  X)) 1 - \mathbb{E})\underset{j}{\max} \mathbb{P}(Y=j\ |\ X))

  • The Bayes classifier is optimal, but in practice we don’t know P)Y  X)\mathbb{P})Y\ |\ X).
K-Nearest Neighbors
  • The K-nearest neighbors classifier works by estimating P)Y  X)\mathbb{P})Y\ |\ X) as follows.
  1. Given K1K\geqslant 1 and x0x_0, find the set of points N0={K nearest points to x0}Rp \mathcal{N}_0 = \{K\ \text{nearest points to}\ x_0\}\subseteq\mathbb{R}^p
  2. For each class jj set P(Y=j  X)=1KxiN0I(yi=j) \mathbb{P}(Y=j\ |\ X) = \frac{1}{K}\sum_{x_i\in\mathcal{N}_0}I(y_i = j)
  3. Predict

y0^=argmax1jKP(Y=j  X=x0) \hat{y_0} = \underset{1\leqslant j \leqslant K}{\text{argmax}\,} \mathbb{P}(Y=j\ |\ X = x_0)


Footnotes

f(X)=β0+β1X1++βpXpf(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p This is linear regression.

  1. Here X=(X1,,Xp)TX=(X_1,\dots, X_p)^T is a vector. 

  2. Reading the rest of the chapter, one realized this is the situation for supervised learning, which is the vast majority of this book is concerned with. 

  3. This is usual definition of the mean squared-error of Y^\hat{Y} as an estimator of the (non-parametric) quantity Y=f(X)Y=f(X)

  4. We can in principle control the reducible error by improving the estimate f^\hat{f}, but we cannot control the irreducible error. 

  5. For example, a simple but popular assumption is that f is linear in both the parameters and the features, that is: 

  6. Here the random variable is f^(x0)\hat{f}(x_0), so the average is taken over all data sets 

  7. This is just the proportion of misclassified observations.