islr notes and exercises from An Introduction to Statistical Learning

5. Resampling Methods

Exercise 7: Estimate the LOOCV error

Prepare the data

import pandas as pd

weekly = pd.read_csv("../../datasets/weekly.csv", index_col=0)
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.154976 -0.270 Down
2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.148574 -2.576 Down
3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.159837 3.514 Up
4 1990 3.514 -2.576 -0.270 0.816 1.572 0.161630 0.712 Up
5 1990 0.712 3.514 -2.576 -0.270 0.816 0.153728 1.178 Up
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1089 entries, 1 to 1089
Data columns (total 9 columns):
Year         1089 non-null int64
Lag1         1089 non-null float64
Lag2         1089 non-null float64
Lag3         1089 non-null float64
Lag4         1089 non-null float64
Lag5         1089 non-null float64
Volume       1089 non-null float64
Today        1089 non-null float64
Direction    1089 non-null object
dtypes: float64(7), int64(1), object(1)
memory usage: 85.1+ KB
weekly['Direction'] = [int(value=="Up") for value in weekly['Direction']]
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.154976 -0.270 0
2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.148574 -2.576 0
3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.159837 3.514 1
4 1990 3.514 -2.576 -0.270 0.816 1.572 0.161630 0.712 1
5 1990 0.712 3.514 -2.576 -0.270 0.816 0.153728 1.178 1


from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(solver='lbfgs').fit(weekly[['Lag1', 'Lag2']], weekly['Direction'])


df = weekly.drop(labels=1, axis=0)
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.148574 -2.576 0
3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.159837 3.514 1
4 1990 3.514 -2.576 -0.270 0.816 1.572 0.161630 0.712 1
5 1990 0.712 3.514 -2.576 -0.270 0.816 0.153728 1.178 1
6 1990 1.178 0.712 3.514 -2.576 -0.270 0.154444 -1.372 0
loocv_model = LogisticRegression(solver='lbfgs').fit(df[['Lag1', 'Lag2']], df['Direction'])


first_obs = weekly.iloc[1, ]
loocv_model.predict_proba(first_obs[['Lag1', 'Lag2']].values.reshape(1, -1))
array([[0.42966146, 0.57033854]])
Since P(Direction="Up" Lag1, Lag2) = 0.57 > 0.5, we’ll predict y^=1\hat{y} = 1 for this observation. The true value is

which is incorrect.

Note that, since by default the classes of LogisticRegression() are equally weighted, we could have got the same prediction directly via

loocv_model.predict(first_obs[['Lag1', 'Lag2']].values.reshape(1, -1))

d., e.

from sklearn.model_selection import LeaveOneOut
import numpy as np

# store loocv predictions
y_pred = np.array([])

# data
X, y = weekly[['Lag1', 'Lag2']].values, weekly['Direction'].values

# LOOCV splits
loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    y_pred = np.append(y_pred, LogisticRegression(solver='lbfgs').fit(X_train, y_train).predict(X_test))
abs(y_pred - y).mean()

This is a point estimate of the LOOCV error - to do better we’d need to repeat this many times