islr notes and exercises from An Introduction to Statistical Learning

5. Resampling Methods

Exercise 7: Estimate the LOOCV error

Prepare the data

import pandas as pd

weekly = pd.read_csv("../../datasets/weekly.csv", index_col=0)
weekly.head()
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.154976 -0.270 Down
2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.148574 -2.576 Down
3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.159837 3.514 Up
4 1990 3.514 -2.576 -0.270 0.816 1.572 0.161630 0.712 Up
5 1990 0.712 3.514 -2.576 -0.270 0.816 0.153728 1.178 Up
weekly.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1089 entries, 1 to 1089
Data columns (total 9 columns):
Year         1089 non-null int64
Lag1         1089 non-null float64
Lag2         1089 non-null float64
Lag3         1089 non-null float64
Lag4         1089 non-null float64
Lag5         1089 non-null float64
Volume       1089 non-null float64
Today        1089 non-null float64
Direction    1089 non-null object
dtypes: float64(7), int64(1), object(1)
memory usage: 85.1+ KB
weekly['Direction'] = [int(value=="Up") for value in weekly['Direction']]
weekly.head()
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.154976 -0.270 0
2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.148574 -2.576 0
3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.159837 3.514 1
4 1990 3.514 -2.576 -0.270 0.816 1.572 0.161630 0.712 1
5 1990 0.712 3.514 -2.576 -0.270 0.816 0.153728 1.178 1

a.

from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(solver='lbfgs').fit(weekly[['Lag1', 'Lag2']], weekly['Direction'])

b.

df = weekly.drop(labels=1, axis=0)
df.head()
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.148574 -2.576 0
3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.159837 3.514 1
4 1990 3.514 -2.576 -0.270 0.816 1.572 0.161630 0.712 1
5 1990 0.712 3.514 -2.576 -0.270 0.816 0.153728 1.178 1
6 1990 1.178 0.712 3.514 -2.576 -0.270 0.154444 -1.372 0
loocv_model = LogisticRegression(solver='lbfgs').fit(df[['Lag1', 'Lag2']], df['Direction'])

c.

first_obs = weekly.iloc[1, ]
loocv_model.predict_proba(first_obs[['Lag1', 'Lag2']].values.reshape(1, -1))
array([[0.42966146, 0.57033854]])
Since P(Direction="Up" Lag1, Lag2) = 0.57 > 0.5, we’ll predict y^=1\hat{y} = 1 for this observation. The true value is
first_obs['Direction']
0.0

which is incorrect.

Note that, since by default the classes of LogisticRegression() are equally weighted, we could have got the same prediction directly via

loocv_model.predict(first_obs[['Lag1', 'Lag2']].values.reshape(1, -1))
array([1])

d., e.

from sklearn.model_selection import LeaveOneOut
import numpy as np

# store loocv predictions
y_pred = np.array([])

# data
X, y = weekly[['Lag1', 'Lag2']].values, weekly['Direction'].values

# LOOCV splits
loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    y_pred = np.append(y_pred, LogisticRegression(solver='lbfgs').fit(X_train, y_train).predict(X_test))
abs(y_pred - y).mean()
0.44995408631772266

This is a point estimate of the LOOCV error - to do better we’d need to repeat this many times