5. Resampling Methods

Exercise 7: Estimate the LOOCV error

Prepare the data
a.
b.
c.
d., e.

Prepare the data

import pandas as pd

weekly = pd.read_csv("../../datasets/weekly.csv", index_col=0)

weekly.head()

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today	Direction
1	1990	0.816	1.572	-3.936	-0.229	-3.484	0.154976	-0.270	Down
2	1990	-0.270	0.816	1.572	-3.936	-0.229	0.148574	-2.576	Down
3	1990	-2.576	-0.270	0.816	1.572	-3.936	0.159837	3.514	Up
4	1990	3.514	-2.576	-0.270	0.816	1.572	0.161630	0.712	Up
5	1990	0.712	3.514	-2.576	-0.270	0.816	0.153728	1.178	Up

weekly.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1089 entries, 1 to 1089
Data columns (total 9 columns):
Year         1089 non-null int64
Lag1         1089 non-null float64
Lag2         1089 non-null float64
Lag3         1089 non-null float64
Lag4         1089 non-null float64
Lag5         1089 non-null float64
Volume       1089 non-null float64
Today        1089 non-null float64
Direction    1089 non-null object
dtypes: float64(7), int64(1), object(1)
memory usage: 85.1+ KB

weekly['Direction'] = [int(value=="Up") for value in weekly['Direction']]
weekly.head()

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today	Direction
1	1990	0.816	1.572	-3.936	-0.229	-3.484	0.154976	-0.270	0
2	1990	-0.270	0.816	1.572	-3.936	-0.229	0.148574	-2.576	0
3	1990	-2.576	-0.270	0.816	1.572	-3.936	0.159837	3.514	1
4	1990	3.514	-2.576	-0.270	0.816	1.572	0.161630	0.712	1
5	1990	0.712	3.514	-2.576	-0.270	0.816	0.153728	1.178	1

a.

from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(solver='lbfgs').fit(weekly[['Lag1', 'Lag2']], weekly['Direction'])

b.

df = weekly.drop(labels=1, axis=0)
df.head()

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today	Direction
2	1990	-0.270	0.816	1.572	-3.936	-0.229	0.148574	-2.576	0
3	1990	-2.576	-0.270	0.816	1.572	-3.936	0.159837	3.514	1
4	1990	3.514	-2.576	-0.270	0.816	1.572	0.161630	0.712	1
5	1990	0.712	3.514	-2.576	-0.270	0.816	0.153728	1.178	1
6	1990	1.178	0.712	3.514	-2.576	-0.270	0.154444	-1.372	0

loocv_model = LogisticRegression(solver='lbfgs').fit(df[['Lag1', 'Lag2']], df['Direction'])

c.

first_obs = weekly.iloc[1, ]
loocv_model.predict_proba(first_obs[['Lag1', 'Lag2']].values.reshape(1, -1))

array([[0.42966146, 0.57033854]])

Since P(Direction="Up" Lag1, Lag2) = 0.57 > 0.5, we’ll predict $\hat{y} = 1$ for this observation. The true value is

first_obs['Direction']

0.0

which is incorrect.

Note that, since by default the classes of LogisticRegression() are equally weighted, we could have got the same prediction directly via

loocv_model.predict(first_obs[['Lag1', 'Lag2']].values.reshape(1, -1))

array([1])

d., e.

from sklearn.model_selection import LeaveOneOut
import numpy as np

# store loocv predictions
y_pred = np.array([])

# data
X, y = weekly[['Lag1', 'Lag2']].values, weekly['Direction'].values

# LOOCV splits
loo = LeaveOneOut()

for train_index, test_index in loo.split(X):
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    y_pred = np.append(y_pred, LogisticRegression(solver='lbfgs').fit(X_train, y_train).predict(X_test))

abs(y_pred - y).mean()

0.44995408631772266

This is a point estimate of the LOOCV error - to do better we’d need to repeat this many times

islr notes and exercises from An Introduction to Statistical Learning

5. Resampling Methods

Exercise 7: Estimate the LOOCV error

Prepare the data

a.

b.

c.

d., e.