5. Resampling Methods
Exercise 7: Estimate the LOOCV error
Prepare the data
import pandas as pd
weekly = pd.read_csv("../../datasets/weekly.csv", index_col=0)
|
Year |
Lag1 |
Lag2 |
Lag3 |
Lag4 |
Lag5 |
Volume |
Today |
Direction |
1 |
1990 |
0.816 |
1.572 |
-3.936 |
-0.229 |
-3.484 |
0.154976 |
-0.270 |
Down |
2 |
1990 |
-0.270 |
0.816 |
1.572 |
-3.936 |
-0.229 |
0.148574 |
-2.576 |
Down |
3 |
1990 |
-2.576 |
-0.270 |
0.816 |
1.572 |
-3.936 |
0.159837 |
3.514 |
Up |
4 |
1990 |
3.514 |
-2.576 |
-0.270 |
0.816 |
1.572 |
0.161630 |
0.712 |
Up |
5 |
1990 |
0.712 |
3.514 |
-2.576 |
-0.270 |
0.816 |
0.153728 |
1.178 |
Up |
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1089 entries, 1 to 1089
Data columns (total 9 columns):
Year 1089 non-null int64
Lag1 1089 non-null float64
Lag2 1089 non-null float64
Lag3 1089 non-null float64
Lag4 1089 non-null float64
Lag5 1089 non-null float64
Volume 1089 non-null float64
Today 1089 non-null float64
Direction 1089 non-null object
dtypes: float64(7), int64(1), object(1)
memory usage: 85.1+ KB
weekly['Direction'] = [int(value=="Up") for value in weekly['Direction']]
weekly.head()
|
Year |
Lag1 |
Lag2 |
Lag3 |
Lag4 |
Lag5 |
Volume |
Today |
Direction |
1 |
1990 |
0.816 |
1.572 |
-3.936 |
-0.229 |
-3.484 |
0.154976 |
-0.270 |
0 |
2 |
1990 |
-0.270 |
0.816 |
1.572 |
-3.936 |
-0.229 |
0.148574 |
-2.576 |
0 |
3 |
1990 |
-2.576 |
-0.270 |
0.816 |
1.572 |
-3.936 |
0.159837 |
3.514 |
1 |
4 |
1990 |
3.514 |
-2.576 |
-0.270 |
0.816 |
1.572 |
0.161630 |
0.712 |
1 |
5 |
1990 |
0.712 |
3.514 |
-2.576 |
-0.270 |
0.816 |
0.153728 |
1.178 |
1 |
a.
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(solver='lbfgs').fit(weekly[['Lag1', 'Lag2']], weekly['Direction'])
b.
df = weekly.drop(labels=1, axis=0)
df.head()
|
Year |
Lag1 |
Lag2 |
Lag3 |
Lag4 |
Lag5 |
Volume |
Today |
Direction |
2 |
1990 |
-0.270 |
0.816 |
1.572 |
-3.936 |
-0.229 |
0.148574 |
-2.576 |
0 |
3 |
1990 |
-2.576 |
-0.270 |
0.816 |
1.572 |
-3.936 |
0.159837 |
3.514 |
1 |
4 |
1990 |
3.514 |
-2.576 |
-0.270 |
0.816 |
1.572 |
0.161630 |
0.712 |
1 |
5 |
1990 |
0.712 |
3.514 |
-2.576 |
-0.270 |
0.816 |
0.153728 |
1.178 |
1 |
6 |
1990 |
1.178 |
0.712 |
3.514 |
-2.576 |
-0.270 |
0.154444 |
-1.372 |
0 |
loocv_model = LogisticRegression(solver='lbfgs').fit(df[['Lag1', 'Lag2']], df['Direction'])
c.
first_obs = weekly.iloc[1, ]
loocv_model.predict_proba(first_obs[['Lag1', 'Lag2']].values.reshape(1, -1))
array([[0.42966146, 0.57033854]])
Since P(Direction="Up" |
Lag1, Lag2 ) = 0.57 > 0.5, we’ll predict y^=1 for this observation. The true value is |
which is incorrect.
Note that, since by default the classes of LogisticRegression()
are equally weighted, we could have got the same prediction directly via
loocv_model.predict(first_obs[['Lag1', 'Lag2']].values.reshape(1, -1))
d., e.
from sklearn.model_selection import LeaveOneOut
import numpy as np
# store loocv predictions
y_pred = np.array([])
# data
X, y = weekly[['Lag1', 'Lag2']].values, weekly['Direction'].values
# LOOCV splits
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
y_pred = np.append(y_pred, LogisticRegression(solver='lbfgs').fit(X_train, y_train).predict(X_test))
This is a point estimate of the LOOCV error - to do better we’d need to repeat this many times