islr notes and exercises from An Introduction to Statistical Learning

5. Resampling Methods

Exercise 6: Bootstrap estimates of standard errors of logistic regression coefficient estimates

Given the differences between R and Python in this case, I’m not following the structure of this exercise

Get estimates of standard errors from statsmodels

import pandas as pd
import statsmodels.formula.api as smf


#import data
default = pd.read_csv("../../datasets/Default.csv", index_col=0)

# add constant
default['const'] = 1
columns = list(default.columns)
columns.remove('const')
default = default[['const'] + columns]

# convert to numeric
default['default'] = [int(value=='Yes') for value in default['default']]
default['student'] = [int(value=='Yes') for value in default['student']]

# fit model
logit = smf.logit(formula='default ~ income + balance', 
                  data=default).fit(disp=0)
logit.summary()
Logit Regression Results
Dep. Variable: default No. Observations: 10000
Model: Logit Df Residuals: 9997
Method: MLE Df Model: 2
Date: Mon, 03 Dec 2018 Pseudo R-squ.: 0.4594
Time: 09:29:05 Log-Likelihood: -789.48
converged: True LL-Null: -1460.3
LLR p-value: 4.541e-292
coef std err z P>|z| [0.025 0.975]
Intercept -11.5405 0.435 -26.544 0.000 -12.393 -10.688
income 2.081e-05 4.99e-06 4.174 0.000 1.1e-05 3.06e-05
balance 0.0056 0.000 24.835 0.000 0.005 0.006



Possibly complete quasi-separation: A fraction 0.14 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

The estimated standard errors of the coefficient estimates are

logit.bse
Intercept    0.434772
income       0.000005
balance      0.000227
dtype: float64

Get boostrap estimates of standard errors

from sklearn.utils import resample

boot_std_errs = {}
n_boot_samples = 1000

for i in range(n_boot_samples):
    default_boot_sample = resample(default)
    logit = smf.logit(formula='default ~ income + balance', 
                      data=default_boot_sample).fit(disp=0)
    boot_std_errs[i] = logit.bse
df = pd.DataFrame.from_dict(boot_std_errs, orient='index')
df.head()
Intercept income balance
0 0.455075 0.000005 0.000242
1 0.486832 0.000005 0.000253
2 0.454962 0.000005 0.000236
3 0.440095 0.000005 0.000230
4 0.420974 0.000005 0.000220
df.std()
Intercept    2.213157e-02
income       1.415595e-07
balance      1.109741e-05
dtype: float64

These estimates are considerably smaller, and likely more precise.

For more details, see the chapter on bootstrapping in Wasserman’s All of Statistics