Given the differences between R and Python in this case, I’m not following the structure of this exercise
statsmodels
import pandas as pd
import statsmodels.formula.api as smf
#import data
default = pd.read_csv("../../datasets/Default.csv", index_col=0)
# add constant
default['const'] = 1
columns = list(default.columns)
columns.remove('const')
default = default[['const'] + columns]
# convert to numeric
default['default'] = [int(value=='Yes') for value in default['default']]
default['student'] = [int(value=='Yes') for value in default['student']]
# fit model
logit = smf.logit(formula='default ~ income + balance',
data=default).fit(disp=0)
logit.summary()
Dep. Variable: | default | No. Observations: | 10000 |
---|---|---|---|
Model: | Logit | Df Residuals: | 9997 |
Method: | MLE | Df Model: | 2 |
Date: | Mon, 03 Dec 2018 | Pseudo R-squ.: | 0.4594 |
Time: | 09:29:05 | Log-Likelihood: | -789.48 |
converged: | True | LL-Null: | -1460.3 |
LLR p-value: | 4.541e-292 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -11.5405 | 0.435 | -26.544 | 0.000 | -12.393 | -10.688 |
income | 2.081e-05 | 4.99e-06 | 4.174 | 0.000 | 1.1e-05 | 3.06e-05 |
balance | 0.0056 | 0.000 | 24.835 | 0.000 | 0.005 | 0.006 |
Possibly complete quasi-separation: A fraction 0.14 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
The estimated standard errors of the coefficient estimates are
logit.bse
Intercept 0.434772
income 0.000005
balance 0.000227
dtype: float64
from sklearn.utils import resample
boot_std_errs = {}
n_boot_samples = 1000
for i in range(n_boot_samples):
default_boot_sample = resample(default)
logit = smf.logit(formula='default ~ income + balance',
data=default_boot_sample).fit(disp=0)
boot_std_errs[i] = logit.bse
df = pd.DataFrame.from_dict(boot_std_errs, orient='index')
df.head()
Intercept | income | balance | |
---|---|---|---|
0 | 0.455075 | 0.000005 | 0.000242 |
1 | 0.486832 | 0.000005 | 0.000253 |
2 | 0.454962 | 0.000005 | 0.000236 |
3 | 0.440095 | 0.000005 | 0.000230 |
4 | 0.420974 | 0.000005 | 0.000220 |
df.std()
Intercept 2.213157e-02
income 1.415595e-07
balance 1.109741e-05
dtype: float64
These estimates are considerably smaller, and likely more precise.
For more details, see the chapter on bootstrapping in Wasserman’s All of Statistics