Given the differences between R and Python in this case, I’m not following the structure of this exercise
statsmodelsimport pandas as pd
import statsmodels.formula.api as smf
#import data
default = pd.read_csv("../../datasets/Default.csv", index_col=0)
# add constant
default['const'] = 1
columns = list(default.columns)
columns.remove('const')
default = default[['const'] + columns]
# convert to numeric
default['default'] = [int(value=='Yes') for value in default['default']]
default['student'] = [int(value=='Yes') for value in default['student']]
# fit model
logit = smf.logit(formula='default ~ income + balance',
data=default).fit(disp=0)
logit.summary()
| Dep. Variable: | default | No. Observations: | 10000 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 9997 |
| Method: | MLE | Df Model: | 2 |
| Date: | Mon, 03 Dec 2018 | Pseudo R-squ.: | 0.4594 |
| Time: | 09:29:05 | Log-Likelihood: | -789.48 |
| converged: | True | LL-Null: | -1460.3 |
| LLR p-value: | 4.541e-292 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -11.5405 | 0.435 | -26.544 | 0.000 | -12.393 | -10.688 |
| income | 2.081e-05 | 4.99e-06 | 4.174 | 0.000 | 1.1e-05 | 3.06e-05 |
| balance | 0.0056 | 0.000 | 24.835 | 0.000 | 0.005 | 0.006 |
Possibly complete quasi-separation: A fraction 0.14 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
The estimated standard errors of the coefficient estimates are
logit.bse
Intercept 0.434772
income 0.000005
balance 0.000227
dtype: float64
from sklearn.utils import resample
boot_std_errs = {}
n_boot_samples = 1000
for i in range(n_boot_samples):
default_boot_sample = resample(default)
logit = smf.logit(formula='default ~ income + balance',
data=default_boot_sample).fit(disp=0)
boot_std_errs[i] = logit.bse
df = pd.DataFrame.from_dict(boot_std_errs, orient='index')
df.head()
| Intercept | income | balance | |
|---|---|---|---|
| 0 | 0.455075 | 0.000005 | 0.000242 |
| 1 | 0.486832 | 0.000005 | 0.000253 |
| 2 | 0.454962 | 0.000005 | 0.000236 |
| 3 | 0.440095 | 0.000005 | 0.000230 |
| 4 | 0.420974 | 0.000005 | 0.000220 |
df.std()
Intercept 2.213157e-02
income 1.415595e-07
balance 1.109741e-05
dtype: float64
These estimates are considerably smaller, and likely more precise.
For more details, see the chapter on bootstrapping in Wasserman’s All of Statistics