islr notes and exercises from An Introduction to Statistical Learning

3. Linear Regression

Exercise 15: Regression models for the Boston Dataset

Preparing the data

import pandas as pd

boston = pd.read_csv('../../datasets/Boston.csv', index_col=0)
boston.head()
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
boston.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 506 entries, 1 to 506
Data columns (total 14 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null int64
ptratio    506 non-null float64
black      506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 59.3 KB

a. Regression models for each predictor

We want to predict the per capita crime rate crim.

import numpy as np
import statsmodels.formula.api as smf

# get predictor names
predictors = np.delete(boston.columns.values, np.where(boston.columns.values==['crim']))

# dictionary for models
models = {}

# fit single predictor models
for predictor in predictors:
    models[predictor] = smf.ols('crim ~ ' + predictor, data=boston).fit()
models_keys_iter = iter(models.values())

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.040
Model: OLS Adj. R-squared: 0.038
Method: Least Squares F-statistic: 21.10
Date: Sun, 11 Nov 2018 Prob (F-statistic): 5.51e-06
Time: 16:41:34 Log-Likelihood: -1796.0
No. Observations: 506 AIC: 3596.
Df Residuals: 504 BIC: 3604.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.4537 0.417 10.675 0.000 3.634 5.273
zn -0.0739 0.016 -4.594 0.000 -0.106 -0.042
Omnibus: 567.443 Durbin-Watson: 0.857
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32753.004
Skew: 5.257 Prob(JB): 0.00
Kurtosis: 40.986 Cond. No. 28.8



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.165
Model: OLS Adj. R-squared: 0.164
Method: Least Squares F-statistic: 99.82
Date: Sun, 11 Nov 2018 Prob (F-statistic): 1.45e-21
Time: 16:41:47 Log-Likelihood: -1760.6
No. Observations: 506 AIC: 3525.
Df Residuals: 504 BIC: 3534.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -2.0637 0.667 -3.093 0.002 -3.375 -0.753
indus 0.5098 0.051 9.991 0.000 0.410 0.610
Omnibus: 585.118 Durbin-Watson: 0.986
Prob(Omnibus): 0.000 Jarque-Bera (JB): 41418.938
Skew: 5.449 Prob(JB): 0.00
Kurtosis: 45.962 Cond. No. 25.1



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.003
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 1.579
Date: Sun, 11 Nov 2018 Prob (F-statistic): 0.209
Time: 16:41:51 Log-Likelihood: -1805.6
No. Observations: 506 AIC: 3615.
Df Residuals: 504 BIC: 3624.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.7444 0.396 9.453 0.000 2.966 4.523
chas -1.8928 1.506 -1.257 0.209 -4.852 1.066
Omnibus: 561.663 Durbin-Watson: 0.817
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30645.429
Skew: 5.191 Prob(JB): 0.00
Kurtosis: 39.685 Cond. No. 3.96



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.177
Model: OLS Adj. R-squared: 0.176
Method: Least Squares F-statistic: 108.6
Date: Sun, 11 Nov 2018 Prob (F-statistic): 3.75e-23
Time: 16:41:58 Log-Likelihood: -1757.0
No. Observations: 506 AIC: 3518.
Df Residuals: 504 BIC: 3526.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -13.7199 1.699 -8.073 0.000 -17.059 -10.381
nox 31.2485 2.999 10.419 0.000 25.356 37.141
Omnibus: 591.712 Durbin-Watson: 0.992
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43138.106
Skew: 5.546 Prob(JB): 0.00
Kurtosis: 46.852 Cond. No. 11.3



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.048
Model: OLS Adj. R-squared: 0.046
Method: Least Squares F-statistic: 25.45
Date: Sun, 11 Nov 2018 Prob (F-statistic): 6.35e-07
Time: 16:42:03 Log-Likelihood: -1793.9
No. Observations: 506 AIC: 3592.
Df Residuals: 504 BIC: 3600.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 20.4818 3.364 6.088 0.000 13.872 27.092
rm -2.6841 0.532 -5.045 0.000 -3.729 -1.639
Omnibus: 575.717 Durbin-Watson: 0.879
Prob(Omnibus): 0.000 Jarque-Bera (JB): 36658.093
Skew: 5.345 Prob(JB): 0.00
Kurtosis: 43.305 Cond. No. 58.4



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.124
Model: OLS Adj. R-squared: 0.123
Method: Least Squares F-statistic: 71.62
Date: Sun, 11 Nov 2018 Prob (F-statistic): 2.85e-16
Time: 16:42:07 Log-Likelihood: -1772.7
No. Observations: 506 AIC: 3549.
Df Residuals: 504 BIC: 3558.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -3.7779 0.944 -4.002 0.000 -5.633 -1.923
age 0.1078 0.013 8.463 0.000 0.083 0.133
Omnibus: 574.509 Durbin-Watson: 0.956
Prob(Omnibus): 0.000 Jarque-Bera (JB): 36741.903
Skew: 5.322 Prob(JB): 0.00
Kurtosis: 43.366 Cond. No. 195.



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.144
Model: OLS Adj. R-squared: 0.142
Method: Least Squares F-statistic: 84.89
Date: Sun, 11 Nov 2018 Prob (F-statistic): 8.52e-19
Time: 16:42:10 Log-Likelihood: -1767.0
No. Observations: 506 AIC: 3538.
Df Residuals: 504 BIC: 3546.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9.4993 0.730 13.006 0.000 8.064 10.934
dis -1.5509 0.168 -9.213 0.000 -1.882 -1.220
Omnibus: 576.519 Durbin-Watson: 0.952
Prob(Omnibus): 0.000 Jarque-Bera (JB): 37426.729
Skew: 5.348 Prob(JB): 0.00
Kurtosis: 43.753 Cond. No. 9.32



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.391
Model: OLS Adj. R-squared: 0.390
Method: Least Squares F-statistic: 323.9
Date: Sun, 11 Nov 2018 Prob (F-statistic): 2.69e-56
Time: 16:42:13 Log-Likelihood: -1680.8
No. Observations: 506 AIC: 3366.
Df Residuals: 504 BIC: 3374.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -2.2872 0.443 -5.157 0.000 -3.158 -1.416
rad 0.6179 0.034 17.998 0.000 0.550 0.685
Omnibus: 656.459 Durbin-Watson: 1.337
Prob(Omnibus): 0.000 Jarque-Bera (JB): 75417.007
Skew: 6.478 Prob(JB): 0.00
Kurtosis: 61.389 Cond. No. 19.2



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.340
Model: OLS Adj. R-squared: 0.338
Method: Least Squares F-statistic: 259.2
Date: Sun, 11 Nov 2018 Prob (F-statistic): 2.36e-47
Time: 16:42:15 Log-Likelihood: -1701.4
No. Observations: 506 AIC: 3407.
Df Residuals: 504 BIC: 3415.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -8.5284 0.816 -10.454 0.000 -10.131 -6.926
tax 0.0297 0.002 16.099 0.000 0.026 0.033
Omnibus: 635.377 Durbin-Watson: 1.252
Prob(Omnibus): 0.000 Jarque-Bera (JB): 63763.835
Skew: 6.156 Prob(JB): 0.00
Kurtosis: 56.599 Cond. No. 1.16e+03



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.084
Model: OLS Adj. R-squared: 0.082
Method: Least Squares F-statistic: 46.26
Date: Sun, 11 Nov 2018 Prob (F-statistic): 2.94e-11
Time: 16:42:17 Log-Likelihood: -1784.1
No. Observations: 506 AIC: 3572.
Df Residuals: 504 BIC: 3581.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -17.6469 3.147 -5.607 0.000 -23.830 -11.464
ptratio 1.1520 0.169 6.801 0.000 0.819 1.485
Omnibus: 568.053 Durbin-Watson: 0.905
Prob(Omnibus): 0.000 Jarque-Bera (JB): 34221.853
Skew: 5.245 Prob(JB): 0.00
Kurtosis: 41.899 Cond. No. 160.



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.148
Model: OLS Adj. R-squared: 0.147
Method: Least Squares F-statistic: 87.74
Date: Sun, 11 Nov 2018 Prob (F-statistic): 2.49e-19
Time: 16:42:20 Log-Likelihood: -1765.8
No. Observations: 506 AIC: 3536.
Df Residuals: 504 BIC: 3544.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 16.5535 1.426 11.609 0.000 13.752 19.355
black -0.0363 0.004 -9.367 0.000 -0.044 -0.029
Omnibus: 594.029 Durbin-Watson: 0.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 44041.935
Skew: 5.578 Prob(JB): 0.00
Kurtosis: 47.323 Cond. No. 1.49e+03



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.208
Model: OLS Adj. R-squared: 0.206
Method: Least Squares F-statistic: 132.0
Date: Sun, 11 Nov 2018 Prob (F-statistic): 2.65e-27
Time: 16:42:22 Log-Likelihood: -1747.5
No. Observations: 506 AIC: 3499.
Df Residuals: 504 BIC: 3507.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -3.3305 0.694 -4.801 0.000 -4.694 -1.968
lstat 0.5488 0.048 11.491 0.000 0.455 0.643
Omnibus: 601.306 Durbin-Watson: 1.182
Prob(Omnibus): 0.000 Jarque-Bera (JB): 49918.826
Skew: 5.645 Prob(JB): 0.00
Kurtosis: 50.331 Cond. No. 29.7



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.151
Model: OLS Adj. R-squared: 0.149
Method: Least Squares F-statistic: 89.49
Date: Sun, 11 Nov 2018 Prob (F-statistic): 1.17e-19
Time: 16:42:24 Log-Likelihood: -1765.0
No. Observations: 506 AIC: 3534.
Df Residuals: 504 BIC: 3542.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 11.7965 0.934 12.628 0.000 9.961 13.632
medv -0.3632 0.038 -9.460 0.000 -0.439 -0.288
Omnibus: 558.880 Durbin-Watson: 0.996
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32740.044
Skew: 5.108 Prob(JB): 0.00
Kurtosis: 41.059 Cond. No. 64.5



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now we check which predictors are statistically significant, using the common rule-of-thumb p<0.05p < 0.05

for model in models:
    if models[model].pvalues[1] < 0.05:
        print("{} is statistically significant with p-value {}". format(model, models[model].pvalues[1]))
    else:
        print("{} is NOT statistically significant with p-value {}". format(model, models[model].pvalues[1]))
zn is statistically significant with p-value 5.506472107679307e-06
indus is statistically significant with p-value 1.4503489330272395e-21
chas is NOT statistically significant with p-value 0.2094345015352004
nox is statistically significant with p-value 3.751739260356923e-23
rm is statistically significant with p-value 6.346702984687839e-07
age is statistically significant with p-value 2.8548693502441573e-16
dis is statistically significant with p-value 8.519948766926326e-19
rad is statistically significant with p-value 2.6938443981864414e-56
tax is statistically significant with p-value 2.357126835257048e-47
ptratio is statistically significant with p-value 2.942922447359816e-11
black is statistically significant with p-value 2.487273973773734e-19
lstat is statistically significant with p-value 2.6542772314731968e-27
medv is statistically significant with p-value 1.1739870821943694e-19
full is statistically significant with p-value 0.01702489149848948

Looks like all predictors were significant

b. Full regression model

# formula for full model
formula = ''
for predictor in predictors:
    formula += predictor + ' + '
formula = formula[:-3]
formula

# add full model
models['full'] = smf.ols('crim ~ ' + formula, data=boston).fit()
models['full'].summary()
OLS Regression Results
Dep. Variable: crim R-squared: 0.454
Model: OLS Adj. R-squared: 0.440
Method: Least Squares F-statistic: 31.47
Date: Sun, 11 Nov 2018 Prob (F-statistic): 1.57e-56
Time: 16:49:14 Log-Likelihood: -1653.3
No. Observations: 506 AIC: 3335.
Df Residuals: 492 BIC: 3394.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 17.0332 7.235 2.354 0.019 2.818 31.248
zn 0.0449 0.019 2.394 0.017 0.008 0.082
indus -0.0639 0.083 -0.766 0.444 -0.228 0.100
chas -0.7491 1.180 -0.635 0.526 -3.068 1.570
nox -10.3135 5.276 -1.955 0.051 -20.679 0.052
rm 0.4301 0.613 0.702 0.483 -0.774 1.634
age 0.0015 0.018 0.081 0.935 -0.034 0.037
dis -0.9872 0.282 -3.503 0.001 -1.541 -0.433
rad 0.5882 0.088 6.680 0.000 0.415 0.761
tax -0.0038 0.005 -0.733 0.464 -0.014 0.006
ptratio -0.2711 0.186 -1.454 0.147 -0.637 0.095
black -0.0075 0.004 -2.052 0.041 -0.015 -0.000
lstat 0.1262 0.076 1.667 0.096 -0.023 0.275
medv -0.1989 0.061 -3.287 0.001 -0.318 -0.080
Omnibus: 666.613 Durbin-Watson: 1.519
Prob(Omnibus): 0.000 Jarque-Bera (JB): 84887.625
Skew: 6.617 Prob(JB): 0.00
Kurtosis: 65.058 Cond. No. 1.58e+04



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.58e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Now we see which predictors were significant

TBC…