3. Linear Regression
Exercise 15: Regression models for the Boston
Dataset
Preparing the data
import pandas as pd
boston = pd.read_csv('../../datasets/Boston.csv', index_col=0)
boston.head()
|
crim |
zn |
indus |
chas |
nox |
rm |
age |
dis |
rad |
tax |
ptratio |
black |
lstat |
medv |
1 |
0.00632 |
18.0 |
2.31 |
0 |
0.538 |
6.575 |
65.2 |
4.0900 |
1 |
296 |
15.3 |
396.90 |
4.98 |
24.0 |
2 |
0.02731 |
0.0 |
7.07 |
0 |
0.469 |
6.421 |
78.9 |
4.9671 |
2 |
242 |
17.8 |
396.90 |
9.14 |
21.6 |
3 |
0.02729 |
0.0 |
7.07 |
0 |
0.469 |
7.185 |
61.1 |
4.9671 |
2 |
242 |
17.8 |
392.83 |
4.03 |
34.7 |
4 |
0.03237 |
0.0 |
2.18 |
0 |
0.458 |
6.998 |
45.8 |
6.0622 |
3 |
222 |
18.7 |
394.63 |
2.94 |
33.4 |
5 |
0.06905 |
0.0 |
2.18 |
0 |
0.458 |
7.147 |
54.2 |
6.0622 |
3 |
222 |
18.7 |
396.90 |
5.33 |
36.2 |
<class 'pandas.core.frame.DataFrame'>
Int64Index: 506 entries, 1 to 506
Data columns (total 14 columns):
crim 506 non-null float64
zn 506 non-null float64
indus 506 non-null float64
chas 506 non-null int64
nox 506 non-null float64
rm 506 non-null float64
age 506 non-null float64
dis 506 non-null float64
rad 506 non-null int64
tax 506 non-null int64
ptratio 506 non-null float64
black 506 non-null float64
lstat 506 non-null float64
medv 506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 59.3 KB
a. Regression models for each predictor
We want to predict the per capita crime rate crim
.
import numpy as np
import statsmodels.formula.api as smf
# get predictor names
predictors = np.delete(boston.columns.values, np.where(boston.columns.values==['crim']))
# dictionary for models
models = {}
# fit single predictor models
for predictor in predictors:
models[predictor] = smf.ols('crim ~ ' + predictor, data=boston).fit()
models_keys_iter = iter(models.values())
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.040 |
Model: | OLS | Adj. R-squared: | 0.038 |
Method: | Least Squares | F-statistic: | 21.10 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 5.51e-06 |
Time: | 16:41:34 | Log-Likelihood: | -1796.0 |
No. Observations: | 506 | AIC: | 3596. |
Df Residuals: | 504 | BIC: | 3604. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 4.4537 | 0.417 | 10.675 | 0.000 | 3.634 | 5.273 |
zn | -0.0739 | 0.016 | -4.594 | 0.000 | -0.106 | -0.042 |
Omnibus: | 567.443 | Durbin-Watson: | 0.857 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 32753.004 |
Skew: | 5.257 | Prob(JB): | 0.00 |
Kurtosis: | 40.986 | Cond. No. | 28.8 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.165 |
Model: | OLS | Adj. R-squared: | 0.164 |
Method: | Least Squares | F-statistic: | 99.82 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 1.45e-21 |
Time: | 16:41:47 | Log-Likelihood: | -1760.6 |
No. Observations: | 506 | AIC: | 3525. |
Df Residuals: | 504 | BIC: | 3534. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | -2.0637 | 0.667 | -3.093 | 0.002 | -3.375 | -0.753 |
indus | 0.5098 | 0.051 | 9.991 | 0.000 | 0.410 | 0.610 |
Omnibus: | 585.118 | Durbin-Watson: | 0.986 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 41418.938 |
Skew: | 5.449 | Prob(JB): | 0.00 |
Kurtosis: | 45.962 | Cond. No. | 25.1 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.003 |
Model: | OLS | Adj. R-squared: | 0.001 |
Method: | Least Squares | F-statistic: | 1.579 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 0.209 |
Time: | 16:41:51 | Log-Likelihood: | -1805.6 |
No. Observations: | 506 | AIC: | 3615. |
Df Residuals: | 504 | BIC: | 3624. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 3.7444 | 0.396 | 9.453 | 0.000 | 2.966 | 4.523 |
chas | -1.8928 | 1.506 | -1.257 | 0.209 | -4.852 | 1.066 |
Omnibus: | 561.663 | Durbin-Watson: | 0.817 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 30645.429 |
Skew: | 5.191 | Prob(JB): | 0.00 |
Kurtosis: | 39.685 | Cond. No. | 3.96 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.177 |
Model: | OLS | Adj. R-squared: | 0.176 |
Method: | Least Squares | F-statistic: | 108.6 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 3.75e-23 |
Time: | 16:41:58 | Log-Likelihood: | -1757.0 |
No. Observations: | 506 | AIC: | 3518. |
Df Residuals: | 504 | BIC: | 3526. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | -13.7199 | 1.699 | -8.073 | 0.000 | -17.059 | -10.381 |
nox | 31.2485 | 2.999 | 10.419 | 0.000 | 25.356 | 37.141 |
Omnibus: | 591.712 | Durbin-Watson: | 0.992 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 43138.106 |
Skew: | 5.546 | Prob(JB): | 0.00 |
Kurtosis: | 46.852 | Cond. No. | 11.3 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.048 |
Model: | OLS | Adj. R-squared: | 0.046 |
Method: | Least Squares | F-statistic: | 25.45 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 6.35e-07 |
Time: | 16:42:03 | Log-Likelihood: | -1793.9 |
No. Observations: | 506 | AIC: | 3592. |
Df Residuals: | 504 | BIC: | 3600. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 20.4818 | 3.364 | 6.088 | 0.000 | 13.872 | 27.092 |
rm | -2.6841 | 0.532 | -5.045 | 0.000 | -3.729 | -1.639 |
Omnibus: | 575.717 | Durbin-Watson: | 0.879 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 36658.093 |
Skew: | 5.345 | Prob(JB): | 0.00 |
Kurtosis: | 43.305 | Cond. No. | 58.4 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.124 |
Model: | OLS | Adj. R-squared: | 0.123 |
Method: | Least Squares | F-statistic: | 71.62 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 2.85e-16 |
Time: | 16:42:07 | Log-Likelihood: | -1772.7 |
No. Observations: | 506 | AIC: | 3549. |
Df Residuals: | 504 | BIC: | 3558. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | -3.7779 | 0.944 | -4.002 | 0.000 | -5.633 | -1.923 |
age | 0.1078 | 0.013 | 8.463 | 0.000 | 0.083 | 0.133 |
Omnibus: | 574.509 | Durbin-Watson: | 0.956 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 36741.903 |
Skew: | 5.322 | Prob(JB): | 0.00 |
Kurtosis: | 43.366 | Cond. No. | 195. |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.144 |
Model: | OLS | Adj. R-squared: | 0.142 |
Method: | Least Squares | F-statistic: | 84.89 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 8.52e-19 |
Time: | 16:42:10 | Log-Likelihood: | -1767.0 |
No. Observations: | 506 | AIC: | 3538. |
Df Residuals: | 504 | BIC: | 3546. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 9.4993 | 0.730 | 13.006 | 0.000 | 8.064 | 10.934 |
dis | -1.5509 | 0.168 | -9.213 | 0.000 | -1.882 | -1.220 |
Omnibus: | 576.519 | Durbin-Watson: | 0.952 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 37426.729 |
Skew: | 5.348 | Prob(JB): | 0.00 |
Kurtosis: | 43.753 | Cond. No. | 9.32 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.391 |
Model: | OLS | Adj. R-squared: | 0.390 |
Method: | Least Squares | F-statistic: | 323.9 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 2.69e-56 |
Time: | 16:42:13 | Log-Likelihood: | -1680.8 |
No. Observations: | 506 | AIC: | 3366. |
Df Residuals: | 504 | BIC: | 3374. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | -2.2872 | 0.443 | -5.157 | 0.000 | -3.158 | -1.416 |
rad | 0.6179 | 0.034 | 17.998 | 0.000 | 0.550 | 0.685 |
Omnibus: | 656.459 | Durbin-Watson: | 1.337 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 75417.007 |
Skew: | 6.478 | Prob(JB): | 0.00 |
Kurtosis: | 61.389 | Cond. No. | 19.2 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.340 |
Model: | OLS | Adj. R-squared: | 0.338 |
Method: | Least Squares | F-statistic: | 259.2 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 2.36e-47 |
Time: | 16:42:15 | Log-Likelihood: | -1701.4 |
No. Observations: | 506 | AIC: | 3407. |
Df Residuals: | 504 | BIC: | 3415. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | -8.5284 | 0.816 | -10.454 | 0.000 | -10.131 | -6.926 |
tax | 0.0297 | 0.002 | 16.099 | 0.000 | 0.026 | 0.033 |
Omnibus: | 635.377 | Durbin-Watson: | 1.252 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 63763.835 |
Skew: | 6.156 | Prob(JB): | 0.00 |
Kurtosis: | 56.599 | Cond. No. | 1.16e+03 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.084 |
Model: | OLS | Adj. R-squared: | 0.082 |
Method: | Least Squares | F-statistic: | 46.26 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 2.94e-11 |
Time: | 16:42:17 | Log-Likelihood: | -1784.1 |
No. Observations: | 506 | AIC: | 3572. |
Df Residuals: | 504 | BIC: | 3581. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | -17.6469 | 3.147 | -5.607 | 0.000 | -23.830 | -11.464 |
ptratio | 1.1520 | 0.169 | 6.801 | 0.000 | 0.819 | 1.485 |
Omnibus: | 568.053 | Durbin-Watson: | 0.905 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 34221.853 |
Skew: | 5.245 | Prob(JB): | 0.00 |
Kurtosis: | 41.899 | Cond. No. | 160. |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.148 |
Model: | OLS | Adj. R-squared: | 0.147 |
Method: | Least Squares | F-statistic: | 87.74 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 2.49e-19 |
Time: | 16:42:20 | Log-Likelihood: | -1765.8 |
No. Observations: | 506 | AIC: | 3536. |
Df Residuals: | 504 | BIC: | 3544. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 16.5535 | 1.426 | 11.609 | 0.000 | 13.752 | 19.355 |
black | -0.0363 | 0.004 | -9.367 | 0.000 | -0.044 | -0.029 |
Omnibus: | 594.029 | Durbin-Watson: | 0.994 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 44041.935 |
Skew: | 5.578 | Prob(JB): | 0.00 |
Kurtosis: | 47.323 | Cond. No. | 1.49e+03 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.208 |
Model: | OLS | Adj. R-squared: | 0.206 |
Method: | Least Squares | F-statistic: | 132.0 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 2.65e-27 |
Time: | 16:42:22 | Log-Likelihood: | -1747.5 |
No. Observations: | 506 | AIC: | 3499. |
Df Residuals: | 504 | BIC: | 3507. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | -3.3305 | 0.694 | -4.801 | 0.000 | -4.694 | -1.968 |
lstat | 0.5488 | 0.048 | 11.491 | 0.000 | 0.455 | 0.643 |
Omnibus: | 601.306 | Durbin-Watson: | 1.182 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 49918.826 |
Skew: | 5.645 | Prob(JB): | 0.00 |
Kurtosis: | 50.331 | Cond. No. | 29.7 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
next(models_keys_iter).summary()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.151 |
Model: | OLS | Adj. R-squared: | 0.149 |
Method: | Least Squares | F-statistic: | 89.49 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 1.17e-19 |
Time: | 16:42:24 | Log-Likelihood: | -1765.0 |
No. Observations: | 506 | AIC: | 3534. |
Df Residuals: | 504 | BIC: | 3542. |
Df Model: | 1 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 11.7965 | 0.934 | 12.628 | 0.000 | 9.961 | 13.632 |
medv | -0.3632 | 0.038 | -9.460 | 0.000 | -0.439 | -0.288 |
Omnibus: | 558.880 | Durbin-Watson: | 0.996 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 32740.044 |
Skew: | 5.108 | Prob(JB): | 0.00 |
Kurtosis: | 41.059 | Cond. No. | 64.5 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now we check which predictors are statistically significant, using the common rule-of-thumb p<0.05
for model in models:
if models[model].pvalues[1] < 0.05:
print("{} is statistically significant with p-value {}". format(model, models[model].pvalues[1]))
else:
print("{} is NOT statistically significant with p-value {}". format(model, models[model].pvalues[1]))
zn is statistically significant with p-value 5.506472107679307e-06
indus is statistically significant with p-value 1.4503489330272395e-21
chas is NOT statistically significant with p-value 0.2094345015352004
nox is statistically significant with p-value 3.751739260356923e-23
rm is statistically significant with p-value 6.346702984687839e-07
age is statistically significant with p-value 2.8548693502441573e-16
dis is statistically significant with p-value 8.519948766926326e-19
rad is statistically significant with p-value 2.6938443981864414e-56
tax is statistically significant with p-value 2.357126835257048e-47
ptratio is statistically significant with p-value 2.942922447359816e-11
black is statistically significant with p-value 2.487273973773734e-19
lstat is statistically significant with p-value 2.6542772314731968e-27
medv is statistically significant with p-value 1.1739870821943694e-19
full is statistically significant with p-value 0.01702489149848948
Looks like all predictors were significant
b. Full regression model
# formula for full model
formula = ''
for predictor in predictors:
formula += predictor + ' + '
formula = formula[:-3]
formula
# add full model
models['full'] = smf.ols('crim ~ ' + formula, data=boston).fit()
OLS Regression Results
Dep. Variable: | crim | R-squared: | 0.454 |
Model: | OLS | Adj. R-squared: | 0.440 |
Method: | Least Squares | F-statistic: | 31.47 |
Date: | Sun, 11 Nov 2018 | Prob (F-statistic): | 1.57e-56 |
Time: | 16:49:14 | Log-Likelihood: | -1653.3 |
No. Observations: | 506 | AIC: | 3335. |
Df Residuals: | 492 | BIC: | 3394. |
Df Model: | 13 | | |
Covariance Type: | nonrobust | | |
| coef | std err | t | P>|t| | [0.025 | 0.975] |
Intercept | 17.0332 | 7.235 | 2.354 | 0.019 | 2.818 | 31.248 |
zn | 0.0449 | 0.019 | 2.394 | 0.017 | 0.008 | 0.082 |
indus | -0.0639 | 0.083 | -0.766 | 0.444 | -0.228 | 0.100 |
chas | -0.7491 | 1.180 | -0.635 | 0.526 | -3.068 | 1.570 |
nox | -10.3135 | 5.276 | -1.955 | 0.051 | -20.679 | 0.052 |
rm | 0.4301 | 0.613 | 0.702 | 0.483 | -0.774 | 1.634 |
age | 0.0015 | 0.018 | 0.081 | 0.935 | -0.034 | 0.037 |
dis | -0.9872 | 0.282 | -3.503 | 0.001 | -1.541 | -0.433 |
rad | 0.5882 | 0.088 | 6.680 | 0.000 | 0.415 | 0.761 |
tax | -0.0038 | 0.005 | -0.733 | 0.464 | -0.014 | 0.006 |
ptratio | -0.2711 | 0.186 | -1.454 | 0.147 | -0.637 | 0.095 |
black | -0.0075 | 0.004 | -2.052 | 0.041 | -0.015 | -0.000 |
lstat | 0.1262 | 0.076 | 1.667 | 0.096 | -0.023 | 0.275 |
medv | -0.1989 | 0.061 | -3.287 | 0.001 | -0.318 | -0.080 |
Omnibus: | 666.613 | Durbin-Watson: | 1.519 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 84887.625 |
Skew: | 6.617 | Prob(JB): | 0.00 |
Kurtosis: | 65.058 | Cond. No. | 1.58e+04 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.58e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Now we see which predictors were significant
TBC…