3. Linear Regression

Exercise 15: Regression models for the `Boston` Dataset

Preparing the data
a. Regression models for each predictor
b. Full regression model

Preparing the data

import pandas as pd

boston = pd.read_csv('../../datasets/Boston.csv', index_col=0)
boston.head()

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
1	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
2	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
3	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
4	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
5	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

boston.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 506 entries, 1 to 506
Data columns (total 14 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null int64
ptratio    506 non-null float64
black      506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 59.3 KB

a. Regression models for each predictor

We want to predict the per capita crime rate crim.

import numpy as np
import statsmodels.formula.api as smf

# get predictor names
predictors = np.delete(boston.columns.values, np.where(boston.columns.values==['crim']))

# dictionary for models
models = {}

# fit single predictor models
for predictor in predictors:
    models[predictor] = smf.ols('crim ~ ' + predictor, data=boston).fit()

models_keys_iter = iter(models.values())

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.040
Model:	OLS	Adj. R-squared:	0.038
Method:	Least Squares	F-statistic:	21.10
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	5.51e-06
Time:	16:41:34	Log-Likelihood:	-1796.0
No. Observations:	506	AIC:	3596.
Df Residuals:	504	BIC:	3604.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	4.4537	0.417	10.675	0.000	3.634	5.273
zn	-0.0739	0.016	-4.594	0.000	-0.106	-0.042

Omnibus:	567.443	Durbin-Watson:	0.857
Prob(Omnibus):	0.000	Jarque-Bera (JB):	32753.004
Skew:	5.257	Prob(JB):	0.00
Kurtosis:	40.986	Cond. No.	28.8

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.165
Model:	OLS	Adj. R-squared:	0.164
Method:	Least Squares	F-statistic:	99.82
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	1.45e-21
Time:	16:41:47	Log-Likelihood:	-1760.6
No. Observations:	506	AIC:	3525.
Df Residuals:	504	BIC:	3534.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2.0637	0.667	-3.093	0.002	-3.375	-0.753
indus	0.5098	0.051	9.991	0.000	0.410	0.610

Omnibus:	585.118	Durbin-Watson:	0.986
Prob(Omnibus):	0.000	Jarque-Bera (JB):	41418.938
Skew:	5.449	Prob(JB):	0.00
Kurtosis:	45.962	Cond. No.	25.1

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.003
Model:	OLS	Adj. R-squared:	0.001
Method:	Least Squares	F-statistic:	1.579
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	0.209
Time:	16:41:51	Log-Likelihood:	-1805.6
No. Observations:	506	AIC:	3615.
Df Residuals:	504	BIC:	3624.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.7444	0.396	9.453	0.000	2.966	4.523
chas	-1.8928	1.506	-1.257	0.209	-4.852	1.066

Omnibus:	561.663	Durbin-Watson:	0.817
Prob(Omnibus):	0.000	Jarque-Bera (JB):	30645.429
Skew:	5.191	Prob(JB):	0.00
Kurtosis:	39.685	Cond. No.	3.96

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.177
Model:	OLS	Adj. R-squared:	0.176
Method:	Least Squares	F-statistic:	108.6
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	3.75e-23
Time:	16:41:58	Log-Likelihood:	-1757.0
No. Observations:	506	AIC:	3518.
Df Residuals:	504	BIC:	3526.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-13.7199	1.699	-8.073	0.000	-17.059	-10.381
nox	31.2485	2.999	10.419	0.000	25.356	37.141

Omnibus:	591.712	Durbin-Watson:	0.992
Prob(Omnibus):	0.000	Jarque-Bera (JB):	43138.106
Skew:	5.546	Prob(JB):	0.00
Kurtosis:	46.852	Cond. No.	11.3

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.048
Model:	OLS	Adj. R-squared:	0.046
Method:	Least Squares	F-statistic:	25.45
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	6.35e-07
Time:	16:42:03	Log-Likelihood:	-1793.9
No. Observations:	506	AIC:	3592.
Df Residuals:	504	BIC:	3600.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	20.4818	3.364	6.088	0.000	13.872	27.092
rm	-2.6841	0.532	-5.045	0.000	-3.729	-1.639

Omnibus:	575.717	Durbin-Watson:	0.879
Prob(Omnibus):	0.000	Jarque-Bera (JB):	36658.093
Skew:	5.345	Prob(JB):	0.00
Kurtosis:	43.305	Cond. No.	58.4

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.124
Model:	OLS	Adj. R-squared:	0.123
Method:	Least Squares	F-statistic:	71.62
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	2.85e-16
Time:	16:42:07	Log-Likelihood:	-1772.7
No. Observations:	506	AIC:	3549.
Df Residuals:	504	BIC:	3558.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-3.7779	0.944	-4.002	0.000	-5.633	-1.923
age	0.1078	0.013	8.463	0.000	0.083	0.133

Omnibus:	574.509	Durbin-Watson:	0.956
Prob(Omnibus):	0.000	Jarque-Bera (JB):	36741.903
Skew:	5.322	Prob(JB):	0.00
Kurtosis:	43.366	Cond. No.	195.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.144
Model:	OLS	Adj. R-squared:	0.142
Method:	Least Squares	F-statistic:	84.89
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	8.52e-19
Time:	16:42:10	Log-Likelihood:	-1767.0
No. Observations:	506	AIC:	3538.
Df Residuals:	504	BIC:	3546.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	9.4993	0.730	13.006	0.000	8.064	10.934
dis	-1.5509	0.168	-9.213	0.000	-1.882	-1.220

Omnibus:	576.519	Durbin-Watson:	0.952
Prob(Omnibus):	0.000	Jarque-Bera (JB):	37426.729
Skew:	5.348	Prob(JB):	0.00
Kurtosis:	43.753	Cond. No.	9.32

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.391
Model:	OLS	Adj. R-squared:	0.390
Method:	Least Squares	F-statistic:	323.9
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	2.69e-56
Time:	16:42:13	Log-Likelihood:	-1680.8
No. Observations:	506	AIC:	3366.
Df Residuals:	504	BIC:	3374.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2.2872	0.443	-5.157	0.000	-3.158	-1.416
rad	0.6179	0.034	17.998	0.000	0.550	0.685

Omnibus:	656.459	Durbin-Watson:	1.337
Prob(Omnibus):	0.000	Jarque-Bera (JB):	75417.007
Skew:	6.478	Prob(JB):	0.00
Kurtosis:	61.389	Cond. No.	19.2

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.340
Model:	OLS	Adj. R-squared:	0.338
Method:	Least Squares	F-statistic:	259.2
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	2.36e-47
Time:	16:42:15	Log-Likelihood:	-1701.4
No. Observations:	506	AIC:	3407.
Df Residuals:	504	BIC:	3415.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-8.5284	0.816	-10.454	0.000	-10.131	-6.926
tax	0.0297	0.002	16.099	0.000	0.026	0.033

Omnibus:	635.377	Durbin-Watson:	1.252
Prob(Omnibus):	0.000	Jarque-Bera (JB):	63763.835
Skew:	6.156	Prob(JB):	0.00
Kurtosis:	56.599	Cond. No.	1.16e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.084
Model:	OLS	Adj. R-squared:	0.082
Method:	Least Squares	F-statistic:	46.26
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	2.94e-11
Time:	16:42:17	Log-Likelihood:	-1784.1
No. Observations:	506	AIC:	3572.
Df Residuals:	504	BIC:	3581.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-17.6469	3.147	-5.607	0.000	-23.830	-11.464
ptratio	1.1520	0.169	6.801	0.000	0.819	1.485

Omnibus:	568.053	Durbin-Watson:	0.905
Prob(Omnibus):	0.000	Jarque-Bera (JB):	34221.853
Skew:	5.245	Prob(JB):	0.00
Kurtosis:	41.899	Cond. No.	160.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.148
Model:	OLS	Adj. R-squared:	0.147
Method:	Least Squares	F-statistic:	87.74
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	2.49e-19
Time:	16:42:20	Log-Likelihood:	-1765.8
No. Observations:	506	AIC:	3536.
Df Residuals:	504	BIC:	3544.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	16.5535	1.426	11.609	0.000	13.752	19.355
black	-0.0363	0.004	-9.367	0.000	-0.044	-0.029

Omnibus:	594.029	Durbin-Watson:	0.994
Prob(Omnibus):	0.000	Jarque-Bera (JB):	44041.935
Skew:	5.578	Prob(JB):	0.00
Kurtosis:	47.323	Cond. No.	1.49e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.208
Model:	OLS	Adj. R-squared:	0.206
Method:	Least Squares	F-statistic:	132.0
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	2.65e-27
Time:	16:42:22	Log-Likelihood:	-1747.5
No. Observations:	506	AIC:	3499.
Df Residuals:	504	BIC:	3507.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-3.3305	0.694	-4.801	0.000	-4.694	-1.968
lstat	0.5488	0.048	11.491	0.000	0.455	0.643

Omnibus:	601.306	Durbin-Watson:	1.182
Prob(Omnibus):	0.000	Jarque-Bera (JB):	49918.826
Skew:	5.645	Prob(JB):	0.00
Kurtosis:	50.331	Cond. No.	29.7

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

next(models_keys_iter).summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.151
Model:	OLS	Adj. R-squared:	0.149
Method:	Least Squares	F-statistic:	89.49
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	1.17e-19
Time:	16:42:24	Log-Likelihood:	-1765.0
No. Observations:	506	AIC:	3534.
Df Residuals:	504	BIC:	3542.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	11.7965	0.934	12.628	0.000	9.961	13.632
medv	-0.3632	0.038	-9.460	0.000	-0.439	-0.288

Omnibus:	558.880	Durbin-Watson:	0.996
Prob(Omnibus):	0.000	Jarque-Bera (JB):	32740.044
Skew:	5.108	Prob(JB):	0.00
Kurtosis:	41.059	Cond. No.	64.5

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now we check which predictors are statistically significant, using the common rule-of-thumb $p < 0.05$

for model in models:
    if models[model].pvalues[1] < 0.05:
        print("{} is statistically significant with p-value {}". format(model, models[model].pvalues[1]))
    else:
        print("{} is NOT statistically significant with p-value {}". format(model, models[model].pvalues[1]))

zn is statistically significant with p-value 5.506472107679307e-06
indus is statistically significant with p-value 1.4503489330272395e-21
chas is NOT statistically significant with p-value 0.2094345015352004
nox is statistically significant with p-value 3.751739260356923e-23
rm is statistically significant with p-value 6.346702984687839e-07
age is statistically significant with p-value 2.8548693502441573e-16
dis is statistically significant with p-value 8.519948766926326e-19
rad is statistically significant with p-value 2.6938443981864414e-56
tax is statistically significant with p-value 2.357126835257048e-47
ptratio is statistically significant with p-value 2.942922447359816e-11
black is statistically significant with p-value 2.487273973773734e-19
lstat is statistically significant with p-value 2.6542772314731968e-27
medv is statistically significant with p-value 1.1739870821943694e-19
full is statistically significant with p-value 0.01702489149848948

Looks like all predictors were significant

b. Full regression model

# formula for full model
formula = ''
for predictor in predictors:
    formula += predictor + ' + '
formula = formula[:-3]
formula

# add full model
models['full'] = smf.ols('crim ~ ' + formula, data=boston).fit()

models['full'].summary()

OLS Regression Results
Dep. Variable:	crim	R-squared:	0.454
Model:	OLS	Adj. R-squared:	0.440
Method:	Least Squares	F-statistic:	31.47
Date:	Sun, 11 Nov 2018	Prob (F-statistic):	1.57e-56
Time:	16:49:14	Log-Likelihood:	-1653.3
No. Observations:	506	AIC:	3335.
Df Residuals:	492	BIC:	3394.
Df Model:	13
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	17.0332	7.235	2.354	0.019	2.818	31.248
zn	0.0449	0.019	2.394	0.017	0.008	0.082
indus	-0.0639	0.083	-0.766	0.444	-0.228	0.100
chas	-0.7491	1.180	-0.635	0.526	-3.068	1.570
nox	-10.3135	5.276	-1.955	0.051	-20.679	0.052
rm	0.4301	0.613	0.702	0.483	-0.774	1.634
age	0.0015	0.018	0.081	0.935	-0.034	0.037
dis	-0.9872	0.282	-3.503	0.001	-1.541	-0.433
rad	0.5882	0.088	6.680	0.000	0.415	0.761
tax	-0.0038	0.005	-0.733	0.464	-0.014	0.006
ptratio	-0.2711	0.186	-1.454	0.147	-0.637	0.095
black	-0.0075	0.004	-2.052	0.041	-0.015	-0.000
lstat	0.1262	0.076	1.667	0.096	-0.023	0.275
medv	-0.1989	0.061	-3.287	0.001	-0.318	-0.080

Omnibus:	666.613	Durbin-Watson:	1.519
Prob(Omnibus):	0.000	Jarque-Bera (JB):	84887.625
Skew:	6.617	Prob(JB):	0.00
Kurtosis:	65.058	Cond. No.	1.58e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.58e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Now we see which predictors were significant

TBC…

islr notes and exercises from An Introduction to Statistical Learning

3. Linear Regression

Exercise 15: Regression models for the Boston Dataset

Preparing the data

a. Regression models for each predictor

b. Full regression model

Exercise 15: Regression models for the `Boston` Dataset