islr notes and exercises from An Introduction to Statistical Learning

3. Linear Regression

Exercise 11: The t-statistic for null hypothesis in simple linear regression with no intercept

Generate the data

First we generate paired data (x,y) according to

Y=2X+ϵ Y = 2X + \epsilon

import numpy as np

np.random.seed(1)
x = np.random.normal(size=100)
y = 2*x + np.random.normal(size=100)

a. Simple regression of y onto x with no intercept

import statsmodels.api as sm

model_1 = sm.OLS(y, x).fit()
model_1.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.798
Model: OLS Adj. R-squared: 0.796
Method: Least Squares F-statistic: 391.7
Date: Sun, 04 Nov 2018 Prob (F-statistic): 3.46e-36
Time: 13:56:12 Log-Likelihood: -135.67
No. Observations: 100 AIC: 273.3
Df Residuals: 99 BIC: 275.9
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 2.1067 0.106 19.792 0.000 1.896 2.318
Omnibus: 0.880 Durbin-Watson: 2.106
Prob(Omnibus): 0.644 Jarque-Bera (JB): 0.554
Skew: -0.172 Prob(JB): 0.758
Kurtosis: 3.119 Cond. No. 1.00



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficient estimate is

model_1.params
array([2.10674169])

which is very close to the true value of 22.

The standard error is

model_1.bse
array([0.10644517])

The t-statistic is

model_1.tvalues
array([19.79180199])

which has p-value

model_1.pvalues
array([3.45737574e-36])

This is an incredibly small p-value, so we have good grounds to reject the null hypothesis and accept the alternative hypothesis Ha:β0H_a: \beta \neq 0

b. Simple regression of x onto y with no intercept

import statsmodels.api as sm

model_2 = sm.OLS(x, y).fit()
model_2.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.798
Model: OLS Adj. R-squared: 0.796
Method: Least Squares F-statistic: 391.7
Date: Sun, 04 Nov 2018 Prob (F-statistic): 3.46e-36
Time: 13:56:12 Log-Likelihood: -49.891
No. Observations: 100 AIC: 101.8
Df Residuals: 99 BIC: 104.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 0.3789 0.019 19.792 0.000 0.341 0.417
Omnibus: 0.476 Durbin-Watson: 2.166
Prob(Omnibus): 0.788 Jarque-Bera (JB): 0.631
Skew: 0.115 Prob(JB): 0.729
Kurtosis: 2.685 Cond. No. 1.00



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficient estimate is

model_2.params
array([0.37890442])

which is somewhat close to the true value of 0.50.5.

The standard error is

model_2.bse
array([0.01914451])

The t-statistic is

model_2.tvalues
array([19.79180199])

which has p-value

model_2.pvalues
array([3.45737574e-36])

This is identical to the p-value for model_1, so we again have grounds to reject the null hypothesis and accept the alternative hypothesis Ha:β0H_a: \beta \neq 0

c. The relationship between the models

In both cases there is a linear relationship between the two variables. Since the first model has the form

Y=2X+ϵY = 2X + \epsilon

the second model has the form

X=12Y12ϵX = \frac{1}{2}Y - \frac{1}{2}\epsilon

In both cases, the regression detected the linear relationship with high confidence, and found good estimates for the coefficient.

It seems remarkable that the t-statistics were identical for both models, since the first is

model_1.params[0]/model_1.bse[0]
19.79180198709121

while the second

model_2.params[0]/model_2.bse[0]
19.79180198709121

d. A formula for the t-statistic

Now it makes sense!

e. Why the t-statistic is the same for both models

From d., we see that the t-statistic only depends on x, y and the sample size, and its symmetric with respect to the substitution x \mapsto y

f. Repeat for simple regression with an intercept

import statsmodels.formula.api as smf
import pandas as pd

df = pd.DataFrame({'x' :x, 'y': y})

model_1 = smf.ols('y ~ x', data=df).fit()
model_2 = smf.ols('x ~ y', data=df).fit()
model_1.tvalues[1] == model_2.tvalues[1]
True