3. Linear Regression

Exercise 11: The t-statistic for null hypothesis in simple linear regression with no intercept

Generate the data
a. Simple regression of y onto x with no intercept
b. Simple regression of x onto y with no intercept
c. The relationship between the models
d. A formula for the t-statistic
e. Why the t-statistic is the same for both models
f. Repeat for simple regression with an intercept

Generate the data

First we generate paired data (x,y) according to

$Y = 2X + \epsilon$

import numpy as np

np.random.seed(1)
x = np.random.normal(size=100)
y = 2*x + np.random.normal(size=100)

a. Simple regression of `y` onto `x` with no intercept

import statsmodels.api as sm

model_1 = sm.OLS(y, x).fit()
model_1.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.798
Model:	OLS	Adj. R-squared:	0.796
Method:	Least Squares	F-statistic:	391.7
Date:	Sun, 04 Nov 2018	Prob (F-statistic):	3.46e-36
Time:	13:56:12	Log-Likelihood:	-135.67
No. Observations:	100	AIC:	273.3
Df Residuals:	99	BIC:	275.9
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	2.1067	0.106	19.792	0.000	1.896	2.318

Omnibus:	0.880	Durbin-Watson:	2.106
Prob(Omnibus):	0.644	Jarque-Bera (JB):	0.554
Skew:	-0.172	Prob(JB):	0.758
Kurtosis:	3.119	Cond. No.	1.00

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficient estimate is

model_1.params

array([2.10674169])

which is very close to the true value of $2$ .

The standard error is

model_1.bse

array([0.10644517])

The t-statistic is

model_1.tvalues

array([19.79180199])

which has p-value

model_1.pvalues

array([3.45737574e-36])

This is an incredibly small p-value, so we have good grounds to reject the null hypothesis and accept the alternative hypothesis $H_a: \beta \neq 0$

b. Simple regression of `x` onto `y` with no intercept

import statsmodels.api as sm

model_2 = sm.OLS(x, y).fit()
model_2.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.798
Model:	OLS	Adj. R-squared:	0.796
Method:	Least Squares	F-statistic:	391.7
Date:	Sun, 04 Nov 2018	Prob (F-statistic):	3.46e-36
Time:	13:56:12	Log-Likelihood:	-49.891
No. Observations:	100	AIC:	101.8
Df Residuals:	99	BIC:	104.4
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	0.3789	0.019	19.792	0.000	0.341	0.417

Omnibus:	0.476	Durbin-Watson:	2.166
Prob(Omnibus):	0.788	Jarque-Bera (JB):	0.631
Skew:	0.115	Prob(JB):	0.729
Kurtosis:	2.685	Cond. No.	1.00

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The coefficient estimate is

model_2.params

array([0.37890442])

which is somewhat close to the true value of $0.5$ .

The standard error is

model_2.bse

array([0.01914451])

The t-statistic is

model_2.tvalues

array([19.79180199])

which has p-value

model_2.pvalues

array([3.45737574e-36])

This is identical to the p-value for model_1, so we again have grounds to reject the null hypothesis and accept the alternative hypothesis $H_a: \beta \neq 0$

c. The relationship between the models

In both cases there is a linear relationship between the two variables. Since the first model has the form

$Y = 2X + \epsilon$

the second model has the form

$X = \frac{1}{2}Y - \frac{1}{2}\epsilon$

In both cases, the regression detected the linear relationship with high confidence, and found good estimates for the coefficient.

It seems remarkable that the t-statistics were identical for both models, since the first is

model_1.params[0]/model_1.bse[0]

19.79180198709121

while the second

model_2.params[0]/model_2.bse[0]

19.79180198709121

d. A formula for the t-statistic

Now it makes sense!

e. Why the t-statistic is the same for both models

From d., we see that the t-statistic only depends on x, y and the sample size, and its symmetric with respect to the substitution x $\mapsto$ y

f. Repeat for simple regression with an intercept

import statsmodels.formula.api as smf
import pandas as pd

df = pd.DataFrame({'x' :x, 'y': y})

model_1 = smf.ols('y ~ x', data=df).fit()
model_2 = smf.ols('x ~ y', data=df).fit()

model_1.tvalues[1] == model_2.tvalues[1]

True

islr notes and exercises from An Introduction to Statistical Learning