First we generate paired data (x,y)
according to
import numpy as np
np.random.seed(1)
x = np.random.normal(size=100)
y = 2*x + np.random.normal(size=100)
y
onto x
with no interceptimport statsmodels.api as sm
model_1 = sm.OLS(y, x).fit()
model_1.summary()
Dep. Variable: | y | R-squared: | 0.798 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.796 |
Method: | Least Squares | F-statistic: | 391.7 |
Date: | Sun, 04 Nov 2018 | Prob (F-statistic): | 3.46e-36 |
Time: | 13:56:12 | Log-Likelihood: | -135.67 |
No. Observations: | 100 | AIC: | 273.3 |
Df Residuals: | 99 | BIC: | 275.9 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x1 | 2.1067 | 0.106 | 19.792 | 0.000 | 1.896 | 2.318 |
Omnibus: | 0.880 | Durbin-Watson: | 2.106 |
---|---|---|---|
Prob(Omnibus): | 0.644 | Jarque-Bera (JB): | 0.554 |
Skew: | -0.172 | Prob(JB): | 0.758 |
Kurtosis: | 3.119 | Cond. No. | 1.00 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficient estimate is
model_1.params
array([2.10674169])
which is very close to the true value of .
The standard error is
model_1.bse
array([0.10644517])
The t-statistic is
model_1.tvalues
array([19.79180199])
which has p-value
model_1.pvalues
array([3.45737574e-36])
This is an incredibly small p-value, so we have good grounds to reject the null hypothesis and accept the alternative hypothesis
x
onto y
with no interceptimport statsmodels.api as sm
model_2 = sm.OLS(x, y).fit()
model_2.summary()
Dep. Variable: | y | R-squared: | 0.798 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.796 |
Method: | Least Squares | F-statistic: | 391.7 |
Date: | Sun, 04 Nov 2018 | Prob (F-statistic): | 3.46e-36 |
Time: | 13:56:12 | Log-Likelihood: | -49.891 |
No. Observations: | 100 | AIC: | 101.8 |
Df Residuals: | 99 | BIC: | 104.4 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x1 | 0.3789 | 0.019 | 19.792 | 0.000 | 0.341 | 0.417 |
Omnibus: | 0.476 | Durbin-Watson: | 2.166 |
---|---|---|---|
Prob(Omnibus): | 0.788 | Jarque-Bera (JB): | 0.631 |
Skew: | 0.115 | Prob(JB): | 0.729 |
Kurtosis: | 2.685 | Cond. No. | 1.00 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficient estimate is
model_2.params
array([0.37890442])
which is somewhat close to the true value of .
The standard error is
model_2.bse
array([0.01914451])
The t-statistic is
model_2.tvalues
array([19.79180199])
which has p-value
model_2.pvalues
array([3.45737574e-36])
This is identical to the p-value for model_1
, so we again have grounds to reject the null hypothesis and accept the alternative hypothesis
In both cases there is a linear relationship between the two variables. Since the first model has the form
the second model has the form
In both cases, the regression detected the linear relationship with high confidence, and found good estimates for the coefficient.
It seems remarkable that the t-statistics were identical for both models, since the first is
model_1.params[0]/model_1.bse[0]
19.79180198709121
while the second
model_2.params[0]/model_2.bse[0]
19.79180198709121
Now it makes sense!
From d., we see that the t-statistic only depends on x
, y
and the sample size, and its symmetric with respect to the substitution x
y
import statsmodels.formula.api as smf
import pandas as pd
df = pd.DataFrame({'x' :x, 'y': y})
model_1 = smf.ols('y ~ x', data=df).fit()
model_2 = smf.ols('x ~ y', data=df).fit()
model_1.tvalues[1] == model_2.tvalues[1]
True