islr notes and exercises from An Introduction to Statistical Learning

3. Linear Regression

Conceptual Exercises

Exercise 1: Hypothesis testing the coefficients from the model in 3.2.1

Table 3.4 has the following information

import pandas as pd

data = {'Coefficient': [2.939, 0.046, 0.189, -0.001],
        'Std. Error': [0.3119, 0.0014, 0.0086, 0.0059],
        't-statistic': [9.42, 32.81, 21.89, -0.18],
        'p-value': [0.0001, 0.0001, 0.0001, 0.8599]}
index = ['Intercept', 'TV', 'radio', 'newspaper']
table = pd.DataFrame(data, index=index)
table.head()
Coefficient Std. Error t-statistic p-value
Intercept 2.939 0.3119 9.42 0.0001
TV 0.046 0.0014 32.81 0.0001
radio 0.189 0.0086 21.89 0.0001
newspaper -0.001 0.0059 -0.18 0.8599

The p-values for Intercept, TV and radio coefficients are small enough to reject the null hypotheses for these coefficient, thus accepting the alternative hypotheses that

  • In the absence of TV and radio ad spending there will be 2,939 units sold
  • For each $1,000 increase in TV ad spending another 46 units will sell
  • For each $1,000 increase in radio ad spending another 189 units will sell

The p-value for newspaper is very high, so in that case we retain the null hypothesis that there is no relationship between newspaper ad spending and sales

Exercise 2: Differences between KNN classifier and regression methods

Both methods look at a neighborhood N0N_0 of the KK points nearest a point x0x_0.

The KNN classifier estimates the conditional probability of of a class jj based on the fraction of observations in the neighborhood of x0x_0 such that are in class jj. It then predicts y0y_0 to be the class that maximizes this probability (using Bayes rule). Thus, the KNN classifier predicts a qualitative response (a discrete finite RV)

The KNN regression method, however, predicts the response value y0y_0 based on the average response value y\overline{y} of all observations in the neighborhood N0N_0. Thus, KNN regression predicts a quantitative response (a continuous RV).

Exercise 3: Regressing Salary on GPA, IQ, and Gender

We have predictors

X1=GPAX2=IQX3=GenderX4=X1X2X5=X1X3 \begin{aligned} X_1 &= GPA\\ X_2 &= IQ\\ X_3 &= Gender\\ X_4 &= X_1X_2\\ X_5 &= X_1X_3 \end{aligned}

And response

Y=Salary Y = Salary

where Salary means salary after graduation in units of $1,000.

Ordinary least squares (OLS) gives the estimated coefficients

β^0=50β^1=20β^2=0.07β^3=35β^4=0.01β^5=10 \begin{aligned} \hat{\beta}_0 &= 50\\ \hat{\beta}_1 &= 20\\ \hat{\beta}_2 &= 0.07\\ \hat{\beta}_3 &= 35\\ \hat{\beta}_4 &= 0.01\\ \hat{\beta}_5 &= -10 \end{aligned}

a.

For a fixed value of IQ and GPA, X1,X2X_1, X_2 are fixed and X3X_3 is variable. In this case we can write the model

Y^=a+bX3\hat{Y} = a + b X_3

where we have absorbed the fixed values into

a=β0+β1X1+β2X2+β4X1X2b=β3+β5X1 \begin{aligned} a &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_4 X_1X_2\\ b &= \beta_3 + \beta_5 X_1 \end{aligned}

For simplicity, assume values of 3.0 for GPA and 110 for IQ. Then we find

a = 50 + 20*3.0 + 0.07*110 + 0.01*3.0*110
b = 35 + -10*3.0
a, b
(121.0, 5.0)

So for the ii-th person, the model predicts

y^i=121+5xi3\hat{y}_i = 121 + 5x_{i3}

and since X3X_3 is an indicator variable, the model predicts a salary of

y^i={126=i-th person is female121=i-th person is female \hat{y}_i = \begin{cases} 126 &= \text{i-th person is female}\\ 121 &= \text{i-th person is female} \end{cases}

so for a fixed GPAGPA and IQIQ, (answer ii) is correct.

NB: The actual values for salary here depended on our assumed values for IQ and GPA.

b.

a = 50 + 20*4.0 + 0.07*110 + 0.01*4.0*110
b = 35 + -10*4.0
print("The predicted salary for a female with an IQ of 110 and a GPA of 4.0 is ${} thousand".format(a + b))
The predicted salary for a female with an IQ of 110 and a GPA of 4.0 is $137.1 thousand

c.

The magnitude of a coefficient describes the magnitude of the effect, but not the evidence for it, which comes from the p-value of the hypothesis test for the coefficient. It’s possible to have a large coefficient but weak evidence (large p-value) and conversely a small coefficient but strong evidence (small p-value).

So as stated, the answer here is false (that is, it’s false as a conditional statement).

Exercise 4: Linear Vs. Non-linear Train and Test RSS

a.

If by “expect” we mean, averaged over many datasets, then the answer is that the linear model should have a lower train RSS.

b.

Same answer as a.

c.

If the true relationship is non-linear but unknown, then we don’t have enough information to say. For example the relationship could be “close to linear” (e.g. quadratic with extremely small X2X^2 coefficient, or piecewise linear) in which case on average we would expect better performance from the linear model. Or it could be polynomial of degree 3 or greater, in which case we’d expect the cubic model to perform better.

Testing my answers

In this section, I’m going to use simulation to test my answers.

# setup
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from tqdm import tqdm_notebook
# generate coefficients for random linear function
def random_true_linear():
    coeffs = 20e3 * np.random.random_sample((2,)) - 10e3
    def f(x):
        return coeffs[1] + coeffs[0] * x
    return f
# generate n data points according to linear relationship
def gen_data(n_sample, true_f):
    # initialize df with uniformly random input
    df = pd.DataFrame({'x': 20e3 * np.random.random_sample((n_sample,)) - 10e3})
    # add linear outputs and noise
    df['y'] = df['x'].map(true_f) + np.random.normal(size=n_sample)**3
    # return df
    return df
# get test and train RSS from linear and cubic models from random linear function
def test_run(n_sample):
    # random linear function
    true_linear = random_true_linear()

    # generate train and test data
    train, test = gen_data(n_sample, true_linear), gen_data(n_sample, true_linear)

    # fit models
    linear_model = smf.ols('y ~ x', data=train).fit()
    cubic_model = smf.ols('y ~ x + I(x**2) + I(x**3)', data=train).fit()

    # get train RSSes
    linear_train_RSS, cubic_train_RSS = (linear_model.resid**2).sum(), (cubic_model.resid**2).sum()

    # get test RSSes
    linear_test_RSS = ((linear_model.predict(exog=test['x']) - test['y'])**2).sum()
    cubic_test_RSS =  ((cubic_model.predict(exog=test['x']) - test['y'])**2).sum()

    # create df and add test and train RSS
    df = pd.DataFrame(columns=pd.MultiIndex.from_product([['linear', 'cubic'], ['train', 'test']]))
    df.loc[0] = np.array([linear_train_RSS, linear_test_RSS, cubic_test_RSS, cubic_train_RSS])

    return df
# sample size, number of tests
n_sample, n_tests = 100, 1000

# dataframe for results
results = pd.DataFrame(columns=pd.MultiIndex.from_product([['linear', 'cubic'], ['train', 'test']]))

# iterate
for i in tqdm_notebook(range(n_tests)):
    results = results.append(test_run(n_sample), ignore_index=True)
results.head()
linear cubic
train test train test
0 1586.209472 2746.762638 2990.943079 1490.689694
1 691.867975 567.483756 586.850022 687.043502
2 3006.376688 1306.298281 1337.048066 2981.928496
3 1159.712782 2207.674753 2211.288146 1124.359569
4 1091.879593 1695.663868 1726.829296 1084.274033
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

plt.figure(figsize=(10,10))
plt.subplot(2, 1, 1)
sns.distplot(results.linear.train, label="Linear train RSS")
sns.distplot(results.linear.test, label="Linear test RSS")
plt.legend()

plt.subplot(2, 1, 2)
sns.distplot(results.cubic.train, label="Cubic train RSS")
sns.distplot(results.cubic.test, label="Cubic test RSS")
plt.legend()
<matplotlib.legend.Legend at 0x1a187ca3c8>

png

Not what I expected!

  • For the linear model, the test RSS is greater than the train RSS
  • For the cubic model, the train RSS is greater than the test RSS
  • The linear train RSS is less than the cubic train RSS
  • The linear test RSS is greater than the cubic test RSS

In light of these results, I don’t know how to answer this question

TODO

  • keep prediction results and use sklearn to evaluate performance
  • visualize (diagnostic plots) to make sure data is really linear

Exercise 5: The fitted values are linear combinations of the response values.

We have

y^i=xiβ^=xi(ixiyijxi2)=1jxj2(xi(ixiyi))=1jxj2(ixixiyi)=i(xixijxj2)yi=iaiyi \begin{aligned} \hat{y}_i &= x_i\hat{\beta}\\ &= x_i\left(\frac{\sum_{i} x_iy_i}{\sum_{j} x_{i'}^2}\right)\\ &= \frac{1}{\sum_{j} x_{j}^2}\left( x_i \left(\sum_{i} x_iy_i\right)\right)\\ &= \frac{1}{\sum_{j} x_{j}^2}\left(\sum_{i'}x_ix_{i'}y_{i'}\right)\\ &= \sum_{i'}\left(\frac{x_ix_{i'}}{\sum_{j} x_{j}^2}\right)y_{i'}\\ &= \sum_{i'} a_{i'}y_{i'}\\ \end{aligned}

where

ai=xixijxj2 a_{i'} = \frac{x_ix_{i'}}{\sum_{j} x_{j}^2}

Exercise 6: Least squares line in simple regression passes through the sample means

Equation (3.4) in the text shows that, in simple regression the coefficients satisfy

β^0=yβ^1x \hat{\beta}_0 = \overline{y} - \hat{\beta}_1\overline{x}

The least squares line is y=β^0+β^1xy = \hat{\beta}_0 + \hat{\beta}_1 x, but by 3.4,

y=β^0+β^1x\overline{y} = \hat{\beta}_0 + \hat{\beta}_1 \overline{x}

which means (x,y)(\overline{x}, \overline{y}) is on the line

Exercise 7: R2=r2R^2 = r^2 ed simple regression

We have

R2=TSSRSSRSS=i(yiy)2i(yiy^i)2i(yiy)2=i(yiy)2i(yiβ^0β^1xi)2i(yiy)2 \begin{aligned} R^2 &= \frac{TSS - RSS}{RSS}\\ &= \frac{\sum_i (y_i - \overline{y})^2 - \sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \overline{y})^2}\\ &= \frac{\sum_i (y_i - \overline{y})^2 - \sum_i (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2}{\sum_i (y_i - \overline{y})^2}\\ \end{aligned}

Since

β^0=yβ^1x \begin{aligned} \hat{\beta}_0 = \overline{y} - \hat{\beta}_1\overline{x} \end{aligned}

The right hand term in the numerator of (3) can be rewritten

i(yiβ^0β^1xi)2=i(yiyβ^1(xix))2=i(yiy)22β^1(yiy)(xix)+β^12(xix)2 \begin{aligned} \sum_i (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 & = \sum_i \left(y_i - \overline{y} - \hat{\beta}_1(x_i - \overline{x})\right)^2 \\ &= \sum_i (y_i - \overline{y})^2 - 2\hat{\beta}_1(y_i - \overline{y})(x_i - \overline{x}) + \hat{\beta}_1^2(x_i - \overline{x})^2\\ \end{aligned}

substituting (5) into (3) we find

R2=i2β^1(yiy)(xix)β^12(xix)2i(yiy)2=1i(yiy)2(2β^1(i(yiy)(xix))β^12(i(xix)2)) \begin{aligned} R^2 &= \frac{\sum_i 2\hat{\beta}_1(y_i - \overline{y})(x_i - \overline{x}) - \hat{\beta}_1^2(x_i - \overline{x})^2}{\sum_i (y_i - \overline{y})^2}\\ &= \frac{1}{\sum_i (y_i - \overline{y})^2}\left(2\hat{\beta}_1\left(\sum_i (y_i - \overline{y})(x_i - \overline{x})\right) - \hat{\beta}_1^2\left(\sum_i(x_i - \overline{x})^2\right)\right) \end{aligned}

Since

β^1=j(xjx)(yjy)j(xjx)2 \begin{aligned} \hat{\beta}_1 &= \frac{\sum_j (x_j - \overline{x})(y_j - \overline{y})}{\sum_j(x_j - \overline{x})^2} \end{aligned}

We can substitute (7) into (6) to get

R2=1i(yiy)2(2((j(xjx)(yjy)j(xjx)2)i(yiy)(xix))(j(xjx)(yjy)j(xjx)2)2(i(xix)2))=1i(yiy)2j(xjx)2(2(ij(xjx)(yjy)(yiy)(xix))ij(xjx)(yjy)(yiy)(xix))=ij(xix)(yiy)(xjx)(yjy)i(xix)2i(yiy)2=(i(xix)(yiy)i(xix)2i(yiy)2)2=r2 \begin{aligned} R^2 &= \frac{1}{\sum_i (y_i - \overline{y})^2}\left(2\left(\left(\frac{\sum_j (x_j - \overline{x})(y_j - \overline{y})}{\sum_j(x_j - \overline{x})^2}\right)\sum_i (y_i - \overline{y})(x_i - \overline{x})\right) - \left(\frac{\sum_j (x_j - \overline{x})(y_j - \overline{y})}{\sum_j(x_j - \overline{x})^2}\right)^2\left(\sum_i(x_i - \overline{x})^2\right)\right)\\ &= \frac{1}{\sum_i (y_i - \overline{y})^2 \sum_j(x_j - \overline{x})^2}\left(2\left(\sum_i\sum_j (x_j - \overline{x})(y_j - \overline{y})(y_i - \overline{y})(x_i - \overline{x})\right) - \sum_i\sum_j (x_j - \overline{x})(y_j - \overline{y})(y_i - \overline{y})(x_i - \overline{x})\right)\\ &= \frac{\sum_i\sum_j(x_i - \overline{x})(y_i - \overline{y})(x_j - \overline{x})(y_j - \overline{y})}{\sum_i (x_i - \overline{x})^2\sum_i (y_i - \overline{y})^2}\\ &= \left(\frac{\sum_i(x_i - \overline{x})(y_i - \overline{y})}{\sqrt{\sum_i (x_i - \overline{x})^2}\sqrt{\sum_i (y_i - \overline{y})^2}}\right)^2\\ &= r^2 \end{aligned}