islr notes and exercises from An Introduction to Statistical Learning

4. Logistic Regression

Conceptual Exercises

Exercise 1: Logistic function and logit representations are equal

Using a little bit of algebra, prove that (4.2) is equivalent to (4.3). In other words, the logistic function representation and logit representation for the logistic regression model are equivalent.

The logistic function representation is

p(X)=exp(βX)1+exp(βX) \begin{aligned} p(X) &= \frac{\exp(\boldsymbol{\beta}^\top \mathbf{X})}{1 + \exp(\boldsymbol{\beta}^\top \mathbf{X})} \end{aligned}

so

1p(x)=11+exp(βX) \begin{aligned} 1 - p(x) &= \frac{1}{1 + \exp(\boldsymbol{\beta}^\top \mathbf{X})} \end{aligned}

Thus

p(x)1p(x)=exp(βX) \begin{aligned} \frac{p(x)}{1 - p(x)} &= \exp(\boldsymbol{\beta}^\top \mathbf{X}) \end{aligned}

and hence

log(p(X)1p(x))=βX \begin{aligned} \log(\frac{p(X)}{1-p(x)}) &= \boldsymbol{\beta}^\top \mathbf{X} \end{aligned}

Exercise 2: The Bayes Classifier maximizes the discriminant function

It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case. In other words, under the assumption that the observations in the kth class are drawn from a N(μk,σ2)N(\mu_k, \sigma^2) distribution, the Bayes’ classifier assigns an observation to the class for which the discriminant function is maximized.

Under the normality assumption, we have

pk(x)=πk2πσexp(12σ2(xμk)2)l=1Kπl2πσexp(12σ2(xμl)2) \begin{aligned} p_k(x) = \frac{\frac{\pi_k}{\sqrt{2\pi\sigma}} \exp\left(-\frac{1}{2\sigma^2}\left(x - \mu_k\right)^2\right)}{\sum_{l=1}^K \frac{\pi_l}{\sqrt{2\pi\sigma}} \exp\left(-\frac{1}{2\sigma^2}\left(x - \mu_l\right)^2\right)} \end{aligned}

Since log\log is an increasing function, maximising pk(x)p_k(x) over all kk is equivalent to maximising log(pk(x))\log(p_k(x)) over all kk. We have

log(pk(x))=log(πk2πσ)+(12σ2(xμk)2)log(l=1Kπl2πσexp(12σ2(xμl)2)) \begin{aligned} \log(p_k(x)) &= \log\left(\frac{\pi_k}{\sqrt{2\pi\sigma}}\right) + \left(-\frac{1}{2\sigma^2}\left(x - \mu_k\right)^2\right) - \log\left(\sum_{l=1}^K \frac{\pi_l}{\sqrt{2\pi\sigma}} \exp\left(-\frac{1}{2\sigma^2}\left(x - \mu_l\right)^2\right)\right) \end{aligned}

The last term is a constant common to all the pk(x)p_k(x) so it won’t affect the maximization and we can get rid of it. The righthand side of (2) then becomes

log(πk2πσ)+(12σ2(xμk)2)=log(πk)log(2πσ)+(12σ2(xμk)2)=log(πk)log(2πσ)+12σ2(x22μkx+μk2) \begin{aligned} \log\left(\frac{\pi_k}{\sqrt{2\pi\sigma}}\right) + \left(-\frac{1}{2\sigma^2}\left(x - \mu_k\right)^2\right) &= \log(\pi_k) - \log(\sqrt{2\pi\sigma}) + \left(-\frac{1}{2\sigma^2}\left(x - \mu_k\right)^2\right)\\ &= \log(\pi_k) - \log(\sqrt{2\pi\sigma}) + -\frac{1}{2\sigma^2}\left(x^2 - 2\mu_k x + \mu_k^2\right) \end{aligned}

Again throwing out the terms which are independent of kk, we get the discriminant function

δk(x)=x(μkσ2)μk22σ2+log(πk) \delta_k(x) = x\left(\frac{\mu_k}{\sigma^2}\right) - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)

Exercise 3: Relaxing equal variance assumption in LDA, the Bayes Classifier becomes quadratic

This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class- specific mean vector and a class specific covariance matrix. We consider the simple case where p=1p = 1; i.e. there is only one feature.

Suppose that we have KK classes, and that if an observation belongs to the kth class then X comes from a one-dimensional normal distribution, XN(μk,σk2)X ∼ N(\mu_k,\sigma_k^2). Recall that the density function for the one-dimensional normal distribution is given in (4.11). Prove that in this case, the Bayes’ classifier is not linear. Argue that it is in fact quadratic.

In the case that the variances are equal across classes (see Exercise 2), the Bayes decision boundary is then the set of points xx where at least two of the discriminants are equal

{x    δk(x)=δj(x)  for some  kj} \begin{aligned} \left\{ x\ \ \bigg\lvert\ \ \delta_k(x) = \delta_j(x)\ \ \text{for some}\ \ k \neq j\right\} \end{aligned}

for all 1k,jK1 \leqslant k, j \leqslant K. The discriminants are linear in xx, hence the decision boundary is linear.

In the case that the variances are not equal across classes, maximizing pk(x)p_k(x) over the classes still leads to (8) above which becomes

log(πk)log(2πσk)+12σk2(x22μkx+μk2) \begin{aligned} \log(\pi_k) - \log(\sqrt{2\pi\sigma_k}) + -\frac{1}{2\sigma_k^2}\left(x^2 - 2\mu_k x + \mu_k^2\right) \end{aligned}

since all terms are dependent on kk, we can’t throw out any terms. If we use the discriminant notation δk(x)\delta_k(x) for (10), the Bayes decision boundary is still (9), but now the equations are quadratic in xx.

Exercise 4: The Curse of Dimensionality

When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when pp is large. We will now investigate this curse.

a.

Suppose that we have a set of observations, each with measure- ments on p=1p = 1 feature, XX. We assume that XX is uniformly (evenly) distributed on [0,1][0, 1]. Associated with each observation is a response value. Suppose that we wish to predict a test obser- vation’s response using only observations that are within 10% of the range of XX closest to that test observation. For instance, in order to predict the response for a test observation with X=0.6X = 0.6, we will use observations in the range [0.55,0.65][0.55, 0.65]. On average, what fraction of the available observations will we use to make the prediction?

Let (X1,Y1),,(Xn,Yn)(X_1, Y_1), \dots, (X_n, Y_n) denote the sample (i.e. the pairs are iid) and assume XiUniform(0,1)X_i \sim \text{Uniform}(0,1).

Without loss of generality, assume X1=x1X_1 = x_1 is the random variate corresponding to the test observation, and let [a,b][0,1][a, b]\subseteq [0, 1] be the closed interval within 10% of the range of x1x_1 1 Define a random variable

N=The number of Xi[a,b] N = \text{The number of}\ X_i \in [a,b]

Then

N=i=1nIi,[a,b] N = \sum_{i = 1}^n I_{i, [a,b]}

Where Ii,[a,b]I_{i, [a,b]} is the indicator for the event Xi[a,b]X_i \in [a,b]. Since XiX_i are assumed iid, N=nI[a,b]N = nI_{[a,b]}, which has mean

E(N)=nI[a,b]=n[a,b]dx=0.1n \begin{aligned} \mathbb{E}(N) &= n I_{[a,b]}\\ &= n\int_{[a,b]} dx\\ &= 0.1n \end{aligned}

b.

Now suppose that we have a set of observations, each with measurements on p=2p = 2 features, X1X_1 and X2X_2. We assume that (X1,X2)(X1 , X2 ) are uniformly distributed on [0,1]×[0,1][0, 1] × [0, 1]. We wish to predict a test observation’s response using only observations that are within 10% of the range of X1X_1 and within 10% of the range of X2X_2 closest to that test observation. For instance, in order to predict the response for a test observation with X1=0.6X_1 = 0.6 and X2=0.35X_2 = 0.35, we will use observations in the range [0.55,0.65][0.55, 0.65] for X1X_1 and in the range [0.3,0.4][0.3, 0.4] for X2X_2. On average, what fraction of the available observations will we use to make the prediction?

Same as a.. Now, let [a1,b1],[a2,b2][a_1, b_1], [a_2, b_2] be the corresponding intervals for X1,X2X_1, X_2. Then

E(N)=nI[a1,b1]×[a2,b2]=n[a1,b1]×[a2,b2]dx=0.01n \begin{aligned} \mathbb{E}(N) &= n I_{[a_1,b_1]\times [a_2,b_2]}\\ &= n\int_{[a_1,b_1]\times [a_2,b_2]} dx\\ &= 0.01n \end{aligned}

c.

Now suppose that we have a set of observations on p=100p = 100 features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

In general, for pp predictors, we have

E(N)=10pn\mathbb{E}(N) = 10^{-p}n

So if p=100p = 100, E(N)=10100n\mathbb{E}(N) = 10^{-100}n, which may as well be zero!.

d.

Using your answers to parts a.–c., argue that a drawback of KNN when pp is large is that there are very few training observations “near” any given test observation.

E(N)0\mathbb{E}(N) \rightarrow 0 exponentially as pp \rightarrow \infty. Thus as pp grows, KNN will find the only observations nearby each xix_i will be the points xix_i themselves, with exponentially growing probability.

e.

Now suppose that we wish to make a prediction for a test observation by creating a pp-dimensional hypercube centered around the test observation that contains, on average, 10% of the training observations. For p=1,2p = 1,2 and 100, what is the length of each side of the hypercube? Comment on your answer.

In this case, we want [a1,b1]×[ap,bp][a_1, b_1] \times [a_p, b_p] such that the pp dimensional volume is 0.10.1. Since the intervals are all of equal length bab - a, we have (ba)p=0.1(b - a)^p = 0.1, so ba=(0.1)1/pb - a = (0.1)^{1/p}.

For p=1p = 1, we have ba=0.1b-a = 0.1, for p=2p = 2, we have ba=0.316b - a = 0.316 and for p=100p = 100 we have b1=0.997b - 1 = 0.997.

So if we fix the volume of the hypercube, we’re forced to take larger and larger intervals around each predictor to guarantee the same fixed fraction of nearby points E(N)=0.1n\mathbb{E}(N) = 0.1n.

It wasn’t clear to me exactly why these growing intervals but in ESL 2 the authors state that the intervals are “no longer local”. Presumably “locality” is one of the chief advantages of KNN.

Thus the curse of dimensionality seems to be a tradeoff between sparseness and locality. Either we retain locality (fixed interval length) at the expense of sparseness (fewer and fewer points nearby), or we lose locality (growing inteval length) to avoid sparseness (fixed fraction of points nearby).

Exercise 5: Differences between LDA and QDA

a.

If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

We expect QDA to perform better on the training set, since the additional degree of freedom makes it easier to fit the noise.

We expect the LDA to perform better on the test set, since the decision boundary is linear (QDA would likely overfit).

b.

If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

We expect QDA to perform better on the training set, since the additional degree of freedom makes it easier to fit the noise.

We expect QDA to also perform better on the test set, since the decision boundary is non-linear.

c.

In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

As the sample size increases, we expect the prediction accuracy of QDA relative to LDA to improve. With more degrees of freedom, for fixed sample size, QDA has higher variance than LDA, but this difference decreases as the sample size increases.

d.

True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

False - consider the case of small sample size in a.

Exercise 6: Classifying A grades

Suppose we collect data for a group of students in a statistics class with variables X1=X_1 = hours studied, X2X_2 = undergrad GPA, and YY = receive an A. We fit a logistic regression and produce estimated coefficient, β^0=6,β^1=0.05,β^2=1\hat{\beta}_0 = −6, \hat{\beta}_1 = 0.05, \hat{\beta}_2 = 1

a.

Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.

Consider YY as a binary variable, where Y=1Y = 1 if the student gets an A. We have estimated probability

P^(X1,X2)=P^(Y=1X)=e6+.05X1+X21+e6+.05X1+X2 \begin{aligned} \hat{P}(X_1, X_2) &= \hat{P}(Y = 1| X)\\ &= \frac{e^{-6 +.05X_1 + X_2}}{1 + e^{-6 +.05X_1 + X_2}} \end{aligned}

so p^(40,3.5)\hat{p}(40, 3.5) is

from math import e

(e**(-6 + 0.5*40 + 3.5))/ (1 + e**(-6 + 0.5*40 + 3.5))
0.999999974890009

b.

How many hours would the student in part a. need to study to have a 50 % chance of getting an A in the class?

For this use the logit representation

log(P^(X)1P^(X))=β^0+β^1X1+β^2X2\log\left(\frac{\hat{P}(X)}{1 - \hat{P}(X)}\right) = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2

Since the student in a. has a GPA of 3.5, the desired number of hours is

from math import log

(log(1) + 6 - 3.5)/0.5
5.0

Exercise 7: Classifying whether a stock will issue a dividend

Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on XX, last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was X=10\overline{X} = 10, while the mean for those that didn’t was X=0\overline{X} = 0. In addition, the variance of XX for these two sets of companies was σ^2=36\hat{\sigma}^2 = 36. Finally, 80% of companies issued dividends. Assuming that X follows a normal distribution3, predict the probability that a company will issue a dividend this year given that its percentage profit was X=4X = 4 last year.

We’ll let YY be the binary random variable with Y=1Y = 1 if the stock issues a dividend, and Y=0Y=0 if not. From Bayes theorem we have

P(Y=k  X=x)=P(X=x  Y=k)P(Y=k)P(X=x  Y=0)P(Y=0)+P(X=x  Y=1)P(Y=1)P(Y = k\ |\ X = x) = \frac{P(X = x\ |\ Y = k)P(Y = k)}{P(X = x\ |\ Y = 0)P(Y = 0) + P(X = x\ |\ Y = 1)P(Y = 1)}

for k{0,1}k\in \{0,1\}4, so the answer we’re looking for is

P(Y=1  X=4)=P(X=4  Y=1)P(Y=1)P(X=4  Y=0)P(Y=0)+P(X=4  Y=1)P(Y=1)P(Y = 1\ |\ X = 4) = \frac{P(X = 4\ |\ Y = 1)P(Y = 1)}{P(X = 4\ |\ Y = 0)P(Y = 0) + P(X = 4\ |\ Y = 1)P(Y = 1)}

Since we don’t know the true probabilities here, we have to estimate (knowing we have sampled a large number of companies helps assure that our estimates will be accurate).

We have XY=kN(μk,σk)X| Y = k \sim N(\mu_k, \sigma_k) by assumption, so that

P(X=x  Y=k)=12πσke(xμk)2/2σk2P(X = x\ |\ Y = k) = \frac{1}{\sqrt{2\pi \sigma_k}}e^{-(x - \mu_k)^2/2\sigma_k^2}

We estimate the population mean μk\mu_k and deviation σk2\sigma_k^2 for class kk with the sample mean and deviation for that class.

Futhermore P(Y=1)0.80P(Y = 1) \approx 0.80, so P(Y=0)0.20P(Y = 0) \approx 0.20 we have

from scipy.stats import norm

a, b = norm.pdf(4, loc=0, scale=6), norm.pdf(4, loc=10, scale=6)

(b * 0.8)/(a*0.2 + b*0.8)
0.7518524532975261

So P(Y=1  X=4)0.75P(Y = 1\ |\ X = 4) \approx 0.75

Exercise 8: Average test and training error

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. K=1K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

Not enough information to say, or, it depends on the train/test error for the KNN model.

Since the train and test sets are the same size, the average error rate

eavg=12(etrain+etest)e_{avg} = \frac{1}{2}(e_{train} + e_{test})

So etest=2eavgetraine_{test} = 2e_{avg} - e_{train}. If eavg=0.18e_{avg} = 0.18, then etrain=0.36etraine_{train} = 0.36 - e_{train}. Then if etrain<0.06e_{train} < 0.06 for the KNN model, it will have higher eteste_{test} than the logistic regression model. In this case, we would prefer the logistic model as it prefers better on unseen data.

Exercise 9: Odds

a.

On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

Since the odds are

p1p=0.37\frac{p}{1-p} = 0.37,

we have p=0.27p = 0.27

b.

Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?

Since p=0.16p=0.16, the odds are 0.16/10.16=0.190.16/1 - 0.16 = 0.19

Footnotes

  1. [a,b][a,b] is an interval with ba=0.1b - a = 0.1. If x1[0.1,0.9]x_1 \in [0.1, 0.9], then [a,b]=[0.9x1,1.1x1][a,b] = [0.9x_1, 1.1x_1] If x1[0,0.1)x_1 \in [0, 0.1), then [a,b]=[0,0.1][a,b] = [0, 0.1] and if x(0.9,1]x \in (0.9, 1], then [a,b]=[0.9,1][a,b] = [0.9, 1]

  2. See section 2.5, page 22 of ESL 

  3. I believe the authors mean “conditionally normally distributed”, i.e. that XY=yX| Y = y is normally distributed, as in section 4.4.1. 

  4. To avoid confusion, I’ve let the class index kk be the same as the values of YY. So instead of having values Y=0,1Y = 0, 1 and class numbers 1,21, 2, the class numbers are also 0,10, 1