4. Logistic Regression

Conceptual Exercises

Exercise 1: Logistic function and logit representations are equal

Using a little bit of algebra, prove that (4.2) is equivalent to (4.3). In other words, the logistic function representation and logit representation for the logistic regression model are equivalent.

The logistic function representation is

$\begin{aligned} p(X) &= \frac{\exp(\boldsymbol{\beta}^\top \mathbf{X})}{1 + \exp(\boldsymbol{\beta}^\top \mathbf{X})} \end{aligned}$

$\begin{aligned} 1 - p(x) &= \frac{1}{1 + \exp(\boldsymbol{\beta}^\top \mathbf{X})} \end{aligned}$

Thus

$\begin{aligned} \frac{p(x)}{1 - p(x)} &= \exp(\boldsymbol{\beta}^\top \mathbf{X}) \end{aligned}$

and hence

$\begin{aligned} \log(\frac{p(X)}{1-p(x)}) &= \boldsymbol{\beta}^\top \mathbf{X} \end{aligned}$

Exercise 2: The Bayes Classifier maximizes the discriminant function

It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case. In other words, under the assumption that the observations in the kth class are drawn from a $N(\mu_k, \sigma^2)$ distribution, the Bayes’ classifier assigns an observation to the class for which the discriminant function is maximized.

Under the normality assumption, we have

$\begin{aligned} p_k(x) = \frac{\frac{\pi_k}{\sqrt{2\pi\sigma}} \exp\left(-\frac{1}{2\sigma^2}\left(x - \mu_k\right)^2\right)}{\sum_{l=1}^K \frac{\pi_l}{\sqrt{2\pi\sigma}} \exp\left(-\frac{1}{2\sigma^2}\left(x - \mu_l\right)^2\right)} \end{aligned}$

Since $\log$ is an increasing function, maximising $p_k(x)$ over all $k$ is equivalent to maximising $\log(p_k(x))$ over all $k$ . We have

$\begin{aligned} \log(p_k(x)) &= \log\left(\frac{\pi_k}{\sqrt{2\pi\sigma}}\right) + \left(-\frac{1}{2\sigma^2}\left(x - \mu_k\right)^2\right) - \log\left(\sum_{l=1}^K \frac{\pi_l}{\sqrt{2\pi\sigma}} \exp\left(-\frac{1}{2\sigma^2}\left(x - \mu_l\right)^2\right)\right) \end{aligned}$

The last term is a constant common to all the $p_k(x)$ so it won’t affect the maximization and we can get rid of it. The righthand side of (2) then becomes

$\begin{aligned} \log\left(\frac{\pi_k}{\sqrt{2\pi\sigma}}\right) + \left(-\frac{1}{2\sigma^2}\left(x - \mu_k\right)^2\right) &= \log(\pi_k) - \log(\sqrt{2\pi\sigma}) + \left(-\frac{1}{2\sigma^2}\left(x - \mu_k\right)^2\right)\\ &= \log(\pi_k) - \log(\sqrt{2\pi\sigma}) + -\frac{1}{2\sigma^2}\left(x^2 - 2\mu_k x + \mu_k^2\right) \end{aligned}$

Again throwing out the terms which are independent of $k$ , we get the discriminant function

$\delta_k(x) = x\left(\frac{\mu_k}{\sigma^2}\right) - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)$

Exercise 3: Relaxing equal variance assumption in LDA, the Bayes Classifier becomes quadratic

This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class- specific mean vector and a class specific covariance matrix. We consider the simple case where $p = 1$ ; i.e. there is only one feature.

Suppose that we have $K$ classes, and that if an observation belongs to the kth class then X comes from a one-dimensional normal distribution, $X ∼ N(\mu_k,\sigma_k^2)$ . Recall that the density function for the one-dimensional normal distribution is given in (4.11). Prove that in this case, the Bayes’ classifier is not linear. Argue that it is in fact quadratic.

In the case that the variances are equal across classes (see Exercise 2), the Bayes decision boundary is then the set of points $x$ where at least two of the discriminants are equal

$\begin{aligned} \left\{ x\ \ \bigg\lvert\ \ \delta_k(x) = \delta_j(x)\ \ \text{for some}\ \ k \neq j\right\} \end{aligned}$

for all $1 \leqslant k, j \leqslant K$ . The discriminants are linear in $x$ , hence the decision boundary is linear.

In the case that the variances are not equal across classes, maximizing $p_k(x)$ over the classes still leads to (8) above which becomes

$\begin{aligned} \log(\pi_k) - \log(\sqrt{2\pi\sigma_k}) + -\frac{1}{2\sigma_k^2}\left(x^2 - 2\mu_k x + \mu_k^2\right) \end{aligned}$

since all terms are dependent on $k$ , we can’t throw out any terms. If we use the discriminant notation $\delta_k(x)$ for (10), the Bayes decision boundary is still (9), but now the equations are quadratic in $x$ .

Exercise 4: The Curse of Dimensionality

When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when $p$ is large. We will now investigate this curse.

a.

Suppose that we have a set of observations, each with measure- ments on $p = 1$ feature, $X$ . We assume that $X$ is uniformly (evenly) distributed on $[0, 1]$ . Associated with each observation is a response value. Suppose that we wish to predict a test obser- vation’s response using only observations that are within 10% of the range of $X$ closest to that test observation. For instance, in order to predict the response for a test observation with $X = 0.6$ , we will use observations in the range $[0.55, 0.65]$ . On average, what fraction of the available observations will we use to make the prediction?

Let $(X_1, Y_1), \dots, (X_n, Y_n)$ denote the sample (i.e. the pairs are iid) and assume $X_i \sim \text{Uniform}(0,1)$ .

Without loss of generality, assume $X_1 = x_1$ is the random variate corresponding to the test observation, and let $[a, b]\subseteq [0, 1]$ be the closed interval within 10% of the range of $x_1$ ¹ Define a random variable

$N = \text{The number of}\ X_i \in [a,b]$

Then

$N = \sum_{i = 1}^n I_{i, [a,b]}$

Where $I_{i, [a,b]}$ is the indicator for the event $X_i \in [a,b]$ . Since $X_i$ are assumed iid, $N = nI_{[a,b]}$ , which has mean

$\begin{aligned} \mathbb{E}(N) &= n I_{[a,b]}\\ &= n\int_{[a,b]} dx\\ &= 0.1n \end{aligned}$

b.

Now suppose that we have a set of observations, each with measurements on $p = 2$ features, $X_1$ and $X_2$ . We assume that $(X1 , X2 )$ are uniformly distributed on $[0, 1] × [0, 1]$ . We wish to predict a test observation’s response using only observations that are within 10% of the range of $X_1$ and within 10% of the range of $X_2$ closest to that test observation. For instance, in order to predict the response for a test observation with $X_1 = 0.6$ and $X_2 = 0.35$ , we will use observations in the range $[0.55, 0.65]$ for $X_1$ and in the range $[0.3, 0.4]$ for $X_2$ . On average, what fraction of the available observations will we use to make the prediction?

Same as a.. Now, let $[a_1, b_1], [a_2, b_2]$ be the corresponding intervals for $X_1, X_2$ . Then

$\begin{aligned} \mathbb{E}(N) &= n I_{[a_1,b_1]\times [a_2,b_2]}\\ &= n\int_{[a_1,b_1]\times [a_2,b_2]} dx\\ &= 0.01n \end{aligned}$

c.

Now suppose that we have a set of observations on $p = 100$ features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

In general, for $p$ predictors, we have

$\mathbb{E}(N) = 10^{-p}n$

So if $p = 100$ , $\mathbb{E}(N) = 10^{-100}n$ , which may as well be zero!.

d.

Using your answers to parts a.–c., argue that a drawback of KNN when $p$ is large is that there are very few training observations “near” any given test observation.

$\mathbb{E}(N) \rightarrow 0$ exponentially as $p \rightarrow \infty$ . Thus as $p$ grows, KNN will find the only observations nearby each $x_i$ will be the points $x_i$ themselves, with exponentially growing probability.

e.

Now suppose that we wish to make a prediction for a test observation by creating a $p$ -dimensional hypercube centered around the test observation that contains, on average, 10% of the training observations. For $p = 1,2$ and 100, what is the length of each side of the hypercube? Comment on your answer.

In this case, we want $[a_1, b_1] \times [a_p, b_p]$ such that the $p$ dimensional volume is $0.1$ . Since the intervals are all of equal length $b - a$ , we have $(b - a)^p = 0.1$ , so $b - a = (0.1)^{1/p}$ .

For $p = 1$ , we have $b-a = 0.1$ , for $p = 2$ , we have $b - a = 0.316$ and for $p = 100$ we have $b - 1 = 0.997$ .

So if we fix the volume of the hypercube, we’re forced to take larger and larger intervals around each predictor to guarantee the same fixed fraction of nearby points $\mathbb{E}(N) = 0.1n$ .

It wasn’t clear to me exactly why these growing intervals but in ESL ² the authors state that the intervals are “no longer local”. Presumably “locality” is one of the chief advantages of KNN.

Thus the curse of dimensionality seems to be a tradeoff between sparseness and locality. Either we retain locality (fixed interval length) at the expense of sparseness (fewer and fewer points nearby), or we lose locality (growing inteval length) to avoid sparseness (fixed fraction of points nearby).

Exercise 5: Differences between LDA and QDA

a.

If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

We expect QDA to perform better on the training set, since the additional degree of freedom makes it easier to fit the noise.

We expect the LDA to perform better on the test set, since the decision boundary is linear (QDA would likely overfit).

b.

If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

We expect QDA to perform better on the training set, since the additional degree of freedom makes it easier to fit the noise.

We expect QDA to also perform better on the test set, since the decision boundary is non-linear.

c.

In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

As the sample size increases, we expect the prediction accuracy of QDA relative to LDA to improve. With more degrees of freedom, for fixed sample size, QDA has higher variance than LDA, but this difference decreases as the sample size increases.

d.

True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

False - consider the case of small sample size in a.

Exercise 6: Classifying A grades

Suppose we collect data for a group of students in a statistics class with variables $X_1 =$ hours studied, $X_2$ = undergrad GPA, and $Y$ = receive an A. We fit a logistic regression and produce estimated coefficient, $\hat{\beta}_0 = −6, \hat{\beta}_1 = 0.05, \hat{\beta}_2 = 1$

a.

Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.

Consider $Y$ as a binary variable, where $Y = 1$ if the student gets an A. We have estimated probability

$\begin{aligned} \hat{P}(X_1, X_2) &= \hat{P}(Y = 1| X)\\ &= \frac{e^{-6 +.05X_1 + X_2}}{1 + e^{-6 +.05X_1 + X_2}} \end{aligned}$

so $\hat{p}(40, 3.5)$ is

from math import e

(e**(-6 + 0.5*40 + 3.5))/ (1 + e**(-6 + 0.5*40 + 3.5))

0.999999974890009

b.

How many hours would the student in part a. need to study to have a 50 % chance of getting an A in the class?

For this use the logit representation

$\log\left(\frac{\hat{P}(X)}{1 - \hat{P}(X)}\right) = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2$

Since the student in a. has a GPA of 3.5, the desired number of hours is

from math import log

(log(1) + 6 - 3.5)/0.5

5.0

Exercise 7: Classifying whether a stock will issue a dividend

Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on $X$ , last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was $\overline{X} = 10$ , while the mean for those that didn’t was $\overline{X} = 0$ . In addition, the variance of $X$ for these two sets of companies was $\hat{\sigma}^2 = 36$ . Finally, 80% of companies issued dividends. Assuming that X follows a normal distribution³, predict the probability that a company will issue a dividend this year given that its percentage profit was $X = 4$ last year.

We’ll let $Y$ be the binary random variable with $Y = 1$ if the stock issues a dividend, and $Y=0$ if not. From Bayes theorem we have

$P(Y = k\ |\ X = x) = \frac{P(X = x\ |\ Y = k)P(Y = k)}{P(X = x\ |\ Y = 0)P(Y = 0) + P(X = x\ |\ Y = 1)P(Y = 1)}$

for $k\in \{0,1\}$ ⁴, so the answer we’re looking for is

$P(Y = 1\ |\ X = 4) = \frac{P(X = 4\ |\ Y = 1)P(Y = 1)}{P(X = 4\ |\ Y = 0)P(Y = 0) + P(X = 4\ |\ Y = 1)P(Y = 1)}$

Since we don’t know the true probabilities here, we have to estimate (knowing we have sampled a large number of companies helps assure that our estimates will be accurate).

We have $X| Y = k \sim N(\mu_k, \sigma_k)$ by assumption, so that

$P(X = x\ |\ Y = k) = \frac{1}{\sqrt{2\pi \sigma_k}}e^{-(x - \mu_k)^2/2\sigma_k^2}$

We estimate the population mean $\mu_k$ and deviation $\sigma_k^2$ for class $k$ with the sample mean and deviation for that class.

Futhermore $P(Y = 1) \approx 0.80$ , so $P(Y = 0) \approx 0.20$ we have

from scipy.stats import norm

a, b = norm.pdf(4, loc=0, scale=6), norm.pdf(4, loc=10, scale=6)

(b * 0.8)/(a*0.2 + b*0.8)

0.7518524532975261

So $P(Y = 1\ |\ X = 4) \approx 0.75$

Exercise 8: Average test and training error

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. $K = 1$ ) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

Not enough information to say, or, it depends on the train/test error for the KNN model.

Since the train and test sets are the same size, the average error rate

$e_{avg} = \frac{1}{2}(e_{train} + e_{test})$

So $e_{test} = 2e_{avg} - e_{train}$ . If $e_{avg} = 0.18$ , then $e_{train} = 0.36 - e_{train}$ . Then if $e_{train} < 0.06$ for the KNN model, it will have higher $e_{test}$ than the logistic regression model. In this case, we would prefer the logistic model as it prefers better on unseen data.

Exercise 9: Odds

a.

On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

Since the odds are

$\frac{p}{1-p} = 0.37$ ,

we have $p = 0.27$

b.

Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?

Since $p=0.16$ , the odds are $0.16/1 - 0.16 = 0.19$

Footnotes

$[a,b]$ is an interval with $b - a = 0.1$ . If $x_1 \in [0.1, 0.9]$ , then $[a,b] = [0.9x_1, 1.1x_1]$ If $x_1 \in [0, 0.1)$ , then $[a,b] = [0, 0.1]$ and if $x \in (0.9, 1]$ , then $[a,b] = [0.9, 1]$ . ↩
See section 2.5, page 22 of ESL ↩
I believe the authors mean “conditionally normally distributed”, i.e. that $X| Y = y$ is normally distributed, as in section 4.4.1. ↩
To avoid confusion, I’ve let the class index $k$ be the same as the values of $Y$ . So instead of having values $Y = 0, 1$ and class numbers $1, 2$ , the class numbers are also $0, 1$ . ↩

islr notes and exercises from An Introduction to Statistical Learning