Using a little bit of algebra, prove that (4.2) is equivalent to (4.3). In other words, the logistic function representation and logit representation for the logistic regression model are equivalent.
The logistic function representation is
so
Thus
and hence
It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case. In other words, under the assumption that the observations in the kth class are drawn from a distribution, the Bayes’ classifier assigns an observation to the class for which the discriminant function is maximized.
Under the normality assumption, we have
Since is an increasing function, maximising over all is equivalent to maximising over all . We have
The last term is a constant common to all the so it won’t affect the maximization and we can get rid of it. The righthand side of (2) then becomes
Again throwing out the terms which are independent of , we get the discriminant function
This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class- specific mean vector and a class specific covariance matrix. We consider the simple case where ; i.e. there is only one feature.
Suppose that we have classes, and that if an observation belongs to the kth class then X comes from a one-dimensional normal distribution, . Recall that the density function for the one-dimensional normal distribution is given in (4.11). Prove that in this case, the Bayes’ classifier is not linear. Argue that it is in fact quadratic.
In the case that the variances are equal across classes (see Exercise 2), the Bayes decision boundary is then the set of points where at least two of the discriminants are equal
for all . The discriminants are linear in , hence the decision boundary is linear.
In the case that the variances are not equal across classes, maximizing over the classes still leads to (8) above which becomes
since all terms are dependent on , we can’t throw out any terms. If we use the discriminant notation for (10), the Bayes decision boundary is still (9), but now the equations are quadratic in .
When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when is large. We will now investigate this curse.
Suppose that we have a set of observations, each with measure- ments on feature, . We assume that is uniformly (evenly) distributed on . Associated with each observation is a response value. Suppose that we wish to predict a test obser- vation’s response using only observations that are within 10% of the range of closest to that test observation. For instance, in order to predict the response for a test observation with , we will use observations in the range . On average, what fraction of the available observations will we use to make the prediction?
Let denote the sample (i.e. the pairs are iid) and assume .
Without loss of generality, assume is the random variate corresponding to the test observation, and let be the closed interval within 10% of the range of 1 Define a random variable
Then
Where is the indicator for the event . Since are assumed iid, , which has mean
Now suppose that we have a set of observations, each with measurements on features, and . We assume that are uniformly distributed on . We wish to predict a test observation’s response using only observations that are within 10% of the range of and within 10% of the range of closest to that test observation. For instance, in order to predict the response for a test observation with and , we will use observations in the range for and in the range for . On average, what fraction of the available observations will we use to make the prediction?
Same as a.. Now, let be the corresponding intervals for . Then
Now suppose that we have a set of observations on features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?
In general, for predictors, we have
So if , , which may as well be zero!.
Using your answers to parts a.–c., argue that a drawback of KNN when is large is that there are very few training observations “near” any given test observation.
exponentially as . Thus as grows, KNN will find the only observations nearby each will be the points themselves, with exponentially growing probability.
Now suppose that we wish to make a prediction for a test observation by creating a -dimensional hypercube centered around the test observation that contains, on average, 10% of the training observations. For and 100, what is the length of each side of the hypercube? Comment on your answer.
In this case, we want such that the dimensional volume is . Since the intervals are all of equal length , we have , so .
For , we have , for , we have and for we have .
So if we fix the volume of the hypercube, we’re forced to take larger and larger intervals around each predictor to guarantee the same fixed fraction of nearby points .
It wasn’t clear to me exactly why these growing intervals but in ESL 2 the authors state that the intervals are “no longer local”. Presumably “locality” is one of the chief advantages of KNN.
Thus the curse of dimensionality seems to be a tradeoff between sparseness and locality. Either we retain locality (fixed interval length) at the expense of sparseness (fewer and fewer points nearby), or we lose locality (growing inteval length) to avoid sparseness (fixed fraction of points nearby).
If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?
We expect QDA to perform better on the training set, since the additional degree of freedom makes it easier to fit the noise.
We expect the LDA to perform better on the test set, since the decision boundary is linear (QDA would likely overfit).
If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?
We expect QDA to perform better on the training set, since the additional degree of freedom makes it easier to fit the noise.
We expect QDA to also perform better on the test set, since the decision boundary is non-linear.
In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?
As the sample size increases, we expect the prediction accuracy of QDA relative to LDA to improve. With more degrees of freedom, for fixed sample size, QDA has higher variance than LDA, but this difference decreases as the sample size increases.
True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.
False - consider the case of small sample size in a.
Suppose we collect data for a group of students in a statistics class with variables hours studied, = undergrad GPA, and = receive an A. We fit a logistic regression and produce estimated coefficient,
Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.
Consider as a binary variable, where if the student gets an A. We have estimated probability
so is
from math import e
(e**(-6 + 0.5*40 + 3.5))/ (1 + e**(-6 + 0.5*40 + 3.5))
0.999999974890009
How many hours would the student in part a. need to study to have a 50 % chance of getting an A in the class?
For this use the logit representation
Since the student in a. has a GPA of 3.5, the desired number of hours is
from math import log
(log(1) + 6 - 3.5)/0.5
5.0
Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”) based on , last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was , while the mean for those that didn’t was . In addition, the variance of for these two sets of companies was . Finally, 80% of companies issued dividends. Assuming that X follows a normal distribution3, predict the probability that a company will issue a dividend this year given that its percentage profit was last year.
We’ll let be the binary random variable with if the stock issues a dividend, and if not. From Bayes theorem we have
for 4, so the answer we’re looking for is
Since we don’t know the true probabilities here, we have to estimate (knowing we have sampled a large number of companies helps assure that our estimates will be accurate).
We have by assumption, so that
We estimate the population mean and deviation for class with the sample mean and deviation for that class.
Futhermore , so we have
from scipy.stats import norm
a, b = norm.pdf(4, loc=0, scale=6), norm.pdf(4, loc=10, scale=6)
(b * 0.8)/(a*0.2 + b*0.8)
0.7518524532975261
So
Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. ) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?
Not enough information to say, or, it depends on the train/test error for the KNN model.
Since the train and test sets are the same size, the average error rate
So . If , then . Then if for the KNN model, it will have higher than the logistic regression model. In this case, we would prefer the logistic model as it prefers better on unseen data.
On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?
Since the odds are
,
we have
Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?
Since , the odds are
is an interval with . If , then If , then and if , then . ↩
I believe the authors mean “conditionally normally distributed”, i.e. that is normally distributed, as in section 4.4.1. ↩
To avoid confusion, I’ve let the class index be the same as the values of . So instead of having values and class numbers , the class numbers are also . ↩