islr notes and exercises from An Introduction to Statistical Learning

4. Logistic Regression

Table of Contents

An Overview of Classification

  • In classification we consider paired data (X,Y)(\mathbf{X}, Y), where YY is a qualitative variable, that is, a finite random variable.

  • The values YY takes are called classes

Why Not Linear Regression?

Because a linear regression model implies an ordering on the values of the response and in general there is no natural ordering on the values of a qualitative variable

Logistic Regression

The Logistic Model

  • Consider a quantitiative predictor XX binary response variable Y0,1Y \in {0,1}

  • We want to model the conditional probability of Y=1Y=1 given XX

P(X):=P(Y=1X) P(X) := P\left(Y=1 | X\right)

  • We model P(X)P(X) with the logistic function

P(X)=eβ0+β1X1+eβ0+β1X P(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}

  • The logistic model can be considered a linear model for the log-odds or logit

log(P(X)1P(X))=β0+β1X\log\left(\frac{P(X)}{1 - P(X)}\right) = \beta_0 + \beta_1 X

Estimating the Regression Coefficients

  • The likelihood function for the logistic regression parameter β=(β0,β1)\boldsymbol{\beta} = (\beta_0, \beta_1) is

(β)=i=1np(xi)=i:yi=1p(xi)i:yi=0(1p(xi)) \begin{aligned} \ell(\boldsymbol{\beta}) &= \prod_{i = 1}^n p(x_i)\\ &= \prod_{i: y_i = 1}p(x_i) \prod_{i: y_i = 0} (1 - p(x_i)) \end{aligned}

  • The maximum likelihood estimate (MLE) for the regression parameter is

β^=argmaxβR2(β) \hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}\in \mathbb{R}^2}{\text{argmax}\,} \ell(\boldsymbol{\beta})

Making Predictions

  • The MLE β^\hat{\boldsymbol{\beta}} results in an estimate for the conditional probability P^(X)\hat{P}(X) which can be used to predict the class YY

Multiple Logistic Regression

  • Multiple logistic regression considers the case of multiple predictors X=(X1,,Xp)\mathbf{X} = (X_1,\dots, X_p)^\top.

  • If we write the predictors as X=(1,X1,,Xp)\mathbf{X} = (1, X_1, \dots, X_p)^\top, and the parameter (β)=(β0,,βp)\boldsymbol(\beta) = (\beta_0, \dots, \beta_p)^\top then multiple logistic regression models

p(X)=exp(βX)1+exp(βX)p(X) = \frac{\exp(\boldsymbol{\beta}^\top \mathbf{X})}{1 + \exp(\boldsymbol{\beta}^\top \mathbf{X})}

Logistic Regression for more than two response classes

  • This isn’t used often (a softmax is often used)

Linear Discriminant Analysis

This is a method for modeling the conditional probability of a qualitative response YY given quantitative predictors X\mathbf{X} when YY takes more than two values. It is useful because:

  • Parameter estimates for logistic regression are suprisingly unstable when the classes are well separated, but LDA doesn’t have this problem

  • If nn is small and the XiX_i are approximately normal in the classes (i.e. the conditional XiY=kX_i | Y = k is approximately normal) LDA is more stable

  • LDA can accomodate more than two clases

Bayes Theorem for Classification

  • Consider a quantitiative input X\mathbf{X} and qualitative response Y1,KY \in {1, \dots K}.

  • Let πk:=P(Y=k)\pi_k := \mathbb{P}(Y = k) be the prior probability that Y=kY=k, let pk(x):=P(Y=k  X=x)p_k(x) := \mathbb{P}(Y = k\ |\ X = x) be the posterior probability that Y=kY = k, and let fk(x):=P(X=x  Y=k)f_k(x):= \mathbb{P}(X = x\ |\ Y = k). Then Bayes’ theorem says:

pk(x)=πkfk(x)lπlfl(x) p_k(x) = \frac{\pi_k f_k(x)}{\sum_{l}\pi_l f_l(x)}

  • We can form an estimate p^k(x)\hat{p}_k(x) for pk(x)p_k(x) with estimates of πk\pi_k and fk(x)f_k(x) for each k, and for xx predicts 1

y^=argmax1kKp^k(x)\hat{y} = \underset{1 \leqslant k \leqslant K}{\text{argmax}\,} \hat{p}_k(x)

Linear Discriminant Analysis for p=1

  • Assume that the conditional XY=kNormal(μk,σk2) X| Y = k \sim \text{Normal}(\mu_k, \sigma_k^2) and that the variances are equal across classes σ12==σK2=σ2\sigma_1^2 = \cdots = \sigma_K^2 = \sigma^2.

  • The Bayes classifier predicts Y=kY = k where pk(x)p_k(x) is largest or equivalently

y^=argmax 1kKδk(x) \begin{aligned} \hat{y} &= \underset{1 \leqslant k \leqslant K}{\text{argmax}\ } \delta_k(x)\\ \end{aligned} where δk(x):=(μkσ2)μk22σ2+log(πk) \delta_k(x) := \left(\frac{\mu_k}{\sigma^2}\right) - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k) is the discriminant function2

  • The LDA classifier 3 estimates the parameters

μ^k=1nki:yi=kxiσ^k=1nKk=1Ki:yi=k(xiμ^k)2 \begin{aligned} \hat{\mu}_k &= \frac{1}{n_k}\sum_{i: y_i = k} x_i\\ \hat{\sigma}_k &= \frac{1}{n-K} \sum_{k = 1}^K \sum_{i: y_i = k} \left(x_i - \hat{\mu}_k\right)^2 \end{aligned}

Where nkn_k is the number of observations in class kk 4 and predicts

y^=argmax 1kKσ^k(x)) \begin{aligned} \hat{y} &= \underset{1 \leqslant k \leqslant K}{\text{argmax}\ } \hat{\sigma}_k(x)) \\ \end{aligned}

Linear Discriminant Analysis for p > 1

  • Assume that the conditional (XY=k)Normal(μk,Σk)(\mathbf{X}| Y = k) \sim \text{Normal}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) and that the covariance matrices are equal across classes Σ1==ΣK=Σ\boldsymbol{\Sigma}_1 = \cdots = \boldsymbol{\Sigma}_K = \boldsymbol{\Sigma}.

  • The discriminant functions are

σk(x)=xΣ1μk12μkΣ1μk+log(πk)\sigma_k(x) = x^\top\boldsymbol{\Sigma}^{-1}\mu_k - \frac{1}{2}\mu_k {\Sigma}^{-1} \mu_k + \log(\pi_k)

  • LDA estimates μk\boldsymbol{\mu}_k and Σk\boldsymbol{\Sigma}_k 5 componentwise (μ^k)j=1nki:yi=kxij(σ^k)j=1nKk=1Ki:yi=k(xij(μ^k)j)2 \begin{aligned} (\hat{\mu}_k)_j &= \frac{1}{n_k}\sum_{i: y_i = k} x_{ij}\\ (\hat{\sigma}_k)_j &= \frac{1}{n-K} \sum_{k = 1}^K \sum_{i: y_i = k} \left(x_{ij} - (\hat{\mu}_k)_j\right)^2 \end{aligned} for 1jp1 \leqslant j \leqslant p as above and predicts y^=argmax 1kKσ^k(x)) \begin{aligned} \hat{y} &= \underset{1 \leqslant k \leqslant K}{\text{argmax}\ } \hat{\sigma}_k(x)) \\ \end{aligned} as above

  • Confusion matrices help analyze misclassifications for an LDA model 6

  • The Bayes decision boundary may not be agreeable in every context so sometimes a different decsision boundary (threshold) is used.

  • An ROC curve is useful for vizualising true vs false positives over different decision thresholds in the binary response case.

Quadratic Discriminant Analysis

  • Assume that the conditional (XY=k)Normal(μk,Σk)(\mathbf{X}| Y = k) \sim \text{Normal}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) but assume that the covariance matrices Σk\boldsymbol{\Sigma}_k are not equal across classes

  • The discriminant functions are now quadratic in xx

σk(x)=xΣk1μk12μkΣk1μk+log(πk)\sigma_k(x) = x^\top\boldsymbol{\Sigma}_k^{-1}\mu_k - \frac{1}{2}\mu_k {\Sigma}_k^{-1} \mu_k + \log(\pi_k)

  • QDA has more degrees of freedom than LDA 7 so generally has lower bias but higher variance.

A Comparison of Classification Methods

  • So far our classification methods are KNN, logistic regression (LogReg), LDA and QDA

  • LDA and LogReg both produce linear decision boundaries. They often give similar performance results, although LDA tends to outperform when the conditionals XY=kX | Y = k are normally distributed, and not when they aren’t

  • As a non-parametric approach, KNN produces a non-linear decision boundary, so tends to outperform LDA and LogReg when the true decision boundary is highly non-linear. It doesn’t help with selecting important predictors

  • With a quadratic decision boundary, QDA is a compromise between the non-linear KNN and the linear LDA/LogReg


Footnotes

  1. Recall that that the Bayes classifier predicts y^=argmax1kKpk(x)\hat{y} = \underset{1 \leqslant k \leqslant K}{\text{argmax}\,} p_k(x) So we can think of LDA as an esimate of the Bayes Classifier 

  2. The Bayes decision boundary corresponds to the parameter values μ=(μ1,,μK)\boldsymbol{\mu} = (\mu_1, \dots, \mu_K), σ=(σ1,,σK)\boldsymbol{\sigma} = (\sigma_1, \dots, \sigma_K) such that δk(x)=δj(x)\delta_k(x) = \delta_j(x) for all 1j,kK1 \leqslant j,k \leqslant K. The Bayes classifier assigns a class yy to an input xx based on where xx falls with respect to this boundary. 

  3. For the case K=2K=2, this is equivalent to assigning xx to class 1 if 2x(μ1μ2)>μ12μ222x(\mu_1 - \mu_2) > \mu_1^2 - \mu_2^2

  4. The functions δ^k(x)\hat{\delta}_k(x) are called discriminant functions, and since they’re linear in xx, the method is called linear discriminant analysis. 

  5. μ^k\hat{\mu}_k is the average of all observation inputs in the kk-th class, and σ^k\hat{\sigma}_k is a weighted average of the samples variances over the KK classes. 

  6. False positives and false negatives in the binary case. 

  7. K(p2)K\binom{p}{2} for QDA versus (p2)\binom{p}{2} for LDA.