4. Logistic Regression
Table of Contents
An Overview of Classification
-
In classification we consider paired data (X,Y), where Y is a qualitative variable, that is, a finite random variable.
-
The values Y takes are called classes
Why Not Linear Regression?
Because a linear regression model implies an ordering on the values of the response and in general there is no natural ordering on the values of a qualitative variable
Logistic Regression
The Logistic Model
-
Consider a quantitiative predictor X binary response variable Y∈0,1
-
We want to model the conditional probability of Y=1 given X
P(X):=P(Y=1∣X)
- We model P(X) with the logistic function
P(X)=1+eβ0+β1Xeβ0+β1X
- The logistic model can be considered a linear model for the log-odds or logit
log(1−P(X)P(X))=β0+β1X
Estimating the Regression Coefficients
- The likelihood function for the logistic regression parameter β=(β0,β1) is
ℓ(β)=i=1∏np(xi)=i:yi=1∏p(xi)i:yi=0∏(1−p(xi))
- The maximum likelihood estimate (MLE) for the regression parameter is
β^=β∈R2argmaxℓ(β)
Making Predictions
- The MLE β^ results in an estimate for the conditional probability P^(X) which can be used to predict the class Y
Multiple Logistic Regression
-
Multiple logistic regression considers the case of multiple predictors X=(X1,…,Xp)⊤.
-
If we write the predictors as X=(1,X1,…,Xp)⊤, and the parameter (β)=(β0,…,βp)⊤ then multiple logistic regression models
p(X)=1+exp(β⊤X)exp(β⊤X)
Logistic Regression for more than two response classes
- This isn’t used often (a softmax is often used)
Linear Discriminant Analysis
This is a method for modeling the conditional probability of a qualitative response Y given quantitative predictors X when Y takes more than two values. It is useful because:
-
Parameter estimates for logistic regression are suprisingly unstable when the classes are well separated, but LDA doesn’t have this problem
-
If n is small and the Xi are approximately normal in the classes (i.e. the conditional Xi∣Y=k is approximately normal) LDA is more stable
-
LDA can accomodate more than two clases
Bayes Theorem for Classification
-
Consider a quantitiative input X and qualitative response Y∈1,…K.
-
Let πk:=P(Y=k) be the prior probability that Y=k, let pk(x):=P(Y=k ∣ X=x) be the posterior probability that Y=k, and let fk(x):=P(X=x ∣ Y=k). Then Bayes’ theorem says:
pk(x)=∑lπlfl(x)πkfk(x)
- We can form an estimate p^k(x) for pk(x) with estimates of πk and fk(x) for each k, and for x predicts
y^=1⩽k⩽Kargmaxp^k(x)
Linear Discriminant Analysis for p=1
-
Assume that the conditional X∣Y=k∼Normal(μk,σk2) and that the variances are equal across classes σ12=⋯=σK2=σ2.
-
The Bayes classifier predicts Y=k where pk(x) is largest or equivalently
y^=1⩽k⩽Kargmax δk(x)
where
δk(x):=(σ2μk)−2σ2μk2+log(πk)
is the discriminant function
- The LDA classifier estimates the parameters
μ^kσ^k=nk1i:yi=k∑xi=n−K1k=1∑Ki:yi=k∑(xi−μ^k)2
Where nk is the number of observations in class k and predicts
y^=1⩽k⩽Kargmax σ^k(x))
Linear Discriminant Analysis for p > 1
-
Assume that the conditional (X∣Y=k)∼Normal(μk,Σk) and that the covariance matrices are equal across classes Σ1=⋯=ΣK=Σ.
-
The discriminant functions are
σk(x)=x⊤Σ−1μk−21μkΣ−1μk+log(πk)
-
LDA estimates μk and Σk componentwise
(μ^k)j(σ^k)j=nk1i:yi=k∑xij=n−K1k=1∑Ki:yi=k∑(xij−(μ^k)j)2
for 1⩽j⩽p as above and predicts
y^=1⩽k⩽Kargmax σ^k(x))
as above
-
Confusion matrices help analyze misclassifications for an LDA model
-
The Bayes decision boundary may not be agreeable in every context so sometimes a different decsision boundary (threshold) is used.
-
An ROC curve is useful for vizualising true vs false positives over different decision thresholds in the binary response case.
Quadratic Discriminant Analysis
-
Assume that the conditional (X∣Y=k)∼Normal(μk,Σk) but assume that the covariance matrices Σk are not equal across classes
-
The discriminant functions are now quadratic in x
σk(x)=x⊤Σk−1μk−21μkΣk−1μk+log(πk)
- QDA has more degrees of freedom than LDA so generally has lower bias but higher variance.
A Comparison of Classification Methods
-
So far our classification methods are KNN, logistic regression (LogReg), LDA and QDA
-
LDA and LogReg both produce linear decision boundaries. They often give similar performance results, although LDA tends to outperform when the conditionals X∣Y=k are normally distributed, and not when they aren’t
-
As a non-parametric approach, KNN produces a non-linear decision boundary, so tends to outperform LDA and LogReg when the true decision boundary is highly non-linear. It doesn’t help with selecting important predictors
-
With a quadratic decision boundary, QDA is a compromise between the non-linear KNN and the linear LDA/LogReg