9. Support Vector Machines
Table of Contents
Maximal Margin Classifier
What Is a Hyperplane?
- A hyperplane in Rp is an affine subspace of dimension p−1. Every hyperplane is the set of solutions X to β⊤X=0 for some β∈Rp.
- A hyperplane β⊤X=0 partitions Rp into two halfspaces:
H+={X∈Rp ∣ β⊤X>0}
H−={X∈Rp ∣ β⊤X>0}
corresponding to either side of the plane, or equivalently,
H+={X∈Rp ∣ sgn(β⊤X)=1}
H−={X∈Rp ∣ sgn(β⊤X)=−1}
Classification Using a Separating Hyperplane
- Given data (xi,yi), i=1,…n with response classes yi∈{±1}, a hyperplane β⊤X=0 is separating if
sgn(β⊤xi)=yi
for all i.
- Given a separating hyperplane, we may predict
y^i=sgn(β⊤xi)
The Maximal Margin Classifier
-
Separating hyperplanes are not unique (if one exists then uncountably many exist). A natural choice is the maximal margin hyperplane (or optimal separating hyperplane)
-
The margin is the minimal perpendicular distance to the hyperplane over the sample points
M=imin{ ∣∣xi−Pxi∣∣ }
where P is the projection matrix onto the hyperplane.
-
The points (xi,yi) “on the margin” (where ∣∣xi−Pxi∣∣=M) are called support vectors
Construction of the Maximal Margin Classifier
The maximal margin classifier is the solution to the optimization problem:
βargmaxsubject to M ∣∣β∣∣=1y⊤(Xβ)⩾M
where M=(M,…,M)∈Rn
The Non-separable Case
-
The maximal margin classifier is a natural classifier, but a separating hyperplane is not guaranteed to exist
-
If a separating hyperplane doesn’t exist, we can choose an “almost” separating hyperplane by using a “soft” margin.
Support Vector Classifiers
Overview of the Support Vector Classifier
-
Separating hyperplanes don’t always exist, and even if they do, they may be undesirable.
-
The distance to the hyperplane can be thought of as a measure of confidence in the classification. For very small margins, the separating hyperplane is very sensitive to individual observations – we have low confidence in the classification of nearby observations.
- In these situations, we may prefer a hyperplane that doesn’t perfectly separate in the interest of:
- Greater robustness to individual observations
- Better classification of most of the training observations
- This is achieved by the support vector classifier or soft margin classifier
Details of the Support Vector Classifier
- The support vector classifier is the solution to the optimization problem:
βargmaxsubject to M ∣∣β∣∣=1yi(β⊤xi)⩾M(1−ϵi)ϵi⩾0i∑ϵi⩽C
where C⩾0 is a tuning parameter, M is the margin, and the ϵi are slack variables [3].
- Observations on the margin or on the wrong side of the margin are called support vectors
Support Vector Machines
Classification with Non-Linear Decision Boundaries
-
The support vector classifier is a natural choice for two response classes when the class boundary is linear, but may perform poorly when the boundary is non-linear.
-
Non-linear transformations of the features will lead to a non-linear class boundary, but enlarging the feature space too much can lead to intractable computations.
-
The support vector machine enlarges the feature space in a way which is computationally efficient.
The Support Vector Machine
- It can be shown that:
- the linear support vector classifier is a model of the form f(x)=β0+i=1∑nαi⟨x,xi⟩
- the parameter estimates α^i,β^0 can be computed from the (2n) inner products ⟨x,xi⟩
-
The support vector machine is a model of the form
f(x)=β0+i=1∑nαiK(x,xi)
where K is a kernel function
- Popular kernels are
- The polynomial kernel
K(xi,xi′)=(1+xi⊤xi′)d
- The radial kernel
K(xi,xi′)=exp(−γ∣∣xi−xi′∣∣2)
SVMs with More than Two Classes
One-Versus-One Classification
This approach works as follows:
- Fit (2K) SVMs, one for each pair of classes k,k′ encoded as ±1, respectively.
- For each observation x, classify using each of the predictors in 1, and let Nk be the number of times x was assigned to class k.
- Predict
f^(x)=kargmaxNk
One-Versus-All Classification
This approach works as follows:
- Fit K SVMs, comparing each class k to other K−1 classes, encoded as ±1, respectively. Let βk(β0k,…,βpk) be resulting parameters.
- Predict
f^(x)=kargmaxβk⊤x
Relationship to Logistic Regression
- The optimization problem leading to the support vector classifier can be rewritten as
βargmin(i=1∑nmax{0,1−yi(β⊤xi)}+λ∣∣β∣∣2)
where λ⩾0 is a tuning parameter .
- The hinge loss is very similar to the logistic regression loss, so both methods tend to give similar results. However, SVMs tend to perform better when the classes are well separated, while logistic regression tends to perform better when they are not.