9. Support Vector Machines

9 Support Vector Machines

Maximal Margin Classifier

What Is a Hyperplane?

A hyperplane in $\mathbb{R}^p$ is an affine subspace of dimension $p-1$ . Every hyperplane is the set of solutions $X$ to $\beta^\top X = 0$ for some $\beta\in\mathbb{R}^p$ .
A hyperplane $\beta^\top X = 0$ partitions $\mathbb{R}^p$ into two halfspaces:

$H_+ = \{X\in\mathbb{R}^p\ |\ \beta^\top X > 0\}$ $H_- = \{X\in\mathbb{R}^p\ |\ \beta^\top X > 0\}$

corresponding to either side of the plane, or equivalently,

$H_+ = \{X\in\mathbb{R}^p\ |\ \text{sgn}(\beta^\top X) = 1\}$ $H_- = \{X\in\mathbb{R}^p\ |\ \text{sgn}(\beta^\top X) = -1\}$

Classification Using a Separating Hyperplane

Given data $(x_i, y_i)$ , $i = 1,\dots n$ with response classes $y_i \in \{ \pm 1\}$ , a hyperplane $\beta^\top X = 0$ is separating if

$\text{sgn}(\beta^\top x_i) = y_i$

for all $i$ .

Given a separating hyperplane, we may predict

$\hat{y}_i = \text{sgn}(\beta^\top x_i)$

The Maximal Margin Classifier

Separating hyperplanes are not unique (if one exists then uncountably many exist). A natural choice is the maximal margin hyperplane (or optimal separating hyperplane)
The margin is the minimal perpendicular distance to the hyperplane over the sample points $M = \underset{i}{\min}\{\ ||x_i - P x_i||\ \}$ where $P$ is the projection matrix onto the hyperplane.
The points $(x_i, y_i)$ “on the margin” (where $||x_i - P x_i|| = M$ ) are called support vectors

Construction of the Maximal Margin Classifier

The maximal margin classifier is the solution to the optimization problem:

$\begin{aligned} \underset{\boldsymbol{\beta}}{\text{argmax}}&\ M\\ \text{subject to}&\ ||\,\boldsymbol{\beta}\,|| = 1\\ & \mathbf{y}^\top(X\boldsymbol{\beta}) \geqslant \mathbf{M}\\ \end{aligned}$

where $\mathbf{M} = (M, \dots, M) \in \mathbb{R}^n$ ¹

The Non-separable Case

The maximal margin classifier is a natural classifier, but a separating hyperplane is not guaranteed to exist
If a separating hyperplane doesn’t exist, we can choose an “almost” separating hyperplane by using a “soft” margin.

Support Vector Classifiers

Overview of the Support Vector Classifier

Separating hyperplanes don’t always exist, and even if they do, they may be undesirable.
The distance to the hyperplane can be thought of as a measure of confidence in the classification. For very small margins, the separating hyperplane is very sensitive to individual observations – we have low confidence in the classification of nearby observations.
In these situations, we may prefer a hyperplane that doesn’t perfectly separate in the interest of:
- Greater robustness to individual observations
- Better classification of most of the training observations
This is achieved by the support vector classifier or soft margin classifier ²

Details of the Support Vector Classifier

The support vector classifier is the solution to the optimization problem:

$\begin{aligned} \underset{\boldsymbol{\beta}}{\text{argmax}}&\ M\\ \text{subject to}&\ ||\,\boldsymbol{\beta}\,|| = 1\\ & y_i(\boldsymbol{\beta}^\top x_i) \geqslant M(1-\epsilon_i)\\ & \epsilon_i \geqslant 0\\ & \sum_i \epsilon_i \leqslant C \end{aligned}$

where $C \geqslant 0$ is a tuning parameter, $M$ is the margin, and the $\epsilon_i$ are slack variables [3].

Observations on the margin or on the wrong side of the margin are called support vectors

Support Vector Machines

Classification with Non-Linear Decision Boundaries

The support vector classifier is a natural choice for two response classes when the class boundary is linear, but may perform poorly when the boundary is non-linear.
Non-linear transformations of the features will lead to a non-linear class boundary, but enlarging the feature space too much can lead to intractable computations.
The support vector machine enlarges the feature space in a way which is computationally efficient.

The Support Vector Machine

It can be shown that:
- the linear support vector classifier is a model of the form $f(x) = \beta_0 + \sum_{i = 1}^n \alpha_i \langle x, x_i\rangle$
- the parameter estimates $\hat{\alpha}_i, \hat{\beta}_0$ can be computed from the $\binom{n}{2}$ inner products $\langle x, x_i \rangle$
The support vector machine is a model of the form $f(x) = \beta_0 + \sum_{i = 1}^n \alpha_i K(x, x_i)$ where $K$ is a kernel function ³
Popular kernels ⁴ are
- The polynomial kernel $K(x_i, x_i') = (1 + x_i^\top x_i')^d$
- The radial kernel $K(x_i, x_i') = \exp(-\gamma\,||x_i - x_i'||^2)$

SVMs with More than Two Classes

One-Versus-One Classification

This approach works as follows:

Fit $\binom{K}{2}$ SVMs, one for each pair of classes $k,k'$ encoded as $\pm 1$ , respectively.
For each observation $x$ , classify using each of the predictors in 1, and let $N_k$ be the number of times $x$ was assigned to class $k$ .
Predict $\hat{f}(x) = \underset{k}{\text{argmax}}\, N_k$

One-Versus-All Classification