islr notes and exercises from An Introduction to Statistical Learning

9. Support Vector Machines

Table of Contents

Maximal Margin Classifier

What Is a Hyperplane?

  • A hyperplane in Rp\mathbb{R}^p is an affine subspace of dimension p1p-1. Every hyperplane is the set of solutions XX to βX=0\beta^\top X = 0 for some βRp\beta\in\mathbb{R}^p.
  • A hyperplane βX=0\beta^\top X = 0 partitions Rp\mathbb{R}^p into two halfspaces:

H+={XRp  βX>0}H_+ = \{X\in\mathbb{R}^p\ |\ \beta^\top X > 0\} H={XRp  βX>0}H_- = \{X\in\mathbb{R}^p\ |\ \beta^\top X > 0\}

corresponding to either side of the plane, or equivalently,

H+={XRp  sgn(βX)=1}H_+ = \{X\in\mathbb{R}^p\ |\ \text{sgn}(\beta^\top X) = 1\} H={XRp  sgn(βX)=1}H_- = \{X\in\mathbb{R}^p\ |\ \text{sgn}(\beta^\top X) = -1\}

Classification Using a Separating Hyperplane

  • Given data (xi,yi)(x_i, y_i), i=1,ni = 1,\dots n with response classes yi{±1}y_i \in \{ \pm 1\}, a hyperplane βX=0\beta^\top X = 0 is separating if

sgn(βxi)=yi\text{sgn}(\beta^\top x_i) = y_i

for all ii.

  • Given a separating hyperplane, we may predict

y^i=sgn(βxi)\hat{y}_i = \text{sgn}(\beta^\top x_i)

The Maximal Margin Classifier

  • Separating hyperplanes are not unique (if one exists then uncountably many exist). A natural choice is the maximal margin hyperplane (or optimal separating hyperplane)

  • The margin is the minimal perpendicular distance to the hyperplane over the sample points M=mini{ xiPxi } M = \underset{i}{\min}\{\ ||x_i - P x_i||\ \} where PP is the projection matrix onto the hyperplane.

  • The points (xi,yi)(x_i, y_i) “on the margin” (where xiPxi=M||x_i - P x_i|| = M) are called support vectors

Construction of the Maximal Margin Classifier

The maximal margin classifier is the solution to the optimization problem:

argmaxβ Msubject to β=1y(Xβ)M \begin{aligned} \underset{\boldsymbol{\beta}}{\text{argmax}}&\ M\\ \text{subject to}&\ ||\,\boldsymbol{\beta}\,|| = 1\\ & \mathbf{y}^\top(X\boldsymbol{\beta}) \geqslant \mathbf{M}\\ \end{aligned}

where M=(M,,M)Rn\mathbf{M} = (M, \dots, M) \in \mathbb{R}^n 1

The Non-separable Case

  • The maximal margin classifier is a natural classifier, but a separating hyperplane is not guaranteed to exist

  • If a separating hyperplane doesn’t exist, we can choose an “almost” separating hyperplane by using a “soft” margin.

Support Vector Classifiers

Overview of the Support Vector Classifier

  • Separating hyperplanes don’t always exist, and even if they do, they may be undesirable.

  • The distance to the hyperplane can be thought of as a measure of confidence in the classification. For very small margins, the separating hyperplane is very sensitive to individual observations – we have low confidence in the classification of nearby observations.

  • In these situations, we may prefer a hyperplane that doesn’t perfectly separate in the interest of:
    • Greater robustness to individual observations
    • Better classification of most of the training observations
  • This is achieved by the support vector classifier or soft margin classifier 2

Details of the Support Vector Classifier

  • The support vector classifier is the solution to the optimization problem:

argmaxβ Msubject to β=1yi(βxi)M(1ϵi)ϵi0iϵiC \begin{aligned} \underset{\boldsymbol{\beta}}{\text{argmax}}&\ M\\ \text{subject to}&\ ||\,\boldsymbol{\beta}\,|| = 1\\ & y_i(\boldsymbol{\beta}^\top x_i) \geqslant M(1-\epsilon_i)\\ & \epsilon_i \geqslant 0\\ & \sum_i \epsilon_i \leqslant C \end{aligned}

where C0C \geqslant 0 is a tuning parameter, MM is the margin, and the ϵi\epsilon_i are slack variables [3].

  • Observations on the margin or on the wrong side of the margin are called support vectors

Support Vector Machines

Classification with Non-Linear Decision Boundaries

  • The support vector classifier is a natural choice for two response classes when the class boundary is linear, but may perform poorly when the boundary is non-linear.

  • Non-linear transformations of the features will lead to a non-linear class boundary, but enlarging the feature space too much can lead to intractable computations.

  • The support vector machine enlarges the feature space in a way which is computationally efficient.

The Support Vector Machine

  • It can be shown that:
    • the linear support vector classifier is a model of the form f(x)=β0+i=1nαix,xif(x) = \beta_0 + \sum_{i = 1}^n \alpha_i \langle x, x_i\rangle
    • the parameter estimates α^i,β^0\hat{\alpha}_i, \hat{\beta}_0 can be computed from the (n2)\binom{n}{2} inner products x,xi\langle x, x_i \rangle
  • The support vector machine is a model of the form f(x)=β0+i=1nαiK(x,xi)f(x) = \beta_0 + \sum_{i = 1}^n \alpha_i K(x, x_i) where KK is a kernel function 3

  • Popular kernels 4 are
    • The polynomial kernel K(xi,xi)=(1+xixi)dK(x_i, x_i') = (1 + x_i^\top x_i')^d
    • The radial kernel K(xi,xi)=exp(γxixi2)K(x_i, x_i') = \exp(-\gamma\,||x_i - x_i'||^2)

SVMs with More than Two Classes

One-Versus-One Classification

This approach works as follows:

  1. Fit (K2)\binom{K}{2} SVMs, one for each pair of classes k,kk,k' encoded as ±1\pm 1, respectively.
  2. For each observation xx, classify using each of the predictors in 1, and let NkN_k be the number of times xx was assigned to class kk.
  3. Predict f^(x)=argmaxkNk \hat{f}(x) = \underset{k}{\text{argmax}}\, N_k

One-Versus-All Classification

This approach works as follows:

  1. Fit KK SVMs, comparing each class kk to other K1K-1 classes, encoded as ±1\pm 1, respectively. Let βk(β0k,,βpk)\beta_k (\beta_{0k}, \dots, \beta_{pk}) be resulting parameters.
  2. Predict f^(x)=argmaxkβkx\hat{f}(x) = \underset{k}{\text{argmax}}\, \beta_k^\top x

Relationship to Logistic Regression

  • The optimization problem leading to the support vector classifier can be rewritten as argminβ(i=1nmax{0,1yi(βxi)}+λβ2)\underset{\beta}{\text{argmin}}\left(\sum_{i = 1}^n \max\{0, 1 - y_i(\beta^\top x_i)\} + \lambda\,||\beta||^2\right) where λ0\lambda \geqslant 0 is a tuning parameter 5.
  • The hinge loss 6 is very similar to the logistic regression loss, so both methods tend to give similar results. However, SVMs tend to perform better when the classes are well separated, while logistic regression tends to perform better when they are not.

Footnotes

  1. The constraint β=1|| \boldsymbol{\beta} || = 1 ensures that the perpendicular distance xiPxi||x_i - P x_i|| is given by yi(βxi)y_i(\beta^\top x_i)

  2. Sometimes the maximal margin and support vector classifiers are called “hard margin” and “soft margin” support vector classifiers, respectively. 

  3. In this context a kernel function is a positive-definite kernel. Among other things, it is a generalization of an inner product (every inner product x,y\langle x, y \rangle is a kernel function), and is one way of quantifying similarity between points. In the context of statistical and machine learning, a kernel method is one which makes use of the “kernel trick”. The kernel function K(xi,xi)K(x_i, x_i') encodes the similarity of the observations xi,xix_i, x_i' in a transformed feature space, but it is more computationally efficient to compute the (nk)\binom{n}{k} kernels themselves than to transform the data. The kernel fits a support vector classifier (hence a linear classification boundary) in the transformed feature space, which corresponds to a non-linear boundary in the original feature space. 

  4. The polynomial kernel is effectively the inner product on the space of dd-degree polynomials in the features XjX_j. The radial kernel is a similarity measure in an infinite dimensional feature space. 

  5. This is another instance of a general form of a “regularized loss” or “loss + penalty” argminβL(X,y,β)+λP(β)\underset{\beta}{\text{argmin}}L(\mathbf{X}, \mathbf{y}, \beta) + \lambda P(\beta) where the loss function L(X,y,β)L(\mathbf{X}, \mathbf{y}, \beta) quantifies how well the parameter model with parameter β\beta fits the data (X,y)(\mathbf{X}, \mathbf{y}), and P(β)P(\beta) is a penalty function controlled by λ\lambda

  6. In this case L(X,y,β)=i=1n{0,1yi(βxi}L(\mathbf{X}, \mathbf{y}, \beta) = \sum_{i = 1}^n \{0, 1 - y_i(\beta^\top x_i\} is called the hinge loss