ch 9: support vector machines - seoul national...

Ch 9: Support Vector Machines

This material is prepared by following the context of James et al. (2013) and slides at

https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-

videos/

9.0 Introduction

• The support vector machine is a generalization of a simple and intuitive classifier called

the maximal margin classifier.

• We discuss the support vector classifier, an extension of the maximal margin classifier

that can be applied in a broader range of cases.

• We further consider the support vector machine, which is a further extension of the

support vector classifier in order to accommodate non-linear class boundaries.

9.1 Maximal Margin Classifier

9.1.1 What Is a Hyperplane ?

• A hyperplane in p dimensions is defined by

{X = (X1, . . . , Xp)T : f(X) = β0 + β1X1 + · · ·+ βpXp = 0} (1)

• The mathematical definition of a hyperplane is quite simple. In two dimensions, a

hyperplane is defined by the equation

β0 + β1X1 + β2X2 = 0

for parameters β0, β1, β2. Any X = (X1, X)T for which the above equation holds is

a point on the hyperplane. Note that the above equation is simply the equation of a

line, since indeed in two dimensions a hyperplane is a line.

• If f(X) = β0 + β1X1 + · · · + βpXp, then f(X) > 0 for points on one side of the

hyperplane, and f(X) > 0 for points on the other. Thus, one can identify that point

X belongs to which regions.

1

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1

.5−

1.0

−0

.50

.00

.51

.01

.5

X1X

2

The hyperplane 1 + 2X1 + 3X2 = 0 is shown. The blue region is the set of points for which 1 + 2X1 +

3X2 > 0, and the purple region is the set of points for which 1 + 2X1 + 3X2 < 0.

9.1.2 Classification Using a Separating Hyperplane

• Suppose that we have a n × p data matrix X that consists of n training observations

in p–dimensional space and that these observations fall into two classes – that is,

y1, . . . , yn ∈ {−1, 1}.

• We also have a test observation, a p-vector of observed features x∗ = (x∗1, . . . , x∗n)T .

Our goal is to develop a classifier based on the training data that will correctly classify

the test observation using its feature measurements.

• We will now see a new approach that is based upon the concept of a separating hyper-

plane.

• Consider a hyperplane that separates the training observations perfectly according to

their class labels. Then a separating hyperplane has the property that

β0 + β1xi1 + · · ·+ βpxip > 0 if yi = 1

and

β0 + β1xi1 + · · ·+ βpxip < 0 if yi = −1.

Equivalently, a separating hyperplane has the property that

yi(β0 + β1xi1 + · · ·+ βpxip) > 0

2

for all i = 1, . . . , n. If a separating hyperplane exists, we can use it to construct a very

natural classifier: a test observation is assigned a class depending on which side of the

hyperplane it is located.

−1 0 1 2 3

−1

01

23

−1 0 1 2 3

−1

01

23

X1X1

X2

X2

9.1.3 The Maximal Margin Classifier

• Among all separating hyperplanes, find the one that makes the biggest gap or margin

between the two classes.

• That is, we can compute the (perpendicular) distance from each training observation

to a given separating hyperplane; the smallest such distance is the minimal distance

from the observations to the hyperplane, and is known as the margin.

• The maximal margin hyperplane is the separating hyperplane for which the margin is

largest.

3

−1 0 1 2 3

−1

01

23

X1

X2

Three training observations are equidistant from the maximal margin hyperplane and lie along the

dashed lines indicating the width of the margin.

• These three observations are known as support vectors, since they are vectors in p-

dimensional space (in Figure, p = 2) and they “support” the maximal margin hyper-

plane in the sense that if these points were moved slightly then the maximal margin

hyperplane would move as well.

9.1.4 Construction of the Maximal Margin Classifier

• The maximal margin hyperplane is the solution to the optimization problem

maxβ0,β1,...βp

M

subject to

p∑j=1

β2j = 1,

yi(β0 + β1xi1 + · · ·+ βpxip) ≥M i = 1, . . . , n.

• The two constraints ensure that each observation is on the correct side of the hyperplane

and at least a distance M from the hyperplane. Hence, M represents the margin of

our hyperplane, and the optimization problem chooses β0, β1, . . . , βp to maximize M .

4

−1 0 1 2 3

−1

01

23

−1 0 1 2 3

−1

01

23

X1X1

X2

X2

Need a better classifier ?

9.2 Support Vector Classifier

• It could be worthwhile to misclassify a few training observations in order to do a better

job in classifying the remaining observations.

• The support vector classifier, sometimes called a soft margin classifier, does exactly

this.

• The solution is the following optimization:

maxβ0,β1,...βp,ε1,...,εn

M

subject to

p∑j=1

β2j = 1,

yi(β0 + β1xi1 + · · ·+ βpxip) ≥M(1− εi),

εi ≥ 0,n∑i=1

εi ≤ C,

where C is a nonnegative tuning parameter.

– ε1, . . . , εn are termed slack variables that allow individual observations to be on

the wrong side of the margin or the hyperplane: If εi = 0 the observation is in

the correct side of the margin, if 0 < ε ≤ 1, the observation is in the wrong side

of the margin, and if ε > 1, it is on the wrong side of the hyperplane.

5

– the tuning parameter C is selected by cross-validation. As one can see, C bounds

the sum of the ε’s, and so it determines the number and severity of the violations

to the margin (and to the hyperplane) that we will tolerate.

– If C = 0, then there is no budget for violations to the margin, and it must be the

case that ε1 = · · · = εn = 0.

– As the C increases, we become more tolerant of violations to the margin, and so

the margin will widen. Conversely, as C decreases, we become less tolerant of

violations to the margin and so the margin narrows.

−1 0 1 2

−3

−2

−1

01

23

−1 0 1 2

−3

−2

−1

01

23

−1 0 1 2

−3

−2

−1

01

23

−1 0 1 2

−3

−2

−1

01

23

X1X1

X1X1

X2

X2

X2

X2

A support vector classifier was fit using four different values of the tuning parameter C. The largest

value of C was used in the top left panel, and smaller values were used in the top right, bottom left,

and bottom right panels. When C is large, then there is a high tolerance for observations being on

the wrong side of the margin, and so the margin will be large. As C decreases, the tolerance for

observations being on the wrong side of the margin decreases, and the margin narrows.

6

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

X2

Left: The observations fall into two classes, with a non-linear boundary between them. Right: The

support vector classifier seeks a linear boundary, and consequently performs very poorly.

9.3 Support Vector Machines

9.3.1 Classification with Non-linear Decision Boundaries

• Enlarge the space of features by including transformations; e.g. X21 , X

31 , X1X2, X1X

22 , . . ..

Hence go from a p–dimensional space to a M > p dimensional space.

• Fit a support–vector classifier in the enlarged space. This results in non-linear decision

boundaries in the original space.

• Example: Suppose we use (X1, X2, X21 , X

22 , X1X2, X

31 , X

32 , X1X

22 , X

21X2) instead of (X1, X2).

Then the decision boundary would be of the form

β0 + β1X1 + β2X2 + β3X21 + β4X

22 + β5X1X2 + β6X

31 + β7X

32 + β8X1X

22 + β9X

21X2 = 0

This leads to nonlinear decision boundaries in the original space.

7

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

X2

9.3.2 The Support Vector Machine

• Polynomials (especially high-dimensional ones) get wild rather fast.

• There is a more elegant and controlled way to introduce nonlinearities in support–

vector classifiers – through the use of kernels.

• The support vector machine (SVM) is an extension of the support vector classifier that

results from enlarging the feature space in a specific way, using kernels.

• Before we discuss these, we must understand the role of inner products in support-

vector classifiers.

• The inner product of two observations xi, xi′ is given by

〈xi, xi′〉 =

p∑j=1

xijxi′j.

• The linear support vector classifier can be represented as

f(x) = β0 +n∑i=1

αi〈x, xi〉.

• To estimate the parameters α1, . . . , αn and β0, all we need are the(n2

)inner products

〈xi, xi′〉 between all pairs of training observations.

• It turns out that most of the α̂i can be zero:

f(x) = β0 +∑i∈S

α̂i〈x, xi〉,

where S is the support set of indices i such that α̂i > 0.

8

• We consider a generalization of the inner product of the form K(xi, xi′), where K is

some function that we will refer to as a kernel. A kernel is a function that quantifies

the similarity of two observations.

• The solution has the form

f(x) = β0 +∑i∈S

α̂iK(x, xi).

• An example of a possible non-linear kernel is a polynomial kernel as

K(xi, xi′) =

(1 +

p∑j=1

xijxi′j

)d

where d is a positive integer. Another popular choice is the radial kernel, which takes

the form

K(xi, xi′) = exp

(−γ

p∑j=1

(xij − xi′j)2)

with a positive constant γ.

• Example: Heart Data

– Use 13 predictors such as Age, Sex, and Chol in order to predict whether an

individual has heart disease.

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Support Vector Classifier

LDA

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


SVM: γ=10−3

SVM: γ=10−2

SVM: γ=10−1

ROC curves for the Heart data training set. Left: The support vector classifier and LDA are compared.

Right: The support vector classifier is compared to an SVM using a radial basis kernel with γ =

10−3, 10−2, and 10−1.

9

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


LDA

False positive rate

Tru

e p

ositiv

e r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


SVM: γ=10−3

SVM: γ=10−2

SVM: γ=10−1

ROC curves for the test set of the Heart data. Left: The support vector classifier and LDA are

compared. Right: The support vector classifier is compared to an SVM using a radial basis kernel

with γ = 10−3, 10−2, and 10−1.

9.4 SVMs with More than Two Classes

• The SVM as defined works for K = 2 classes. What do we do if we have K > 2 classes

?

• One-Versus-One Classification (OVO): Fit all(K2

)pairwise classifiers f̂k`(x). Classify

x∗ to the class that wins the most pairwise competitions.

• One-Versus-All Classification (OVA): Fit K different 2-class SVM classifiers f̂k(x),

k = 1, . . . , K; each class versus the rest. Classify x∗ to the class for which f̂k(x∗) is

largest.

• Which to choose? If K is not too large, use OVO.

9.5 Relationship to Logistic Regression

• One can rewrite the support-vector classifier optimization for fitting the support vector

classifier f(X) = β0 + β1X1 + · · ·+ βpXp as

minβ0,β1,...,βp

{n∑i=1

max(0, 1− yif(xi)) + λ

p∑j=1

β2j

}.

10

• The above form is like “Loss + Penalty”

minβ0,β1,...,βp

{L(X, y, β) + λP (β)} .

• In our case, the loss function is

L(X, y, β) =n∑i=1

max(0, 1− yi(β0 + β1xi1 + · · ·+ βpxip))

which is called hinge loss.

−6 −4 −2 0 2

02

46

8

Loss

SVM Loss

Logistic Regression Loss

yi(β0 + β1xi1 + . . . + βpxip)

11

ch 9: support vector machines - seoul national...

Documents