support vector machines & kernels lecture...

19
Support Vector Machines & Kernels Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer and Carlos Guestrin

Upload: others

Post on 21-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Support Vector Machines & Kernels Lecture 5

David Sontag

New York University

Slides adapted from Luke Zettlemoyer and Carlos Guestrin

Page 2: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Support Vector Machines

QP form:

More “natural” form:

Empirical loss Regularization term

Equivalent if

Page 3: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Subgradient method

Page 4: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Subgradient method

Step size:

Page 5: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Stochastic subgradient

Subgradient 1

Page 6: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

PEGASOS

Subgradient

Projection

A_t = S

Subgradient method

|A_t| = 1

Stochastic gradient

1

Page 7: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Run-Time of Pegasos

•  Choosing |At|=1

Run-time required for Pegasos to find ε accurate solution w.p. ¸ 1-δ

•  Run-time does not depend on #examples

•  Depends on “difficulty” of problem (λ and ε)

n = # of features

Page 8: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Experiments •  3 datasets (provided by Joachims)

–  Reuters CCAT (800K examples, 47k features)

–  Physics ArXiv (62k examples, 100k features)

–  Covertype (581k examples, 54 features)

Training Time (in seconds):

Pegasos SVM-Perf SVM-Light

Reuters 2 77 20,075

Covertype 6 85 25,514

Astro-Physics 2 5 80

Page 9: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

What’s Next!

•  Learn one of the most interesting and exciting recent advancements in machine learning – The “kernel trick”

– High dimensional feature spaces at no extra cost

•  But first, a detour – Constrained optimization!

Page 10: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Constrained optimization

x*=0

No Constraint x ≥ -1

x*=0 x*=1

x ≥ 1

How do we solve with constraints? Lagrange Multipliers!!!

Page 11: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Lagrange multipliers – Dual variables

Introduce Lagrangian (objective):

We will solve:

Add Lagrange multiplier

Add new constraint

Why is this equivalent? •  min is fighting max! x<b (x-b)<0 maxα-α(x-b) = ∞

•  min won’t let this happen!

x>b, α≥0 (x-b)>0 maxα-α(x-b) = 0, α*=0 •  min is cool with 0, and L(x, α)=x2 (original objective)

x=b α can be anything, and L(x, α)=x2 (original objective)

Rewrite Constraint

The min on the outside forces max to behave, so constraints will be satisfied.

Page 12: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Dual SVM derivation (1) – the linearly separable case (hard margin SVM)

Original optimization problem:

Lagrangian:

Rewrite constraints

One Lagrange multiplier per example

Our goal now is to solve:

Page 13: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Dual SVM derivation (2) – the linearly separable case (hard margin SVM)

Swap min and max

Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent!

(Primal)

(Dual)

Page 14: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Dual SVM derivation (3) – the linearly separable case (hard margin SVM)

Can solve for optimal w, b as function of α:

⇥(x) =

⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

⇤L

⇤w= w �

j

�jyjxj

7

(Dual)

Substituting these values back in (and simplifying), we obtain:

(Dual)

Sums over all training examples dot product scalars

Page 15: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Reminder: What if the data is not linearly separable?

Use features of features of features of features….

Feature space can get really large really quickly!

�(x) =

⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

7

Page 16: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Higher order polynomials

number of input dimensions

num

ber

of m

onom

ial t

erm

s

d=2

d=4

d=3

m – input features d – degree of polynomial

grows fast! d = 6, m = 100 about 1.6 billion terms

Page 17: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Dual formulation only depends on dot-products of the features!

First, we introduce a feature mapping:

Next, replace the dot product with an equivalent kernel function:

Page 18: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Polynomial kernel

Polynomials of degree exactly d

d=1

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

7

d=2

For any d (we will skip proof):

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

⇥(u).⇥(v) = (u.v)d

7

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

⇥(u).⇥(v) =

⌥⌥⇧

u21u1u2u2u1u22

��⌃ .

⌥⌥⇧

v21v1v2v2v1v22

��⌃ = u21v21 + 2u1v1u2v2 + u22v

22

⇥(u).⇥(v) = (u.v)d

7

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

⇥(u).⇥(v) =

⌥⌥⇧

u21u1u2u2u1u22

��⌃ .

⌥⌥⇧

v21v1v2v2v1v22

��⌃ = u21v21 + 2u1v1u2v2 + u22v

22

= (u1v1 + u2v2)2

⇥(u).⇥(v) = (u.v)d

7

⇥(x) =

⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⌥⇧

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

�����������⌃

⇤L

⇤w= w �

j

�jyjxj

⇥(u).⇥(v) =

�u1u2

⇥.

�v1v2

⇥= u1v1 + u2v2 = u.v

⇥(u).⇥(v) =

⌥⌥⇧

u21u1u2u2u1u22

��⌃ .

⌥⌥⇧

v21v1v2v2v1v22

��⌃ = u21v21 + 2u1v1u2v2 + u22v

22

= (u1v1 + u2v2)2

= (u.v)2

⇥(u).⇥(v) = (u.v)d

7

Page 19: Support Vector Machines & Kernels Lecture 5people.csail.mit.edu/dsontag/courses/ml13/slides/lecture5.pdf · Lecture 5 David Sontag New York University Slides adapted from Luke Zettlemoyer

Common kernels •  Polynomials of degree exactly d

•  Polynomials of degree up to d

•  Gaussian kernels

•  Sigmoid

•  And many others: very active area of research!