cse446: svms spring 2017 - courses.cs.washington.edu · 2017-05-02 · cse446: svms spring 2017 ali...

20
CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Upload: others

Post on 05-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

CSE446: SVMsSpring 2017

Ali Farhadi

Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Page 2: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Linear classifiers – Which line is better?

Page 3: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Pick the one with the largest margin!

w•x = ∑i wi xi

w•x

+ w

0 =

0

w•x

+ w

0 >

0

w•x

+ w

0 >>

0

w•x

+ w

0 <

0

w•x

+ w

0 <<

0

γ

γ

γ

γ

Margin for point j:

Max Margin:

Page 4: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

How many possible solutions?

w•x

+ w

0 =

0Any other ways of writing the same dividing line? • w.x + b = 0 • 2w.x + 2b = 0 • 1000w.x + 1000b = 0 • …. • Any constant scaling has the same

intersection with z=0 plane, so same dividing line!

Do we really want to max γ,w,w0?

Page 5: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Review: Normal to a plane

w•x

+w 0

= 0

Key Terms-- projection of xj onto w

-- unit vector normal to w

Page 6: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

w•x

+ w

0 =

+1

w•x

+ w

0 =

-1

w•x

+ w

0 =

0

x-x+

Final result: can maximize constrained margin by minimizing ||w||2!!!

γ

Assume: x+ on positive line (y=1 intercept), x- on negative (y=-1)

Page 7: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Max margin using canonical hyperplanes

The assumption of canonical hyperplanes (at +1 and -1) changes the objective and the constraints!

w.x

+ b

= +

1

w.x

+ b

= -1

w.x

+ b

= 0

x-x+

γ

Page 8: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Support vector machines (SVMs)

• Solve efficiently by quadratic programming (QP) – Well-studied solution algorithms – Not simple gradient ascent, but

close

• Decision boundary defined by support vectors

w.x

+ b

= +

1

w.x

+ b

= -1

w.x

+ b

= 0

margin 2γ

Support Vectors: • data points on the

canonical lines

Non-support Vectors: • everything else • moving them will not

change w

Page 9: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

What if the data is not linearly separable?

Add More Features!!!

Can use Kernels… (more on this later) What about overfitting?

Page 10: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

What if the data is still not linearly separable?

• First Idea: Jointly minimize and number of training mistakes – How to tradeoff two criteria? – Pick C on development / cross

validation

• Tradeoff #(mistakes) and – 0/1 loss – Not QP anymore – Also doesn’t distinguish near misses

and really bad mistakes – NP hard to find optimal solution!!!

+ C #(mistakes)

Page 11: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Slack variables – Hinge loss

For each data point: • If margin ≥ 1, don’t care • If margin < 1, pay linear penalty

+ C Σj ξj

- ξj , ξj≥0

Slack Penalty C > 0: • C=∞ ! have to separate the

data! • C=0 ! ignore data entirely! • Select on dev. set, etc.

w.x

+ w

0 =

+1

w.x

+ w

0 =

-1

w.x

+ w

0 =

0

ξ

ξ

ξ

ξ

Page 12: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Slack variables – Hinge lossw

.x +

w0

= +1

w.x

+ w

0 =

-1

w.x

+ w

0 =

0

ξ

ξ

ξ

ξ

+ C Σj ξj

- ξj , ξj≥0

Hinge Loss

[x]+= max(x,0)

Regularization Solving SVMs: • Differentiate and set equal to zero! • No closed form solution, but quadratic program (top) is concave • Hinge loss is not differentiable, gradient ascent a little trickier…

Page 13: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Logistic Regression as Minimizing Loss

Logistic regression assumes:

And tries to maximize data likelihood, for Y={-1,+1}:

Equivalent to minimizing log loss:

Page 14: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

SVMs vs Regularized Logistic Regression

SVM Objective:

Logistic regression objective:

Tradeoff: same l2 regularization term, but different error term

[x]+= max(x,0)

Page 15: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Graphing Loss vs MarginLogistic regression:

We want to smoothly approximate 0/1 loss!

Hinge loss:

0-1 Loss:

Page 16: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

What about multiple classes?

Page 17: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

One against All

Learn 3 classifiers: • + vs {0,-}, weights w+ • - vs {0,+}, weights w-

• 0 vs {+,-}, weights w0 Output for x: y = argmaxi wi•x

w+

w-

Any problems? Could we learn this ! dataset?

w0

Page 18: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Learn 1 classifier: Multiclass SVMSimultaneously learn 3 sets of weights: • How do we

guarantee the correct labels?

• Need new constraints!

For each class:

w+

w-

w0

Page 19: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

Learn 1 classifier: Multiclass SVMAlso, can introduce slack variables, as before:

Now, can we learn it?

!

Page 20: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer

What you need to know• Maximizing margin • Derivation of SVM formulation • Slack variables and hinge loss • Tackling multiple class

– One against All – Multiclass SVMs