support vector machines (i): overview and linear svm · • best known element is the support...

59
Support vector machines (I): Overview and Linear SVM LING 572 Fei Xia 1

Upload: others

Post on 16-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Support vector machines (I): Overview and Linear SVM

LING 572

Fei Xia

1

Page 2: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Why another learning method?

• It is based on some “beautifully simple” ideas (Schölkopf, 1998)

– Maximum margin decision hyperplane

• Member of class of kernel models (vs. attribute models)

• Empirically successful:

– Performs well on many practical applications

– Robust to noisy data complex distributions

– Natural extensions to semi-supervised learning

2

Page 3: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Kernel methods• Family of “pattern analysis” algorithms

• Best known element is the Support Vector Machine (SVM)

• Maps instances into higher dimensional feature space efficiently

• Applicable to:

– Classification

– Regression

– Clustering

– ….

3

Page 4: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

History of SVM

• Linear classifier: 1962

– Use a hyperplane to separate examples

– Choose the hyperplane that maximizes the minimal margin

• Non-linear SVMs:

– Kernel trick: 1992

4

Page 5: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

History of SVM (cont)

• Soft margin: 1995

– To deal with non-separable data or noise

• Semi-supervised variants:

– Transductive SVM: 1998

– Laplacian SVMs: 2006

5

Page 6: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Main ideas

• Use a hyperplane to separate the examples.

• Among all the hyperplanes wx+b=0, choose the one with the maximum margin.

• Maximizing the margin is the same as minimizing ||w|| subject to some constraints.

6

Page 7: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Main ideas (cont)

• For the data set that are not linear separable, map the data to a higher dimensional space and separate them there by a hyperplane.

• The Kernel trick allows the mapping to be “done” efficiently.

• Soft margin deals with noise and/or inseparable data set.

7

Page 8: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Papers

• (Manning et al., 2008)

– Chapter 15

• (Collins and Duffy, 2001): tree kernel

8

Page 9: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Outline

• Linear SVM– Maximizing the margin– Soft margin

• Nonlinear SVM– Kernel trick

• A case study

• Handling multi-class problems

9

Page 10: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Inner product vs. dot product

Page 11: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Dot product

Page 12: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Inner product

• An inner product is a generalization of the dot product.

• It is a function that satisfies the following properties:

Page 13: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Some examples

Page 14: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Linear SVM

14

Page 15: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

The setting

• Input: x ε X– x is a vector of real-valued feature values

• Output: y ε Y, Y = {-1, +1}

• Training set: S = {(x1, y1), …, (xi, yi)}

• Goal: Find a function y = f(x) that fits the data:

f: X R

15

Page 16: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Notations

16

Page 17: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Linear classifier

• Consider the 2-D data below

• +: Class +1

• -: Class -1

• Can we draw a line that

separates the two classes?

++

+ + ++

+ +- - -

- - - - -

- - -

17

Page 18: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Linear classifier

• Consider the 2-D data below

• +: Class +1

• -: Class -1

• Can we draw a line that

separates the two classes?

• Yes!

– We have a linear classifier/separator; >2D hyperplane

++

+ + ++

+ +- - -

- - - - -

- - -

18

Page 19: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Linear classifier

• Consider the 2-D data below

• +: Class +1

• -: Class -1

• Can we draw a line that

separates the two classes?

• Yes!

– We have a linear classifier/separator; >2D hyperplane

• Is this the only such separator?

++

+ + ++

+ +- - -

- - - - -

- - -

19

Page 20: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Linear classifier

• Consider the 2-D data below

• +: Class +1

• -: Class -1

• Can we draw a line that

separates the two classes?

• Yes!

– We have a linear classifier/separator; >2D hyperplane

• Is this the only such separator?

– No

++

+ + ++

+ +- - -

- - - - -

- - -

20

Page 21: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Linear classifier

• Consider the 2-D data below

• +: Class +1

• -: Class -1

• Can we draw a line that

separates the two classes?

• Yes!

– We have a linear classifier/separator; >2D hyperplane

• Is this the only such separator?

– No

• Which is the best?

++

+ + ++

+ +- - -

- - - - -

- - -

21

Page 22: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Maximum Margin Classifier

• What’s best classifier?– Maximum margin

• Biggest distance between decision boundary and closest examples

• Why is this better?– Intuition:

• Which instances are we most sure of?– Furthest from boundary

• Least sure of?– Closest

• Create boundary with most ‘room’ for error in attributes

++

+ + ++

+ +- - -

- - - - -

- - -

22

Page 23: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Complicating Classification

• Consider the new 2-D data below:

+: Class +1; -: Class -1

• Can we draw a line that separates the two classes?

++

+ - ++

+ +- + -

- - - + -

- - -

23

Page 24: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Complicating Classification

• Consider the new 2-D data below

+: Class +1; -: Class -1

• Can we draw a line that separates the two classes?– No.

• What do we do?– Give up and try another

classifier? No.

++

+ - ++

+ +- + -

- - - + -

- - -

24

Page 25: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Noisy/Nonlinear Classification

• Consider the new 2-D data below:+: Class +1; -: Class -1

• Two basic approaches:– Use a linear classifier, but

allow some (penalized) errors• soft margin, slack variables

– Project data into higher dimensional space• Do linear classification there• Kernel functions

++

+ - ++

+ +- + -

- - - + -

- - -

25

Page 26: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Multiclass Classification

• SVMs create linear decision boundaries

– At basis binary classifiers

• How can we do multiclass classification?

– One-vs-all

– All-pairs

– ECOC

– ...

26

Page 27: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

SVM Implementations

• Many implementations of SVMs:

– SVM-Light: Thorsten Joachims• http://svmlight.joachims.org

– LibSVM: C-C. Chang and C-J. Lin• http://www.csie.ntu.edu.tw/~cjlin/libsvm/

– Weka’s SMO

– …

27

Page 28: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

SVMs: More Formally

• A hyperplane

• w: normal vector (aka weight vector), which is perpendicular to the hyperplane

• b: intercept term

• ||w||:

– Euclidean norm of w

• = offset from origin

<w, x > +b = 0

b

w

28

Page 29: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Inner product example

• Inner product between two vectors

29

Page 30: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Inner product (cont)

30

cosine similarity = scaled inner product

Page 31: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Hyperplane Example

• <w,x>+b=0

• How many (w,b)s?

• Infinitely many!

– Just scaling

x1+2x2-2 = 0 w=(1,2) b=-2

10x1+20x2-20 = 0 w=(10,20) b=-20

31

Page 32: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Finding a hyperplane

• Given the training instances, we want to find a hyperplane that separates them.

• If there are more than one hyperplane, SVM chooses the one with the maximum margin.

32

Page 33: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Maximizing the margin

++

+++

+

<w,x>+b=0Training: to find w and b.33

Page 34: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Support vectors

++

+++

+

<w,x>+b=0<w,x>+b=-1

<w,x>+b=1

34

Page 35: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Margins & Support Vectors

• Closest instances to hyperplane:

– “Support Vectors”

– Both pos/neg examples

• Add Hyperplanes through

– Support vectors

• d= 1/||w||

• How do we pick support vectors? Training

• How many are there? Depends on data set

35

Page 36: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

SVM Training

• Goal: Maximum margin, consistent w/training data– Margin = 1 /||w||

• How can we maximize?– Max d Min ||w||

• So we will:– Minimizing ||w||2

subject toyi(<w,xi>+b) >= 1

• Quadratic Programming (QP) problem– Can use standard QP solvers

36

Page 37: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

37

Let w=(w1, w2, w3, w4, w5)

X1 1 f1:2 f3:3.5 f4:-1X2 -1 f2:-1 f3:2 X3 1 f1:5 f4:2 f5:3.1

We are trying to choose w and b for the hyperplane wx + b = 0

1*(2w1 + 3.5w3 - w4) >= 1(-1)*(-w2 + 2w3) >= 11*(5w1 + 2w4 + 3.1w5) >= 1

2w1 + 3.5w3 – w4 >= 1-w2 +2w3 <= 15w1 + 2w4 + 3.1w5 >= 1

With those constraints, we want to minimize 𝑤12 + 𝑤22 + 𝑤32 + 𝑤42 + 𝑤52

Page 38: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Training (cont)

subject to the constraint ++

+

++

+

38

Page 39: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Lagrangian**

39

Page 40: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

The dual problem **

• Find 𝛼1, … , 𝛼𝑁 such that the following is maximized

• Subject to

40

Page 41: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

• The solution has the form

41

𝑏 = 𝑦𝑘−< w, 𝑥𝑘 >, for any 𝑥𝑘 whose weight is non-zero

Page 42: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

An example

42

𝑥1=(1, 0, 3), 𝑦1 = 1, α1=2

𝑥2=(-1, 2, 0), 𝑦2 = −1, α2=3

𝑥3=(0, -4, 1), 𝑦3 = 1, α3=0

Page 43: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

43

𝑤= (1*1*2 + 3* (-1)*(-1) + 0*1*0,

𝑥1=(1, 0, 3), 𝑦1 = 1, α1=2

𝑥2=(-1, 2, 0), 𝑦2 = −1, α2=3

𝑥3=(0, -4, 1), 𝑦3 = 1, α3=0

Page 44: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

44

Page 45: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Finding the solution

• This is a Quadratic Programming (QP) problem.

• The function is convex and there is no local minima.

• Solvable in polynomial time.

45

Page 46: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Decoding with w and b

Hyperplane: w=(1,2), b=-2f(x) = x1 + 2 x2 – 2

x=(3,1)

x=(0,0)46

f(x) = 3+2-2 = 3 > 0

f(x) = 0+0-2 = -2 < 0

(2,0)

(0,1)

Page 47: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Decoding with 𝛼𝑖

Decoding:

47

= ( 𝑖 < 𝛼𝑖 𝑦𝑖 𝑥𝑖, 𝑥 >) + 𝑏

Page 48: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

kNN vs. SVM• Majority voting:

c* = arg maxc g(c)

• Weighted voting: weighting is on each neighborc* = arg maxc i wi (c, fi(x))

• Weighted voting allows us to use more training examples:e.g., wi = 1/dist(x, xi)We can use all the training examples.

48

(weighted kNN, 2-class)

(SVM)

Page 49: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Summary of linear SVM

• Main ideas:

– Choose a hyperplane to separate instances:

<w,x> + b = 0

– Among all the allowed hyperplanes, choose the one with the max margin

– Maximizing margin is the same as minimizing ||w||

– Choosing w is the same as choosing ®i

49

Page 50: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

The problem

50

Page 51: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

The dual problem **

51

Page 52: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Remaining issues

• Linear classifier: what if the data is not separable?

– The data would be linear separable without noise

soft margin

– The data is not linear separable

map the data to a higher-dimension space

52

Page 53: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Soft margin

53

Page 54: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

The highlight

• Problem: Some data set is not separable or there are mislabeled examples.

• Idea: split the data as cleanly as possible, while maximizing the distance to the nearest cleanly split examples.

• Mathematically, introduce the slack variables

54

Page 55: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

55

++

Page 56: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Objective function

• For each training instance 𝑥𝑖 , introduce a slack variable 𝜀𝑖

• Minimizing

such that

C is a regularization term (for controlling overfitting), k = 1 or 2

56

Page 57: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

Objective function

• For each training instance 𝑥𝑖 , introduce a slack variable 𝜀𝑖

• Minimizing

such that

C is a regularization term (for controlling overfitting), k = 1 or 2

57

Page 58: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

The dual problem**

• Maximize

• Subject to

58

Page 59: Support vector machines (I): Overview and Linear SVM · • Best known element is the Support Vector Machine (SVM) • Maps instances into higher dimensional feature space efficiently

• The solution has the form

59

𝑏 = 𝑦𝑘 1 − 𝜀𝑘 −< w, 𝑥𝑘 >, for k = arg𝑚𝑎𝑥𝑘 𝛼𝑘

𝑥𝑖 with non-zero 𝛼𝑖 is called a support vector.

Every data point which is misclassified or within the margin will have a non-zero 𝛼𝑖