svm lecture 2

Post on 23-Dec-2015

242 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

machine learning

TRANSCRIPT

Support Vector MachineSVM

Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Slides for guest lecture presented by Linda Sellie in Spring 2012 for CS6923, Machine Learning, NYU-Poly
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
with a few corrections...
Lisa Hellerstein
Typewritten Text

http://www.svms.org/tutorials/Hearst-etal1998.pdf

http://www.cs.cornell.edu/courses/cs578/2003fa/slides_sigir03_tutorial-modified.v3.pdf

Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
These slides were prepared by Linda Sellie and Lisa Hellerstein
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text

+

-----

-

++

+

++++

g(x)?

Which Hyperplane?

If wT = (3, 4) & w0 = −10

g(x) = wT x +w0

+

-----

-

++

+

++++

(2,1)

(0,2.5)

(3,1)

(2,2)

(1,1)

g(x)

g(x) > 0 then f (x) = 1g(x) ≤ 0 then f (x) = −1

-

g(x) = (3, 4)T x −10

g(0,5 / 2) = (3, 4)i(0,5 / 2)−10 = 0

If wT = (3, 4) & w0 = −10

g(x) = wT x +w0

+

-----

-

++

+

++++

(2,1)

(0,2.5)

(3,1)

(2,2)

(1,1)

g(2,1) = (3, 4)i(2,1)−10 = 0

g(x)

g(2,2) = (3, 4)i(2,2)−10 = 4 > 0

g(3,1) = (3, 4)i(3,1)−10 = 3 > 0

g(1,1) = (3, 4)i(1,1)−10 = −3 ≤ 0

g(x) > 0 then f (x) = 1g(x) ≤ 0 then f (x) = −1

f (2,2)= 1

f (3,1)

so f (2,1) = f (0,5 / 2) = −1

= 1

f (1,1)= −1

-

g(x) = (3, 4)T x −10

Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
with shared variance for each feature (for each xi, requiring estimated variance of distributions for p[xi|+] and p[xi|-] to be the same)
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
(for the usual Gaussian Naive Bayes, where you don't required shared variance for each feature, discriminant function is quadratic.)
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
that is, if you have Boolean features, and you treat them as discrete/categorical features, running the standard NB algorithm for discrete/categorical features will produce a linear discriminant.
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text

+

----

-

++

+

++++

g(x)

-

+

----

-

++

+

++++

g(x)

-

Which line (hyperplane) to choose?

margin margin

Maximal Margin Hyperplane

How to compute the distance from a point on the plane to the hyperplane

x = xp + rww

xpx a point

the normal projection of onto the plane

x

g(x) = wT x +w0 = wT xp + rww

⎛⎝⎜

⎞⎠⎟+w0

= wT r ww wTw = w 2

Observe that

= w r

Thus r = g(x)w

xp

w

x3

x2

x1

r x∀x,g(x) = wT x +w0 = 0

wT = (3, 4) w0 = −10g(x) = wT x +w0

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x)

--(1,.5)

+

Distance Formula: r = g(x)w

g(2,2)(3, 4)

= (3, 4)i(2,2)−105

= 4 / 5

g(1,.5)(3, 4)

= (3, 4)i(1,.5)−105

= −1

g(3,1)(3, 4)

= (3, 4)i(3,1)−105

= 3 / 5

g(1,1)(3, 4)

= (3, 4)i(1,1)−105

= −3 / 5

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x)

--(1,.5)

+

Distance Formula: r = g '(x)w 'g '(2,2)(1, 4 / 3)

= (1, 4 / 3)i(2,2)−10 / 35 / 3

= 4 / 5

g '(1,.5)(1, 4 / 3)

= (1, 4 / 3)i(1,.5)−10 / 35 / 3

= −1

g '(3,1)(1, 4 / 3)

= (1, 4 / 3)i(3,1)−10 / 35 / 3

= 3 / 5

g '(1,1)(1, 4 / 3)

= (1, 4 / 3)i(1,1)−10 / 35 / 3

= −3 / 5

g '(x) = w 'T x +w '0 =13g(x)

w ' = 13w = (1, 4 / 3)T w '0 = −10 / 3

+

-----

-

++

+

++++

-

We want to classify points in space.Which hyperplane does SVM choose?

+

+

- -----

+

----

-

++

+

++++

g(x)

-

Maximal Margin Hyperplane

The margin is the geometric distance between the closest training example to the hyperplane,

support vector

support vector

support vectormargin

y1g(x1)||w ||

= y2g(x2 )||w ||

x1x2

+

-----

-

++

+

++++

(3,1)

(2,2)

(1,1)

(0,2.5)

(2,1)

g(x)

-

The hyperplane is defined by all the points which satisfy g(x) = (3, 4)ix −10 = 0

g(0,2.5) = 0

(0, 3.3)

g(2,1) = (3, 4)i(2,1)−10 = 0e.g.

All the points above the line are positive

e.g. g(2,2) = (3, 4)i(2,2)−10 = 4g(x) = (3, 4)ix −10 > 0

All the points below the line are negativeg(x) = (3, 4)ix −10 < 0

g(1,1) = (3, 4)i(1,1)−10 = −3e.g.

g(3,1) = (3, 4)i(3,1)−10 = 3

We use the hyperplane to classify a point x f (x) = 1 if wix +w0 > 0f (x) = −1 if wix +w0 ≤ 0

g(x) = (3, 4)ix −10

Notice that for any hyperplane we have an infinite number of formulas that describe it!

If (3, 4)ix −10 = 0, so does (1/3) (3, 4)ix −10( ) = 0so does 23 (3, 4)ix −10( ) = 0so does .9876 (3, 4)ix −10( ) = 0

Lisa Hellerstein
Typewritten Text
if it is a maximum margin hyperplane -- since in such a hyperplane, the distance to the closest + example must be equal to the distance to the closest - example
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text

is (the functional margin is 1.)

The canonical hyperplane for a set of training examples

y(i ) (w 'T x(i ) +w '0 ) ≥1+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x)

1(1, 4 / 3)i(2,2)−10 / 3 ≥1

--

g '(3,1) = (1, 4 / 3)i(3,1)−10 / 3 = 1g '(1,1) = (1, 4 / 3)i(1,1)−10 / 3 = −1

S = {< (1,1),−1>,< (2,2),1>,< (1,1 / 2),−1>,< (3,2),1> ...}

(3,2)

-(1,.5)

(1,1)−1(1, 4 / 3)i(1,1 / 2)−10 / 3 ≥1

g '(x) = (1, 4 / 3)ix −10 / 3

1

1

For a canonical hyperplane w.r.t. a fixed set of training examples, , the margin is computed by

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

--

g(x) = wT x +w0

(3,2)

-(1,.5)

(1,1)

S

ρ = g(x+ )w

= g(x− )w

= 1w

Remember the distance from a point to the hyperplane is ρ = g(x)

wg(x) = wT x +w0

x

g(x) = (1, 4 / 3)T x +10 / 3

ρ = 1(1, 4 / 3)

= 11+16 / 9

= 35

3 / 5

Distance of x to the hyperplane is : r = g(x)w

= ±1w

assuming canonical

hyperplane

For the canonical hyperplane the margin is 1w

To find the maximal canonical hyperplane,the goal is to minimize w

For a set of training examples, we can find themaximum margin hyperplane in polynomial time!

To do this - we reformulate the problem as an optimization problem

There is an algorithm to solve an optimization problem if it has this form:

min:Subject to:

Where is convex, is convex, and is affine.We can use the standard techniques to find the optimal.

f (x)∀i,gi (w) ≤ 0∀i,hi (w) = 0

f ∀i,gi ∀i,hi

Finding the largest geometric margin by finding which

max: ρ

Subject to:

g(x)

y(i ) (wT x(i ) +w0 )w

≥ ρ∀i

Finding the largest geometric marginby finding which

min: w

Subject to:

g(x)

y(i ) (wT x(i ) +w0 ) ≥1∀i

Finding the largest marginby finding which

min:12w 2

Subject to:

g(x)

y(i ) (wT x(i ) +w0 ) ≥1∀i

Solving this constrained quadratic optimization requires The Karush, Kuhn, Tucker (KKT) conditions are met.

The Karush, Kuhn, Tucker (KKT) conditions imply that is non-zero only if it is a support vector!

vi

is

The hypothesis for the set of training examples,

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x) = wT x + b = 0--

S = {< (1,1),−1>,< (2,2),1>,< (1,1 / 2),−1>,< (3,2),1>,...}

(3,2)

-(1,.5)

(1,1)

g(x) = v1(2, 7 / 4)ix + v2 (3,1)ix + v3 i(1,1)ix + w0

(2, 7 / 4)

Note that only the support vectors are in the hypothesis.

g(x) = wT x + b = −1

g(x) = wT x + b = 1

NonLinearly Separable DataIII

positivenegative

++-

-

++++ ++

----

---

g(x)+

++ +

+

+ +

+

- -+

++

++

+

+(0,0) (0.5,0)

(−1,−1)

(1,1)+

g(x) = wT x +w0X

(0,1.25)

(−0.25,0.25)

(0.5,−0.5)

--- -

----

- ---- +

+

++

(−0.75,−0.25)

(−1.25,0)

(−1,1)

(1,−0.75)

Linearly separable?

-

++++ ++

----

---

g(x)+

++ +

+

+ +

+

- -+

++

++

+

+(0,0) (0.5,0)

(−1,−1)

(1,1)+

g(x) = wTφ(x)+w0w = (1,1)

φ(x) = x12 , x2

2⎡⎣ ⎤⎦Transform feature space

φ : x→φ(x)

w0 = −1

(0,1.25)

(−0.25,0.25)

(0.5,−0.5)

--- -

----

- ---- +

+

++

(−0.75,−0.25)

(−1.25,0)

(−1,1)

(1,−0.75) --(0,0) (0,0.25)-(0.25,0.25)

+(1,0.56)+(1,1)

- +(1.5625,0)

+(0,1.5625)

-(0.56,0.0625)

Linearly separable?

-

++++ ++

----

---

g(x)+

++ +

+

+ +

+

- -+

++

++

+

+(0,0) (0.5,0)

(−1,−1)

(1,1)+

g(x) = wTφ(x)+w0w = (1,1)

g(.5,−.5) = (1,1)i(0.25,0.25)−1≤ 0

φ(x) = x12 , x2

2⎡⎣ ⎤⎦Transform feature space

φ : x→φ(x) g(1,1) = (1,1)i(1,1)−1> 0g(−1,−1) = (1,1)i(1,1)−1> 0

w0 = −1

(0,1.25)

(−0.25,0.25)

(0.5,−0.5)

--- -

----

- ---- +

+

++

(−0.75,−0.25)

(−1.25,0)

(−1,1)

(1,−0.75) --(0,0) (0,0.25)-(0.25,0.25)

+(1,0.56)+(1,1)

- +(1.5625,0)

+(0,1.5625)

-(0.56,0.0625)

Linearly separable?

++-- -- + -- - --

Linearly separable?

10 2 3............

Linearly separable?

Yes, by transforming the feature space!φ(x) = x, x2⎡⎣ ⎤⎦

g(φ(x)) = (1,−3)iφ(x)+ 2f (x) = 1 if g(φ(x)) = (1,−3)iφ(x)+ 2 > 0f (x) = −1 if g(φ(x)) = (1,−3)iφ(x)+ 2 < 0

++-- -- + -- - --10 2 3

............

Lisa Hellerstein
Typewritten Text
There is an error in this slide. The given g is equal to x - 3x^2 + 2, which is positive iff x > -2/3 and < 1 (check this by factoring). So this slide and the next one can be fixed by relabeling the points on the line accordingly. (Alternatively, change phi(x) to be [x^2,x] instead of [x,x^2]. Then the labeling on the line is correct, but the rest of the example needs to be changed.)
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text

Linearly separable?

Yes, by transforming the feature space!

g(φ(1 / 2)) = g(1 / 2,1 / 4) = (1,−3)i(1 / 2,1 / 4)+ 2

φ(x) = x, x2⎡⎣ ⎤⎦

g(φ(x)) = (1,−3)iφ(x)+ 2f (x) = 1 if g(φ(x)) = (1,−3)iφ(x)+ 2 > 0f (x) = −1 if g(φ(x)) = (1,−3)iφ(x)+ 2 < 0

g(φ(3 / 2)) = g(3 / 2,9 / 4) = (1,−3)i(3 / 2,9 / 4)+ 2g(φ(2)) = g(2, 4) = (1,−3)i(2, 4)+ 2

++-- -- + -- - --10 2 3

............

- -

φ(x) = x, x2⎡⎣ ⎤⎦

--

+++(7 / 4,49 /16)(3 / 2,9 / 4)

(5 / 4,25 /16)

-(7 / 4,49 /16)

(3 / 4,9 /16)

g(φ(x)) = (1,−3)iφ(x)+ 2transforming the feature space using

The points are now linearly separable

These points become linearly separable by++-- -- + -- - --

10 2 3............

Kernel Function

K (x, z) = φ(x)iφ(y)

Transform the feature spacemap x to φ(x)

Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
KERNEL TRICK Never compute phi(x). Just compute K(x,z) Why is this enough? If work with dual representation of hyperplane (and dual quadratic program), only use of new features is in inner products!
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text

Non-Separable DataIV

+

-----

-

++

+

+++

+

(3,1)

---(1,1)

What if data is not linearly separable for only a few points?

x

-

+

+++

+++

++ +

------

-

+

-----

-

++

+

+++

+

(3,1)

--(1,1)

What if a small number of points prevents the margin from being large?

x

- +

+++

+++

++ +

------

+

-----

-

++

+

+++

+

(3,1)

---(1,1)

What if a small number of points prevents the margin from being large?

x

- +

+++

+++

++ +

------

+

-----

-

++

+

+++

+

(3,1)

---(1,1)

- +

+++

+++

++ +

------

λ large λ small

-+ +

-- -

++

λ =∞?What if

top related