cse 151 machine learning - computer science · 2012. 5. 8. · cse 151 machine learning instructor:...

CSE 151 Machine LearningInstructor: Kamalika Chaudhuri

Linear Classification

Given labeled data:

(xi, yi)

where y is +1 or -1, find a hyperplane to separate + from -

featurevector label

i=1,..,n

Linear Classification Process

Classification Algorithm

Training Data

--

-

+-

-

++

+

+

+

---

-- wO

Prediction Rule: Vector w

How do you classify a test example x?

Output y = sign(�w, x�)

--

-

+-

-

++

+

+

+

---

--w

O

Linear Classification

Given labeled data (xi, yi), i=1,..,n, where y is +1 or -1,

Find a hyperplane through the origin to separate + from -

w: normal vector to the hyperplane

For a point x on one side of the hyperplane,

�w, x� > 0

For a point x on the other side,

�w, x� < 0

The Perceptron Algorithm

Goal: Given labeled data (xi, yi), i=1,..,n, where y is +1 or -1,Find a vector w such that the corresponding hyperplane separates + from -

Perceptron Algorithm:

1. Initially, w1 = y1x12. For t=2, 3,....

If wt+1 = wt + yt xt

Else:

then:yt�wt, xt� < 0

wt+1 = wt

Linear Separability

--

-

+-

-

++

+

+

+

---

--

linearly separable not linearly separable

--

-

+-

-

++

+

+

+

---

--

+

+

Measure of Separability: Margin

For a unit vector w, and training set S, margin of w is defined as:

γ = minx∈S

|�w, x�|�x�

min cosine of angle between w and x, for any x in S

Low Margin

+

wO

θ

+w

O

High Margin

θ

Real Examples: Margin

For a unit vector w, and training set S, margin of w is defined as:

γ = minx∈S

|�w, x�|�x�

min cosine of angle between w and x, for any x in data

Low Margin High Margin

w w

When does Perceptron work?

Observations:

1. Lower the margin, more mistakes

If the training points are linearly separable with margin γ

then #mistakes made by perceptron is at most 1/γ2

Theorem:

and if for all xi in the training data,

2. May need more than one pass over the training data to get a classifier which makes no mistakes on the training set

�xi� = 1

What if data is not linearly separable?

Ideally, we want to find a linear separator that makes the minimum number of mistakes on training data

But this is NP Hard

--

-

+-

-

++

+

+

+

---

--

+

+

Voted Perceptron

Perceptron: Voted Perceptron:

1. Initially:m = 1, w1 = y1x1

2. For t=2, 3,....

If wm+1 = wm + yt xt

then:

m = m + 13. Output wm

yt�wm, xt� ≤ 0


2. For t=2, 3,....


then:

m = m + 1

3. Output (w1, c1), (w2, c2), ..., (wm, cm)

cm = 1Else:

cm = cm + 1


Voted Perceptron

Voted Perceptron:

How to classify example x?

Output: sign(m�

i=1

cisign(�wi, x�))

Problem: Have to store all the classifiers


2. For t=2, 3,....


then:

m = m + 1

3. Output (w1, c1), (w2, c2), ..., (wm, cm)

cm = 1Else:

cm = cm + 1


Averaged Perceptron

Averaged Perceptron:


Output: sign(�m�

i=1

ciwi, x�)


2. For t=2, 3,....


then:

m = m + 1

3. Output (w1, c1), (w2, c2), ..., (wm, cm)

cm = 1Else:

cm = cm + 1


Voted vs. Averaged


Averaged: sign(�m�

i=1

ciwi, x�)

Voted: sign(m�

i=1

cisign(�wi, x�))

Example where voted and averaged are different:

w1 = x1, w2 = x2, w3 = x3, c1 = c2 = c3 = 1

Averaged: x1 + x2 + x3 > 0Voted: Any two of x1, x2, x3 > 0

Perceptron: An Example

Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

w

O


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 1: w1 = (4, 0), c1 = 1

O w1

Decision Boundary


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)

Step 1: w1 = (4, 0), c1 = 1

O

Decision Boundary

w1

x2


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)

y2�w1, x2� < 0 Mistake!

Step 1: w1 = (4, 0), c1 = 1

O

x2

Decision Boundary

w1


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)


Step 1: w1 = (4, 0), c1 = 1

Update: w2 = (3, -1), c2 = 1

O-x2 w2

Decision Boundary

w1

x2


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)


Step 3: (x3, y3) = ( (0, 1), -1)

Step 1: w1 = (4, 0), c1 = 1

Update: w2 = (3, -1), c2 = 1

Ow2

Decision Boundary

w1

x2x3


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)


Step 3: (x3, y3) = ( (0, 1), -1)

y3�w2, x3� > 0 Correct!

Step 1: w1 = (4, 0), c1 = 1

Update: w2 = (3, -1), c2 = 1

Ow2

Decision Boundary

x2x3


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)


Step 3: (x3, y3) = ( (0, 1), -1)


Step 1: w1 = (4, 0), c1 = 1

Update: w2 = (3, -1), c2 = 1

Update: c2 = 2

Ow2

Decision Boundary

x2x3


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)


Step 3: (x3, y3) = ( (0, 1), -1)


Step 4: (x4, y4) = ( (-2, -2), +1)

Step 1: w1 = (4, 0), c1 = 1

Update: w2 = (3, -1), c2 = 1

Update: c2 = 2

Ow2

Decision Boundary

x2x3

x4


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)


Step 3: (x3, y3) = ( (0, 1), -1)


Step 4: (x4, y4) = ( (-2, -2), +1)

Mistake!

Step 1: w1 = (4, 0), c1 = 1

Update: w2 = (3, -1), c2 = 1

y4�w2, x4� < 0

Update: c2 = 2

Ow2

Decision Boundary

x2x3

x4


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)


Step 3: (x3, y3) = ( (0, 1), -1)


Step 4: (x4, y4) = ( (-2, -2), +1)

Mistake!

Update: w3 = (1, -3), c3 = 1

Step 1: w1 = (4, 0), c1 = 1

Update: w2 = (3, -1), c2 = 1

y4�w2, x4� < 0

Update: c2 = 2

O

x4

Decision Boundary

w3

w2


Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

Step 2: (x2, y2) = ( (1, 1), -1)


Step 3: (x3, y3) = ( (0, 1), -1)


Step 4: (x4, y4) = ( (-2, -2), +1)

Mistake!

Update: w3 = (1, -3), c3 = 1

Step 1: w1 = (4, 0), c1 = 1

Update: w2 = (3, -1), c2 = 1

y4�w2, x4� < 0

Update: c2 = 2

Final Voted Perceptron:

Final Averaged Perceptron:sign(�w1 + 2w2 + w3, x�)

O

x4

Decision Boundary

w3

w2 sign(sign(�w1, x�) + 2sign(�w2, x�) + sign(�w3, x�))

Summary

• Linear classification • Perceptron• Voted and Averaged Perceptron• Some issues

Margins

Suppose data is linearly separable by a margin. Perceptron will stoponce it finds a linear separator. But sometimes we would like to compute the max-margin separator. This is done by Support Vector Machines (SVMs).

--

-

+

-

-

+

++

+

+

---

--

+

+

Max margin separator

Separator

More General Linear Classification

Given labeled data (xi, yi), i=1,..,n, where y is +1 or -1,

Find a hyperplane to separate + from -

We can transform general linear classification to linear classification by a hyperplane through the origin as follows:

For each (xi, yi) (xi is d-dimensional), construct:

(zi, yi), where zi is the d+1-dimensional vector [1, xi]

Any hyperplane through the origin in the z-space corresponds to a hyperplane (not necessarily through the origin) in x-space, and vice versa

Categorical Features

Features with no inherent ordering

Example: Gender = “Male”, “Female”State = “Alaska”, “Hawaii”, “California”

Convert a categorical feature with k categories to a k-dimensional vector:e.g, “Alaska” maps to [1, 0, 0], “Hawaii” to [0, 1, 0], “California” to [0, 0, 1]

Multiclass Linear Classification

Main Idea: Reduce to multiple binary classification problems

One-vs-all: For k classes, solve k binary classification problemsClass 1 vs. rest, Class 2 vs. rest, ..., Class k vs. rest

All-vs-all: For k classes, solve k(k-1)/2 binary classification problemsClass i vs. j, for i=1,..,k, and j = i+1,..,k

To classify, combine by voting

The Confusion Matrix

Indicates which classes are easy and hard to classify

1 2 3 4 5 6

1

2

3

4

5

6

High diagonal entry: easy to classifyHigh off-diagonal entry: high scope for confusion

Mij = #examples that have label j but are classified as i

Nj = #examples with label j

Cij = Mij/Nj

Kernels

Feature Maps

Data which is not linearly separable may be linearly separable in another feature space.

X1

X2

Z1

Z2

Z3

In the figure, data is linearly separable in z-space: (Z1, Z2, Z3) = (X12, 2X1X2, X22)

φ

Feature Maps

Let φ(x) be a feature map. For example: = (x12, x1x2, x22)We want to run perceptron on the feature space

φ(x)

φ(x)

X1

X2

Z1

Z2

Z3

φ

Feature Maps

X1

X2

What feature map will make this dataset linearly separable?

Feature Maps

Let φ(x) be a feature map. For example: = (x12, x1, x2)We want to run perceptron on the feature space

φ(x)

φ(x)

Many linear classification algorithms, including perceptron, SVMs, etc can be written in terms of dot-products �x, z�

We can simply change these dot products to �φ(x),φ(z)�

For a feature map φ we define the corresponding kernel K:K(x, y) = �φ(x),φ(y)�

Thus we can write the perceptron algorithm in terms of K

Kernels


Examples:

1. K(x, z) = (�x, z�)2

Kernels


Examples:

1.

φ(x) = [ x12, x1x2, x1x3, x2x1, x22, x2x3, x3x1, x3x2, x32]TFor d=3,

K(x, z) =

�d�

i=1

xizi

��d�

i=1

xizi

�=

d�

i=1

d�

j=1

xixjzizj =d�

i,j=1

(xixj)(zizj)

Time to compute K directly = O(d)Time to compute K though feature map = O(d2)

K(x, z) = (�x, z�)2

Kernels


Examples:

1.

2.

In Class Problem: What is the feature map for this kernel?

K(x, z) = (�x, z�)2

K(x, z) = (�x, z�+ c)2

cse 151 machine learning - computer science · 2012. 5. 8. · cse 151 machine learning instructor:...

Documents