cse 151 machine learning - computer science · 2012. 5. 8. · cse 151 machine learning instructor:...

40
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri

Upload: others

Post on 04-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • CSE 151 Machine LearningInstructor: Kamalika Chaudhuri

  • Linear Classification

    Given labeled data:

    (xi, yi)

    where y is +1 or -1, find a hyperplane to separate + from -

    featurevector label

    i=1,..,n

  • Linear Classification Process

    Classification Algorithm

    Training Data

    --

    -

    +-

    -

    ++

    +

    +

    +

    ---

    -- wO

    Prediction Rule: Vector w

    How do you classify a test example x?

    Output y = sign(�w, x�)

  • --

    -

    +-

    -

    ++

    +

    +

    +

    ---

    --w

    O

    Linear Classification

    Given labeled data (xi, yi), i=1,..,n, where y is +1 or -1,

    Find a hyperplane through the origin to separate + from -

    w: normal vector to the hyperplane

    For a point x on one side of the hyperplane,

    �w, x� > 0

    For a point x on the other side,

    �w, x� < 0

  • The Perceptron Algorithm

    Goal: Given labeled data (xi, yi), i=1,..,n, where y is +1 or -1,Find a vector w such that the corresponding hyperplane separates + from -

    Perceptron Algorithm:

    1. Initially, w1 = y1x12. For t=2, 3,....

    If wt+1 = wt + yt xt

    Else:

    then:yt�wt, xt� < 0

    wt+1 = wt

  • Linear Separability

    --

    -

    +-

    -

    ++

    +

    +

    +

    ---

    --

    linearly separable not linearly separable

    --

    -

    +-

    -

    ++

    +

    +

    +

    ---

    --

    +

    +

  • Measure of Separability: Margin

    For a unit vector w, and training set S, margin of w is defined as:

    γ = minx∈S

    |�w, x�|�x�

    min cosine of angle between w and x, for any x in S

    Low Margin

    +

    wO

    θ

    +w

    O

    High Margin

    θ

  • Real Examples: Margin

    For a unit vector w, and training set S, margin of w is defined as:

    γ = minx∈S

    |�w, x�|�x�

    min cosine of angle between w and x, for any x in data

    Low Margin High Margin

    w w

  • When does Perceptron work?

    Observations:

    1. Lower the margin, more mistakes

    If the training points are linearly separable with margin γ

    then #mistakes made by perceptron is at most 1/γ2

    Theorem:

    and if for all xi in the training data,

    2. May need more than one pass over the training data to get a classifier which makes no mistakes on the training set

    �xi� = 1

  • What if data is not linearly separable?

    Ideally, we want to find a linear separator that makes the minimum number of mistakes on training data

    But this is NP Hard

    --

    -

    +-

    -

    ++

    +

    +

    +

    ---

    --

    +

    +

  • Voted Perceptron

    Perceptron: Voted Perceptron:

    1. Initially:m = 1, w1 = y1x1

    2. For t=2, 3,....

    If wm+1 = wm + yt xt

    then:

    m = m + 13. Output wm

    yt�wm, xt� ≤ 0

    1. Initially:m = 1, w1 = y1x1

    2. For t=2, 3,....

    If wm+1 = wm + yt xt

    then:

    m = m + 1

    3. Output (w1, c1), (w2, c2), ..., (wm, cm)

    cm = 1Else:

    cm = cm + 1

    yt�wm, xt� ≤ 0

  • Voted Perceptron

    Voted Perceptron:

    How to classify example x?

    Output: sign(m�

    i=1

    cisign(�wi, x�))

    Problem: Have to store all the classifiers

    1. Initially:m = 1, w1 = y1x1

    2. For t=2, 3,....

    If wm+1 = wm + yt xt

    then:

    m = m + 1

    3. Output (w1, c1), (w2, c2), ..., (wm, cm)

    cm = 1Else:

    cm = cm + 1

    yt�wm, xt� ≤ 0

  • Averaged Perceptron

    Averaged Perceptron:

    How to classify example x?

    Output: sign(�m�

    i=1

    ciwi, x�)

    1. Initially:m = 1, w1 = y1x1

    2. For t=2, 3,....

    If wm+1 = wm + yt xt

    then:

    m = m + 1

    3. Output (w1, c1), (w2, c2), ..., (wm, cm)

    cm = 1Else:

    cm = cm + 1

    yt�wm, xt� ≤ 0

  • Voted vs. Averaged

    How to classify example x?

    Averaged: sign(�m�

    i=1

    ciwi, x�)

    Voted: sign(m�

    i=1

    cisign(�wi, x�))

    Example where voted and averaged are different:

    w1 = x1, w2 = x2, w3 = x3, c1 = c2 = c3 = 1

    Averaged: x1 + x2 + x3 > 0Voted: Any two of x1, x2, x3 > 0

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    w

    O

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 1: w1 = (4, 0), c1 = 1

    O w1

    Decision Boundary

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    Step 1: w1 = (4, 0), c1 = 1

    O

    Decision Boundary

    w1

    x2

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    y2�w1, x2� < 0 Mistake!

    Step 1: w1 = (4, 0), c1 = 1

    O

    x2

    Decision Boundary

    w1

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    y2�w1, x2� < 0 Mistake!

    Step 1: w1 = (4, 0), c1 = 1

    Update: w2 = (3, -1), c2 = 1

    O-x2 w2

    Decision Boundary

    w1

    x2

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    y2�w1, x2� < 0 Mistake!

    Step 3: (x3, y3) = ( (0, 1), -1)

    Step 1: w1 = (4, 0), c1 = 1

    Update: w2 = (3, -1), c2 = 1

    Ow2

    Decision Boundary

    w1

    x2x3

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    y2�w1, x2� < 0 Mistake!

    Step 3: (x3, y3) = ( (0, 1), -1)

    y3�w2, x3� > 0 Correct!

    Step 1: w1 = (4, 0), c1 = 1

    Update: w2 = (3, -1), c2 = 1

    Ow2

    Decision Boundary

    x2x3

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    y2�w1, x2� < 0 Mistake!

    Step 3: (x3, y3) = ( (0, 1), -1)

    y3�w2, x3� > 0 Correct!

    Step 1: w1 = (4, 0), c1 = 1

    Update: w2 = (3, -1), c2 = 1

    Update: c2 = 2

    Ow2

    Decision Boundary

    x2x3

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    y2�w1, x2� < 0 Mistake!

    Step 3: (x3, y3) = ( (0, 1), -1)

    y3�w2, x3� > 0 Correct!

    Step 4: (x4, y4) = ( (-2, -2), +1)

    Step 1: w1 = (4, 0), c1 = 1

    Update: w2 = (3, -1), c2 = 1

    Update: c2 = 2

    Ow2

    Decision Boundary

    x2x3

    x4

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    y2�w1, x2� < 0 Mistake!

    Step 3: (x3, y3) = ( (0, 1), -1)

    y3�w2, x3� > 0 Correct!

    Step 4: (x4, y4) = ( (-2, -2), +1)

    Mistake!

    Step 1: w1 = (4, 0), c1 = 1

    Update: w2 = (3, -1), c2 = 1

    y4�w2, x4� < 0

    Update: c2 = 2

    Ow2

    Decision Boundary

    x2x3

    x4

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    y2�w1, x2� < 0 Mistake!

    Step 3: (x3, y3) = ( (0, 1), -1)

    y3�w2, x3� > 0 Correct!

    Step 4: (x4, y4) = ( (-2, -2), +1)

    Mistake!

    Update: w3 = (1, -3), c3 = 1

    Step 1: w1 = (4, 0), c1 = 1

    Update: w2 = (3, -1), c2 = 1

    y4�w2, x4� < 0

    Update: c2 = 2

    O

    x4

    Decision Boundary

    w3

    w2

  • Perceptron: An Example

    Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)

    Step 2: (x2, y2) = ( (1, 1), -1)

    y2�w1, x2� < 0 Mistake!

    Step 3: (x3, y3) = ( (0, 1), -1)

    y3�w2, x3� > 0 Correct!

    Step 4: (x4, y4) = ( (-2, -2), +1)

    Mistake!

    Update: w3 = (1, -3), c3 = 1

    Step 1: w1 = (4, 0), c1 = 1

    Update: w2 = (3, -1), c2 = 1

    y4�w2, x4� < 0

    Update: c2 = 2

    Final Voted Perceptron:

    Final Averaged Perceptron:sign(�w1 + 2w2 + w3, x�)

    O

    x4

    Decision Boundary

    w3

    w2 sign(sign(�w1, x�) + 2sign(�w2, x�) + sign(�w3, x�))

  • Summary

    • Linear classification • Perceptron• Voted and Averaged Perceptron• Some issues

  • Margins

    Suppose data is linearly separable by a margin. Perceptron will stoponce it finds a linear separator. But sometimes we would like to compute the max-margin separator. This is done by Support Vector Machines (SVMs).

    --

    -

    +

    -

    -

    +

    ++

    +

    +

    ---

    --

    +

    +

    Max margin separator

    Separator

  • More General Linear Classification

    Given labeled data (xi, yi), i=1,..,n, where y is +1 or -1,

    Find a hyperplane to separate + from -

    We can transform general linear classification to linear classification by a hyperplane through the origin as follows:

    For each (xi, yi) (xi is d-dimensional), construct:

    (zi, yi), where zi is the d+1-dimensional vector [1, xi]

    Any hyperplane through the origin in the z-space corresponds to a hyperplane (not necessarily through the origin) in x-space, and vice versa

  • Categorical Features

    Features with no inherent ordering

    Example: Gender = “Male”, “Female”State = “Alaska”, “Hawaii”, “California”

    Convert a categorical feature with k categories to a k-dimensional vector:e.g, “Alaska” maps to [1, 0, 0], “Hawaii” to [0, 1, 0], “California” to [0, 0, 1]

  • Multiclass Linear Classification

    Main Idea: Reduce to multiple binary classification problems

    One-vs-all: For k classes, solve k binary classification problemsClass 1 vs. rest, Class 2 vs. rest, ..., Class k vs. rest

    All-vs-all: For k classes, solve k(k-1)/2 binary classification problemsClass i vs. j, for i=1,..,k, and j = i+1,..,k

    To classify, combine by voting

  • The Confusion Matrix

    Indicates which classes are easy and hard to classify

    1 2 3 4 5 6

    1

    2

    3

    4

    5

    6

    High diagonal entry: easy to classifyHigh off-diagonal entry: high scope for confusion

    Mij = #examples that have label j but are classified as i

    Nj = #examples with label j

    Cij = Mij/Nj

  • Kernels

  • Feature Maps

    Data which is not linearly separable may be linearly separable in another feature space.

    X1

    X2

    Z1

    Z2

    Z3

    In the figure, data is linearly separable in z-space: (Z1, Z2, Z3) = (X12, 2X1X2, X22)

    φ

  • Feature Maps

    Let φ(x) be a feature map. For example: = (x12, x1x2, x22)We want to run perceptron on the feature space

    φ(x)

    φ(x)

    X1

    X2

    Z1

    Z2

    Z3

    φ

  • Feature Maps

    X1

    X2

    What feature map will make this dataset linearly separable?

  • Feature Maps

    Let φ(x) be a feature map. For example: = (x12, x1, x2)We want to run perceptron on the feature space

    φ(x)

    φ(x)

    Many linear classification algorithms, including perceptron, SVMs, etc can be written in terms of dot-products �x, z�

    We can simply change these dot products to �φ(x),φ(z)�

    For a feature map φ we define the corresponding kernel K:K(x, y) = �φ(x),φ(y)�

    Thus we can write the perceptron algorithm in terms of K

  • Kernels

    For a feature map φ we define the corresponding kernel K:K(x, y) = �φ(x),φ(y)�

    Examples:

    1. K(x, z) = (�x, z�)2

  • Kernels

    For a feature map φ we define the corresponding kernel K:K(x, y) = �φ(x),φ(y)�

    Examples:

    1.

    φ(x) = [ x12, x1x2, x1x3, x2x1, x22, x2x3, x3x1, x3x2, x32]TFor d=3,

    K(x, z) =

    �d�

    i=1

    xizi

    ��d�

    i=1

    xizi

    �=

    d�

    i=1

    d�

    j=1

    xixjzizj =d�

    i,j=1

    (xixj)(zizj)

    Time to compute K directly = O(d)Time to compute K though feature map = O(d2)

    K(x, z) = (�x, z�)2

  • Kernels

    For a feature map φ we define the corresponding kernel K:K(x, y) = �φ(x),φ(y)�

    Examples:

    1.

    2.

    In Class Problem: What is the feature map for this kernel?

    K(x, z) = (�x, z�)2

    K(x, z) = (�x, z�+ c)2