4. linear classificationbigeye.au.tsinghua.edu.cn/dragonstar2012/docs/dragonstar...support vector...

40
4. Linear Classification Kai Yu

Upload: others

Post on 22-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • 4. Linear Classification

    Kai Yu

  • 2

    Linear Classifiers

    •  A simplest classification model

    •  Help to understand nonlinear models

    •  Arguably the most useful classification method!

  • 3

    Linear Classifiers

    •  A simplest classification model

    •  Help to understand nonlinear models

    •  Arguably the most useful classification method!

  • Outline

    §  Perceptron Algorithm

    §  Support Vector Machines

    §  Logistic Regression

    §  Summary

    4

  • 5

    Basic Neuron

  • 6

    Perceptron Node – Threshold Logic Unit

    x1

    xn

    x2

    w1

    w2

    wn

    z b

    1 if xii=1

    n

    ∑ wi ≥ b

    z =

    0 if xii=1

    n

    ∑ wi < b

  • 7

    Perceptron Learning Algorithm

    x1

    x2

    z

    .4

    -.2

    .1

    x1 x2 t

    0 1

    .1

    .3 .4 .8

    1 if xii=1

    n

    ∑ wi ≥ b

    z =

    0 if xii=1

    n

    ∑ wi < b

  • 8

    First Training Instance

    .8

    .3

    z

    .4

    -.2

    .1

    net = .8*.4 + .3*-.2 = .26

    =1

    x1 x2 t

    0 1

    .1

    .3 .4 .8

    1 if xii=1

    n

    ∑ wi ≥ b

    z =

    0 if xii=1

    n

    ∑ wi < b

  • 9

    Second Training Instance

    .4

    .1

    z

    .4

    -.2

    .1

    x1 x2 t

    0 1

    .1

    .3 .4 .8

    net = .4*.4 + .1*-.2 = .14

    =1

    Δwi = (t - z) * c * xi

    1 if xii=1

    n

    ∑ wi ≥ b

    z =

    0 if xii=1

    n

    ∑ wi < b

  • 10

    The Perceptron Learning Rule

    Δwij = c(tj – zj) xi

    §  Least perturbation principle

    –  Only change weights if there is an error (learning from mistakes)

    –  The change amounts to xi scaled by a small c

    §  Iteratively apply an example from the training set

    §  Each iteration through the training set is an epoch

    §  Continue training until total training error is less than

    epsilon

    §  Perceptron Convergence Theorem: Guaranteed to

    find a solution in finite time if a solution exists

  • Outline

    §  Perceptron Algorithm §  Support Vector Machines §  Logistic Regression §  Summary

    11

  • Support Vector Machines: Overview

    • A powerful method for 2-class classification

    •  Original idea: Vapnik, 1965 for linear classifiers

    •  SVM, Corte and Vapnik, 1995

    •  Becomes very hot since late 90’s

    • Better generalization (less overfitting)

    • Key ideas

    – Use kernel function to transform low dimensional training

    samples to higher dim (for linear separability problem)

    – Use quadratic programming (QP) to find the best classifier

    boundary hyperplane (for global optima and)

  • Linear Classifiers

    f x

    α

    yest

    denotes +1

    denotes -1

    f(x,w,b) = sign(w. x - b)

    How would you classify this data?

  • Linear Classifiers

    f x

    α

    yest

    denotes +1

    denotes -1

    f(x,w,b) = sign(w. x - b)

    How would you classify this data?

  • Linear Classifiers

    f x

    α

    yest

    denotes +1

    denotes -1

    f(x,w,b) = sign(w. x - b)

    How would you classify this data?

  • Linear Classifiers

    f x

    α

    yest

    denotes +1

    denotes -1

    f(x,w,b) = sign(w. x - b)

    How would you classify this data?

  • Linear Classifiers

    f x

    α

    yest

    denotes +1

    denotes -1

    f(x,w,b) = sign(w. x - b)

    Any of these would be fine..

    ..but which is best?

  • Classifier Margin

    f x

    α

    yest

    denotes +1

    denotes -1

    f(x,w,b) = sign(w. x - b)

    Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

  • Maximum Margin

    f x

    α

    yest

    denotes +1

    denotes -1

    f(x,w,b) = sign(w. x - b)

    The maximum margin linear classifier is the linear classifier with maximum margin.

    This is the simplest kind of SVM (Called an LSVM)

    Linear SVM

  • Maximum Margin

    f x

    α

    yest

    denotes +1

    denotes -1

    f(x,w,b) = sign(w. x - b)

    The maximum margin linear classifier is the linear classifier with maximum margin.

    This is the simplest kind of SVM (Called an LSVM)

    Support Vectors are those datapoints that the margin pushes up against

    Linear SVM

  • Computing the margin width

    §  Plus-plane = { x : w . x + b = +1 } §  Minus-plane = { x : w . x + b = -1 } §  Margin M=

    M = Margin Width

    How do we compute M in terms of w and b?

    2w.w

  • Why Maximum Margin?

    denotes +1

    denotes -1

    f(x,w,b) = sign(w. x - b)

    The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

    This is the simplest kind of SVM (Called an LSVM)

    Support Vectors are those datapoints that the margin pushes up against

    1.  Intuitively this feels safest.

    2.  If we’ve made a small error in the location of the boundary this gives us least chance of causing a misclassification.

    3.  It also helps generalization

    4.  There’s some theory that this is a good thing.

    5.  Empirically it works very very well.

  • Another way to understand max margin

    §  For f(x)=w.x+b, one way to characterize the smoothness of f(x) is

    §  Therefore, margin measures the smoothness of f(x).

    §  As a rule of thumb, machine learning prefers smooth functions: similar x’s should have similar y’s.

    8/7/12 23

    ∂f (x)∂x

    = w

  • Learning the Maximum Margin Classifier

    Given a guess of w and b we can

    §  Compute whether all data points in the correct half-planes

    §  Compute the width of the margin

    So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the data points. How?

    M = Margin Width =

    x-

    x+ ww.2

  • Learning via Quadratic Programming

    §  QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

    §  Minimize both w⋅w (to maximize M) and misclassification error

  • Quadratic Programming

    2maxarg

    uuudu

    RcT

    T ++Find

    nmnmnn

    mm

    mm

    buauaua

    buauauabuauaua

    ≤+++

    ≤+++

    ≤+++

    ...:......

    2211

    22222121

    11212111

    )()(22)(11)(

    )2()2(22)2(11)2(

    )1()1(22)1(11)1(

    ...:......

    enmmenenen

    nmmnnn

    nmmnnn

    buauaua

    buauauabuauaua

    ++++

    ++++

    ++++

    =+++

    =+++

    =+++And subject to

    n additional linear inequality constraints

    e additional linear equality constraints

    Quadratic criterion

    Subject to

  • Learning without Noise

    27

    What should our quadratic optimization criterion be?

    Minimize

    =

    +R

    kkåC

    1.

    21 ww

    What constraints should be?

    w . xk + b >= 1 if yk = 1 w . xk + b

  • Solving the Optimization Problem

    §  Need to optimize a quadratic function subject to linear constraints.

    §  Quadratic optimization problems are a well-known class of

    mathematical programming problems for which several (non-trivial) algorithms exist.

    §  The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every inequality constraint in the primal (original) problem:

    Find w and b such that Φ(w) =wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1

    Find α1…αn such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi

  • Soft Margin Classification

    § What if the training set is not linearly separable?

    §  Slack variables εi can be added to allow misclassification

    of difficult or noisy examples, resulting so-called soft margin.

    εi εi

  • Learning Maximum Margin with Noise

    Given guess of w , b we can §  Compute sum of distances

    of points to their correct zones

    §  Compute the margin width Assume R data points, each

    (xk,yk) where yk = +/- 1

    M =

    ww.2

    What should our quadratic optimization criterion be?

    Minimize

    How many constraints will we have? R

    What should they be? w . xk + b >= 1-εk if yk = 1 w . xk + b = 0 for all k

    12w.w+C εk

    k=1

    R

    ε7

    ε11 ε2

    εk = distance of error points to their correct place

  • Soft Margin Classification Mathematically

    §  The old formulation:

    §  Modified formulation incorporates slack variables:

    §  Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data.

    Find w and b such that Φ(w) =wTw is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1

    Find w and b such that Φ(w) =wTw + CΣξi is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0

  • CS 194-10, F’11Lect. 5

    Binary Classification

    Regularization andRobustness

    Hinge loss functionThe previous LP can be interpreted as minimizing the hinge lossfunction

    L(w , b) :=nX

    i=1

    max(1 � yi

    (wT xi

    + b), 0).

    This serves as an approximation to the number of errors made on thetraining set:

    Hinge loss

    §  The soft margin SVM is equivalent to applying a hinge loss

    §  Equivalent unconstrained optimization formulation

    min{w,b} L(w,b)+λ||w||2 λ=0.5/C

    8/7/12 32 1 2 0

    1

    2

    L

    y.(w.x+b)

    − Hinge Loss − 0-1 Loss

  • Outline

    §  Perceptron Algorithm §  Support Vector Machines §  Logistic Regression §  Summary

    33

  • Logistic Regression

    §  Binary response: Y ={+1, -1}

    where pi is the probability of Yi=1

    §  Likelihood

    8/7/12 34

    Yi|Xi ⇠ Bernoulli(pi)

    pi =1

    1 + exp(�WTXi)

    nY

    i=1

    P (Yi|Xi) =nY

    i=1

    ⇣1

    1 + exp(�YiXTi W )

  • Logistic regression

    §  Maximum likelihood estimator (MLE) becomes logistic regression

    §  Convex optimization problem in terms of W

    §  MAP is regularized logistic regression

    8/7/12 35

    min

    W

    nX

    i=1

    � ln p(Yi|Xi) =nX

    i=1

    ln(1 + exp(YiXTi W ))

    min

    W

    nX

    i=1

    ln(1 + exp(YiXTi W )) + �kWk2

  • Outline

    §  Perceptron Algorithm §  Support Vector Machines §  Logistic Regression §  Summary

    36

  • General formulation of linear classifiers

    min{w,b} L(w,b)+λ||w||2

    §  The objective: empirical loss + regularization

    §  The regularization term is usually L2 norm, but also often L1 norm for sparse models

    §  The empirical loss can be hinge loss, logistic loss, smooth hinge loss, … or your own invention

    8/7/12 37

  • Different loss functions

    8/7/12 38

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    loss

    f(X)*Y

    classification errorSVM

    logistic regressionleast squares

    exponentialmodified huber

    17

  • Comments on linear classifiers

    §  Choosing logistic regression or SVM?

    –  Not that big different!

    –  Logistic regression outputs probabilities.

    §  Smooth loss functions, e.g. logistic

    –  Easier to optimize (LBFGS …)

    –  Hinge à Differentiable hinge, then you can easily have

    your own implementation of SVMs

    §  Try different loss functions and regularization terms

    –  Depend on data, e.g., many outliers? Irrelevant features?

    structure in output?

    8/7/12 39

  • Linear classifiers in practice and research

    §  Linear classifiers are simple and scalable

    –  Training complexity is linear or sub-linear w.r.t. data size

    –  Classification is simple to implement

    –  State-of-the-art for texts, images, …

    §  Their success depends on quality of features

    –  A useful practice: use a lot of features, learn sparse w

    §  Hot topic: large-scale linear classification

    –  Many data, many features, many classes

    –  Stochastic optimization

    –  Parallel implementation

    8/7/12 40