4. linear classiﬁcationbigeye.au.tsinghua.edu.cn/dragonstar2012/docs/dragonstar...support vector...

4. Linear Classification

Kai Yu

2

Linear Classifiers

•  A simplest classification model

•  Help to understand nonlinear models

•  Arguably the most useful classification method!

3

Linear Classifiers

•  A simplest classification model

•  Help to understand nonlinear models

•  Arguably the most useful classification method!

Outline

§  Perceptron Algorithm

§  Support Vector Machines

§  Logistic Regression

§  Summary

4

5

Basic Neuron

6

Perceptron Node – Threshold Logic Unit

x1

xn

x2

w1

w2

wn

z b

1 if xii=1

n

∑ wi ≥ b

z =

0 if xii=1

n

∑ wi < b

7

Perceptron Learning Algorithm

x1

x2

z

.4

-.2

.1

x1 x2 t

0 1

.1

.3 .4 .8

1 if xii=1

n

∑ wi ≥ b

z =

0 if xii=1

n

∑ wi < b

8

First Training Instance

.8

.3

z

.4

-.2

.1

net = .8*.4 + .3*-.2 = .26

=1

x1 x2 t

0 1

.1

.3 .4 .8

1 if xii=1

n

∑ wi ≥ b

z =

0 if xii=1

n

∑ wi < b

9

Second Training Instance

.4

.1

z

.4

-.2

.1

x1 x2 t

0 1

.1

.3 .4 .8

net = .4*.4 + .1*-.2 = .14

=1

Δwi = (t - z) * c * xi

1 if xii=1

n

∑ wi ≥ b

z =

0 if xii=1

n

∑ wi < b

10

The Perceptron Learning Rule

Δwij = c(tj – zj) xi

§  Least perturbation principle

–  Only change weights if there is an error (learning from mistakes)

–  The change amounts to xi scaled by a small c

§  Iteratively apply an example from the training set

§  Each iteration through the training set is an epoch

§  Continue training until total training error is less than

epsilon

§  Perceptron Convergence Theorem: Guaranteed to

find a solution in finite time if a solution exists

Outline

§  Perceptron Algorithm §  Support Vector Machines §  Logistic Regression §  Summary

11

Support Vector Machines: Overview

• A powerful method for 2-class classification

•  Original idea: Vapnik, 1965 for linear classifiers

•  SVM, Corte and Vapnik, 1995

•  Becomes very hot since late 90’s

• Better generalization (less overfitting)

• Key ideas

– Use kernel function to transform low dimensional training

samples to higher dim (for linear separability problem)

– Use quadratic programming (QP) to find the best classifier

boundary hyperplane (for global optima and)

Linear Classifiers

f x

α

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Linear Classifiers

f x

α

yest

denotes +1

denotes -1


Any of these would be fine..

..but which is best?

Classifier Margin

f x

α

yest

denotes +1

denotes -1


Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Maximum Margin

f x

α

yest

denotes +1

denotes -1


The maximum margin linear classifier is the linear classifier with maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Linear SVM

Maximum Margin

f x

α

yest

denotes +1

denotes -1


The maximum margin linear classifier is the linear classifier with maximum margin.


Support Vectors are those datapoints that the margin pushes up against

Linear SVM

Computing the margin width

§  Plus-plane = { x : w . x + b = +1 } §  Minus-plane = { x : w . x + b = -1 } §  Margin M=

M = Margin Width

How do we compute M in terms of w and b?

2w.w

Why Maximum Margin?

denotes +1

denotes -1


The maximum margin linear classifier is the linear classifier with the, um, maximum margin.


Support Vectors are those datapoints that the margin pushes up against

1.  Intuitively this feels safest.

2.  If we’ve made a small error in the location of the boundary this gives us least chance of causing a misclassification.

3.  It also helps generalization

4.  There’s some theory that this is a good thing.

5.  Empirically it works very very well.

Another way to understand max margin

§  For f(x)=w.x+b, one way to characterize the smoothness of f(x) is

§  Therefore, margin measures the smoothness of f(x).

§  As a rule of thumb, machine learning prefers smooth functions: similar x’s should have similar y’s.

8/7/12 23

∂f (x)∂x

= w

Learning the Maximum Margin Classifier

Given a guess of w and b we can

§  Compute whether all data points in the correct half-planes

§  Compute the width of the margin

So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the data points. How?

M = Margin Width =

x-

x+ ww.2

Learning via Quadratic Programming

§  QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.

§  Minimize both w⋅w (to maximize M) and misclassification error

Quadratic Programming

2maxarg

uuudu

RcT

T ++Find

nmnmnn

mm

mm

buauaua

buauauabuauaua

≤+++

≤+++

≤+++

...:......

2211

22222121

11212111

)()(22)(11)(

)2()2(22)2(11)2(

)1()1(22)1(11)1(

...:......

enmmenenen

nmmnnn

nmmnnn

buauaua

buauauabuauaua

++++

++++

++++

=+++

=+++

=+++And subject to

n additional linear inequality constraints

e additional linear equality constraints

Quadratic criterion

Subject to

Learning without Noise

27

What should our quadratic optimization criterion be?

Minimize

∑

=

+R

kkåC

1.

21 ww

What constraints should be?

w . xk + b >= 1 if yk = 1 w . xk + b

Solving the Optimization Problem

§  Need to optimize a quadratic function subject to linear constraints.

§  Quadratic optimization problems are a well-known class of

mathematical programming problems for which several (non-trivial) algorithms exist.

§  The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every inequality constraint in the primal (original) problem:

Find w and b such that Φ(w) =wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1

Find α1…αn such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi

Soft Margin Classification

§ What if the training set is not linearly separable?

§  Slack variables εi can be added to allow misclassification

of difficult or noisy examples, resulting so-called soft margin.

εi εi

Learning Maximum Margin with Noise

Given guess of w , b we can §  Compute sum of distances

of points to their correct zones

§  Compute the margin width Assume R data points, each

(xk,yk) where yk = +/- 1

M =

ww.2

What should our quadratic optimization criterion be?

Minimize

How many constraints will we have? R

What should they be? w . xk + b >= 1-εk if yk = 1 w . xk + b = 0 for all k

12w.w+C εk

k=1

R

∑

ε7

ε11 ε2

εk = distance of error points to their correct place

Soft Margin Classification Mathematically

§  The old formulation:

§  Modified formulation incorporates slack variables:

§  Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data.

Find w and b such that Φ(w) =wTw is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1

Find w and b such that Φ(w) =wTw + CΣξi is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0

CS 194-10, F’11Lect. 5

Binary Classification

Regularization andRobustness

Hinge loss functionThe previous LP can be interpreted as minimizing the hinge lossfunction

L(w , b) :=nX

i=1

max(1 � yi

(wT xi

+ b), 0).

This serves as an approximation to the number of errors made on thetraining set:

Hinge loss

§  The soft margin SVM is equivalent to applying a hinge loss

§  Equivalent unconstrained optimization formulation

min{w,b} L(w,b)+λ||w||2 λ=0.5/C

8/7/12 32 1 2 0

1

2

L

y.(w.x+b)

− Hinge Loss − 0-1 Loss

Outline


33

Logistic Regression

§  Binary response: Y ={+1, -1}

where pi is the probability of Yi=1

§  Likelihood

8/7/12 34

Yi|Xi ⇠ Bernoulli(pi)

pi =1

1 + exp(�WTXi)

nY

i=1

P (Yi|Xi) =nY

i=1

⇣1

1 + exp(�YiXTi W )

⌘

Logistic regression

§  Maximum likelihood estimator (MLE) becomes logistic regression

§  Convex optimization problem in terms of W

§  MAP is regularized logistic regression

8/7/12 35

min

W

nX

i=1

� ln p(Yi|Xi) =nX

i=1

ln(1 + exp(YiXTi W ))

min

W

nX

i=1

ln(1 + exp(YiXTi W )) + �kWk2

Outline


36

General formulation of linear classifiers

min{w,b} L(w,b)+λ||w||2

§  The objective: empirical loss + regularization

§  The regularization term is usually L2 norm, but also often L1 norm for sparse models

§  The empirical loss can be hinge loss, logistic loss, smooth hinge loss, … or your own invention

8/7/12 37

Different loss functions

8/7/12 38

0

2

4

6

8

10

12

14

16

18

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

loss

f(X)*Y

classification errorSVM

logistic regressionleast squares

exponentialmodified huber

17

Comments on linear classifiers

§  Choosing logistic regression or SVM?

–  Not that big different!

–  Logistic regression outputs probabilities.

§  Smooth loss functions, e.g. logistic

–  Easier to optimize (LBFGS …)

–  Hinge à Differentiable hinge, then you can easily have

your own implementation of SVMs

§  Try different loss functions and regularization terms

–  Depend on data, e.g., many outliers? Irrelevant features?

structure in output?

8/7/12 39

Linear classifiers in practice and research

§  Linear classifiers are simple and scalable

–  Training complexity is linear or sub-linear w.r.t. data size

–  Classification is simple to implement

–  State-of-the-art for texts, images, …

§  Their success depends on quality of features

–  A useful practice: use a lot of features, learn sparse w

§  Hot topic: large-scale linear classification

–  Many data, many features, many classes

–  Stochastic optimization

–  Parallel implementation

8/7/12 40

4. linear classiﬁcationbigeye.au.tsinghua.edu.cn/dragonstar2012/docs/dragonstar...support vector...

Documents