4. linear classificationbigeye.au.tsinghua.edu.cn/dragonstar2012/docs/dragonstar...support vector...
TRANSCRIPT
-
4. Linear Classification
Kai Yu
-
2
Linear Classifiers
• A simplest classification model
• Help to understand nonlinear models
• Arguably the most useful classification method!
-
3
Linear Classifiers
• A simplest classification model
• Help to understand nonlinear models
• Arguably the most useful classification method!
-
Outline
§ Perceptron Algorithm
§ Support Vector Machines
§ Logistic Regression
§ Summary
4
-
5
Basic Neuron
-
6
Perceptron Node – Threshold Logic Unit
x1
xn
x2
w1
w2
wn
z b
1 if xii=1
n
∑ wi ≥ b
z =
0 if xii=1
n
∑ wi < b
-
7
Perceptron Learning Algorithm
x1
x2
z
.4
-.2
.1
x1 x2 t
0 1
.1
.3 .4 .8
1 if xii=1
n
∑ wi ≥ b
z =
0 if xii=1
n
∑ wi < b
-
8
First Training Instance
.8
.3
z
.4
-.2
.1
net = .8*.4 + .3*-.2 = .26
=1
x1 x2 t
0 1
.1
.3 .4 .8
1 if xii=1
n
∑ wi ≥ b
z =
0 if xii=1
n
∑ wi < b
-
9
Second Training Instance
.4
.1
z
.4
-.2
.1
x1 x2 t
0 1
.1
.3 .4 .8
net = .4*.4 + .1*-.2 = .14
=1
Δwi = (t - z) * c * xi
1 if xii=1
n
∑ wi ≥ b
z =
0 if xii=1
n
∑ wi < b
-
10
The Perceptron Learning Rule
Δwij = c(tj – zj) xi
§ Least perturbation principle
– Only change weights if there is an error (learning from mistakes)
– The change amounts to xi scaled by a small c
§ Iteratively apply an example from the training set
§ Each iteration through the training set is an epoch
§ Continue training until total training error is less than
epsilon
§ Perceptron Convergence Theorem: Guaranteed to
find a solution in finite time if a solution exists
-
Outline
§ Perceptron Algorithm § Support Vector Machines § Logistic Regression § Summary
11
-
Support Vector Machines: Overview
• A powerful method for 2-class classification
• Original idea: Vapnik, 1965 for linear classifiers
• SVM, Corte and Vapnik, 1995
• Becomes very hot since late 90’s
• Better generalization (less overfitting)
• Key ideas
– Use kernel function to transform low dimensional training
samples to higher dim (for linear separability problem)
– Use quadratic programming (QP) to find the best classifier
boundary hyperplane (for global optima and)
-
Linear Classifiers
f x
α
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
-
Linear Classifiers
f x
α
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
-
Linear Classifiers
f x
α
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
-
Linear Classifiers
f x
α
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
-
Linear Classifiers
f x
α
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Any of these would be fine..
..but which is best?
-
Classifier Margin
f x
α
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
-
Maximum Margin
f x
α
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Linear SVM
-
Maximum Margin
f x
α
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
Linear SVM
-
Computing the margin width
§ Plus-plane = { x : w . x + b = +1 } § Minus-plane = { x : w . x + b = -1 } § Margin M=
M = Margin Width
How do we compute M in terms of w and b?
2w.w
-
Why Maximum Margin?
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
1. Intuitively this feels safest.
2. If we’ve made a small error in the location of the boundary this gives us least chance of causing a misclassification.
3. It also helps generalization
4. There’s some theory that this is a good thing.
5. Empirically it works very very well.
-
Another way to understand max margin
§ For f(x)=w.x+b, one way to characterize the smoothness of f(x) is
§ Therefore, margin measures the smoothness of f(x).
§ As a rule of thumb, machine learning prefers smooth functions: similar x’s should have similar y’s.
8/7/12 23
∂f (x)∂x
= w
-
Learning the Maximum Margin Classifier
Given a guess of w and b we can
§ Compute whether all data points in the correct half-planes
§ Compute the width of the margin
So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the data points. How?
M = Margin Width =
x-
x+ ww.2
-
Learning via Quadratic Programming
§ QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.
§ Minimize both w⋅w (to maximize M) and misclassification error
-
Quadratic Programming
2maxarg
uuudu
RcT
T ++Find
nmnmnn
mm
mm
buauaua
buauauabuauaua
≤+++
≤+++
≤+++
...:......
2211
22222121
11212111
)()(22)(11)(
)2()2(22)2(11)2(
)1()1(22)1(11)1(
...:......
enmmenenen
nmmnnn
nmmnnn
buauaua
buauauabuauaua
++++
++++
++++
=+++
=+++
=+++And subject to
n additional linear inequality constraints
e additional linear equality constraints
Quadratic criterion
Subject to
-
Learning without Noise
27
What should our quadratic optimization criterion be?
Minimize
∑
=
+R
kkåC
1.
21 ww
What constraints should be?
w . xk + b >= 1 if yk = 1 w . xk + b
-
Solving the Optimization Problem
§ Need to optimize a quadratic function subject to linear constraints.
§ Quadratic optimization problems are a well-known class of
mathematical programming problems for which several (non-trivial) algorithms exist.
§ The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every inequality constraint in the primal (original) problem:
Find w and b such that Φ(w) =wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
Find α1…αn such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi
-
Soft Margin Classification
§ What if the training set is not linearly separable?
§ Slack variables εi can be added to allow misclassification
of difficult or noisy examples, resulting so-called soft margin.
εi εi
-
Learning Maximum Margin with Noise
Given guess of w , b we can § Compute sum of distances
of points to their correct zones
§ Compute the margin width Assume R data points, each
(xk,yk) where yk = +/- 1
M =
ww.2
What should our quadratic optimization criterion be?
Minimize
How many constraints will we have? R
What should they be? w . xk + b >= 1-εk if yk = 1 w . xk + b = 0 for all k
12w.w+C εk
k=1
R
∑
ε7
ε11 ε2
εk = distance of error points to their correct place
-
Soft Margin Classification Mathematically
§ The old formulation:
§ Modified formulation incorporates slack variables:
§ Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data.
Find w and b such that Φ(w) =wTw is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1
Find w and b such that Φ(w) =wTw + CΣξi is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0
-
CS 194-10, F’11Lect. 5
Binary Classification
Regularization andRobustness
Hinge loss functionThe previous LP can be interpreted as minimizing the hinge lossfunction
L(w , b) :=nX
i=1
max(1 � yi
(wT xi
+ b), 0).
This serves as an approximation to the number of errors made on thetraining set:
Hinge loss
§ The soft margin SVM is equivalent to applying a hinge loss
§ Equivalent unconstrained optimization formulation
min{w,b} L(w,b)+λ||w||2 λ=0.5/C
8/7/12 32 1 2 0
1
2
L
y.(w.x+b)
− Hinge Loss − 0-1 Loss
-
Outline
§ Perceptron Algorithm § Support Vector Machines § Logistic Regression § Summary
33
-
Logistic Regression
§ Binary response: Y ={+1, -1}
where pi is the probability of Yi=1
§ Likelihood
8/7/12 34
Yi|Xi ⇠ Bernoulli(pi)
pi =1
1 + exp(�WTXi)
nY
i=1
P (Yi|Xi) =nY
i=1
⇣1
1 + exp(�YiXTi W )
⌘
-
Logistic regression
§ Maximum likelihood estimator (MLE) becomes logistic regression
§ Convex optimization problem in terms of W
§ MAP is regularized logistic regression
8/7/12 35
min
W
nX
i=1
� ln p(Yi|Xi) =nX
i=1
ln(1 + exp(YiXTi W ))
min
W
nX
i=1
ln(1 + exp(YiXTi W )) + �kWk2
-
Outline
§ Perceptron Algorithm § Support Vector Machines § Logistic Regression § Summary
36
-
General formulation of linear classifiers
min{w,b} L(w,b)+λ||w||2
§ The objective: empirical loss + regularization
§ The regularization term is usually L2 norm, but also often L1 norm for sparse models
§ The empirical loss can be hinge loss, logistic loss, smooth hinge loss, … or your own invention
8/7/12 37
-
Different loss functions
8/7/12 38
0
2
4
6
8
10
12
14
16
18
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
loss
f(X)*Y
classification errorSVM
logistic regressionleast squares
exponentialmodified huber
17
-
Comments on linear classifiers
§ Choosing logistic regression or SVM?
– Not that big different!
– Logistic regression outputs probabilities.
§ Smooth loss functions, e.g. logistic
– Easier to optimize (LBFGS …)
– Hinge à Differentiable hinge, then you can easily have
your own implementation of SVMs
§ Try different loss functions and regularization terms
– Depend on data, e.g., many outliers? Irrelevant features?
structure in output?
8/7/12 39
-
Linear classifiers in practice and research
§ Linear classifiers are simple and scalable
– Training complexity is linear or sub-linear w.r.t. data size
– Classification is simple to implement
– State-of-the-art for texts, images, …
§ Their success depends on quality of features
– A useful practice: use a lot of features, learn sparse w
§ Hot topic: large-scale linear classification
– Many data, many features, many classes
– Stochastic optimization
– Parallel implementation
8/7/12 40