support vector machines › ~mgormley › courses › 10701-f16 › slides › ...boundary? consider...
TRANSCRIPT
![Page 1: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/1.jpg)
Machine Learning
10-701, Fall 2016
Support Vector Machines
Eric Xing
Lecture 6, September 26, 2016
1
Reading: Chap. 6&7, C.B book, and listed papers© Eric Xing @ CMU, 2006-2016
![Page 2: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/2.jpg)
2
What is a good Decision Boundary? Consider a binary classification
task with y = ±1 labels (not 0/1 as before).
When the training examples are linearly separable, we can set the parameters of a linear classifier so that all the training examples are classified correctly
Many decision boundaries! Generative classifiers Logistic regressions …
Are all decision boundaries equally good?
Class 1
Class 2
© Eric Xing @ CMU, 2006-2016
![Page 3: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/3.jpg)
3
What is a good Decision Boundary?
© Eric Xing @ CMU, 2006-2016
![Page 4: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/4.jpg)
4
Not All Decision Boundaries Are Equal!
Why we may have such boundaries? Irregular distribution Imbalanced training sizes outliners
© Eric Xing @ CMU, 2006-2016
![Page 5: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/5.jpg)
5
Classification and Margin Parameterzing decision boundary
Let w denote a vector orthogonal to the decision boundary, and b denote a scalar "offset" term, then we can write the decision boundary as:
0TT
T
wbx
ww
Class 1
Class 2
d - d+
© Eric Xing @ CMU, 2006-2016
![Page 6: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/6.jpg)
6
Classification and Margin Parameterzing decision boundary
Let w denote a vector orthogonal to the decision boundary, and b denote a scalar "offset" term, then we can write the decision boundary as:
Class 1
Class 2
Margin
(wTxi+b)/||w|| > +c/||w|| for all xi in class 2(wTxi+b)/||w|| < c/||w|| for all xi in class 1
Or more compactly:
(wTxi+b)yi /||w|| >c/||w||
The margin between any two pointsm = d + d+ =
d - d+
0TT
T
wbx
ww
© Eric Xing @ CMU, 2006-2016
![Page 7: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/7.jpg)
7
Maximum Margin Classification The minimum permissible margin is:
Here is our Maximum Margin Classification problem:
wcxx
wwm
ji
T 2 **
iwcwbxwywc
iT
i
w
,//)(s.t
2max
© Eric Xing @ CMU, 2006-2016
![Page 8: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/8.jpg)
8
Maximum Margin Classification, con'd. The optimization problem:
But note that the magnitude of c merely scales w and b, and does not change the classification boundary at all! (why?)
So we instead work on this cleaner problem:
The solution to this leads to the famous Support Vector Machines --- believed by many to be the best "off-the-shelf" supervised learning algorithm
icbxwy
wc
iT
i
bw
,)(s.tmax ,
ibxwy
w
iT
i
bw
,)(s.tmax ,
1
1
© Eric Xing @ CMU, 2006-2016
![Page 9: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/9.jpg)
9
Support vector machine A convex quadratic programming problem
with linear constrains:
The attained margin is now given by
Only a few of the classification constraints are relevant support vectors
Constrained optimization We can directly solve this using commercial quadratic programming (QP) code But we want to take a more careful investigation of Lagrange duality, and the
solution of the above in its dual form. deeper insight: support vectors, kernels … more efficient algorithm
ibxwy
w
iT
i
bw
,)(s.tmax ,
1
1
w1
d+
d-
© Eric Xing @ CMU, 2006-2016
![Page 10: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/10.jpg)
10
Digression to Lagrangian Duality The Primal Problem
Primal:
The generalized Lagrangian:
the 's (≥0) and 's are called the Lagarangian multipliers
Lemma:
A re-written Primal:
liwhkiwg
wf
i
i
w
,1, ,)(,1, ,)(
)(s.t.min
00
l
iii
k
iii whwgwfw
11)()()(),,( L
o/w
sconstraint primal satisfies if)(),,(max ,,
wwfw
i L0
),,(maxmin ,, w iw L0
© Eric Xing @ CMU, 2006-2016
![Page 11: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/11.jpg)
11
Lagrangian Duality, cont. Recall the Primal Problem:
The Dual Problem:
Theorem (weak duality):
Theorem (strong duality):Iff there exist a saddle point of , we have
),,(maxmin ,, w iw L0
),,(minmax ,, wwiL0
*,,,,
* ),,(maxmin ),,(minmax pw wdii ww LL 00
** pd ),,( wL
© Eric Xing @ CMU, 2006-2016
![Page 12: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/12.jpg)
12
A sketch of strong and weak duality Now, ignoring h(x) for simplicity, let's look at what's happening
graphically in the duality theorems.** )()(maxmin )()(minmax pwgw fwgwfd T
wT
w ii 00
)(wf
)(wg
© Eric Xing @ CMU, 2006-2016
![Page 13: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/13.jpg)
13
A sketch of strong and weak duality Now, ignoring h(x) for simplicity, let's look at what's happening
graphically in the duality theorems.** )()(maxmin )()(minmax pwgw fwgwfd T
wT
w ii 00
)(wf
)(wg
© Eric Xing @ CMU, 2006-2016
![Page 14: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/14.jpg)
14
A sketch of strong and weak duality Now, ignoring h(x) for simplicity, let's look at what's happening
graphically in the duality theorems.** )()(maxmin )()(minmax pwgw fwgwfd T
wT
w ii 00
)(wf
)(wg
)(wf
)(wg
© Eric Xing @ CMU, 2006-2016
![Page 15: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/15.jpg)
15
The KKT conditions If there exists some saddle point of L, then the saddle point
satisfies the following "Karush-Kuhn-Tucker" (KKT) conditions:
Theorem: If w*, * and * satisfy the KKT condition, then it is also a solution to the primal and the dual problems.
mimiwgmiwgα
liw
kiww
i
i
ii
i
i
,,1 ,0,,1 ,0)(,,1 ,0)(
,,1 ,0 ),,(
,,1 ,0 ),,(
L
L
Primal feasibility
Dual feasibility
Complementary slackness
© Eric Xing @ CMU, 2006-2016
![Page 16: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/16.jpg)
16
Solving optimal margin classifier Recall our opt problem:
This is equivalent to
Write the Lagrangian:
Recall that (*) can be reformulated asNow we solve its dual problem:
ibxwy
w
iT
i
bw
,)(s.tmax ,
1
1
ibxwy
ww
iT
i
Tbw
,)(s.tmin ,
0121
m
ii
Tii
T bxwywwbw1
121 )(),,( L
*( )
),,(maxmin , bw ibw L0
),,(minmax , bwbwiL0
© Eric Xing @ CMU, 2006-2016
![Page 17: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/17.jpg)
17
***( )
The Dual Problem
We minimize L with respect to w and b first:
Note that (*) implies:
Plug (***) back to L , and using (**), we have:
),,(minmax , bwbwiL0
, ),,(
m
iiiiw xywbw
10L
, ),,(
m
iiib ybw
10L
m
iiii xyw
1
*( )
m
jij
Tijiji
m
ii yybw
11 21
,
)(),,( xxL
**( )
m
ii
Tii
T bxwywwbw1
121 )(),,( L
© Eric Xing @ CMU, 2006-2016
![Page 18: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/18.jpg)
18
The Dual problem, cont. Now we have the following dual opt problem:
This is, (again,) a quadratic programming problem. A global maximum of i can always be found. But what's the big deal?? Note two things:1. w can be recovered by
2. The "kernel"
m
jij
Tijiji
m
ii yy
11 21
,
)()(max xx J
.
,, , s.t.
m
iii
i
y
ki
10
10
m
iiii yw
1x
jTi xx
See next …
More later …© Eric Xing @ CMU, 2006-2016
![Page 19: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/19.jpg)
19
Support vectors Note the KKT condition --- only a few i's can be nonzero!!
miwgα ii ,,1 ,0)(
6=1.4
Class 1
Class 2
1=0.8
2=0
3=0
4=0
5=07=0
8=0.6
9=0
10=0Call the training data points whose i's are nonzero the support vectors (SV)
© Eric Xing @ CMU, 2006-2016
![Page 20: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/20.jpg)
20
Support vector machines Once we have the Lagrange multipliers {i}, we can
reconstruct the parameter vector w as a weighted combination of the training examples:
For testing with a new data z Compute
and classify z as class 1 if the sum is positive, and class 2 otherwise
Note: w need not be formed explicitly
SVi
iii yw x
bzybzwSVi
Tiii
T
x
© Eric Xing @ CMU, 2006-2016
![Page 21: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/21.jpg)
21
Interpretation of support vector machines
The optimal w is a linear combination of a small number of data points. This “sparse” representation can be viewed as data compression as in the construction of kNN classifier
To compute the weights {i}, and to use support vector machines we need to specify only the inner products (or kernel) between the examples
We make decisions by comparing each new example z with only the support vectors:
jTi xx
bzyySVi
Tiii xsign*
© Eric Xing @ CMU, 2006-2016
![Page 22: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/22.jpg)
22
Non-linearly Separable Problems
We allow “error” i in classification; it is based on the output of the discriminant function wTx+b
i approximates the number of misclassified samples
Class 1
Class 2
© Eric Xing @ CMU, 2006-2016
![Page 23: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/23.jpg)
23
Soft Margin Hyperplane Now we have a slightly different opt problem:
i are “slack variables” in optimization Note that i=0 if there is no error for xi
i is an upper bound of the number of errors C : tradeoff parameter between error and margin
, ,)(
s.ti
ibxwy
i
iiT
i
01
m
ii
Tbw Cww
121 ,min
© Eric Xing @ CMU, 2006-2016
![Page 24: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/24.jpg)
24
The Optimization Problem The dual of this new constrained optimization problem is
This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on i now
Once again, a QP solver can be used to find i
m
jij
Tijiji
m
ii yy
11 21
,
)()(max xx J
.0
,,1 ,0 s.t.
1
m
iii
i
y
miC
© Eric Xing @ CMU, 2006-2016
![Page 25: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/25.jpg)
25
The SMO algorithm Consider solving the unconstrained opt problem:
We’ve already see three opt algorithms! ? ? ?
Coordinate ascend:
© Eric Xing @ CMU, 2006-2016
![Page 26: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/26.jpg)
26
Coordinate ascend
© Eric Xing @ CMU, 2006-2016
![Page 27: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/27.jpg)
27
Sequential minimal optimization Constrained optimization:
Question: can we do coordinate along one direction at a time (i.e., hold all [-i] fixed, and update i?)
m
jij
Tijiji
m
ii yy
11 21
,
)()(max xx J
.0
,,1 ,0 s.t.
1
m
iii
i
y
miC
© Eric Xing @ CMU, 2006-2016
![Page 28: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/28.jpg)
28
The SMO algorithm
Repeat till convergence
1. Select some pair i and j to update next (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum).
2. Re-optimize J() with respect to i and j, while holding all the other k 's (k i; j) fixed.
Will this procedure converge?
© Eric Xing @ CMU, 2006-2016
![Page 29: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/29.jpg)
29
Convergence of SMO
Let’s hold 3 ,…, m fixed and reopt J w.r.t. 1 and 2
m
jij
Tijiji
m
ii yy
1,1
)(21)(max xx J
.
,, ,0 s.t.
m
iii
i
y
kiC
10
1
KKT:
© Eric Xing @ CMU, 2006-2016
![Page 30: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/30.jpg)
30
Convergence of SMO The constraints:
The objective:
Constrained opt:
© Eric Xing @ CMU, 2006-2016
![Page 31: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/31.jpg)
31
Cross-validation error of SVM The leave-one-out cross-validation error does not depend on
the dimensionality of the feature space but only on the # of support vectors!
examples trainingof #ctorssupport ve #error CVout -one-Leave
© Eric Xing @ CMU, 2006-2016
![Page 32: Support Vector Machines › ~mgormley › courses › 10701-f16 › slides › ...Boundary? Consider a binary classification task with y = ±1 labels (not 0/1 as before). When the](https://reader033.vdocument.in/reader033/viewer/2022060503/5f1cf24cbc326d395f26d182/html5/thumbnails/32.jpg)
32
Summary Max-margin decision boundary
Constrained convex optimization
Duality
The KTT conditions and the support vectors
Non-separable case and slack variables
The SMO algorithm
© Eric Xing @ CMU, 2006-2016