![Page 1: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/1.jpg)
Pattern Recognition Advanced
The Support Vector Machine
Introduction to Kernel Methods
PRA - 2013 1
![Page 2: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/2.jpg)
Supervised Learning
Many observations of the same data type, with labels
• we consider a database {x1, · · · , xN},
• each datapoint xj is represented as a vector of features xj =
x1,j
x2,j...
xd,j
• To each observation is associated a label yj...
◦ If yj ∈ R, we have a regression problem.◦ If yj ∈ S where S is a finite set, multiclass classification.◦ If S only has two elements, binary classification.
PRA - 2013 2
![Page 3: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/3.jpg)
Supervised Learning: Binary Classification
Examples of Binary Classes
• Using elementary measurements, guess if someone has or not a disease that is
◦ difficult to detect at an early stage◦ difficult to measure directly (fetus)
• Classify chemical compounds into toxic / nontoxic
• Classify a scanned piece of luggage as suspicious/not suspicious
• Classify body tumor as begign/malign to detect cancer
• Detect whether an image’s primary content is people or any other object
• etc.
PRA - 2013 3
![Page 4: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/4.jpg)
Data
• Data: instances x1, x2, x3, · · · , xN .
• To infer a “yes/no” rule, we need the corresponding answer for each vector.
• We consider thus a set of pairs of (vector,bit)
“training set” =
xj =
x1,j
x2,j...
xd,j
∈ R
d, yj ∈ {0, 1}
j=1..N
• For illustration purposes only we will consider vectors in the plane, d = 2.
• The ideas for d ≫ 3 are conceptually the same.
PRA - 2013 4
![Page 5: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/5.jpg)
Binary Classification Separation Surfaces for Vectors
What is a classification rule?
PRA - 2013 5
![Page 6: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/6.jpg)
Binary Classification Separation Surfaces for Vectors
Classification rule = a partition of Rd into two sets
PRA - 2013 6
![Page 7: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/7.jpg)
Binary Classification Separation Surfaces for Vectors
This partition is encoded as the level set of a function on Rd
PRA - 2013 7
![Page 8: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/8.jpg)
Binary Classification Separation Surfaces for Vectors
Namely, {x ∈ Rd|f(x) > 0} and {x ∈ R
d|f(x) ≤ 0}
PRA - 2013 8
![Page 9: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/9.jpg)
Classification Separation Surfaces for Vectors
What kind of function? any smooth function works. For instance, a curved line
PRA - 2013 9
![Page 10: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/10.jpg)
Classification Separation Surfaces for Vectors
Even more simple: using affine functions that define halfspaces.
PRA - 2013 10
![Page 11: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/11.jpg)
Linear Classifiers
• lines (hyperplanes when d > 2) provide the simplest type of classifiers.
• A hyperplane Hc,b is a set in Rd defined by
◦ a normal vector c ∈ Rd
◦ a constant b ∈ R. as
Hc,b = {x ∈ Rd | cTx = b}
• Letting b vary we can “slide” the hyperplane across Rd
c
Hc,0
Hc,b0
PRA - 2013 11
![Page 12: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/12.jpg)
Linear Classifiers
• Exactly like lines in the plane, hyperplanes divide Rd into two halfspaces,
{x ∈ R
d | cTx< b}∪{x ∈ R
d | cTx≥ b}= R
d
• Linear classifiers answer “yes” or “no” given x and pre-defined c and b.
NO
YES
Hc,b
c
how can we choose a “good” (c⋆, b⋆) given the training set?
PRA - 2013 12
![Page 13: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/13.jpg)
Linear Classifiers
• This specific question,
“training set”{(
xi ∈ Rd, yi ∈ {0, 1}
)
i=1..N
} ????=⇒“best”c⋆, b⋆
has different answers. A (non-exhaustive!) selection of techniques:
• Linear Discriminant Analysis (or Fisher’s Linear Discriminant);
• Logistic regression maximum likelihood estimation;
• Perceptron, a one-layer neural network;
• Support Vector Machine, the result of a convex program.
PRA - 2013 13
![Page 14: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/14.jpg)
Linear Classifiers
• This specific question,
“training set”{(
xi ∈ Rd, yi ∈ {0, 1}
)
i=1..N
} ????=⇒“best”c⋆, b⋆
has different answers. A (non-exhaustive!) selection of techniques:
• Linear Discriminant Analysis (or Fisher’s Linear Discriminant);
• Logistic regression maximum likelihood estimation;
• Perceptron, a one-layer neural network;
• Support Vector Machine, the result of a convex program.
• etc.
Which one should I use?
PRA - 2013 14
![Page 15: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/15.jpg)
Linear Classifiers
Which one should I use?
Too many criteria to answer that question.
Computational speed. Interpretability of the results. Batch or online training.Theoretical guarantees and learning rates. Applicability of assumptions
underpinning the technique to your dataset. Source code availability. Parallel?.Size (in bits) of the solution. Reproducibility of results & Numerical stability...etc.
Choosing the adequate tool for a specific dataset iswhere your added value as a researcher in pattern recognition lies.
PRA - 2013 15
![Page 16: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/16.jpg)
Classification Separation Surfaces for Vectors
Given two sets of points...
PRA - 2013 16
![Page 17: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/17.jpg)
Classification Separation Surfaces for Vectors
It is sometimes possible to separate them perfectly
PRA - 2013 17
![Page 18: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/18.jpg)
Classification Separation Surfaces for Vectors
Each choice might look equivalently good on the training set,but it will have obvious impact on new points
PRA - 2013 18
![Page 19: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/19.jpg)
Classification Separation Surfaces for Vectors
PRA - 2013 19
![Page 20: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/20.jpg)
Linear classifier, some degrees of freedom
PRA - 2013 20
![Page 21: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/21.jpg)
Linear classifier, some degrees of freedom
PRA - 2013 21
![Page 22: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/22.jpg)
Linear classifier, some degrees of freedom
Specially close to the border of the classifier
PRA - 2013 22
![Page 23: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/23.jpg)
Linear classifier, some degrees of freedom
PRA - 2013 23
![Page 24: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/24.jpg)
Linear classifier, some degrees of freedom
For each different technique, different results, different performance.
PRA - 2013 24
![Page 25: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/25.jpg)
Support Vector Machines
The linearly-separable case
PRA - 2013 25
![Page 26: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/26.jpg)
A criterion to select a linear classifier: the margin ?
PRA - 2013 26
![Page 27: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/27.jpg)
A criterion to select a linear classifier: the margin ?
PRA - 2013 27
![Page 28: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/28.jpg)
A criterion to select a linear classifier: the margin ?
PRA - 2013 28
![Page 29: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/29.jpg)
A criterion to select a linear classifier: the margin ?
PRA - 2013 29
![Page 30: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/30.jpg)
A criterion to select a linear classifier: the margin ?
PRA - 2013 30
![Page 31: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/31.jpg)
Largest Margin Linear Classifier ?
PRA - 2013 31
![Page 32: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/32.jpg)
Support Vectors with Large Margin
PRA - 2013 32
![Page 33: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/33.jpg)
In Mathematical Equations
• The training set is a finite set of n data/class pairs:
T = {(x1, y1), . . . , (xN , yN)} ,
where xi ∈ Rd and yi ∈ {−1, 1}.
• We assume (for the moment) that the data are linearly separable, i.e., thatthere exists (w, b) ∈ R
d × R such that:
{
wTxi + b > 0 if yi = 1 ,
wTxi + b < 0 if yi = −1 .
PRA - 2013 33
![Page 34: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/34.jpg)
How to find the largest separating hyperplane?
For the linear classifier f(x) = wTx+ b consider the interstice defined by thehyperplanes
• f(x) = wTx+ b = +1
• f(x) = wTx+ b = −1
w.x+b=0
x2x1
w.x+b > +1
w.x+b < −1w
w.x+b=+1
w.x+b=−1
PRA - 2013 34
![Page 35: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/35.jpg)
The margin is 2/||w||
• Indeed, the points x1 and x2 satisfy:
{
wTx1 + b = 0,
wTx2 + b = 1.
• By subtracting we get wT (x2 − x1) = 1, and therefore:
γ = 2||x2 − x1|| =2
||w||.
where γ is the margin.
Large margin γ ⇔ Small ||w|| ⇔ Small ||w||2
PRA - 2013 35
![Page 36: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/36.jpg)
All training points should be on the good side
• For positive examples (yi = 1) this means:
wTxi + b ≥ 1
• For negative examples (yi = −1) this means:
wTxi + b ≤ −1
• in both cases:∀i = 1, . . . , n, yi
(wTxi + b
)≥ 1
PRA - 2013 36
![Page 37: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/37.jpg)
Finding the optimal hyperplane
• Finding the optimal hyperplane is equivalent to finding (w, b) which minimize:
‖w‖2
under the constraints:
∀i = 1, . . . , n, yi(wTxi + b
)− 1 ≥ 0.
This is a classical quadratic program on Rd+1
linear constraints - quadratic objective
PRA - 2013 37
![Page 38: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/38.jpg)
Lagrangian
• In order to minimize:1
2||w||2
under the constraints:
∀i = 1, . . . , n, yi(wTxi + b
)− 1 ≥ 0.
• introduce one dual variable αi for each constraint,
• one constraint for each training point.
• the Lagrangian is, for α � 0 (that is for each αi ≥ 0)
L(w, b, α) =1
2||w||2 −
n∑
i=1
αi
(yi(wTxi + b
)− 1).
PRA - 2013 38
![Page 39: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/39.jpg)
The Lagrange dual function
g(α) = infw∈Rd,b∈R
{
1
2‖w‖2 −
n∑
i=1
αi
(yi(wTxi + b
)− 1)
}
the saddle point conditions give us that at the minimum in w and b
w =
n∑
i=1
αiyixi, ( derivating w.r.t w) (∗)
0 =n∑
i=1
αiyi, (derivating w.r.t b) (∗∗)
substituting (∗) in g, and using (∗∗) as a constraint, get the dual function g(α).
• To solve the dual problem, maximize g w.r.t. α.
PRA - 2013 39
![Page 40: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/40.jpg)
Dual optimum
The dual problem is thus
maximize g(α) =∑n
i=1αi − 1
2
∑n
i,j=1αiαjyiyjx
Ti xj
such that α � 0,∑n
i=1αiyi = 0.
This is a quadratic program in Rn, with box constraints.
α⋆ can be computed using optimization software(e.g. built-in matlab function)
PRA - 2013 40
![Page 41: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/41.jpg)
Dual optimum
The dual problem is thus
maximize g(α) =∑n
i=1αi − 1
2
∑n
i,j=1αiαjyiyjx
Ti xj
such that α � 0,∑n
i=1αiyi = 0.
This is a quadratic program in Rn, with box constraints.
α⋆ can be computed using optimization software(e.g. built-in matlab function)
All SVM toolboxes available can be interpretedas customized solvers that handleefficiently this particular problem.
PRA - 2013 41
![Page 42: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/42.jpg)
Dual optimum
The dual problem is thus
maximize g(α) =∑n
i=1αi − 1
2
∑n
i,j=1αiαjyiyjx
Ti xj
such that α � 0,∑n
i=1αiyi = 0.
This is a quadratic program in Rn, with box constraints.
α⋆ can be computed using optimization software(e.g. built-in matlab function)
PRA - 2013 42
![Page 43: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/43.jpg)
Recovering the optimal hyperplane
• With α⋆, we recover (wT⋆ , b⋆) corresponding to the optimal hyperplane.
• wT⋆ is given by wT
⋆ =∑n
i=1yiα
⋆i x
Ti ,
• b⋆ is given by the conditions on the support vectors αi > 0, yi(wTxi + b) = 1,
b⋆ = −1
2
(
minyi=1,αi>0
(wT⋆ xi) + max
yi=−1,αi>0(wT
⋆ xi)
)
• the decision function is therefore:
f⋆(x) = wT⋆ x+ b⋆
=
(n∑
i=1
yiα⋆i x
Ti
)
x+ b⋆.
• Here the dual solution gives us directly the primal solution.
PRA - 2013 43
![Page 44: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/44.jpg)
From optimization theory...
Studying the relationship between primal/dual problemsis an important topic in optimization theory.
As a consequence, we can say many things about the optimal α⋆ and theconstraints of the original problem.
• Strong duality : primal and dual problems have the same optimum.
• Karush-Kuhn-Tucker Conditions give us that for every i ≤ n
α⋆i (yi
(wT
⋆ xi + b⋆)− 1) = 0.
PRA - 2013 44
![Page 45: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/45.jpg)
From optimization theory...
Studying the relationship between primal/dual problemsis an important topic in optimization theory.
As a consequence, we can say many things about the optimal α⋆ and theconstraints of the original problem.
• Strong duality : primal and dual problems have the same optimum.
• Karush-Kuhn-Tucker Conditions give us that for every i ≤ n
α⋆i︸︷︷︸
multiplier i
× (yi(wT
⋆ xi + b⋆)− 1)
︸ ︷︷ ︸constraint i
= 0.
• Hence, either
α⋆i = 0 OR yi
(wT
⋆ xi + b⋆)= 1.
• α⋆i 6= 0 only for points that lie on the tube (support vectors!).
PRA - 2013 45
![Page 46: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/46.jpg)
Visualizing Support Vectors
α>0
α=0
w⋆ =∑n
i=1α⋆i yixi
α⋆i 6= 0 only for points that lie on the tube (support vectors!).
PRA - 2013 46
![Page 47: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/47.jpg)
Another interpretation: Convex Hulls
go back to 2 sets of points that are linearly separable
PRA - 2013 47
![Page 48: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/48.jpg)
Another interpretation: Convex Hulls
Linearly separable = convex hulls do not intersect
PRA - 2013 48
![Page 49: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/49.jpg)
Another interpretation: Convex Hulls
Find two closest points, one in each convex hull
PRA - 2013 49
![Page 50: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/50.jpg)
Another interpretation: Convex Hulls
The SVM = bisection of that segment
PRA - 2013 50
![Page 51: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/51.jpg)
Another interpretation: Convex Hulls
support vectors = extreme points of the faces on which the two points lie
PRA - 2013 51
![Page 52: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/52.jpg)
The non-linearly separable case
(when convex hulls intersect)
PRA - 2013 52
![Page 53: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/53.jpg)
What happens when the data is not linearly separable?
PRA - 2013 53
![Page 54: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/54.jpg)
What happens when the data is not linearly separable?
PRA - 2013 54
![Page 55: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/55.jpg)
What happens when the data is not linearly separable?
PRA - 2013 55
![Page 56: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/56.jpg)
What happens when the data is not linearly separable?
PRA - 2013 56
![Page 57: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/57.jpg)
Soft-margin SVM ?
• Find a trade-off between large margin and few errors.
• Mathematically:
minf
{1
margin(f)+ C × errors(f)
}
• C is a parameter
PRA - 2013 57
![Page 58: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/58.jpg)
Soft-margin SVM formulation ?
• The margin of a labeled point (x, y) is
margin(x, y) = y(wTx+ b
)
• The error is
◦ 0 if margin(x, y) > 1,◦ 1−margin(x, y) otherwise.
• The soft margin SVM solves:
minw,b
{‖w‖2 + C
n∑
i=1
max{0, 1− yi(wTxi + b
)}
• c(u, y) = max{0, 1− yu} is known as the hinge loss.
• c(wTxi + b, yi) associates a mistake cost to the decision w, b for example xi.
PRA - 2013 58
![Page 59: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/59.jpg)
Dual formulation of soft-margin SVM
• The soft margin SVM program
minw,b
{‖w‖2 + Cn∑
i=1
max{0, 1− yi(wTxi + b
)}
can be rewritten as
minimize ‖w‖2 + C∑n
i=1ξi
such that yi(wTxi + b
)≥ 1− ξi
• In that case the dual function
g(α) =n∑
i=1
αi −1
2
n∑
i,j=1
αiαjyiyjxTi xj,
which is finite under the constraints:{
0 ≤ αi≤ C, for i = 1, . . . , n∑n
i=1αiyi = 0.
PRA - 2013 59
![Page 60: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/60.jpg)
Interpretation: bounded and unbounded support vectors
Cα=0
0<α<C
α=
PRA - 2013 60
![Page 61: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/61.jpg)
What about the convex hull analogy?
• Remember the separable case
• Here we consider the case where the two sets are not linearly separable, i.e.their convex hulls intersect.
PRA - 2013 61
![Page 62: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/62.jpg)
What about the convex hull analogy?
Definition 1. Given a set of n points A, and 0 ≤ C ≤ 1, the set of finitecombinations
n∑
i=1
λixi, 1 ≤ λi ≤ C,n∑
i=1
λi = 1,
is the (C) reduced convex hull of A
• Using C = 1/2, the reduced convex hulls of A and B,
• Soft-SVM with C = closest two points of C-reduced convex hulls.
PRA - 2013 62
![Page 63: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/63.jpg)
Kernels
PRA - 2013 63
![Page 64: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/64.jpg)
Kernel trick for SVM’s
• use a mapping φ from X to a feature space,
• which corresponds to the kernel k:
∀x, x′ ∈ X , k(x, x′) = 〈φ(x), φ(x′) 〉
• Example: if φ(x) = φ
([x1
x2
])
=
[x21
x22
]
, then
k(x, x′) = 〈φ(x), φ(x′) 〉 = (x1)2(x′
1)2 + (x2)
2(x′2)
2.
PRA - 2013 64
![Page 65: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/65.jpg)
Training a SVM in the feature space
Replace each xTx′ in the SVM algorithm by 〈φ(x), φ(x′) 〉 = k(x, x′)
• Reminder: the dual problem is to maximize
g(α) =
n∑
i=1
αi −1
2
n∑
i,j=1
αi αj yi yj k(xi, xj),
under the constraints:{
0 ≤ αi ≤ C, for i = 1, . . . , n∑n
i=1αiyi = 0.
• The decision function becomes:
f(x) = 〈w, φ(x) 〉+ b⋆
=n∑
i=1
yiαik(xi, x)+ b⋆.(1)
PRA - 2013 65
![Page 66: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/66.jpg)
The Kernel Trick ?
The explicit computation of φ(x) is not necessary.The kernel k(x, x′) is enough.
• the SVM optimization for α works implicitly in the feature space.
• the SVM is a kernel algorithm: only need to input K and y:
maximize g(α) = αT1− 1
2αT (K ⊙ yyT )α
such that 0 ≤ αi ≤ C, for i = 1, . . . , n∑n
i=1αiyi = 0.
• K’s positive definite ⇔ problem has an unique optimum
• the decision function is f(·) =∑n
i=1αi k(xi, ·) + b.
PRA - 2013 66
![Page 67: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/67.jpg)
Kernel example: polynomial kernel
• For x = (x1, x2)⊤ ∈ R
2, let φ(x) = (x21,√2x1x2, x
22) ∈ R
3:
K(x, x′) = x21x
′21 + 2x1x2x
′1x
′2 + x2
2x′22
= {x1x′1 + x2x
′2}2
= {xTx′}2 .
2R
x1
x2
x1
x2
2
PRA - 2013 67
![Page 68: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/68.jpg)
Kernels are Trojan Horses onto Linear Models
• With kernels, complex structures can enter the realm of linear models
PRA - 2013 68
![Page 69: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/69.jpg)
What is a kernel
In the context of these lectures...
• A kernel k is a function
k : X × X 7−→ R
(x, y) −→ k(x, y)
• which compares two objects of a space X , e.g....
◦ strings, texts and sequences,
◦ images, audio and video feeds,
◦ graphs, interaction networks and 3D structures
• whatever actually... time-series of graphs of images? graphs of texts?...
PRA - 2013 69
![Page 70: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/70.jpg)
Fundamental properties of a kernel
symmetric
k(x, y) = k(y, x).
positive-(semi)definitefor any finite family of points x1, · · · , xn of X , the matrix
K =
k(x1, x1) k(x1, x2) · · · k(x1, xi) · · · k(x1, xn)k(x2, x1) k(x2, x2) · · · k(x2, xi) · · · k(x2, xn)
... ... . . . ... ... ...k(xi, x1) k(xi, x2) · · · k(xi, xi) · · · k(x2, xn)
... ... ... ... . . . ...k(xn, x1) k(xn, x2) · · · k(xn, xi) · · · k(xn, xn)
� 0
is positive semidefinite (has a nonnegative spectrum).
K is often called the Gram matrix of {x1, · · · , xn} using k
PRA - 2013 70
![Page 71: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/71.jpg)
What can we do with a kernel?
PRA - 2013 71
![Page 72: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/72.jpg)
The setting
• Very loose setting: a set of objects x1, · · · , xn of X
• Sometimes additional information on these objects
◦ labels yi ∈ {−1, 1} or {1, · · · ,#(classes)},◦ scalar values yi ∈ R,◦ associated object yi ∈ Y
• A kernel k : X × X 7→ R.
PRA - 2013 72
![Page 73: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/73.jpg)
The Gram matrix perspective
• Imagine a little task: you have read 100 novels so far.
• You would like to know whether you will enjoy reading a new novel.
• A few options:
◦ read the book...◦ have friends read it for you, read reviews.◦ try to guess, based on the novels you read, if you will like it
PRA - 2013 73
![Page 74: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/74.jpg)
The Gram matrix perspective
Two distinct approaches
• Define what features can characterize a book.
◦ Map each book in the library onto vectors
−→ x =
x1
x2...xd
typically the xi’s can describe...⊲ # pages, language, year 1st published, country,⊲ coordinates of the main action, keyword counts,⊲ author’s prizes, popularity, booksellers ranking
• Challenge: find a decision function using 100 ratings and features.
PRA - 2013 74
![Page 75: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/75.jpg)
The Gram matrix perspective
• Define what makes two novels similar,
◦ Define a kernel k which quantifies novel similarities.◦ Map the library onto a Gram matrix
−→ K =
k(b1, b1) k(b1, b2) · · · k(b1, b100)k(b2, b1) k(b2, b2) · · · k(b2, b100)
... ... . . . ...k(bn, b1) k(bn, b2) · · · k(b100, b100)
• Challenge: find a decision function that takes this 100×100 matrix as an input.
PRA - 2013 75
![Page 76: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/76.jpg)
The Gram matrix perspective
Given a new novel,
• with the features approach, the prediction can be rephrased as what are thefeatures of this new book? what features have I found in the past that weregood indicators of my taste?
• with the kernel approach, the prediction is rephrased as which novels thisbook is similar or dissimilar to? what pool of books did I find the mostinfluentials to define my tastes accurately?
kernel methods only use kernel similarities, do not consider features.
Features can help define similarities, but never considered elsewhere.
PRA - 2013 76
![Page 77: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/77.jpg)
The Gram matrix perspective
in kernel methods, clear separation between the kernel...
dataset x3
x4
x5x2
x1
convex optimization
K5×5, kernel matrix
k
α
and Convex optimization (thanks to psdness of K, more later) to output the α’s.
PRA - 2013 77
![Page 78: Pattern Recognition Advanced The Support Vector Machine](https://reader031.vdocument.in/reader031/viewer/2022012512/618ab004b22476391644437b/html5/thumbnails/78.jpg)
Kernel Trick
Given a dataset {x1, · · · , xN} and a new instance xnew
Many data analysis methods only depend on xTi xj and xT
i xnew
• Ridge regression
• Principal Component Analysis
• Linear Discriminant Analysis
• Canonical Correlation Analysis
• etc.
Replace these evaluations by k(xi, xj) and k(xi,xnew)
• This will even work if the xi’s are not in a dot-product space! (strings, graphs,images etc.)
PRA - 2013 78