![Page 1: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/1.jpg)
Tuesday, September 9, 2014, 9:00 am - 10:30 am
SVM and kernel machines: linearand non-linear classification
Prof. Stéphane Canu
Kernel methods are a class of learning machine that has become an increasingly populartool for learning tasks such as pattern recognition, classification or novelty detection. Thispopularity is mainly due to the success of the support vector machines (SVM), probablythe most popular kernel method, and to the fact that kernel machines can be used in manyapplications as they provide a bridge from linearity to non-linearity. This allows thegeneralization of many well known methods such as PCA or LDA to name a few. Other keypoints related with kernel machines are convex optimization, duality and related sparcity.The Objective of this course is to provide an overview of all these issues related withkernels machines. To do so, we will introduce kernel machines and associatedmathematical foundations through practical implementation. All lectures will be devoted tothe writing of some Matlab functions that, putting all together, will provide a toolbox forlearning with kernels.
About Stéphane Canu
Stéphane Canu is a Professor of the LITIS research laboratory and of theinformation technology department, at the National institute of appliedscience in Rouen (INSA). He has been the former executive director of theLITIS, an information technology research laboratory in Normandy (150researcher) form 2005 to 2012. He received a Ph.D. degree in SystemCommand from Comiègne University of Technology in 1986. He joined the
faculty department of Computer Science at Compiegne University of Technology in 1987.He received the French habilitation degree from Paris 6 University. In 1997, he joined theRouen Applied Sciences National Institute (INSA) as a full professor, where he created theinformation engineering department. He has been the dean of this department until 2002when he was named director of the computing service and facilities unit. In 2004 he join forone sabbatical year the machine learning group at ANU/NICTA (Canberra) with Alex Smola
Ocean's Big Data Mining, 2014(Data mining in large sets of complex oceanic data: new challenges and solutions)
8-9 Sep 2014 Brest (France)
SUMMER SCHOOL #OBIDAM14 / 8-9 Sep 2014 Brest (France) oceandatamining.sciencesconf.org
![Page 2: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/2.jpg)
and Bob Williamson. In the last five years, he has published approximately thirty papers inrefereed conference proceedings or journals in the areas of theory, algorithms andapplications using kernel machines learning algorithm and other flexible regressionmethods. His research interests includes kernels and frames machines, regularization,machine learning applied to signal processing, pattern classification, matrix factorizationfor recommender systems and learning for context aware applications.
![Page 3: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/3.jpg)
SVM and Kernel machinelinear and non-linear classification
Stéphane Canu
Ocean’s Big Data Mining, 2014
September 9, 2014
![Page 4: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/4.jpg)
Road map
1 Supervised classification and prediction
2 Linear SVMSeparating hyperplanesLinear SVM: the problemOptimization in 5 slidesDual formulation of the linear SVMThe non separable case
3 Kernels
4 Kernelized support vector machine
![Page 5: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/5.jpg)
Supervised classification as Learning from examples
The task, use longitude and latitude to predict: is it a boat or a house?
credit: A Gentle Introduction to Support Vector Machines in Biomedicine A. Statnikov, D. Hardin, I. Guyon and C. F. Aliferiswww.nyuinformatics.org/downloads/supplements/SVM_Tutorial_2010/Final_WB.pdf
![Page 6: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/6.jpg)
Supervised classification as Learning from examples
Using (red and green) labelled examples learn a (yellow) decision rule
credit: A Gentle Introduction to Support Vector Machines in Biomedicine A. Statnikov, D. Hardin, I. Guyon and C. F. Aliferiswww.nyuinformatics.org/downloads/supplements/SVM_Tutorial_2010/Final_WB.pdf
![Page 7: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/7.jpg)
Supervised classification as Learning from examples
Using (red and green) labelled examples...
credit: A Gentle Introduction to Support Vector Machines in Biomedicine A. Statnikov, D. Hardin, I. Guyon and C. F. Aliferiswww.nyuinformatics.org/downloads/supplements/SVM_Tutorial_2010/Final_WB.pdf
![Page 8: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/8.jpg)
Supervised classification as Learning from examples
Using (red and green) labelled examples... learn a (yellow) decision rule
credit: A Gentle Introduction to Support Vector Machines in Biomedicine A. Statnikov, D. Hardin, I. Guyon and C. F. Aliferiswww.nyuinformatics.org/downloads/supplements/SVM_Tutorial_2010/Final_WB.pdf
![Page 9: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/9.jpg)
Supervised classification as Learning from examples
Use the decision border to predict unseen objects label
credit: A Gentle Introduction to Support Vector Machines in Biomedicine A. Statnikov, D. Hardin, I. Guyon and C. F. Aliferiswww.nyuinformatics.org/downloads/supplements/SVM_Tutorial_2010/Final_WB.pdf
![Page 10: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/10.jpg)
Suppervised classification: the 2 steps
{xi
, yi
}{xi
, yi
}i = 1, n
A the learning algorithmf the decision frontier
x
y
p
= f (x)
1 the border Learn(xi , yi , n training data) % A is SVM_learn2
y
p
Predict(unseen x , the border) % f is SVM_val
![Page 11: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/11.jpg)
Unavaliable speakers (more qualified in Environmental Data Learning ;)
.
.
Mikhail Kanevski S. Thiria & F. Badran .UNIL geostat UPMC Locean .
less "ocean", but...
more maths, more optimization, more matlab...
![Page 12: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/12.jpg)
Unavaliable speakers (more qualified in Environmental Data Learning ;)
.
.
Mikhail Kanevski S. Thiria & F. Badran .UNIL geostat UPMC Locean .
less "ocean", but...
more maths, more optimization, more matlab...
![Page 13: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/13.jpg)
Road map
1 Supervised classification and prediction
2 Linear SVMSeparating hyperplanesLinear SVM: the problemOptimization in 5 slidesDual formulation of the linear SVMThe non separable case
3 Kernels
4 Kernelized support vector machine
0
0
0
margin
"The algorithms for constructing the separating hyperplane considered above willbe utilized for developing a battery of programs for pattern recognition." inLearning with kernels, 2002 - from V .Vapnik, 1982
![Page 14: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/14.jpg)
Separating hyperplanes
Find a line to separate (classify) blue from red
D(x) = sign�v
>x + a
�
the decision border:
v
>x + a = 0
there are many solutions...The problem is ill posed
How to choose a solution?
![Page 15: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/15.jpg)
Separating hyperplanes
Find a line to separate (classify) blue from red
D(x) = sign�v
>x + a
�
the decision border:
v
>x + a = 0
there are many solutions...The problem is ill posed
How to choose a solution?
![Page 16: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/16.jpg)
Separating hyperplanes
Find a line to separate (classify) blue from red
D(x) = sign�v
>x + a
�
the decision border:
v
>x + a = 0
there are many solutions...The problem is ill posed
How to choose a solution?
![Page 17: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/17.jpg)
Maximize our confidence = maximize the margin
the decision border: �(v, a) = {x 2 IRd
��v
>x + a = 0}
0
0
0
margin
maximize the margin
maxv,a
mini2[1,n]
dist(xi
,�(v, a))
| {z }margin: m
Maximize the confidence
8><
>:
maxv,a
m
with mini=1,n
|v>x
i
+ a|kvk � m
the problem is still ill posed
if (v, a) is a solution, 8 0 < k (kv, ka) is also a solution. . .
![Page 18: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/18.jpg)
From the geometrical to the numerical margin
+1
�1
�1/|w|
1/|w|
{x | wTx = 0}
marge<− −>
x
wT x
Valeur de la marge dans le cas monodimensionnel
Maximize the (geometrical) margin8><
>:
maxv,a
m
with mini=1,n
|v>x
i
+ a|kvk � m
if the min is greater, everybody is greater(y
i
2 {�1, 1})8><
>:
maxv,a
m
withyi
(v>x
i
+ a)kvk � m, i = 1, n
change variable: w = v
mkvk and b = a
mkvk =) kwk = 1
m
8><
>:
maxw,b
m
with y
i
(w>x
i
+ b) � 1 ; i = 1, nand m = 1
kwk
8><
>:
minw,b
kwk2with y
i
(w>x
i
+ b) � 1i = 1, n
![Page 19: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/19.jpg)
From the geometrical to the numerical margin
+1
�1
�1/|w|
1/|w|
{x | wTx = 0}
marge<− −>
x
wT x
Valeur de la marge dans le cas monodimensionnel
Maximize the (geometrical) margin8><
>:
maxv,a
m
with mini=1,n
|v>x
i
+ a|kvk � m
if the min is greater, everybody is greater(y
i
2 {�1, 1})8><
>:
maxv,a
m
withyi
(v>x
i
+ a)kvk � m, i = 1, n
change variable: w = v
mkvk and b = a
mkvk =) kwk = 1
m
8><
>:
maxw,b
m
with y
i
(w>x
i
+ b) � 1 ; i = 1, nand m = 1
kwk
8><
>:
minw,b
kwk2with y
i
(w>x
i
+ b) � 1i = 1, n
![Page 20: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/20.jpg)
Road map
1 Supervised classification and prediction
2 Linear SVMSeparating hyperplanesLinear SVM: the problemOptimization in 5 slidesDual formulation of the linear SVMThe non separable case
3 Kernels
4 Kernelized support vector machine
0
0
0
margin
"The algorithms for constructing the separating hyperplane considered above willbe utilized for developing a battery of programs for pattern recognition." inLearning with kernels, 2002 - from V .Vapnik, 1982
![Page 21: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/21.jpg)
Linear SVM: the problem
The maximal margin (=minimal norm)canonical hyperplane
0
0
0
margin
Linear SVMs are the solution of the following problem (called primal)
Let {(xi
, yi
); i = 1 : n} be a set of labelled data with x 2 IRd , yi
2 {1,�1}A support vector machine (SVM) is a linear classifier associated with thefollowing decision function: D(x) = sign
�w
>x + b
�where w 2 IRd and
b 2 IR a given thought the solution of the following problem:(
minw2IRd , b2IR
1
2
kwk2
with y
i
(w>x
i
+ b) � 1 , i = 1, n
This is a quadratic program (QP):
(min
z
12 z
>Az� d
>z
with Bz e
![Page 22: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/22.jpg)
Support vector machines as a QPThe Standart QP formulation(
minw,b
1
2
kwk2with y
i
(w>x
i
+ b) � 1, i = 1, n,
(min
z2IRd+1
1
2
z
>Az� d
>z
with Bz e
z = (w, b)>, d = (0, . . . , 0)>, A =
I 00 0
�, B = �[diag(y)X , y] and
e = �(1, . . . , 1)>
Solve it using a standard QP solver such as (for instance)% QUADPROG Quadratic programming.% X = QUADPROG(H,f,A,b) attempts to solve the quadratic programming problem:%% min 0.5*x’*H*x + f’*x subject to: A*x <= b% x% so that the solution is in the range LB <= X <= UB
For more solvers (just to name a few) have a look at:plato.asu.edu/sub/nlores.html#QP-problem
www.numerical.rl.ac.uk/people/nimg/qp/qp.html
![Page 23: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/23.jpg)
Road map
1 Supervised classification and prediction
2 Linear SVMSeparating hyperplanesLinear SVM: the problemOptimization in 5 slidesDual formulation of the linear SVMThe non separable case
3 Kernels
4 Kernelized support vector machine
![Page 24: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/24.jpg)
First order optimality condition (1)
problem P =
8><
>:
minx2IRn
J(x)
with hj
(x) = 0 j = 1, . . . , pand g
i
(x) 0 i = 1, . . . , q
Definition: Karush, Kuhn and Tucker (KKT) conditions
stationarity rJ(x?) +pX
j=1
�j
rh
j
(x?) +qX
i=1
µi
rg
i
(x?) = 0
primal admissibility h
j
(x?) = 0 j = 1, . . . , pg
i
(x?) 0 i = 1, . . . , qdual admissibility µ
i
� 0 i = 1, . . . , qcomplementarity µ
i
g
i
(x?) = 0 i = 1, . . . , q
�j
and µi
are called the Lagrange multipliers of problem P
![Page 25: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/25.jpg)
First order optimality condition (2)
Theorem (12.1 Nocedal & Wright pp 321)
If a vector x? is a stationary point of problem PThen there existsa Lagrange multipliers such that
�x?, {�
j
}j=1:p, {µi
}i=1:q
�
fulfill KKT conditionsaunder some conditions e.g. linear independence constraint qualification
If the problem is convex, then a stationary point is the solution of theproblem
A quadratic program (QP) is convex when. . .
(QP)
(min
z
12z
>Az� d
>z
with Bz e
. . . when matrix A is positive definite
![Page 26: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/26.jpg)
KKT condition - Lagrangian (3)
problem P =
8><
>:
minx2IRn
J(x)
with hj
(x) = 0 j = 1, . . . , pand g
i
(x) 0 i = 1, . . . , q
Definition: Lagrangian
The lagrangian of problem P is the following function:
L(x,�, µ) = J(x) +pX
j=1
�j
hj
(x) +qX
i=1
µi
gi
(x)
The importance of being a lagrangian
the stationarity condition can be written: rL(x?,�, µ) = 0
the lagrangian saddle point max�,µ
minx
L(x,�, µ)
Primal variables: x and dual variables �, µ (the Lagrange multipliers)
![Page 27: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/27.jpg)
Duality – definitions (1)
Primal and (Lagrange) dual problems
P =
8><
>:
minx2IRn
J(x)
with hj
(x) = 0 j = 1, pand g
i
(x) 0 i = 1, qD =
(max
�2IRp,µ2IRqQ(�, µ)
with µj
� 0 j = 1, q
Dual objective function:
Q(�, µ) = infx
L(x,�, µ)
= infx
J(x) +pX
j=1
�j
hj
(x) +qX
i=1
µi
gi
(x)
Wolf dual problem
W =
8>>>><
>>>>:
maxx,�2IRp,µ2IRq
L(x,�, µ)with µ
j
� 0 j = 1, q
and rJ(x?) +pX
j=1
�j
rhj
(x?) +qX
i=1
µi
rgi
(x?) = 0
![Page 28: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/28.jpg)
Duality – theorems (2)
Theorem (12.12, 12.13 and 12.14 Nocedal & Wright pp 346)
If f , g and h are convex and continuously differentiablea, then the solutionof the dual problem is the same as the solution of the primal
aunder some conditions e.g. linear independence constraint qualification
(�?, µ?) = solution of problem Dx
? = arg minx
L(x,�?, µ?)
Q(�?, µ?) = arg minx
L(x,�?, µ?) = L(x?,�?, µ?)
= J(x?) + �?H(x?) + µ?G (x?) = J(x?)
and for any feasible point x
Q(�, µ) J(x) ! 0 J(x)� Q(�, µ)
The duality gap is the difference between the primal and dual cost functions
![Page 29: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/29.jpg)
Road map
1 Supervised classification and prediction
2 Linear SVMSeparating hyperplanesLinear SVM: the problemOptimization in 5 slidesDual formulation of the linear SVMThe non separable case
3 Kernels
4 Kernelized support vector machine
Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.
![Page 30: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/30.jpg)
Linear SVM dual formulation - The lagrangian
(minw,b
1
2
kwk2with y
i
(w>x
i
+ b) � 1 i = 1, n
Looking for the lagrangian saddle point max↵
minw,b
L(w, b,↵) with so called
lagrange multipliers ↵i
� 0
L(w, b,↵) =12kwk2 �
nX
i=1
↵i
�y
i
(w>x
i
+ b)� 1�
↵i
represents the influence of constraint thus the influence of the trainingexample (x
i
, yi
)
![Page 31: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/31.jpg)
Stationarity conditions
L(w, b,↵) =12kwk2 �
nX
i=1
↵i
�yi
(w>x
i
+ b)� 1�
Computing the gradients:
8>><
>>:
rw
L(w, b,↵) = w �nX
i=1
↵i
yi
x
i
@L(w, b,↵)@b
=P
n
i=1 ↵i
yi
we have the following optimality conditions8>>>><
>>>>:
rw
L(w, b,↵) = 0 ) w =nX
i=1
↵i
y
i
x
i
@L(w, b,↵)
@b
= 0 )nX
i=1
↵i
y
i
= 0
![Page 32: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/32.jpg)
KKT conditions for SVM
stationarity w �nX
i=1
↵i
yi
x
i
= 0 andnX
i=1
↵i
yi
= 0
primal admissibility yi
(w>x
i
+ b) � 1 i = 1, . . . , n
dual admissibility ↵i
� 0 i = 1, . . . , n
complementarity ↵i
⇣yi
(w>x
i
+ b)� 1⌘= 0 i = 1, . . . , n
The complementary condition split the data into two sets
A be the set of active constraints: usefull points
A = {i 2 [1, n]�� y
i
(w⇤>x
i
+ b⇤) = 1}
its complementary A useless points
if i /2 A,↵i
= 0
![Page 33: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/33.jpg)
The KKT conditions for SVM
The same KKT but using matrix notations and the active set Astationarity w � X>D
y
↵ = 0
↵>y = 0
primal admissibility Dy
(Xw + b I1) � I1
dual admissibility ↵ � 0
complementarity Dy
(XAw + b I1A) = I1A↵A = 0
Knowing A, the solution verifies the following linear system:8<
:
w �X>AD
y
↵A = 0�D
y
XAw �byA = �eA�y
>A↵A = 0
with Dy
= diag(yA), ↵A = ↵(A) , yA = y(A) et XA = X (XA; :).
![Page 34: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/34.jpg)
The KKT conditions as a linear system8<
:
w �X>AD
y
↵A = 0�D
y
XAw �byA = �eA�y
>A↵A = 0
with Dy
= diag(yA), ↵A = ↵(A) , yA = y(A) et XA = X (XA; :).
=
I �X>AD
y
0
�Dy
XA 0 �yA
0 �y
>A 0
w
↵A
b
0
�eA
0
we can work on it to separate w from (↵A, b)
![Page 35: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/35.jpg)
The SVM dual formulationThe SVM Wolfe dual
8>>>>>><
>>>>>>:
maxw,b,↵
12kwk2 �
nX
i=1
↵i
�yi
(w>x
i
+ b)� 1�
with ↵i
� 0 i = 1, . . . , n
and w �nX
i=1
↵i
yi
x
i
= 0 andnX
i=1
↵i
yi
= 0
using the fact: w =nX
i=1
↵i
y
i
x
i
The SVM Wolfe dual without w and b8>>>>>><
>>>>>>:
max↵
� 12
nX
i=1
nX
j=1
↵j
↵i
yi
yj
x
>j
x
i
+nX
i=1
↵i
with ↵i
� 0 i = 1, . . . , n
andnX
i=1
↵i
yi
= 0
![Page 36: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/36.jpg)
Linear SVM dual formulationL(w, b,↵) =
12kwk2 �
nX
i=1
↵i
�yi
(w>x
i
+ b)� 1�
Optimality: w =nX
i=1
↵i
yi
x
i
nX
i=1
↵i
yi
= 0
L(↵) = 12
nX
i=1
nX
j=1
↵j
↵i
yi
yj
x
>j
x
i
| {z }w
>w
�Pn
i=1 ↵i
yi
nX
j=1
↵j
yj
x
>j
| {z }w
>
x
i
� bnX
i=1
↵i
yi
| {z }=0
+P
n
i=1 ↵i
= �12
nX
i=1
nX
j=1
↵j
↵i
yi
yj
x
>j
x
i
+nX
i=1
↵i
Dual linear SVM is also a quadratic program
problem D
8><
>:
min↵2IRn
12↵
>G↵� e
>↵
with y
>↵ = 0and 0 ↵
i
i = 1, n
with G a symmetric matrix n ⇥ n such that Gij
= yi
yj
x
>j
x
i
![Page 37: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/37.jpg)
SVM primal vs. dual
Primal
8><
>:
minw2IRd ,b2IR
1
2
kwk2
with y
i
(w>x
i
+ b) � 1i = 1, n
d + 1 unknownn constraintsclassical QPperfect when d << n
Dual
8><
>:
min↵2IRn
1
2
↵>G↵� e
>↵
with y
>↵ = 0and 0 ↵
i
i = 1, n
n unknownG Gram matrix (pairwiseinfluence matrix)n box constraintseasy to solveto be used when d > n
f (x) =dX
j=1
w
j
x
j
+ b =nX
i=1
↵i
y
i
(x>x
i
) + b
![Page 38: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/38.jpg)
SVM primal vs. dual
Primal
8><
>:
minw2IRd ,b2IR
1
2
kwk2
with y
i
(w>x
i
+ b) � 1i = 1, n
d + 1 unknownn constraintsclassical QPperfect when d << n
Dual
8><
>:
min↵2IRn
1
2
↵>G↵� e
>↵
with y
>↵ = 0and 0 ↵
i
i = 1, n
n unknownG Gram matrix (pairwiseinfluence matrix)n box constraintseasy to solveto be used when d > n
f (x) =dX
j=1
w
j
x
j
+ b =nX
i=1
↵i
y
i
(x>x
i
) + b
![Page 39: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/39.jpg)
Road map
1 Supervised classification and prediction
2 Linear SVMSeparating hyperplanesLinear SVM: the problemOptimization in 5 slidesDual formulation of the linear SVMThe non separable case
3 Kernels
4 Kernelized support vector machine 0
0
Slack j
![Page 40: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/40.jpg)
The non separable case: a bi criteria optimization problem
Modeling potential errors: introducing slack variables ⇠i
(xi
, yi
)
⇢no error: y
i
(w>x
i
+ b) � 1) ⇠i
= 0error: ⇠
i
= 1� yi
(w>x
i
+ b) > 00
0
Slack j
8>>>>>>><
>>>>>>>:
minw,b,⇠
12kwk2
minw,b,⇠
C
p
nX
i=1
⇠p
i
with y
i
(w>x
i
+ b) � 1� ⇠i
⇠i
� 0 i = 1, n
Our hope: almost all ⇠i
= 0
![Page 41: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/41.jpg)
The non separable caseModeling potential errors: introducing slack variables ⇠
i
(xi
, yi
)
⇢no error: y
i
(w>x
i
+ b) � 1) ⇠i
= 0error: ⇠
i
= 1� yi
(w>x
i
+ b) > 0
Minimizing also the slack (the error), for a given C > 08>>><
>>>:
minw,b,⇠
12kwk2 + C
p
nX
i=1
⇠p
i
with y
i
(w>x
i
+ b) � 1� ⇠i
i = 1, n⇠i
� 0 i = 1, n
Looking for the saddle point of the lagrangian with the Lagrangemultipliers ↵
i
� 0 and �i
� 0
L(w, b,↵,�) =12kwk2 + C
p
nX
i=1
⇠p
i
�nX
i=1
↵i
�yi
(w>x
i
+ b)� 1 + ⇠i
��nX
i=1
�i
⇠i
![Page 42: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/42.jpg)
The KKT
L(w, b,↵,�) =12kwk2 + C
p
nX
i=1
⇠p
i
�nX
i=1
↵i
�yi
(w>x
i
+ b)� 1 + ⇠i
��nX
i=1
�i
⇠i
stationarity w �nX
i=1
↵i
yi
x
i
= 0 andnX
i=1
↵i
yi
= 0
C � ↵i
� �i
= 0 i = 1, . . . , n
primal admissibility yi
(w>x
i
+ b) � 1 i = 1, . . . , n
⇠i
� 0 i = 1, . . . , n
dual admissibility ↵i
� 0 i = 1, . . . , n
�i
� 0 i = 1, . . . , n
complementarity ↵i
⇣yi
(w>x
i
+ b)� 1 + ⇠i
⌘= 0 i = 1, . . . , n
�i
⇠i
= 0 i = 1, . . . , n
Let’s eliminate �!
![Page 43: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/43.jpg)
KKT
stationarity w �nX
i=1
↵i
yi
x
i
= 0 andnX
i=1
↵i
yi
= 0
primal admissibility yi
(w>x
i
+ b) � 1 i = 1, . . . , n⇠i
� 0 i = 1, . . . , n;
dual admissibility ↵i
� 0 i = 1, . . . , nC � ↵
i
� 0 i = 1, . . . , n;
complementarity ↵i
⇣yi
(w>x
i
+ b)� 1 + ⇠i
⌘= 0 i = 1, . . . , n
(C � ↵i
) ⇠i
= 0 i = 1, . . . , n
sets I0 IA IC
↵i
0 0 < ↵ < C C�
i
C C � ↵ 0⇠i
0 0 1� yi
(w>x
i
+ b)yi
(w>x
i
+ b) > 1 yi
(w>x
i
+ b) = 1 yi
(w>x
i
+ b) < 1useless usefull (support vec) suspicious
![Page 44: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/44.jpg)
The importance of being support
−2 −1 0 1 2 3 4−2
−1
0
1
2
3
4
−2 −1 0 1 2 3 4−2
−1
0
1
2
3
4
.
datapoint ↵
constraintvalue set
x
i
useless ↵i
= 0 y
i
�w
>x
i
+ b
�> 1 I
0
x
i
support 0 < ↵i
< C y
i
�w
>x
i
+ b
�= 1 I↵
x
i
suspicious ↵i
= C y
i
�w
>x
i
+ b
�< 1 I
C
Table : When a data point is « support » it lies exactly on the margin.
here lies the efficiency of the algorithm (and its complexity)!
sparsity: ↵i
= 0
![Page 45: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/45.jpg)
Optimality conditions (p = 1)
L(w, b,↵,�) =12kwk2 + C
nX
i=1
⇠i
�nX
i=1
↵i
�yi
(w>x
i
+ b)� 1 + ⇠i
��nX
i=1
�i
⇠i
Computing the gradients:
8>>>>>><
>>>>>>:
rw
L(w, b,↵) = w �nX
i=1
↵i
yi
x
i
@L(w, b,↵)@b
=nX
i=1
↵i
yi
r⇠iL(w, b,↵) = C � ↵i
� �i
no change for w and b
�i
� 0 and C � ↵i
� �i
= 0 ) ↵i
C
The dual formulation:8><
>:
min↵2IRn
12↵
>G↵� e
>↵
with y
>↵ = 0and 0 ↵
i
C i = 1, n
![Page 46: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/46.jpg)
SVM primal vs. dual
Primal
8>>><
>>>:
minw,b,⇠2IRn
1
2
kwk2 + C
nX
i=1
⇠i
with y
i
(w>x
i
+ b) � 1� ⇠i
⇠i
� 0 i = 1, n
d + n + 1 unknown2n constraintsclassical QPto be used when n is toolarge to build G
Dual
8><
>:
min↵2IRn
1
2
↵>G↵� e
>↵
with y
>↵ = 0and 0 ↵
i
C i = 1, n
n unknownG Gram matrix (pairwiseinfluence matrix)2n box constraintseasy to solveto be used when n is not toolarge
![Page 47: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/47.jpg)
Eliminating the slack but not the possible mistakes8>>><
>>>:
minw,b,⇠2IRn
12kwk2 + C
nX
i=1
⇠i
with yi
(w>x
i
+ b) � 1� ⇠i
⇠i
� 0 i = 1, n
Introducing the hinge loss
⇠i
= max�1� y
i
(w>x
i
+ b), 0�
minw,b
12 kwk2 + C
nX
i=1
max�0, 1� y
i
(w>x
i
+ b)�
Back to d + 1 variables, but this is no longer an explicit QP
![Page 48: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/48.jpg)
The hinge and other loss
Square hinge: (huber/hinge) and Lasso SVM
minw,b
kwk1 + CnX
i=1
max�1� y
i
(w>x
i
+ b), 0�p
Penalized Logistic regression (Maxent)
minw,b
kwk22 � CnX
i=1
log�1 + exp�2yi (w
>xi+b)
�
The exponential loss (commonly used in boosting)
minw,b
kwk22 + CnX
i=1
exp�yi (w>
xi+b)
The sigmoid loss
minw,b
kwk22 � CnX
i=1
tanh�yi
(w>x
i
+ b)�
−1 0 10
1
yf(x)
clas
sific
atio
n lo
ss
0/1 losshingehinge2
logisticexponentialsigmoid
![Page 49: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/49.jpg)
Roadmap
1 Supervised classification and prediction
2 Linear SVMSeparating hyperplanesLinear SVM: the problemOptimization in 5 slidesDual formulation of the linear SVMThe non separable case
3 Kernels
4 Kernelized support vector machine
![Page 50: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/50.jpg)
Introducing non linearities through the feature mapSVM Val
f (x) =dX
j=1
xj
wj
+ b =nX
i=1
↵i
(x>i
x) + b
✓t
1
t
2
◆2 IR2
�(t) =
t
1
x
1
t
2
1
x
2
t
2
x
3
t
2
2
x
4
t
1
t
2
x
5
linear in x 2 IR5
quadratic in t 2 IR2
The feature map
� : IR2 �! IR5
t 7�! �(t) = x
x
>i
x = �(ti
)>�(t)
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 39 / 62
![Page 51: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/51.jpg)
Introducing non linearities through the feature mapSVM Val
f (x) =dX
j=1
xj
wj
+ b =nX
i=1
↵i
(x>i
x) + b
✓t
1
t
2
◆2 IR2 �(t) =
t
1
x
1
t
2
1
x
2
t
2
x
3
t
2
2
x
4
t
1
t
2
x
5
linear in x 2 IR5
quadratic in t 2 IR2
The feature map
� : IR2 �! IR5
t 7�! �(t) = x
x
>i
x = �(ti
)>�(t)Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 39 / 62
![Page 52: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/52.jpg)
Introducing non linearities through the feature map
A. Lorena & A. de Carvalho, Uma Introducão às Support Vector Machines, 2007
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 40 / 62
![Page 53: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/53.jpg)
Non linear case: dictionary vs. kernel
in the non linear case: use a dictionary of functions
�j
(x), j = 1, p with possibly p =1
for instance polynomials, wavelets...
f (x) =pX
j=1
wj
�j
(x) with wj
=nX
i=1
↵i
yi
�j
(xi
)
so that
f (x) =nX
i=1
↵i
yi
pX
j=1
�j
(xi
)�j
(x)
| {z }k(xi ,x)
p � n so what since k(xi
, x) =P
p
j=1 �j
(xi
)�j
(x)
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 41 / 62
![Page 54: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/54.jpg)
Non linear case: dictionary vs. kernel
in the non linear case: use a dictionary of functions
�j
(x), j = 1, p with possibly p =1
for instance polynomials, wavelets...
f (x) =pX
j=1
wj
�j
(x) with wj
=nX
i=1
↵i
yi
�j
(xi
)
so that
f (x) =nX
i=1
↵i
yi
pX
j=1
�j
(xi
)�j
(x)
| {z }k(xi ,x)
p � n so what since k(xi
, x) =P
p
j=1 �j
(xi
)�j
(x)
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 41 / 62
![Page 55: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/55.jpg)
closed form kernel: the quadratic kernelThe quadratic dictionary in IRd :
� : IRd ! IRp=1+d+ d (d+1)2
s 7! � =�1, s1, s2, ..., sd , s2
1 , s22 , ..., s
2d
, ..., si
sj
, ...�
in this case�(s)>�(t) = 1 + s1t1 + s2t2 + ...+ s
d
td
+ s21 t2
1 + ...+ s2d
t2d
+ ...+ si
sj
ti
tj
+ ...
The quadratic kenrel: s, t 2 IRd , k(s, t) =�s
>t + 1�2
= 1 + 2s>t +�s
>t�2 computes
the dot product of the reweighted dictionary:
� : IRd ! IRp=1+d+ d (d+1)2
s 7! � =�1,p
2s1,p
2s2, ...,p
2sd
, s21 , s
22 , ..., s
2d
, ...,p
2si
sj
, ...�
p = 1 + d + d(d+1)2
multiplications vs. d + 1use kernel to save computration
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 42 / 62
![Page 56: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/56.jpg)
closed form kernel: the quadratic kernelThe quadratic dictionary in IRd :
� : IRd ! IRp=1+d+ d (d+1)2
s 7! � =�1, s1, s2, ..., sd , s2
1 , s22 , ..., s
2d
, ..., si
sj
, ...�
in this case�(s)>�(t) = 1 + s1t1 + s2t2 + ...+ s
d
td
+ s21 t2
1 + ...+ s2d
t2d
+ ...+ si
sj
ti
tj
+ ...
The quadratic kenrel: s, t 2 IRd , k(s, t) =�s
>t + 1�2
= 1 + 2s>t +�s
>t�2 computes
the dot product of the reweighted dictionary:
� : IRd ! IRp=1+d+ d (d+1)2
s 7! � =�1,p
2s1,p
2s2, ...,p
2sd
, s21 , s
22 , ..., s
2d
, ...,p
2si
sj
, ...�
p = 1 + d + d(d+1)2
multiplications vs. d + 1use kernel to save computration
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 42 / 62
![Page 57: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/57.jpg)
closed form kernel: the quadratic kernelThe quadratic dictionary in IRd :
� : IRd ! IRp=1+d+ d (d+1)2
s 7! � =�1, s1, s2, ..., sd , s2
1 , s22 , ..., s
2d
, ..., si
sj
, ...�
in this case�(s)>�(t) = 1 + s1t1 + s2t2 + ...+ s
d
td
+ s21 t2
1 + ...+ s2d
t2d
+ ...+ si
sj
ti
tj
+ ...
The quadratic kenrel: s, t 2 IRd , k(s, t) =�s
>t + 1�2
= 1 + 2s>t +�s
>t�2 computes
the dot product of the reweighted dictionary:
� : IRd ! IRp=1+d+ d (d+1)2
s 7! � =�1,p
2s1,p
2s2, ...,p
2sd
, s21 , s
22 , ..., s
2d
, ...,p
2si
sj
, ...�
p = 1 + d + d(d+1)2
multiplications vs. d + 1use kernel to save computration
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 42 / 62
![Page 58: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/58.jpg)
kernel: features through pairwise comparisons
x �(x)
e.g. a text e.g. BOW
K
n examples
nexam
ples
�
p features
nexam
ples
k(xi
, xj
) =pX
j=1
�j
(xi
)�j
(xj
)
K The matrix of pairwise comparizons (O(n2))
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 43 / 62
![Page 59: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/59.jpg)
Kenrel machinekernel as a dictionary
f (x) =nX
i=1
↵i
k(x, xi
)
↵i
influence of example i depends on y
i
k(x, xi
) the kernel do NOT depend on y
i
Definition (Kernel)
Let ⌦ be a non empty set (the input space).
A kernel is a function k from ⌦⇥ ⌦ onto IR. k : ⌦⇥ ⌦ 7�! IRs, t �! k(s, t)
semi-parametric version: given the family q
j
(x), j = 1, p
f (x) =nX
i=1
↵i
k(x, xi
)+pX
j=1
�j
q
j
(x)
![Page 60: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/60.jpg)
Kenrel machinekernel as a dictionary
f (x) =nX
i=1
↵i
k(x, xi
)
↵i
influence of example i depends on y
i
k(x, xi
) the kernel do NOT depend on y
i
Definition (Kernel)
Let ⌦ be a non empty set (the input space).
A kernel is a function k from ⌦⇥ ⌦ onto IR. k : ⌦⇥ ⌦ 7�! IRs, t �! k(s, t)
semi-parametric version: given the family q
j
(x), j = 1, p
f (x) =nX
i=1
↵i
k(x, xi
)+pX
j=1
�j
q
j
(x)
![Page 61: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/61.jpg)
In the beginning was the kernel...
Definition (Kernel)
a function of two variable k from ⌦⇥ ⌦ to IR
Definition (Positive kernel)
A kernel k(s, t) on ⌦ is said to be positiveif it is symetric: k(s, t) = k(t, s)
an if for any finite positive interger n:
8{↵i
}i=1,n 2 IR, 8{x
i
}i=1,n 2 ⌦,
nX
i=1
nX
j=1
↵i
↵j
k(xi
, xj
) � 0
it is strictly positive if for ↵i
6= 0nX
i=1
nX
j=1
↵i
↵j
k(xi
, xj
) > 0
![Page 62: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/62.jpg)
Examples of positive kernelsthe linear kernel: s, t 2 IRd , k(s, t) = s
>t
symetric: s
>t = t>s
positive:nX
i=1
nX
j=1
↵i↵jk(xi , xj ) =nX
i=1
nX
j=1
↵i↵jx>i xj
=
nX
i=1
↵ixi
!>0
@nX
j=1
↵jxj
1
A =
�����
nX
i=1
↵ixi
�����
2
the product kernel: k(s, t) = g(s)g(t) for some g : IRd ! IR,
symetric by constructionpositive:
nX
i=1
nX
j=1
↵i↵jk(xi , xj ) =nX
i=1
nX
j=1
↵i↵jg(xi )g(xj )
=
nX
i=1
↵ig(xi )
!0
@nX
j=1
↵jg(xj )
1
A =
nX
i=1
↵ig(xi )
!2
k is positive , (its square root exists) , k(s, t) = h�s
,�ti
J.P. Vert, 2006
![Page 63: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/63.jpg)
Positive definite Kernel (PDK) algebra (closure)
if k
1
(s, t) and k
2
(s, t) are two positive kernels
DPK are a convex cone: 8a1 2 IR+ a1k1(s, t) + k2(s, t)
product kernel k1(s, t)k2(s, t)
proofs
by linearity:nX
i=1
nX
j=1
↵i↵j�a
1
k
1
(i , j) + k
2
(i , j)�= a
1
nX
i=1
nX
j=1
↵i↵jk1
(i , j) +nX
i=1
nX
j=1
↵i↵jk2
(i , j)
assuming 9 ` s.t. k
1
(s, t) =X
`
`(s) `(t)
nX
i=1
nX
j=1
↵i↵j k
1
(xi , xj )k2
(xi , xj ) =nX
i=1
nX
j=1
↵i↵j�X
`
`(xi ) `(xj )k2
(xi , xj )�
=X
`
nX
i=1
nX
j=1
�↵i `(xi )
� �↵j `(xj )
�k
2
(xi , xj )
N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004
![Page 64: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/64.jpg)
Kernel engineering: building PDKfor any polynomial with positive coef. � from IR to IR
��k(s, t)
�
if is a function from IRd to IRd
k
� (s), (t)
�
if ' from IRd to IR+, is minimum in 0k(s, t) = '(s + t)� '(s� t)
convolution of two positive kernels is a positive kernelK
1
? K
2
Example : the Gaussian kernel is a PDK
exp(�ks� tk2) = exp(�ksk2 � ktk2 + 2s>t)= exp(�ksk2) exp(�ktk2) exp(2s>t)
s
>t is a PDK and function exp as the limit of positive series expansion, soexp(2s>t) is a PDK
exp(�ksk2) exp(�ktk2) is a PDK as a product kernel
the product of two PDK is a PDKO. Catoni, master lecture, 2005
![Page 65: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/65.jpg)
some examples of PD kernels...
type name k(s, t)
radial gaussian exp⇣� r
2
b
⌘, r = ks � tk
radial laplacian exp(�r/b)
radial rationnal 1� r
2
r
2+b
radial loc. gauss. max�0, 1� r
3b
�d exp(� r
2
b
)
non stat. �2 exp(�r/b), r =P
k
(sk�tk )2
sk+tk
projective polynomial (s>t)p
projective affine (s>t + b)p
projective cosine s
>t/kskktk
projective correlation exp⇣
s
>t
kskktk � b
⌘
Most of the kernels depends on a quantity b called the bandwidth
![Page 66: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/66.jpg)
Roadmap
1 Supervised classification and prediction
2 Linear SVMSeparating hyperplanesLinear SVM: the problemOptimization in 5 slidesDual formulation of the linear SVMThe non separable case
3 Kernels
4 Kernelized support vector machine −1
−1
−1
−1
−1
−1
−1
−1
−1
−1
−1
−1
−1
−1
−1
−1
−1
−1
−1
0
0
0
0
0
0
0
0
0
0
0 0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 50 / 62
![Page 67: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/67.jpg)
using relevant features...
a data point becomes a function x �! k(x, •)
![Page 68: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/68.jpg)
Representer theorem for SVM(
minf ,b
12kf k2H
with yi
�f (x
i
) + b� � 1
Lagrangian
L(f , b,↵) =12kf k2H �
nX
i=1
↵i
�yi
(f (xi
) + b)� 1�
↵ � 0
optimility condition: rf
L(f , b,↵) = 0, f (x) =nX
i=1
↵i
yi
k(xi
, x)
Eliminate f from L:
8>>>><
>>>>:
kf k2H =nX
i=1
nX
j=1
↵i
↵j
yi
yj
k(xi
, xj
)
nX
i=1
↵i
yi
f (xi
) =nX
i=1
nX
j=1
↵i
↵j
yi
yj
k(xi
, xj
)
Q(b,↵) = �12
nX
i=1
nX
j=1
↵i
↵j
yi
yj
k(xi
, xj
)�nX
i=1
↵i
�yi
b � 1�
![Page 69: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/69.jpg)
Dual formulation for SVMthe intermediate function
Q(b,↵) = �12
nX
i=1
nX
j=1
↵i
↵j
yi
yj
k(xi
, xj
)� b� nX
i=1
↵i
yi
�+
nX
i=1
↵i
max↵
minb
Q(b,↵)
b can be seen as the Lagrange multiplier of the following (balanced)constaint
Pn
i=1 ↵i
yi
= 0 which is also the optimality KKT condition on b
Dual formulation8>>>>>><
>>>>>>:
max↵2IRn
� 12
nX
i=1
nX
j=1
↵i
↵j
yi
yj
k(xi
, xj
) +nX
i=1
↵i
such thatnX
i=1
↵i
yi
= 0
and 0 ↵i
, i = 1, n
![Page 70: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/70.jpg)
SVM dual formulation
Dual formulation8>>>><
>>>>:
max↵2IRn
� 12
nX
i=1
nX
j=1
↵i
↵j
yi
yj
k(xi
, xj
) +nX
i=1
↵i
withnX
i=1
↵i
yi
= 0 and 0 ↵i
, i = 1, n
The dual formulation gives a quadratic program (QP)(min↵2IRn
12↵
>G↵� I1>↵with ↵>
y = 0 and 0 ↵
with Gij
= yi
yj
k(xi
, xj
)
with the linear kernel f (x) =P
n
i=1 ↵i
yi
(x>x
i
) =P
d
j=1 �j
xj
when d is small wrt. n primal may be interesting.
![Page 71: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/71.jpg)
the general case: C -SVMPrimal formulation
(P)
8><
>:min
f 2H,b,⇠2IRn12kf k2 + C
p
nX
i=1
⇠p
i
such that yi
�f (x
i
) + b� � 1� ⇠
i
, ⇠i
� 0, i = 1, n
C is the regularization path parameter (to be tuned)
p = 1 , L
1
SVM(max↵2IRn
� 12↵
>G↵+ ↵>I1such that ↵>
y = 0 and 0 ↵i
C i = 1, n
p = 2, L
2
SVM (max↵2IRn
� 12↵
> �G + 1
C
I�↵+ ↵>I1
such that ↵>y = 0 and 0 ↵
i
i = 1, n
the regularization path: is the set of solutions ↵(C ) when C varies
![Page 72: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/72.jpg)
Data groups: illustrationf (x) =
nX
i=1
↵ik(x, xi )
D(x) = sign
�f (x) + b
�
useless data important data suspicious datawell classified support
↵ = 0 0 < ↵ < C ↵ = C
the regularization path: is the set of solutions ↵(C ) when C varies
![Page 73: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/73.jpg)
The importance of being support
f (x) =nX
i=1
↵i
yi
k(xi
, x)
datapoint ↵
constraintvalue set
x
i
useless ↵i
= 0 y
i
�f (x
i
) + b
�> 1 I
0
x
i
support 0 < ↵i
< C y
i
�f (x
i
) + b
�= 1 I↵
x
i
suspicious ↵i
= C y
i
�f (x
i
) + b
�< 1 I
C
Table : When a data point is « support » it lies exactly on the margin.
here lies the efficiency of the algorithm (and its complexity)!
sparsity: ↵i
= 0
![Page 74: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/74.jpg)
checker board
2 classes500 examplesseparable
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 58 / 62
![Page 75: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/75.jpg)
a separable case
n = 500 data points
n = 5000 data points
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 59 / 62
![Page 76: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/76.jpg)
Tuning C and � (the kernel width) : grid search
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 60 / 62
![Page 77: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/77.jpg)
Empirical complexity
103 10410−1
100
101
102
103
Training size (log)
103 104101
102
103
104
Training size (log)
103 1040
5
10
15
20
25
30
35
40
Training size (log)
103 10410−1
100
101
102
103
Training size (log)
103 104101
102
103
104
Training size (log)
103 1040
5
10
15
20
25
30
35
40
Training size (log)
103 10410−1
100
101
102
103
Training size (log)
103 104101
102
103
104
Training size (log)
103 1040
5
10
15
20
25
30
35
40
Training size (log)
103 10410−1
100
101
102
103
Training size (log)
103 104101
102
103
104
Training size (log)
103 1040
5
10
15
20
25
30
35
40
Training size (log)
103 10410−1
100
101
102
103
Training size (log)
103 104101
102
103
104
Training size (log)
103 1040
5
10
15
20
25
30
35
40
Training size (log)
103 10410−1
100
101
102
103
Training size (log)
103 104101
102
103
104
Training size (log)
103 1040
5
10
15
20
25
30
35
40
Training size (log)
CVM
LibSVM
SimpleSVM
Results for C = 1Left : γ = 1 Right γ = 0.3
Results for C = 1000Left : γ = 1 Right γ = 0.3
Results for C = 1000000Left : γ = 1 Right γ = 0.3
Trainingtimeresultsin cpuseconds(log scale)
Number ofSupportVectors
(log scale)
Errorrate (%)
(over 2000unseenpoints)
G. Loosli et al JMLR, 2007
Stéphane Canu (INSA Rouen - LITIS) September 9, 2014 61 / 62
![Page 78: SVM and kernel machines: linear and non-linear classification · Tuesday, September 9, 2014, 9:00 am - 10:30 am SVM and kernel machines: linear and non-linear classification Prof](https://reader030.vdocument.in/reader030/viewer/2022040210/5e54654d0305ba49024b44d4/html5/thumbnails/78.jpg)
Conclusion
Learning as an optimization problemI use CVX to prototypeI MonQPI specific parallel and distributed solvers
Universal through Kernelization (dual trick)
ScalabilityI Sparsity provides scalabilityI Kernel implies "locality"I Big data limitations: back to primal (an linear)