suykens
TRANSCRIPT
-
8/11/2019 suykens
1/84
Least Squares Support Vector
MachinesJohan Suykens
K.U. Leuven ESAT-SCD-SISTAKasteelpark Arenberg 10
B-3001 Leuven (Heverlee), BelgiumTel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]
http://www.esat.kuleuven.ac.be/sista/members/suykens.html
NATO-ASI Learning Theory and PracticeLeuven July 2002
http://www.esat.kuleuven.ac.be/sista/natoasi/ltp2002.html
-
8/11/2019 suykens
2/84
Main reference:
J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De
Moor, J. Vandewalle,Least Squares Support Vector Ma-
chines, World Scientific, in press (ISBN 981-238-151-1)
Related software LS-SVMlab(Matlab/C toolbox):
http://www.esat.kuleuven.ac.be/sista/lssvmlab/
(software + publications)
... with thanks to Kristiaan Pelckmans, Bart Hamers, Lukas, Luc Hoe-
gaerts, Chuan Lu, Lieveke Ameye, Sabine Van Huffel, Gert Lanckriet, Tijl
De Bie and many others
-
8/11/2019 suykens
3/84
Interdisciplinary challenges
LS-SVM
linear algebra
mathematics
statistics
systems and control
signal processingoptimization
machine learning
pattern recognition
datamining
neural networks
One of the original dreams in the neural networks area is to make
a universal class of models (such as MLPs) generally applicable to
a wide range of applications.
Presently, standard SVMs are mainly available only for classifica-
tion, function estimation and density estimation. However, several
extensions are possible in terms of least squares and equality con-
straints (and exploiting primal-dual interpretations).
-
8/11/2019 suykens
4/84
Overview of this talk
LS-SVM for classification and link with kernel Fisher discrimi-nant analysis
Links to regularization networks and Gaussian processes
Bayesian inference for LS-SVMs
Sparseness and robustness
Large scale methods:Fixed Size LS-SVM
Extensions to committee networks
New formulations to kernel PCA, kernel CCA, kernel PLS
Extensions of LS-SVM to recurrent networks and optimal con-trol
-
8/11/2019 suykens
5/84
Vapniks SVM classifier
Given a training set {xk, yk}Nk=1Input patternsxk R
n
Class labelsyk Rwhereyk {1, +1}
Classifier:y(x) = sign[wT(x) +b]
with () : Rn Rnh mapping to high dimensional featurespace (can be infinite dimensional)
For separable data, assumewT(xk) +b +1 , if yk = +1wT(xk) +b 1 , if yk = 1
which is equivalent to
yk[wT(xk) +b] 1, k= 1,...,N Optimization problem (non-separable case):
minw,
J(w, ) =12
wTw+cNk=1
k
subject toyk[w
T(xk) +b] 1 k, k= 1,...,Nk 0, k= 1,...,N.
-
8/11/2019 suykens
6/84
Construct Lagrangian:
L(w,b, ; , ) = J(w, k)Nk=1
k{yk[wT(xk)+b]1+k}Nk=1
kk
with Lagrange multipliersk 0,k 0 (k= 1,...,N). Solution given by saddle point of Lagrangian:
max,
minw,b,
L(w,b, ; , )One obtains
Lw
= 0
w=N
k=1
kyk(xk)
Lb
= 0 Nk=1
kyk = 0
Lk
= 0 0 k c, k= 1,...,N
Quadratic programming problem (Dual problem):
maxk
Q() = 12
Nk,l=1
ykylK(xk, xl) kl+Nk=1
k
such that
Nk=1
kyk = 0
0 k c, k= 1,...,N.
-
8/11/2019 suykens
7/84
Note: w and(xk) are not calculated.
Mercer condition:
K(xk, xl) =(xk)
T
(xl) Obtained classifier:
y(x) = sign[Nk=1
kykK(x, xk) +b]
withk positive real constants, breal constant, that follow as
solution to the QP problem.
Non-zero k are called support values and the corresponding
data points are called support vectors.
The bias term bfollows from KKT conditions.
Some possible kernelsK(, ):
K(x, xk) =xTk x(linear SVM)
K(x, xk) = (xTk x+ 1)
d (polynomial SVM of degree d)
K(x, xk) = exp{x xk22/2} (RBF SVM)
K(x, xk) = tanh( xT
k
x+) (MLP SVM)
In the case of RBF and MLP kernel, the number of hiddenunits corresponds to the number of support vectors.
-
8/11/2019 suykens
8/84
Feature space and kernel trick
+ +
+
x
x
x
+
xx
x
+
+
+
+ +
+
x
x
x
+
x x
xx
x
x
x
+
+
+
Input space
Feature space
(x)
=
(x)T
(z)
K(x, z)
-
8/11/2019 suykens
9/84
Primal-dual interpretations of SVMs
Primal problem P
Dual problem D
Parametric: estimate w Rnh
Non-parametric: estimate RN
y(x) = sign[wT(x) +b]
y(x) = sign[#sv
k=1 kykK(x, xk) +b]
Kernel trick
K(xk, xl) =(xk)T(xl)
y(x)
y(x)
w1
wnh
1
#sv
1(x)
nh(x)
K(x, x1)
K(x, x#sv)
x
x
-
8/11/2019 suykens
10/84
Least Squares SVM classifiers
LS-SVM classifiers (Suykens, 1999): close to Vapniks SVMformulation but solves linear system instead of QP problem.
Optimization problem:
minw,b,e
J(w,b,e) =12
wTw+1
2
Nk=1
e2k
subject to the equality constraints
yk[wT(xk) +b] = 1 ek, k= 1,...,N. Lagrangian
L(w,b,e; ) = J(w,b,e) Nk=1
k{yk[wT(xk) + b] 1 + ek}
wherek are Lagrange multipliers.
Conditions for optimality:
Lw
= 0 w=Nk=1
kyk(xk)
Lb
= 0 N
k=1kyk = 0
Lek = 0 k =ek, k= 1,...,NLk
= 0 yk[wT(xk) +b] 1 +ek = 0, k= 1,...,N
-
8/11/2019 suykens
11/84
Set of linear equations (instead of QP):
I 0 0 ZT0 0 0 YT
0 0 I IZ Y I 0
w
b
e
=
0
0
01
with
Z= [(x1)Ty1; ...; (xN)
TyN]
Y = [y1; ...; yN]1 = [1; ...; 1]
e= [e1; ...; eN]= [1; ...; N].
After elimination ofw, eone obtains 0 YT
Y +1I
b
=
01
.
where
=Z ZT
and Mercers condition is applied
kl = ykyl(xk)T(xl)
= ykylK(xk, xl).
Related work: Saunders (1998), Smola & Scholkopf (1998),Cristianini & Taylor (2000)
-
8/11/2019 suykens
12/84
Large scale LS-SVMs
For the binary class case:problem involves matrix of size (N+ 1) (N+ 1).Large data sets iterative methods needed (CG, SOR, ...)
Problem: solvingAx= B A Rnn, B Rn
by CG requires that A is symmetric positive definite.
Represent the original problem of the form 0 YT
Y H
12
=
d1d2
withH= +1
I,1=b,2=,d1= 0,d2=1 as
s 0
0 H
1
2+H1Y 1
=
d1+YTH1d2d2
withs=YTH1Y >0 (H=HT >0).
Iterative methods can be applied to the latter problem.
Convergence of CG depends on condition number(hence it depends on, )
-
8/11/2019 suykens
13/84
Fisher Discriminant Analysis
Projection of data:z=f(x) =wTx+b
Optimize Rayleigh quotient:
maxw,b
JFD(w, b) = wTBwwTWw
Class 1
Class 2
wT1x
+b1
w T
2x+
b2
-
8/11/2019 suykens
14/84
-
8/11/2019 suykens
15/84
A two spiral classification problem
40 30 20 10 0 10 20 30 4040
30
20
10
0
10
20
30
40
Difficult problem for MLPs, not for SVM or LS-SVM with RBF
kernel (but easy in the sense that the two classes are separable)
-
8/11/2019 suykens
16/84
Benchmarking LS-SVM classifiers
acr bld gcr hea ion pid snr ttt wbc adu
NCV 460 230 666 180 234 512 138 638 455 33000Ntest 230 115 334 90 117 256 70 320 228 12222N 690 345 1000 270 351 768 208 958 683 45222
nnum 6 6 7 7 33 8 60 0 9 6ncat 8 0 13 6 0 0 0 9 0 8n 14 6 20 13 33 8 60 9 9 14
bal cmc ims iri led thy usp veh wav win
NCV 416 982 1540 100 2000 4800 6000 564 2400 118Ntest 209 491 770 50 1000 2400 3298 282 1200 60N 625 1473 2310 150 3000 7200 9298 846 3600 178
nnum 4 2 18 4 0 6 256 18 19 13ncat 0 7 0 0 7 15 0 0 0 0n 4 9 18 4 7 21 256 18 19 13
M 3 3 7 3 10 3 10 4 3 3ny,MOC 2 2 3 2 4 2 4 2 2 2ny,1vs1 3 3 21 3 45 3 45 6 2 3
(Van Gestel et al., Machine Learning, in press)
-
8/11/2019 suykens
17/84
Successful LS-SVM applications
20 UCI benchmark data sets (binary and multiclass problems)
Ovarian cancer classification
Classification of brain tumours from magnetic resonance spec-troscopy signals
Prediction of mental development of preterm newborns
Prediction of air polution and chaotic time series
Marketting and financial engineering studies
Modelling the Belgian electricity consumption
Softsensor modelling in the chemical process industry
-
8/11/2019 suykens
18/84
acr bld gcr hea ion pid snr ttt wbc aduNtest 230 115 334 90 117 256 70 320 228 1222
n 14 6 20 13 33 8 60 9 9 14
RBF LS-SVM 87.0(2.1) 70.2(4.1) 76.3(1.4) 84.7(4.8) 96.0(2.1) 76.8(1.7) 73.1(4.2) 99.0(0.3) 96.4(1.0) 84.7(0.3
RBF LS-SVMF 86.4(1.9) 65.1(2.9) 70.8(2.4) 83.2(5.0) 93.4(2.7) 72.9(2.0) 73.6(4.6) 97.9(0.7) 96.8(0.7) 77.6(1.3
Lin LS-SVM 86.8(2.2) 65.6(3.2) 75.4(2.3) 84.9(4.5) 87.9(2.0) 76.8(1.8) 72.6(3.7) 66.8(3.9) 95.8(1.0) 81.8(0.3
Lin LS-SVMF 86.5(2.1) 61.8(3.3) 68.6(2.3) 82.8(4.4) 85.0(3.5) 73.1(1.7) 73.3(3.4) 57.6(1.9) 96.9(0.7) 71.3(0.3
Pol LS-SVM 86.5(2.2) 70.4(3.7) 76.3(1.4) 83.7(3.9) 91.0(2.5) 77.0(1.8) 76.9(4.7) 99.5(0.5) 96.4(0.9) 84.6(0.3
Pol LS-SVMF 86.6(2.2) 65.3(2.9) 70.3(2.3) 82.4(4.6) 91.7(2.6) 73.0(1.8) 77.3(2.6) 98.1(0.8) 96.9(0.7) 77.9(0.2
RBF SVM 86.3(1.8) 70.4(3.2) 75.9(1.4) 84.7(4.8) 95.4(1.7) 77.3(2.2) 75.0(6.6) 98.6(0.5) 96.4(1.0) 84.4(0.3
Lin SVM 86.7(2.4) 67.7(2.6) 75.4(1.7) 83.2(4.2) 87.1(3.4) 77.0(2.4) 74.1(4.2) 66.2(3.6) 96.3(1.0) 83.9(0.2
LDA 85.9(2.2) 65.4(3.2) 75.9(2.0) 83.9(4.3) 87.1(2.3) 76.7(2.0) 67.9(4.9) 68.0(3.0) 95.6(1.1) 82.2(0.3
QDA 80.1(1.9) 62.2(3.6) 72.5(1.4) 78.4(4.0) 90.6(2.2) 74.2(3.3) 53.6(7.4) 75.1(4.0) 94.5(0.6) 80.7(0.3
Logit 86.8(2.4) 66.3(3.1) 76.3(2.1) 82.9(4.0) 86.2(3.5) 77.2(1.8) 68.4(5.2) 68.3(2.9) 96.1(1.0) 83.7(0.2
C4.5 85.5(2.1) 63.1(3.8) 71.4(2.0) 78.0(4.2) 90.6(2.2) 73.5(3.0) 72.1(2.5) 84.2(1.6) 94.7(1.0)85.6(0.3
oneR 85.4(2.1) 56.3(4.4) 66.0(3.0) 71.7(3.6) 83.6(4.8) 71.3(2.7) 62.6(5.5) 70.7(1.5) 91.8(1.4) 80.4(0.3
IB1 81.1(1.9) 61.3(6.2) 69.3(2.6) 74.3(4.2) 87.2(2.8) 69.6(2.4) 77.7(4.4) 82.3(3.3) 95.3(1.1) 78.9(0.2
IB10 86.4(1.3) 60.5(4.4) 72.6(1.7) 80.0(4.3) 85.9(2.5) 73.6(2.4) 69.4(4.3) 94.8(2.0) 96.4(1.2) 82.7(0.3
NBk 81.4(1.9) 63.7(4.5) 74.7(2.1) 83.9(4.5) 92.1(2.5) 75.5(1.7) 71.6(3.5) 71.7(3.1) 97.1(0.9) 84.8(0.2
NBn 76.9(1.7) 56.0(6.9) 74.6(2.8) 83.8(4.5) 82.8(3.8) 75.1(2.1) 66.6(3.2) 71.7(3.1) 95.5(0.5) 82.7(0.2
Maj. Rule 56.2(2.0) 56.5(3.1) 69.7(2.3) 56.3(3.8) 64.4(2.9) 66.8(2.1) 54.4(4.7) 66.2(3.6) 66.2(2.4) 75.3(0.3
-
8/11/2019 suykens
19/84
-
8/11/2019 suykens
20/84
LS-SVMs for function estimation
LS-SVM model as feature space representation:y(x) =wT(x) +b
withx Rn, y R.
The nonlinear mapping() is similar to the classifier case.
Given training set {xk, yk}Nk=1.
Optimization problem
minw,e
J(w, e) =12
wTw+1
2
Nk=1
e2k
subject to equality constraints
yk =wT(xk) +b+ek, k= 1,...,N
This is a form of ridge regression (see Saunders, 1998).
Lagrangian:
L(w,b,e; ) =
J(w, e)
N
k=1
k {
wT(xk
) +b+ek
yk}
withk Lagrange multipliers.
-
8/11/2019 suykens
21/84
Conditions for optimality
Lw
= 0 w=N
k=1k(xk)
Lb
= 0 Nk=1
k = 0
Lek
= 0 k = ek, k= 1,...,NLk
= 0 wT(xk) +b+ek yk = 0, k= 1,...,N
Solution 0 1T
1 +1I b
= 0
y
with
y= [y1; ...; yN],1 = [1; ...;1],= [1; ...; N]
and by applying Mercers condition
kl = (xk)T(xl), k, l= 1,...,N
= K(xk, xl) Resulting LS-SVM model for function estimation
y(x) =Nk=1
kK(xk, x) +b
Solution mathematically equivalent to regularization networksand Gaussian processes (usually without bias term) (see e.g.Poggio & Girosi (1990)).
-
8/11/2019 suykens
22/84
SVM, RN, GP, LS-SVM, ...
=
Support Vector Machines
LS-SVMs
Regularization networks
Gaussian processes
Kriging
Kernel ridge regression
RKHS
?
Some early history:
1910-1920: Moore - RKHS
1940: Aronszajn - systematic development of RKHS
1951: Krige
1970: Parzen - RKHS for statistical problems
1971: Kimeldorf & Wahba
In LS-SVM primal-dual interpretations are emphasized (and often
exploited).
-
8/11/2019 suykens
23/84
=
=
Prior
Prior
Prior
Maximize Posterior
Maximize Posterior
Maximize Posterior
Evidence
Likelihood
Evidence
Likelihood
Evidence
Likelihood
Bayesian inference of LS-SVM models
Level 1 (w, b)
Level 2 (, )
Level 3 ()
p(D|w,b,,, H)
p(
D|,,
H)
p(D|H)
p(D)
p(w, b|,, H)
p(, |H)
p(H)
p(w, b|D, , , H)
p(, |D, H)
p(H|D)
-
8/11/2019 suykens
24/84
Bayesian LS-SVM framework
(Van Gestel, Suykens et al., Neural Computation, 2002)
Level 1 inference:Inference ofw,bProbabilistic interpretation of outputs
Moderated outputs
Additional correction for prior probabilities and bias term
Level 2 inference:
Inference of hyperparameterEffective number of parameters (< N)
Eigenvalues of centered kernel matrix are important
Level 3:Model comparison - Occams razor
Moderated output with uncertainty on hyperparameters
Selection of width of kernel
Input selection (ARD at level 3 instead of level 2)
Related work:MacKays Bayesian interpolation for MLPs, GP
Difference with GP:
bias term contained in the formulation
classifier case treated as a regression problem
of RBF kernel determined at Level 3
-
8/11/2019 suykens
25/84
Benchmarking results
n N Ntest Ntot LS-SVM LS-SVM LS-SVM SVM GP GPb GP
(BayM) (Bay) (CV10) (CV10) (Bay) (Bay) (CV10)bld 6 230 115 345 69.4(2.9)69.4(3.1)69.4(3.4)69.2(3.5)69.2(2.7)68.9(3.3) 69.7(4.0)cra 6 133 67 200 96.7(1.5)96.7(1.5)96.9(1.6)95.1(3.2)96.4(2.5) 94.8(3.2) 96.9(2.4)gcr 20 666 334 1000 73.1(3.8) 73.5(3.9) 75.6(1.8) 74.9(1.7)76.2(1.4)75.9(1.7) 75.4(2.0)hea 13 1 80 90 2 70 83.6(5.1)83.2(5.2)84.3(5.3)83.4(4.4)83.1(5.5)83.7(4.9) 84.1(5.2)ion 33 234 117 351 95.6(0.9) 96.2(1.0)95.6(2.0) 95.4(1.7) 91.0(2.3) 94.4(1.9) 92.4(2.4)pid 8 512 256 768 77.3(3.1)77.5(2.8)77.3(3.0)76.9(2.9)77.6(2.9)77.5(2.7) 77.2(3.0)rsy 2 250 1000 1250 90.2(0.7)90.2(0.6) 89.6(1.1) 89.7(0.8)90.2(0.7)90.1(0.8) 89.9(0.8)snr 60 138 70 208 76.7(5.6) 78.0(5.2)77.9(4.2) 76.3(5.3) 78.6(4.9)75.7(6.1) 76.6(7.2)tit 3 1467 734 2201 78.8(1.1)78.7(1.1)78.7(1.1)78.7(1.1)78.5(1.0) 77.2(1.9) 78.7(1.2)wbc 9 455 228 683 95.9(0.6) 95.7(0.5)96.2(0.7)96.2(0.8) 95.8(0.7) 93.7(2.0) 96.5(0.7)AP 83.7 83.9 84.1 83.6 83.7 83.2 83.7AR 2.3 2.5 2.5 3.8 3.2 4.2 2.6
PST 1.000 0 .754 1.000 0.344 0.754 0.344 0.508
-
8/11/2019 suykens
26/84
Bayesian inference of LS-SVMs:example Ripley data set
1.5 1 0.5 0 0.5 1 1.50.5
0
0.5
1
1.5
0.10.2
0.30.4
0.4
0.5
0.6
0.7
0.8
0.9
x1
x2
1.5
1
0.5
0
0.5
1
1.5
0.5
0
0.5
1
1.5
0
0.2
0.4
0.6
0.8
1
x1x2
Posteriorclassprobability
-
8/11/2019 suykens
27/84
Bayesian inference of LS-SVMs:sinc example
0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.51
0.5
0
0.5
1
1.5
x
y
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7200
210
220
230
240
250
260
270
280
log(P(D|H
))
-
8/11/2019 suykens
28/84
Sparseness
kN
sorted|k|
LS-SVM
QP-SVM
Lack of sparseness in the LS-SVM case
but ... sparseness can be imposed by applying pruning techniques
existing in the neural networks area (e.g. optimal brain damage,
optimal brain surgeon etc.)
kN
sorted|k|
LS-SVM
sparse LS-SVM
-
8/11/2019 suykens
29/84
Robustness
Convex cost function
convex
optimization
SVM solution
Weighted version with
modified cost function
robust
statistics
LS-SVM solution
SVM Weighted LS-SVM
Weighted LS-SVM:
minw,b,e
J(w, e) = 1
2wTw+
1
2
N
k=1vke
2k
such that yk =wT(xk) +b+ek, k= 1,...,N
with vk determined by the distribution of{ek}Nk=1 obtained fromthe unweighted LS-SVM.
-
8/11/2019 suykens
30/84
L2 and L1 estimators: score function
3 2 1 0 1 2 36
4
2
0
2
4
6
8
10
e
L(e),
(e)
3 2 1 0 1 2 32
1.5
1
0.5
0
0.5
1
1.5
2
2.5
3
e
L(e),
(e)
L1estimator: influence of outliers is reduced (the score function in
robust statistics means derivative of the loss function)
-
8/11/2019 suykens
31/84
Loss function and score function
Examples of score function in robust statistics:
Huber, Andrews, Hampel, Tukey
3 2 1 0 1 2 32
1
0
1
2
3
4
e
L(e),
(e)
5 4 3 2 1 0 1 2 3 4 51.5
1
0.5
0
0.5
1
1.5
e
(e)
10 8 6 4 2 0 2 4 6 8 102.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
e
(e)
3 2 1 0 1 2 32.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
e
(e)
In practice: iterative weighting accrording to the score function.
Note that all the score functions (except Hubers) lead to non-
convex loss functions. Further application of robust statistics to
robust cross-validation (De Brabanter et al., 2002) with good per-formance in terms of efficiency-robustness trade-off.
-
8/11/2019 suykens
32/84
Large scale methods
Nystrom method (GP)
Fixed size LS-SVM
Committee networks and extensions
-
8/11/2019 suykens
33/84
Nystrom method
Reference in GP: Williams & Seeger (2001)
bigkernel matrix: (N,N) RNNsmallkernel matrix: (M,M) RMM(based on random subsample, in practice often M N)
Eigenvalue decomposition of (M,M):(M,M) U=U
Relation to eigenvalues and eigenfunctions of the integral equa-tion
K(x, x)i(x)p(x)dx=ii(x)
is given by
i = 1M
i
i(xk) =
M uki
i(x) =
M
i
Mk=1
ukiK(xk, x)
where i and i are estimates to i and i, respectively, and
uki denotes theki-th entry of the matrix U.
-
8/11/2019 suykens
34/84
For the big matrix:
(N,N) U= U
Furthermore, one has
i = NM
i
ui =
NM
1i
(N,M)ui
One can show then that
(N,N) (N,M)1(M,M)(M,N)where (N,M) is theN Mblock matrix taken from (N,N).
Solution to the big linear system((N,N)+I/)=y
can be written as
=
y U( 1
I+UTU)1UTy
by applying Sherman-Morrison-Woodbury formula.
Some numerical difficulties as pointed out by Fine & Scheinberg(2001).
-
8/11/2019 suykens
35/84
Computation in primal or dual space ?
Primal space
Dual space
Large data sets
Large dimensional inputs
?
-
8/11/2019 suykens
36/84
Fixed Size LS-SVM
Related work: basis construction in feature spaceCawley (2002), Csato & Opper (2002), Smola & Scholkopf(2002), Baudat & Anouar (2001).
Model in primal space:
minwRnh,bR
1
2wTw+
1
2
N
k=1 yk (wT(xk) +b)
2
.
Observation: for linear model it is computationally better to
solve the primal problem (one knows that (xk) =xk)
Can we do this for the nonlinear case too ?
Employ the Nystrom method:
i(x) =
i i(x
) =
Mi
Mk=1
ukiK(xk, x),
assuming a fixed sizeM.
The model becomes theny(x) = wT(x) +b
=Mi=1
wi
Mi
Mk=1
ukiK(xk, x).
-
8/11/2019 suykens
37/84
The support values corresponding to the number ofM support
vectors equal
k =M
i=1 wi
M
iukiwhen ones represent the model as
y(x) =Mk=1
kK(xk, x)
How to select a working set ofMsupport vectors ?
Is taking a random subsample the only option ?
-
8/11/2019 suykens
38/84
Fixed Size LS-SVM: Selection of SV
Girolami (2002): link between Nystrom method, kernel PCA,density estimation and entropy criteria
The quadratic Renyi entropy
HR= log
p(x)2dx
has been related to kernel PCA and density estimation with p(x)2dx=
1
N21Tv 1v
where 1v = [1;1; ...; 1] and a normalized kernel is assumed with
respect to density estimation.
Fixed Size LS-SVM:Take a working set of M support vectors and select vectors
according to the entropy criterion (instead of a random sub-
sample as in the Nystrom method)
-
8/11/2019 suykens
39/84
Fixed Size LS-SVM
Primal space Dual space
Nystrom method
Kernel PCA
Density estimation
Entropy criteria
Eigenfunctions
Active selection of SV
Regression
-
8/11/2019 suykens
40/84
Fixed Size LS-SVM
Pool of Training Data
Fixed Size LS-SVM
new point IN
present or new point OUT
random point selection
-
8/11/2019 suykens
41/84
Fixed size LS-SVM algorithm:
1. Given normalized and standardized training data {xk, yk}Nk=1with inputsxk
R
n, outputsyk
R andNtraining data.
2. Choose a working set with size Mand impose in this way
a number ofM support vectors (typicallyM N).
3. Randomly select a support vector x from the working setofM support vectors.
4. Randomly select a point xt from theNtraining data andreplacexbyxtin the working set. If the entropy increasesby taking the pointxt instead ofx then this point xt isaccepted for the working set ofMsupport vectors, other-
wise the pointxt is rejected (and returned to the trainingdata pool) and the support vector x stays in the workingset.
5. Calculate the entropy value for the present working set.
6. Stop if the change in entropy value is small or the number
of iterations is exceeded, otherwise go to (3).
7. Estimatew, bin the primal space after estimating the eigen-
functions from the Nystrom approximation.
-
8/11/2019 suykens
42/84
Fixed Size LS-SVM: sinc example
10 8 6 4 2 0 2 4 6 8 100.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
10 8 6 4 2 0 2 4 6 8 100.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
10 8 6 4 2 0 2 4 6 8 100.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
10 8 6 4 2 0 2 4 6 8 100.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
demo in LS-SVMlab
-
8/11/2019 suykens
43/84
100
101
102
103
104
105
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
iteration step
entropy
100
101
102
103
104
105
104
103
102
101
100
iteration step
testseterror
-
8/11/2019 suykens
44/84
Fixed Size LS-SVM: sinc example
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
10 8 6 4 2 0 2 4 6 8 100.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
x
y
Sinc function with 20000 data points
-
8/11/2019 suykens
45/84
Fixed Size LS-SVM: spiral example
2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.52.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
-
8/11/2019 suykens
46/84
Fixed Size LS-SVM: spiral example
2 1.5 1 0.5 0 0.5 1 1.5 22.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
x1
x2
-
8/11/2019 suykens
47/84
Committee network of LS-SVMs
Data
D1D2
Dm
fc(x)
LS-SVM (D1)
LS-SVM (D2)
LS-SVM (Dm)x
x
x
-
8/11/2019 suykens
48/84
Committee network of LS-SVMs
sinc function with 4000 training data
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
-
8/11/2019 suykens
49/84
Results of the individual LS-SVM models
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
-
8/11/2019 suykens
50/84
Nonlinear combination of LS-SVMs
x
x
x
LS-SVM (D1)
LS-SVM (D2)
LS-SVM (Dm)
g
This results into a multilayer network
(layers of (LS)-SVMs or e.g. MLP + LS-SVM combination)
-
8/11/2019 suykens
51/84
-
8/11/2019 suykens
52/84
Committee network of LS-SVMs
Linear versus nonlinear combinations of trained LS-SVM submod-
els under heavy Gaussian noise
10 8 6 4 2 0 2 4 6 8 104
3
2
1
0
1
2
3
4
x
y
10 8 6 4 2 0 2 4 6 8 104
3
2
1
0
1
2
3
4
x
y
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
10 8 6 4 2 0 2 4 6 8 100.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
x
y
10 8 6 4 2 0 2 4 6 8 100.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
y
-
8/11/2019 suykens
53/84
Classical PCA formulation
Given data {xk}Nk=1withxk Rn
Find projected variables wTxk with maximal variance
maxw
Var(wTx) = Cov(wTx, wTx) 1N
Nk=1
(wTxk)2
= wT C w
whereC= (1/N)
Nk=1
xkxT
k .
Consider constraintwTw= 1.
Constrained optimization:
L(w; ) =12
wTCw (wTw 1)
with Lagrange multiplier.
Eigenvalue problemCw=w
withC=CT 0, obtained fromL/w= 0,L/= 0.
-
8/11/2019 suykens
54/84
PCA analysis as a one-class modellingproblem
One-class with target value zero:
maxw
Nk=1
(0 wTxk)2
Score variables: z=wTx
Illustration:
x
x
x
x
x
x
x
xx
x
x
x
x
Input spaceTarget space
0
wTx
-
8/11/2019 suykens
55/84
LS-SVM interpretation to FDA
1
+1
xx
x
x
x
x
+
+
+
+
+
+
+
Input spaceTarget space
0
wTx + b
Minimize within class scatter
LS-SVM interpretation to PCA
x
x
x
x
x
x
x
x
x
x
x
x
x
Input spaceTarget space
0
wT(x x)
Find direction with maximal variance
-
8/11/2019 suykens
56/84
SVM formulation to linear PCA
Primal problem:
P : maxw,e
JP(w, e) = 1
2
Nk=1
e2k 1
2wTw
such that ek =wTxk, k= 1,...,N
Lagrangian
L(w, e; ) =12
Nk=1
e2k 12wTw N
k=1
k
ek wTxk
Conditions for optimality
Lw
= 0
w=
N
k=1
kxk
Lek
= 0 k = ek, k= 1,...,NLk
= 0 ek wTxk = 0, k= 1,...,N
-
8/11/2019 suykens
57/84
Elimination of variables e, w gives1
k
N
l=1lx
Tl xk = 0 , k= 1,...,N
and after defining = 1/one obtains the eigenvalue problem
D : solve in :
xT1 x1 . . . xT1 xN
... ...
xTNx1 . . . xTNxN
1...
N
=
1...
N
as the dual problem (quantization in terms of= 1/).
Score variables become
z(x) =wTx=Nl=1
lxTl x
Optimal solution corresponding to largest eigenvalue
Nk=1
(wTxk)2 =
Nk=1
e2k =Nk=1
1
22k =
2max
whereN
k=1 2k = 1 for the normalized eigenvector.
Many data : better solve primal problem
Many inputs: better solve dual problem
-
8/11/2019 suykens
58/84
SVM formulation to PCA analysis witha bias term
Usually: apply PCA analysis to centered dataand consider
maxw
Nk=1
[wT(xk x)]2
where x= 1
N
Nk=1
xk.
Bias term formulation:
score variables
z(x) =wTx+b
and objective
maxw,b
Nk=1 [0 (w
T
x+b)]
2
-
8/11/2019 suykens
59/84
Primal optimization problem
P : maxw,e
JP(w, e) = 1
2
N
k=1e2k
1
2wTw
such that ek =wTxk+b, k= 1,...,N
Lagrangian
L(w, e; ) =12
Nk=1
e2k 1
2wTw
Nk=1
k
ek wTxk b
Conditions for optimality
Lw
= 0 w=Nk=1
kxk
Lek
= 0 k = ek, k= 1,...,N
Lek
= 0 N
k=1
k = 0
Lk
= 0 ek wTxk b= 0, k= 1,...,N
Applying Nk=1 k = 0 yieldsb= 1
N
N
k=1N
l=1lx
Tlxk.
-
8/11/2019 suykens
60/84
By defining= 1/one obtains the dual problemD : solve in :
(x1 x)T(x1 x) . . . (x1 x)T(xNx)... ...
(xNx)T(x1 x) . . . (xNx)
T(xNx)
1...
N
=
1...
N
which is an eigenvalue decomposition of the centered Gram
matrix
c=
with c = McMc where Mc = I 1v1Tv /N, 1v = [1;1; ...; 1]and
kl=xT
kxlfork, l= 1,...,N.
Score variables:
z(x) =wTx+b=Nl=1
lxTl x+b
where is the eigenvector corresponding to the largest eigen-
value andNk=1
(wTxk+b)2 =
Nk=1
e2k =Nk=1
1
22k =
2max
-
8/11/2019 suykens
61/84
Reconstruction problem for linear PCA
Reconstruction error:
minNk=1
xk xk22
where xk are variables reconstructed from the score variables,
with
x= V z+
Hence
minV,
Nk=1
xk (V zk+)22
Information bottleneck:
x z x
wTx+b V z+
-
8/11/2019 suykens
62/84
5 4 3 2 1 0 1 2 3 4 55
4
3
2
1
0
1
2
3
4
5
x1
x2
100 80 60 40 20 0 20 40 60 80 100100
80
60
40
20
0
20
40
60
80
100
z1
z2
-
8/11/2019 suykens
63/84
LS-SVM approach to kernel PCA
Create nonlinear version of the method byMapping input space to a high dimensional feature spaceApplying the kernel trick
(kernel PCA - Scholkopf et al.; SVM approach - Suykens et
al., 2002)
Primal optimization problem:
P : maxw,e
JP(w, e) = 12
Nk=1
e2k 12wTw
such that ek =wT((xk) ), k= 1,...,N.
Lagrangian
L(w, e; ) =
1
2
N
k=1
e2k
1
2
wTw
N
k=1
k ek wT((xk)
)
Conditions for optimality
Lw
= 0 w=Nk=1
k((xk) )Lek
= 0 k =ek, k= 1,...,NLk
= 0 ek wT((xk) ) = 0, k= 1,...,N.
-
8/11/2019 suykens
64/84
By elimination of the variables e, w and defining = 1/oneobtains
D : solve in :
c=
with
c=
((x1) )
T((x1) ) . . . ((x1) )T((xN) )
... ...
((xN) )T((x1) ) . . . ((xN) )
T((xN) )
Elements of the centered kernel matrix
c,kl
= ((xk
)
)T((xl)
) , k, l= 1,...,N
Score variablesz(x) = wT ((x) )
=Nl=1
l((xl) )T ((x) )
=
Nl=1
l
K(xl, x) 1NNr=1
K(xr, x) 1NNr=1
K(xr, xl)+
1
N2
Nr=1
Ns=1
K(xr, xs)
.
-
8/11/2019 suykens
65/84
LS-SVM interpretation to Kernel PCA
x
x
x x
x
x
x
x
xx
x
x
x
x
x
x
x
x x
x
Input space
Feature space
Target space
0
wT((x) )
(
)
Find direction with maximal variance
-
8/11/2019 suykens
66/84
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 11.5
1
0.5
0
0.5
1
1.5
x1
x2
0 5 10 15 20 250
5
10
15
20
25
30
35
i
i
-
8/11/2019 suykens
67/84
Canonical Correlation Analysis
Hotelling (1936): linear CCA analysis
Applications e.g. in system identification, subspace algorithms,
signal processing and information theory.
New applications to bioinformatics (correlations between genes
and between experiments in both directions for microarray data
matrix)
Finding maximal correlation between projected variableszx=w
Tx and zy =vTy
wherex Rnx, y Rny denote given random vectors with zeromean.
Objective: achieve maximal correlation coefficientmaxw,v
= E[zxzy]E[zxzx]E[zyzy]
= wTCxyvwTCxxw
vTCyyv
withCxx=
E[xxT],Cyy=
E[yyT],Cxy=
E[xyT].
-
8/11/2019 suykens
68/84
This is formulated as the constrained optimization problemmaxw,v
wTCxyv
such that wTCxxw= 1
vTCyyv= 1
which leads to a generalized eigenvalue problem.
The solution follows from the Lagrangian
L(w, v; , ) =wTCxyv
1
2
(wTCxxw
1)
1
2
(vTCyyv
1)
with Lagrange multipliers, , which gives Cxyv = Cxxw
Cyxw = Cyyv.
-
8/11/2019 suykens
69/84
SVM formulation to CCA
In primal weight space:
P : maxw,v,e,r
Nk=1
ekrk 1 12
Nk=1
e2k 21
2
Nk=1
r2k
12
wTw 12
vTv
such thatek =wTxk, k= 1,...,N
rk =vTyk k= 1,...,N
Lagrangian:
L(w,v,e,r; , ) =Nk=1
ekrk 1 12
Nk=1
e2k 21
2
Nk=1
r2k
1
2wTw
1
2vTv
N
k=1
k[ek
wTxk]
N
k=1
k[rk
vTyk]
wherek,k are Lagrange multipliers.
Related objective in CCA literature (Gittins, 1985):minw,v k w
Txk vTyk22
-
8/11/2019 suykens
70/84
Conditions for optimality
Lw
= 0 w=N
k=1kxk
Lv
= 0 v=Nk=1
kyk
Lek
= 0 vTyk =1wTxk+k, k= 1,...,NLrk
= 0 wTxk =2vTyk+k, k= 1,...,NLk
= 0 ek =wTxk, k= 1,...,NLk = 0 rk =vTyk, k= 1,...,N
resulting into the following dual problem after defining= 1/
D : solve in , :
yT1y1 . . . yT1yN
0...
...yTN
y1 . . . yT
NyN
xT1x1 . . . xT1xN
... ... 0xTN
x1 . . . xT
NxN
1...
N1
...N
=
1xT1x1+ 1 . . . 1x
T1xN
...... 0
1xT
Nx1 . . . 1x
T
NxN+ 1
2yT1y1+ 1 . . . 2y
T1yN
0...
...2y
T
Ny1 . . . 2y
T
NyN+ 1
1...
N1...
N
-
8/11/2019 suykens
71/84
Resulting score variables
zxk = ek =N
l=1lx
Tlxk
zyk = rk =Nl=1
lyTl yk.
The eigenvalues will be both positive and negative for the
CCA problem. Also note that one has [1, 1].
-
8/11/2019 suykens
72/84
Extension to kernel CCA
Score variables
zx = w
T
(1(x) 1)zy = v
T(2(y) 2)where 1() : Rnx Rnhx and 2() : Rny Rnhy are map-pings (which can be chosen to be different) to high dimen-
sional feature spaces and 1 = (1/N)N
k=1 1(xk), 2 =
(1/N)
Nk=1 2(yk).
Primal problem
P : maxw,v,e,r
Nk=1
ekrk 1 12
Nk=1
e2k 21
2
Nk=1
r2k
12
wTw 12
vTv
such that ek =wT(1(xk)
1), k= 1,...,N
rk =vT(2(yk) 2), k= 1,...,N
with Lagrangian
L(w , v, e, r; , ) =N
k=1
ekrk 11
2
Nk=1
e2k 21
2
Nk=1
r2k 1
2wTw
1
2vTv
N
k=1
k[ek wT(1(xk) 1)]
N
k=1
k[rk vT(2(yk) 2)]
where k, k are Lagrange multipliers. Note that w and v
might be infinite dimensional now.
-
8/11/2019 suykens
73/84
Conditions for optimality
Lw
= 0 w=N
k=1k(1(xk) 1)
Lv
= 0 v=Nk=1
k(2(yk) 2)Lek
= 0 vT(2(yk) 2) =1wT(1(xk) 1) +kk= 1,...,N
Lrk
= 0 wT(1(xk) 1) =2vT(2(yk) 2) +kk= 1,...,N
Lk
= 0 ek =wT(1(xk) 1)k= 1,...,N
Lk
= 0 rk =vT(2(yk) 2)k= 1,...,N
resulting into the following dual problem after defining= 1/
D : solve in , : 0 c,2c,1 0
=
1c,1+I 0
0 2c,2+I
wherec,1kl = (1(xk) 1)T(1(xl) 1)c,2kl = (2(yk) 2)T(2(yl) 2)
-
8/11/2019 suykens
74/84
are the elements of the centered Gram matrices for k, l =
1,...,N. In practice these matrices can be computed by
c,1 = Mc1Mcc,2 = Mc2Mc
with centering matrixMc=I (1/N)1v1Tv .
The resulting score variables can be computed by applying thekernel trick with kernels
K1(xk, xl) = 1(xk)T1(xl)
K2(yk, yl) = 2(yk)T
2(yl)
Related work: Bach & Jordan (2002), Rosipal & Trejo (2001)
-
8/11/2019 suykens
75/84
Recurrent Least Squares SVMs
Standard SVM only for static estimation problem.Dynamic case: Recurrent LS-SVM (Suykens, 1999)
Feedforward/NARX/series-parallel:
yk =f(yk1, yk2,...,ykp, uk1, uk2,...,ukp)
with inputuk Rand outputyk R.Training by classical SVM method.
Recurrent/NOE/parallel:yk =f(yk1,yk2,...,ykp, uk1, uk2,...,ukp)
Classically trained by Narendras Dynamic Backpropagation or
Werbos backpropagation through time.
Parametrization for autonomous case:
yk =wT (
yk1yk2
...
ykp
) +bwhich is equivalent to
yk ek =wT (xk1|kp k1|kp) +b
whereek =yk yk,k1|kp= [ek1; ek2; ...;ekp]xk1|kp= [yk1; yk2; ...; ykp]
-
8/11/2019 suykens
76/84
Optimization problem:
minw,e
J(w, e) =12
wTw+1
2
N+p
k=p+1e2k
subject to
ykek =wT (xk1|kpk1|kp)+b, (k=p +1,...,N+p).
Lagrangian
L(w,b,e; ) =
J(w,b,e)
+
N+pk=p+1
kp[yk ek wT (xk1|kp k1|kp) b]
Conditions for optimality
Lw = w N+pk=p+1
kp (xk1|kp k1|kp) = 0
Lb
=
N+pk=p+1
kp= 0
Lek
= ek kp pi=1
kp+i
eki[wT(xk1|kp k1|kp)] = 0
(k=p+ 1,...,N)Lkp
= yk ek wT (xk1|kp k1|kp) b= 0,(k=p+ 1,...,N+p)
-
8/11/2019 suykens
77/84
Mercer conditionK(zk1|kp, zl1|lp) =(zk1|kp)T(zl1|lp)
gives
N+pk=p+1
kp= 0
ek kp pi=1
kp+i
eki[
N+pl=p+1
lpK(zk1|kp, zl1|lp)] = 0,
(k=p+ 1,...,N)
yk ek N+pl=p+1
lpK(zk1|kp, zl1|lp) b= 0,
(k=p+ 1,...,N+p)
Solving the set of nonlinear equations in the unknowns e, , bis computationally very expensive.
-
8/11/2019 suykens
78/84
Recurrent LS-SVMs: example
Chuas circuit:
x = a [y h(x)]
y = x y+zz = bywith piecewise linear characteristic
h(x) =m1x+1
2(m0 m1) (|x+ 1| |x 1|).
Double scroll attractor for
a= 9,b= 14.286,m0=
1/7,m1= 2/7.
4
2
0
2
40.4
0.3 0.2
0.1 0
0.1 0.2
0.3 0.4
4
2
0
2
4
x1x2
x3
Recurrent LS-SVM:
N= 300 training data
Model: p= 12
RBF kernel: = 1SQP with early stopping
-
8/11/2019 suykens
79/84
0 100 200 300 400 500 600 7002.5
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
k
Figure: Trajectory learning of the double scroll (full line) by a recurrent least
squares SVM with RBF kernel. The simulation result after training (dashed
line) onN = 300 data points is shown with as initial condition data points
k= 1to 12. For the model structurep = 12and= 1 is taken. Early stopping
is done in order to avoid overfitting.
-
8/11/2019 suykens
80/84
LS-SVM for control
The N-stage optimal control problem
Optimal control problem:
minJN(xk, uk) =(xN+1) +Nk=1
h(xk, uk)
subject to the system dynamics
xk+1=f(xk, uk), k= 1,...,N (x1 given)
wherexk Rn state vectoruk R input(),h(, ) positive definite functions
Lagrangian
LN(xk, uk; k) = JN(xk, uk) +
N
k=1
T
k [xk+1 f(xk, uk)]with Lagrange multipliersk Rn.
Conditions for optimality:
LNxk
= hxk
+k1 (fxk
)Tk = 0, k= 2,...,N (adjoint equation)
LNxN+1
= xN+1
+N= 0 (adjoint final condition)
LNuk
= huk Tk
fuk
= 0, k= 1,...,N (variational condition)
LNk
= xk+1 f(xk, uk) = 0, k= 1,...,N (system dynamics)
-
8/11/2019 suykens
81/84
Optimal control with LS-SVMs
Consider LS-SVM from state space to action space
Combine the optimization problem formulations of the optimalcontrol problem and of the LS-SVM into one formulation
Optimal control problem with LS-SVM:
minJ(xk, uk, w , ek) = JN(xk, uk) +12
wTw+1
2
Nk=1
e2k
subject to the system dynamics
xk+1=f(xk, uk), k= 1,...,N (x1 given)
and the control law
uk =wT(xk) +ek, k= 1,...,N
Actual control signal applied to the plant: wT(xk) Lagrangian
L(xk, uk, w , ek; k, k) = JN(xk, uk) + 12w
T
w+12N
k=1 e2k+N
k=1 Tk [xk+1 f(xk, uk)] +
Nk=1 k[uk wT(xk) ek]
-
8/11/2019 suykens
82/84
Conditions for optimality
Lxk
= hxk
+k1 (fxk
)Tk
k
xk[wT(xk)] = 0, k= 2,...,N (adjoint equation)
LxN+1
= xN+1
+N= 0 (adjoint final condition)
Luk
= huk
Tkf
uk+k = 0, k= 1,...,N (variational condition)
Lw
= w N
k=1 k(xk) = 0 (support vectors)
Lek
= ek k = 0 k= 1,...,N (support values)
Lk
= xk+1 f(xk, uk) = 0, k= 1,...,N (system dynamics)
Lk
= uk wT(xk) ek = 0, k= 1,...,N (SVM control)
One obtains set of nonlinear equations of the formF1(xk, xN+1, uk, w , ek, k, k) = 0
fork= 1,...,N withx1given.
Apply Mercer condition trick:K(xk, xl) =(xk)
T(xl)
For RBF kernels one takes
K(xk, xl) = exp( xk xl22)
-
8/11/2019 suykens
83/84
By replacingw= kk(xk) in the equations, one obtains
hxk
+k1 ( fxk )Tk k
Nl=1 l
K(xk,xl)xk
= 0, k= 2,...,N
xN+1 +N= 0
huk
Tk fuk +k = 0, k= 1,...,N
xk+1 f(xk, uk) = 0, k= 1,...,N
uk
Nl=1 lK(xl, xk)
k/= 0, k= 1,...,N
which is of the form
F2(xk, xN+1, uk, k, k) = 0
fork= 1,...,N withx1given.
For RBF kernels one has
K(xk, xl)
xk= 2 (xk xl) exp( xk xl22)
The actual control signal applied to the plant becomes
uk =Nl=1
lK(xl, xk)
where {xl}Nl=1, {l}Nl=1are the solution toF2= 0.
-
8/11/2019 suykens
84/84