suykens

8/11/2019 suykens

1/84

Least Squares Support Vector

MachinesJohan Suykens

K.U. Leuven ESAT-SCD-SISTAKasteelpark Arenberg 10

B-3001 Leuven (Heverlee), BelgiumTel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]

http://www.esat.kuleuven.ac.be/sista/members/suykens.html

NATO-ASI Learning Theory and PracticeLeuven July 2002

http://www.esat.kuleuven.ac.be/sista/natoasi/ltp2002.html

8/11/2019 suykens

2/84

Main reference:

J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De

Moor, J. Vandewalle,Least Squares Support Vector Ma-

chines, World Scientific, in press (ISBN 981-238-151-1)

Related software LS-SVMlab(Matlab/C toolbox):

http://www.esat.kuleuven.ac.be/sista/lssvmlab/

(software + publications)

... with thanks to Kristiaan Pelckmans, Bart Hamers, Lukas, Luc Hoe-

gaerts, Chuan Lu, Lieveke Ameye, Sabine Van Huffel, Gert Lanckriet, Tijl

De Bie and many others

8/11/2019 suykens

3/84

Interdisciplinary challenges

LS-SVM

linear algebra

mathematics

statistics

systems and control

signal processingoptimization

machine learning

pattern recognition

datamining

neural networks

One of the original dreams in the neural networks area is to make

a universal class of models (such as MLPs) generally applicable to

a wide range of applications.

Presently, standard SVMs are mainly available only for classifica-

tion, function estimation and density estimation. However, several

extensions are possible in terms of least squares and equality con-

straints (and exploiting primal-dual interpretations).

8/11/2019 suykens

4/84

Overview of this talk

LS-SVM for classification and link with kernel Fisher discrimi-nant analysis

Links to regularization networks and Gaussian processes

Bayesian inference for LS-SVMs

Sparseness and robustness

Large scale methods:Fixed Size LS-SVM

Extensions to committee networks

New formulations to kernel PCA, kernel CCA, kernel PLS

Extensions of LS-SVM to recurrent networks and optimal con-trol

8/11/2019 suykens

5/84

Vapniks SVM classifier

Given a training set {xk, yk}Nk=1Input patternsxk R

n

Class labelsyk Rwhereyk {1, +1}

Classifier:y(x) = sign[wT(x) +b]

with () : Rn Rnh mapping to high dimensional featurespace (can be infinite dimensional)

For separable data, assumewT(xk) +b +1 , if yk = +1wT(xk) +b 1 , if yk = 1

which is equivalent to

yk[wT(xk) +b] 1, k= 1,...,N Optimization problem (non-separable case):

minw,

J(w, ) =12

wTw+cNk=1

k

subject toyk[w

T(xk) +b] 1 k, k= 1,...,Nk 0, k= 1,...,N.

8/11/2019 suykens

6/84

Construct Lagrangian:

L(w,b, ; , ) = J(w, k)Nk=1

k{yk[wT(xk)+b]1+k}Nk=1

kk

with Lagrange multipliersk 0,k 0 (k= 1,...,N). Solution given by saddle point of Lagrangian:

max,

minw,b,

L(w,b, ; , )One obtains

Lw

= 0

w=N

k=1

kyk(xk)

Lb

= 0 Nk=1

kyk = 0

Lk

= 0 0 k c, k= 1,...,N

Quadratic programming problem (Dual problem):

maxk

Q() = 12

Nk,l=1

ykylK(xk, xl) kl+Nk=1

k

such that

Nk=1

kyk = 0

0 k c, k= 1,...,N.

8/11/2019 suykens

7/84

Note: w and(xk) are not calculated.

Mercer condition:

K(xk, xl) =(xk)

T

(xl) Obtained classifier:

y(x) = sign[Nk=1

kykK(x, xk) +b]

withk positive real constants, breal constant, that follow as

solution to the QP problem.

Non-zero k are called support values and the corresponding

data points are called support vectors.

The bias term bfollows from KKT conditions.

Some possible kernelsK(, ):

K(x, xk) =xTk x(linear SVM)

K(x, xk) = (xTk x+ 1)

d (polynomial SVM of degree d)

K(x, xk) = exp{x xk22/2} (RBF SVM)

K(x, xk) = tanh( xT

k

x+) (MLP SVM)

In the case of RBF and MLP kernel, the number of hiddenunits corresponds to the number of support vectors.

8/11/2019 suykens

8/84

Feature space and kernel trick

+ +

+

x

x

x

+

xx

x

+

+

+

+ +

+

x

x

x

+

x x

xx

x

x

x

+

+

+

Input space

Feature space

(x)

=

(x)T

(z)

K(x, z)

8/11/2019 suykens

9/84

Primal-dual interpretations of SVMs

Primal problem P

Dual problem D

Parametric: estimate w Rnh

Non-parametric: estimate RN

y(x) = sign[wT(x) +b]

y(x) = sign[#sv

k=1 kykK(x, xk) +b]

Kernel trick

K(xk, xl) =(xk)T(xl)

y(x)

y(x)

w1

wnh

1

#sv

1(x)

nh(x)

K(x, x1)

K(x, x#sv)

x

x

8/11/2019 suykens

10/84

Least Squares SVM classifiers

LS-SVM classifiers (Suykens, 1999): close to Vapniks SVMformulation but solves linear system instead of QP problem.

Optimization problem:

minw,b,e

J(w,b,e) =12

wTw+1

2

Nk=1

e2k

subject to the equality constraints

yk[wT(xk) +b] = 1 ek, k= 1,...,N. Lagrangian

L(w,b,e; ) = J(w,b,e) Nk=1

k{yk[wT(xk) + b] 1 + ek}

wherek are Lagrange multipliers.

Conditions for optimality:

Lw

= 0 w=Nk=1

kyk(xk)

Lb

= 0 N

k=1kyk = 0

Lek = 0 k =ek, k= 1,...,NLk

= 0 yk[wT(xk) +b] 1 +ek = 0, k= 1,...,N

8/11/2019 suykens

11/84

Set of linear equations (instead of QP):

I 0 0 ZT0 0 0 YT

0 0 I IZ Y I 0

w

b

e

=

0

0

01

with

Z= [(x1)Ty1; ...; (xN)

TyN]

Y = [y1; ...; yN]1 = [1; ...; 1]

e= [e1; ...; eN]= [1; ...; N].

After elimination ofw, eone obtains 0 YT

Y +1I

b

=

01

.

where

=Z ZT

and Mercers condition is applied

kl = ykyl(xk)T(xl)

= ykylK(xk, xl).

Related work: Saunders (1998), Smola & Scholkopf (1998),Cristianini & Taylor (2000)

8/11/2019 suykens

12/84

Large scale LS-SVMs

For the binary class case:problem involves matrix of size (N+ 1) (N+ 1).Large data sets iterative methods needed (CG, SOR, ...)

Problem: solvingAx= B A Rnn, B Rn

by CG requires that A is symmetric positive definite.

Represent the original problem of the form 0 YT

Y H

12

=

d1d2

withH= +1

I,1=b,2=,d1= 0,d2=1 as

s 0

0 H

1

2+H1Y 1

=

d1+YTH1d2d2

withs=YTH1Y >0 (H=HT >0).

Iterative methods can be applied to the latter problem.

Convergence of CG depends on condition number(hence it depends on, )

8/11/2019 suykens

13/84

Fisher Discriminant Analysis

Projection of data:z=f(x) =wTx+b

Optimize Rayleigh quotient:

maxw,b

JFD(w, b) = wTBwwTWw

Class 1

Class 2

wT1x

+b1

w T

2x+

b2

8/11/2019 suykens

14/84

8/11/2019 suykens

15/84

A two spiral classification problem

40 30 20 10 0 10 20 30 4040

30

20

10

0

10

20

30

40

Difficult problem for MLPs, not for SVM or LS-SVM with RBF

kernel (but easy in the sense that the two classes are separable)

8/11/2019 suykens

16/84

Benchmarking LS-SVM classifiers

acr bld gcr hea ion pid snr ttt wbc adu

NCV 460 230 666 180 234 512 138 638 455 33000Ntest 230 115 334 90 117 256 70 320 228 12222N 690 345 1000 270 351 768 208 958 683 45222

nnum 6 6 7 7 33 8 60 0 9 6ncat 8 0 13 6 0 0 0 9 0 8n 14 6 20 13 33 8 60 9 9 14

bal cmc ims iri led thy usp veh wav win

NCV 416 982 1540 100 2000 4800 6000 564 2400 118Ntest 209 491 770 50 1000 2400 3298 282 1200 60N 625 1473 2310 150 3000 7200 9298 846 3600 178

nnum 4 2 18 4 0 6 256 18 19 13ncat 0 7 0 0 7 15 0 0 0 0n 4 9 18 4 7 21 256 18 19 13

M 3 3 7 3 10 3 10 4 3 3ny,MOC 2 2 3 2 4 2 4 2 2 2ny,1vs1 3 3 21 3 45 3 45 6 2 3

(Van Gestel et al., Machine Learning, in press)

8/11/2019 suykens

17/84

Successful LS-SVM applications

20 UCI benchmark data sets (binary and multiclass problems)

Ovarian cancer classification

Classification of brain tumours from magnetic resonance spec-troscopy signals

Prediction of mental development of preterm newborns

Prediction of air polution and chaotic time series

Marketting and financial engineering studies

Modelling the Belgian electricity consumption

Softsensor modelling in the chemical process industry

8/11/2019 suykens

18/84

acr bld gcr hea ion pid snr ttt wbc aduNtest 230 115 334 90 117 256 70 320 228 1222

n 14 6 20 13 33 8 60 9 9 14

RBF LS-SVM 87.0(2.1) 70.2(4.1) 76.3(1.4) 84.7(4.8) 96.0(2.1) 76.8(1.7) 73.1(4.2) 99.0(0.3) 96.4(1.0) 84.7(0.3

RBF LS-SVMF 86.4(1.9) 65.1(2.9) 70.8(2.4) 83.2(5.0) 93.4(2.7) 72.9(2.0) 73.6(4.6) 97.9(0.7) 96.8(0.7) 77.6(1.3

Lin LS-SVM 86.8(2.2) 65.6(3.2) 75.4(2.3) 84.9(4.5) 87.9(2.0) 76.8(1.8) 72.6(3.7) 66.8(3.9) 95.8(1.0) 81.8(0.3

Lin LS-SVMF 86.5(2.1) 61.8(3.3) 68.6(2.3) 82.8(4.4) 85.0(3.5) 73.1(1.7) 73.3(3.4) 57.6(1.9) 96.9(0.7) 71.3(0.3

Pol LS-SVM 86.5(2.2) 70.4(3.7) 76.3(1.4) 83.7(3.9) 91.0(2.5) 77.0(1.8) 76.9(4.7) 99.5(0.5) 96.4(0.9) 84.6(0.3

Pol LS-SVMF 86.6(2.2) 65.3(2.9) 70.3(2.3) 82.4(4.6) 91.7(2.6) 73.0(1.8) 77.3(2.6) 98.1(0.8) 96.9(0.7) 77.9(0.2

RBF SVM 86.3(1.8) 70.4(3.2) 75.9(1.4) 84.7(4.8) 95.4(1.7) 77.3(2.2) 75.0(6.6) 98.6(0.5) 96.4(1.0) 84.4(0.3

Lin SVM 86.7(2.4) 67.7(2.6) 75.4(1.7) 83.2(4.2) 87.1(3.4) 77.0(2.4) 74.1(4.2) 66.2(3.6) 96.3(1.0) 83.9(0.2

LDA 85.9(2.2) 65.4(3.2) 75.9(2.0) 83.9(4.3) 87.1(2.3) 76.7(2.0) 67.9(4.9) 68.0(3.0) 95.6(1.1) 82.2(0.3

QDA 80.1(1.9) 62.2(3.6) 72.5(1.4) 78.4(4.0) 90.6(2.2) 74.2(3.3) 53.6(7.4) 75.1(4.0) 94.5(0.6) 80.7(0.3

Logit 86.8(2.4) 66.3(3.1) 76.3(2.1) 82.9(4.0) 86.2(3.5) 77.2(1.8) 68.4(5.2) 68.3(2.9) 96.1(1.0) 83.7(0.2

C4.5 85.5(2.1) 63.1(3.8) 71.4(2.0) 78.0(4.2) 90.6(2.2) 73.5(3.0) 72.1(2.5) 84.2(1.6) 94.7(1.0)85.6(0.3

oneR 85.4(2.1) 56.3(4.4) 66.0(3.0) 71.7(3.6) 83.6(4.8) 71.3(2.7) 62.6(5.5) 70.7(1.5) 91.8(1.4) 80.4(0.3

IB1 81.1(1.9) 61.3(6.2) 69.3(2.6) 74.3(4.2) 87.2(2.8) 69.6(2.4) 77.7(4.4) 82.3(3.3) 95.3(1.1) 78.9(0.2

IB10 86.4(1.3) 60.5(4.4) 72.6(1.7) 80.0(4.3) 85.9(2.5) 73.6(2.4) 69.4(4.3) 94.8(2.0) 96.4(1.2) 82.7(0.3

NBk 81.4(1.9) 63.7(4.5) 74.7(2.1) 83.9(4.5) 92.1(2.5) 75.5(1.7) 71.6(3.5) 71.7(3.1) 97.1(0.9) 84.8(0.2

NBn 76.9(1.7) 56.0(6.9) 74.6(2.8) 83.8(4.5) 82.8(3.8) 75.1(2.1) 66.6(3.2) 71.7(3.1) 95.5(0.5) 82.7(0.2

Maj. Rule 56.2(2.0) 56.5(3.1) 69.7(2.3) 56.3(3.8) 64.4(2.9) 66.8(2.1) 54.4(4.7) 66.2(3.6) 66.2(2.4) 75.3(0.3

8/11/2019 suykens

19/84

8/11/2019 suykens

20/84

LS-SVMs for function estimation

LS-SVM model as feature space representation:y(x) =wT(x) +b

withx Rn, y R.

The nonlinear mapping() is similar to the classifier case.

Given training set {xk, yk}Nk=1.

Optimization problem

minw,e

J(w, e) =12

wTw+1

2

Nk=1

e2k

subject to equality constraints

yk =wT(xk) +b+ek, k= 1,...,N

This is a form of ridge regression (see Saunders, 1998).

Lagrangian:

L(w,b,e; ) =

J(w, e)

N

k=1

k {

wT(xk

) +b+ek

yk}

withk Lagrange multipliers.

8/11/2019 suykens

21/84

Conditions for optimality

Lw

= 0 w=N

k=1k(xk)

Lb

= 0 Nk=1

k = 0

Lek

= 0 k = ek, k= 1,...,NLk

= 0 wT(xk) +b+ek yk = 0, k= 1,...,N

Solution 0 1T

1 +1I b

= 0

y

with

y= [y1; ...; yN],1 = [1; ...;1],= [1; ...; N]

and by applying Mercers condition

kl = (xk)T(xl), k, l= 1,...,N

= K(xk, xl) Resulting LS-SVM model for function estimation

y(x) =Nk=1

kK(xk, x) +b

Solution mathematically equivalent to regularization networksand Gaussian processes (usually without bias term) (see e.g.Poggio & Girosi (1990)).

8/11/2019 suykens

22/84

SVM, RN, GP, LS-SVM, ...

=

Support Vector Machines

LS-SVMs

Regularization networks

Gaussian processes

Kriging

Kernel ridge regression

RKHS

?

Some early history:

1910-1920: Moore - RKHS

1940: Aronszajn - systematic development of RKHS

1951: Krige

1970: Parzen - RKHS for statistical problems

1971: Kimeldorf & Wahba

In LS-SVM primal-dual interpretations are emphasized (and often

exploited).

8/11/2019 suykens

23/84

=

=

Prior

Prior

Prior

Maximize Posterior

Maximize Posterior

Maximize Posterior

Evidence

Likelihood

Evidence

Likelihood

Evidence

Likelihood

Bayesian inference of LS-SVM models

Level 1 (w, b)

Level 2 (, )

Level 3 ()

p(D|w,b,,, H)

p(

D|,,

H)

p(D|H)

p(D)

p(w, b|,, H)

p(, |H)

p(H)

p(w, b|D, , , H)

p(, |D, H)

p(H|D)

8/11/2019 suykens

24/84

Bayesian LS-SVM framework

(Van Gestel, Suykens et al., Neural Computation, 2002)

Level 1 inference:Inference ofw,bProbabilistic interpretation of outputs

Moderated outputs

Additional correction for prior probabilities and bias term

Level 2 inference:

Inference of hyperparameterEffective number of parameters (< N)

Eigenvalues of centered kernel matrix are important

Level 3:Model comparison - Occams razor

Moderated output with uncertainty on hyperparameters

Selection of width of kernel

Input selection (ARD at level 3 instead of level 2)

Related work:MacKays Bayesian interpolation for MLPs, GP

Difference with GP:

bias term contained in the formulation

classifier case treated as a regression problem

of RBF kernel determined at Level 3

8/11/2019 suykens

25/84

Benchmarking results

n N Ntest Ntot LS-SVM LS-SVM LS-SVM SVM GP GPb GP

(BayM) (Bay) (CV10) (CV10) (Bay) (Bay) (CV10)bld 6 230 115 345 69.4(2.9)69.4(3.1)69.4(3.4)69.2(3.5)69.2(2.7)68.9(3.3) 69.7(4.0)cra 6 133 67 200 96.7(1.5)96.7(1.5)96.9(1.6)95.1(3.2)96.4(2.5) 94.8(3.2) 96.9(2.4)gcr 20 666 334 1000 73.1(3.8) 73.5(3.9) 75.6(1.8) 74.9(1.7)76.2(1.4)75.9(1.7) 75.4(2.0)hea 13 1 80 90 2 70 83.6(5.1)83.2(5.2)84.3(5.3)83.4(4.4)83.1(5.5)83.7(4.9) 84.1(5.2)ion 33 234 117 351 95.6(0.9) 96.2(1.0)95.6(2.0) 95.4(1.7) 91.0(2.3) 94.4(1.9) 92.4(2.4)pid 8 512 256 768 77.3(3.1)77.5(2.8)77.3(3.0)76.9(2.9)77.6(2.9)77.5(2.7) 77.2(3.0)rsy 2 250 1000 1250 90.2(0.7)90.2(0.6) 89.6(1.1) 89.7(0.8)90.2(0.7)90.1(0.8) 89.9(0.8)snr 60 138 70 208 76.7(5.6) 78.0(5.2)77.9(4.2) 76.3(5.3) 78.6(4.9)75.7(6.1) 76.6(7.2)tit 3 1467 734 2201 78.8(1.1)78.7(1.1)78.7(1.1)78.7(1.1)78.5(1.0) 77.2(1.9) 78.7(1.2)wbc 9 455 228 683 95.9(0.6) 95.7(0.5)96.2(0.7)96.2(0.8) 95.8(0.7) 93.7(2.0) 96.5(0.7)AP 83.7 83.9 84.1 83.6 83.7 83.2 83.7AR 2.3 2.5 2.5 3.8 3.2 4.2 2.6

PST 1.000 0 .754 1.000 0.344 0.754 0.344 0.508

8/11/2019 suykens

26/84

Bayesian inference of LS-SVMs:example Ripley data set

1.5 1 0.5 0 0.5 1 1.50.5

0

0.5

1

1.5

0.10.2

0.30.4

0.4

0.5

0.6

0.7

0.8

0.9

x1

x2

1.5

1

0.5

0

0.5

1

1.5

0.5

0

0.5

1

1.5

0

0.2

0.4

0.6

0.8

1

x1x2

Posteriorclassprobability

8/11/2019 suykens

27/84

Bayesian inference of LS-SVMs:sinc example

0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.51

0.5

0

0.5

1

1.5

x

y

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7200

210

220

230

240

250

260

270

280

log(P(D|H

))

8/11/2019 suykens

28/84

Sparseness

kN

sorted|k|

LS-SVM

QP-SVM

Lack of sparseness in the LS-SVM case

but ... sparseness can be imposed by applying pruning techniques

existing in the neural networks area (e.g. optimal brain damage,

optimal brain surgeon etc.)

kN

sorted|k|

LS-SVM

sparse LS-SVM

8/11/2019 suykens

29/84

Robustness

Convex cost function

convex

optimization

SVM solution

Weighted version with

modified cost function

robust

statistics

LS-SVM solution

SVM Weighted LS-SVM

Weighted LS-SVM:

minw,b,e

J(w, e) = 1

2wTw+

1

2

N

k=1vke

2k

such that yk =wT(xk) +b+ek, k= 1,...,N

with vk determined by the distribution of{ek}Nk=1 obtained fromthe unweighted LS-SVM.

8/11/2019 suykens

30/84

L2 and L1 estimators: score function

3 2 1 0 1 2 36

4

2

0

2

4

6

8

10

e

L(e),

(e)

3 2 1 0 1 2 32

1.5

1

0.5

0

0.5

1

1.5

2

2.5

3

e

L(e),

(e)

L1estimator: influence of outliers is reduced (the score function in

robust statistics means derivative of the loss function)

8/11/2019 suykens

31/84

Loss function and score function

Examples of score function in robust statistics:

Huber, Andrews, Hampel, Tukey

3 2 1 0 1 2 32

1

0

1

2

3

4

e

L(e),

(e)

5 4 3 2 1 0 1 2 3 4 51.5

1

0.5

0

0.5

1

1.5

e

(e)

10 8 6 4 2 0 2 4 6 8 102.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

e

(e)

3 2 1 0 1 2 32.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

e

(e)

In practice: iterative weighting accrording to the score function.

Note that all the score functions (except Hubers) lead to non-

convex loss functions. Further application of robust statistics to

robust cross-validation (De Brabanter et al., 2002) with good per-formance in terms of efficiency-robustness trade-off.

8/11/2019 suykens

32/84

Large scale methods

Nystrom method (GP)

Fixed size LS-SVM

Committee networks and extensions

8/11/2019 suykens

33/84

Nystrom method

Reference in GP: Williams & Seeger (2001)

bigkernel matrix: (N,N) RNNsmallkernel matrix: (M,M) RMM(based on random subsample, in practice often M N)

Eigenvalue decomposition of (M,M):(M,M) U=U

Relation to eigenvalues and eigenfunctions of the integral equa-tion

K(x, x)i(x)p(x)dx=ii(x)

is given by

i = 1M

i

i(xk) =

M uki

i(x) =

M

i

Mk=1

ukiK(xk, x)

where i and i are estimates to i and i, respectively, and

uki denotes theki-th entry of the matrix U.

8/11/2019 suykens

34/84

For the big matrix:

(N,N) U= U

Furthermore, one has

i = NM

i

ui =

NM

1i

(N,M)ui

One can show then that

(N,N) (N,M)1(M,M)(M,N)where (N,M) is theN Mblock matrix taken from (N,N).

Solution to the big linear system((N,N)+I/)=y

can be written as

=

y U( 1

I+UTU)1UTy

by applying Sherman-Morrison-Woodbury formula.

Some numerical difficulties as pointed out by Fine & Scheinberg(2001).

8/11/2019 suykens

35/84

Computation in primal or dual space ?

Primal space

Dual space

Large data sets

Large dimensional inputs

?

8/11/2019 suykens

36/84

Fixed Size LS-SVM

Related work: basis construction in feature spaceCawley (2002), Csato & Opper (2002), Smola & Scholkopf(2002), Baudat & Anouar (2001).

Model in primal space:

minwRnh,bR

1

2wTw+

1

2

N

k=1 yk (wT(xk) +b)

2

.

Observation: for linear model it is computationally better to

solve the primal problem (one knows that (xk) =xk)

Can we do this for the nonlinear case too ?

Employ the Nystrom method:

i(x) =

i i(x

) =

Mi

Mk=1

ukiK(xk, x),

assuming a fixed sizeM.

The model becomes theny(x) = wT(x) +b

=Mi=1

wi

Mi

Mk=1

ukiK(xk, x).

8/11/2019 suykens

37/84

The support values corresponding to the number ofM support

vectors equal

k =M

i=1 wi

M

iukiwhen ones represent the model as

y(x) =Mk=1

kK(xk, x)

How to select a working set ofMsupport vectors ?

Is taking a random subsample the only option ?

8/11/2019 suykens

38/84

Fixed Size LS-SVM: Selection of SV

Girolami (2002): link between Nystrom method, kernel PCA,density estimation and entropy criteria

The quadratic Renyi entropy

HR= log

p(x)2dx

has been related to kernel PCA and density estimation with p(x)2dx=

1

N21Tv 1v

where 1v = [1;1; ...; 1] and a normalized kernel is assumed with

respect to density estimation.

Fixed Size LS-SVM:Take a working set of M support vectors and select vectors

according to the entropy criterion (instead of a random sub-

sample as in the Nystrom method)

8/11/2019 suykens

39/84

Fixed Size LS-SVM

Primal space Dual space

Nystrom method

Kernel PCA

Density estimation

Entropy criteria

Eigenfunctions

Active selection of SV

Regression

8/11/2019 suykens

40/84

Fixed Size LS-SVM

Pool of Training Data

Fixed Size LS-SVM

new point IN

present or new point OUT

random point selection

8/11/2019 suykens

41/84

Fixed size LS-SVM algorithm:

1. Given normalized and standardized training data {xk, yk}Nk=1with inputsxk

R

n, outputsyk

R andNtraining data.

2. Choose a working set with size Mand impose in this way

a number ofM support vectors (typicallyM N).

3. Randomly select a support vector x from the working setofM support vectors.

4. Randomly select a point xt from theNtraining data andreplacexbyxtin the working set. If the entropy increasesby taking the pointxt instead ofx then this point xt isaccepted for the working set ofMsupport vectors, other-

wise the pointxt is rejected (and returned to the trainingdata pool) and the support vector x stays in the workingset.

5. Calculate the entropy value for the present working set.

6. Stop if the change in entropy value is small or the number

of iterations is exceeded, otherwise go to (3).

7. Estimatew, bin the primal space after estimating the eigen-

functions from the Nystrom approximation.

8/11/2019 suykens

42/84

Fixed Size LS-SVM: sinc example

10 8 6 4 2 0 2 4 6 8 100.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

10 8 6 4 2 0 2 4 6 8 100.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

10 8 6 4 2 0 2 4 6 8 100.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

10 8 6 4 2 0 2 4 6 8 100.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

demo in LS-SVMlab

8/11/2019 suykens

43/84

100

101

102

103

104

105

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

iteration step

entropy

100

101

102

103

104

105

104

103

102

101

100

iteration step

testseterror

8/11/2019 suykens

44/84

Fixed Size LS-SVM: sinc example

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

10 8 6 4 2 0 2 4 6 8 100.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

x

y

Sinc function with 20000 data points

8/11/2019 suykens

45/84

Fixed Size LS-SVM: spiral example

2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.52.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

8/11/2019 suykens

46/84

Fixed Size LS-SVM: spiral example

2 1.5 1 0.5 0 0.5 1 1.5 22.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

x1

x2

8/11/2019 suykens

47/84

Committee network of LS-SVMs

Data

D1D2

Dm

fc(x)

LS-SVM (D1)

LS-SVM (D2)

LS-SVM (Dm)x

x

x

8/11/2019 suykens

48/84


sinc function with 4000 training data

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

8/11/2019 suykens

49/84

Results of the individual LS-SVM models

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

8/11/2019 suykens

50/84

Nonlinear combination of LS-SVMs

x

x

x

LS-SVM (D1)

LS-SVM (D2)

LS-SVM (Dm)

g

This results into a multilayer network

(layers of (LS)-SVMs or e.g. MLP + LS-SVM combination)

8/11/2019 suykens

51/84

8/11/2019 suykens

52/84


Linear versus nonlinear combinations of trained LS-SVM submod-

els under heavy Gaussian noise

10 8 6 4 2 0 2 4 6 8 104

3

2

1

0

1

2

3

4

x

y

10 8 6 4 2 0 2 4 6 8 104

3

2

1

0

1

2

3

4

x

y

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

10 8 6 4 2 0 2 4 6 8 100.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

x

y

10 8 6 4 2 0 2 4 6 8 100.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

x

y

8/11/2019 suykens

53/84

Classical PCA formulation

Given data {xk}Nk=1withxk Rn

Find projected variables wTxk with maximal variance

maxw

Var(wTx) = Cov(wTx, wTx) 1N

Nk=1

(wTxk)2

= wT C w

whereC= (1/N)

Nk=1

xkxT

k .

Consider constraintwTw= 1.

Constrained optimization:

L(w; ) =12

wTCw (wTw 1)

with Lagrange multiplier.

Eigenvalue problemCw=w

withC=CT 0, obtained fromL/w= 0,L/= 0.

8/11/2019 suykens

54/84

PCA analysis as a one-class modellingproblem

One-class with target value zero:

maxw

Nk=1

(0 wTxk)2

Score variables: z=wTx

Illustration:

x

x

x

x

x

x

x

xx

x

x

x

x

Input spaceTarget space

0

wTx

8/11/2019 suykens

55/84

LS-SVM interpretation to FDA

1

+1

xx

x

x

x

x

+

+

+

+

+

+

+


0

wTx + b

Minimize within class scatter

LS-SVM interpretation to PCA

x

x

x

x

x

x

x

x

x

x

x

x

x


0

wT(x x)

Find direction with maximal variance

8/11/2019 suykens

56/84

SVM formulation to linear PCA

Primal problem:

P : maxw,e

JP(w, e) = 1

2

Nk=1

e2k 1

2wTw

such that ek =wTxk, k= 1,...,N

Lagrangian

L(w, e; ) =12

Nk=1

e2k 12wTw N

k=1

k

ek wTxk


Lw

= 0

w=

N

k=1

kxk

Lek

= 0 k = ek, k= 1,...,NLk

= 0 ek wTxk = 0, k= 1,...,N

8/11/2019 suykens

57/84

Elimination of variables e, w gives1

k

N

l=1lx

Tl xk = 0 , k= 1,...,N

and after defining = 1/one obtains the eigenvalue problem

D : solve in :

xT1 x1 . . . xT1 xN

... ...

xTNx1 . . . xTNxN

1...

N

=

1...

N

as the dual problem (quantization in terms of= 1/).

Score variables become

z(x) =wTx=Nl=1

lxTl x

Optimal solution corresponding to largest eigenvalue

Nk=1

(wTxk)2 =

Nk=1

e2k =Nk=1

1

22k =

2max

whereN

k=1 2k = 1 for the normalized eigenvector.

Many data : better solve primal problem

Many inputs: better solve dual problem

8/11/2019 suykens

58/84

SVM formulation to PCA analysis witha bias term

Usually: apply PCA analysis to centered dataand consider

maxw

Nk=1

[wT(xk x)]2

where x= 1

N

Nk=1

xk.

Bias term formulation:

score variables

z(x) =wTx+b

and objective

maxw,b

Nk=1 [0 (w

T

x+b)]

2

8/11/2019 suykens

59/84

Primal optimization problem

P : maxw,e

JP(w, e) = 1

2

N

k=1e2k

1

2wTw

such that ek =wTxk+b, k= 1,...,N

Lagrangian

L(w, e; ) =12

Nk=1

e2k 1

2wTw

Nk=1

k

ek wTxk b


Lw

= 0 w=Nk=1

kxk

Lek

= 0 k = ek, k= 1,...,N

Lek

= 0 N

k=1

k = 0

Lk

= 0 ek wTxk b= 0, k= 1,...,N

Applying Nk=1 k = 0 yieldsb= 1

N

N

k=1N

l=1lx

Tlxk.

8/11/2019 suykens

60/84

By defining= 1/one obtains the dual problemD : solve in :

(x1 x)T(x1 x) . . . (x1 x)T(xNx)... ...

(xNx)T(x1 x) . . . (xNx)

T(xNx)

1...

N

=

1...

N

which is an eigenvalue decomposition of the centered Gram

matrix

c=

with c = McMc where Mc = I 1v1Tv /N, 1v = [1;1; ...; 1]and

kl=xT

kxlfork, l= 1,...,N.

Score variables:

z(x) =wTx+b=Nl=1

lxTl x+b

where is the eigenvector corresponding to the largest eigen-

value andNk=1

(wTxk+b)2 =

Nk=1

e2k =Nk=1

1

22k =

2max

8/11/2019 suykens

61/84

Reconstruction problem for linear PCA

Reconstruction error:

minNk=1

xk xk22

where xk are variables reconstructed from the score variables,

with

x= V z+

Hence

minV,

Nk=1

xk (V zk+)22

Information bottleneck:

x z x

wTx+b V z+

8/11/2019 suykens

62/84

5 4 3 2 1 0 1 2 3 4 55

4

3

2

1

0

1

2

3

4

5

x1

x2

100 80 60 40 20 0 20 40 60 80 100100

80

60

40

20

0

20

40

60

80

100

z1

z2

8/11/2019 suykens

63/84

LS-SVM approach to kernel PCA

Create nonlinear version of the method byMapping input space to a high dimensional feature spaceApplying the kernel trick

(kernel PCA - Scholkopf et al.; SVM approach - Suykens et

al., 2002)

Primal optimization problem:

P : maxw,e

JP(w, e) = 12

Nk=1

e2k 12wTw

such that ek =wT((xk) ), k= 1,...,N.

Lagrangian

L(w, e; ) =

1

2

N

k=1

e2k

1

2

wTw

N

k=1

k ek wT((xk)

)


Lw

= 0 w=Nk=1

k((xk) )Lek

= 0 k =ek, k= 1,...,NLk

= 0 ek wT((xk) ) = 0, k= 1,...,N.

8/11/2019 suykens

64/84

By elimination of the variables e, w and defining = 1/oneobtains

D : solve in :

c=

with

c=

((x1) )

T((x1) ) . . . ((x1) )T((xN) )

... ...

((xN) )T((x1) ) . . . ((xN) )

T((xN) )

Elements of the centered kernel matrix

c,kl

= ((xk

)

)T((xl)

) , k, l= 1,...,N

Score variablesz(x) = wT ((x) )

=Nl=1

l((xl) )T ((x) )

=

Nl=1

l

K(xl, x) 1NNr=1

K(xr, x) 1NNr=1

K(xr, xl)+

1

N2

Nr=1

Ns=1

K(xr, xs)

.

8/11/2019 suykens

65/84

LS-SVM interpretation to Kernel PCA

x

x

x x

x

x

x

x

xx

x

x

x

x

x

x

x

x x

x

Input space

Feature space

Target space

0

wT((x) )

(

)

Find direction with maximal variance

8/11/2019 suykens

66/84

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 11.5

1

0.5

0

0.5

1

1.5

x1

x2

0 5 10 15 20 250

5

10

15

20

25

30

35

i

i

8/11/2019 suykens

67/84

Canonical Correlation Analysis

Hotelling (1936): linear CCA analysis

Applications e.g. in system identification, subspace algorithms,

signal processing and information theory.

New applications to bioinformatics (correlations between genes

and between experiments in both directions for microarray data

matrix)

Finding maximal correlation between projected variableszx=w

Tx and zy =vTy

wherex Rnx, y Rny denote given random vectors with zeromean.

Objective: achieve maximal correlation coefficientmaxw,v

= E[zxzy]E[zxzx]E[zyzy]

= wTCxyvwTCxxw

vTCyyv

withCxx=

E[xxT],Cyy=

E[yyT],Cxy=

E[xyT].

8/11/2019 suykens

68/84

This is formulated as the constrained optimization problemmaxw,v

wTCxyv

such that wTCxxw= 1

vTCyyv= 1

which leads to a generalized eigenvalue problem.

The solution follows from the Lagrangian

L(w, v; , ) =wTCxyv

1

2

(wTCxxw

1)

1

2

(vTCyyv

1)

with Lagrange multipliers, , which gives Cxyv = Cxxw

Cyxw = Cyyv.

8/11/2019 suykens

69/84

SVM formulation to CCA

In primal weight space:

P : maxw,v,e,r

Nk=1

ekrk 1 12

Nk=1

e2k 21

2

Nk=1

r2k

12

wTw 12

vTv

such thatek =wTxk, k= 1,...,N

rk =vTyk k= 1,...,N

Lagrangian:

L(w,v,e,r; , ) =Nk=1

ekrk 1 12

Nk=1

e2k 21

2

Nk=1

r2k

1

2wTw

1

2vTv

N

k=1

k[ek

wTxk]

N

k=1

k[rk

vTyk]

wherek,k are Lagrange multipliers.

Related objective in CCA literature (Gittins, 1985):minw,v k w

Txk vTyk22

8/11/2019 suykens

70/84


Lw

= 0 w=N

k=1kxk

Lv

= 0 v=Nk=1

kyk

Lek

= 0 vTyk =1wTxk+k, k= 1,...,NLrk

= 0 wTxk =2vTyk+k, k= 1,...,NLk

= 0 ek =wTxk, k= 1,...,NLk = 0 rk =vTyk, k= 1,...,N

resulting into the following dual problem after defining= 1/

D : solve in , :

yT1y1 . . . yT1yN

0...

...yTN

y1 . . . yT

NyN

xT1x1 . . . xT1xN

... ... 0xTN

x1 . . . xT

NxN

1...

N1

...N

=

1xT1x1+ 1 . . . 1x

T1xN

...... 0

1xT

Nx1 . . . 1x

T

NxN+ 1

2yT1y1+ 1 . . . 2y

T1yN

0...

...2y

T

Ny1 . . . 2y

T

NyN+ 1

1...

N1...

N

8/11/2019 suykens

71/84

Resulting score variables

zxk = ek =N

l=1lx

Tlxk

zyk = rk =Nl=1

lyTl yk.

The eigenvalues will be both positive and negative for the

CCA problem. Also note that one has [1, 1].

8/11/2019 suykens

72/84

Extension to kernel CCA

Score variables

zx = w

T

(1(x) 1)zy = v

T(2(y) 2)where 1() : Rnx Rnhx and 2() : Rny Rnhy are map-pings (which can be chosen to be different) to high dimen-

sional feature spaces and 1 = (1/N)N

k=1 1(xk), 2 =

(1/N)

Nk=1 2(yk).

Primal problem

P : maxw,v,e,r

Nk=1

ekrk 1 12

Nk=1

e2k 21

2

Nk=1

r2k

12

wTw 12

vTv

such that ek =wT(1(xk)

1), k= 1,...,N

rk =vT(2(yk) 2), k= 1,...,N

with Lagrangian

L(w , v, e, r; , ) =N

k=1

ekrk 11

2

Nk=1

e2k 21

2

Nk=1

r2k 1

2wTw

1

2vTv

N

k=1

k[ek wT(1(xk) 1)]

N

k=1

k[rk vT(2(yk) 2)]

where k, k are Lagrange multipliers. Note that w and v

might be infinite dimensional now.

8/11/2019 suykens

73/84


Lw

= 0 w=N

k=1k(1(xk) 1)

Lv

= 0 v=Nk=1

k(2(yk) 2)Lek

= 0 vT(2(yk) 2) =1wT(1(xk) 1) +kk= 1,...,N

Lrk

= 0 wT(1(xk) 1) =2vT(2(yk) 2) +kk= 1,...,N

Lk

= 0 ek =wT(1(xk) 1)k= 1,...,N

Lk

= 0 rk =vT(2(yk) 2)k= 1,...,N

resulting into the following dual problem after defining= 1/

D : solve in , : 0 c,2c,1 0

=

1c,1+I 0

0 2c,2+I

wherec,1kl = (1(xk) 1)T(1(xl) 1)c,2kl = (2(yk) 2)T(2(yl) 2)

8/11/2019 suykens

74/84

are the elements of the centered Gram matrices for k, l =

1,...,N. In practice these matrices can be computed by

c,1 = Mc1Mcc,2 = Mc2Mc

with centering matrixMc=I (1/N)1v1Tv .

The resulting score variables can be computed by applying thekernel trick with kernels

K1(xk, xl) = 1(xk)T1(xl)

K2(yk, yl) = 2(yk)T

2(yl)

Related work: Bach & Jordan (2002), Rosipal & Trejo (2001)

8/11/2019 suykens

75/84

Recurrent Least Squares SVMs

Standard SVM only for static estimation problem.Dynamic case: Recurrent LS-SVM (Suykens, 1999)

Feedforward/NARX/series-parallel:

yk =f(yk1, yk2,...,ykp, uk1, uk2,...,ukp)

with inputuk Rand outputyk R.Training by classical SVM method.

Recurrent/NOE/parallel:yk =f(yk1,yk2,...,ykp, uk1, uk2,...,ukp)

Classically trained by Narendras Dynamic Backpropagation or

Werbos backpropagation through time.

Parametrization for autonomous case:

yk =wT (

yk1yk2

...

ykp

) +bwhich is equivalent to

yk ek =wT (xk1|kp k1|kp) +b

whereek =yk yk,k1|kp= [ek1; ek2; ...;ekp]xk1|kp= [yk1; yk2; ...; ykp]

8/11/2019 suykens

76/84

Optimization problem:

minw,e

J(w, e) =12

wTw+1

2

N+p

k=p+1e2k

subject to

ykek =wT (xk1|kpk1|kp)+b, (k=p +1,...,N+p).

Lagrangian

L(w,b,e; ) =

J(w,b,e)

+

N+pk=p+1

kp[yk ek wT (xk1|kp k1|kp) b]


Lw = w N+pk=p+1

kp (xk1|kp k1|kp) = 0

Lb

=

N+pk=p+1

kp= 0

Lek

= ek kp pi=1

kp+i

eki[wT(xk1|kp k1|kp)] = 0

(k=p+ 1,...,N)Lkp

= yk ek wT (xk1|kp k1|kp) b= 0,(k=p+ 1,...,N+p)

8/11/2019 suykens

78/84

Recurrent LS-SVMs: example

Chuas circuit:

x = a [y h(x)]

y = x y+zz = bywith piecewise linear characteristic

h(x) =m1x+1

2(m0 m1) (|x+ 1| |x 1|).

Double scroll attractor for

a= 9,b= 14.286,m0=

1/7,m1= 2/7.

4

2

0

2

40.4

0.3 0.2

0.1 0

0.1 0.2

0.3 0.4

4

2

0

2

4

x1x2

x3

Recurrent LS-SVM:

N= 300 training data

Model: p= 12

RBF kernel: = 1SQP with early stopping

8/11/2019 suykens

79/84

0 100 200 300 400 500 600 7002.5

2

1.5

1

0.5

0

0.5

1

1.5

2

2.5

k

Figure: Trajectory learning of the double scroll (full line) by a recurrent least

squares SVM with RBF kernel. The simulation result after training (dashed

line) onN = 300 data points is shown with as initial condition data points

k= 1to 12. For the model structurep = 12and= 1 is taken. Early stopping

is done in order to avoid overfitting.

8/11/2019 suykens

80/84

LS-SVM for control

The N-stage optimal control problem

Optimal control problem:

minJN(xk, uk) =(xN+1) +Nk=1

h(xk, uk)

subject to the system dynamics

xk+1=f(xk, uk), k= 1,...,N (x1 given)

wherexk Rn state vectoruk R input(),h(, ) positive definite functions

Lagrangian

LN(xk, uk; k) = JN(xk, uk) +

N

k=1

T

k [xk+1 f(xk, uk)]with Lagrange multipliersk Rn.

Conditions for optimality:

LNxk

= hxk

+k1 (fxk

)Tk = 0, k= 2,...,N (adjoint equation)

LNxN+1

= xN+1

+N= 0 (adjoint final condition)

LNuk

= huk Tk

fuk

= 0, k= 1,...,N (variational condition)

LNk

= xk+1 f(xk, uk) = 0, k= 1,...,N (system dynamics)

8/11/2019 suykens

81/84

Optimal control with LS-SVMs

Consider LS-SVM from state space to action space

Combine the optimization problem formulations of the optimalcontrol problem and of the LS-SVM into one formulation

Optimal control problem with LS-SVM:

minJ(xk, uk, w , ek) = JN(xk, uk) +12

wTw+1

2

Nk=1

e2k

subject to the system dynamics

xk+1=f(xk, uk), k= 1,...,N (x1 given)

and the control law

uk =wT(xk) +ek, k= 1,...,N

Actual control signal applied to the plant: wT(xk) Lagrangian

L(xk, uk, w , ek; k, k) = JN(xk, uk) + 12w

T

w+12N

k=1 e2k+N

k=1 Tk [xk+1 f(xk, uk)] +

Nk=1 k[uk wT(xk) ek]

8/11/2019 suykens

82/84


Lxk

= hxk

+k1 (fxk

)Tk

k

xk[wT(xk)] = 0, k= 2,...,N (adjoint equation)

LxN+1

= xN+1

+N= 0 (adjoint final condition)

Luk

= huk

Tkf

uk+k = 0, k= 1,...,N (variational condition)

Lw

= w N

k=1 k(xk) = 0 (support vectors)

Lek

= ek k = 0 k= 1,...,N (support values)

Lk

= xk+1 f(xk, uk) = 0, k= 1,...,N (system dynamics)

Lk

= uk wT(xk) ek = 0, k= 1,...,N (SVM control)

One obtains set of nonlinear equations of the formF1(xk, xN+1, uk, w , ek, k, k) = 0

fork= 1,...,N withx1given.

Apply Mercer condition trick:K(xk, xl) =(xk)

T(xl)

For RBF kernels one takes

K(xk, xl) = exp( xk xl22)

8/11/2019 suykens

83/84

By replacingw= kk(xk) in the equations, one obtains

hxk

+k1 ( fxk )Tk k

Nl=1 l

K(xk,xl)xk

= 0, k= 2,...,N

xN+1 +N= 0

huk

Tk fuk +k = 0, k= 1,...,N

xk+1 f(xk, uk) = 0, k= 1,...,N

uk

Nl=1 lK(xl, xk)

k/= 0, k= 1,...,N

which is of the form

F2(xk, xN+1, uk, k, k) = 0

fork= 1,...,N withx1given.

For RBF kernels one has

K(xk, xl)

xk= 2 (xk xl) exp( xk xl22)

The actual control signal applied to the plant becomes

uk =Nl=1

lK(xl, xk)

where {xl}Nl=1, {l}Nl=1are the solution toF2= 0.

8/11/2019 suykens

84/84

suykens

Documents