suykens

Upload: savan

Post on 03-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 suykens

    1/84

    Least Squares Support Vector

    MachinesJohan Suykens

    K.U. Leuven ESAT-SCD-SISTAKasteelpark Arenberg 10

    B-3001 Leuven (Heverlee), BelgiumTel: 32/16/32 18 02 - Fax: 32/16/32 19 70Email: [email protected]

    http://www.esat.kuleuven.ac.be/sista/members/suykens.html

    NATO-ASI Learning Theory and PracticeLeuven July 2002

    http://www.esat.kuleuven.ac.be/sista/natoasi/ltp2002.html

  • 8/11/2019 suykens

    2/84

    Main reference:

    J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De

    Moor, J. Vandewalle,Least Squares Support Vector Ma-

    chines, World Scientific, in press (ISBN 981-238-151-1)

    Related software LS-SVMlab(Matlab/C toolbox):

    http://www.esat.kuleuven.ac.be/sista/lssvmlab/

    (software + publications)

    ... with thanks to Kristiaan Pelckmans, Bart Hamers, Lukas, Luc Hoe-

    gaerts, Chuan Lu, Lieveke Ameye, Sabine Van Huffel, Gert Lanckriet, Tijl

    De Bie and many others

  • 8/11/2019 suykens

    3/84

    Interdisciplinary challenges

    LS-SVM

    linear algebra

    mathematics

    statistics

    systems and control

    signal processingoptimization

    machine learning

    pattern recognition

    datamining

    neural networks

    One of the original dreams in the neural networks area is to make

    a universal class of models (such as MLPs) generally applicable to

    a wide range of applications.

    Presently, standard SVMs are mainly available only for classifica-

    tion, function estimation and density estimation. However, several

    extensions are possible in terms of least squares and equality con-

    straints (and exploiting primal-dual interpretations).

  • 8/11/2019 suykens

    4/84

    Overview of this talk

    LS-SVM for classification and link with kernel Fisher discrimi-nant analysis

    Links to regularization networks and Gaussian processes

    Bayesian inference for LS-SVMs

    Sparseness and robustness

    Large scale methods:Fixed Size LS-SVM

    Extensions to committee networks

    New formulations to kernel PCA, kernel CCA, kernel PLS

    Extensions of LS-SVM to recurrent networks and optimal con-trol

  • 8/11/2019 suykens

    5/84

    Vapniks SVM classifier

    Given a training set {xk, yk}Nk=1Input patternsxk R

    n

    Class labelsyk Rwhereyk {1, +1}

    Classifier:y(x) = sign[wT(x) +b]

    with () : Rn Rnh mapping to high dimensional featurespace (can be infinite dimensional)

    For separable data, assumewT(xk) +b +1 , if yk = +1wT(xk) +b 1 , if yk = 1

    which is equivalent to

    yk[wT(xk) +b] 1, k= 1,...,N Optimization problem (non-separable case):

    minw,

    J(w, ) =12

    wTw+cNk=1

    k

    subject toyk[w

    T(xk) +b] 1 k, k= 1,...,Nk 0, k= 1,...,N.

  • 8/11/2019 suykens

    6/84

    Construct Lagrangian:

    L(w,b, ; , ) = J(w, k)Nk=1

    k{yk[wT(xk)+b]1+k}Nk=1

    kk

    with Lagrange multipliersk 0,k 0 (k= 1,...,N). Solution given by saddle point of Lagrangian:

    max,

    minw,b,

    L(w,b, ; , )One obtains

    Lw

    = 0

    w=N

    k=1

    kyk(xk)

    Lb

    = 0 Nk=1

    kyk = 0

    Lk

    = 0 0 k c, k= 1,...,N

    Quadratic programming problem (Dual problem):

    maxk

    Q() = 12

    Nk,l=1

    ykylK(xk, xl) kl+Nk=1

    k

    such that

    Nk=1

    kyk = 0

    0 k c, k= 1,...,N.

  • 8/11/2019 suykens

    7/84

    Note: w and(xk) are not calculated.

    Mercer condition:

    K(xk, xl) =(xk)

    T

    (xl) Obtained classifier:

    y(x) = sign[Nk=1

    kykK(x, xk) +b]

    withk positive real constants, breal constant, that follow as

    solution to the QP problem.

    Non-zero k are called support values and the corresponding

    data points are called support vectors.

    The bias term bfollows from KKT conditions.

    Some possible kernelsK(, ):

    K(x, xk) =xTk x(linear SVM)

    K(x, xk) = (xTk x+ 1)

    d (polynomial SVM of degree d)

    K(x, xk) = exp{x xk22/2} (RBF SVM)

    K(x, xk) = tanh( xT

    k

    x+) (MLP SVM)

    In the case of RBF and MLP kernel, the number of hiddenunits corresponds to the number of support vectors.

  • 8/11/2019 suykens

    8/84

    Feature space and kernel trick

    + +

    +

    x

    x

    x

    +

    xx

    x

    +

    +

    +

    + +

    +

    x

    x

    x

    +

    x x

    xx

    x

    x

    x

    +

    +

    +

    Input space

    Feature space

    (x)

    =

    (x)T

    (z)

    K(x, z)

  • 8/11/2019 suykens

    9/84

    Primal-dual interpretations of SVMs

    Primal problem P

    Dual problem D

    Parametric: estimate w Rnh

    Non-parametric: estimate RN

    y(x) = sign[wT(x) +b]

    y(x) = sign[#sv

    k=1 kykK(x, xk) +b]

    Kernel trick

    K(xk, xl) =(xk)T(xl)

    y(x)

    y(x)

    w1

    wnh

    1

    #sv

    1(x)

    nh(x)

    K(x, x1)

    K(x, x#sv)

    x

    x

  • 8/11/2019 suykens

    10/84

    Least Squares SVM classifiers

    LS-SVM classifiers (Suykens, 1999): close to Vapniks SVMformulation but solves linear system instead of QP problem.

    Optimization problem:

    minw,b,e

    J(w,b,e) =12

    wTw+1

    2

    Nk=1

    e2k

    subject to the equality constraints

    yk[wT(xk) +b] = 1 ek, k= 1,...,N. Lagrangian

    L(w,b,e; ) = J(w,b,e) Nk=1

    k{yk[wT(xk) + b] 1 + ek}

    wherek are Lagrange multipliers.

    Conditions for optimality:

    Lw

    = 0 w=Nk=1

    kyk(xk)

    Lb

    = 0 N

    k=1kyk = 0

    Lek = 0 k =ek, k= 1,...,NLk

    = 0 yk[wT(xk) +b] 1 +ek = 0, k= 1,...,N

  • 8/11/2019 suykens

    11/84

    Set of linear equations (instead of QP):

    I 0 0 ZT0 0 0 YT

    0 0 I IZ Y I 0

    w

    b

    e

    =

    0

    0

    01

    with

    Z= [(x1)Ty1; ...; (xN)

    TyN]

    Y = [y1; ...; yN]1 = [1; ...; 1]

    e= [e1; ...; eN]= [1; ...; N].

    After elimination ofw, eone obtains 0 YT

    Y +1I

    b

    =

    01

    .

    where

    =Z ZT

    and Mercers condition is applied

    kl = ykyl(xk)T(xl)

    = ykylK(xk, xl).

    Related work: Saunders (1998), Smola & Scholkopf (1998),Cristianini & Taylor (2000)

  • 8/11/2019 suykens

    12/84

    Large scale LS-SVMs

    For the binary class case:problem involves matrix of size (N+ 1) (N+ 1).Large data sets iterative methods needed (CG, SOR, ...)

    Problem: solvingAx= B A Rnn, B Rn

    by CG requires that A is symmetric positive definite.

    Represent the original problem of the form 0 YT

    Y H

    12

    =

    d1d2

    withH= +1

    I,1=b,2=,d1= 0,d2=1 as

    s 0

    0 H

    1

    2+H1Y 1

    =

    d1+YTH1d2d2

    withs=YTH1Y >0 (H=HT >0).

    Iterative methods can be applied to the latter problem.

    Convergence of CG depends on condition number(hence it depends on, )

  • 8/11/2019 suykens

    13/84

    Fisher Discriminant Analysis

    Projection of data:z=f(x) =wTx+b

    Optimize Rayleigh quotient:

    maxw,b

    JFD(w, b) = wTBwwTWw

    Class 1

    Class 2

    wT1x

    +b1

    w T

    2x+

    b2

  • 8/11/2019 suykens

    14/84

  • 8/11/2019 suykens

    15/84

    A two spiral classification problem

    40 30 20 10 0 10 20 30 4040

    30

    20

    10

    0

    10

    20

    30

    40

    Difficult problem for MLPs, not for SVM or LS-SVM with RBF

    kernel (but easy in the sense that the two classes are separable)

  • 8/11/2019 suykens

    16/84

    Benchmarking LS-SVM classifiers

    acr bld gcr hea ion pid snr ttt wbc adu

    NCV 460 230 666 180 234 512 138 638 455 33000Ntest 230 115 334 90 117 256 70 320 228 12222N 690 345 1000 270 351 768 208 958 683 45222

    nnum 6 6 7 7 33 8 60 0 9 6ncat 8 0 13 6 0 0 0 9 0 8n 14 6 20 13 33 8 60 9 9 14

    bal cmc ims iri led thy usp veh wav win

    NCV 416 982 1540 100 2000 4800 6000 564 2400 118Ntest 209 491 770 50 1000 2400 3298 282 1200 60N 625 1473 2310 150 3000 7200 9298 846 3600 178

    nnum 4 2 18 4 0 6 256 18 19 13ncat 0 7 0 0 7 15 0 0 0 0n 4 9 18 4 7 21 256 18 19 13

    M 3 3 7 3 10 3 10 4 3 3ny,MOC 2 2 3 2 4 2 4 2 2 2ny,1vs1 3 3 21 3 45 3 45 6 2 3

    (Van Gestel et al., Machine Learning, in press)

  • 8/11/2019 suykens

    17/84

    Successful LS-SVM applications

    20 UCI benchmark data sets (binary and multiclass problems)

    Ovarian cancer classification

    Classification of brain tumours from magnetic resonance spec-troscopy signals

    Prediction of mental development of preterm newborns

    Prediction of air polution and chaotic time series

    Marketting and financial engineering studies

    Modelling the Belgian electricity consumption

    Softsensor modelling in the chemical process industry

  • 8/11/2019 suykens

    18/84

    acr bld gcr hea ion pid snr ttt wbc aduNtest 230 115 334 90 117 256 70 320 228 1222

    n 14 6 20 13 33 8 60 9 9 14

    RBF LS-SVM 87.0(2.1) 70.2(4.1) 76.3(1.4) 84.7(4.8) 96.0(2.1) 76.8(1.7) 73.1(4.2) 99.0(0.3) 96.4(1.0) 84.7(0.3

    RBF LS-SVMF 86.4(1.9) 65.1(2.9) 70.8(2.4) 83.2(5.0) 93.4(2.7) 72.9(2.0) 73.6(4.6) 97.9(0.7) 96.8(0.7) 77.6(1.3

    Lin LS-SVM 86.8(2.2) 65.6(3.2) 75.4(2.3) 84.9(4.5) 87.9(2.0) 76.8(1.8) 72.6(3.7) 66.8(3.9) 95.8(1.0) 81.8(0.3

    Lin LS-SVMF 86.5(2.1) 61.8(3.3) 68.6(2.3) 82.8(4.4) 85.0(3.5) 73.1(1.7) 73.3(3.4) 57.6(1.9) 96.9(0.7) 71.3(0.3

    Pol LS-SVM 86.5(2.2) 70.4(3.7) 76.3(1.4) 83.7(3.9) 91.0(2.5) 77.0(1.8) 76.9(4.7) 99.5(0.5) 96.4(0.9) 84.6(0.3

    Pol LS-SVMF 86.6(2.2) 65.3(2.9) 70.3(2.3) 82.4(4.6) 91.7(2.6) 73.0(1.8) 77.3(2.6) 98.1(0.8) 96.9(0.7) 77.9(0.2

    RBF SVM 86.3(1.8) 70.4(3.2) 75.9(1.4) 84.7(4.8) 95.4(1.7) 77.3(2.2) 75.0(6.6) 98.6(0.5) 96.4(1.0) 84.4(0.3

    Lin SVM 86.7(2.4) 67.7(2.6) 75.4(1.7) 83.2(4.2) 87.1(3.4) 77.0(2.4) 74.1(4.2) 66.2(3.6) 96.3(1.0) 83.9(0.2

    LDA 85.9(2.2) 65.4(3.2) 75.9(2.0) 83.9(4.3) 87.1(2.3) 76.7(2.0) 67.9(4.9) 68.0(3.0) 95.6(1.1) 82.2(0.3

    QDA 80.1(1.9) 62.2(3.6) 72.5(1.4) 78.4(4.0) 90.6(2.2) 74.2(3.3) 53.6(7.4) 75.1(4.0) 94.5(0.6) 80.7(0.3

    Logit 86.8(2.4) 66.3(3.1) 76.3(2.1) 82.9(4.0) 86.2(3.5) 77.2(1.8) 68.4(5.2) 68.3(2.9) 96.1(1.0) 83.7(0.2

    C4.5 85.5(2.1) 63.1(3.8) 71.4(2.0) 78.0(4.2) 90.6(2.2) 73.5(3.0) 72.1(2.5) 84.2(1.6) 94.7(1.0)85.6(0.3

    oneR 85.4(2.1) 56.3(4.4) 66.0(3.0) 71.7(3.6) 83.6(4.8) 71.3(2.7) 62.6(5.5) 70.7(1.5) 91.8(1.4) 80.4(0.3

    IB1 81.1(1.9) 61.3(6.2) 69.3(2.6) 74.3(4.2) 87.2(2.8) 69.6(2.4) 77.7(4.4) 82.3(3.3) 95.3(1.1) 78.9(0.2

    IB10 86.4(1.3) 60.5(4.4) 72.6(1.7) 80.0(4.3) 85.9(2.5) 73.6(2.4) 69.4(4.3) 94.8(2.0) 96.4(1.2) 82.7(0.3

    NBk 81.4(1.9) 63.7(4.5) 74.7(2.1) 83.9(4.5) 92.1(2.5) 75.5(1.7) 71.6(3.5) 71.7(3.1) 97.1(0.9) 84.8(0.2

    NBn 76.9(1.7) 56.0(6.9) 74.6(2.8) 83.8(4.5) 82.8(3.8) 75.1(2.1) 66.6(3.2) 71.7(3.1) 95.5(0.5) 82.7(0.2

    Maj. Rule 56.2(2.0) 56.5(3.1) 69.7(2.3) 56.3(3.8) 64.4(2.9) 66.8(2.1) 54.4(4.7) 66.2(3.6) 66.2(2.4) 75.3(0.3

  • 8/11/2019 suykens

    19/84

  • 8/11/2019 suykens

    20/84

    LS-SVMs for function estimation

    LS-SVM model as feature space representation:y(x) =wT(x) +b

    withx Rn, y R.

    The nonlinear mapping() is similar to the classifier case.

    Given training set {xk, yk}Nk=1.

    Optimization problem

    minw,e

    J(w, e) =12

    wTw+1

    2

    Nk=1

    e2k

    subject to equality constraints

    yk =wT(xk) +b+ek, k= 1,...,N

    This is a form of ridge regression (see Saunders, 1998).

    Lagrangian:

    L(w,b,e; ) =

    J(w, e)

    N

    k=1

    k {

    wT(xk

    ) +b+ek

    yk}

    withk Lagrange multipliers.

  • 8/11/2019 suykens

    21/84

    Conditions for optimality

    Lw

    = 0 w=N

    k=1k(xk)

    Lb

    = 0 Nk=1

    k = 0

    Lek

    = 0 k = ek, k= 1,...,NLk

    = 0 wT(xk) +b+ek yk = 0, k= 1,...,N

    Solution 0 1T

    1 +1I b

    = 0

    y

    with

    y= [y1; ...; yN],1 = [1; ...;1],= [1; ...; N]

    and by applying Mercers condition

    kl = (xk)T(xl), k, l= 1,...,N

    = K(xk, xl) Resulting LS-SVM model for function estimation

    y(x) =Nk=1

    kK(xk, x) +b

    Solution mathematically equivalent to regularization networksand Gaussian processes (usually without bias term) (see e.g.Poggio & Girosi (1990)).

  • 8/11/2019 suykens

    22/84

    SVM, RN, GP, LS-SVM, ...

    =

    Support Vector Machines

    LS-SVMs

    Regularization networks

    Gaussian processes

    Kriging

    Kernel ridge regression

    RKHS

    ?

    Some early history:

    1910-1920: Moore - RKHS

    1940: Aronszajn - systematic development of RKHS

    1951: Krige

    1970: Parzen - RKHS for statistical problems

    1971: Kimeldorf & Wahba

    In LS-SVM primal-dual interpretations are emphasized (and often

    exploited).

  • 8/11/2019 suykens

    23/84

    =

    =

    Prior

    Prior

    Prior

    Maximize Posterior

    Maximize Posterior

    Maximize Posterior

    Evidence

    Likelihood

    Evidence

    Likelihood

    Evidence

    Likelihood

    Bayesian inference of LS-SVM models

    Level 1 (w, b)

    Level 2 (, )

    Level 3 ()

    p(D|w,b,,, H)

    p(

    D|,,

    H)

    p(D|H)

    p(D)

    p(w, b|,, H)

    p(, |H)

    p(H)

    p(w, b|D, , , H)

    p(, |D, H)

    p(H|D)

  • 8/11/2019 suykens

    24/84

    Bayesian LS-SVM framework

    (Van Gestel, Suykens et al., Neural Computation, 2002)

    Level 1 inference:Inference ofw,bProbabilistic interpretation of outputs

    Moderated outputs

    Additional correction for prior probabilities and bias term

    Level 2 inference:

    Inference of hyperparameterEffective number of parameters (< N)

    Eigenvalues of centered kernel matrix are important

    Level 3:Model comparison - Occams razor

    Moderated output with uncertainty on hyperparameters

    Selection of width of kernel

    Input selection (ARD at level 3 instead of level 2)

    Related work:MacKays Bayesian interpolation for MLPs, GP

    Difference with GP:

    bias term contained in the formulation

    classifier case treated as a regression problem

    of RBF kernel determined at Level 3

  • 8/11/2019 suykens

    25/84

    Benchmarking results

    n N Ntest Ntot LS-SVM LS-SVM LS-SVM SVM GP GPb GP

    (BayM) (Bay) (CV10) (CV10) (Bay) (Bay) (CV10)bld 6 230 115 345 69.4(2.9)69.4(3.1)69.4(3.4)69.2(3.5)69.2(2.7)68.9(3.3) 69.7(4.0)cra 6 133 67 200 96.7(1.5)96.7(1.5)96.9(1.6)95.1(3.2)96.4(2.5) 94.8(3.2) 96.9(2.4)gcr 20 666 334 1000 73.1(3.8) 73.5(3.9) 75.6(1.8) 74.9(1.7)76.2(1.4)75.9(1.7) 75.4(2.0)hea 13 1 80 90 2 70 83.6(5.1)83.2(5.2)84.3(5.3)83.4(4.4)83.1(5.5)83.7(4.9) 84.1(5.2)ion 33 234 117 351 95.6(0.9) 96.2(1.0)95.6(2.0) 95.4(1.7) 91.0(2.3) 94.4(1.9) 92.4(2.4)pid 8 512 256 768 77.3(3.1)77.5(2.8)77.3(3.0)76.9(2.9)77.6(2.9)77.5(2.7) 77.2(3.0)rsy 2 250 1000 1250 90.2(0.7)90.2(0.6) 89.6(1.1) 89.7(0.8)90.2(0.7)90.1(0.8) 89.9(0.8)snr 60 138 70 208 76.7(5.6) 78.0(5.2)77.9(4.2) 76.3(5.3) 78.6(4.9)75.7(6.1) 76.6(7.2)tit 3 1467 734 2201 78.8(1.1)78.7(1.1)78.7(1.1)78.7(1.1)78.5(1.0) 77.2(1.9) 78.7(1.2)wbc 9 455 228 683 95.9(0.6) 95.7(0.5)96.2(0.7)96.2(0.8) 95.8(0.7) 93.7(2.0) 96.5(0.7)AP 83.7 83.9 84.1 83.6 83.7 83.2 83.7AR 2.3 2.5 2.5 3.8 3.2 4.2 2.6

    PST 1.000 0 .754 1.000 0.344 0.754 0.344 0.508

  • 8/11/2019 suykens

    26/84

    Bayesian inference of LS-SVMs:example Ripley data set

    1.5 1 0.5 0 0.5 1 1.50.5

    0

    0.5

    1

    1.5

    0.10.2

    0.30.4

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    x1

    x2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    0.5

    0

    0.5

    1

    1.5

    0

    0.2

    0.4

    0.6

    0.8

    1

    x1x2

    Posteriorclassprobability

  • 8/11/2019 suykens

    27/84

    Bayesian inference of LS-SVMs:sinc example

    0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.51

    0.5

    0

    0.5

    1

    1.5

    x

    y

    0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7200

    210

    220

    230

    240

    250

    260

    270

    280

    log(P(D|H

    ))

  • 8/11/2019 suykens

    28/84

    Sparseness

    kN

    sorted|k|

    LS-SVM

    QP-SVM

    Lack of sparseness in the LS-SVM case

    but ... sparseness can be imposed by applying pruning techniques

    existing in the neural networks area (e.g. optimal brain damage,

    optimal brain surgeon etc.)

    kN

    sorted|k|

    LS-SVM

    sparse LS-SVM

  • 8/11/2019 suykens

    29/84

    Robustness

    Convex cost function

    convex

    optimization

    SVM solution

    Weighted version with

    modified cost function

    robust

    statistics

    LS-SVM solution

    SVM Weighted LS-SVM

    Weighted LS-SVM:

    minw,b,e

    J(w, e) = 1

    2wTw+

    1

    2

    N

    k=1vke

    2k

    such that yk =wT(xk) +b+ek, k= 1,...,N

    with vk determined by the distribution of{ek}Nk=1 obtained fromthe unweighted LS-SVM.

  • 8/11/2019 suykens

    30/84

    L2 and L1 estimators: score function

    3 2 1 0 1 2 36

    4

    2

    0

    2

    4

    6

    8

    10

    e

    L(e),

    (e)

    3 2 1 0 1 2 32

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    e

    L(e),

    (e)

    L1estimator: influence of outliers is reduced (the score function in

    robust statistics means derivative of the loss function)

  • 8/11/2019 suykens

    31/84

    Loss function and score function

    Examples of score function in robust statistics:

    Huber, Andrews, Hampel, Tukey

    3 2 1 0 1 2 32

    1

    0

    1

    2

    3

    4

    e

    L(e),

    (e)

    5 4 3 2 1 0 1 2 3 4 51.5

    1

    0.5

    0

    0.5

    1

    1.5

    e

    (e)

    10 8 6 4 2 0 2 4 6 8 102.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    e

    (e)

    3 2 1 0 1 2 32.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    e

    (e)

    In practice: iterative weighting accrording to the score function.

    Note that all the score functions (except Hubers) lead to non-

    convex loss functions. Further application of robust statistics to

    robust cross-validation (De Brabanter et al., 2002) with good per-formance in terms of efficiency-robustness trade-off.

  • 8/11/2019 suykens

    32/84

    Large scale methods

    Nystrom method (GP)

    Fixed size LS-SVM

    Committee networks and extensions

  • 8/11/2019 suykens

    33/84

    Nystrom method

    Reference in GP: Williams & Seeger (2001)

    bigkernel matrix: (N,N) RNNsmallkernel matrix: (M,M) RMM(based on random subsample, in practice often M N)

    Eigenvalue decomposition of (M,M):(M,M) U=U

    Relation to eigenvalues and eigenfunctions of the integral equa-tion

    K(x, x)i(x)p(x)dx=ii(x)

    is given by

    i = 1M

    i

    i(xk) =

    M uki

    i(x) =

    M

    i

    Mk=1

    ukiK(xk, x)

    where i and i are estimates to i and i, respectively, and

    uki denotes theki-th entry of the matrix U.

  • 8/11/2019 suykens

    34/84

    For the big matrix:

    (N,N) U= U

    Furthermore, one has

    i = NM

    i

    ui =

    NM

    1i

    (N,M)ui

    One can show then that

    (N,N) (N,M)1(M,M)(M,N)where (N,M) is theN Mblock matrix taken from (N,N).

    Solution to the big linear system((N,N)+I/)=y

    can be written as

    =

    y U( 1

    I+UTU)1UTy

    by applying Sherman-Morrison-Woodbury formula.

    Some numerical difficulties as pointed out by Fine & Scheinberg(2001).

  • 8/11/2019 suykens

    35/84

    Computation in primal or dual space ?

    Primal space

    Dual space

    Large data sets

    Large dimensional inputs

    ?

  • 8/11/2019 suykens

    36/84

    Fixed Size LS-SVM

    Related work: basis construction in feature spaceCawley (2002), Csato & Opper (2002), Smola & Scholkopf(2002), Baudat & Anouar (2001).

    Model in primal space:

    minwRnh,bR

    1

    2wTw+

    1

    2

    N

    k=1 yk (wT(xk) +b)

    2

    .

    Observation: for linear model it is computationally better to

    solve the primal problem (one knows that (xk) =xk)

    Can we do this for the nonlinear case too ?

    Employ the Nystrom method:

    i(x) =

    i i(x

    ) =

    Mi

    Mk=1

    ukiK(xk, x),

    assuming a fixed sizeM.

    The model becomes theny(x) = wT(x) +b

    =Mi=1

    wi

    Mi

    Mk=1

    ukiK(xk, x).

  • 8/11/2019 suykens

    37/84

    The support values corresponding to the number ofM support

    vectors equal

    k =M

    i=1 wi

    M

    iukiwhen ones represent the model as

    y(x) =Mk=1

    kK(xk, x)

    How to select a working set ofMsupport vectors ?

    Is taking a random subsample the only option ?

  • 8/11/2019 suykens

    38/84

    Fixed Size LS-SVM: Selection of SV

    Girolami (2002): link between Nystrom method, kernel PCA,density estimation and entropy criteria

    The quadratic Renyi entropy

    HR= log

    p(x)2dx

    has been related to kernel PCA and density estimation with p(x)2dx=

    1

    N21Tv 1v

    where 1v = [1;1; ...; 1] and a normalized kernel is assumed with

    respect to density estimation.

    Fixed Size LS-SVM:Take a working set of M support vectors and select vectors

    according to the entropy criterion (instead of a random sub-

    sample as in the Nystrom method)

  • 8/11/2019 suykens

    39/84

    Fixed Size LS-SVM

    Primal space Dual space

    Nystrom method

    Kernel PCA

    Density estimation

    Entropy criteria

    Eigenfunctions

    Active selection of SV

    Regression

  • 8/11/2019 suykens

    40/84

    Fixed Size LS-SVM

    Pool of Training Data

    Fixed Size LS-SVM

    new point IN

    present or new point OUT

    random point selection

  • 8/11/2019 suykens

    41/84

    Fixed size LS-SVM algorithm:

    1. Given normalized and standardized training data {xk, yk}Nk=1with inputsxk

    R

    n, outputsyk

    R andNtraining data.

    2. Choose a working set with size Mand impose in this way

    a number ofM support vectors (typicallyM N).

    3. Randomly select a support vector x from the working setofM support vectors.

    4. Randomly select a point xt from theNtraining data andreplacexbyxtin the working set. If the entropy increasesby taking the pointxt instead ofx then this point xt isaccepted for the working set ofMsupport vectors, other-

    wise the pointxt is rejected (and returned to the trainingdata pool) and the support vector x stays in the workingset.

    5. Calculate the entropy value for the present working set.

    6. Stop if the change in entropy value is small or the number

    of iterations is exceeded, otherwise go to (3).

    7. Estimatew, bin the primal space after estimating the eigen-

    functions from the Nystrom approximation.

  • 8/11/2019 suykens

    42/84

    Fixed Size LS-SVM: sinc example

    10 8 6 4 2 0 2 4 6 8 100.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    10 8 6 4 2 0 2 4 6 8 100.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    10 8 6 4 2 0 2 4 6 8 100.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    10 8 6 4 2 0 2 4 6 8 100.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    demo in LS-SVMlab

  • 8/11/2019 suykens

    43/84

    100

    101

    102

    103

    104

    105

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    2.2

    iteration step

    entropy

    100

    101

    102

    103

    104

    105

    104

    103

    102

    101

    100

    iteration step

    testseterror

  • 8/11/2019 suykens

    44/84

    Fixed Size LS-SVM: sinc example

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

    10 8 6 4 2 0 2 4 6 8 100.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    x

    y

    Sinc function with 20000 data points

  • 8/11/2019 suykens

    45/84

    Fixed Size LS-SVM: spiral example

    2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.52.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    2.5 2 1.5 1 0.5 0 0.5 1 1.5 22.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

  • 8/11/2019 suykens

    46/84

    Fixed Size LS-SVM: spiral example

    2 1.5 1 0.5 0 0.5 1 1.5 22.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    x1

    x2

  • 8/11/2019 suykens

    47/84

    Committee network of LS-SVMs

    Data

    D1D2

    Dm

    fc(x)

    LS-SVM (D1)

    LS-SVM (D2)

    LS-SVM (Dm)x

    x

    x

  • 8/11/2019 suykens

    48/84

    Committee network of LS-SVMs

    sinc function with 4000 training data

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

  • 8/11/2019 suykens

    49/84

    Results of the individual LS-SVM models

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

  • 8/11/2019 suykens

    50/84

    Nonlinear combination of LS-SVMs

    x

    x

    x

    LS-SVM (D1)

    LS-SVM (D2)

    LS-SVM (Dm)

    g

    This results into a multilayer network

    (layers of (LS)-SVMs or e.g. MLP + LS-SVM combination)

  • 8/11/2019 suykens

    51/84

  • 8/11/2019 suykens

    52/84

    Committee network of LS-SVMs

    Linear versus nonlinear combinations of trained LS-SVM submod-

    els under heavy Gaussian noise

    10 8 6 4 2 0 2 4 6 8 104

    3

    2

    1

    0

    1

    2

    3

    4

    x

    y

    10 8 6 4 2 0 2 4 6 8 104

    3

    2

    1

    0

    1

    2

    3

    4

    x

    y

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

    10 8 6 4 2 0 2 4 6 8 100.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    x

    y

    10 8 6 4 2 0 2 4 6 8 100.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    x

    y

  • 8/11/2019 suykens

    53/84

    Classical PCA formulation

    Given data {xk}Nk=1withxk Rn

    Find projected variables wTxk with maximal variance

    maxw

    Var(wTx) = Cov(wTx, wTx) 1N

    Nk=1

    (wTxk)2

    = wT C w

    whereC= (1/N)

    Nk=1

    xkxT

    k .

    Consider constraintwTw= 1.

    Constrained optimization:

    L(w; ) =12

    wTCw (wTw 1)

    with Lagrange multiplier.

    Eigenvalue problemCw=w

    withC=CT 0, obtained fromL/w= 0,L/= 0.

  • 8/11/2019 suykens

    54/84

    PCA analysis as a one-class modellingproblem

    One-class with target value zero:

    maxw

    Nk=1

    (0 wTxk)2

    Score variables: z=wTx

    Illustration:

    x

    x

    x

    x

    x

    x

    x

    xx

    x

    x

    x

    x

    Input spaceTarget space

    0

    wTx

  • 8/11/2019 suykens

    55/84

    LS-SVM interpretation to FDA

    1

    +1

    xx

    x

    x

    x

    x

    +

    +

    +

    +

    +

    +

    +

    Input spaceTarget space

    0

    wTx + b

    Minimize within class scatter

    LS-SVM interpretation to PCA

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    Input spaceTarget space

    0

    wT(x x)

    Find direction with maximal variance

  • 8/11/2019 suykens

    56/84

    SVM formulation to linear PCA

    Primal problem:

    P : maxw,e

    JP(w, e) = 1

    2

    Nk=1

    e2k 1

    2wTw

    such that ek =wTxk, k= 1,...,N

    Lagrangian

    L(w, e; ) =12

    Nk=1

    e2k 12wTw N

    k=1

    k

    ek wTxk

    Conditions for optimality

    Lw

    = 0

    w=

    N

    k=1

    kxk

    Lek

    = 0 k = ek, k= 1,...,NLk

    = 0 ek wTxk = 0, k= 1,...,N

  • 8/11/2019 suykens

    57/84

    Elimination of variables e, w gives1

    k

    N

    l=1lx

    Tl xk = 0 , k= 1,...,N

    and after defining = 1/one obtains the eigenvalue problem

    D : solve in :

    xT1 x1 . . . xT1 xN

    ... ...

    xTNx1 . . . xTNxN

    1...

    N

    =

    1...

    N

    as the dual problem (quantization in terms of= 1/).

    Score variables become

    z(x) =wTx=Nl=1

    lxTl x

    Optimal solution corresponding to largest eigenvalue

    Nk=1

    (wTxk)2 =

    Nk=1

    e2k =Nk=1

    1

    22k =

    2max

    whereN

    k=1 2k = 1 for the normalized eigenvector.

    Many data : better solve primal problem

    Many inputs: better solve dual problem

  • 8/11/2019 suykens

    58/84

    SVM formulation to PCA analysis witha bias term

    Usually: apply PCA analysis to centered dataand consider

    maxw

    Nk=1

    [wT(xk x)]2

    where x= 1

    N

    Nk=1

    xk.

    Bias term formulation:

    score variables

    z(x) =wTx+b

    and objective

    maxw,b

    Nk=1 [0 (w

    T

    x+b)]

    2

  • 8/11/2019 suykens

    59/84

    Primal optimization problem

    P : maxw,e

    JP(w, e) = 1

    2

    N

    k=1e2k

    1

    2wTw

    such that ek =wTxk+b, k= 1,...,N

    Lagrangian

    L(w, e; ) =12

    Nk=1

    e2k 1

    2wTw

    Nk=1

    k

    ek wTxk b

    Conditions for optimality

    Lw

    = 0 w=Nk=1

    kxk

    Lek

    = 0 k = ek, k= 1,...,N

    Lek

    = 0 N

    k=1

    k = 0

    Lk

    = 0 ek wTxk b= 0, k= 1,...,N

    Applying Nk=1 k = 0 yieldsb= 1

    N

    N

    k=1N

    l=1lx

    Tlxk.

  • 8/11/2019 suykens

    60/84

    By defining= 1/one obtains the dual problemD : solve in :

    (x1 x)T(x1 x) . . . (x1 x)T(xNx)... ...

    (xNx)T(x1 x) . . . (xNx)

    T(xNx)

    1...

    N

    =

    1...

    N

    which is an eigenvalue decomposition of the centered Gram

    matrix

    c=

    with c = McMc where Mc = I 1v1Tv /N, 1v = [1;1; ...; 1]and

    kl=xT

    kxlfork, l= 1,...,N.

    Score variables:

    z(x) =wTx+b=Nl=1

    lxTl x+b

    where is the eigenvector corresponding to the largest eigen-

    value andNk=1

    (wTxk+b)2 =

    Nk=1

    e2k =Nk=1

    1

    22k =

    2max

  • 8/11/2019 suykens

    61/84

    Reconstruction problem for linear PCA

    Reconstruction error:

    minNk=1

    xk xk22

    where xk are variables reconstructed from the score variables,

    with

    x= V z+

    Hence

    minV,

    Nk=1

    xk (V zk+)22

    Information bottleneck:

    x z x

    wTx+b V z+

  • 8/11/2019 suykens

    62/84

    5 4 3 2 1 0 1 2 3 4 55

    4

    3

    2

    1

    0

    1

    2

    3

    4

    5

    x1

    x2

    100 80 60 40 20 0 20 40 60 80 100100

    80

    60

    40

    20

    0

    20

    40

    60

    80

    100

    z1

    z2

  • 8/11/2019 suykens

    63/84

    LS-SVM approach to kernel PCA

    Create nonlinear version of the method byMapping input space to a high dimensional feature spaceApplying the kernel trick

    (kernel PCA - Scholkopf et al.; SVM approach - Suykens et

    al., 2002)

    Primal optimization problem:

    P : maxw,e

    JP(w, e) = 12

    Nk=1

    e2k 12wTw

    such that ek =wT((xk) ), k= 1,...,N.

    Lagrangian

    L(w, e; ) =

    1

    2

    N

    k=1

    e2k

    1

    2

    wTw

    N

    k=1

    k ek wT((xk)

    )

    Conditions for optimality

    Lw

    = 0 w=Nk=1

    k((xk) )Lek

    = 0 k =ek, k= 1,...,NLk

    = 0 ek wT((xk) ) = 0, k= 1,...,N.

  • 8/11/2019 suykens

    64/84

    By elimination of the variables e, w and defining = 1/oneobtains

    D : solve in :

    c=

    with

    c=

    ((x1) )

    T((x1) ) . . . ((x1) )T((xN) )

    ... ...

    ((xN) )T((x1) ) . . . ((xN) )

    T((xN) )

    Elements of the centered kernel matrix

    c,kl

    = ((xk

    )

    )T((xl)

    ) , k, l= 1,...,N

    Score variablesz(x) = wT ((x) )

    =Nl=1

    l((xl) )T ((x) )

    =

    Nl=1

    l

    K(xl, x) 1NNr=1

    K(xr, x) 1NNr=1

    K(xr, xl)+

    1

    N2

    Nr=1

    Ns=1

    K(xr, xs)

    .

  • 8/11/2019 suykens

    65/84

    LS-SVM interpretation to Kernel PCA

    x

    x

    x x

    x

    x

    x

    x

    xx

    x

    x

    x

    x

    x

    x

    x

    x x

    x

    Input space

    Feature space

    Target space

    0

    wT((x) )

    (

    )

    Find direction with maximal variance

  • 8/11/2019 suykens

    66/84

    1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 11.5

    1

    0.5

    0

    0.5

    1

    1.5

    x1

    x2

    0 5 10 15 20 250

    5

    10

    15

    20

    25

    30

    35

    i

    i

  • 8/11/2019 suykens

    67/84

    Canonical Correlation Analysis

    Hotelling (1936): linear CCA analysis

    Applications e.g. in system identification, subspace algorithms,

    signal processing and information theory.

    New applications to bioinformatics (correlations between genes

    and between experiments in both directions for microarray data

    matrix)

    Finding maximal correlation between projected variableszx=w

    Tx and zy =vTy

    wherex Rnx, y Rny denote given random vectors with zeromean.

    Objective: achieve maximal correlation coefficientmaxw,v

    = E[zxzy]E[zxzx]E[zyzy]

    = wTCxyvwTCxxw

    vTCyyv

    withCxx=

    E[xxT],Cyy=

    E[yyT],Cxy=

    E[xyT].

  • 8/11/2019 suykens

    68/84

    This is formulated as the constrained optimization problemmaxw,v

    wTCxyv

    such that wTCxxw= 1

    vTCyyv= 1

    which leads to a generalized eigenvalue problem.

    The solution follows from the Lagrangian

    L(w, v; , ) =wTCxyv

    1

    2

    (wTCxxw

    1)

    1

    2

    (vTCyyv

    1)

    with Lagrange multipliers, , which gives Cxyv = Cxxw

    Cyxw = Cyyv.

  • 8/11/2019 suykens

    69/84

    SVM formulation to CCA

    In primal weight space:

    P : maxw,v,e,r

    Nk=1

    ekrk 1 12

    Nk=1

    e2k 21

    2

    Nk=1

    r2k

    12

    wTw 12

    vTv

    such thatek =wTxk, k= 1,...,N

    rk =vTyk k= 1,...,N

    Lagrangian:

    L(w,v,e,r; , ) =Nk=1

    ekrk 1 12

    Nk=1

    e2k 21

    2

    Nk=1

    r2k

    1

    2wTw

    1

    2vTv

    N

    k=1

    k[ek

    wTxk]

    N

    k=1

    k[rk

    vTyk]

    wherek,k are Lagrange multipliers.

    Related objective in CCA literature (Gittins, 1985):minw,v k w

    Txk vTyk22

  • 8/11/2019 suykens

    70/84

    Conditions for optimality

    Lw

    = 0 w=N

    k=1kxk

    Lv

    = 0 v=Nk=1

    kyk

    Lek

    = 0 vTyk =1wTxk+k, k= 1,...,NLrk

    = 0 wTxk =2vTyk+k, k= 1,...,NLk

    = 0 ek =wTxk, k= 1,...,NLk = 0 rk =vTyk, k= 1,...,N

    resulting into the following dual problem after defining= 1/

    D : solve in , :

    yT1y1 . . . yT1yN

    0...

    ...yTN

    y1 . . . yT

    NyN

    xT1x1 . . . xT1xN

    ... ... 0xTN

    x1 . . . xT

    NxN

    1...

    N1

    ...N

    =

    1xT1x1+ 1 . . . 1x

    T1xN

    ...... 0

    1xT

    Nx1 . . . 1x

    T

    NxN+ 1

    2yT1y1+ 1 . . . 2y

    T1yN

    0...

    ...2y

    T

    Ny1 . . . 2y

    T

    NyN+ 1

    1...

    N1...

    N

  • 8/11/2019 suykens

    71/84

    Resulting score variables

    zxk = ek =N

    l=1lx

    Tlxk

    zyk = rk =Nl=1

    lyTl yk.

    The eigenvalues will be both positive and negative for the

    CCA problem. Also note that one has [1, 1].

  • 8/11/2019 suykens

    72/84

    Extension to kernel CCA

    Score variables

    zx = w

    T

    (1(x) 1)zy = v

    T(2(y) 2)where 1() : Rnx Rnhx and 2() : Rny Rnhy are map-pings (which can be chosen to be different) to high dimen-

    sional feature spaces and 1 = (1/N)N

    k=1 1(xk), 2 =

    (1/N)

    Nk=1 2(yk).

    Primal problem

    P : maxw,v,e,r

    Nk=1

    ekrk 1 12

    Nk=1

    e2k 21

    2

    Nk=1

    r2k

    12

    wTw 12

    vTv

    such that ek =wT(1(xk)

    1), k= 1,...,N

    rk =vT(2(yk) 2), k= 1,...,N

    with Lagrangian

    L(w , v, e, r; , ) =N

    k=1

    ekrk 11

    2

    Nk=1

    e2k 21

    2

    Nk=1

    r2k 1

    2wTw

    1

    2vTv

    N

    k=1

    k[ek wT(1(xk) 1)]

    N

    k=1

    k[rk vT(2(yk) 2)]

    where k, k are Lagrange multipliers. Note that w and v

    might be infinite dimensional now.

  • 8/11/2019 suykens

    73/84

    Conditions for optimality

    Lw

    = 0 w=N

    k=1k(1(xk) 1)

    Lv

    = 0 v=Nk=1

    k(2(yk) 2)Lek

    = 0 vT(2(yk) 2) =1wT(1(xk) 1) +kk= 1,...,N

    Lrk

    = 0 wT(1(xk) 1) =2vT(2(yk) 2) +kk= 1,...,N

    Lk

    = 0 ek =wT(1(xk) 1)k= 1,...,N

    Lk

    = 0 rk =vT(2(yk) 2)k= 1,...,N

    resulting into the following dual problem after defining= 1/

    D : solve in , : 0 c,2c,1 0

    =

    1c,1+I 0

    0 2c,2+I

    wherec,1kl = (1(xk) 1)T(1(xl) 1)c,2kl = (2(yk) 2)T(2(yl) 2)

  • 8/11/2019 suykens

    74/84

    are the elements of the centered Gram matrices for k, l =

    1,...,N. In practice these matrices can be computed by

    c,1 = Mc1Mcc,2 = Mc2Mc

    with centering matrixMc=I (1/N)1v1Tv .

    The resulting score variables can be computed by applying thekernel trick with kernels

    K1(xk, xl) = 1(xk)T1(xl)

    K2(yk, yl) = 2(yk)T

    2(yl)

    Related work: Bach & Jordan (2002), Rosipal & Trejo (2001)

  • 8/11/2019 suykens

    75/84

    Recurrent Least Squares SVMs

    Standard SVM only for static estimation problem.Dynamic case: Recurrent LS-SVM (Suykens, 1999)

    Feedforward/NARX/series-parallel:

    yk =f(yk1, yk2,...,ykp, uk1, uk2,...,ukp)

    with inputuk Rand outputyk R.Training by classical SVM method.

    Recurrent/NOE/parallel:yk =f(yk1,yk2,...,ykp, uk1, uk2,...,ukp)

    Classically trained by Narendras Dynamic Backpropagation or

    Werbos backpropagation through time.

    Parametrization for autonomous case:

    yk =wT (

    yk1yk2

    ...

    ykp

    ) +bwhich is equivalent to

    yk ek =wT (xk1|kp k1|kp) +b

    whereek =yk yk,k1|kp= [ek1; ek2; ...;ekp]xk1|kp= [yk1; yk2; ...; ykp]

  • 8/11/2019 suykens

    76/84

    Optimization problem:

    minw,e

    J(w, e) =12

    wTw+1

    2

    N+p

    k=p+1e2k

    subject to

    ykek =wT (xk1|kpk1|kp)+b, (k=p +1,...,N+p).

    Lagrangian

    L(w,b,e; ) =

    J(w,b,e)

    +

    N+pk=p+1

    kp[yk ek wT (xk1|kp k1|kp) b]

    Conditions for optimality

    Lw = w N+pk=p+1

    kp (xk1|kp k1|kp) = 0

    Lb

    =

    N+pk=p+1

    kp= 0

    Lek

    = ek kp pi=1

    kp+i

    eki[wT(xk1|kp k1|kp)] = 0

    (k=p+ 1,...,N)Lkp

    = yk ek wT (xk1|kp k1|kp) b= 0,(k=p+ 1,...,N+p)

  • 8/11/2019 suykens

    77/84

    Mercer conditionK(zk1|kp, zl1|lp) =(zk1|kp)T(zl1|lp)

    gives

    N+pk=p+1

    kp= 0

    ek kp pi=1

    kp+i

    eki[

    N+pl=p+1

    lpK(zk1|kp, zl1|lp)] = 0,

    (k=p+ 1,...,N)

    yk ek N+pl=p+1

    lpK(zk1|kp, zl1|lp) b= 0,

    (k=p+ 1,...,N+p)

    Solving the set of nonlinear equations in the unknowns e, , bis computationally very expensive.

  • 8/11/2019 suykens

    78/84

    Recurrent LS-SVMs: example

    Chuas circuit:

    x = a [y h(x)]

    y = x y+zz = bywith piecewise linear characteristic

    h(x) =m1x+1

    2(m0 m1) (|x+ 1| |x 1|).

    Double scroll attractor for

    a= 9,b= 14.286,m0=

    1/7,m1= 2/7.

    4

    2

    0

    2

    40.4

    0.3 0.2

    0.1 0

    0.1 0.2

    0.3 0.4

    4

    2

    0

    2

    4

    x1x2

    x3

    Recurrent LS-SVM:

    N= 300 training data

    Model: p= 12

    RBF kernel: = 1SQP with early stopping

  • 8/11/2019 suykens

    79/84

    0 100 200 300 400 500 600 7002.5

    2

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    k

    Figure: Trajectory learning of the double scroll (full line) by a recurrent least

    squares SVM with RBF kernel. The simulation result after training (dashed

    line) onN = 300 data points is shown with as initial condition data points

    k= 1to 12. For the model structurep = 12and= 1 is taken. Early stopping

    is done in order to avoid overfitting.

  • 8/11/2019 suykens

    80/84

    LS-SVM for control

    The N-stage optimal control problem

    Optimal control problem:

    minJN(xk, uk) =(xN+1) +Nk=1

    h(xk, uk)

    subject to the system dynamics

    xk+1=f(xk, uk), k= 1,...,N (x1 given)

    wherexk Rn state vectoruk R input(),h(, ) positive definite functions

    Lagrangian

    LN(xk, uk; k) = JN(xk, uk) +

    N

    k=1

    T

    k [xk+1 f(xk, uk)]with Lagrange multipliersk Rn.

    Conditions for optimality:

    LNxk

    = hxk

    +k1 (fxk

    )Tk = 0, k= 2,...,N (adjoint equation)

    LNxN+1

    = xN+1

    +N= 0 (adjoint final condition)

    LNuk

    = huk Tk

    fuk

    = 0, k= 1,...,N (variational condition)

    LNk

    = xk+1 f(xk, uk) = 0, k= 1,...,N (system dynamics)

  • 8/11/2019 suykens

    81/84

    Optimal control with LS-SVMs

    Consider LS-SVM from state space to action space

    Combine the optimization problem formulations of the optimalcontrol problem and of the LS-SVM into one formulation

    Optimal control problem with LS-SVM:

    minJ(xk, uk, w , ek) = JN(xk, uk) +12

    wTw+1

    2

    Nk=1

    e2k

    subject to the system dynamics

    xk+1=f(xk, uk), k= 1,...,N (x1 given)

    and the control law

    uk =wT(xk) +ek, k= 1,...,N

    Actual control signal applied to the plant: wT(xk) Lagrangian

    L(xk, uk, w , ek; k, k) = JN(xk, uk) + 12w

    T

    w+12N

    k=1 e2k+N

    k=1 Tk [xk+1 f(xk, uk)] +

    Nk=1 k[uk wT(xk) ek]

  • 8/11/2019 suykens

    82/84

    Conditions for optimality

    Lxk

    = hxk

    +k1 (fxk

    )Tk

    k

    xk[wT(xk)] = 0, k= 2,...,N (adjoint equation)

    LxN+1

    = xN+1

    +N= 0 (adjoint final condition)

    Luk

    = huk

    Tkf

    uk+k = 0, k= 1,...,N (variational condition)

    Lw

    = w N

    k=1 k(xk) = 0 (support vectors)

    Lek

    = ek k = 0 k= 1,...,N (support values)

    Lk

    = xk+1 f(xk, uk) = 0, k= 1,...,N (system dynamics)

    Lk

    = uk wT(xk) ek = 0, k= 1,...,N (SVM control)

    One obtains set of nonlinear equations of the formF1(xk, xN+1, uk, w , ek, k, k) = 0

    fork= 1,...,N withx1given.

    Apply Mercer condition trick:K(xk, xl) =(xk)

    T(xl)

    For RBF kernels one takes

    K(xk, xl) = exp( xk xl22)

  • 8/11/2019 suykens

    83/84

    By replacingw= kk(xk) in the equations, one obtains

    hxk

    +k1 ( fxk )Tk k

    Nl=1 l

    K(xk,xl)xk

    = 0, k= 2,...,N

    xN+1 +N= 0

    huk

    Tk fuk +k = 0, k= 1,...,N

    xk+1 f(xk, uk) = 0, k= 1,...,N

    uk

    Nl=1 lK(xl, xk)

    k/= 0, k= 1,...,N

    which is of the form

    F2(xk, xN+1, uk, k, k) = 0

    fork= 1,...,N withx1given.

    For RBF kernels one has

    K(xk, xl)

    xk= 2 (xk xl) exp( xk xl22)

    The actual control signal applied to the plant becomes

    uk =Nl=1

    lK(xl, xk)

    where {xl}Nl=1, {l}Nl=1are the solution toF2= 0.

  • 8/11/2019 suykens

    84/84