an rkhs approach to systematic kernel selection in nonlinear system identification

An RKHS Approach toSystematic Kernel Selection

in Nonlinear System Identification

Y. Bhujwalla, V. Laurain, M. Gilson

55th IEEE Conference on Decision and [email protected]

Yusuf Bhujwalla (Université de Lorraine) CDC 2016 1 / 14

IntroductionProblem Description

Measured data :DN = {(u1, y1), (u2, y2), . . . , (uN , yN)}

Describing an unknown system :

So :

{

yo,k = fo(xk), fo : X → R

yk = yo,k + eo,k, eo,k ∼ N (0,σ2e )

- xk = [ yk−1 · · · yk−na u1,k · · · u1,k−nb u2,k · · · unu,k−nb ]⊤ ∈ X = R

na+nu (nb+1)


IntroductionModelling Objective

Aim : to choose the simplest model from a candidate set of models that accuratelydescribes the system :

Mopt : Accuracy (Data) vs Simplicity (Model)⏐

⏐

⏐

⏐

#

Vf : V( f ) =∑N

k=1 ( yk − f (xk))2 + g( f )





⏐

⏐

⏐

#

Vf : V( f ) =∑N

k=1 ( yk − f (xk))2 + g( f )

Q1 : How to choose the simplest accurate model ?- Often g( f ) = λ ∥ f∥2H - ensure uniqueness of the solution- λ → controls the bias-variance trade-off





⏐

⏐

⏐

#

Vf : V( f ) =∑N

k=1 ( yk − f (xk))2 + g( f )

Q1 : How to choose the simplest accurate model ?- Often g( f ) = λ ∥ f∥2H - ensure uniqueness of the solution- λ → controls the bias-variance trade-off

Q2 : How to determine a suitable set of candidate models. . . ?


Outline

1. Kernel Methods in Nonlinear Identification

2. Model Selection Using Derivatives

3. Smoothness-Enforcing Regularisation

4. Application : Estimation of Locally Nonsmooth Functions



Input0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Out

put

0

0.5

1

1.5

2f̂kx

→ Model :

Ff : f (x) =N∑

i=1

αi kxi(x)

→ Nonparametric (nθ ∼ N)→ Flexible : M defined through choice of K→ Height : α (model parameters)→ Width : σ (kernel hyperparameter)


1. Kernel Methods in Nonlinear IdentificationIdentification in the RKHS

Reproducing Kernel Hilbert SpacesKernel function defines the model class :

K↔ H

Hence, functions can be represented in terms of kernels :

f (x) = ⟨ f , kx⟩H (1)


1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem

Choosing an overly flexible model class (a small kernel) :

FIGURE: Flexible Model Class



Choosing an overly flexible model class (a small kernel) :

FIGURE: Flexible Model Class-1 -0.5 0 0.5 1

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2foyf̂kx

FIGURE: High Variance



Choosing an overly constrained model class (a large kernel) :

FIGURE: Constrained Model Class



Choosing an overly constrained model class (a large kernel) :

FIGURE: Constrained Model Class

-1 -0.5 0 0.5 1-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2foyf̂kx

FIGURE: Model Biased



Why not just choose the ’optimal’ model class ?

FIGURE: Optimal Model Class



Why not just choose the ’optimal’ model class ?

FIGURE: Optimal Model Class

-1 -0.5 0 0.5 1-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2foyf̂kx

FIGURE: Optimal Model



Why not just choose the ’optimal’ model class ?• This is, in general, what we try to do.• However, Hopt is unknown.• Optimisation over one hyperparameter - not that difficult.• Optimisation over multiple model structures, kernel functions and

hyperparameters → more difficult.

-1 -0.5 0 0.5 1-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2foyf̂kx


Outline







But, note that many properties of K are encoded into its derivatives, e.g.

Smoothness f (x) = ax2 + bx+ c =⇒ d3 f(x)dx3

∣

∣

∣

∀x= 0

f (x) = g1(x) [ x < x∗] + g2(x) [ x > x∗] =⇒ ∃ d f(x)dx

∣

∣

∣

∀x̸=x∗

Linearity f ( x1, x2 ) = x1 h1(x2) + h2(x2) =⇒ ∂2 f( x1,x2 )∂x21

∣

∣

∣

∀x1= 0

Separability f ( x1, x2 ) = g(x1) + h(x2) =⇒ ∂2 f( x1,x2 )∂x1 ∂x1

∣

∣

∣

∀x1,x2= 0


2. Model Selection Using DerivativesIncorporating this information into the problem formulation allows the model selectioncan be transferred from an optimisation over K. . .

. . . to an explicit regularisation problem over derivatives, using an a priori flexiblemodel class definition.


Outline






3. Smoothness-Enforcing RegularisationProblem Formulation

Here we consider X = R → where the kernel optimisation is reduced to asmoothness selection problem.

What would we like to do?

Replace exisiting functional norm regularisation. . .

Vf : V( f ) =N∑

k=1

( yk − f (xk))2 + λ ∥ f∥2H






Vf : V( f ) =N∑

k=1

( yk − f (xk))2 + λ ∥ f∥2H

With a smoothness-penalty in the cost-function. . .

VD : V( f ) =N

∑

k=1

( yk − f (xk))2 + λ ∥Df∥2H






Vf : V( f ) =N∑

k=1

( yk − f (xk))2 + λ ∥ f∥2H

With a smoothness-penalty in the cost-function. . .

VD : V( f ) =N

∑

k=1

( yk − f (xk))2 + λ ∥Df∥2H

How?- ∥Df∥2H → known (D. X. Zhou, 2008)- f (x) for VD → unknown


3. Smoothness-Enforcing RegularisationAn Extended Representer of f (x)

A finite representer for VD does not exist.But, by adding kernels along X , an approximate formulation can be defined :

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5ObservationsObs Kernels∥ f∥2

FIGURE: N = 2

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5ObservationsObs KernelsAdd Kernels∥Df∥2

FIGURE: (N,P) = (2, 8)

FD : f (x) =N∑

i=1

αi kxi(x) +P

∑

j=1

α∗j kx∗j (x)


3. Smoothness-Enforcing RegularisationChoosing the Kernel Width

Examination of the kernel density allows us to make an a priori choice of kernel width :

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5f̂kx

FIGURE: ρk = 0.4

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5f̂kx

FIGURE: ρk = 0.5

Input (x/σ)-6 -4 -2 0 2 4 6

Out

put

0

0.5

1

1.5f̂kx

FIGURE: ρk = 0.6

Hence, for a given P we can define the maximally flexible model class for a givenproblem.


Outline






4. ApplicationEstimation of Locally Nonsmooth Functions

In VD, smoothness ∼ regularisation.

Hence, by introducing weights into the loss-function, importance of the regularisationcan be varied across X :

Vw : V( f ) =N∑

i=1

(wk yk − wk f (xk))2 + λ∥Df∥2H,

How to determine the weights?

Relative to a particular modelling objective, e.g.• wk ∼ ∥D f̂(0)(xk)∥22 for piecewise constant structures, or• wk ∼ ∥D2 f̂(0)(xk)∥22 for piecewise linear structures.



-0.5 0 0.5-10

-5

0

5

10

15

20

25yo

FIGURE: Noise-Free System

-0.5 0 0.5-10

-5

0

5

10

15

20

25y

FIGURE: Noisy System



-0.5 0 0.5-10

-5

0

5

10

15

20

25yof̂MEDBIAS + SDEV

FIGURE: Vf : R( f )

-0.5 0 0.5-10

-5

0

5

10

15

20


FIGURE: VD : R(Df )



-0.5 0 0.5-10

-5

0

5

10

15

20


FIGURE: Vf : R( f )

-0.5 0 0.5-10

-5

0

5

10

15

20


FIGURE: Vw : R(Df )


ConclusionsObjectives :

• To simplify model selection in nonlinear identification.• By shifting the problem to a regularisation over functional derivatives.

→ Allowing the definition of an a priori flexible model class.





This presentation :• First step ⇒ consider a simple example.

→ Model selection ⇔ smoothness detection.→ Kernel selection ⇔ hyperparameter optimisation.





This presentation :• First step ⇒ consider a simple example.

→ Model selection ⇔ smoothness detection.→ Kernel selection ⇔ hyperparameter optimisation.

Current/Future Research :• Application to dynamical, control-oriented problems (e.g. linearparameter-varying identification)

• Investigation of more complex model selection problems (e.g. detection oflinearities, separability. . . ).


A. Bibliography

• Sobolev Spaces (Wahba, 1990 ; Pillonetto et al, 2014)

∥ f∥Hk=

m∑

i=0

∫

X

(

dif (x)dxi

)2

dx

• Identification using derivative observations (Zhou, 2008 ; Rosasco et al,2010)

Vobvs( f ) = ∥y− f (x)∥22 + γ1

∥

∥

∥

∥

dydx

−df (x)dx

∥

∥

∥

∥

2

2

+ · · · γm

∥

∥

∥

∥

dmydxm

−dmf (x)dxm

∥

∥

∥

∥

2

2+ λ ∥f∥H

• Regularization Using Derivatives (Rosasco et al, 2010 ; Lauer, Le and Bloch,2012 ; Duijkers et al, 2014)

VD( f ) = ∥y− f (x)∥22 + λ∥Dmf∥p.


B. Choosing the Kernel WidthThe Smoothness-Tolerance Parameter

ρk =σ

∆x∗, ∆x∗ =

x∗max − x∗minP

, ϵ̂f = 100 ×

{

1− ∥ f̂ ∥infC

}

%.

Kernel Density (ρ)10-2 10-1 100

Smoo

thne

ss T

oler

ance

(ϵ %

)

10-15

10-10

10-5

100

ϵ(ρ)ϵ̂

FIGURE: Selecting an appropriate kernel using ϵ


C. Effect of the Regularisation

⇒ Negligible regularisation (very small λf , λD).

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )



⇒ Light regularisation (small λf , λD).

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )



⇒ Moderate regularisation.

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )



⇒ Heavy regularisation (large λf , λD).

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )



⇒ Excessive regularisation (very large λf , λD).

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: Vf : R( f )

Input-1 -0.5 0 0.5 1

Out

put

-20

-10

0

10

20

30

yof̂MEANf̂SD

FIGURE: VD : R(Df )


D. Further Examples : Detecting Piecewise Structures

So : Noise-free and observed data

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-0.49039

0

1

1.4358

FIGURE: y(x1, x2)



Results M1 : (Vf , Ff )

FIGURE: MEDIAN FIGURE: BIAS FIGURE: SDEV



Results M2 : (VD, FD)




Results M3 : (Vw, FD)



E. Further Examples : Enforcing Separabilityf ( x1, x2 )

λ−→ f1(x1) + f2(x2)

FIGURE: VDX: R(∂x1∂x2 f )


an rkhs approach to systematic kernel selection in nonlinear system identification

Engineering