an rkhs approach to systematic kernel selection in nonlinear system identification
TRANSCRIPT
An RKHS Approach toSystematic Kernel Selection
in Nonlinear System Identification
Y. Bhujwalla, V. Laurain, M. Gilson
55th IEEE Conference on Decision and [email protected]
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 1 / 14
IntroductionProblem Description
Measured data :DN = {(u1, y1), (u2, y2), . . . , (uN , yN)}
Describing an unknown system :
So :
{
yo,k = fo(xk), fo : X → R
yk = yo,k + eo,k, eo,k ∼ N (0,σ2e )
- xk = [ yk−1 · · · yk−na u1,k · · · u1,k−nb u2,k · · · unu,k−nb ]⊤ ∈ X = R
na+nu (nb+1)
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 2 / 14
IntroductionModelling Objective
Aim : to choose the simplest model from a candidate set of models that accuratelydescribes the system :
Mopt : Accuracy (Data) vs Simplicity (Model)⏐
⏐
⏐
⏐
#
Vf : V( f ) =∑N
k=1 ( yk − f (xk))2 + g( f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 3 / 14
IntroductionModelling Objective
Aim : to choose the simplest model from a candidate set of models that accuratelydescribes the system :
Mopt : Accuracy (Data) vs Simplicity (Model)⏐
⏐
⏐
⏐
#
Vf : V( f ) =∑N
k=1 ( yk − f (xk))2 + g( f )
Q1 : How to choose the simplest accurate model ?- Often g( f ) = λ ∥ f∥2H - ensure uniqueness of the solution- λ → controls the bias-variance trade-off
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 3 / 14
IntroductionModelling Objective
Aim : to choose the simplest model from a candidate set of models that accuratelydescribes the system :
Mopt : Accuracy (Data) vs Simplicity (Model)⏐
⏐
⏐
⏐
#
Vf : V( f ) =∑N
k=1 ( yk − f (xk))2 + g( f )
Q1 : How to choose the simplest accurate model ?- Often g( f ) = λ ∥ f∥2H - ensure uniqueness of the solution- λ → controls the bias-variance trade-off
Q2 : How to determine a suitable set of candidate models. . . ?
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 3 / 14
Outline
1. Kernel Methods in Nonlinear Identification
2. Model Selection Using Derivatives
3. Smoothness-Enforcing Regularisation
4. Application : Estimation of Locally Nonsmooth Functions
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 3 / 14
1. Kernel Methods in Nonlinear Identification
Input0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Out
put
0
0.5
1
1.5
2f̂kx
→ Model :
Ff : f (x) =N∑
i=1
αi kxi(x)
→ Nonparametric (nθ ∼ N)→ Flexible : M defined through choice of K→ Height : α (model parameters)→ Width : σ (kernel hyperparameter)
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 4 / 14
1. Kernel Methods in Nonlinear IdentificationIdentification in the RKHS
Reproducing Kernel Hilbert SpacesKernel function defines the model class :
K↔ H
Hence, functions can be represented in terms of kernels :
f (x) = ⟨ f , kx⟩H (1)
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 5 / 14
1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem
Choosing an overly flexible model class (a small kernel) :
FIGURE: Flexible Model Class
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14
1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem
Choosing an overly flexible model class (a small kernel) :
FIGURE: Flexible Model Class-1 -0.5 0 0.5 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2foyf̂kx
FIGURE: High Variance
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14
1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem
Choosing an overly constrained model class (a large kernel) :
FIGURE: Constrained Model Class
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14
1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem
Choosing an overly constrained model class (a large kernel) :
FIGURE: Constrained Model Class
-1 -0.5 0 0.5 1-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2foyf̂kx
FIGURE: Model Biased
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14
1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem
Why not just choose the ’optimal’ model class ?
FIGURE: Optimal Model Class
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14
1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem
Why not just choose the ’optimal’ model class ?
FIGURE: Optimal Model Class
-1 -0.5 0 0.5 1-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2foyf̂kx
FIGURE: Optimal Model
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14
1. Kernel Methods in Nonlinear IdentificationThe Kernel Selection Problem
Why not just choose the ’optimal’ model class ?• This is, in general, what we try to do.• However, Hopt is unknown.• Optimisation over one hyperparameter - not that difficult.• Optimisation over multiple model structures, kernel functions and
hyperparameters → more difficult.
-1 -0.5 0 0.5 1-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2foyf̂kx
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14
Outline
1. Kernel Methods in Nonlinear Identification
2. Model Selection Using Derivatives
3. Smoothness-Enforcing Regularisation
4. Application : Estimation of Locally Nonsmooth Functions
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 6 / 14
2. Model Selection Using Derivatives
But, note that many properties of K are encoded into its derivatives, e.g.
Smoothness f (x) = ax2 + bx+ c =⇒ d3 f(x)dx3
∣
∣
∣
∀x= 0
f (x) = g1(x) [ x < x∗] + g2(x) [ x > x∗] =⇒ ∃ d f(x)dx
∣
∣
∣
∀x̸=x∗
Linearity f ( x1, x2 ) = x1 h1(x2) + h2(x2) =⇒ ∂2 f( x1,x2 )∂x21
∣
∣
∣
∀x1= 0
Separability f ( x1, x2 ) = g(x1) + h(x2) =⇒ ∂2 f( x1,x2 )∂x1 ∂x1
∣
∣
∣
∀x1,x2= 0
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 7 / 14
2. Model Selection Using Derivatives
But, note that many properties of K are encoded into its derivatives, e.g.
Smoothness f (x) = ax2 + bx+ c =⇒ d3 f(x)dx3
∣
∣
∣
∀x= 0
f (x) = g1(x) [ x < x∗] + g2(x) [ x > x∗] =⇒ ∃ d f(x)dx
∣
∣
∣
∀x̸=x∗
Linearity f ( x1, x2 ) = x1 h1(x2) + h2(x2) =⇒ ∂2 f( x1,x2 )∂x21
∣
∣
∣
∀x1= 0
Separability f ( x1, x2 ) = g(x1) + h(x2) =⇒ ∂2 f( x1,x2 )∂x1 ∂x1
∣
∣
∣
∀x1,x2= 0
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 7 / 14
2. Model Selection Using Derivatives
But, note that many properties of K are encoded into its derivatives, e.g.
Smoothness f (x) = ax2 + bx+ c =⇒ d3 f(x)dx3
∣
∣
∣
∀x= 0
f (x) = g1(x) [ x < x∗] + g2(x) [ x > x∗] =⇒ ∃ d f(x)dx
∣
∣
∣
∀x̸=x∗
Linearity f ( x1, x2 ) = x1 h1(x2) + h2(x2) =⇒ ∂2 f( x1,x2 )∂x21
∣
∣
∣
∀x1= 0
Separability f ( x1, x2 ) = g(x1) + h(x2) =⇒ ∂2 f( x1,x2 )∂x1 ∂x1
∣
∣
∣
∀x1,x2= 0
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 7 / 14
2. Model Selection Using DerivativesIncorporating this information into the problem formulation allows the model selectioncan be transferred from an optimisation over K. . .
. . . to an explicit regularisation problem over derivatives, using an a priori flexiblemodel class definition.
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 8 / 14
Outline
1. Kernel Methods in Nonlinear Identification
2. Model Selection Using Derivatives
3. Smoothness-Enforcing Regularisation
4. Application : Estimation of Locally Nonsmooth Functions
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 8 / 14
3. Smoothness-Enforcing RegularisationProblem Formulation
Here we consider X = R → where the kernel optimisation is reduced to asmoothness selection problem.
What would we like to do?
Replace exisiting functional norm regularisation. . .
Vf : V( f ) =N∑
k=1
( yk − f (xk))2 + λ ∥ f∥2H
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 9 / 14
3. Smoothness-Enforcing RegularisationProblem Formulation
Here we consider X = R → where the kernel optimisation is reduced to asmoothness selection problem.
What would we like to do?
Replace exisiting functional norm regularisation. . .
Vf : V( f ) =N∑
k=1
( yk − f (xk))2 + λ ∥ f∥2H
With a smoothness-penalty in the cost-function. . .
VD : V( f ) =N
∑
k=1
( yk − f (xk))2 + λ ∥Df∥2H
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 9 / 14
3. Smoothness-Enforcing RegularisationProblem Formulation
Here we consider X = R → where the kernel optimisation is reduced to asmoothness selection problem.
What would we like to do?
Replace exisiting functional norm regularisation. . .
Vf : V( f ) =N∑
k=1
( yk − f (xk))2 + λ ∥ f∥2H
With a smoothness-penalty in the cost-function. . .
VD : V( f ) =N
∑
k=1
( yk − f (xk))2 + λ ∥Df∥2H
How?- ∥Df∥2H → known (D. X. Zhou, 2008)- f (x) for VD → unknown
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 9 / 14
3. Smoothness-Enforcing RegularisationAn Extended Representer of f (x)
A finite representer for VD does not exist.But, by adding kernels along X , an approximate formulation can be defined :
Input (x/σ)-6 -4 -2 0 2 4 6
Out
put
0
0.5
1
1.5ObservationsObs Kernels∥ f∥2
FIGURE: N = 2
Input (x/σ)-6 -4 -2 0 2 4 6
Out
put
0
0.5
1
1.5ObservationsObs KernelsAdd Kernels∥Df∥2
FIGURE: (N,P) = (2, 8)
FD : f (x) =N∑
i=1
αi kxi(x) +P
∑
j=1
α∗j kx∗j (x)
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 10 / 14
3. Smoothness-Enforcing RegularisationChoosing the Kernel Width
Examination of the kernel density allows us to make an a priori choice of kernel width :
Input (x/σ)-6 -4 -2 0 2 4 6
Out
put
0
0.5
1
1.5f̂kx
FIGURE: ρk = 0.4
Input (x/σ)-6 -4 -2 0 2 4 6
Out
put
0
0.5
1
1.5f̂kx
FIGURE: ρk = 0.5
Input (x/σ)-6 -4 -2 0 2 4 6
Out
put
0
0.5
1
1.5f̂kx
FIGURE: ρk = 0.6
Hence, for a given P we can define the maximally flexible model class for a givenproblem.
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 11 / 14
Outline
1. Kernel Methods in Nonlinear Identification
2. Model Selection Using Derivatives
3. Smoothness-Enforcing Regularisation
4. Application : Estimation of Locally Nonsmooth Functions
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 11 / 14
4. ApplicationEstimation of Locally Nonsmooth Functions
In VD, smoothness ∼ regularisation.
Hence, by introducing weights into the loss-function, importance of the regularisationcan be varied across X :
Vw : V( f ) =N∑
i=1
(wk yk − wk f (xk))2 + λ∥Df∥2H,
How to determine the weights?
Relative to a particular modelling objective, e.g.• wk ∼ ∥D f̂(0)(xk)∥22 for piecewise constant structures, or• wk ∼ ∥D2 f̂(0)(xk)∥22 for piecewise linear structures.
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 12 / 14
4. ApplicationEstimation of Locally Nonsmooth Functions
-0.5 0 0.5-10
-5
0
5
10
15
20
25yo
FIGURE: Noise-Free System
-0.5 0 0.5-10
-5
0
5
10
15
20
25y
FIGURE: Noisy System
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 13 / 14
4. ApplicationEstimation of Locally Nonsmooth Functions
-0.5 0 0.5-10
-5
0
5
10
15
20
25yof̂MEDBIAS + SDEV
FIGURE: Vf : R( f )
-0.5 0 0.5-10
-5
0
5
10
15
20
25yof̂MEDBIAS + SDEV
FIGURE: VD : R(Df )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 13 / 14
4. ApplicationEstimation of Locally Nonsmooth Functions
-0.5 0 0.5-10
-5
0
5
10
15
20
25yof̂MEDBIAS + SDEV
FIGURE: Vf : R( f )
-0.5 0 0.5-10
-5
0
5
10
15
20
25yof̂MEDBIAS + SDEV
FIGURE: Vw : R(Df )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 13 / 14
ConclusionsObjectives :
• To simplify model selection in nonlinear identification.• By shifting the problem to a regularisation over functional derivatives.
→ Allowing the definition of an a priori flexible model class.
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
ConclusionsObjectives :
• To simplify model selection in nonlinear identification.• By shifting the problem to a regularisation over functional derivatives.
→ Allowing the definition of an a priori flexible model class.
This presentation :• First step ⇒ consider a simple example.
→ Model selection ⇔ smoothness detection.→ Kernel selection ⇔ hyperparameter optimisation.
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
ConclusionsObjectives :
• To simplify model selection in nonlinear identification.• By shifting the problem to a regularisation over functional derivatives.
→ Allowing the definition of an a priori flexible model class.
This presentation :• First step ⇒ consider a simple example.
→ Model selection ⇔ smoothness detection.→ Kernel selection ⇔ hyperparameter optimisation.
Current/Future Research :• Application to dynamical, control-oriented problems (e.g. linearparameter-varying identification)
• Investigation of more complex model selection problems (e.g. detection oflinearities, separability. . . ).
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
A. Bibliography
• Sobolev Spaces (Wahba, 1990 ; Pillonetto et al, 2014)
∥ f∥Hk=
m∑
i=0
∫
X
(
dif (x)dxi
)2
dx
• Identification using derivative observations (Zhou, 2008 ; Rosasco et al,2010)
Vobvs( f ) = ∥y− f (x)∥22 + γ1
∥
∥
∥
∥
dydx
−df (x)dx
∥
∥
∥
∥
2
2
+ · · · γm
∥
∥
∥
∥
dmydxm
−dmf (x)dxm
∥
∥
∥
∥
2
2+ λ ∥f∥H
• Regularization Using Derivatives (Rosasco et al, 2010 ; Lauer, Le and Bloch,2012 ; Duijkers et al, 2014)
VD( f ) = ∥y− f (x)∥22 + λ∥Dmf∥p.
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
B. Choosing the Kernel WidthThe Smoothness-Tolerance Parameter
ρk =σ
∆x∗, ∆x∗ =
x∗max − x∗minP
, ϵ̂f = 100 ×
{
1− ∥ f̂ ∥infC
}
%.
Kernel Density (ρ)10-2 10-1 100
Smoo
thne
ss T
oler
ance
(ϵ %
)
10-15
10-10
10-5
100
ϵ(ρ)ϵ̂
FIGURE: Selecting an appropriate kernel using ϵ
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
C. Effect of the Regularisation
⇒ Negligible regularisation (very small λf , λD).
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: Vf : R( f )
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: VD : R(Df )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
C. Effect of the Regularisation
⇒ Light regularisation (small λf , λD).
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: Vf : R( f )
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: VD : R(Df )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
C. Effect of the Regularisation
⇒ Moderate regularisation.
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: Vf : R( f )
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: VD : R(Df )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
C. Effect of the Regularisation
⇒ Heavy regularisation (large λf , λD).
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: Vf : R( f )
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: VD : R(Df )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
C. Effect of the Regularisation
⇒ Excessive regularisation (very large λf , λD).
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: Vf : R( f )
Input-1 -0.5 0 0.5 1
Out
put
-20
-10
0
10
20
30
yof̂MEANf̂SD
FIGURE: VD : R(Df )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
D. Further Examples : Detecting Piecewise Structures
So : Noise-free and observed data
-1 -0.5 0 0.5 1-1
-0.5
0
0.5
1
-0.49039
0
1
1.4358
FIGURE: y(x1, x2)
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
D. Further Examples : Detecting Piecewise Structures
Results M1 : (Vf , Ff )
FIGURE: MEDIAN FIGURE: BIAS FIGURE: SDEV
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
D. Further Examples : Detecting Piecewise Structures
Results M2 : (VD, FD)
FIGURE: MEDIAN FIGURE: BIAS FIGURE: SDEV
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
D. Further Examples : Detecting Piecewise Structures
Results M3 : (Vw, FD)
FIGURE: MEDIAN FIGURE: BIAS FIGURE: SDEV
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14
E. Further Examples : Enforcing Separabilityf ( x1, x2 )
λ−→ f1(x1) + f2(x2)
FIGURE: VDX: R(∂x1∂x2 f )
Yusuf Bhujwalla (Université de Lorraine) CDC 2016 14 / 14