15th february2011 11 data-driven kriging models based on fanova decomposition o. roustant, ecole des...

15th February2011 1 1

Data-driven Kriging models based on FANOVA decomposition

O. Roustant, Ecole des Mines de St-Etienne,

www.emse.fr/~roustant

joint work with

T. Muehlenstädt1, L. Carraro2 and S. Kuhnt1

1 University of Dortmund - 2 Telecom St-Etienne

http://www.emse.fr/~roustant

Cliques of FANOVA graph and block additive decomposition

2

Cliques: {1,2,3}, {4,5,6}, {3,4}

f(x) = cos(x1+x2+x3) +sin(x4+x5+x6) +tan(x3+x4)

f(x) = f1,2,3(x1,x2,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4)

Z(x) = Z1,2,3(x1,x2,x3) + Z4,5,6(x4,x5,x6) + Z3,4(x3,x4)

k(h) = k1,2,3(h1,h2,h3) + k4,5,6(h4,h5,h6) + k3,4(h3, h4)

This talk presents – with many pictures - the main ideas of the corresponding paper: we refer to it for details.

3

Introduction

Computer experiments

• A keyword associated to the analysis of time-consuming computer codes

5

ffx1x2

xd

y

Metamodeling and Kriging

• Metamodeling: construct a cheap-to-evaluate model of the simulator (itself modeling the reality)

• Kriging: basically an interpolation method based on Gaussian processes

6

Kriging model (definition)

Y(x) = b0 + b1g1(x) + … + bkgk(x) + Z(x)

linear trend (deterministic)

+ centered stationary Gaussian process (stochastic)

7

Some conditional simulations

Kriging model (prediction)

8

Some conditional simulations

Conditional mean and 95% conf. int.

Kriging model (kernel)

• Kriging model is a kernel-based method

K(x,x’) = cov(Z(x), Z(x’))

FLEXIBLE (see after…)

• When Z is stationary, K(x,x’) depends on h=x-x’

we denote k(h) = K(x,x’)

9

Kriging model (kernel)

• “Making new kernels from old” (Rasmussen and Williams, 2006)

K1 + K2

cK, with c>0K1K2

…

10

Kriging model (common choice)

• Tensor-product structurek(h) = k1(h1)k2(h2)…kd(hd)

with hi=xi – xi’, and ki Gaussian, Matern 5/2…

11

The main idea on an example

• Ishigami, defined on D = [-π,π]3, with A=7, B=0.1:

f(x) = sin(x1) + Asin2(x2) + B(x3)4sin(x1)

• This is a block additive decomposition

f(x) = f2(x2) + f1,3(x1,x3)

Z(x) = Z2(x2) + Z1,3(x1,x3) k(h) = k2(h2) + k1,3(h1,h3)

12

The main idea on an example

• Comparison of the two Kriging models– Training set: 100 points from a maximin Latin hypercube– Test set: 1000 additional points from a unif. distribution

13

The schema to be generalized

14

ff k = k2 + k1,3

Outline

Introduction

[How to choose a Kriging model for the Ishigami function]

1. From FANOVA graphs to block additive kernels

[Generalizes 1.]

2. Estimation methodologies

[With a new sensitivity index]

3. Applications

4. Some comments15

From FANOVA graphs to block additive kernels

FANOVA decomposition (Efron and Stein, 1981)

• Assume that X1, …, Xd are independent random variables. Let f be a function defined on D1x…xDd and dν=dν1…dvd an integration measure. Then:

f(X) = μ0 + Σμi(Xi) + Σμi,j(Xi,Xj) + Σμi,j,k(Xi,Xj,Xk) + …

where all terms are centered and orthogonal to the others. They are given by:

μ0 := E(f(X)), μi(Xi) := E(f(X)|Xi) – μ0

μi,j(Xi,Xj) := E(f(X)|Xi,Xj) - μi(Xi) - μj(Xj) – μ0

and so on…

17

FANOVA decomposition

Example. Ishigami function, with uniform measure on D = [-π,π ]3. With a=π, b=2π5/5, we have:

f(x) = sin(x1) + Asin2(x2) + B(x3)4sin(x1)

= aA + sin(x1)(1+bB) + A(sin2(x2)-a) + B[(x3)4-b]sin(x1)

18

μ0 μ1(x1)

(main effect)

μ2(x2)

(main effect)

μ1,3(x1,x3)

(2nd order interaction)


Example (following)

• Some terms can vanish, due to averaging, as μ3 , or μ1 if B = -1/b, but this depends on the integration measure, and only happens when there exist terms of higher order

• On the other hand, and for the same reason, we always have:

μ2,1 = μ2,3 = 0 and, under mild conditions: μ1,3 ≠ 0

19


• The name “FANOVA” becomes from the relation on variances implied by orthogonality:

var(f(X)) = Σvar(μi(Xi)) + Σvar(μi,j(Xi,j)) + …

which measures the importance of each term.

• var(μJ(XJ))/var(f(X)) is often called a Sobol indice

20

FANOVA graph

21

Vertices: variablesEdges: if there is at least one interaction (at any order)Width: prop. to variances

Here (example above):μ1,2 = μ2,3 = 0

The graph does not depend on the integration measure, (under mild conditions)

FANOVA graph and cliques

• A complete subgraph: all edges exist• A clique: maximal complete subgraph

22

Cliques: {1,3} and {2}

Cliques of FANOVA graph and block additive decomposition

23

Cliques: {1,2,3}, {4,5,6}, {3,4}


f(x) = f1,2,3(x1,x2,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4)

Why cliques?

24

{1,2,3,4,5,6}


f(x) = f1,…,6(x1,…,x6) !!!Incomplete subgraphs rough model forms

Why cliques?

25

{1,2},{2,3},{1,3},{3,4}{4,5,6}

Non maximality wrong model forms

f(x) = f1,2(x1,x2)+f2,3(x2,x3)+f1,3(x1,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4)


Cliques of FANOVA graph and Kriging models

26

Cliques: {1,2,3}, {4,5,6}, {3,4}

f(x) = f1,2,3(x1,x2,x3) +f4,5,6(x4,x5,x6) +f3,4(x3,x4)

Z(x) = Z1,2,3(x1,x2,x3) + Z4,5,6(x4,x5,x6) + Z3,4(x3,x4)

k(h) = k1,2,3(h1,h2,h3) + k4,5,6(h4,h5,h6) + k3,4(h3, h4)

Estimation methodologies

Graph estimation

• Challenge: estimate all interactions (at any order) involving two variables

• Two issues:1. The computer code is time-consuming

2. Huge number of combinations for the usual Sobol indices

28

Graph estimation

• Solutions1. Replace the computer code by a metamodel, for instance a

Kriging model with a standard kernel

2. Fix x3, …, xd, and consider the 2nd order interaction of the 2-dimensional function:

(x1, x2) f(x) = f1(x-2) + f2(x-1) + f12(x1,x2; x-{1,2})

Denote D12(x3,…,xd) the unnormalized Sobol indice, and define: D12 = E(D12(x3,…,xd))

Then D12 > 0 iif (1,2) is an edge of the FANOVA graph29

Graph estimation

• Comments:– The new sensitivity index is computed by averaging 2nd

order Sobol indices, and thus numerically tractable

– In practice “D12 > 0” is replaced by “D12 > δ”

Different thresholds give different FANOVA graphs

30

Kriging model estimation

• Assume that there are L cliques of size d1,…,dL. The total number of parameters to be estimated is:

ntrend + (d1+1) + … + (dL + 1)

(trend, “ranges” and variance parameters)

• MLE is used, 3 numerical procedures tested

31

Kriging model estimation

• Isotropic kernels are useful for high dimensions

• Example: suppose that C1={1,2,3}, C2={4,5,6}, C3={3,4}, and x7, …, x16 have a smaller influence

1st solution: C4={x7}, …, C13={x16}

N = ntrend + 4 + 4 + 3 + 10*2 = ntrend + 31

2nd solution: C4 = {x7, …, x16}, with an isotropic kernel

N = ntrend + 4 + 4 + 3 + 2 = ntrend + 13 32

Applications

a 6D analytical case

• f(x) = cos(- 0.8 - 1.1x1 + 1.1x2 + x3)

+ sin(- 0.5 + 0.9x4 + x5 – 1.1x6)

+ (0.5 + 0.35x3 - 0.6x4)2

• Domain: [-1,1]6

• Integration measure: uniform

• Training set: 100 points from a maximin LHD • Test set: 1000 points drawn from a unif. dist.

34


35

Estimated graph Usual Sensitivity Analysis (from R package sensitivity)


36


• Consider the same function, but assume that it is in a 16D space (with 10 more inactive variables)

• Including all the inactive variables in one clique is improving the prediction

37

A 6D case study

• Piston slap data set (Fang et al., 2006)– Unwanted noise of engine, simulated using a finite

elements method

• Training set: 100 points• Test set: 12 points• Leave-one-out is also considered

38

A 6D case study

39

A 6D case study

40

Leave-one-out RMSE: 0.0864 (standard Kriging), 0.0371 (modified Kriging)

Some comments

Some comments

• Main strengths– Adapting the kernel to the data in a flexible manner– A substantial improvement may be expected in prediction

• Depending on the function complexity

• Some drawbacks– Dependence on the first initial metamodel– Sometimes a large nb of parameters to be estimated

• May decrease the prediction power

42

THANKS A LOT FOR ATTENDING!

43

15th february2011 11 data-driven kriging models based on fanova decomposition o. roustant, ecole des...

Documents