a comparison of learning methods to predict n2o fluxes and n leaching

80
A comparison of learning methods to predict N 2 O fluxes and N leaching Nathalie Villa-Vialaneix [email protected] http://www.nathalievilla.org Workshop on N2O meta-modelling, March 9th, 2015 INRA, Toulouse Nathalie Villa-Vialaneix | Comparison of metamodels 1/37

Upload: tuxette

Post on 17-Jul-2015

70 views

Category:

Science


1 download

TRANSCRIPT

A comparison of learning methods to predict N2Ofluxes and N leaching

Nathalie Villa-Vialaneix

[email protected]://www.nathalievilla.org

Workshop on N2O meta-modelling, March 9th, 2015

INRA, Toulouse

Nathalie Villa-Vialaneix | Comparison of metamodels 1/37

Sommaire

1 DNDC-Europe model description

2 MethodologyUnderfitting / OverfittingConsistencyProblem at stake

3 Presentation of the different methods

4 Results

Nathalie Villa-Vialaneix | Comparison of metamodels 2/37

Sommaire

1 DNDC-Europe model description

2 MethodologyUnderfitting / OverfittingConsistencyProblem at stake

3 Presentation of the different methods

4 Results

Nathalie Villa-Vialaneix | Comparison of metamodels 3/37

General overview

Modern issues in agriculturefight against the food crisis;

while preserving environments.

EC needs simulation tools to

link the direct aids with the respect of standards ensuring propermanagement;

quantify the environmental impact of European policies (“CrossCompliance”).

Nathalie Villa-Vialaneix | Comparison of metamodels 4/37

General overview

Modern issues in agriculturefight against the food crisis;

while preserving environments.

EC needs simulation tools to

link the direct aids with the respect of standards ensuring propermanagement;

quantify the environmental impact of European policies (“CrossCompliance”).

Nathalie Villa-Vialaneix | Comparison of metamodels 4/37

Cross Compliance Assessment Tool

DNDC is a biogeochemical model.

Nathalie Villa-Vialaneix | Comparison of metamodels 5/37

Zoom on DNDC-EUROPE

Nathalie Villa-Vialaneix | Comparison of metamodels 6/37

Moving from DNDC-Europe to metamodeling

Needs for metamodelingeasier integration into CCAT

faster execution and responding scenario analysis

Nathalie Villa-Vialaneix | Comparison of metamodels 7/37

Moving from DNDC-Europe to metamodeling

Needs for metamodelingeasier integration into CCAT

faster execution and responding scenario analysis

Nathalie Villa-Vialaneix | Comparison of metamodels 7/37

Data [Villa-Vialaneix et al., 2012]Data extracted from the biogeochemical simulator DNDC-EUROPE: ∼19 000 HSMU (Homogeneous Soil Mapping Units ' 1km2 but the area isquite varying) used for corn cultivation:

corn corresponds to ' 4.6% of UAA;HSMU for which at least 10% of the agricultural land was used forcorn were selected.

2 outputs to be estimated (independently) from the inputs:N2O fluxes (greenhouse gaz);N leaching (one major cause for water pollution).

Nathalie Villa-Vialaneix | Comparison of metamodels 8/37

Data [Villa-Vialaneix et al., 2012]Data extracted from the biogeochemical simulator DNDC-EUROPE:11 input (explanatory) variables (selected by experts and previoussimulations)

N FR (N input through fertilization; kg/ha y);N MR (N input through manure spreading; kg/ha y);Nfix (N input from biological fixation; kg/ha y);Nres (N input from root residue; kg/ha y);BD (Bulk Density; g/cm3 );SOC (Soil organic carbon in topsoil; mass fraction);PH (Soil pH);Clay (Ratio of soil clay content);Rain (Annual precipitation; mm/y);Tmean (Annual mean temperature; C);Nr (Concentration of N in rain; ppm).

2 outputs to be estimated (independently) from the inputs:N2O fluxes (greenhouse gaz);N leaching (one major cause for water pollution).

Nathalie Villa-Vialaneix | Comparison of metamodels 8/37

Data [Villa-Vialaneix et al., 2012]Data extracted from the biogeochemical simulator DNDC-EUROPE:2 outputs to be estimated (independently) from the inputs:

N2O fluxes (greenhouse gaz);N leaching (one major cause for water pollution).

Nathalie Villa-Vialaneix | Comparison of metamodels 8/37

Sommaire

1 DNDC-Europe model description

2 MethodologyUnderfitting / OverfittingConsistencyProblem at stake

3 Presentation of the different methods

4 Results

Nathalie Villa-Vialaneix | Comparison of metamodels 9/37

Regression

Consider the problem where:

Y ∈ R has to be estimated from X ∈ Rd ;

we are given a learning set, i.e., n i.i.d. observations of (X ,Y),(x1, y1), . . . , (xn, yn).

Example: Predict N2O fluxes from pH, climate, concentration of N in rain,fertilization for a large number of HSMU . . .

Nathalie Villa-Vialaneix | Comparison of metamodels 10/37

Basics

From (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function;

if Y is a factor, Φn is called a classifier;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!! [Vapnik, 1995]

Nathalie Villa-Vialaneix | Comparison of metamodels 11/37

Basics

From (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function;

if Y is a factor, Φn is called a classifier;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!! [Vapnik, 1995]

Nathalie Villa-Vialaneix | Comparison of metamodels 11/37

Basics

From (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function;

if Y is a factor, Φn is called a classifier;

Φn is said to be trained or learned from the observations (xi , yi)i .

Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!! [Vapnik, 1995]

Nathalie Villa-Vialaneix | Comparison of metamodels 11/37

Basics

From (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function;

if Y is a factor, Φn is called a classifier;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!! [Vapnik, 1995]

Nathalie Villa-Vialaneix | Comparison of metamodels 11/37

Basics

From (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function;

if Y is a factor, Φn is called a classifier;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!! [Vapnik, 1995]

Nathalie Villa-Vialaneix | Comparison of metamodels 11/37

Basics

From (xi , yi)i , definition of a machine, Φn s.t.:

ynew = Φn(xnew).

if Y is numeric, Φn is called a regression function;

if Y is a factor, Φn is called a classifier;

Φn is said to be trained or learned from the observations (xi , yi)i .Desirable properties

accuracy to the observations: predictions made on known data areclose to observed values;

generalization ability: predictions made on new data are alsoaccurate.

Conflicting objectives!! [Vapnik, 1995]

Nathalie Villa-Vialaneix | Comparison of metamodels 11/37

Underfitting/OverfittingFunction x → y to be estimated

Nathalie Villa-Vialaneix | Comparison of metamodels 12/37

Underfitting/OverfittingObservations we might have

Nathalie Villa-Vialaneix | Comparison of metamodels 12/37

Underfitting/OverfittingObservations we do have

Nathalie Villa-Vialaneix | Comparison of metamodels 12/37

Underfitting/OverfittingFirst estimation from the observations: underfitting

Nathalie Villa-Vialaneix | Comparison of metamodels 12/37

Underfitting/OverfittingSecond estimation from the observations: accurate estimation

Nathalie Villa-Vialaneix | Comparison of metamodels 12/37

Underfitting/OverfittingThird estimation from the observations: overfitting

Nathalie Villa-Vialaneix | Comparison of metamodels 12/37

Underfitting/OverfittingSummary

Nathalie Villa-Vialaneix | Comparison of metamodels 12/37

Errors

training error (measures the accuracy to the observations)

I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Comparison of metamodels 13/37

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Comparison of metamodels 13/37

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Comparison of metamodels 13/37

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Comparison of metamodels 13/37

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Comparison of metamodels 13/37

Errors

training error (measures the accuracy to the observations)I if y is a factor: misclassification rate

]{yi , yi , i = 1, . . . , n}n

I if y is numeric: mean square error (MSE)

1n

n∑i=1

(yi − yi)2

or root mean square error (RMSE) or pseudo-R2: 1−MSE/Var((yi)i)

test error: a way to prevent overfitting (estimates the generalizationerror) is the simple validation

1 split the data into training/test sets (usually 80%/20%)2 train Φn from the training dataset3 calculate the test error from the remaining data

Nathalie Villa-Vialaneix | Comparison of metamodels 13/37

ExampleObservations

Nathalie Villa-Vialaneix | Comparison of metamodels 14/37

ExampleTraining/Test datasets

Nathalie Villa-Vialaneix | Comparison of metamodels 14/37

ExampleTraining/Test errors

Nathalie Villa-Vialaneix | Comparison of metamodels 14/37

Example

Summary

Nathalie Villa-Vialaneix | Comparison of metamodels 14/37

Consistency in the parametric/non parametric case

Example in the parametric framework (linear methods)an assumption is made on the form of the relation between X and Y :

Y = βT X + ε

β is estimated from the observations (x1, y1), . . . , (xn, yn) by a givenmethod which calculates a βn.

The estimation is said to be consistent if βn n→+∞−−−−−−→ β under (eventually)

technical assumptions on X , ε, Y .

Nathalie Villa-Vialaneix | Comparison of metamodels 15/37

Consistency in the parametric/non parametric case

Example in the nonparametric frameworkthe form of the relation between X and Y is unknown:

Y = Φ(X) + ε

Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a givenmethod which calculates a Φn.

The estimation is said to be consistent if Φn n→+∞−−−−−−→ Φ under (eventually)

technical assumptions on X , ε, Y .

Nathalie Villa-Vialaneix | Comparison of metamodels 15/37

Consistency from the statistical learning perspective[Vapnik, 1995]

Question: Are we really interested in estimating Φ or...

... rather in having the smallest prediction error?

Statistical learning perspective: a method that builds a machine Φn fromthe observations is said to be (universally) consistent if, given a riskfunction R : R × R→ R+ (which calculates an error),

E (R(Φn(X),Y))n→+∞−−−−−−→ inf

Φ:X→RE (R(Φ(X),Y)) ,

for any distribution of (X ,Y) ∈ X × R.Definitions: L∗ = infΦ:X→R E (R(Φ(X),Y)) and LΦ = E (R(Φ(X),Y)).

Nathalie Villa-Vialaneix | Comparison of metamodels 16/37

Consistency from the statistical learning perspective[Vapnik, 1995]

Question: Are we really interested in estimating Φ or...... rather in having the smallest prediction error?

Statistical learning perspective: a method that builds a machine Φn fromthe observations is said to be (universally) consistent if, given a riskfunction R : R × R→ R+ (which calculates an error),

E (R(Φn(X),Y))n→+∞−−−−−−→ inf

Φ:X→RE (R(Φ(X),Y)) ,

for any distribution of (X ,Y) ∈ X × R.Definitions: L∗ = infΦ:X→R E (R(Φ(X),Y)) and LΦ = E (R(Φ(X),Y)).

Nathalie Villa-Vialaneix | Comparison of metamodels 16/37

Purpose of the work

We focus on methods that are universally consistent. These methods leadto the definition of machines Φn such that:

ER(Φn(X),Y)N→+∞−−−−−−→ L∗ = inf

Φ:Rd→RLΦ

for any random pair (X ,Y).

1 multi-layer perceptrons (neural networks): [Bishop, 1995]2 Support Vector Machines (SVM): [Boser et al., 1992]3 random forests: [Breiman, 2001] (universal consistency is not proven

in this case)

Nathalie Villa-Vialaneix | Comparison of metamodels 17/37

Purpose of the work

We focus on methods that are universally consistent. These methods leadto the definition of machines Φn such that:

ER(Φn(X),Y)N→+∞−−−−−−→ L∗ = inf

Φ:Rd→RLΦ

for any random pair (X ,Y).

1 multi-layer perceptrons (neural networks): [Bishop, 1995]2 Support Vector Machines (SVM): [Boser et al., 1992]3 random forests: [Breiman, 2001] (universal consistency is not proven

in this case)

Nathalie Villa-Vialaneix | Comparison of metamodels 17/37

Methodology

Purpose: Comparison of several metamodeling approaches (accuracy,computational time...).

For every data set, every output and every method,1 The data set was split into a training set and a test set (on a 80%/20%

basis);

2 The regression function was learned from the training set (with a fullvalidation process for the hyperparameter tuning);

3 The performances were calculated on the basis of the test set: for thetest set, predictions were made from the inputs and compared to thetrue outputs.

Nathalie Villa-Vialaneix | Comparison of metamodels 18/37

Methodology

Purpose: Comparison of several metamodeling approaches (accuracy,computational time...).For every data set, every output and every method,

1 The data set was split into a training set and a test set (on a 80%/20%basis);

2 The regression function was learned from the training set (with a fullvalidation process for the hyperparameter tuning);

3 The performances were calculated on the basis of the test set: for thetest set, predictions were made from the inputs and compared to thetrue outputs.

Nathalie Villa-Vialaneix | Comparison of metamodels 18/37

Methodology

Purpose: Comparison of several metamodeling approaches (accuracy,computational time...).For every data set, every output and every method,

1 The data set was split into a training set and a test set (on a 80%/20%basis);

2 The regression function was learned from the training set (with a fullvalidation process for the hyperparameter tuning);

3 The performances were calculated on the basis of the test set: for thetest set, predictions were made from the inputs and compared to thetrue outputs.

Nathalie Villa-Vialaneix | Comparison of metamodels 18/37

Methods

2 linear models:I one with the 11 explanatory variables;I one with the 11 explanatory variables plus several nonlinear

transformations of these variables (square, log...): stepwise AIC wasused to train the model;

MLP

SVM

RF

3 approaches based on splines: ACOSSO (ANOVA splines), SDR(improvement of the previous one) and DACE (kriging basedapproach).

Nathalie Villa-Vialaneix | Comparison of metamodels 19/37

Sommaire

1 DNDC-Europe model description

2 MethodologyUnderfitting / OverfittingConsistencyProblem at stake

3 Presentation of the different methods

4 Results

Nathalie Villa-Vialaneix | Comparison of metamodels 20/37

Multilayer perceptrons (MLP)

A “one-hidden-layer perceptron” takes the form:

Φw : x ∈ Rd →

Q∑i=1

w(2)i G

(xT w(1)

i + w(0)i

)+ w(2)

0

where:

the w are the weights of the MLP that have to be learned from thelearning set;

G is a given activation function: typically, G(z) = 1−e−z

1+e−z ;

Q is the number of neurons on the hidden layer. It controls theflexibility of the MLP. Q is a hyper-parameter that is usually tunedduring the learning process.

Nathalie Villa-Vialaneix | Comparison of metamodels 21/37

Symbolic representation of MLPIN

PU

TS

x1

x2

. . .

xd

w(1)11

w(1)pQ

Neuron 1

Neuron Q

OU

TPU

TS

φw(x)

w(2)1

w(2)Q

+w(0)Q

Nathalie Villa-Vialaneix | Comparison of metamodels 22/37

Learning MLP

Learning the weights: w are learned by a mean squared errorminimization scheme :

w∗ = arg minw

N∑i=1

L(yi ,Φw(xi)).

Problem: MSE is not quadratic in w and thus some solutions can belocal minima.

Tuning the hyper-parameters, C and Q : simple validation was used totune first C and Q .

Nathalie Villa-Vialaneix | Comparison of metamodels 23/37

Learning MLP

Learning the weights: w are learned by a mean squared errorminimization scheme penalized by a weight decay to avoid overfitting(ensure a better generalization ability):

w∗ = arg minw

N∑i=1

L(yi ,Φw(xi))+C‖w‖2.

Problem: MSE is not quadratic in w and thus some solutions can belocal minima.

Tuning the hyper-parameters, C and Q : simple validation was used totune first C and Q .

Nathalie Villa-Vialaneix | Comparison of metamodels 23/37

Learning MLP

Learning the weights: w are learned by a mean squared errorminimization scheme penalized by a weight decay to avoid overfitting(ensure a better generalization ability):

w∗ = arg minw

N∑i=1

L(yi ,Φw(xi))+C‖w‖2.

Problem: MSE is not quadratic in w and thus some solutions can belocal minima.

Tuning the hyper-parameters, C and Q : simple validation was used totune first C and Q .

Nathalie Villa-Vialaneix | Comparison of metamodels 23/37

Learning MLP

Learning the weights: w are learned by a mean squared errorminimization scheme penalized by a weight decay to avoid overfitting(ensure a better generalization ability):

w∗ = arg minw

N∑i=1

L(yi ,Φw(xi))+C‖w‖2.

Problem: MSE is not quadratic in w and thus some solutions can belocal minima.

Tuning the hyper-parameters, C and Q : simple validation was used totune first C and Q .

Nathalie Villa-Vialaneix | Comparison of metamodels 23/37

SVMSVM is also an algorithm based on penalized error loss minimization:

1 Basic linear SVM for regression: Φ(w,b) is of the form x → wT x + bwith (w, b) solution of

arg minN∑

i=1

Lε(yi ,Φ(w,b)(xi)) + λ‖w‖2

whereI λ is a regularization (hyper) parameter (to be tuned);I Lε(y, y) = max{|y − y | − ε, 0} is an ε-insensitive loss function

See ε-insensitive loss function

2 Non linear SVM for regression are the same except that a non linear(fixed) transformation of the inputs is previously made: ϕ(x) ∈ H isused instead of x.

Kernel trick: in fact, ϕ is never explicit but used through a kernel,K : Rd × Rd → R. This kernel is used for K(xi , xj) = 〈ϕ(xi), ϕ(xj)〉.

Common kernel: Gaussian kernel

Kγ(u, v) = e−γ‖u−v‖2

is known to have good theoretical properties both for accuracy andgeneralization.

Nathalie Villa-Vialaneix | Comparison of metamodels 24/37

SVMSVM is also an algorithm based on penalized error loss minimization:

1 Basic linear SVM for regression2 Non linear SVM for regression are the same except that a non linear

(fixed) transformation of the inputs is previously made: ϕ(x) ∈ H isused instead of x.

Kernel trick: in fact, ϕ is never explicit but used through a kernel,K : Rd × Rd → R. This kernel is used for K(xi , xj) = 〈ϕ(xi), ϕ(xj)〉.

Original space X Feature space H

Ψ (non linear)

Common kernel: Gaussian kernel

Kγ(u, v) = e−γ‖u−v‖2

is known to have good theoretical properties both for accuracy andgeneralization.

Nathalie Villa-Vialaneix | Comparison of metamodels 24/37

SVMSVM is also an algorithm based on penalized error loss minimization:

1 Basic linear SVM for regression2 Non linear SVM for regression are the same except that a non linear

(fixed) transformation of the inputs is previously made: ϕ(x) ∈ H isused instead of x.Kernel trick: in fact, ϕ is never explicit but used through a kernel,K : Rd × Rd → R. This kernel is used for K(xi , xj) = 〈ϕ(xi), ϕ(xj)〉.

Original space X Feature space H

Ψ (non linear)

Common kernel: Gaussian kernel

Kγ(u, v) = e−γ‖u−v‖2

is known to have good theoretical properties both for accuracy andgeneralization.

Nathalie Villa-Vialaneix | Comparison of metamodels 24/37

SVMSVM is also an algorithm based on penalized error loss minimization:

1 Basic linear SVM for regression2 Non linear SVM for regression are the same except that a non linear

(fixed) transformation of the inputs is previously made: ϕ(x) ∈ H isused instead of x.Kernel trick: in fact, ϕ is never explicit but used through a kernel,K : Rd × Rd → R. This kernel is used for K(xi , xj) = 〈ϕ(xi), ϕ(xj)〉.

Common kernel: Gaussian kernel

Kγ(u, v) = e−γ‖u−v‖2

is known to have good theoretical properties both for accuracy andgeneralization.

Nathalie Villa-Vialaneix | Comparison of metamodels 24/37

Learning SVMLearning (w, b): w =

∑Ni=1 αiK(xi , .) and b are calculated by an exact

optimization scheme (quadratic programming). The only step that canbe time consumming is the calculation of the kernel matrix:

K(xi , xj) for i, j = 1, . . . , n.

The resulting Φn is known to be of the form:

Φn(x) =N∑

i=1

αiK(xi , x) + b

where only a few αi are non zero. The corresponding xi are calledsupport vectors.Tuning of the hyper-parameters, C = 1/λ, ε and γ: simple validationhas been used. To limit waste of time, ε has not been tuned in ourexperiments but set to the default value (1) which ensured 0.5nsupport vectors at most.

Nathalie Villa-Vialaneix | Comparison of metamodels 25/37

Learning SVMLearning (w, b): w =

∑Ni=1 αiK(xi , .) and b are calculated by an exact

optimization scheme (quadratic programming). The only step that canbe time consumming is the calculation of the kernel matrix:

K(xi , xj) for i, j = 1, . . . , n.

The resulting Φn is known to be of the form:

Φn(x) =N∑

i=1

αiK(xi , x) + b

where only a few αi are non zero. The corresponding xi are calledsupport vectors.

Tuning of the hyper-parameters, C = 1/λ, ε and γ: simple validationhas been used. To limit waste of time, ε has not been tuned in ourexperiments but set to the default value (1) which ensured 0.5nsupport vectors at most.

Nathalie Villa-Vialaneix | Comparison of metamodels 25/37

Learning SVMLearning (w, b): w =

∑Ni=1 αiK(xi , .) and b are calculated by an exact

optimization scheme (quadratic programming). The only step that canbe time consumming is the calculation of the kernel matrix:

K(xi , xj) for i, j = 1, . . . , n.

The resulting Φn is known to be of the form:

Φn(x) =N∑

i=1

αiK(xi , x) + b

where only a few αi are non zero. The corresponding xi are calledsupport vectors.Tuning of the hyper-parameters, C = 1/λ, ε and γ: simple validationhas been used. To limit waste of time, ε has not been tuned in ourexperiments but set to the default value (1) which ensured 0.5nsupport vectors at most.

Nathalie Villa-Vialaneix | Comparison of metamodels 25/37

From regression tree to random forest

Example of a regression tree

|SOCt < 0.095

PH < 7.815

SOCt < 0.025

FR < 130.45 clay < 0.185

SOCt < 0.025

SOCt < 0.145

FR < 108.45PH < 6.5

4.366 7.10015.010 8.975

2.685 5.257

26.26028.070 35.900 59.330

Each split is made such thatthe two induced subsets havethe greatest homogeneity pos-sible.The prediction of a final node isthe mean of the Y value of theobservations belonging to thisnode.

Nathalie Villa-Vialaneix | Comparison of metamodels 26/37

Random forest

Basic principle: combination of a large number of under-efficientregression trees (the prediction is the mean prediction of all trees).

For each tree, two simplifications of the original method are performed:1 A given number of observations are randomly chosen among the

training set: this subset of the training data set is called in-bag samplewhereas the other observations are called out-of-bag and are used tocontrol the error of the tree;

2 For each node of the tree, a given number of variables are randomlychosen among all possible explanatory variables.

The best split is then calculated on the basis of these variables and of thechosen observations. The chosen observations are the same for a giventree whereas the variables taken into account change for each split.

Nathalie Villa-Vialaneix | Comparison of metamodels 27/37

Random forest

Basic principle: combination of a large number of under-efficientregression trees (the prediction is the mean prediction of all trees).For each tree, two simplifications of the original method are performed:

1 A given number of observations are randomly chosen among thetraining set: this subset of the training data set is called in-bag samplewhereas the other observations are called out-of-bag and are used tocontrol the error of the tree;

2 For each node of the tree, a given number of variables are randomlychosen among all possible explanatory variables.

The best split is then calculated on the basis of these variables and of thechosen observations. The chosen observations are the same for a giventree whereas the variables taken into account change for each split.

Nathalie Villa-Vialaneix | Comparison of metamodels 27/37

Additional toolsOOB (Out-Of Bags) error: error based on the OOB predictions.Stabilization of OOB error is a good indication that there is enoughtrees in the forest.

Importance of a variable to help interpretation: for a given variable X j

(j ∈ {1, . . . , p}), the importance of X j is the mean decrease in accuracyobtained when the values of X j are randomized.

Importance isestimated with OOB observations (see next slides for details)

Nathalie Villa-Vialaneix | Comparison of metamodels 28/37

Additional toolsOOB (Out-Of Bags) error: error based on the OOB predictions.Stabilization of OOB error is a good indication that there is enoughtrees in the forest.Importance of a variable to help interpretation: for a given variable X j

(j ∈ {1, . . . , p}), the importance of X j is the mean decrease in accuracyobtained when the values of X j are randomized:

I(X j) = E(R(Φn(X(j)),Y)

)− E (R(Φn(X),Y))

in which X(j) = (X1, . . . ,X (j), . . . ,Xp), X (j) being the variable X j withpermuted values.

Importance is estimated with OOB observations(see next slides for details)

Nathalie Villa-Vialaneix | Comparison of metamodels 28/37

Additional toolsOOB (Out-Of Bags) error: error based on the OOB predictions.Stabilization of OOB error is a good indication that there is enoughtrees in the forest.Importance of a variable to help interpretation: for a given variable X j

(j ∈ {1, . . . , p}), the importance of X j is the mean decrease in accuracyobtained when the values of X j are randomized. Importance isestimated with OOB observations (see next slides for details)

Nathalie Villa-Vialaneix | Comparison of metamodels 28/37

Learning a random forestRandom forest are not very sensitive to hyper-parameters (number ofobservations for each tree, number of variables for each split): the defaultvalues have been used.

The number of trees should be large enough for the mean squared errorbased on out-of-sample observations to stabilize:

0 100 200 300 400 500

02

46

810

trees

Err

or

Out−of−bag (training)Test

Nathalie Villa-Vialaneix | Comparison of metamodels 29/37

Learning a random forestRandom forest are not very sensitive to hyper-parameters (number ofobservations for each tree, number of variables for each split): the defaultvalues have been used.The number of trees should be large enough for the mean squared errorbased on out-of-sample observations to stabilize:

0 100 200 300 400 500

02

46

810

trees

Err

or

Out−of−bag (training)Test

Nathalie Villa-Vialaneix | Comparison of metamodels 29/37

Importance estimation in random forests

OOB estimation for variable X j

1: for b = 1→ B (loop on trees) do2: permute values for (x j

i )i: xi<T b return x(j,b)i = (x1

i , . . . , x(j,b)i , . . . , xp

i ),

x(j,b)i permuted values

3: predict Φb(x(j,b)

i

)4: end for5: return OOB estimation of the importance of X j

1B

B∑b=1

1

|T b |

∑xi<T b

(Φb(x(j,b)

i ) − yi

)2−

1

|T b |

∑xi<T b

(Φb(xi) − yi

)2

Nathalie Villa-Vialaneix | Comparison of metamodels 30/37

Sommaire

1 DNDC-Europe model description

2 MethodologyUnderfitting / OverfittingConsistencyProblem at stake

3 Presentation of the different methods

4 Results

Nathalie Villa-Vialaneix | Comparison of metamodels 31/37

Influence of the training sample size

5 6 7 8 9

0.5

0.6

0.7

0.8

0.9

1.0

N2O prediction

log size (training)

R2

LM1LM2DaceSDRACOSSOMLPSVMRF

5 6 7 8 9

0.6

0.7

0.8

0.9

1.0

N leaching prediction

log size (training)

R2

LM1LM2DaceSDRACOSSOMLPSVMRF

Nathalie Villa-Vialaneix | Comparison of metamodels 32/37

Influence of the training sample size

5 6 7 8 9

0.6

0.7

0.8

0.9

1.0

N leaching prediction

log size (training)

R2

LM1LM2DaceSDRACOSSOMLPSVMRF

Nathalie Villa-Vialaneix | Comparison of metamodels 32/37

Computational time

Use LM1 LM2 Dace SDR AcossoTrain <1 s. 50 min 80 min 4 hours 65 min nPrediction <1 s. <1 s. 90 s. 14 min 4 min.

Use MLP SVM RFTrain 2.5 hours 5 hours 15 minPrediction 1 s. 20 s. 5 s.

Time for DNDC: about 200 hours with a desktop computer and about 2days using cluster 7!

Nathalie Villa-Vialaneix | Comparison of metamodels 33/37

Further comparisons

Evaluation of the different step (time/difficulty)

Training Validation TestLM1 ++ +LM2 + +ACOSSO = + -SDR = + -DACE = - -MLP - - +SVM = - -RF + + +

Nathalie Villa-Vialaneix | Comparison of metamodels 34/37

Understanding which inputs are importantExample (N2O, RF):

●●

●●

●●

●●

2 4 6 8 10

510

1520

2530

Rank

Impo

rtan

ce (

mea

n de

crea

se M

SE

)

pH

Nr N_MR NfixN_FR

clay NresTmean BD rain

SOC

The variables SOC and PH are the most important for accuratepredictions.

Example (N leaching, SVM):

● ●

●●

● ●

2 4 6 8 10

050

010

0015

00

Rank

Impo

rtan

ce (

decr

ease

MS

E) N_FR

Nres pH

Nrclay

rain

SOCTmean Nfix

BD

N_MR

The variables N_MR, N_FR, Nres and pH are the most important foraccurate predictions.

Nathalie Villa-Vialaneix | Comparison of metamodels 35/37

Understanding which inputs are importantExample (N leaching, SVM):

● ●

●●

● ●

2 4 6 8 10

050

010

0015

00

Rank

Impo

rtan

ce (

decr

ease

MS

E) N_FR

Nres pH

Nrclay

rain

SOCTmean Nfix

BD

N_MR

The variables N_MR, N_FR, Nres and pH are the most important foraccurate predictions.

Nathalie Villa-Vialaneix | Comparison of metamodels 35/37

Thank you for your attention...

... questions?

Nathalie Villa-Vialaneix | Comparison of metamodels 36/37

Bishop, C. (1995).

Neural Networks for Pattern Recognition.

Oxford University Press, New York, USA.

Boser, B., Guyon, I., and Vapnik, V. (1992).

A training algorithm for optimal margin classifiers.

In 5th annual ACM Workshop on COLT, pages 144–152. D. Haussler Editor, ACMPress.

Breiman, L. (2001).

Random forests.

Machine Learning, 45(1):5–32.

Vapnik, V. (1995).

The Nature of Statistical Learning Theory.

Springer Verlag, New York, USA.

Villa-Vialaneix, N., Follador, M., Ratto, M., and Leip, A. (2012).

A comparison of eight metamodeling techniques for the simulation of N2O fluxes andN leaching from corn crops.

Environmental Modelling and Software, 34:51–66.

Nathalie Villa-Vialaneix | Comparison of metamodels 37/37

ε-insensitive loss function

Go back

Nathalie Villa-Vialaneix | Comparison of metamodels 37/37