gaussian processes: surrogate models for continuous black-box...

48
Optimization Gaussian processes Doubly trained Surrogate CMA-ES Gaussian processes: surrogate models for continuous black-box optimization Lukáš Bajer MFF UK 04/2018 Lukáš Bajer Gaussian processes: surrogates for cont. optimization 1

Upload: others

Post on 04-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian processes: surrogate modelsfor continuous black-box optimization

Lukáš Bajer

MFF UK 04/2018

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 1

Page 2: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Contents

1 OptimizationContinuous optimizationMetaheuristics, black-box functions

2 Gaussian processesGaussian process predictionGaussian process covariance functions

3 Doubly trained Surrogate CMA-ESCMA-ESDoubly trained Surrogate CMA-ESExperimental results

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 2

Page 3: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Optimization

optimization (minimization) is finding such x? ∈ Rn that

f (x?) = min∀x∈Rn

f (x)

“near-optimal” solution is usually sufficient

f(x)

xf(x)

x

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 3

Page 4: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Continuous white-box optimization

also known as numerical optimization methods

requirements:gradients: ∇f (x)

. . . can be approximated by finite difference approximations

and sometimes also Hessians: ∇2f (x)

1 gradient descend (1st order)2 Newthon method (2nd order)3 quasi-Newthon methods (2nd order approximated)4 trust-region, conjugate gradients

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 4

Page 5: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

1st order: gradient descend

iterative steps in the direction of negative gradient

x(k+1) = x(k) − σ∇f (x(k))

σ – step size, usually changes every iteration, adapted, forexample, using a line search along the gradient direction

x 0

x 1

x 2

x 3 x 4

*

*

source: (CC) Wikipedia

pros:guaranteed to converge theoreticallysuitable even for large problems (deep NN. . . )

limitations:can be very slow, especially without a momentumoften ends-up much sooner due to round-off errors

source: (GNU) Wikipedia, author: P.A. Simionescu

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 5

Page 6: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

2nd order: Newton’s method

take into account the second-order term of a Taylorexpansion of f (x) around x(k):

f (x(k) + h) ≈ q(k)(h) = f (x(k)) + h>∇f (k) +12

h>[∇2f (k)

]h

the next iterate is then

x(k+1) = (x(k) + h(k))

where h(k) minimizes q(k)(h)

x(k+1) = x(k) − γ[∇2f (k)

]−1∇f (k)

x

x0

pros:very fast convergence on quadratic-like function

limitations:needs a Hessian matrix to be computed and invertedthus rarely usable in practice

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 6

Page 7: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Quasi-Newton’s methods

Hessian matrix ∇2f (k) is not computed,only iteratively approximated B(k),B(k+1), . . .

Hessians’ inverses are often calculated without inversions

BFGSthe most successful for the last three decadesindependently discovered by 4 (!) people in 1970C. G. Broyden, R. Fletcher, D. Goldfarb and D. Shanno

Hessian approximation updated via rank-two updatesworks even without derivatives (with finite differences)

shown to behave well on a variety of (even multimodal) functionsL-BFGS – a popular memory-limited version (Nocedal, 1980)in every optimization package (Matlab, Python,. . . )

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 7

Page 8: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Other numerical optimization techniques

quadratic approximations:by far the most popular optimization technique

trust-region methodsquadratic approximations around current point x(k)

minimizes the model within region of trust

NEWOUA, BOBYQA (J. D. Powell, 2004, 2009)

construct the quadratic model using much fewer pointsthan (n + 1)(n + 2)/2 using additional minimizing a norm

that saves time and enhances performance

conjugate gradientsdo not approximate Hessiansconjugate vectors – a momentum guiding the searchcheaper variant to quasi-Newton’s methods

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 8

Page 9: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Optimization of black-box functions

black-box functions

x f(x)

f

only evaluation of the function value, no derivatives orgradients→ no gradient methods available

we consider continuous domain: x ∈ Rn

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 9

Page 10: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Optimization of empirical black-box functions

empirical function:assessing the function-value via an experiment(measuring, intensive calculation, evaluating a prototype)evaluating such functions are expensive (time and/ormoney)search cost ∼ the number of function evaluations

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 10

Page 11: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Metaheuristics

Metaheuristicoptimization techniques finding sufficiently good solutiontreat the objective function as black-boxsample a set of candidate solutions(search space often too large to be completely sampled)often nature-inspired

particle/swarn optimizationsimulated annealing. . .evolutionary computation (EA, GA, ES, . . . )

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 11

Page 12: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

EA’s for empirical black-box optimization

what can help with decreasingthe number of function evaluations:

utilize already measured values(at least prevent measuring the same thing twice)learn the shape of the function landscapeor learn the (global) gradient or step direction & size

f(X)

i

Re f(X)

CrSe

Mu?

X*

source: (GNU) Wikipedia, author: Johann "nojhan" Dréo

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 12

Page 13: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Model-based methods accelerating the convergence

several methods are used in order to decreasethe number of objective function evaluations needed by EA’s

1 Bayesian optimization (EGO)

2 Surrogate modelling

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 13

Page 14: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Bayesian optimization

Bayesian optimizer

Input : objective function f , the size of the initial sample dx1, . . . , xd ← generate an initial sampleA ← {(xi, yi)} /* initialize the archive */

for generation g = 1, 2, . . . until stopping conditions met doM← generate the probabilistic model based on Ax1, . . .← choose next points x ∈ X accord. to CM(x)y1, . . .← f (x1), . . . /* evaluate the new point(s) */A ← A∪ {(x1, y1), . . .} /* update the archive */

suitable for very low budgets of f -evaluations (∼ 10 · D)Gaussian processes used in the criterion CM most oftenexisting algorithms: EGO (D. R. Jones, 1998),SPOT (T. Bartz-Beielstein, 2005), SMAC (F. Hutter, 2011) etc.

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 14

Page 15: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Continuous optimizationMetaheuristics, black-box functions

Surrogate modelling

Surrogate modellingtechnique which builds an approximating modelof the fitness function landscapethe model provides a cheap and fast,but also inaccurate replacement of the fitness functionfor part of the populationinaccurate approximating model can deceive the optimizer

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 15

Page 16: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process

GP is a stochastic approximation method based on Gaussiandistributions

GP can express uncertainty of the prediction in a new point x:it gives a probability distribution of the output value

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 16

Page 17: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process

Gaussian ProcessA Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.

A Gaussian process is completely specified by itsmean function m(x) = E [fGP(x)]

covariance function cov(xi, xj) = cov(fGP(x1), fGP(x2))

and we write the Gaussian process as

f (x) ∼ GP(m(x), cov(x, x)).

(Rasmussen, Williams, 2006)

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 17

Page 18: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process

given a set of N training points XN = (x1 . . . xN)>, xi ∈ Rd,and measured values yN = (y1, . . . , yN)>

of a function f being approximated

yi = f (xi), i = 1, . . . ,N

GP considers vector of these function values as a samplefrom N-variate Gaussian distribution

yN ∼ N(0,CN)

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 18

Page 19: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process prior distribution

-5 0 5-3

-2

-1

0

1

2

3

-5 0 5-3

-2

-1

0

1

2

3

-5 0 5-3

-2

-1

0

1

2

3

Draws from Gaussian processes prior for three different covariance functionsKSE, Kν=3/2

Mat«ern , Kν=5/2Mat«ern (in that order), all of them with the parameters ` = 1 and

σ2f = 1 without noise

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 19

Page 20: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process prediction (posterior)

Making predictionsLet CN+1 be extended covariance matrix – extended by entriesbelonging to an unseen point (x, y∗). Because yN is known and

the inverse C−1N+1 can be expressed using inverse of the training

covariance CN−1,

the density in a new point marginalize to 1D Gaussian density

p(y∗ |XN+1, yN) ∝ exp

(−1

2(y∗ − yN+1)2

s2yN+1

)wherethe mean yN+1 and thevariance s2

yN+1

is easily expressible fromCN−1 and yN .

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 20

Page 21: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process prediction (posterior)

-5 0 5-3

-2

-1

0

1

2

3

-5 0 5-3

-2

-1

0

1

2

3

-5 0 5-3

-2

-1

0

1

2

3

Graphs of Gaussian processes prediction N = 2, 3, 4 training data. (+) –training set, thick line – mean prediction, thin lines – three draws from the GPposterior (without noise). Predictions y∗ and ±2s∗ are generated for 101points, computationally stable as the matrix inversion only for the trainingcovariace CN .

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 21

Page 22: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process covariance

The covariance matrix CN is determined by the covariancefunction cov(xi, xj) which is defined on pairs from the inputspace

(C)ij = cov(xi, xj), xi,j ∈ Rd

expressing the degree of correlations between two points’values; typically decreasing functions on two points distance

d(xi,xj)

cov(xi,xj)1

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 22

Page 23: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process covariance

The most frequent covariance function is squared-exponential

(K)ij = covSE(xi, xj) = θ exp

(−12`2 (xi − xj)

>(xi − xj)

)with the parameters (usually fitted by MLE)

θ – signal variance (scales the correlation)` – characteristic length scale

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 23

Page 24: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process covariance

Another usual option in data-minig applications isMatérn covariance, which is for r = (xi − xj)

(K)ij = covMaternν=5/2(r) = θ

(1 +

√5r`

+5r2

3`2

)exp

(−√

5r`

).

with the parameters (same as for squared exponential)θ – signal variance` – characteristic length scale

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 24

Page 25: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

Gaussian process predictionGaussian process covariance functions

Gaussian Process covariance

source: (Rasmussen and Williams, 2006)

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 25

Page 26: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Stochastic search of Evolutionary algorithms

Stochastic black box searchinitilize distribution parameters θset population size λ ∈ Nwhile not terminate

1 sample distribution P(x|θ)→ x1, . . . , xλ ∈ Rn

2 evaluate x1, . . . , xλ on f3 update parameters θ

(A. Auger, Tutorial CMA-ES, GECCO 2013)

schema of most of the evolutionary strategies (and EDAalgorithms)as well as CMA-ES (Covariance Matrix Adaptation ES)– current state of the art in continuous optimization

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 26

Page 27: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

The CMA-ESInput: m ∈ Rn, σ ∈ R+, λ ∈ NInitialize: C = I (and several other parameters)Set the weights w1, . . . wλ appropriately

while not terminate

1 xi = m + σyi, yi ∼ N(0,C), for i = 1, . . . , λ sampling

2 m←∑µ

i=1 wi xi:λ = m + σyw where yw =∑µ

i=1 wi yi:λ updatemean

3 update C4 update step-size σ

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 27

m1 σ1,C1

m1 σ1,C1

λ = 12

xi

m1 σ1,C1

µ = 3

m2

σy1:λ

σy2:λ

σy3:λ

µ = 3

m2

σ2,C2

m2

σ2,C2

m2

σ2,C2

m3

m3

σ3,C3

Page 28: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Covariance matrix adaptation

eigenvectors of the covariance matrix C are the principlecomponents – the principle axes of the mutation ellipsoidCMA-ES learns and updates a new Mahalanobis metricsuccessively approximates the inverse Hessian onquadratic functions– transforms ellipsoid function into sphere function– it somehow holds for other functions, too (up to somedegree)

b1

b2

source: (S. Finck, N. Hansen, R. Ros, and A. Auger, 2009)

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 28

Page 29: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Is the CMA-ES the best for everything?

CMA-ES is state-of-the-art optimization algorithm,especially for rugged and ill-conditioned objective functionshowever, not the fastest if we can affordonly very few objective function evaluations

what we have already seen:use a surrogate model!however, original evaluated solutions are availableonly along the search pathsolution: construct local surrogate models

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 29

Page 30: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Doubly trained Surrogate CMA-ES

m,σ

1

1st modeltraining

m,σ

4

fitnessevaluationof a fewchosenpoints

m,σ

2

sampling fromN(m,σ)

CMA-ES m,σ

5

2nd modeltraining

m,σ3rd3rd

3 criterion rankingaccording to 1st model

1st1st

2nd2nd

s2 m,σ

6

population

mean-prediction2nd model

for the rest of

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 30

Page 31: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Doubly trained Surrogate CMA-ES

1 sample a new population of size λ (standard CMA-ESoffspring),

2 train the first surrogate model on the original-evaluatedpoints from the archive A,

3 select dαλe point(s) wrt. a criterion C, which is based onthe first model’s prediction,

4 evaluate these point(s) with the original fitness,5 re-train the surrogate model also using these new point(s),

and6 predict the fitness for the non-original evaluated points with

this second model.

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 31

Page 32: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Criteria for the selection of original-evaluated points

GP predictive mean

CM(x) = − y(x)

GP predictive standard deviation

CSTD(x) = s(x)

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 32

Page 33: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Criteria for the selection of original-evaluated points

Expected improvement (EI). ymin – the minimum so-farfitness

CEI(x) = E((ymin − f (x))I(f (x) < ymin) | y1, . . . , yN) , where

I(f (x) < ymin) =

{1 for f (x) < ymin

0 for f (x) ≥ ymin

Probability of improvement (PoI). the probability of findinglower fitness than some threshold T

CPoI(x,T) = P(f (x) ≤ T | y1, . . . , yN) = Φ

(T − y(x)

s(x)

)where Φ is the CDF of N (0, 1), T = ymin or a slightly highervalue

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 33

Page 34: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Criteria for the selection of original-evaluated pointsselected unimodal COCO functions f1,2, f8...14

0 50 100 150 200 250-8

-6

-4

-2

0

"flo

g

5-D

GP predictive mean (M)GP predictive stand. deviation (STD)Expected improvement (EI)

0 50 100 150 200 250-8

-6

-4

-2

0 20-D

Probability of improvement (PoI)Expected RDE (ERDE)

multimodal COCO functions f3,4, f15...24

0 50 100 150 200 250Number of evaluations / D

-8

-6

-4

-2

0

"flo

g

5-D

0 50 100 150 200 250Number of evaluations / D

-8

-6

-4

-2

0 20-D

The log10 of the median best f -value distances to the benchmarks’ optimawere scaled linearly to [−8, 0] for each COCO function.

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 34

Page 35: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

GP model training

trainModel(A,Nmax,TSS, rAmax,K, σ(g),C(g),m(g))

(XN , yN)← select at most Nmax points from the archive A using TSSand rAmax

XN ← transform the selected points into the (σ(g))2C(g) basis with theorigin at m(g)

yN ← standardize the f -values in yN to zero mean and unit variance(mµ, σ2

f , `, σn)← fit the hyperparameters of µ(x) and K using MLestimation

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 35

Page 36: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Training set selection

1 TSS1 taking up to Nmax most recently evaluated points

2 TSS2 selecting the union of the k nearest neighbors ofevery point for which the fitness should be predicted,where k is maximal such that the total number of selectedpoints does not exceed Nmax,

3 TSS3 clustering the points in the input space into Nmaxclusters and taking the points nearest to clusters’ centroids

4 TSS4 selecting Nmax points which are closest to any pointin the current population.

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 36

Page 37: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

GP model parameters

parameter considered values

training set selection method TSS TSS1, TSS2 TSS3, TSS4maximum distance rAmax 2

√Qχ2(0.99,D), 4

√Qχ2(0.99,D)

Nmax 10 · D, 15 · D, 20 · Dcovariance function K KSE, Kν=3/2

Mat«ern , Kν=5/2Mat«ern

Parameters of the GP surrogate models. The maximum distance rAmax isderived using the Mahalanobis distance given by the covariance matrix σ2C.Qχ2(0.99,D) is the 0.99-quantile of the χ2

D distribution, and therefore√Qχ2(0.99,D) is the 0.99-quantile of the norm of a D-dimensional normal

distributed random vector.

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 37

Page 38: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Gaussian process parameter settings – heatmap2-D

1 2 3 4 6 7 8 9 101112131415161718192021222324

COCO/BBOB functions

10

20

30

40

50

60

70

para

met

er s

ets

0.4

0.5

0.6

0.7

0.8

0.9

5-D

1 2 3 4 6 7 8 9 101112131415161718192021222324

COCO/BBOB functions

10

20

30

40

50

60

70

para

met

er s

ets

0.3

0.4

0.5

0.6

0.7

0.8

10-D

1 2 3 4 6 7 8 9 101112131415161718192021222324

COCO/BBOB functions

10

20

30

40

50

60

70

para

met

er s

ets

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

20-D

1 2 3 4 6 7 8 9 101112131415161718192021222324

COCO/BBOB functions

10

20

30

40

50

60

70

para

met

er s

ets

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Ranking prediction error. Solid horizontal lines separate different TSSmethods TSS1–TSS4 (in that order), dashed lines separate sectors withsmaller and larger maximum distance rAmax. Three triples of settings withineach sector represent raising values of Nmax and three covariance functionsKSE, Kν=3/2

Mat«ern , Kν=5/2Mat«ern within each triple.

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 38

Page 39: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Testing frameworkBlack-Box Optimization Benchmarking (BBOB)COmparing Continuous Optimisers (COCO)

24 artificial functionsdifferent degree of separability, conditioning, modality or with orwithout a global structuretesting sets defined for dimensions 2, 3, 5, 10, 20 (and 40:)

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 39

f1 f3 f4

f8 f12 f13

f14 f16 f17

f20 f22 f23 f24 .

Page 40: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Aggregated experimental results on BBOB

0 50 100 150 200 250-8

-6

-4

-2

0

"flo

g

2-D

S-CMA-ES0.05/2pop DTS-CMA-ESadaptive DTS-CMA-ESMA-ESGPOPSAPEOCMA-ES

0 50 100 150 200 250-8

-6

-4

-2

0 5-D

CMA-ES 2pop

BIPOP-s* ACMES-klmm-CMABOBYQASMACfmincon

0 50 100 150 200 250

Number of evaluations / D

-8

-6

-4

-2

0

"flo

g

10-D

0 50 100 150 200 250

Number of evaluations / D

-8

-6

-4

-2

0 20-D

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 40

Page 41: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Experimental results on BBOB (5 D)

0 50 100 150 200 250-8

-6

-4

-2

0

"flo

g

f1 Sphere 5-D

S-CMA-ES0.05/2pop DTS-CMA-ESadaptive DTS-CMA-ESMA-ESGPOPSAPEOCMA-ES

0 50 100 150 200 250

-5

0

5

f2 Ellipsoidal 5-D

CMA-ES 2pop

BIPOP-s* ACMES-klmm-CMABOBYQASMACfmincon

0 50 100 150 200 250

0.5

1

1.5

2

2.5

"flo

g

f3 Rastrigin 5-D

0 50 100 150 200 250

1

1.5

2

2.5

f4 Bueche-Rastrigin 5-D

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 41

Page 42: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Experimental results on BBOB (5 D)

0 50 100 150 200 250-8

-6

-4

-2

0

2

"flo

g

f5 Linear Slope 5-D

0 50 100 150 200 250

-5

0

5f6 Attractive Sector 5-D

0 50 100 150 200 250

-5

0

5f8 Rosenbrock, original 5-D

0 50 100 150 200 250-8

-6

-4

-2

0

2

4

"flo

g

f9 Rosenbrock, rotated 5-D

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 42

Page 43: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Experimental results on BBOB (5 D)

0 50 100 150 200 250-8

-6

-4

-2

0

2

"flo

g

f13 Sharp Ridge 5-D

0 50 100 150 200 2500

0.5

1

1.5

2

2.5

"flo

g

f15 Rastrigin, multi-modal 5-D

0 50 100 150 200 250

-4

-2

0

"flo

g

f17 Schaffers F7 5-D

0 50 100 150 200 250

-4

-2

0

2f18 Schaffers F7, ill-conditioned 5-D

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 43

Page 44: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Experimental results on BBOB (5 D)

0 50 100 150 200 250

-1

-0.5

0

0.5

1

"flo

g

f19 Composite Griewank-Rosenbrock F8F2 5-D

0 50 100 150 200 250

0

0.5

1

1.5

f22 Gallagher's Gaussian 21-hi Peaks 5-D

0 50 100 150 200 250

Number of evaluations / D

-0.5

0

0.5

1

"flo

g

f23 Katsuura 5-D

0 50 100 150 200 250

Number of evaluations / D

1

1.5

2

f24 Lunacek bi-Rastrigin 5-D

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 44

Page 45: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Experimental results on BBOB (20 D)

0 50 100 150 200 250-8

-6

-4

-2

0

2

"flo

g

f1 Sphere 20-D

S-CMA-ES0.05/2pop DTS-CMA-ESadaptive DTS-CMA-ESMA-ESGPOPSAPEOCMA-ES

0 50 100 150 200 250

-5

0

5

f2 Ellipsoidal 20-D

CMA-ES 2pop

BIPOP-s* ACMES-klmm-CMABOBYQASMACfmincon

0 50 100 150 200 250

1.5

2

2.5

3

"flo

g

f3 Rastrigin 20-D

0 50 100 150 200 250

1.5

2

2.5

3

f4 Bueche-Rastrigin 20-D

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 45

Page 46: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Experimental results on BBOB (20 D)

0 50 100 150 200 250-8

-6

-4

-2

0

2

"flo

g

f5 Linear Slope 20-D

0 50 100 150 200 250

-2

0

2

4

f6 Attractive Sector 20-D

0 50 100 150 200 250

-5

0

5

f8 Rosenbrock, original 20-D

0 50 100 150 200 250

-5

0

5

"flo

g

f9 Rosenbrock, rotated 20-D

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 46

Page 47: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Experimental results on BBOB (20 D)

0 50 100 150 200 250-6

-4

-2

0

2

"flo

g

f13 Sharp Ridge 20-D

0 50 100 150 200 250

1.5

2

2.5

3

"flo

g

f15 Rastrigin, multi-modal 20-D

0 50 100 150 200 250

-3

-2

-1

0

1

"flo

g

f17 Schaffers F7 20-D

0 50 100 150 200 250-2

-1

0

1

2

f18 Schaffers F7, ill-conditioned 20-D

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 47

Page 48: Gaussian processes: surrogate models for continuous black-box …ai.ms.mff.cuni.cz/~sui/bajer18.pdf · 2018. 4. 13. · Optimization Gaussian processes Doubly trained Surrogate CMA-ES

OptimizationGaussian processes

Doubly trained Surrogate CMA-ES

CMA-ESDoubly trained Surrogate CMA-ESExperimental results

Experimental results on BBOB (20 D)

0 50 100 150 200 250

-0.5

0

0.5

1

1.5

"flo

g

f19 Composite Griewank-Rosenbrock F8F2 20-D

0 50 100 150 200 250

0

0.5

1

1.5

f22 Gallagher's Gaussian 21-hi Peaks 20-D

0 50 100 150 200 250

Number of evaluations / D

-0.5

0

0.5

1

"flo

g

f23 Katsuura 20-D

0 50 100 150 200 250

Number of evaluations / D

1.5

2

2.5

f24 Lunacek bi-Rastrigin 20-D

Lukáš Bajer Gaussian processes: surrogates for cont. optimization 48