a comparative introduction to two optimization classics: the nelder … · 2018-04-07 ·...

42
A comparative introduction to two optimization classics: the Nelder-Mead and the CMA-ES algorithms Rodolphe Le Riche 1,2 1 Ecole des Mines de Saint Etienne, France 2 CNRS LIMOS France March 2018 MEXICO network research school on sensitivity analysis, metamodeling and optimization of complex models La Rochelle, France R. Le Riche (CNRS EMSE) Optimization without metamodel 1/37 March 2018 1 / 37

Upload: others

Post on 07-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

A comparative introduction to two optimization

classics: the Nelder-Mead and the CMA-ES

algorithms

Rodolphe Le Riche1,2

1 Ecole des Mines de Saint Etienne, France2 CNRS LIMOS France

March 2018MEXICO network research school on sensitivity analysis,

metamodeling and optimization of complex modelsLa Rochelle, France

R. Le Riche (CNRS EMSE) Optimization without metamodel 1/37 March 2018 1 / 37

Page 2: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Foreword

In this 1h30 course + 1h30 hands-on class, we addressunconstrained optimization with continuous variables

find one or many x∗ ≡ arg minx∈S∈Rd

f (x) (1)

where S is typically the compact [xLB, xUB] ∈ Rd .

We cover the principles (the optimization “engines”) of twoalgorithms, Nelder-Mead and CMA-ES. These algorithms havethe same goal, but one is deterministic and one is stochastic.

Both algorithms do not use metamodels: because they do notmodel f , they are insensitive to a monotonic transformation off () (e.g., they perform the same on f (), f 3(), exp(f ())); they donot build any approximation of f () valid in the entire S (likeEGO does).

R. Le Riche (CNRS EMSE) Optimization without metamodel 2/37 March 2018 2 / 37

Page 3: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Optimizers are iterative algorithms

We are going to approximate the solution(s) to Problem (1) usingiterative optimization algorithms (optimizers) that exchangeinformation with the objective function code (that wraps the oftencostly simulation).

optimizerminx f (x)state S

f of simulationat x

x

f

or in a state-space representation,

xt+1 = φ(St+1

)(2)

St+1 = ψ(St , (xt , f (xt))

)(3)

The optimizer is defined by S (state vector), φ (next iterateequation) and ψ (state equation).

R. Le Riche (CNRS EMSE) Optimization without metamodel 3/37 March 2018 3 / 37

Page 4: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Content of the talk

A pattern search method: the Nelder-Mead algorithm

A stochastic optimizer: CMA-ES

R. Le Riche (CNRS EMSE) Optimization without metamodel 4/37 March 2018 4 / 37

Page 5: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

The Nelder-Mead algorithm

1965. A great compromise between simplicity and efficiency:[14] is cited about 27000 times on Google Scholar.

It belongs to the class of pattern search methods [2], i.e.,methods that propose new points to evaluate based on thedeformation of a geometrical pattern.

Sometimes also called “simplex” (not to be mistaken with thatof linear programming) because the pattern is made of asimplex: d + 1 points in a d-dimensional space

x0

x1

x2

R. Le Riche (CNRS EMSE) Optimization without metamodel 5/37 March 2018 5 / 37

Page 6: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Simplex properties (1)

x0

x1

x2

Notation: f (x0) ≤ f (x1) ≤ . . . ≤ f (xd)

(x1 − x0, x2 − x0) is a basis of Rd .

(x1 − x0, x2 − x0,−(x1 − x0),−(x2 − x0)) apositive set (−∇f (x0) can be approximatedas a positive linear combination of thevectors, a condition for convergence withshrink steps, see later).

d + 1 is the minimum number of points to construct a linearinterpolation.

R. Le Riche (CNRS EMSE) Optimization without metamodel 6/37 March 2018 6 / 37

Page 7: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Simplex properties (2)

d + 1 is the minimum number of points to construct a linearinterpolation:

f (xi ) = α0 + α1xi1 + . . .+ αdx

id = f (xi ) , i = 0, d

1 x01 . . . x0

d

1 x11 . . . x1

d

. . .1 xd1 . . . xdd

α0

α1

αd

=

f (x0)f (x1)

f (xd)

1 x0

1 . . . x0d

0 x11 − x0

1 . . . x1d − x0

d

. . .0 xd1 − x0

1 . . . xdd − x0d

α0

α1

αd

=

f (x0)

f (x1)− f (x0)

f (xd)− f (x0)

which has a unique solution iff lower right sub-matrix L is invertible,or better, well-conditioned.

R. Le Riche (CNRS EMSE) Optimization without metamodel 7/37 March 2018 7 / 37

Page 8: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Simplex properties (2)

d + 1 is the minimum number of points to construct a linearinterpolation:

f (xi ) = α0 + α1xi1 + . . .+ αdx

id = f (xi ) , i = 0, d

1 x01 . . . x0

d

1 x11 . . . x1

d

. . .1 xd1 . . . xdd

α0

α1

αd

=

f (x0)f (x1)

f (xd)

1 x0

1 . . . x0d

0 x11 − x0

1 . . . x1d − x0

d

. . .0 xd1 − x0

1 . . . xdd − x0d

α0

α1

αd

=

f (x0)

f (x1)− f (x0)

f (xd)− f (x0)

which has a unique solution iff lower right sub-matrix L is invertible,or better, well-conditioned.

R. Le Riche (CNRS EMSE) Optimization without metamodel 7/37 March 2018 7 / 37

Page 9: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Simplex properties (3)

The diameter of a simplex{

x0, . . . , xd}

is its largest edge,

diam({x0, . . . , xd}) = max0≤i<j≤d

∥∥xi − xj∥∥

A measure of how well the edges of a simplex span Rd is itsnormalized volume (noting s ≡ diam({x0, . . . , xd})),

vol({x0, . . . , xd}) = vol

(x0

s, . . . ,

x0

s

)=

det(L)

d !sd,

which is a scale invariant measure of the volume of the simplex.

R. Le Riche (CNRS EMSE) Optimization without metamodel 8/37 March 2018 8 / 37

Page 10: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

A Nelder-Mead iteration

New points are generated through simple geometricaltransformations. Remember f (x0) ≤ f (x1) ≤ . . . ≤ f (xd).

x0

x1

x2

xc

xr

xe

xcoxci

xs2

xs3

R. Le Riche (CNRS EMSE) Optimization without metamodel 9/37 March 2018 9 / 37

Page 11: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

A Nelder-Mead iteration

Reflection: take a step from the worst to its symmetric w.r.t. thecendroid of the d better points ≈ a descent step, whose size anddirection come from the current simplex,

x0

x1

x2

xc

xr

xe

xcoxci

xs2

xs3

xr = xc + δr (xc − xd) where δr = 1 and xc =1

d

d−1∑i=0

xi

calculate f (xr ) (cost 1 call to f ).If f (x0) ≤ f (xr ) < f (xd−1), xd ← xr , end iteration.

R. Le Riche (CNRS EMSE) Optimization without metamodel 9/37 March 2018 9 / 37

Page 12: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

A Nelder-Mead iteration

Expansion: if f (xr ) < f (x0), progress was made, increase step size.

x0

x1

x2

xc

xr

xe

xcoxci

xs2

xs3

xe = xc + δe(xc − xd) where δe = 2

If f (xe) ≤ f (xr ), xd ← xe , end iteration.Otherwise, xd ← xr , end iteration.

R. Le Riche (CNRS EMSE) Optimization without metamodel 9/37 March 2018 9 / 37

Page 13: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

A Nelder-Mead iteration

Contraction: if f (xr ) ≥ f (xd−1), little progress, reduce step size:Outside contraction if f (xr ) < f (xd); Inside contraction iff (xr ) ≥ f (xd).

x0

x1

x2

xc

xr

xe

xcoxci

xs2

xs3

xc{i ,o} = xc + δc{i ,o}(xc − xd) where δci = −1/2 and δco = 1/2

If f (xco) ≤ f (xr ), xd ← xco ; Else xd ← xr ; End iteration.If f (xci) < f (xd), xd ← xci and end iteration; Else shrink.

R. Le Riche (CNRS EMSE) Optimization without metamodel 9/37 March 2018 9 / 37

Page 14: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

A Nelder-Mead iteration

Shrink: if f (xci) > f (xd), no progress, reduce simplex size (directionmore influenced by x0 and smaller step size).

x0

x1

x2

xc

xr

xe

xcoxci

xs2

xs3

xs{1,...,d} = x0 + γs(x{1,...,d} − x0) where γs = 1/2

Evaluate f at xs1, . . . , xsd (cost d), end iteration.

R. Le Riche (CNRS EMSE) Optimization without metamodel 9/37 March 2018 9 / 37

Page 15: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Flow chart of the Nelder-Mead algorithm

initialize simplex,min diam,

max calls (to f )

do a Nelder Mead iteration(and reorder points with increasing f )

diam(simplex)<min diam ornb calls to f > max calls ?

stop

no

yes

R. Le Riche (CNRS EMSE) Optimization without metamodel 10/37 March 2018 10 / 37

Page 16: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Properties of the Nelder-Mead algorithm

An iteration may cost 1 call to f if end at reflection, or 2 calls(reflection + expansion; reflection + contraction) or d + 2 calls(reflection + expansion + shrink).No shrink step is ever taken on a convex function (proof in [2] p.148).It always converges to a stationary point (where ∇f (x) = 0)when d = 1 (proof in [8]) and f is continuously differentiable.However it may converge to nonstationary points:

Nelder-Mead operates a series of con-tractions and converges to a nonstation-ary point on McKinnon functions whichare convex and continuously differentiable(plot from [12]).

R. Le Riche (CNRS EMSE) Optimization without metamodel 11/37 March 2018 11 / 37

Page 17: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Practicalities

Restarts: after each iteration, add a restart test,if vol < ε (say 10−4), restart the simplex at x0.

Bound constraints handling, xLB ≤ x ≤ xUB: the solutionused in the nlopt software ([6], a R version exists) is to projectpoints onto the constraints whenever a Nelder Meadtransformation creates out-of-bounds points. This may induce acollapse of the simplex onto a subspace and prevent convergence.

Works often well when d ≤ 10 even on discontinuous,moderately multi-modal, moderately noisy functions.Nelder-Mead seems to contain a minimal set of operations torapidly adapt to f local curvatures.

R. Le Riche (CNRS EMSE) Optimization without metamodel 12/37 March 2018 12 / 37

Page 18: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Converging extensions to the Nelder-Mead

algorithm

Do a 180◦ rotation of the simplex when the reflection does notwork ([2, 16]): ensures that a descent direction can be found ifthe simplex is small enough (either d>∇f (x0) < 0 or−d>∇f (x0) < 0, unless at a stationary point).

A version with space filling restarts and bound constraints, [11].

. . . (393 articles with Nelder Mead in the title on GoogleScholar).

R. Le Riche (CNRS EMSE) Optimization without metamodel 13/37 March 2018 13 / 37

Page 19: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

The CMA-ES algorithm – Introduction (1)

A stochastic way of searchingfor good points. Replace theprevious (simplex / simplextransformations) by (multi-variate normal distribution /sampling and estimation).

N (m,C) represent our beliefabout the location of the opti-mum x?.

x_1

x_2

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

m∼ N (m,C)

R. Le Riche (CNRS EMSE) Optimization without metamodel 14/37 March 2018 14 / 37

Page 20: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Multivariate normal

A d-dimensional Gaussian isfully described by its meanand covariance ⇒ the centralquestion is “how to adapt themean and covariance of N tooptimize efficiently?”.

Sample generation:

L ← cholesky(C)U ∼ N (0, I) {rnorm in R}x← m + L × U

because if Cov(U) = I and x = Lu,

then Cov(X) = LCov(U)L> = LL>

= C −10 −5 0 5 10

−10

−5

05

1015

samples[, 1]

sam

ples

[, 2]

m =

[00

]eig(C)=

{[301

], 1√

2

[1 −11 1

]}

m =

[−510

]eig(C)=

{[21

], 1√

5

[1 2−2 1

]}

R. Le Riche (CNRS EMSE) Optimization without metamodel 15/37 March 2018 15 / 37

Page 21: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

The CMA-ES algorithm – Introduction (2)

Originated in the evolutionary computation community with thework of Niko Hansen et al. in 1996 [4, 5], now turning moremathematical (cf. work by Anne Auger, [1, 15], Dimo Brockhoff,. . . ).

A practical tutorial is [3].

The presentation is restricted to the main concepts. To startwith, ES-(1+1) with constant step.

R. Le Riche (CNRS EMSE) Optimization without metamodel 16/37 March 2018 16 / 37

Page 22: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

The constant step size ES-(1+1)

Algorithm 1 constant circular step ES-(1+1)

Require: m ∈ S, σ2, tmax

1: calculate f (m), t ← 12: while t < tmax do3: sample: x ∼ N (m, σ2I)4: calculate f (x), t ← t + 15: if f (x) < f (m) then6: m← x7: end if8: end while

The simplest stochastic optimizer: the covariance matrix is C = σ2I.

R. Le Riche (CNRS EMSE) Optimization without metamodel 17/37 March 2018 17 / 37

Page 23: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

The choice of σ2 is criticalMinimize the sphere functionf (x) = ‖x‖2, x ∈ R10, start-ing from m = (4, . . . , 4)> withES-(1+1) constant circular stepsize, 10 repetitions. Plot showsthe 25%, 50% and 75% quan-tiles.A constant step size may gofrom too small to too large dur-ing convergence.The optimal step size for thesphere function is (proof in a later

slide)

σ? ≈ 1.22‖x‖d

.

0 100 200 300 400

−6

−4

−2

02

nb. calls

log1

0(f)

random

normal sig=0.1

normal sig=1

normal sig=10

normal sig opt

quartiles

R. Le Riche (CNRS EMSE) Optimization without metamodel 18/37 March 2018 18 / 37

Page 24: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Multivariate normal facts

In the simple ES-(1+1), theperturbation seen by the cur-rent mean m is ‖x − m‖2 =σ2∑d

i=1 u2i where ui ∼ N (0, 1)

iid,‖x−m‖2 ∼ σ2χ2

d

As d ↗, χ2d → N (d , 2d)

and√χ2d → N (

√d , 1/2) (see

graph, d = 2, 10, 100)⇒ as d ↗, the points are con-centrated in a sphere of radiusσ√d centered at m.

u_norms

Fre

quen

cy

0 5 10 15

020

040

060

080

010

0012

0014

00

R. Le Riche (CNRS EMSE) Optimization without metamodel 19/37 March 2018 19 / 37

Page 25: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Optimal step size of ES-(1+1) with the sphere

function (1)

The improvement ito f isI2 = ‖m‖2 − ‖m + σU‖2

= m>m− (m + σU)>(m + σU)= −σ2U>U− 2σm>U

U>Ud ↗−−−→ N (d , 2d) ≈ d

I2 ≈ −σ2d − 2σm>U

I2 ∼ N (−σ2d , 4σ2‖m‖2)

The expected improvement ofES-(1+1) is

E [max(0, I2)] =∫∞

0I2p(I2)dI2 =∫ +∞

σd/(2‖m‖)(2σ‖m‖i − σ2d)φ(i)di

with the normalization i = (I2 +σ2d)/(2σ‖m‖) and φ() pdf of i ∼N (0, 1)

R. Le Riche (CNRS EMSE) Optimization without metamodel 20/37 March 2018 20 / 37

Page 26: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Optimal step size of ES-(1+1) with the sphere

function (2)

E [max(0, I2)] = −σ2d[

1− Φ(

σd2‖m‖

)]+ 2σ‖m‖φ

(σd

2‖m‖

)where Φ() denotes the cdf of N (0, 1). Multiplying by d

‖m‖2 and

introducing the normalized step σ = σd‖m‖ , the normalized

improvement is

d‖m‖2E [max(0, I2)] = σ2

[Φ(σ

2)]

+ 2σφ(σ2

)

which is maximized at σ? ≈ 1.22 ⇒ σ? ≈ 1.22‖m‖d

R. Le Riche (CNRS EMSE) Optimization without metamodel 21/37 March 2018 21 / 37

Page 27: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Optimal step size of ES-(1+1) with a parabolic

function

Let optimize f (x) = 12 x>Hx with an ES-(1+1) whose steps follow

x−m ∼ N (0, σ2C ).C can be decomposed into the product of its eigenvectors B and eigenvalues D2,C = BD2B>.We choose C = H−1 and perform the change of variables y = D−1B>x. Then,f (x) = 1

2 x>BD−1D−1B>x = 12 y>y = f (y) ⇒ in the y -space f is a sphere.

The covariance of Y is Cy = σ2D−1B>BD2B>BD−1 = σ2I⇒ sphere function and spherical perturbations, we can use the result

σ? = 1.22‖y‖d = 1.22d

√x>BD−2B>x = 1.22

d

√x>Hx

C proportional to the inverse of the Hessian allows to use the sphereoptimality results and is the best choice when there is only localinformation (∇f (m),∇2f (m)).

This is what CMA-ES learns.

R. Le Riche (CNRS EMSE) Optimization without metamodel 22/37 March 2018 22 / 37

Page 28: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

ES-(µ, λ) flow chart

CMA-ES is an evolution strategy ES-(µ, λ): λ points sampled, the µbest are used for updating m and C.

Algorithm 2 Generic ES-(µ, λ)

Require: m ∈ S, C, tmax

1: calculate f (m), t ← 1, g ← 12: while t < tmax do3: sample: x1, . . . , xλ ∼ N (m(g),C(g))4: calculate f (x1), . . . , f (xλ), t ← t + λ5: rank: f (x1:λ) ≤ . . . ≤ f (xλ:λ)6: estimate m(g+1) and C(g+1) with the µ bests, x1:λ, x2:λ, . . . , xλ:λ.7: g ← g + 18: end while

Default semi-empirical values for CMA-ES (N. Hansen):λ = 4 + b3 ln(d)c, µ = bλ

2c.

R. Le Riche (CNRS EMSE) Optimization without metamodel 23/37 March 2018 23 / 37

Page 29: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

A simplified CMA-ES

We now present a simplification of the CMA-ES algorithm that iseasier to understand while keeping some of the key ideas. It will notcompare in performance with the true CMA-ES. Disregardedingredients are

weighting of the sampled points

separation of the adaptation of the step size and covariance“shape”, C = σ2C

simultaneous rank 1 and rank µ updates of the covariance matrix

For competitive implementations, refer to the official tutorial [3].

R. Le Riche (CNRS EMSE) Optimization without metamodel 24/37 March 2018 24 / 37

Page 30: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Points versus steps (1)

Points are generated from sampling, at iteration g , N (m(g),C(g)),

xi = m(g) +Ni(0,C(g)) , i = 1, λ

The associated step is

yi = xi −m(g) = Ni(0,C(g))

CMA-ES learns steps, as opposed to points. Points are static, stepsare dynamic. Efficient optimizers (e.g., Nelder Mead, BFGS, . . . )make steps.

R. Le Riche (CNRS EMSE) Optimization without metamodel 25/37 March 2018 25 / 37

Page 31: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Points versus steps (2)

good steps

good points

m(g)

m(g+1)

N of good stepscentered at m(g+1)

N of good pointscentered at m(g+1)

Note how the law of good steps stretches in the good directionas opposed to the good points (plot from [3])

R. Le Riche (CNRS EMSE) Optimization without metamodel 26/37 March 2018 26 / 37

Page 32: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Update of the mean

Remember the order statistics notation for the new sampled points

f (x1:λ) ≤ . . . ≤ f (xλ:λ)

The new mean is the average of the µ best points,

m(g+1) = 1µ

∑µi=1 xi :λ

R. Le Riche (CNRS EMSE) Optimization without metamodel 27/37 March 2018 27 / 37

Page 33: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Covariance matrix estimation

Empirical estimate of the good steps,

C(g+1) = 1µ

∑µi=1(xi :λ −m(g))(xi :λ −m(g))> (4)

= 1µ

∑µi=1 yi :λyi :λ>

Empirical estimate of the good points (don’t use),

C(g+1)EDA =

1

µ− 1

µ∑i=1

(xi :λ −m(g+1))(xi :λ −m(g+1))>

where EDA means Estimation of Density Algorithm (aka cross-entropy method,

EMNA – Estimation of Multivariate Normal Algorithm).

Both estimators are unbiased. They require µ ≥ d linearlyindependent y’s to have full rank and λ ≥ 20d to be reliable forwell-conditioned (cond(C) < 10) problems [3].

R. Le Riche (CNRS EMSE) Optimization without metamodel 28/37 March 2018 28 / 37

Page 34: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Rank 1 and time averaging for covariance matrix

estimation (1)

If f is ill-conditioned (large ratio of largest to smallest curvatures),the estimator (4) requires large µ (hence λ) which makes theoptimizer costly.Solution: account for information coming from past iterations bytime averaging, which allows few samples and rank 1 updates.

Use the average good step,

y = 1µ

∑µi=1

(x(i :λ) −m(g)

)= m(g+1) −m(g) (5)

R. Le Riche (CNRS EMSE) Optimization without metamodel 29/37 March 2018 29 / 37

Page 35: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Rank 1 and time averaging for covariance matrix

estimation (2)

and time averaging of the covariance estimates,

C(g+1) = (1− c1)C(g) + c1yy> (6)

where 0 ≤ c1 ≤ 1.If c1 = 1, no memory and C(g+1) has rank 1 (degenerated gaussianalong y).Default1 c1 = 2/((d + 1.3)2 + µ).1/c1 is the backward time horizon that contributes to about 63% ofthe information [3].

1Default for the complete CMA-ES, not sure it applies to this simplifiedversion. Idem with cc later.

R. Le Riche (CNRS EMSE) Optimization without metamodel 30/37 March 2018 30 / 37

Page 36: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Steps cumulation

The estimation (6) has a flaw: it is notsensitive to the sign of the step, yy> =(−y)(−y)>. It will not detect oscillationsand cannot converge. Worse, it will in-crease step size around x?.

x?

⇒ use cumulated steps (aka evolution path) instead of y in (6),

p(g+1) = (1− cc)p(g) +√cc(2− cc)µ y (7)

Default cc = 4+µ/dd+4+2µ/d .

The weighting in (7) is such that if yi :λ ∼ N (0,C) and p(g) ∼ N (0,C), thenp(g+1) ∼ N (0,C).

Proof: y = 1µ

∑µi=1 yi :λ ∼ N (0, µ

µ2 C),

p(g+1) ∼ N (0, (1− cc )2C) +N (0, cc (2−cc )µ

C) ∼ N (0,C). �

R. Le Riche (CNRS EMSE) Optimization without metamodel 31/37 March 2018 31 / 37

Page 37: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Practicalities

Accounting for the variables bounds:

repeatsample: x ∼ N (m,C)

until xLB ≤ x ≤ xUB

Another stopping criterion: a maximum number of fcalculations without progress (to avoid stagnation).

R. Le Riche (CNRS EMSE) Optimization without metamodel 32/37 March 2018 32 / 37

Page 38: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Concluding remarks I

Both algorithms are likely to miss the global optimum:practically, they are local (although CMA-ES is asymptoticallyconvergent to the global optimum and Nelder-Mead may go overregions of attraction to local optima).

Restart Nelder-Mead and CMA-ES at convergence (possiblyforce a few early convergences for restarts). Competitiveversions of these algorithms use restarts (e.g., [9, 11]).

R. Le Riche (CNRS EMSE) Optimization without metamodel 33/37 March 2018 33 / 37

Page 39: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

Concluding remarks II

A tempting idea when f () is costly:

1 Make a design of experiments (X,F) and build a metamodel f2 optimize the (almost free) f with your favorite optimizer,

xt+1 = arg minx∈S f (x)3 Calculate f (xt+1), stop or back to 1 with

(X,F) ∪ (xt+1, f (xt+1)).

but convergence may be harmed if f (xt+1) ≈ f (xt+1) and xt+1

far from x? ⇒ use a well-designed optimizer for this, e.g., EGO[7], saACM-ES [10] or an EGO-CMA mix [13].

Acknowledgments: I would like to thank Victor Picheny, StephanieMahevas and Robert Faivre for their invitation at this MEXICOnetwork school.

R. Le Riche (CNRS EMSE) Optimization without metamodel 34/37 March 2018 34 / 37

Page 40: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

References I[1] Anne Auger and Nikolaus Hansen.

Linear convergence of comparison-based step-size adaptive randomized search via stabilityof markov chains.SIAM Journal on Optimization, 26(3):1589–1624, 2016.

[2] Andrew R. Conn, Katya Scheinberg, and Luis N. Vicente.Introduction to Derivative-Free Optimization.Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2009.

[3] Nikolaus Hansen.The CMA Evolution Strategy: A Tutorial.ArXiv e-prints, arXiv:1604.00772, 2016, 2005.

[4] Nikolaus Hansen and Andreas Ostermeier.Adapting arbitrary normal mutation distributions in evolution strategies: The covariancematrix adaptation.pages 312–317. Morgan Kaufmann, 1996.

[5] Nikolaus Hansen and Andreas Ostermeier.Completely derandomized self-adaptation in evolution strategies.Evol. Comput., 9(2):159–195, June 2001.

[6] Steven G. Johnson.The NLopt nonlinear-optimization package, 2011.

R. Le Riche (CNRS EMSE) Optimization without metamodel 35/37 March 2018 35 / 37

Page 41: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

References II

[7] Donald R. Jones, Matthias Schonlau, and William J. Welch.Efficient global optimization of expensive black-box functions.Journal of Global optimization, 13(4):455–492, 1998.

[8] Jeffrey C. Lagarias, James A. Reeds, Margaret H. Wright, and Paul E. Wright.Convergence properties of the Nelder-Mead simplex method in low dimensions.SIAM Journal on Optimization, 9(1):112–147, 1998.

[9] Ilya Loshchilov.Cma-es with restarts for solving cec 2013 benchmark problems.In 2013 IEEE Congress on Evolutionary Computation, pages 369–376, June 2013.

[10] Ilya Loshchilov, Marc Schoenauer, and Michele Sebag.Intensive Surrogate Model Exploitation in Self-adaptive Surrogate-assisted CMA-ES(saACM-ES).In Christian Blum and Enrique Alba, editors, Genetic and Evolutionary ComputationConference (GECCO 2013), pages 439–446, Amsterdam, Netherlands, July 2013.

[11] Marco A. Luersen and Rodolphe Le Riche.Globalized nelder–mead method for engineering optimization.Computers & Structures, 82(23):2251 – 2260, 2004.

R. Le Riche (CNRS EMSE) Optimization without metamodel 36/37 March 2018 36 / 37

Page 42: A comparative introduction to two optimization classics: the Nelder … · 2018-04-07 · Practicalities Restarts: after each iteration, add a restart test, if vol

References III

[12] Ken IM McKinnon.Convergence of the nelder–mead simplex method to a nonstationary point.SIAM Journal on Optimization, 9(1):148–158, 1998.

[13] Hossein Mohammadi, Rodolphe Le Riche, and Eric Touboul.Making EGO and CMA-ES Complementary for Global Optimization.In Learning and Intelligent Optimization, volume Volume 8994 of 9th InternationalConference, LION 9, Lille, France, January 12-15, 2015. Revised Selected Papers, pages pp287–292. May 2015.

[14] John A. Nelder and Roger Mead.A simplex method for function minimization.The Computer Journal, 7(4):308–313, 1965.

[15] Yann Ollivier, Ludovic Arnold, Anne Auger, and Nikolaus Hansen.Information-geometric optimization algorithms: A unifying picture via invariance principles.Journal of Machine Learning Research, 2017.accepted.

[16] P. Tseng.Fortified-descent simplicial search method: A general approach.SIAM J. Optim., 10:269–288, 1999.

R. Le Riche (CNRS EMSE) Optimization without metamodel 37/37 March 2018 37 / 37