goback - department of cyberneticslabe.felk.cvut.cz/~posik/xe33scp/xe33scp-eda2.pdf · finite...

GoBack

P. Pošík c© 2007 Soft Computing – 1 / 23

8. Estimation of Distribution Algorithms. Continuous Domain.

Petr Pošík

Czech Technical University in Prague

Faculty of Electrical Engineering

Department of Cybernetics

Contents

Last week. . .

Features of continuousspaces

Real-valued EDAs

Optimization usingGaussian distribution


Last week. . .

Features of continuous spaces

Real-valued EDAs

Optimization using Gaussian distribution

Last week. . .

Last week. . .

Intro to EDAs

Content of the lectures


Real-valued EDAs



Intro to EDAs

Last week. . .

Intro to EDAs



Real-valued EDAs



Black-box optimization

GA vs. EDA

✔ GA approach: select — crossover — mutate

✔ EDA approach: select — model — sample

EDA with binary representation

✔ the best possible (general, flexible) model: joint probability

✘ determine the probability of each possible combination of bits

✘ 2D − 1 parameters, exponential complexity

✔ less precise (less flexible), but simpler probabilistic models


Last week. . .

Intro to EDAs



Real-valued EDAs



Binary EDAs

✔ Without interactions

✘ 1-dimensional marginal probabilities p(X = x)

✘ PBIL, UMDA, cGA

✔ Pairwise interactions

✘ conditional probabilities p(X = x|Y = y)

✘ sequences (MIMIC), trees (COMIT), forrest (BMDA)

✔ Multivariate interactions

✘ conditional probabilities p(X = x|Y = y, Z = z, . . .)

✘ Bayesian networks (BOA, EBNA, LFDA)

Continuous EDAs

✔ Histograms

✔ Gaussian distribution

✔ Evolutionary strategies, CMA-ES

Features of continuous spaces

Last week. . .


The difference of binaryand real space

Local neighborhood

Real-valued EDAs



The difference of binary and real space

Last week. . .



Local neighborhood

Real-valued EDAs



Binary space

✔ Each possible solution is placedin one of the corners ofD-dimensional hypercube

✔ No values lying between them

✔ Finite number of elements

0000 1000

11000100

0010 1010

11100110

0111

0101

111110110011

0001 1001

1101

Real space

✔ The space in each dimension need not be bounded

✔ Even when bounded by a hypercube, there are infinitely many points betweenthe bounds (theoretically; in practice we are limited by the numerical precisionof given machine)

✔ Infinitely many (even uncountably many) candidate solutions

Local neighborhood

Last week. . .



Local neighborhood

Real-valued EDAs



How do you define a local neighborhood?

✔ . . . as a set of points that do not have the distance to a reference point largerthan a threshold?

Local neighborhood

Last week. . .



Local neighborhood

Real-valued EDAs





✘ The volume of the local neighborhood relative to the volume of thewhole space exponentially drops

✘ With increasing dimensionality the neighborhood becomes increasinglymore local

Local neighborhood

Last week. . .



Local neighborhood

Real-valued EDAs







✔ . . . as a set of points that are closest to the reference point and their unificationcovers part of the search space of certain (constant) size?

Local neighborhood

Last week. . .



Local neighborhood

Real-valued EDAs







✔ . . . as a set of points that are closest to the reference point and their unificationcovers part of the search space of certain (constant) size?

✘ The size of the local neighborhood rises with dimensionality of the searchspace

✘ With increasing dimensionality of the search space the neighborhood isincreasingly less local

Curse of dimensionality!

Real-valued EDAs

Last week. . .


Real-valued EDAsBy analogy to discreteEDAs. . .No Interactions AmongVariablesResults: Two PeaksFunctionHistogram UMDA:Summary



By analogy to discrete EDAs. . .

Last week. . .





Without interactions

✔ UMDA: the same principle, only the marginal probability model is of differenttype

✔ Univariate histograms?

✔ Univariate Gaussian distribution?

✔ Univariate mixture of Gaussians?

Pairwise and higher-order interactions:

✔ Many different types of interactions!

✔ Model which would describe all possible kinds of interaction is virtuallyimpossible to find!

No Interactions Among Variables

Last week. . .





UMDA: EDA with marginal productmodel:

p(x) =D

∏d=1

p(xd) (1)

The following univariate models werecompared:

✔ Equi-width histogram

✔ Equi-height histogram

✔ Max-diff histogram

✔ Univariate mixture of Gaussians

Features:

✔ the most straightforward analogywith discrete histograms

✔ if any bin is empty, there is noway to create new individual inthat bin

0 5 10 15 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Equi−width Histogram


Last week. . .






p(x) =D

∏d=1

p(xd) (1)






Features:

✔ instead of fixing the bin width, fixthe number of points in each bin

✔ no empty bins, always possibleto generate any point in thehyperrectangle

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Equi−height Histogram


Last week. . .






p(x) =D

∏d=1

p(xd) (1)






Features:

✔ place the bin boundaries to thelargest gaps between the points

✔ no empty bins, always possibleto generate any point in thehyperrectangle

0 5 10 15 200

0.02

0.04

0.06

0.08

0.1

0.12

Max−diff Histogram


Last week. . .






p(x) =D

∏d=1

p(xd) (1)






Features:

✔ built by the EM algorithm(probabilistic version of k-meansclustering)

✔ more suitable for unboundedspaces

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

0.35Mixture of Gaussians


Last week. . .






p(x) =D

∏d=1

p(xd) (1)






The winner of comparison:

Equi-height histogram

✔ precise

✔ non-parametric

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Equi−height Histogram

Results: Two Peaks Function

Last week. . .





✔ optimum in (1, 1, . . . , 1)

✔ 2D local optima

0 2 4 6 8 10 120

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Evolution of bin boundaries (component centers for MOG):

0 20 40 60 80 1000

2

4

6

8

10

12Evolution of bin boundaries for the HEH model

0 20 40 60 80 1000

2

4

6

8

10

12Evolution of bin boundaries for the HMD model

0 20 40 60 80 1000

2

4

6

8

10

12Evolution of component centers for the MOG model

Histogram UMDA: Summary

Last week. . .





Suitable when

✔ the search space is bounded by a hyperrectangle

✔ there are no strong interactions among variables

Possible extension:

✔ Rotation of the coordinate system → UMDA is then able to work with linearinteractions

Optimization using Gaussian distribution

Last week. . .


Real-valued EDAs

Optimization usingGaussian distributionCase study: QuadraticfunctionWhat happens on theslope?

Preventing prematureconvergence

Evolutionary Strategies

ES: Increasing modelflexibility

CMA-ES

CMA-ES: Summary

Real-valued EDAs:Summary


Case study: Quadratic function

Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary



Consider simple EDA with the following settings:

✔ Truncation selection: use τ · N best individuals to build the model

✔ Gaussian distribution: fit the Gaussian using maximum likelihood (ML)estimate

Two situations:

Population centered around optimum(population in the valley):

Population far away from optimum(population on the slope):


Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary






Two situations:


−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

1.2

1.4τ = 0.8



Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary






Two situations:


−3 −2 −1 0 1 2 30

0.2

0.4

0.6

0.8

1

1.2

1.4τ = 0.8


−103 −102 −101 −100 −99 −98 −970

0.5

1

1.5

2

2.5

3

3.5

4τ = 0.2

What happens on the slope?

Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary



The change of population statistics in 1 generation:

Expected value:

µt+1 = E(X|X > xmin) = µt + σ · d(τ),

where

d(τ) =φ(Φ−1(τ))

τ.


Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary




Expected value:

µt+1 = E(X|X > xmin) = µt + σ · d(τ),

where

d(τ) =φ(Φ−1(τ))

τ.

Variance:

(σt+1)2 = Var(X|X > xmin) = (σt)2 · c(τ),

where

c(τ) = 1 +Φ−1(1 − τ) · φ(Φ−1(τ))

τ− d(τ)2.


Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary




Expected value:

µt+1 = E(X|X > xmin) = µt + σ · d(τ),

where

d(τ) =φ(Φ−1(τ))

τ.

Variance:

(σt+1)2 = Var(X|X > xmin) = (σt)2 · c(τ),

where

c(τ) = 1 +Φ−1(1 − τ) · φ(Φ−1(τ))

τ− d(τ)2.

0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

1

1.5

2

2.5

3

τ

d

On slopeIn the valley

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

τ

c

On slopeIn the valley

What happens on the slope (cont.)

Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary



Population statistics in generation t:

µt = µ0 + σ0 · d(τ) · ∑ti=1

√

c(τ)i−1

σt = σ0 ·√

c(τ)t

Convergence of population statistics:

limt→∞

µt = µ0 + σ0 · d(τ) · 1

1−√

c(τ)

limt→∞

σt = 0


Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary




µt = µ0 + σ0 · d(τ) · ∑ti=1

√

c(τ)i−1

σt = σ0 ·√

c(τ)t


limt→∞

µt = µ0 + σ0 · d(τ) · 1

1−√

c(τ)

limt→∞

σt = 0

Geometric series


Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary




µt = µ0 + σ0 · d(τ) · ∑ti=1

√

c(τ)i−1

σt = σ0 ·√

c(τ)t


limt→∞

µt = µ0 + σ0 · d(τ) · 1

1−√

c(τ)

limt→∞

σt = 0

Geometric series

The distance the population can “travel” in this algorithm is bounded!

Premature convergence!

Preventing premature convergence

Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary



Artificially enlarge the ML estimate of variance:

✔ keep variance at values greater than a specified threshold

✔ use self-adaptation (let the variance be part of the chromosome)

✔ adaptive variance scaling when population is on the slope, ML estimate ofvariance when population is in the valley

✔ . . .

Conclusions:

✔ Maximum likelihood estimates are suitable in situations when model fits thefitness function well (at least in local neighborhood)

✘ Gaussian distribution is suitable in the neighborhood of optimum

✘ Gaussian distribution is not suitable on the slope of fitness function!


Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary



✔ (µ, λ)-ES or (µ + λ)-ES (µ parents, λ offspring)

✘ (µ, λ)-ES: offspring completely replace parents

✘ (µ + λ)-ES: parents are joined with offspring, both fight for survival

✔ offspring individuals are created using mutation as

x′ = x + ND(0, σ),

where x is parent, x′ is offspring, and ND(0, σ) is D-dimensional isotropicGaussian distribution

Increasing flexibility of ES: σ is not constant during ES run

✔ decrease σ deterministicaly

✔ feedback control of σ ( 15 rule)

✔ autoadaptation of σ

✘ σ is part of chromosome

σ′ = σ · exp(N1(0, ∆σ))

x′ = x + ND(0, σ′)

ES: Increasing model flexibility

Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary



Use different σ in all dimensions

✔ Gaussiam with diagonal covariance matrix

σ = (σ1, σ2, . . . , σD)

x′ = x + ND(0, I · σ)

✔ Gaussian with full covariance matrix

x′ = x + ND(0, C)

✔ Autoadaptation of σ and C

✔ Changes in the covariance structure are still very random!

CMA-ES

Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary



✔ Derandomized evolutionary strategy

✔ (1, λ)-ES with covariance matrix adaptation:

1. Generate λ offspring

xt = µt + ND(0, σt · Ct)

µt ∈ RD , σt ∈ R+, Ct ∈ RD×D

2. Based on the offspring, adapt the model parameters:

Adapt the Gaussian center µ and covariance matrix C using maximumlikelihood estimation:

µt+1 = arg maxµt+1

P(xtsel|µt+1), µt+1 = x̄t

sel

Ct+1 = arg maxCt+1

P

(

xtsel − µt

σt|Ct+1

)

, Ct+1 = Cov

(

xtsel − µt

σt

)

Adapt the global step size σ so that two consecutive steps, µt → µt+1 andµt+1 → µt+2, are conjugated, i.e. conceptually

(

µt+2 − µt+1)

× C−1 ×(

µt+1 − µt

(σt+1)2

)

= 0

CMA-ES: Summary

Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary



✔ CMA-ES has its roots in ES, but seems to be an instance of EDA (lerning ofprobabilistic model)

✔ behaves like local optimizer, but often does not get stuck in local optimum

✔ state-of-the-art in real-valued black-box ooptimization, its advantages aresignificant already in spaces of 5–10 dimensions

✔ it was used to solve many real-world problems (tuning of electronic filters,aerodynamic and hydrodynamic design, etc.)

Real-valued EDAs: Summary

Last week. . .


Real-valued EDAs





CMA-ES

CMA-ES: Summary



✔ Much less developed than EDAs for binary representation

✔ The difficulties are caused mainly by

✘ much severe effects of the curse of dimensionality

✘ many different types of interactions among variables

✔ Despite of that, EDA (and EAs generally) are able to gain better results thenconventional optimization techniques (line search, Nelder-Mead search, . . . )

goback - department of cyberneticslabe.felk.cvut.cz/~posik/xe33scp/xe33scp-eda2.pdf · finite...

Documents