asynchronous parallel stochastic global optimization using …dme65/data/informs_2017.pdf ·...

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Asynchronous Parallel Stochastic GlobalOptimization using Radial Basis Functions

David Eriksson

Center for Applied MathematicsCornell University

dme65@cornell.edu

October 24, 2017

Joint work with David Bindel and Christine Shoemaker

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 1/18

Summary

Global optimization problem (GOP)

Find x∗ ∈ Ω such that f(x∗) ≤ f(x), ∀x ∈ Ω

f : Ω→ R continuous, computationally expensive, and black-box

Ω ⊂ Rd is a hypercube

Evaluating the model may take several hours or days

Common examples are PDE models describing physical processes

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 2/18

Summary

Difficulty with popular approaches for global optimization

(Multi-start) Gradient based optimizers:

Examples: Gradient descent, quasi-Newton methodsProblem: Hard to obtain (accurate) derivatives, multi-modalityTricky to choose step size for finite differencesFinite differences are expensive in higher dimensions

(Multi-start) Derivative-free methods:

Examples: Nelder-Mead, pattern searchProblem: Slow convergence, multi-modality, ignores smoothness

Heuristic methods:

Examples: Genetic algorithm, simulated annealingProblem: Require a large number of evaluations

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 3/18

Summary

Surrogate optimization

Use a surrogate f (- - -) to approximate f (——–)

The surrogate enables cheap function value predictions

Main idea: Solve auxiliary problem, evaluate, fit surrogate, repeat

Figure: ( ) Evaluated points, () next evaluation.

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 4/18

Summary

Exploration vs exploitation

A successful method needs to balance exploration and exploitation

Exploration: Evaluate in unexplored regions

Exploitation: Improve good solutions

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 5/18

Summary

Radial basis function interpolation

sf,X(x) =

n∑j=1

λjϕ(‖x− xj‖) + p(x)

p(x) =∑mj=1 cjπj(x) a polynomial of degree k

Interpolation constraints:

s(xj) = f(xj), j = 1, . . . , n

Discrete orthogonality:n∑j=1

λjq(xj) = 0, ∀poly q of deg ≤ k

Need to solve [Φ PPT 0

] [λc

]where Φij = ϕ(‖xi − xj‖), Pij = πj(xi)

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 6/18

Summary

Radial basis function interpolation

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 7/18

Summary

Stochastic Radial Basis Function (SRBF) method

Uses a radial basis function to approximate objective

Generate a set of candidate points Λ

Each candidate point is a N (0, σ2) perturbation of best solution

Sampling radius σ is adjusted based on progress

Auxiliary problem:

minx∈Λ

λ s(x)−miny∈Λ

maxy∈Λ

s(y)−miny∈Λ

s(y)+ (1− λ)

maxy∈Λ

dX(y)− dX(x)

maxy∈Λ

dX(y)−miny∈Λ

where dX(y) = min

x∈X‖x− y‖, λ ∈ [0, 1].

(λ = 0) favors large minimum distance to evaluated points

(λ = 1) favors small function value prediction

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 8/18

Summary

Parallelism

Running function evaluations in parallel:

1 Batch synchronous parallel

2 Asynchronous parallel

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 9/18

Summary

Parallelism

Synchronous parallel assumes:1 Computational resources are homogeneous2 Evaluation time independent of input

Examples of heterogeneous resources:

Mixture of CPU/GPUClouds (e.g., ”stragglers” in MapReduce)

Examples of input dependent evaluation time:

Adaptive meshesIterative solver (Krylov, bisection, etc.)Early termination

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 10/18

Summary

POAP and pySOT

POAP (Plumbing for Optimization with Asynchronous Parallelism)

Available at: https://github.com/dbindel/POAP

Framework for building asynchronous optimization strategies

pySOT (Python Surrogate Optimization Toolbox)

Available at: https://github.com/dme65/pySOT

Surrogate optimization strategies implemented in POAP

A great test-suite for doing head-to-head comparisons

Has been cited in work on:

Groundwater flow calibration for the Umatilla Chemical DepotCalibration of a geothermal reservoir modelHyper-parameter optimization of deep neural networks

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 11/18

Summary

Questions to answer

1 How do we choose between asynchrony and synchrony?

2 What is the tradeoff between information and idle time?

3 What is the effect of parallelism?

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 12/18

Summary

Experimental setup for test problems

Use SRBF with 1, 4, 8, 16, and 32 workers

10-dimensional F15-F24 from the BBOB test suite

Draw eval time from Pareto distribution: fX(x) = αx1+α1[1,∞)(x)

Vary α ∈ 102, 12, 2.84 to achieve different tail behaviors

Corresponds to standard deviations 0.01, 0.1, and 1

1 1.5 2 2.5 3 3.5 40

Pareto PDF

= 2.84

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 13/18

Summary

Progress comparison for F18

0 10 20 30 40 50

1.4e+00

8.1e+00

4.7e+01

2.7e+02

1.5e+03

Absolu

F18, Pareto-102

Serial Sync4 Async4 Sync8 Async8 Sync16 Async16 Sync32 Async32

0 10 20 30 40 50

1.4e+00

8.1e+00

4.7e+01

2.7e+02

1.5e+03

Absolu

F18, Pareto-12

0 10 20 30 40 50

1.4e+00

8.1e+00

4.7e+01

2.7e+02

1.5e+03

Absolu

F18, Pareto-2.84

600 800 1000 1200 1400 1600

Evaluations

Absolu

F18, Pareto-102

600 800 1000 1200 1400 1600

Evaluations

Absolu

F18, Pareto-12

600 800 1000 1200 1400 1600

Evaluations

Absolu

F18, Pareto-2.84

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 14/18

Summary

Relative speedup for F18

Relative speedup: S(p) = Execution time for serial algorithmExecution time for parallel algorithm with p processors

Computed over intersection of ranges from all runs

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 15/18

Summary

Progress comparison for unimodal function

Consider the sphere function: f(x) =

30∑j=1

0 10 20 30 40

1.3e-01

8.8e-01

5.9e+00

4.0e+01

2.7e+02

Absolu

Sphere, Pareto-102

Serial Sync4 Async4 Sync8 Async8 Sync16 Async16 Sync32 Async32

0 10 20 30 40

1.3e-01

8.8e-01

5.9e+00

4.0e+01

2.7e+02

Absolu

Sphere, Pareto-12

0 10 20 30 40

1.3e-01

8.8e-01

5.9e+00

4.0e+01

2.7e+02

Absolu

Sphere, Pareto-2.84

0 250 500 750 960

Evaluations

2.0e-03

3.8e-02

7.5e-01

1.5e+01

2.8e+02

Absolu

Sphere, Pareto-102

0 250 500 750 960

Evaluations

2.0e-03

3.8e-02

7.4e-01

1.4e+01

2.8e+02

Absolu

Sphere, Pareto-12

0 250 500 750 960

Evaluations

2.0e-03

3.9e-02

7.5e-01

1.5e+01

2.9e+02

Absolu

Sphere, Pareto-2.84

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 16/18

Summary

Answers to questions

1 How do we choose between asynchrony and synchrony?

Asynchrony is the best choice on multimodal problemsBest on all problems in large variance caseIn small variance case asynchrony better vs time on

7/10 problems with 4 processors6/10 problems with 8 processors5/10 problems with 16 processors5/10 problems with 32 processors

2 What is the tradeoff between information and idle time?

Idle time more important than information for multimodal problemsSerial not necessarily best vs #evals in multimodal caseSerial best vs #evals for unimodal problems

3 What is the effect of parallelism?

Helps with explorationImproves results vs time

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 17/18

Summary

Thank you!

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 18/18

asynchronous parallel stochastic global optimization using …dme65/data/informs_2017.pdf ·...

Documents

7. liveness and asynchrony

novel, multidisciplinary global optimization …

rapid visual processing using spike asynchrony

asynchrony in c#5.0

asynchrony among local communities stabilises ecosystem

asynchrony during mechanical ventilation

harmonic, tonal, and formal asynchrony in robert schumann

first-order axioms for asynchrony

machine learning for global optimization learning for global...

launchcodergirl - agile qa at asynchrony

global optimization toolbox

global alignment algorithm optimization

asynchrony in (asp).net aliaksandr

local vs global optimization optimization problems

phenological asynchrony between herbivorous insects and

synchrony and asynchrony in conformance testing

evolution of asynchrony in (asp).net

rhythmic masking release: effects of asynchrony,...

hormonal asynchrony and embryonic development… ·...

asynchrony during mechanical...