asynchronous parallel stochastic global optimization using …dme65/data/informs_2017.pdf ·...

18
Global optimization Surrogate optimization Asynchrony Numerical experiments Summary Asynchronous Parallel Stochastic Global Optimization using Radial Basis Functions David Eriksson Center for Applied Mathematics Cornell University [email protected] October 24, 2017 Joint work with David Bindel and Christine Shoemaker 1 - 2 ---- 3 -- 4 ---- 5 - 1/18

Upload: others

Post on 11-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Asynchronous Parallel Stochastic GlobalOptimization using Radial Basis Functions

David Eriksson

Center for Applied MathematicsCornell University

[email protected]

October 24, 2017

Joint work with David Bindel and Christine Shoemaker

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 1/18

Page 2: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Global optimization problem (GOP)

Find x∗ ∈ Ω such that f(x∗) ≤ f(x), ∀x ∈ Ω

f : Ω→ R continuous, computationally expensive, and black-box

Ω ⊂ Rd is a hypercube

Evaluating the model may take several hours or days

Common examples are PDE models describing physical processes

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 2/18

Page 3: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Difficulty with popular approaches for global optimization

(Multi-start) Gradient based optimizers:

Examples: Gradient descent, quasi-Newton methodsProblem: Hard to obtain (accurate) derivatives, multi-modalityTricky to choose step size for finite differencesFinite differences are expensive in higher dimensions

(Multi-start) Derivative-free methods:

Examples: Nelder-Mead, pattern searchProblem: Slow convergence, multi-modality, ignores smoothness

Heuristic methods:

Examples: Genetic algorithm, simulated annealingProblem: Require a large number of evaluations

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 3/18

Page 4: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Surrogate optimization

Use a surrogate f (- - -) to approximate f (——–)

The surrogate enables cheap function value predictions

Main idea: Solve auxiliary problem, evaluate, fit surrogate, repeat

Figure: ( ) Evaluated points, () next evaluation.

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 4/18

Page 5: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Exploration vs exploitation

A successful method needs to balance exploration and exploitation

Exploration: Evaluate in unexplored regions

Exploitation: Improve good solutions

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 5/18

Page 6: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Radial basis function interpolation

sf,X(x) =

n∑j=1

λjϕ(‖x− xj‖) + p(x)

p(x) =∑mj=1 cjπj(x) a polynomial of degree k

Interpolation constraints:

s(xj) = f(xj), j = 1, . . . , n

Discrete orthogonality:n∑j=1

λjq(xj) = 0, ∀poly q of deg ≤ k

Need to solve [Φ PPT 0

] [λc

]=

[fX0

]where Φij = ϕ(‖xi − xj‖), Pij = πj(xi)

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 6/18

Page 7: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Radial basis function interpolation

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 7/18

Page 8: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Stochastic Radial Basis Function (SRBF) method

Uses a radial basis function to approximate objective

Generate a set of candidate points Λ

Each candidate point is a N (0, σ2) perturbation of best solution

Sampling radius σ is adjusted based on progress

Auxiliary problem:

minx∈Λ

λ s(x)−miny∈Λ

s(y)

maxy∈Λ

s(y)−miny∈Λ

s(y)+ (1− λ)

maxy∈Λ

dX(y)− dX(x)

maxy∈Λ

dX(y)−miny∈Λ

dX(y)

where dX(y) = min

x∈X‖x− y‖, λ ∈ [0, 1].

(λ = 0) favors large minimum distance to evaluated points

(λ = 1) favors small function value prediction

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 8/18

Page 9: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Parallelism

Running function evaluations in parallel:

1 Batch synchronous parallel

2 Asynchronous parallel

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 9/18

Page 10: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Parallelism

Synchronous parallel assumes:1 Computational resources are homogeneous2 Evaluation time independent of input

Examples of heterogeneous resources:

Mixture of CPU/GPUClouds (e.g., ”stragglers” in MapReduce)

Examples of input dependent evaluation time:

Adaptive meshesIterative solver (Krylov, bisection, etc.)Early termination

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 10/18

Page 11: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

POAP and pySOT

POAP (Plumbing for Optimization with Asynchronous Parallelism)

Available at: https://github.com/dbindel/POAP

Framework for building asynchronous optimization strategies

pySOT (Python Surrogate Optimization Toolbox)

Available at: https://github.com/dme65/pySOT

Surrogate optimization strategies implemented in POAP

A great test-suite for doing head-to-head comparisons

Has been cited in work on:

Groundwater flow calibration for the Umatilla Chemical DepotCalibration of a geothermal reservoir modelHyper-parameter optimization of deep neural networks

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 11/18

Page 12: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Questions to answer

1 How do we choose between asynchrony and synchrony?

2 What is the tradeoff between information and idle time?

3 What is the effect of parallelism?

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 12/18

Page 13: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Experimental setup for test problems

Use SRBF with 1, 4, 8, 16, and 32 workers

10-dimensional F15-F24 from the BBOB test suite

Draw eval time from Pareto distribution: fX(x) = αx1+α1[1,∞)(x)

Vary α ∈ 102, 12, 2.84 to achieve different tail behaviors

Corresponds to standard deviations 0.01, 0.1, and 1

1 1.5 2 2.5 3 3.5 40

1

2

3

4

f X(x

)

Pareto PDF

= 102

= 12

= 2.84

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 13/18

Page 14: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Progress comparison for F18

0 10 20 30 40 50

Time

1.4e+00

8.1e+00

4.7e+01

2.7e+02

1.5e+03

Absolu

te e

rror

F18, Pareto-102

Serial Sync4 Async4 Sync8 Async8 Sync16 Async16 Sync32 Async32

0 10 20 30 40 50

Time

1.4e+00

8.1e+00

4.7e+01

2.7e+02

1.5e+03

Absolu

te e

rror

F18, Pareto-12

0 10 20 30 40 50

Time

1.4e+00

8.1e+00

4.7e+01

2.7e+02

1.5e+03

Absolu

te e

rror

F18, Pareto-2.84

600 800 1000 1200 1400 1600

Evaluations

1.5

2

2.5

3

3.5

4

4.5

5

Absolu

te e

rror

F18, Pareto-102

600 800 1000 1200 1400 1600

Evaluations

1.5

2

2.5

3

3.5

4

4.5

5

Absolu

te e

rror

F18, Pareto-12

600 800 1000 1200 1400 1600

Evaluations

1.5

2

2.5

3

3.5

4

4.5

5

Absolu

te e

rror

F18, Pareto-2.84

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 14/18

Page 15: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Relative speedup for F18

Relative speedup: S(p) = Execution time for serial algorithmExecution time for parallel algorithm with p processors

Computed over intersection of ranges from all runs

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 15/18

Page 16: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Progress comparison for unimodal function

Consider the sphere function: f(x) =

30∑j=1

x2j

0 10 20 30 40

Time

1.3e-01

8.8e-01

5.9e+00

4.0e+01

2.7e+02

Absolu

te e

rror

Sphere, Pareto-102

Serial Sync4 Async4 Sync8 Async8 Sync16 Async16 Sync32 Async32

0 10 20 30 40

Time

1.3e-01

8.8e-01

5.9e+00

4.0e+01

2.7e+02

Absolu

te e

rror

Sphere, Pareto-12

0 10 20 30 40

Time

1.3e-01

8.8e-01

5.9e+00

4.0e+01

2.7e+02

Absolu

te e

rror

Sphere, Pareto-2.84

0 250 500 750 960

Evaluations

2.0e-03

3.8e-02

7.5e-01

1.5e+01

2.8e+02

Absolu

te e

rror

Sphere, Pareto-102

0 250 500 750 960

Evaluations

2.0e-03

3.8e-02

7.4e-01

1.4e+01

2.8e+02

Absolu

te e

rror

Sphere, Pareto-12

0 250 500 750 960

Evaluations

2.0e-03

3.9e-02

7.5e-01

1.5e+01

2.9e+02

Absolu

te e

rror

Sphere, Pareto-2.84

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 16/18

Page 17: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Answers to questions

1 How do we choose between asynchrony and synchrony?

Asynchrony is the best choice on multimodal problemsBest on all problems in large variance caseIn small variance case asynchrony better vs time on

7/10 problems with 4 processors6/10 problems with 8 processors5/10 problems with 16 processors5/10 problems with 32 processors

2 What is the tradeoff between information and idle time?

Idle time more important than information for multimodal problemsSerial not necessarily best vs #evals in multimodal caseSerial best vs #evals for unimodal problems

3 What is the effect of parallelism?

Helps with explorationImproves results vs time

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 17/18

Page 18: Asynchronous Parallel Stochastic Global Optimization using …dme65/data/informs_2017.pdf · 2018-12-05 · Global optimization Surrogate optimization Asynchrony Numerical experiments

Global optimizationSurrogate optimization

AsynchronyNumerical experiments

Summary

Thank you!

≡ 1 − 2 −−−− 3 −− 4 −−−− 5 − 18/18