lecture 8: optimization - wordpress.com...fatih guvenen lecture 8: optimization january 10, 2016 19...

Lecture 8: Optimization

Fatih Guvenen

January 10, 2016

Fatih Guvenen Lecture 8: Optimization January 10, 2016 1 / 24

Overview of Optimization

I Most commonly needed for:Solving a dynamic programming problem.Root-finding as a minimization problem (discussed earlier)!solving for GE.

Two main trade-offs:I Fast local methods versus slow but more global methods.I Whether or not to calculate derivatives (including Jacobians and

Hessians in multidimensional case!).

I Some of the ideas for local minimization are very similar toroot-finding.

In fact, Brent’s and Newton’s methods have analogs forminimization that work with exactly the same logic.Newton-based methods scale very well to multidimensional case


LOCAL OPTIMIZATION

One-Dimensional Problems

I Note: You only need two points to bracket a zero. But you needthree to bracket a minimum: f (a), f (c) > f (b).

I So first obtain those three points. Many economic problemsnaturally suggest the two end points: (cmin = ✏, cmax = y � amin).

I Sometimes, I use NR’s mnbrak.f90 routine. Nothing fancy.

I In one dimension, Brent’s method works well and is pretty fast.(Figure)

I I often prefer it to Newton’s because I know that I am alwaysbracketing a minimum.

I NR has a version of Brent that uses derivative information verycarefully, which is my preferred routine (dbrent.f90).


Multi-Dimensional OptimizationI Once you are “close enough” to the minimum, quasi-newton

methods are hard to beat.I Quasi-Newton methods reduce the N-dimensional minimization

into a series of 1-dimensional problems (line search)I Basically starting from a point P, take a direction vector n and find

� that minimizes f (P + �n)I Once you are at this line minimum, call P+�n, the key step is to

decide what direction to move next.I Two main variants: Conjugate Gradient and Variable Metric

methods. Differences are relatively minor.I I use the BFGS variant of Davidon-Fletcher-Powell algorithm.

A Derivative-Free Least Squares (DFLS) Minimization Algorithm:

I Consider the special case of an objective function of this form:

min�(x) =12⌃m

i=1fi(x)2

where fi : Rn ! R, i = 1, 2, .., m.

I Zhang-Conn-Scheinberg (SIAM, 2010) propose an extension ofthe BOBYQA algorithm of Powell that does not require derivativeinformation.

I The key insight is to build quadratic models of each fi individually,rather than of � directly.

I The function evaluation cost is the same (order) but it is moreaccurate, so faster.


Judging The Performance of Solvers

I First begin by defining a convergence criteria to use to judge whena certain solver has finished its job.

I Let x0 denote the starting point and ⌧ , ideally small, the tolerance.The value fL is the best value that can be attained.

In practice, fL is the best value attained among the set of solvers inconsideration using at most µf function evaluations (i.e., your“budget”).

I Define the stopping rule as :

f (x0) � f (x) � (1 � ⌧)(f (x0) � fL). (1)

I We will consider values like ⌧ = 10�k , for k 2 {1, 3, 5}.


Moree and Wild (2009)I Performance profiles are defined in terms of a performance

measure tp,s > 0 obtained for each problem p 2 P and solvers 2 S.

I Mathematically, the performance ratio is:

rp,s =tp,s

min {tp,s : s 2 S}

I tp,s could be based on the amount of computing time or thenumber of function evaluations required to satisfy the convergencetest.

I Note that the best solver for a particular problem attains the lowerbound rp,s = 1.

I The convention rp,s = 1 is used when solver s fails to satisfy theconvergence test on problem p.


Performance Profile

I The performance profile of a solver s 2 S is defined as thefraction of problems where the performance ratio is at most ↵, thatis,

⇢s(↵) =1

|P|size {p 2 P : rp,s ↵} ,

where |P| denotes the cardinality of P.I Thus, a performance profile is the probability distribution for the

ratio rp,s.

I Performance profiles seek to capture how well the solver performsrelative to the other solvers in S on the set of problems in P.

I ⇢s(1) is the fraction of problems for which s is the best.

I In general, ⇢s(↵) is the fraction of problems with a performanceratio rp,s bounded by ↵, and thus solvers with high values for ⇢s(↵)are preferable.


Performance Profile

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

178 JORGE J. MORE AND STEFAN M. WILD

Fig. 2.1. Sample performance profile �s(�) (logarithmic scale) for derivative-free solvers.

2.2. Data profiles. We can use performance profiles with the convergence test(2.2) to benchmark optimization solvers for problems with expensive function evalua-tions. In this case the performance measure tp,s is the number of function evaluationsbecause this is assumed to be the dominant cost per iteration. Performance profilesprovide an accurate view of the relative performance of solvers within a given numberµf of function evaluations. Performance profiles do not, however, provide su�cientinformation for a user with an expensive optimization problem.

Figure 2.1 shows a typical performance profile for derivative-free optimizationsolvers with the convergence test (2.2) and ⌧ = 10�3. Users generally are interestedin the best solver, and for these problems and level of accuracy, solver S3 has thebest performance. However, it is also important to pay attention to the performancedi�erence between solvers. For example, consider the performance profiles ⇢1 and ⇢4

at a performance ratio of ↵ = 2, ⇢1(2) � 55%, and ⇢4(2) � 35%. These profiles showthat solver S4 requires more than twice the number of function evaluations as solverS1 on roughly 20% of the problems. This is a significant di�erence in performance.

The performance profiles in Figure 2.1 provide an accurate view of the perfor-mance of derivative-free solvers for ⌧ = 10�3. However, these results were obtainedwith a limit of µf = 1300 function evaluations and thus are not directly relevant to auser for which this limit exceeds their computational budget.

Users with expensive optimization problems are often interested in the perfor-mance of solvers as a function of the number of functions evaluations. In other words,these users are interested in the percentage of problems that can be solved (for a giventolerance ⌧) with � function evaluations. We can obtain this information by lettingtp,s be the number of function evaluations required to satisfy (2.2) for a given tolerance⌧ , since then

ds(↵) =1

|P| size{p 2 P : tp,s ↵}

is the percentage of problems that can be solved with ↵ function evaluations. Asusual, there is a limit µf on the total number of function evaluations, and tp,s = 1if the convergence test (2.2) is not satisfied after µf evaluations.

Gri�n and Kolda [12] were also interested in performance in terms of the numberof functions evaluations and used plots of the total number of solved problems as


Data Profiles

I Oftentimes, we are interested in the percentage of problems thatcan be solved (for a given ⌧ ) with µf function evaluations.

I We can obtain this information by letting tp,s be the number offunction evaluations required to satisfy (1) for a given tolerance ⌧ .

I Moree and Wild (2009) define a data profile as:

ds(↵) =1

|P|size⇢

p 2 P :tp,s

np + 1 ↵

�,

where np is the number of variables in problem p.


Measuring Actual Performance


GLOBAL OPTIMIZATION

How Your Objective Function Looks LikeI

Caution: Very easy to find a minimum, very hard to ensure it isthe minimum.


How Your Objective Function Looks Like

−20

24

−2

0

2

4

−2

−1

0

1

2

x

0.05 x+0.2 y+(sin(4 x)/x) sin(abs(3 y))/(sqrt(5+abs(2 y)))

y


How Your Objective Function Looks Like


Nelder-Mead Downhill Simplex Algorithm

I Powerful method that relies only on function evaluations (noderivatives).

I Works even when the objective function is discontinuous andkinky!

I It is slow, but has better global convergence properties thanderivative-based algorithms (such as theBroyden-Fletcher-Goldfarb-Shanno method).

I It must be part of your everyday toolbox.


Nelder-Mead Simplex

Initial Simplex Reflection Reflection and expansion

Contraction Contraction in all directions

Figure: Evolution of the N-Simplex During the Amoeba Iterations


A Practical Guide

How to proceed in practice?1 If you can establish some geometric properties of your objective

function, this is where you should start.

2 For example, in a standard portfolio choice problem with CRRAutility and linear budget constraints, you can show that the RHS ofthe Bellman equation has a single peak (no local maxima).

3 Even when this is theoretically true there is no guarantee yournumerical objective will have a single peak because of theapproximations. (We will see an example in a few weeks).

4 The least you should do is to plot slices and/or two-dimensionalsurfaces from your objective function.

5 These will give you valuable insights into the nature of theproblem.


A Practical Guide

I Having said that, when you solve a DP problem without fixedcosts, option values, max operators, and other sources ofnon-concavity, local methods described above will usually workfine.

I When your minimizer converges, restart the program from thepoint it converged to. (You will be surprised at how often theminimizer will drift away from the supposed minimum!)

I Another idea is to do random restarts—a bunch of times!

I But this is not very efficient, because the random restart pointscould end up being very close to each other (general problem withrandom sampling—small sample issues.)

I Is there a better way? Yes (with some qualifications.)


A Global Minimization AlgorithmHere is the algorithm that I use and it has worked well for me.Experiment with variations that may work better for your problem!

1 Set j = 0 and start the iteration.2 Start with initial guess xj and apply Nelder-Mead until it

“converges” to a new point, call zj .3 Draw a quasi-random initial guess, yj (using Halton’s or Sobol’s

sequence. More on this in a minute).4 Then take your new starting point to be exj = ✓j z⇤

j +(1 � ✓j)yj where✓j 2 [0, 1] and z⇤

j is the best point obtained up until iteration j .

5 Update j = j + 1, and xj = exj�1. Go to step 2.6 Iterate until convergence.

I Take ✓j to be close to zero initially and increase as you go.I You could sprinkle some BFGS after step 2 and let it simmer for a

while!


Quasi-Random Numbers

I One could imagine that a better approach in the previousalgorithm would be take the starting guesses on a Cartesian grid.

I But how to decide on how coarse or fine this grid should be? If xis 6 dimensional and you take 3 points in each direction, you needto start from 36 = 729 different points. And who says 3 points isgood enough?

I Random numbers have the advantage that you do not have todecide before hand how many restarts to do. Instead look at theimprovement in objective value.

I But a disadvantage of random numbers is that... well, they arerandom! So they can accumulate in some areas and leave otherareas empty.

I This is where quasi-random numbers come into play. They arenot random, but they spread out maximally in a given space nomatter how many of them are generated.


Uniform Random vs. Sobol’ Numbers

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Sobol’

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Uniform


A Little Parallel Programming

...Without knowing any parallel programming

I Ingredients you need:

DropboxFriends who will let you use their computers when they are asleep.

I Here is a modified version of my global algorithm that you can usewith N computers.


A Little Parallel Programming

1 Generate an empty text file myobjvals.txt and put it intoautomatic sync across all machines using Dropbox.

2 Generate a large number of quasi-random numbers (say 1000).3 Take the first N of these points and start your program on N

machines, each with one of your quasi-random numbers as initialguess.

4 After Nelden-Mead converges on a given machine, write theminimum value found and the corresponding point tomyobjvals.txt.

5 Before starting the next iteration open and read all objectivevalues found so far (because of syncing this will be the minimumacross all machines!)

6 Take your initial guess to the a linear combination of this best pointand a new quasi-random number.

7 The rest of the algorithm is as before.


lecture 8: optimization - wordpress.com...fatih guvenen lecture 8: optimization january 10, 2016 19...

Documents