lecture notes on engineering optimization fraser j. forbes ...aksikas/optimization-notes.pdf ·...

1'

&

$

%

Lecture Notes on Engineering Optimization

Fraser J. Forbes and Ilyasse Aksikas

Department of Chemical and Materials Engineering

University of Alberta

[email protected]

2'

&

$

%

Contents

– Chapter 1 : Introduction - Basic Concepts

– Chapter 2 : Linear Programming

– Chapter 3 : Unconstrained Univariate Nonlinear Optimization

– Chapter 4 : Unconstrained Multivariate Nonlinear Optimization

– Chapter 5 : Constrained Nonlinear Optimization

3'

&

$

%

Chapter I

Introduction - Basic Concepts

– Introductory Example

– General Formulation and Classification

– Development Stages

– Degrees of Freedom

1 INTRODUCTORY EXAMPLE 4'

&

$

%

1 Introductory Example

Problem

Schedule your activities for next week-end, given you only have$100 to spend.

Question

How can we go about solving this problem in a coherent fashion ?


&

$

%

A systematic approach might be :

Step 1 : Choose a scheduling objective.

Step 2 : List all possible activities and pertinent informationregarding each activity,

activity fun units/hr cost time limitsFi Ci Li


&

$

%

Step 3 : Specify other limitations.

tavailable =

$available =

Step 4 : Decide what the decision variables are.

Step 5 : Write down any relationships you know of among thevariables.


&

$

%

Step 6 : Write down all restrictions on the variables.

Step 7 : Write down the mathematical expression for the objective.


&

$

%

Step 8 : Write the problem out in a standard form.

There is a more compact way to write this problem down.

→ matrix-vector form.

2 GENERAL FORMULATION 9'

&

$

%

2 General Formulation

Optimize P(x) Objective function

Performance function

Profit (cost) function

Subject to

f(x) = 0 Equality constraints

Process model

g(x) ≤ 0 Inequality constraints

Operating constraints

xl ≤ x ≤ xu Variable bounds

Variables limits

3 DEVELOPMENT STAGES 10'

&

$

%

3 Development Stages

Stage 1 : Problem Definition

– Objectives

→ goal of optimization

→ performance measurement

→ example : max profit, min energy cost, etc

→ max(P (x)) = min(−P (x)).


&

$

%

– Process models

→ assumptions & constants

→ equations to represent relationships

? material / energy balances

? thermodynamics

? kinetics, etc

→ constraints and bounds to represent limitations

? operating restrictions,

? bounds / limits

A point which satisfies the process model equations is a feasiblepoint. Otherwise, it is an infeasible point.


&

$

%

Stage 2 : Problem Formulation

– Standard form

– Degrees of freedom analysis

→ over-, under- or exactly specified ?

→ decision variables (independent vs. dependent)

– Scaling of variables

→ units of measure

→ scaling factors


&

$

%

Stage 3 : Problem Solution

– Technique selection

→ matched to problem type

→ exploit problem structure

→ knowledge of algorithms strengths & weaknesses

– Starting points

→ usually several and compare

– Algorithm tuning

→ termination criteria

→ convergence tolerances

→ step length


&

$

%

Stage 4 : Results Analysis

– Solution uniqueness

– Perturbation of optimization problem

→ effects of assumptions

→ variation (uncertainty) in problem parameters

→ variation in prices/costs


&

$

%

Solution of an optimization problem requires all of the steps

– a full understanding is developed by following the complete cycle

– good decisions require a full understanding of the problem

– shortcuts lead to bad decisions.


&

$

%


&

$

%

In this course, our problems will have :

- objective functions which are continuously differentiable,- constraint equations which are continuously differentiable.

Optimization problems with scalar objective functions and vectorconstraints can be classified :

4 DEGREES OF FREEDOM 18'

&

$

%

4 Degrees of Freedom

To determine the degrees of freedom (the number of variableswhose values may be independently specified) in our model wecould simply count the number of independent variables (thenumber of variables which remain on the right- hand side) in ourmodified equations.

This suggests a possible definition :

degrees of freedom = # variables − # equations

Definition : The degrees of freedom for a given problem are thenumber of independent problem variables which must be specifiedto uniquely determine a solution.


&

$

%

Consider the following three equations relating three variablesx1, x2 and x3 :

x1 − x2 = 0x1 − 2x3 = 0x2 − 2x3 = 0

This seems to indicate that there are no degrees of freedom.

Notice that if we subtract the last from the second equation :

x1 − 2x3 = 0− x2 − 2x3 = 0

x1 − x2 = 0

the result is the first equation.

It seems that we have three different equations, which contain nomore information than two of the equations. In fact any of theequations is a linear combination of the other two equations.

We require a clearer, more precise definition for degrees of freedom.


&

$

%

A More Formal Approach :

Suppose we have a set of m equations :

h(v) = 0

in the set of variables v (n+m elements). We would like todetermine whether the set of equations can be used to solve forsome of the variables in terms of the others.

In this case we have a system of m equations in n+m unknownvariables. The Implicit Function Theorem states that if the mequations are linearly independent, then we can divide our set ofvariables v into m dependent variables u and n independentvariables x :

The Implicit Function Theorem goes on to give conditions underwhich the dependent variables u may be expressed in terms of theindependent variables x or :

u = g(x)


&

$

%

Usually we don’t need to find the set of equations u = g(x), weonly need to know if it is possible. Again the Implicit FunctionTheorem can help us out :

if rank[∇vh] = m , all of the model equations are linearlyindependent and it is possible (at least in theory) to use the set ofequations h(v) = 0 to determine values for all of the m dependentvariables u given values for the n independent variables x.

Alternatively we could say that the number of degrees of freedomin this case are the number of independent variables. (Recall thatthere are n variables in x).

We know that :rank[∇vh] ≤ m.

What does it mean if :

rank[∇vh] < m?


&

$

%

Let’s investigate a simple set of equations :

h(v) =

[v1 − v1v2 − αev3v1 − v2 − v3

]with a = 1/e. (This could be the material balances for a reactor.) Athree dimensional plot of the equations looks like :


&

$

%

The solution of the equations lies at the intersection of the twosurfaces, each described by one of the equations.

How many degrees of freedom does this problem have ?

Take a closer look :


&

$

%

Examine the neighbourhood of the point :

v1 = 1, v2 = 0, v3 = 1

What is happening here ?

The Jacobian of the equation set h(v) = 0 is :

∇vh =

[1− v2 −v1 −αev3

1 −1 −1

]When a=1/e at the point v = [1, 0, 1]T , then :

∇vh =

[1 −1 −11 −1 −1

]What is the rank of this matrix ?

rank[∇vh] = 1 < m = 2.

This tells us that, at the point v = [1, 0, 1]T , our system containsonly one linearly independent equation.


&

$

%

Then :

degrees of freedom = 3 variables - 1 equation = 2.

→ at the given point two variables must be specified to uniquelydetermine the solution to the problem.

This example was chosen because it was very easy to see theoccurrence of linear dependence within the equation set. Of coursethe situation can be much more complex for real problems withmany more equations and variables. So, clearly we cannotdetermine degrees of freedom by counting the number of equationsin our problem.

As a result, we must modify our definition :

DOF = # variables - # linearly independent equations


&

$

%

In summary, determination of the degrees of freedom for a specificsteady-state set of equations requires :

1. determination of rank [∇vh].

2. if rank [∇vh] = number of equations (m), then all of theequations are linearly independent and :

d.o. f = # variables - # equations.

3. if rank [∇vh] = number of equations (m), then all of theequations are not linearly independent and :

d.o. f = # variables - # rank [∇vh] .

Remember : For sets of linear equations, the analysis has to beperformed only once. Generally, for nonlinear equation sets, theanalysis is only valid at the variables values used in theanalysis.


&

$

%

Degrees of freedom analysis tells us the maximum number ofvariables which can be independently specified to uniquelydetermine a feasible solution to a given problem.

We need to consider degrees of freedom when solving manydifferent types of problems. These include :

– plant design,– plant flow sheeting,– model fitting,

and, of course, optimization problems.


&

$

%

Consider the example of isothermal reactor system :

The material balances are :

mass FA − FR = 0

component A FA − kXAV ρ−XAFR = 0

component B kXAV ρ−XBFR = 0


&

$

%

What other information do we have ?

By definition, we also know :

1−XA −XB = 0

If we know the feed-rate (FA = 5kg/min), reactor volume (V = 2litres), density of the reaction mixture (ρ = 1kg/l), and reactionrate constant (k = 7min−1), then the material balances can bere-written :

5− FR = 0

5− 14XA −XAFR = 0

14XA −XBFR = 0

1−XA −XB = 0

Is this set of equations linear or nonlinear ?


&

$

%

In our problem, there are :

3 variables (XA, XB , FR)4 equations (1 mass balance, 2 component balances, 1 otherequation).

What does this mean ?

Consider what happens if we add :

component A + component B - mass

=⇒ FR −XAFR −XBFR = 0

or :

FR(1−XA −XB) = 0

and since FR 6= 0, then

1−XA −XB = 0


&

$

%

which is our fourth equation. Thus our fourth equations are notlinearly independent.

Caution :Care should be taken when developing a model to avoid thissituation of specifying too many linearly dependent equations, sinceit often leads to difficulty in finding a feasible solution.

We need to eliminate one of the equations from our model. I wouldsuggest eliminating :

1−XA−XB = 0

This equation could then provide an easy check on any solution wefind using the other equations.

Thus our model is reduced to :

5− FR = 0

5− 14XA −XAFR = 0

14XA −XBFR = 0


&

$

%

How many degrees of freedom are there in this model ?

For our reactor model

∇vh =

−1 0 0−XA −14− FA 0−XB 14 −FR

what is the rank of this matrix ? (suppose FR = 0 or FR = −14.)

We could determine the rank by :

1. using software (Maple, Matlab, Mathcad, ...)

2. examining the eigenvalues (which are displayed on the diagonalof a triangular matrix.

rank[∇vh] = 3 FR 6= 0 or − 14

rank[∇vh] = 2 FR = 0 or − 14

3. using the determinant

|∇vh| = −FR(14 + FR)


&

$

%

Then for our reactor problem and realistic flow-rates, we have :

3 variables (FR, XA, XB)3 linearly independent equations (1 mass and 2 componentbalances).

Therefore :

degrees of freedom = 3 - 3 = 0.

This says there are no degrees of freedom for optimization and thereactor problem is fully specified.

In fact the unique solution is :

FR = 5 kg/min.

XA = 5/19 wt. fraction

XB = 14/19 wt. fraction


&

$

%

Until now, we have specified the feed-rate (FA) in our reactorproblem. Suppose we decide to include it as an optimizationvariable :

1. how many degrees of freedom are now available foroptimization ?

d.o.f = 4 variables - 3 equations = 1

2. Can we re-introduce the fourth equation :

1−XA −XB = 0

to reduce the degrees of freedom ?

No, using FA as an optimization variable does not change thelinear dependence among the four equations.

1'

&

$

%

Chapter II

Linear Programming

1. Introduction

2. Simplex Method

3. Duality Theory

4. Optimality Conditions

5. Applications (QP & SLP)

6. Sensitivity Analysis

7. Interior Point Methods

1 INTRODUCTION 2'

&

$

%

1 Introduction

Linear programming became important during World War II :

=⇒ used to solve logistics problems for the military.

Linear Programming (LP) was the first widely used form ofoptimization in the process industries and is still the dominantform today.

Linear Programming has been used to :

→ schedule production,→ choose feedstocks,→ determine new product feasibility,→ handle constraints for process controllers, . . .

There is still a considerable amount of research taking place in thisarea today.

1 INTRODUCTION 3'

&

$

%

All Linear Programs can be written in the form :

minxi

c1x1 + c2x2 + c3x3 + . . .+ cnxn

subject to :

a11x1 + a12x2 + a13x3 + . . .+ a1nxn ≤ b1a21x1 + a22x2 + a23x3 + . . .+ a2nxn ≤ b2

......

am1x1 + am2x2 + am3x3 + . . .+ amnxn ≤ bm

0 ≤ xi ≤ xl,ibi ≥ 0

=⇒ note that all of the functions in this optimization problem arelinear in the variables xi.

1 INTRODUCTION 4'

&

$

%

In mathematical short-hand this problem can be re-written :

minx

cTx

subject to :

Ax ≤ b0 ≤ x ≤ xl

b ≥ 0

Note that :

1. the objective function is linear2. the constraints are linear3. the variables are defined as non-negative4. all elements of b are non-negative.

1 INTRODUCTION 5'

&

$

%

Example

A market gardener has a small plot of land she wishes to use toplant cabbages and tomatoes. From past experience she knows thather garden will yield about 1.5 tons of cabbage per acre or 2 tons oftomatoes per acre. Her experience also tells her that cabbagesrequire 20 lbs of fertilizer per acre, whereas tomatoes require 60 lbsof fertilizer per acre. She expects that her efforts will produce$600/ton of cabbages and $750/ton of tomatoes.

Unfortunately, the truck she uses to take her produce to market isgetting old, so she doesn’t think that it will transport anymorethan 3 tons of produce to market at harvest time. She has alsofound that she only has 60 lbs of fertilizer.

Question : What combination of cabbages and tomatoes shouldshe plant to maximize her profits ?

1 INTRODUCTION 6'

&

$

%

The optimization problem is :

In the general matrix LP form :

1 INTRODUCTION 7'

&

$

%

We can solve this problem graphically :

We know that for well-posed LP problems, the solution lies at avertex of the feasible region. So all we have to do is try all of theconstraint intersection points.

1 INTRODUCTION 8'

&

$

%

Check each vertex :

c t P(x)0 0 02 0 18000 1 1500

6/5 3/5 1980

So our market gardener should plant 1.2 acres in cabbages and 0.6acres in tomatoes. She will then make $1980 in profit.

This graphical procedure is adequate when the optimizationproblem is simple. For optimization problems which include morevariables and constraints this approach becomes impractical.

1 INTRODUCTION 9'

&

$

%

Degeneracy in Linear Programming

Our example was well-posed, which means there was an uniqueoptimum. It is possible to formulate LP problems which areill-posed or degenerate. There are three main types of degeneracywhich must be avoided in Linear Programming :

1. unbounded solutions,

−→ not enough independent constraints.

2. no feasible region,

−→ too many or conflicting constraints.

3. non-unique solution,

−→ too many solutions,−→ dependance between profit function and some constraints.

1 INTRODUCTION 10'

&

$

%

1. Unbounded Solution :

As we move up either the x1-axis or the constraint the objectivefunction improves without limit.

In industrial-scale problems with many variables and constraints,this situation can arise when :

−→ there are too few constraints.−→ there is linear dependence among the constraints.

1 INTRODUCTION 11'

&

$

%

2. No Feasible Region :

Any value of x violates some set of constraints.

This situation occurs most often in industrial-scale LP problemswhen too many constraints are specified and some of them conflict.

1 INTRODUCTION 12'

&

$

%

Non-Unique Solution :

Profit contour is parallel to constraint.

2 SIMPLEX METHOD 13'

&

$

%

2 Simplex Method

Our market gardener example had the form :

minx

−[900 1500]x

subject to :[1.5 220 60

]x ≤

[360

]where : x = [acres cabbages acres tomatoes]

We need a more systematic approach to solving these problems,particularly when there are many variables and constraints.

−→ SIMPLEX method (Dantzig).−→ always move to a vertex which improves the value of theobjective function.


&

$

%

SIMPLEX Algorithm

1. Convert problem into standard form with :→ positive right-hand sides,→ lower bounds of zero.

2. Introduce slack/surplus variables :→ change all inequality constraints to equality constraints.

3. Define an initial feasible basis :→ choose a starting set of variable values which satisfy all ofthe constraints.

4. Determine a new basis which improves the objective function :→ select a new vertex with a better value of P (x).

5. Transform the equations :→ perform row reduction on the equation set.

6. Repeat steps 4 & 5 until no more improvement in the objectivefunction is possible.


&

$

%

The problem was written :

maxc,t

900c+ 1500t

subjcet to :

1.5c+ 2t ≤ 320c+ 60t ≤ 60

1. in standard form, the problem is

minx

−[900 1500]x

subject to :[1.5 220 60

]x ≤

[360

]where : x = [c t]T .


&

$

%

2. Introduce slack variables (or convert all inequality constraintsto equality constraints) :

minc,t

−900c− 1500t

subjcet to :

1.5c+ 2t+ s1 = 320c+ 60t+ s2 = 60

c, t, s1, s2 ≥ 0

or in matrix form :

minx

−[900 1500 0 0]x

subject to :[1.5 2 1 020 60 0 1

]x =

[360

]where : x = [c t s1 s2]T ≥ 0.


&

$

%

3. Define an initial basis and set up tableau :

→ choose the slacks as initial basic variables,

→ this is equivalent to starting at the origin.

the initial tableau is :


&

$

%

4. Determine the new basis :

→ examine the objective function coefficients and choose avariable with a negative weight. (you want to decrease theobjective function because you are minimizing. Usually we willchoose the most negative weight, but this is not necessary).

→ this variable will be brought into the basis.

→ divide each element of b by the corresponding constraintcoefficient of the new basic variable.

→ the variable which will be removed from the basis is in thepivot row (given by the smallest positive ratio of bi/aij ).


&

$

%


&

$

%

5. Transform the constraint equations :

→ perform row reduction on the constraint equations to makethe pivot element 1 and all other elements of the pivot column0.

i) new row #2 = row #2/60ii) new row #1 = row #1 -2*new row #2iii) new row #3 = row #3 + 1500*new row #2


&

$

%

6. Repeat steps 4&5 until no improvement is possible :


&

$

%

the new tableau is :

no further improvements are possible, since there are no morenegative coefficients in the bottom row of the tableau.


&

$

%

The final tableau is :

– optimal basis contains both cabbages and tomatoes (c, t).– optimal acres of cabbages is 6/5 and of tomatoes is 3/5.– the maximum profit is 1980.


&

$

%

Recall that we could graph the market garden problem :

We can track how the SIMPLEX algorithm moved through thefeasible region. The SIMPLEX algorithm started at the origin.Then moved along the tomato axis (this was equivalent tointroducing tomatoes into the basis) to the fertilizer constraint.The algorithm then moved up the fertilizer constraint until itfound the intersection with the trucking constraint, which isthe optimal solution.


&

$

%

Some other considerations

1. Negative lower bounds on variables :

– SIMPLEX method requires all variables to have non-negativelower bounds (i.e. x ≥ 0).

– problems can have variables for which negative values arephysically meaningful (e.g. inter-mediate cross-flows betweenparallel process units).

– introduce artificial variables.

If we have bounds on the variable x1 such that : a ≤ x ≤ b,then we introduce two new artificial variables (e.g. x5, x6)where :

x1 = x6 − x5

and we can require x5 ≥ 0 and x6 ≥ 0 while satisfying thebounds on x1.


&

$

%

2. Mixed constraints :

– until this point we have always been able to formulate ourconstraints sets using only ≤ constraints (with positiveright-hand sides).

– how can we handle constraint sets that also include constraints(with positive right-hand sides) ?

– introduce surplus variables.

Suppose we have the constraint :

[ai1...ain]x ≥ biThen we can introduce a surplus variable si to form an equalityconstraint by setting :

ai1x1 + ...+ ainxn − si = bi


&

$

%

We have a problem if we start with an initial feasible basis suchthat x = 0 (i.e. starting at the origin). Then :

si = −biwhich violates the non-negativity constraint on all variables(including slack / surplus variables).

We need a way of finding a feasible starting point for thesesituations.

3. Origin not in the feasible region

The SIMPLEX algorithm requires a feasible starting point. Wehave been starting at the origin, but in this case the origin isnot in the feasible region. There are several differentapproaches :

−→ Big M method (Murty, 1983).


&

$

%

Big M Method

Consider the minimization problem with mixed constraints :

minx

x1 + 2x2

subject to :

3x1 + 4x2 ≥ 5x1 + 2x3 = 3x1 + x2 ≤ 4

1. As well as the usual slack / surplus variables, introduce aseparate artificial variable ai into each equality and ≥constraint :

3x1 + 4x2 − s1 + a1 = 5x1 + 2x3 + a2 = 3x1 + x2 + s2 = 4


&

$

%

2. Augment the objective function by penalizing non-zero valuesof the artificial variables :

P (x) = cTx+m∑i=1

Mai

Choose a large value for M ( say 10 times the magnitude of thelargest coefficient in the objective function). For the exampleminimization problem :

P (x) = x1 + 2x2 + 20a1 + 20a2


&

$

%

3. Set up the tableau as usual. The initial basis will include all ofthe artificial variables and the required slack variables :


&

$

%

4. Phase 1 : modify the objective row in the tableau to reduce thecoefficients of the artificial variables to zero :


&

$

%

5. Phase 2 : Solve the problem as before starting from the finaltableau from Phase 1.


&

$

%

The final tableau is :


&

$

%

Simplex Algorithm (Matrix Form)

To develop the matrix form of the Linear Programming problem :

minx

cTx

subject to :

Ax ≤ bx ≥ 0

we introduce the slack variables and partition x,A and c as follows :

x =

xB−−−xN

A =

[B | N

]


&

$

%

c =

cB−−−cN

Then, the Linear Programming problem becomes :

minx

cTBxB + cTNxN

subject to :

BxB +NxN = bxB , xN ≥ 0

Feasible values of the basic variables (xB) can be defined in termsof the values for non-basic variables (xN ) :

xB = B−1[b−NxN ]


&

$

%

The value of the objective function is given by :

P (x) = CTBB−1[b−NxN ] + cTNxN

orP (x) = CTBB

−1b+ [cTN − cTBB−1N ]xN

Then, the tableau we used to solve these problems can berepresented as :

xTB xTN b

xB I B−1N B−1b

0 −[cTN − cTBB−1N ] cTBB−1b


&

$

%

The Simplex Algorithm is :

1. form the B and N matrices. Calculate B−1.

2. calculate the shadow prices of the non-basic variables (xN )

−[cTN − cTBB−1N ]

3. calculate B−1N and B−1b.

4. find the pivot element by performing the ratio text using thecolumn corresponding to the most negative shadow price.

5. the pivot column corresponds to the new basic variable and thepivot row corresponds to the new non-basic variable. Modifythe B and N matrices accordingly. Calculate B−1.

6. repeat steps 2 through 5 until there are no negative


&

$

%

This method is computationally inefficient because you mustcalculate a complete inverse of the matrix B at each iteration of theSIMPLEX algorithm. There are several variations on the RevisedSIMPLEX algorithm which attempt to minimize the computationsand memory requirements of the method. Examples of these can befound in :

– Chvatal– Edgar & Himmelblau (references in 7.7)– Fletcher– Gill, Murray and Wright

3 DUALITY & LINEAR PROGRAMMING 39'

&

$

%

3 Duality & Linear Programming

Many of the traditional advanced topics in LP analysis andresearch are based on the so-called method of Lagrange.Thisapproach to constrained optimization :

→ originated from the study of orbital motion.→ has proven extremely useful in the development of,

– optimality conditions for constrained problems,– duality theory,– optimization algorithms,– sensitivity analysis.


&

$

%

Consider the general Linear programming problem :

minx

cTx

subject to :

Ax ≥ bx ≥ 0

This problem is often called the Primal problem to indicate that itis defined in terms of the Primal variables (x).

You form the Lagrangian of this optimization problem as follows :

L(x, λ) = cTx− λT [Ax− b]

Notice that the Lagrangian is a scalar function of two sets ofvariables : the Primal variables x, and the Dual variables (orLagrange multipliers) λ. Up until now we have been calling theLagrange multipliers λ the shadow prices.


&

$

%

We can develop the Dual problem first be re-writing theLagrangian as :

LT (x, λ) = xT c− [xTAT − bT ]λ

which can be re-arranged to yield :

LT (λ, x) = bTλ− xT [ATλ− c]

This Lagrangian looks very similar to the previous one except thatthe Lagrange multipliers λ and problem variables x have switchedplaces. In fact this re-arranged form is the Lagrangian for themaximization problem :

maxλ

bTλ

subject to :

ATλ ≤ cλ ≥ 0

This formulation is often called the Dual problem to indicate thatit is defined in terms of the Dual variables λ.


&

$

%

It is worth noting that in our market gardener example, if wecalculate the optimum value of the objective function using thePrimal variables x∗ :

P (x∗)|primal = cTx∗ = 900(

95

)+ 1500

(35

)= 1980

and using the Dual variables λ∗ :

P (λ∗)|dual = bTλ∗ = 3 (480) + 60 (9) = 1980

we get the same result. There is a simple explanation for this thatwe will examine later.


&

$

%

Besides being of theoretical interest, the Dual problem formulationcan have practical advantages in terms of ease of solution. Considerthat the Primal problem was formulated as :

minx

cTx

subject to :

Ax ≥ bx ≥ 0

As we saw earlier, problems with inequality constraints of this formoften present some difficulty in determining an initial feasiblestarting point. They usually require a Phase 1 starting proceduresuch as the Big M method.


&

$

%

The Dual problem had the form :

maxλ

bTλ

subject to :

ATλ ≤ cλ ≥ 0

Problems such as these are usually easy to start since λ = 0 is afeasible point. Thus no Phase 1 starting procedure is required.As a result you may want to consider solving the Dual problem,when the origin is not a feasible starting point for the Primalproblem.


&

$

%

Rules for forming the dual

Primal DualEquality constraint Free variable

Inequality constraint Nonnegative variableFree variable Equality constraint

Nonnegative variable Inequality constraint

4 OPTIMALITY CONDITIONS 46'

&

$

%

4 Optimality Conditions

Consider the linear programming problem :

minx

cTx

subject to : M−−N

x ≥

bM−−bN

← active

← inactive

– if equality constraints are present, they can be used to eliminatesome of the elements of x, thereby reducing the dimensionality ofthe optimization problem.

– rows of the coefficient matrix (M) are linearly independent.


&

$

%

In our market gardener example, the problem looked like :

Form the Lagrangian :

L(x, λ) = cTx− λM [Mx− bM ]− λN [Nx− bN ]

At the optimum, we know that the Lagrange multipliers (shadowprices) for the inactive inequality constraints are zero (i.e. λN = 0).Also, since at the optimum the active inequality constraints areexactly satisfied :

Mx∗ − bM = 0


&

$

%

Then, notice that at the optimum :

L(x∗, λ∗) = P (x∗) = cTx∗

At the optimum, we have seen that the shadow prices for the activeinequality constraints must all be non-negative (i.e. λM ≥ 0). Thesemultiplier values told us how much the optimum value of theobjective function would increase if the associated constraint wasmoved into the feasible region, for a minimization problem.

Finally, the optimum (x∗, λ∗) must be a stationary point of theLagrangian :

∇xL(x∗, λ∗) = ∇xP (x∗)− (λ∗M )T gM (x∗) = cT − (λ∗M )TM = 0

∇λL(x∗, λ∗) = (gM (x∗))T = (Mx∗ − bM )T = 0


&

$

%

Thus, necessary and sufficient conditions for an optimum of aLinear Programming problem are :

– the rows of the active set matrix (M) must be linearlyindependent,

– the active set are exactly satisfied at the optimum point x∗ :

Mx∗ − bM = 0

– the Lagrange multipliers for the inequality constraints are :

λ∗N = 0 and λ∗M ≥ 0

this is sometimes expressed as :

(λ∗M )T [Mx∗ − bM ] + (λ∗N )T [Nx∗ − bN ] = 0

– the optimum point (x∗, λ∗) is a stationary point of theLagrangian :

∇xL(x∗, λ∗) = 0

5 APPLICATIONS 50'

&

$

%

5 Applications

Quadratic Programming Using LP

Consider an optimization problem of the form :

minx

12xTHx+ cTx

subject toAx ≥ bx ≥ 0

where H is a symmetric, positive definite matrix.

This is a Quadratic Programming (QP) problem. Problems such asthese are encountered in areas such as parameter estimation(regression), process control and so forth.

5 APPLICATIONS 51'

&

$

%

Recall that the Lagrangian for this problem would be formed as :

L(x, λ,m) =12xTHx+ cTx− λT (Ax− b)−mTx

where λ and m are the Lagrange multipliers for the constraints.

The optimality conditions for this problem are :

∇xL(x, λ,m) = xTH + cT − λTA−mT = 0

Ax− b ≥ 0x, λ,m ≥ 0

λT (Ax− b)−mTx = 0

5 APPLICATIONS 52'

&

$

%

It can be shown that the solution to the QP problem is also asolution of the following LP problem :

minx

12[cTx+ bT (Ax− b)

]subject to :

Hx+ c−ATλ−m = 0

λT (Ax− b)−mTx = 0

Ax− b ≥ 0x, λ,m ≥ 0

This LP problem may prove easier to solve than the original QPproblem.

5 APPLICATIONS 53'

&

$

%

Successive Linear Programming

Consider the general nonlinear optimization problem :

minxP (x)

subject to :h(x) = 0

g(x) ≥ 0

If we linearize the objective function and all of the constraintsabout some point xk, the original nonlinear problem may beapproximated (locally) by :

minxP (xk) +∇xP |xk

(x− xk)

subject toh(xk+1) +∇xh|xk

(x− xk) = 0

g(xk+1) +∇xg|xk(x− xk) = 0

5 APPLICATIONS 54'

&

$

%

Since xk is known, the problem is linear in the variables x and canbe solved using LP techniques.

The solution to the approximate problem is x∗ and will often notbe a feasible point of the original nonlinear problem (i.e. the pointx∗ may not satisfy the original nonlinear constraints). There are avariety of techniques to correct the solution x∗ to ensure feasibility.

A possibility would be to search in the direction of x∗, startingfrom xk, to find a feasible point :

xk+1 = xk + α(x∗ − xk)

We will discuss such ideas later with nonlinear programmingtechniques.

5 APPLICATIONS 55'

&

$

%

The SLP algorithm is :

1. linearize the nonlinear problem about the current point xk,

2. solve the approximate LP,

3. correct the solution (x∗) to ensure feasibility of the new pointxk+1 (i.e. find a point xk+1 in the direction of x∗ which satisfiesthe original nonlinear constraints),

4. repeat steps 1 through 3 until convergence.

The SLP technique has several disadvantages of which slowconvergence and infeasible intermediate solutions are the mostimportant. However, the ease of implementation of the method canoften compensate for the method’s shortcomings.

6 SENSITIVITY ANALYSIS 56'

&

$

%

6 Sensitivity Analysis

Usually there is some uncertainty associated with the values usedin an optimization problem. After we have solved an optimizationproblem, we are often interested in knowing how the optimalsolution will change as specific problem parameters change. This iscalled by a variety of names, including :

– sensitivity analysis,– post-optimality analysis,– parametric programming.

In our Linear Programming problems we have made use of pricinginformation, but we know such prices can fluctuate. Similarly thedemand / availability information we have used is also subject tosome fluctuation. Consider that in our market gardener problem,we might be interested in determining how the optimal solutionchanges when :

– the price of vegetables changes,– the amount of available fertilizer changes,– we buy a new truck.


&

$

%

In the market gardener example, we had :

– Changes in the pricing information (cT ) affects the slope of theobjective function contours.

– Changes to the right-hand sides of the inequality constraints (b),translates the constraints without affecting their slope.

– Changes to the coefficient matrix of the constraints (A) affectsthe slopes of the corresponding constraints.


&

$

%

For the general Linear Programming problem :

minx

cTx

subject to :

Ax ≥ b

we had the Lagrangian :

L(x, λ) = P (x)− λT g(x) = cTx− λT [Ax− b]

or in terms of the active and inactive inequality constraints :

L(x, λ) = cTx− λTM [Mx− bM ]− λTN [Nx− bN ]

Recall that at the optimum :

∇xL(x∗, λ∗) = cT − (λ∗M )TM = 0

∇λL(x∗, λ∗) = [Mx∗ − bM ] = 0

(λ∗N )T = 0


&

$

%

Notice that the first two sets of equations are linear in the variablesof interest (x∗, λ∗M ) and they can be solved to yield :

λ∗M = (MT )−1c and x∗ = M−1bM

Then, as we discussed previously, it follows that :

P (x∗) = cTx∗ = cTM−1bM = (λ∗M )T bM = bTMλ∗M

We now have expressions for the complete solution of our LinearProgramming problem in terms of the problem parameters.

In this course we will only consider two types of variation in ournominal optimization problem including :

1. changes in the right-hand side of constraints (b). Such changesoccur with variation in the supply / demand of materials,product quality requirements and so forth.

2. pricing (c) changes. The economics used in optimization arerarely known exactly.


&

$

%

For the more difficult problem of determining the effects ofuncertainty in the coefficient matrix (A) on the optimization resultssee :

Gal, T., Postoptimal Analysis, Parametric Programming andRelated Topics, McGraw-Hill, 1979.

Forbes, J.F., and T.E. Marlin, Model Accuracy for EconomicOptimizing Controllers : The Bias Update Case, Ind. Eng. Chem.Res., 33, pp. 1919-29, 1994.


&

$

%

For small (differential) changes in bM , which do not change theactive constraint set :

∇bP (x∗) = ∇b[(λ∗)T b

]= (λ∗)T

∇bx∗ = ∇b[M−1bM

]=

M−1

−−−0

∇bλ∗ = ∇b

M−1c−−−0

= 0

Note that the Lagrange multipliers are not a function of bM as longas the active constraint set does not change.


&

$

%

For small (differential) changes in c, which do not change the activeconstraint set :

∇cP (x∗) = ∇c[cTx∗

]= (x∗)T

∇cx∗ = ∇c[M−1bM

]= 0

∇cλ∗ = ∇c

(MT )−1c−−−0

=

(MT )−1

−−−0


&

$

%

Note that x∗ is not an explicit function of the vector c and will notchange so long as the active set remains fixed. If a change in ccauses an element of λ to become negative, then x∗ will jump to anew vertex.

Also it is worth noting that the optimal values P (x∗), x∗ and λ∗

are all affected by changes in the elements of the constraintcoefficient matrix (M). This is beyond the scope of the course andthe interested student can find some detail in the referencespreviously provided.

There is a simple geometrical interpretation for most of theseresults. Consider the 2 variable Linear Programming problem :

minx1,x2

c1x1 + c2x2

subject to g1(x)g2(x)

...

=

a11x1 + a12x2

a21x1 + a22x2

...

= Ax ≥ b


&

$

%

Note that the set of inequality constraints can be expressed :

Ax =

M−−−N

x =

∇xg1∇xg2−−−

...

x ≥ bThis two variables optimization problem has an optimum at theintersection of the g1 and g2 constraints, and can be depicted as :


&

$

%

Summary

– The solution of a well-posed LP is uniquely determined by theintersection of the active constraints.

– most LPs are solved by the SIMPLEX method.

– commercial computer codes used the Revised SIMPLEXalgorithm (most efficient)

– Optimization studies include :

* solution to the nominal optimization problem

* a sensitivity study to determine how uncertainty can affect thesolution.

7 INTERIOR POINT METHODS 66'

&

$

%

7 Interior Point Methods

Recall that the SIMPLEX method solved the LP problem bywalking around the boundary of the feasible region. The algorithmmoved from vertex to vertex on the boundary.

Interior point methods move through the feasible region toward theoptimum. This is accomplished by :

1. Converting the Lp problem into an unconstrained problemusing Barrier Functions for the constrains at each new point(iterate)

2. Solving the optimality conditions for the resultingunconstrained problem

3. Repeating until converged.

This is a very active research area and some exciting developmentshave occurred in the last two decades.

7 INTERIOR POINT METHODS 67'

&

$

%

1'

&

$

%

Chapter III

Unconstrained Univariate Optimization

– Introduction

– Interval Elimination Methods

– Polynomial Approximation Methods

– Newton’s Method

– Quasi-Newton Methods

1 INTRODUCTION 2'

&

$

%

1 Introduction

Univariate optimization means optimization of a scalar function ofa single variable :

y = P (x)

These optimization methods are important for a variety of reasons :

1. there are many instances in engineering when we want to findthe optimum of functions such as these (e.g. optimum reactortemperature, etc.),

2. almost all multivariable optimization methods in commercialuse today contain a line search step in their algorithm,

3. they are easy to illustrate and many of the fundamental ideasare directly carried over to multivariable optimization.

Of course these discussions will be limited to nonlinear functions,but before discussing optimization methods we need somemathematical background.

1 INTRODUCTION 3'

&

$

%

Continuity :

In this course we will limit our discussions to continuous functions

Functions can be continuous but there derivatives may not be.

1 INTRODUCTION 4'

&

$

%

A function P (x) is continuous at a point x0 iff :

P (x0) exists and limx→x−

0

P(x) = limx→x+

0

P(x) = P(x0)

Some discontinuous functions include :

→ price of processing equipment,

→ manpower costs,

→ measurements, . . .

Many of the functions we deal with in engineering are eithercontinuous or can be approximated as continuous functions.However, there is a growing interest in solving optimizationproblems with discontinuities.

1 INTRODUCTION 5'

&

$

%

Convexity of Functions

In most optimization problems we require functions which areeither convex or concave over some region (preferably globally) :

This can be expressed mathematically as :

i) for a convex function and ∀α ∈ [0, 1],

P (αx+ (1− α)y) ≤ αP (x) + (1− α)P (y)

ii) for a concave function and ∀α ∈ [0, 1],

P (αx+ (1− α)y) ≥ αP (x) + (1− α)P (y)

1 INTRODUCTION 6'

&

$

%

Convexity is a general idea used in optimization theory which whenapplied to a function, describes the property of having an optimum.Convexity combines both stationarity and curvature into a singleconcept. Often both convex and concave functions are described asexhibiting the property of convexity.

These conditions are very inconvenient for testing a specificfunction, so we tend to test the derivatives of the function atselected points. First derivatives give us a measure of the rate ofchange of the function. Second derivatives give us a measure ofcurvature or the rate of change of the second derivatives.

1 INTRODUCTION 7'

&

$

%

Notice for a convex function :

The first derivatives increase (become more positive) as x increases.Thus, in the region we are discussing, the second derivative ispositive. Since the first derivative starts out negative and becomespositive in the region of interest, there is some point where thefunction no longer decreases in value as x increases and this pointis the function minimum.

1 INTRODUCTION 8'

&

$

%

This is another way of stating that the point x∗ is a (local)minimum of the function P (x) iff :

P (x∗) ≤ P (x), ∀x ∈ [a, b]

If this can be stated as a strict inequality x∗ is said to be a unique(local) minimum.

A similar line of argument can be followed for concave functions,with the exception that : the first derivatives decrease (becomemore negative) with increasing x, the second derivative is negativeand as result, the optimum is a maximum.

The problem with these ideas is that we can only calculate thederivatives at a point. So looking for an optimum using thesefundamental ideas would be infeasible since a very large numberpoints (theoretically all of the points) in the region of interest mustbe checked.

1 INTRODUCTION 9'

&

$

%

Necessary and Sufficient Conditions for an Optimum

Recall that for a twice continuously differentiable function P (x),the point x∗ is an optimum iff :

dP

dx |x∗= 0 stationarity

andd2P

dx2 |x∗> 0 a minimum

d2P

dx2 |x∗< 0 a maximum

What happens when the second derivative is zero ?

1 INTRODUCTION 10'

&

$

%

Consider the function :

y = x4

which we know to have a minimum at x = 0.

at the point x = 0 :

dy

dx|0 = 4x3 = 0,

d2y

dx2|0 = 12x2 = 0,

d3y

dx3|0 = 24x = 0,

d4y

dx4|0 = 24 > 0

1 INTRODUCTION 11'

&

$

%

We need to add something to our optimality conditions. If at thestationary point x∗ the second derivative is zero we must check thehigher-order derivatives. Thus, for the first non-zero derivative isodd, that is :

dnP

dxn |x∗6= 0 where n ∈ {3, 5, 7, . . .}

Then x∗ is an inflection point. Otherwise, if the first higher-order,non-zero derivative is even :

dnP

dxn |x∗> 0 a minimum

dnP

dxn |x∗< 0 a maximum

1 INTRODUCTION 12'

&

$

%

Some Examples

1.minxax2 + bx+ c = 0

stationarity :

dP

dx= 2ax+ b→ x∗ = − b

2a

curvature :

d2P

dx2= 2a→ a > 0 maximum

a < 0 minimum

1 INTRODUCTION 13'

&

$

%

2.minxx3 − 2x2 − 5x+ 6

stationarity :

dP

dx= 3x2 − 4x− 5 = 0→ x∗ =

2±√

19

3curvature :

d2P

dx2|x∗ = 6x− 4 =

{2√

19 > 0 mimimum

−2√

19 < 0 maximum

3. min(x− 1)n

1 INTRODUCTION 14'

&

$

%

4. A more realistic example. Consider the situation where we havetwo identical settling ponds in a waste water treatment plant :

These are normally operated at a constant throughput flow ofF, with an incoming concentration of contaminants ci. Thesystem is designed so that at nominal operating conditions theoutgoing concentration co is well below the required levels.However, we know that twice a day ( 7 am and 7 pm) theincoming concentration of contaminants to this section of theplant dramatically increase above these nominal levels. Whatwe would like to determine is whether the peak outgoing levelswill remain within acceptable limits.

1 INTRODUCTION 15'

&

$

%

What assumptions can we make ?

1) well mixed,2) constant density,3) constant flowrate.

contaminant balance :

dρV codt

= Fc1 − Fco (second pond)

dρV c1dt

= Fci − Fc1 (first pond)

Let the residence time be τ = ρVF . Then the balances become

τdcodt

= c1 − co (second pond)

τdc1dt

= ci − c1 (first pond)

1 INTRODUCTION 16'

&

$

%

Since we wish to know how changes in ci will affect the outletconcentration co. Define perturbation variables around thenominal operating point and take Laplace Transforms :

τsCo(s) = C1(s)− Co(s) (second pond)

τsC1(s) = Ci(s)− C1(s) (first pond)

Combining and simplifying into a single transfer function :

Co(s) =1

(τs+ 1)2Ci(s)

If we represent the concentration disturbances as idealimpulses :

Ci(s) = k

Then, transfer function of the outlet concentration becomes :

Co(s) =k

(τs+ 1)2

1 INTRODUCTION 17'

&

$

%

In the time-domain this is :

co(t) = kte−tτ

Since we want to determine :

maxtco(t) = kte−

tτ

1 INTRODUCTION 18'

&

$

%

stationarity :

dcodt

= ke−tτ

(1− t

τ

)= 0 =⇒ 1− t

τ= 0 =⇒ t = τ

curvature :

d2codt2

=k

τ

(t

τ− 2

)e−

tτ |t=τ = − k

τe< 0 =⇒ maximum(k > 0)

Finally, the maximum outlet concentration is :

co(τ) = kτe−1 ' 0.368kρV

F

1 INTRODUCTION 19'

&

$

%

This example serves to highlight some difficulties with theanalytical approach. These include :

→ determining when the first derivative goes to zero. This requiredthe solution of a nonlinear equation which is often as difficult asthe solution of the original optimization problem.

→ computation of the appropriate higher-order derivatives.

→ evaluation of the higher-order derivatives at the appropriatepoints.

We will need more efficient methods for optimization of complexnonlinear functions. Existing numerical methods can be classifiedinto two broad types :

1. those that rely solely on the evaluation of the objectivefunction,

2. those that use derivatives (or approximations of thederivatives) of the objective function.

2 INTERVAL ELIMINATION METHODS 20'

&

$

%

2 Interval Elimination Methods

These methods work by systematically reducing the size of theinterval within which the optimum is believed to exist. Since we areonly ever working with an interval, we can never find the optimumexactly. There are many interval elimination algorithms ; however,they all consist of the same steps :

1. determine an interval in which the optimum is believed to exist,

2. reduce the size of the interval,

3. check for convergence,

4. repeat steps 2 & 3 until converged.

We will examine only one interval elimination method, called the”Golden Section Search”.


&

$

%

Scanning & bracketing

As an example, consider that our objective is to minimize afunction P(x). In order to start the optimization procedure we needdetermine the upper and lower bounds for an interval in which webelieve an optimum will exist. These could be established by :

– physical limits,

– scanning,

· choosing a starting point,

· checking the function initially decreases,

· finding another point of higher value.


&

$

%

Golden Section Search

Consider that we want to break up the interval, in which we believethe optimum exists, into two regions :

where :l1l2

=l2L

= r

Since L = l1 + l2, l2 = rL and l1 = r2L, then :

L = rL+ r2L

orr2 + r − 1 = 0


&

$

%

There are two solutions for this quadratic equation. We are onlyinterested in the positive root :

r =−1 +

√5

2' 0.61803398 . . . Golden ratio

Which has the interesting property :

r =1

r + 1

We will use this Golden ratio to break up our interval as follows :


&

$

%

By breaking up the interval in this way we will be able to limitourselves to one simple calculation of an xi and one objectivefunction evaluation per iteration.

The optimization algorithm is :

1. using the interval limits ak and bk, determine xk1 and xk2 ,

2. evaluate P (xk1) and P (xk2),

3. eliminate the part of the interval in which the optimum is notlocated,

4. repeat steps 1 through 3 until the desired accuracy is obtained.

How can we decide which part of the interval to eliminate ?


&

$

%

There are only three possible situations. For the minimization case,consider :


&

$

%

Graphically the algorithm is :

Three things worth noting :

1. only one new point needs to be evaluated during each iteration,


&

$

%

2. the length of the remaining interval after k iterations (oraccuracy to which you know the optimum value of x) is :

Lk = rkL0 ' (0.618034)kL0

3. the new point for the current iteration is :

xk1 = ak + bk − xk2or :

xk2 = ak + bk − xk1


&

$

%

Example

Find the solution to within 5% for

minxx2 − 2x+ 1

Guess an initial interval for the optimum as 0 ≤ x∗ ≤ 2 .


&

$

%

Note that the number of calculation we have performed is :

· (k + 2) = 8 evaluations of P (xi),· (k + 2) = 8 determinations of a new point xi.

There are a variety of other interval elimination methods includingInterval Bisection, Fibonacci Searches, and so forth. They are allsimple to implement and not computationally intensive. They tendto be fairly robust and somewhat slow to converge.

3 POLYNOMIAL APPROXIMATION METHODS 30'

&

$

%

3 Polynomial Approximation Methods

There are a number of polynomial approximation methods, whichdiffer only in the choice of polynomial used to locally approximatethe function to be optimized. Polynomials are attractive since theyare easily fit to data and the optima of low-order polynomials aresimple to calculate. We will only discuss Successive QuadraticApproximation as it illustrates the basic ideas and will carry overinto multivariable optimization techniques.

Successive Quadratic Approximation

Suppose we have evaluated some test points of a function asfollows :


&

$

%

We could approximate the objective function by fitting a quadraticthrough the test points.

The main advantage of using a quadratic approximation is that weknow that optimum is :

x∗ = − b

2a


&

$

%

The first step is to calculate the approximation constants (a, b, c) inthe quadratic from the available points (x1, x2, x3). Using thesethree evaluation points we can write :

P (x1) = ax21 + bx1 + c

P (x2) = ax22 + bx2 + c

P (x3) = ax23 + bx3 + c

We have three equations and three unknowns, so in general theconstants (a, b, c) can be uniquely determined. In matrix form theset of equations are written : x21 x1 1

x22 x2 1x23 x3 1

abc

=

P (x1)P (x2)P (x3)


&

$

%

Which has the solution : abc

=

x21 x1 1x22 x2 1x23 x3 1

−1 P (x1)P (x2)P (x3)

Some algebra yields the solution for the minimum as :

x∗ =1

2

(x22 − x23)P (x1) + (x23 − x21)P (x2) + (x21 − x22)P (x3)

(x2 − x3)P (x1) + (x3 − x1)P (x2) + (x1 − x2)P (x3)


&

$

%

Graphically, the situation is :

To perform the next iteration, we need to chose which three pointsto use next. If you’re doing hand calculations choose the point (xi)with the smallest value of the objective function P (x) and thepoints immediately on either side. Note that x∗ will not necessarilyhave the smallest objective function value.


&

$

%

Finally, the accuracy to which the optimum is known at anyiteration is given by the two points which bracket the point withthe smallest objective function value.

The procedure is :

1. choose three points that bracket the optimum,

2. determine a quadratic approximation for the objective functionbased on these three points.

3. calculate the optimum of the quadratic approximation (thepredicted optimum x∗ ),

4. repeat steps 1 to 3 until the desired accuracy is reached.


&

$

%

Example : revisit the wastewater treatment problem with τ = 1.

maxtte−t = min

t−te−t

it t1 t2 t3 P (t1) P (t2) P (t3) t∗ P (t∗)

0 0.5 1.5 2.5 -0.3303 -0.3347 -0.2052 1.1953 -0.3617

1 0.5 1.1953 1.5 -0.3303 -0.3617 -0.3347 1.0910 -0.3664

2 0.5 1.0910 1.1953 -0.3303 -0.3664 -0.3617 1.0396 -0.3676

3 0.5 1.0396 1.0910 -0.3303 -0.3676 -0.3664 1.0185 -0.3678

4 0.5 1.0185 1.0396 -0.3303 -0.3678 -0.3676 1.0083 -0.3679

4 NEWTON’S METHOD 37'

&

$

%

4 Newton’s Method

This is a derivative-based method that uses first and secondderivatives to approximate the objective function. The optimum ofthe objective function is then estimated using this approximation.

Consider the 2nd-order Taylor Series approximations of ourobjective function :

P (x) ' P (xk) +dP

dx|xk(x− xk) +

1

2

d2P

dx2|xk(x− xk)2

Then we could approximate the first derivative of the objectivefunction as :

dP

dx' dP

dx|xk +

d2P

dx2|xk(x− xk)


&

$

%

Since we require that the first derivative of the objective functionto be zero at the optimum, then :

dP

dx' dP

dx|xk +

d2P

dx2|xk(x− xk) = 0

A little algebra will allow us to estimate what the optimum value ofx is :

x∗ = xk+1 = xk −

[dPdx |xkd2Pdx2 |xk

]

This is called a Newton step and we can iterate on this equation tofind the optimum value of x.


&

$

%

Graphically, this is :

Newton’s method attempts to find the stationary point of afunction by successive approximation using the first and secondderivatives of the original objective function.


&

$

%


maxtte−t = min

t−te−t

it tk P ′(tk) P ′′(tk) P (tk)

0 0.000 1.000 -2.000 -0.000

1 0.500 0.3303 -0.9098 -0.3303

2 0.8333 0.0724 -0.5070 -0.3622

3 0.9762 0.0090 -0.3857 -0.3678

4 0.9994 0.0002 -0.3683 -0.3679

5 1.000


&

$

%

In summary, Newton’s Method :

1. shows relatively fast convergence (finds the optimum of aquadratic in a single step),

2. requires first and second derivatives of the function,

3. has a computational load of first and second derivativeevaluations, and a Newton step calculation per iteration,

4. may oscillate or not converge when there many local optima orstationary points.

5 QUASI-NEWTON METHODS 42'

&

$

%

5 Quasi-Newton Methods

A major drawback of Newton’s method is that it requires us tohave analytically determined both the first and second derivativesof our objective function. Often this is considered onerous,particularly in the case of the second derivative. The large family ofoptimization algorithms that use finite difference approximations ofthe derivatives are called ” Quasi-Newton ” methods.

There are a variety of ways first derivatives can be approximatedusing finite differences :

P ′(x) ≈ P (x+ ∆x)− P (x−∆x)

2∆x

P ′(x) ≈ P (x+ ∆x)− P (x)

∆x

P ′(x) ≈ P (x)− P (x−∆x)

∆x


&

$

%

Approximations for the second derivatives can be determined interms of either the first derivatives or the original objectivefunction :

P ′′(x) ≈ P ′(x+ ∆x)− P ′(x−∆x)

2∆x

P ′′(x) ≈ P ′(x+ ∆x)− P ′(x)

∆x

P ′′(x) ≈ P ′(x)− P ′(x−∆x)

∆x

P ′′(x) ≈ P (x+ ∆x)− 2P (x) + P (x−∆x)

∆x2

P ′′(x) ≈ P (x+ 2∆x)− 2P (x+ ∆x) + P (x)

∆x2

P ′′(x) ≈ P (x)− 2P (x−∆x) + P (x− 2∆x)

∆x2


&

$

%

The major differences within the Quasi-Newton family ofalgorithms arises from : the number of function and / or derivativesthat must be evaluated, the speed of converge and the stability ofthe algorithm.


&

$

%

Regula Falsi

This method approximates the second derivative as :

d2P

dx2≈ p′(xq)− P ′(xp)

xq − xp

where xp and xq are chosen so that P ′(xq) and P ′(xp) havedifferent signs (i.e. the two points bracket the stationary point).

Substituting this approximation into the formula for a Newton stepyields :

x∗ = xk+1 = xq −

[dPdx |xq(

dPdx |xq −

dPdx |xp

)]


&

$

%

The Regula Falsi method iterates on this formula taking care oneach iteration to retain two points :

x∗ and xp or xq

depending upon :

sign[P ′(x∗k] = sign[P ′(xp)] =⇒ keep x∗k and xq

or :

sign[P ′(x∗k] = sign[P ′(xq)] =⇒ keep x∗k and xp


&

$

%

Graphically :


&

$

%


maxtte−t = min

t−te−t

it xp xq x∗k P ′(xp) P ′(xq) P ′(x∗k)

0 0 2.000 1.7616 -1 -0.1353 0.1308

1 0 1.7616 1.5578 -1 0.1308 0.1175

2 0 1.5578 1.3940 -1 0.1175 0.0978

3 0 1.3940 1.2699 -1 0.0978 0.0758

4 0 1.2699 1.1804 -1 0.0758 0.00554

5 0 1.1804 1.1184 -1 0.00554 0.0387

6 0 1.1184 1.0768 -1 0.0387 0.0262

Note : computational load is one derivative evaluation, a signcheck and a step calculation per iteration.


&

$

%

In summary, Quasi-Newton methods :

1. show slower convergence rates the Newton’s Method,

2. use some finite difference approximations for derivatives of theobjective function,

3. may oscillate or not converge when there many local optima orstationary points.

6 SUMMARY 50'

&

$

%

6 Summary

There are only two basic ideas in unconstrained univariateoptimization :

1. direct search (scanning, bracketing, interval elimination,approximation),

2. derivative based methods (optimality conditions, Newton andQuasi-Newton methods).

The direct search methods work directly on the objective functionto successively determine better values of x. The derivative basedmethods try to find the stationary point of the objective function,usually by some iterative procedure.

6 SUMMARY 51'

&

$

%

Choosing which method to use is always a trade-off between :

1. ease of implementation,

2. convergence speed,

3. robustness,

4. computational load (number of function evaluations, derivativeevaluations / approximations, etc.).

There are no clear answers as to which univariate method is better.The best choice is problem dependent :

· for nearly quadratic functions, the Newton / Quasi-Newtonmethods are generally a good choice,

· for very flat functions, interval elimination methods can be veryeffective.

6 SUMMARY 52'

&

$

%

When the univariate search is buried in a multivariate code, thebest choice of univariate search will depend upon the propertiesand requirements of the multivariate search method.

Finally remember that to solve an unconstrained univariateoptimization problem you must :

1. have a function (and derivatives) you can evaluate,

2. choose an optimization method appropriate for the problem,

3. choose a starting point (or interval),

4. select a termination criteria (or accuracy).

1'

&

$

%

Chapter IV

Unconstrained Multivariate Optimization

– Introduction

– Sequential Univariate Searches

– Gradient-Based Methods

– Newton’s Method

– Quasi-Newton Methods

1 INTRODUCTION 2'

&

$

%

1 Introduction

Multivariate optimization means optimization of a scalar functionof several variables :

y = P (x)

and has the general form :

minxP (x)

where P (x) is a nonlinear scalar-valued function of the vectorvariable x.

1 INTRODUCTION 3'

&

$

%

Background

Before we discuss optimization methods, we need to talk about howto characterize nonlinear, multivariable functions such as P (x).

Consider the 2nd order Taylor series expansion about the point x0 :

P (x0) ' P (x0) +∇xP|x0 (x− x0) +1

2(x− x0)T∇2

xP|x0 (x− x0)

If we let :

a = P (x0)−∇xP|x0x0 +1

2xT0∇2

xP|x0x0

bT = ∇xP|x0 − xT0∇2

xP|x0

H = ∇2xP|x0

1 INTRODUCTION 4'

&

$

%

Then we can re-write the Taylor series expansion as a quadraticapproximation for P (x) :

P (x) = a+ bTx+1

2xTHx

and the derivatives are :

∇xP (x) = bT + xTH

∇2xP (x) = H

We can describe some of the local geometric properties of P (x)using its gradient and Hessian. If fact there are only a fewpossibilities for the local geometry, which can easily bedifferentiated by the eigenvalues of the Hessian (H).

Recall that the eigenvalues of a square matrix (H) are computed byfinding all of the roots (λ) of its characteristic equation :

|λI −H| = 0

1 INTRODUCTION 5'

&

$

%

The possible geometries are :

1. if λi < 0 (i = 1, 2, . . . , n), the Hessian is said to be negativedefinite. This object has a unique maximum and is what wecommonly refer to as a hill (in three dimensions).

1 INTRODUCTION 6'

&

$

%

2. if λi > 0 (i = 1, 2, . . . , n), the Hessian is said to be positivedefinite. This object has a unique minimum and is what wecommonly refer to as a valley (in three dimensions).

1 INTRODUCTION 7'

&

$

%

3. if λi < 0 (i = 1, 2, . . . ,m) and λi > 0 (i = m+ 1, . . . , n), theHessian is said to be indefinite. This object has neither aunique minimum or maximum and is what we commonly referto as a saddle (in three dimensions).

1 INTRODUCTION 8'

&

$

%

4. if λi < 0 (i = 1, 2, . . . ,m) and λi = 0 (i = m+ 1, . . . , n) , theHessian is said to be negative semi-definite. This does not havea unique maximum, and is what we commonly refer to as aridge (in three dimensions).

1 INTRODUCTION 9'

&

$

%

5. if λi > 0 (i = 1, 2, . . . ,m) and λi = 0 (i = m+ 1, . . . , n) theHessian is said to be positive semi-definite. This object doesnot have a unique minimum and is what we commonly refer toas a trough (in three dimensions).

1 INTRODUCTION 10'

&

$

%

A well posed problem has a unique optimum, so we will limitour discussions to either problems with positive definiteHessians (for minimization) or negative definite Hessians (formaximization).

Further, we would prefer to choose units for our decisionvariables (x) so that the eigenvalues of the Hessian all haveapproximately the same magnitude. This will scale the problemso that the profit contours are concentric circles and willcondition our optimization calculations.

1 INTRODUCTION 11'

&

$

%

Necessary and Sufficient Conditions

For a twice continuously differentiable scalar function P (x), apoint x∗ is an optimum if :

∇xP|x∗ = 0

and :

∇2xP|x∗ is positive definite (a minimum)

∇2xP|x∗ is negative definite (a maximum)

We can use these conditions directly, but it usually involvessolving a set of simultaneous nonlinear equations (which isusually just as tough as the original optimization problem).

1 INTRODUCTION 12'

&

$

%

Consider :

P (x) = xTAxexTAx

Then :

∇xP = 2(xTA)exTAx + xTAx(xTA)ex

TAx

= 2(xTA)exTAx(1 + xTAx)

and stationarity of the gradient requires that :

∇xP = 2(xTA)exTAx(1 + xTAx) = 0

This is a set of very nonlinear equations in the variables x.

1 INTRODUCTION 13'

&

$

%

Example

Consider the scalar function :

P (x) = 3 + x1 + 2x2 + 4x1x2 + x21 + x22

or :

P (x) = 3 +[

1 2]x+ xT

[1 22 1

]x

where x = [x1 x2]T .

Stationarity of the gradient requires :

∇xP =[

1 2]

+ 2xT[

1 22 1

]= 0

x = −1

2

[1 22 1

]−1 [12

]=

[−0.5

0

]

1 INTRODUCTION 14'

&

$

%

Check the Hessian to classify the type of stationarity point :

∇2P = 2

[1 22 1

]=

[2 44 2

]The eigenvalues of the Hessian are :

∣∣∣∣λI − [ 2 44 2

]∣∣∣∣ =

∣∣∣∣ λ− 2 −4−4 λ− 2

∣∣∣∣ = (λ− 2)2 − 16 = 0

λ1 = 6 and λ2 = −2

Thus the Hessian is indefinite and the stationary point is asaddle point. This is a trivial example, but it highlights thegeneral procedure for direct use of the optimality conditions.

1 INTRODUCTION 15'

&

$

%

Like the univariate search methods we studied earlier,multivariate optimization methods can be separated into twogroups :

i) those which depend solely on function evaluations,

ii) those which use derivative information (either analyticalderivatives or numerical approximations).

Regardless of which method you choose to solve a multivariateoptimization problem, the general procedure will be :

1) select a starting point(s),2) choose a search direction,3) minimize in chosen search direction,4) repeat steps 2 & 3 until converged.

1 INTRODUCTION 16'

&

$

%

Also, successful solution of an optimization problem willrequire specification of a convergence criterion. Somepossibilities include :

‖xk+1 − xk‖ ≤ γ

‖P (xk+1)− P (xk)‖ ≤ δ

‖∇xP|xk ‖ ≤ ε

2 SEQUENTIAL UNIVARIATE SEARCHES 17'

&

$

%

2 Sequential Univariate Searches

Perhaps the simplest multivariable search technique toimplement would be a system of sequential univariate searchesalong some fixed set of directions. Consider the twodimensional case, where the chosen search directions areparallel to the coordinate axes :


&

$

%

In this algorithm, you :

1. select a starting point x0,

2. select a coordinate direction (e.g. s = [0 1]T , or [1 0]T ),

3. perform a univariate search ,

minαP (xk + αs)

4. repeat steps 2 and 3 alternating between search directions,until converged.

The problem with this method is that a very large number ofiterations may be required to attain a reasonable level ofaccuracy. If we new something about the ”orientation” of theobjective function the rate of convergence could be greatlyenhanced.


&

$

%

Consider the previous two dimensional problem, but using theindependent search directions ([1 1]T , [1 − 1]T ).

Of course, finding the optimum in n steps only works forquadratic objective functions where the Hessian is known andeach line search is exact. There are a large number of theseoptimization techniques which vary only in the way that thesearch directions are chosen.


&

$

%

Nelder-Meade Simplex

This approach has nothing to do with the SIMPLEX method oflinear programming. The method derives its name from ann-dimensional polytope (and as a result is often referred to asthe ”polytope” method).

A polytope is an n-dimensional object that has n+1 vertices :

n polytope number of vertices1 line segment 22 triangle 33 tetrahedron 4

It is an easily implemented direct search method, that onlyrelies on objective function evaluations. As a result, it can berobust to some types of discontinuities and so forth. However,it is often slow to converge and is not useful for larger problems( > 10 variables).


&

$

%

To illustrate the basic idea of the method, consider thetwo-dimensional minimization problem :


&

$

%

Step 1 :

- P1 = max(P1, P2, P3), define P4 as the reflection of P1

through the centroid of the line joining P2 and P3.

- P4 ≤ max(P1, P2, P3), from a new polytope from the pointsP2, P3, P4.

Step 2 :

- repeat the procedure in Step 1 to form new polytopes.

We can repeat the procedure until the polytope defined by P4,P5, P6. Further iterations will cause us to flip between twopolytopes. We can eliminate these problems by introducing twofurther operations into the method :

- contraction when the reflection step does not offer anyimprovement,

- expansion to accelerate convergence when the reflection stepis an improvement.


&

$

%


&

$

%

For minimization, the algorithm is :

(a) Order according to the values at the vertices

P (x1) ≤ P (x2) ≤ . . . ≤ P (xn+1)

(b) Calculate xo, the centre of gravity of all points except xn+1

xo =1

n

n∑i=1

xi

(c) Reflection : Compute reflected point

xr = xo + α(xo − xn+1)

If P (x1) ≤ P (xr) < P (xn) :

The reflected point is better than the second worst, but notbetter than the best, then obtain a new simplex by replacingthe worst point xn+1 with the reflected point xr, and go tostep (a).


&

$

%

If P (xr) < P (x1)

The reflected point is the best point so far, then go to step (d).

(d) Expansion : compute the expanded point

xe = xo + γ(xo − xn+1)

If P (xe) < P (xr)

The expanded point is better than the reflected point,

then obtain a new simplex by replacing the worst point

xn+1 with the expanded point xe, and go to step (a).

Else

Obtain a new simplex by replacing the worst point

xn+1 with the reflected point xr, and go to step (a).


&

$

%

Else

The reflected point is worse than second worst, then continueat step (e).

(e) Contraction : Here, it is certain that P (xr) ≥ P (xn)Compute contracted point

xc = xn+1 + ρ(xo − xn+1)

If P (xc) ≤ P (xn+1)

The contracted point is better than the worst point,

then obtain a new simplex by replacing the worst point

xn+1 with the contracted point xc, and go to step (a).

Else

go to step (f).


&

$

%

(f) Reduction For all but the best point, replace the point with

xi = x1 + σ(xi − x1), ∀i ∈ {2, . . . , n+ 1}

go to step (a).

(g) End of the algorithm.

Note : α, γ and σ are respectively the reflection, theexpansion, the contraction and the shrink coefficients.Standard values are α = 1, γ = 2, ρ = 1/2 and σ = 1/2.


&

$

%

For the reflection, since xn+1 is the vertex with the higherassociated value among the vertices, we can expect to find alower value at the reflection of xn+1 in the opposite faceformed by all vertices point xi except xn+1.

For the expansion, if the reflection point xr is the newminimum along the vertices we can expect to find interestingvalues along the direction from xo to xr.

Concerning the contraction : If P (xr) ≥ P (xn) we can expectthat a better value will be inside the simplex formed by allthe vertices xi.

The initial simplex is important, indeed, a too small initialsimplex can lead to a local search, consequently the NM canget more easily stuck. So this simplex should depend on thenature of the problem.

The Nelder-Meade Simplex method can be slow to converge,but it is useful for functions whose derivatives cannot becalculated or approximated (e.g. some non-smooth functions).


&

$

%

Example

minxx21 + x22 − 2x1 − 2x2 + 2

x1 x2 x3 P1 P2 P3 xr Pr notes

0[

33

] [32

] [22

]8 5 2

[21

]1 expansion

1[

21

] [32

] [22

]1 5 2

[11

]0 expansion

2[

21

] [11

] [22

]1 0 2

[10

]1 contraction

3[

21

] [11

] [7/43/2

]1 0 13

16

[3432

]516

4[

3/43/2

] [11

] [7/43/2

] 516 0 13

16

3 GRADIENT-BASED METHODS 30'

&

$

%

3 Gradient-Based Methods

This family of optimization methods use first-order derivatives todetermine a ”descent” direction. (Recall that the gradient gives thedirection of the quickest increase for the objective function. Thus,the negative of the gradient gives the quickest decrease for theobjective function.)


&

$

%

Steepest Descent

It would seem that the fastest way to find an optimum would be toalways move in the direction in which the objective functiondecreases the fastest. Although this idea is intuitively appealing, itis very rarely true ; however with a good line search every iterationwill move closer to the optimum.

The Steepest Descent algorithm is :

1. choose a starting point x0,

2. calculate the search direction :

sk = −[∇xP|xk

]3. determine the next iterate for x :

xk+1 = xk + λksk


&

$

%

by solving the univariate optimization problem :

minλk

P (xk + λksk)

4. repeat steps 2 &3 until termination. Some termination criteriato consider :

λk ≤ ε‖xk+1 − xk‖ ≤ ε‖∇xP|xk ‖ ≤ ε

‖P (xk+1)− P (xk)‖ ≤ ε

where ε is some small positive scalar.


&

$

%

Example 1

minxx21 + x22 − 2x1 − 2x2 + 2

∇xPT =

[2x1 − 22x2 − 2

]start at x0 = [3 3]T

xk ∇xPT|xk λk P (xk)

0

[33

] [44

]8

1

[11

] [00

]


&

$

%

start at x0 = [1 2]T

xk ∇xPT|xk λk P (xk)

0

[12

] [02

]1

1

[11

] [00

]

This example shows that the method of Steepest Descent is veryefficient for well-scaled quadratic optimization problems (the profitcontours are concentric circles). This method is not nearly asefficient for non-quadratic objective functions with very ellipticalprofit contours (poorly scaled).


&

$

%

Example 2

Consider the minimization problem :

minx

1

2xT[

101 −99−99 101

]x

Then

∇xP = xT[

101 −99−99 101

]= [101x1 − 99x2 − 99x1 + 101x2]

To develop an objective function for our line search, substitute

xk+1 = xk + λksk

into the original objective function :


&

$

%

P (xk+λksk) =1

2

(xk − λk∇xP|xk

)T [ 101 −99−99 101

](xk − λk∇xP|xk

)

=1

2

(xk − λk

[101 −99−99 101

]xk

)T [101 −99−99 101

](xk − λk

[101 −99−99 101

]xk

)

To simplify our work, let

H =

[101 −99−99 101

]which yields :

P (xk + λksk) =1

2

[xkHxk − 2λkx

TkH

2xk + λ2kxTkH

3xk]


&

$

%

This expression is quadratic in λk , thus making it easy to performan exact line search :

dP

dλk= −xTkH2xk + λkx

TkH

3xk = 0

Solving for the step length yields :

λk =xTkH

2xkxTkH

3xk=

xTk

[101 −99−99 101

]2xk

xTk

[101 −99−99 101

]3xk

λk =

xTk

[20, 002 −19, 998−19, 998 20, 002

]xk

xTk

[4, 000, 004 −3, 999, 9996−3, 999, 996 4, 000, 004

]xk


&

$

%

Starting at x0 = [2 1]T :

xk P (xk) ∇xPT|xk

λk

0

[21

]54.50

[103−97

]0.0050

1

[1.48451.4854

]4.4104

[2.88093.0591

]0.4591

2

[0.16180.0809

]0.3569

[8.3353−7.8497

]0.0050

3

[0.12010.1202

]0.0289

[0.23310.2476

]0.4591

4

[0.01310.0065

]0.0023

[0.6745−0.6352

]0.0050

5

[0.00970.0097

]1.8915 · 10−4

[0.01890.0200

]0.4591

6

[0.00110.0005

]1.5307 · 10−5

[0.0547−0.0514

]0.0050

7

[7.868 · 10−4

7.872 · 10−4

]1.2387 · 10−5

[0.00150.0016

]0.4591


&

$

%

Conjugate Gradients

Steepest Descent was based on the idea that the minimum will befound provided that we always move in the downhill direction. Thesecond example showed that the gradient does not always point inthe direction of the optimum, due to the geometry of the problem.Fletcher and Reeves (1964) developed a method which attempts toaccount for local curvature in determining the next search direction.

The algorithm is :

1. choose a starting point x0,

2. set the initial search direction

s0 = −[∇xP|x0

]


&

$

%

3. determine the next iterate for x :

xk+1 = xk + λksk

by solving the univariate optimization problem :

minλk

P (xk + λksk)

4. calculate the next search direction as :

sk+1 = −∇xP|xk + sk∇xP|xk+1

∇xPT|xk+1

∇xP|xk∇xPT|xk

5. repeat steps 2 &3 until termination.


&

$

%

The manner in which the successive search directions are calculatedis important. For quadratic functions, these successive searchdirections are conjugate with respect to the Hessian. This meansthat for the quadratic function :

P (x) = a+ bTx+1

2xTHx

successive search directions satisfy :

sTkHsk+1 = 0

As a result, these successive search directions have incorporatedwithin them information about the geometry of the optimizationproblem.


&

$

%

Example

Consider again the optimization problem :

minx

1

2xT[

101 −99−99 101

]x

We saw that Steepest Descent was slow to converge on this problemof its poor scaling. As in the previous example, we will use an exactline search. It can be shown that the optimal step length is :

λk = −xTk

[101 −99−99 101

]sk

sTk

[101 −99−99 101

]sk


&

$

%


xk P (xk) ∇xPT|xk sk λk

0

[21

] [103−97

] [103−97

]

1

[1.48451.4854

] [2.88093.0591

] [−2.9717−2.9735

]

2

[00

]


&

$

%

Notice the much quicker convergence of the algorithm. For aquadratic objective with exact line searches, the ConjugateGradients method exhibits quadratic convergence.

The quicker convergence comes at an increased computationalrequirement. These include :

– a more complex search direction calculation,

– increased storage for maintaining previous search directions andgradients.

These are usually small in relation to the performanceimprovements.


&

$

%

4 Newton’s Method

The Conjugate Gradients method showed that there was aconsiderable performance gain to be realized by incorporating someproblem geometry into the optimization algorithm. However, themethod did this in an approximate sense. Newton’s method doesthis by using second-derivative information.

In the multivariate case, Newton’s method determines a searchdirection by using the Hessian to modify the gradient. The methodis developed directly from the Taylor Series approximation of theobjective function. Recall that an objective function can be locallyapproximated :

P (x) ∼ P (xk) +∇xP|xk (x− xk) +1

2(x− xk)T∇2

xP|xk (x− xk)


&

$

%

Then, stationarity can be approximated as :

∇xP|xk ∼ ∇xP|xk + (x− xk)T∇2xP|xk = 0

Which can be used to determine an expression for calculating thenext point in the minimization procedure :

xk+1 = xk −(∇2xP|xk

)−1 (∇xP|xk

)Notice that in this case we get both the direction and the steplength for the minimization. Further, if the objective function isquadratic, the optimum is found in a single step.


&

$

%

At a given point xk, the algorithm for Newton’s method is :

1. calculate the gradient at the current point ∇xP|xk .

2. calculate the Hessian at the current point ∇2xP|xk .

3. calculate the Newton step :

∆xk = −(∇2xP|xk

)−1 (∇xP|xk

)T

4. calculate the next point :

xk+1 = xk + ∆xk

5. repeat steps 1 through 4 until termination.


&

$

%

Example Consider again the optimization problem :

minx

1

2xT[

101 −99−99 101

]x

The gradient and Hessian for this example are :

∇xP = xT[

101 −99−99 101

], ∇2

xP =

[101 −99−99 101

]start at x0 = [2 1]T

xk P (xk) ∇xPT|xk

∇2xP

T|xk

0

[21

]54.50

[103−97

] [101 −99−99 101

]

1

[00

]0


&

$

%

As expected Newton’s method was very effective, eventhough theproblem was not well scaled. Poor scaling was compensated for byusing the inverse of the Hessian.

Steepest Descent was very good for the first two iterations, whenwe were quite far from the optimum. But it slowed rapidly as :

x→ x∗ ∇xP|xk → 0

The Conjugate Gradients approach avoided this difficulty byincluding some measure of problem geometry in the algorithm.However, the method started out using the Steepest Descentdirection and built in curvature information as the algorithmprogressed to the optimum.

For higher-dimensional and non-quadratic objective functions, thenumber of iterations required for convergence can increasesubstantially.


&

$

%

In general, Newton’s Method will yield the best convergenceproperties, at the cost of increased computational load to calculateand store the Hessian. The Conjugate Gradients approach can beeffective in situations where computational requirements are anissue. However, there some other methods where the Hessian isapproximated using gradient information and the approximateHessian is used to calculate a Newton step. These are called theQuasi-Newton methods.

As we saw in the univariate case, care must be taken when usingNewton’s method for complex functions. Since Newton’s method issearching for a stationary point, it can oscillate between suchpoints in the above situation. For such complex functions, Newton’smethod is often implemented with an alternative to step 4.

4’. calculate the next point :

xk+1 = xk + λk∆xk


&

$

%

where the step length is determined as :

i) 0 < λk < 1 (so we don’t take a full step).

ii) perform a line search on λk.

Generally, Newton and Newton-like methods are preferred to othermethods, the main difficulty is determining the Hessian.

5 APPROXIMATE NEWTON METHODS 52'

&

$

%

5 Approximate Newton Methods

As we saw in the univariate case the derivatives can beapproximated by finite differences. In multiple dimensions this canrequire substantially more calculations. As an example, considerforward difference approximation of the gradient :

∇xP|xk =

[∂P

∂xi

]xk

∼ P (xk + δi)− P (xk)

‖δi‖where δi is a perturbation in the direction of xi. This approach

requires 1 additional objective function evaluation per dimensionfor forward or back ward differencing, and 2 additional objectivefunction evaluations per dimension for central differencing.


&

$

%

To approximate the Hessian using only finite differences in theobjective function could have the form :

∇2xP|xk =

[∂2P

∂xi∂xj

]xk

∼ [P (xk + δi)− P (xk)]− [P (xk + δj)− P (xk)]

‖δi‖‖δj‖

Finite difference approximation of the Hessian requires at least anadditional 2 objective function evaluations per dimension. Somedifferencing schemes will give better performance than others ;however, the increased computational load required for differenceapproximation of the second derivatives precludes the use of thisapproach for larger problems.

Alternatively, the Hessian can be approximated in terms ofgradient information. In the forward difference case the Hessian canbe approximated as :

∇2xP|xk =

[∂2P

∂xi∂xj

]xk

∼ hi ∼∇xP (xk + δi)−∇xP (xk)

‖δi‖


&

$

%

Then, we can form a finite difference approximation to the Hessianas :

H = [hi]

The problem is that often this approximate Hessian isnon-symmetric. A symmetric approximation can be formed as :

H =1

2

[H + HT

]Whichever approach is used to approximate the derivatives, theNewton step is calculated from them. In general, Newton’s methodsbased on finite differences can produce results similar to analyticalresults, providing care is taken in choosing the perturbations δi .(Recall that as the algorithm proceeds : xk → x∗ and ∇xP|xk → 0.

Then small approximation errors will affect convergence andaccuracy.) Some alternatives which can require fewer computations,yet build curvature information as the algorithm proceed are theQuasi-Newton family.


&

$

%

5.1 Quasi-Newton Methods

This family of methods are the multivariate analogs of theunivariate Secant methods. They build second derivative (Hessian)information as the algorithm proceeds to solution, using availablegradient information.

Consider the Taylor series expansion for the gradient of theobjective function along the step direction sk :

∇xP|xk+1= ∇xP|xk +∇2

xP|xk (∆xk) + . . .

Truncating the Taylor series at the second-order terms anre-arranging slightly yields the so-called Quasi-Newton Condition :

Hk+1(∆xk) = ∇xP|xk+1−∇xP|xk = ∆gk


&

$

%

Thus, any algorithm which satisfies this condition can build upcurvature information as it proceeds from xk to xk+1 in thedirection ∆xk ; however, the curvature information is gained in onlyone direction. Then as these algorithms proceed the fromiteration-to-iteration, the approximate Hessian can differ by only arank-one matrix

Hk+1 = Hk + vuT

Combining the Quasi-Newton Condition and the update formulayields :

Hk+1(∆xk) = (Hk + vuT )(∆xk) = ∆gk

Which yields a solution for the vector v of :

v =1

uT∆xk

[∆gk − Hk∆xk

]


&

$

%

Providing ∆xk and uT are not orthogonal, the Hessian updateformula is given by :

Hk+1 = Hk +1

uT∆xk

[∆gk − Hk∆xk

]uT

Unfortunately this update formula is not unique with respect to theQuasi-Newton Condition and the step vector sk.

We could add an additional term of wzT to the update formula,where the vector w is any vector from the null space of ∆xk.(Recall that if w ∈ null(∆xk), then ∆xTkw = 0). Thus there are afamily of such Quasi-Newton methods which differ only in thisadditional term wzT .


&

$

%

Perhaps the most successful member of the family is theBroyden-Fletcher-Goldfarb-Shanno (BFGS) update :

Hk+1 = Hk +∆gk∆gTk∆gTk ∆xk

− Hk∆xk∆xTk Hk

∆xTk Hk∆xk

The BFGS update has several advantages including that it issymmetric and it can be further simplified if the step is calculatedaccording to :

Hk∆xk = −λk∇xP|xkIn this case the BFGS update formula simplifies to :

Hk+1 = Hk +∆gk∆gTk∆gTk ∆xk

+λk∇xP|xk∇xP

T|xk

∇xPT|xk∆xk


&

$

%

General Quasi-Newton Algorithm

1. Initialization :

- choose a starting point x0

- choose an initial Hessian approximation (usually H0 = I).

2. At iteration k :

- calculate the gradient ∇xP|xk- calculate the search direction :

sk = −(Hk)−1(∇xP|xk )

- calculate the next iterate :

xk+1 = xk + λksk


&

$

%

using a line search on λk.

- compute ∇xP|xk ,∆gk,∆xk.

- update the Hessian (Hk+1) using the method of your choice(e.g. BFGS).

3. Repeat step 2 until termination.


&

$

%

Example

Consider again the optimization problem :

minx

1

2xT[

101 −99−99 101

]x

The gradient and Hessian for this example are :

∇xP = xT[

101 −99−99 101

], ∇2

xP =

[101 −99−99 101

]


&

$

%


xk P (xk) ∇xP|xkHk sk λk

[21

]54.5

[103−97

] [1 00 1

] [103−97

]0.005

[1.481.48

]4.41

[2.883.05

] [100.52 −99.5−99.5 100.46

] [−2.97−2.97

]0.50

[7.8 · 10−4

7.8 · 10−4

]1.2 ·10−8

[1.5 · 10−4

1.6 · 10−4

] [101 −99−99 101

] [−7.8 · 10−4

−7.8 · 10−4

]1

[00

]

It is worth noting that the optimum is found in 3 steps(although we are very close within 2). Also, notice that we havea very accurate approximation for the Hessian (to 7 significantfigures) after 2 updates.


&

$

%

The BFGS algorithm was :

- not quit as efficient as Newton’s method,

- approximately as efficient as the Conjugate Gradientsapproach,

- much more efficient that Steepest Descent.

This is what should be expected for quadratic functions of lowdimension. As the dimension of the problem increases, theNewton and Quasi-Newton methods will usually out performthe other methods.

cautionary note :

The BFGS algorithm guarantees that the approximate Hessianremains positive definite (and invertible) providing that theinitial matrix is chosen as positive definite, in theory. However,round-off errors and sometimes the search history can cause theapproximate Hessian to become badly conditioned. When thisoccurs, the approximate Hessian should be reset to some value(often the identity matrix I).


&

$

%

5.2 Convergence

During our discussions we have mentioned convergenceproperties of various algorithms. There are a number of wayswith which convergence properties of an algorithm can becharacterized. The most conventional approach is to determinethe asymptotic behaviour of the sequence of distances betweenthe iterates (xk) and the optimum (x∗), as the iterations of aparticular algorithm proceed. Typically, the distance betweenas iterate and the optimum is defined as :

‖xk − x∗‖

We are most interested in the behaviour of an algorithm withinsome neighbourhood of the optimum ; hence, the asymptoticanalysis .


&

$

%

The asymptotic rate of convergence is defined as :

limk→∞

‖xk+1 − x∗‖‖xk − x∗‖p

= β

with the asymptotic order p and the asymptotic error β.

In general, most algorithms show convergence properties whichare within three categories :

- linear (0 ≤ β < 1, p = 1),

- super-linear (β = 0, p = 1),

- quadratic (β > 0, p = 2).


&

$

%

For the methods we have studied, it can be shown that :

(a) the Steepest Descent method exhibits super-linear convergencein most cases. For quadratic objective functions the asymptoticconvergence properties can be shown to approach quadratic ask(H)→ 1 and linear as k(H)→∞.

(b) the Conjugate Gradients method will generally showsuper-linear (and often nearly quadratic) convergence properties.

(c) Newton’s method converges quadratically.

(d) Quasi-Newton methods can closely approach quadratic rates ofconvergence.


&

$

%

As in the univariate case, there are two basic approaches :

- direct search (Nelder-Meade Simplex, . . . ),

- derivative-based (Steepest Descent, Newton’s, . . . ).

Choosing which method to use for a particular problem is atrade-off among :

- ease of implementation,

- convergence speed and accuracy,

- computational limitations.


&

$

%

Unfortunately there is no best method in all situations. Fornearly quadratic functions Newton’s and Quasi-Newtonmethods will usually be very efficient. For nearly flat ornon-smooth objective functions, the direct search methods areoften a good choice. For very large problems (more than 1000decision variables), algorithms which use only gradientinformation (Steepest Descent, Conjugate Gradients,Quasi-Newton, etc.) can be the only practical methods.

Remember that the solution to a multivariable optimizationproblem requires :

- a function (and derivatives) that can be evaluated,

- an appropriate optimization method.

1'

&

$

%

Chapter V

Constrained Multivariate Optimization

– Introduction

– Optimality Conditions

– Examples

– Algorithms for Constrained NLP

– Sequential Quadratic Programming

1 INTRODUCTION 2'

&

$

%

1 Introduction

Until now we have only been discussing NLP that do not haveconstraints. Most problems that we wish to solve will have someform of constraints.

i) equality constraints (material and energy balances)

ii) inequality constraints (limits on operating conditions).

These type of problems require techniques that rely heavily on theunconstrained methods.

1 INTRODUCTION 3'

&

$

%

General Form

Optimize P(x) Objective function

Performance functionProfit (cost) function

Subject to

f(x) = 0 Equality constraintsProcess model

g(x) ≤ 0 Inequality constraints

Operating constraints

xl ≤ x ≤ xu Variable bounds

Variables limits

2 OPTIMALITY CONDITIONS FOR LOCAL OPTIMUM 4'

&

$

%

2 Optimality Conditions for LocalOptimum

Karush-Kuhn-Tucker (KKT) Conditions :

1. Stationarity

∇L(x∗, u, v) = ∇P (x∗) +∇g(x∗)u+∇f(x∗)v = 0

2. Feasibility

g(x∗) ≤ 0, f(x∗) = 0

3. Complementarity

uT g(x∗) = 0


&

$

%

4. Non-negativity

u ≥ 0

5. Curvature

wT∇2Lw ≥ 0,∀ feasible w


&

$

%

Example

min(x1 − 5)2 + (x2 − 4)2

subject to

2x1 + 3x2 ≤ 126x1 + 3x2 ≤ 18

Check to see if constraint intersection is optimum, using KKTconditions :

L = (x1 − 5)2 + (x2 − 4)2 + λ1(2x1 + 3x2 − 12) + λ2(6x1 + 3x2 − 18)


&

$

%

The gradients are given by :

∇xL = [2(x1 − 5) + 2λ1 + 6λ2 2(x2 − 4) + 3λ1 + 3λ2]

∇λL = [2x1 + 3x2 − 12 6x1 + 3x2 − 18]

To find the stationary points, set

∇xL = 0 and ∇λL = 0

x =[

3/23

]λ =

[−3/417/12

]P = 13.25

Since one of the multipliers is negative, this is not optimum.


&

$

%

Let us assume that only the second constraint is active at theoptimum.

min(x1 − 5)2 + (x2 − 4)2

subject to

2x1 + 3x2 ≤ 126x1 + 3x2 ≤ 18 (active)

Then we could modify Lagrangian and check KKT conditions

L = (x1 − 5)2 + (x2 − 4)2 + λ1(6x1 + 3x2 − 18)

∇xL = [2(x1 − 5) + 6λ1 2(x2 − 4) + 3λ1] = 0

∇λL = [6x1 + 3x2 − 18] = 0

x =[

9/512/5

]=

[1.82.4

]λ = 16/15 = 1.0666 P = 12.8


&

$

%

Let us assume that only the second constraint is active at theoptimum and right hand side is perturbed a little bit by e = 0.001.

min(x1 − 5)2 + (x2 − 4)2

subject to

2x1 + 3x2 ≤ 126x1 + 3x2 ≤ 18 + e (active)

L = (x1 − 5)2 + (x2 − 4)2 + λ1(6x1 + 3x2 − 18.001)

∇xL = [2(x1 − 5) + 6λ1 2(x2 − 4) + 3λ1] = 0

∇λL = [6x1 + 3x2 − 18.001] = 0

x =[

1.80012.4001

]λ = 1.0666 P = 12.7989


&

$

%

since the multiplier is positive, this is optimum.

Notice that x and λ changed very slightly from those values of theoriginal solution

min(x1 − 5)2 + (x2 − 4)2

subject to

2x1 + 3x2 ≤ 126x1 + 3x2 ≤ 18 + e (active)

dP

de=Poriginal − Pnew

0− e=

12.8− 12.7989−0.001

= −1.0666 = −λ

That is the profit function decreases by an amount of λ for everyincrease in the r.h.s of the active constraint.Sensitivity analysis tells us how the solution to an optimizationproblem changes as certain problem parameters change.

We don’t necessarily have to resolve the problem.

3 ALGORITHMS FOR CONSTRAINED NLP 11'

&

$

%

3 Algorithms for Constrained NLP

1. Successive (or Sequential) Linear Programming (SLP)

– Linearize objective function and constraints,

– Solve a sequence of linearized problems,

– Limited use.

2. Penalty/ Barrier Function Methods

– Create an unconstrained problem by augmenting theobjective function with constraints.

– Solve the resulting unconstrained problem,

– Barrier function methods back in vogue.

3 ALGORITHMS FOR CONSTRAINED NLP 12'

&

$

%

3. Gradient-Based Methods

– Steepest, feasible descent direction,

– Solve the resulting univariate problem,

– Not currently popular.

4. Successive (or Sequential) Quadratic Programming (SQP)

– Approximate problem locally as a QP,

– Solve a sequence of QPs.,

– Currently very popular for process optimization.

4 SEQUENTIAL QUADRATIC PROGRAMMING 13'

&

$

%

4 Sequential Quadratic Programming

The idea is to approximate the NLP with a QP at the current valueof x by the problem :

min∆x∇P (xk)T∆x+

12

∆xTBk∆x

subject tof(xk) +∇f(xk)T∆x = 0

g(xk) +∇g(xk)T∆x = 0

where Bk is the current approximation of the Hessian of theLagrangian and ∆x is the search direction.

Once the optimal search direction is found a univariate search isperformed to determine how far we should proceed in this direction.

The entire procedure is then repeated until the algorithm hasconverged.

5 OPTIMIZATION PROBLEMS NOT DISCUSSED 14'

&

$

%

5 Optimization Problems Not Discussed

This workshop only scratched the surface of optimization problemsand techniques to solve them. Some of the important topics thatwere not covered include :

1. Integer, Mixed-Integer, and Logic/ Constraint Optimization.

2. Direct Search Optimization Techniques (e.g., SimulatedAnnealing, Genetic Algorithms).

3. Semidefinite Optimization.

4. Stochastic, Robust, Uncertain Optimization.

5. Multi-objective, Multi-criteria, and Vector Optimization.

6. Semi-Infinite Optimization.

7. Dynamic Optimization.

lecture notes on engineering optimization fraser j. forbes ...aksikas/optimization-notes.pdf ·...

Documents