introduction to unconstrained optimization - gradient...

Introduction to

unconstrained optimization

- gradient-based methods

(cont.)

Jussi Hakanen

Post-doctoral researcher [email protected]

spring 2014 TIES483 Nonlinear optimization

mailto:[email protected]

Cone of learning by Edgar Dale

Model algorithm for unconstrained

minimization

Let 𝑥ℎ be the current estimate for 𝑥∗

1) [Test for convergence.] If conditions are satisfied, stop. The solution is 𝑥ℎ.

2) [Compute a search direction.] Compute a non-zero vector 𝑑ℎ ∈ 𝑅𝑛 which is the search direction.

3) [Compute a step length.] Compute 𝛼ℎ > 0, the step length, for which it holds that 𝑓 𝑥ℎ + 𝛼ℎ𝑑ℎ < 𝑓(𝑥ℎ).

4) [Update the estimate for minimum.] Set 𝑥ℎ+1 = 𝑥ℎ + 𝛼ℎ𝑑ℎ, ℎ = ℎ + 1 and go to step 1.


From Gill et al., Practical Optimization, 1981, Academic Press

Reminder: Important steps

Computing a step length 𝛼ℎ – Requirement of 𝑓 𝑥ℎ + 𝛼ℎ𝑑ℎ < 𝑓 𝑥ℎ is not sufficient

– E.g. consider a univariate function 𝑓 𝑥 = 𝑥2 and a sequence −1 ℎ(

1

2+ 2−ℎ)

– Thus, the step length must produce a “sufficient decrease”

Computing a search direction 𝑑ℎ ∈ 𝑅𝑛 – Function 𝑓(𝑥) decreases the most locally in the direction −𝛻𝑓(𝑥)

– If 𝑑ℎ is almost orthogonal to 𝛻𝑓(𝑥ℎ), then 𝑓 is almost constant along 𝑑ℎ→ this must be avoided

When 𝑓 is bounded from below, a strictly decreasing sequence {𝑥ℎ} converges – Can converge to a non-optimal point

– If the above situations are avoided, the level set 𝐿 𝑓 𝑥0 ={𝑥|𝑓 𝑥 ≤ 𝑓(𝑥0)} is closed and bounded and 𝑓 is twice continuously differentiable, then lim

ℎ→∞𝛻𝑓(𝑥ℎ) = 0


Examples of gradient-based methods

Steepest descent

Newton’s method

Quasi-Newton method

Conjugate gradient method


Today: brief reminder

Today: introduce these

Summary: steepest descent &

Newton’s method

Steepest descent (𝑥ℎ+1 = 𝑥ℎ − 𝛼ℎ𝛻𝑓(𝑥ℎ))

– Descent direction: yes, 𝛻𝑓 𝑥ℎ 𝑇𝑑ℎ = −𝛻𝑓 𝑥ℎ 𝑇

𝛻𝑓 𝑥ℎ

– Global convergence: yes

– Local convergence: zig-zag near optimum

– Computational bottle neck: 𝛼ℎ

– Memory consumption: 𝑂(𝑛)

Newton’s method (𝑥ℎ+1 = 𝑥ℎ − 𝐻 𝑥ℎ −1𝛻𝑓(𝑥ℎ))

– Descent direction: only if 𝐻 𝑥ℎ −1 positive definite, 𝛻𝑓 𝑥ℎ 𝑇

𝑑ℎ =

− 𝛻𝑓 𝑥ℎ 𝑇𝐻 𝑥ℎ −1

𝛻𝑓 𝑥ℎ

– Global convergence: no

– Local convergence: yes, quadratic near optimum

– Computational bottle neck: 𝐻 𝑥ℎ −1

– Memory consumption: 𝑂(𝑛2)


Notes on Newton’s method

Solve 𝑓 𝑥 = 0 with Newton’s (Newton-Rhapson) method:

– 𝑥ℎ+1 = 𝑥ℎ −𝑓 𝑥ℎ

𝑓′ 𝑥ℎ

– http://en.wikipedia.org/wiki/Newton%27s_method

Solve 𝑓′ 𝑥 = 0 with Newton’s method (𝐹 𝑥 ≔ 𝑓′(𝑥))

– 𝑥ℎ+1 = 𝑥ℎ −𝐹 𝑥ℎ

𝐹′ 𝑥ℎ = 𝑥ℎ −𝑓′ 𝑥ℎ

𝑓′′ 𝑥ℎ

→ Newton’s method in nonlinear optimization is equal to solving

𝛻𝑓 𝑥 = 0!

Usually, the next iterate is solved from

𝐻 𝑥ℎ 𝑥ℎ+1 − 𝑥ℎ = −𝛻𝑓(𝑥ℎ)

– No need to compute 𝐻 𝑥ℎ −1 explicitly


http://en.wikipedia.org/wiki/Newton's_method

Idea of quasi-Newton methods


A drawback of Newton’s method is that 𝐻(𝑥ℎ) may not be positive definite

Quasi-Newton methods try to retain the good local convergence of Newton’s method, i.e., a local quadratic approximation of 𝑓 without the second derivatives – Cf. solving 𝑓′ 𝑥 = 0, left: Newton’s method, right: the secant

method

Fro

m M

iett

ine

n: N

on

line

ar

op

tim

iza

tio

n, 2

00

7 (

in F

inn

ish

)

Quasi-Newton method

Search direction 𝑑ℎ = −𝐷ℎ𝛻𝑓(𝑥ℎ), where a

symmetric positive definite matrix 𝐷ℎ is

updated iteratively

– 𝐷ℎ → 𝐻−1 when 𝑥ℎ → 𝑥∗

– 𝑑ℎ is a descent direction since 𝐷ℎ is positive

definite

The step length 𝛼ℎ is optimized by line search

or e.g. 𝛼ℎ = 1 can be used

– 𝑥ℎ+1 = 𝑥ℎ − 𝛼ℎ𝐷ℎ𝛻𝑓(𝑥ℎ)


Quasi-Newton algorithm

1) Choose the final tolerance 𝜖 > 0, a starting point 𝑥1 and a

symmetric pos. def. 𝐷1 (e.g. 𝐷1 = 𝐼 i.e. an identity matrix).

Set 𝑦1 = 𝑥1 and ℎ = 𝑗 = 1.

2) If 𝛻𝑓(𝑦𝑗 ) < 𝜖 stop.

3) Set 𝑑𝑗 = −𝐷𝑗𝛻𝑓(𝑦𝑗). Let 𝛼𝑗 be the solution of problem

min𝛼≥0

𝑓(𝑦𝑗 + 𝛼𝑑𝑗). Set yj+1 = y𝑗 + 𝛼jdj. If 𝑗 = 𝑛, set

𝑦1 = xh+1 = 𝑦𝑛+1, ℎ = ℎ + 1, 𝑗 = 1 and go to 2).

4) Compute 𝐷𝑗+1. Set 𝑗 = 𝑗 + 1 and go to 2).


Updating 𝐷ℎ

Exact Hessian is approximated by using only

the objective function and gradient values

Taylor series: 𝛻𝑓 𝑥ℎ + 𝑑ℎ ≈ 𝛻𝑓 𝑥ℎ +

𝐻 𝑥ℎ 𝑑ℎ

Curvature:

𝑑ℎ 𝑇𝐻 𝑥ℎ 𝑑ℎ ≈ 𝛻𝑓 𝑥ℎ + 𝑑ℎ − 𝛻𝑓(𝑥ℎ)

𝑇𝑑ℎ

Curvature information can be approximated

without second derivatives!


DFP & BFGS updates

Let 𝑝ℎ = 𝑥ℎ+1 − 𝑥ℎ and 𝑞ℎ = 𝛻𝑓 𝑥ℎ+1 − 𝛻𝑓(𝑥ℎ)

𝐻(𝑥ℎ)−1 is approximated (DFP update)

– 𝐷ℎ+1 = 𝐷ℎ +𝑝ℎ 𝑝ℎ 𝑇

𝑝ℎ 𝑇𝑞ℎ

+𝐷ℎ𝑞ℎ 𝑞ℎ 𝑇

𝐷ℎ

𝑞ℎ 𝑇𝐷ℎ𝑞ℎ

𝐻(𝑥ℎ) is approximated (BFGS update)

– 𝐷ℎ+1 =

𝐷ℎ + 1 +𝑞ℎ 𝑇

𝐷ℎ𝑞ℎ

𝑝ℎ 𝑇𝑞ℎ

𝑝ℎ 𝑝ℎ 𝑇

𝑝ℎ 𝑇𝑞ℎ

−𝐷ℎ𝑞ℎ 𝑝ℎ 𝑇

+𝑝ℎ 𝑞ℎ 𝑇𝐷ℎ

𝑝ℎ 𝑇𝑞ℎ


Notes

Usually 𝐷1 = 𝐼 is used if no other information is available

It can be shown that if 𝐷1 is pos. def. and 𝛼ℎ is optimized exactly, then the matrices 𝐷2, 𝐷3, … are pos. def. in both DFP and BFGS updates

Updating 𝐷ℎ is easier in DFP update but BFGS is not so sensitive to rounding errors

A drawback of quasi-Newton methods is that they also use lots of memory: 𝑂(𝑛2)


Newton’s method: example


Fro

m G

ill e

t a

l., P

ractical O

ptim

iza

tio

n, 1

98

1

Quasi-Newton: example


Fro

m G

ill e

t a

l., P

ractical O

ptim

iza

tio

n, 1

98

1

Conjugate gradient methods

Developed originally for solving systems of linear equations

– For minimizing a quadratic function without constraints is equivalent

to solving 𝛻𝑓 𝑥 = 0 (system of linear equations if the resulting

matrix is pos. def.)

– Extended to system of nonlinear equations and nonlinear

unconstrained optimization

Are not as efficient as quasi-Newton methods but they

require less memory (𝑂(𝑛))

– Need only to store gradients

– Good candidates when solving problems with a large number of

variables


Conjugate gradient methods (cont.)

Idea is to improve convergence properties of steepest

descent

– A search direction is a combination of the current search direction

and a previous search direction

Search direction

– 𝑑1 = −𝛻𝑓(𝑥1)

– 𝑑ℎ+1 = −𝛻𝑓 𝑥ℎ+1 + 𝛽ℎ𝑑ℎ =1

𝜇𝜇 −𝛻𝑓(𝑥ℎ+1) + 1 − 𝜇 𝑑ℎ

– 𝛽ℎ is a parameter which is chosen s.t. the consecutive search

directions are conjugate: Let 𝐴 be a symmetric 𝑛 × 𝑥 matrix.

Directions 𝑑1, … , 𝑑𝑝 are (𝐴-)conjugate if they are linearly independent

and if 𝑑𝑖 𝑇𝐴𝑑𝑗 = 0 for all 𝑖, 𝑗 = 1, … , 𝑝, 𝑖 ≠ 𝑗.


Conjugate gradient algorithm

1) Choose the final tolerance 𝜖 > 0 and a starting point 𝑥1.

Set 𝑦1 = 𝑥1, 𝑑1 = −𝛻𝑓(𝑦1) and ℎ = 𝑗 = 1.

2) If 𝛻𝑓(𝑦𝑗 ) < 𝜖 stop.

3) Let 𝛼𝑗 be the solution of problem min𝛼≥0

𝑓(𝑦𝑗 + 𝛼𝑑𝑗). Set

yj+1 = y𝑗 + 𝛼jdj. If 𝑗 = 𝑛, go to 5).

4) Set 𝑑𝑗+1 = −𝛻𝑓 𝑦𝑗+1 + 𝛽𝑗𝑑𝑗, where e.g.

𝛽𝑗 =𝛻𝑓 𝑦𝑗+1 2

𝛻𝑓 𝑦𝑗 2 (Fletcher and Reeves method). Set

𝑗 = 𝑗 + 1 and go to 2).

5) Set 𝑦1 = xh+1 = 𝑦𝑛+1 and 𝑑1 = −𝛻𝑓(𝑦1). Set 𝑗 = 1, ℎ =ℎ + 1 and go to 2).


Steepest descent: example


Fro

m M

iett

ine

n: N

on

line

ar

op

tim

iza

tio

n, 2

00

7 (

in F

inn

ish

)

𝑓 𝑥 = 𝑥1 − 2 4 + x1 − 2x22

Conjugate gradient: example


Fro

m M

iett

ine

n: N

on

line

ar

op

tim

iza

tio

n, 2

00

7 (

in F

inn

ish

)

𝑓 𝑥 = 𝑥1 − 2 4 + x1 − 2x22

Summary

Quasi-Newton methods – Improve Newton’s method by guaranteeing a descent

direction (𝐷ℎ pos. def.)

– Not quite as good convergence as Newton’s method

– No need to evaluate 2nd derivatives

– Memory consumption 𝑂 𝑛2

– Best suited for small and medium scale problems

Conjugate gradient methods – Improve the convergence of steepest descent

– Memory consumption 𝑂 𝑛

– Best suited for large scale problems


introduction to unconstrained optimization - gradient...

Documents