introduction to unconstrained optimization - gradient...
TRANSCRIPT
Introduction to
unconstrained optimization
- gradient-based methods
(cont.)
Jussi Hakanen
Post-doctoral researcher [email protected]
spring 2014 TIES483 Nonlinear optimization
Cone of learning by Edgar Dale
Model algorithm for unconstrained
minimization
Let 𝑥ℎ be the current estimate for 𝑥∗
1) [Test for convergence.] If conditions are satisfied, stop. The solution is 𝑥ℎ.
2) [Compute a search direction.] Compute a non-zero vector 𝑑ℎ ∈ 𝑅𝑛 which is the search direction.
3) [Compute a step length.] Compute 𝛼ℎ > 0, the step length, for which it holds that 𝑓 𝑥ℎ + 𝛼ℎ𝑑ℎ < 𝑓(𝑥ℎ).
4) [Update the estimate for minimum.] Set 𝑥ℎ+1 = 𝑥ℎ + 𝛼ℎ𝑑ℎ, ℎ = ℎ + 1 and go to step 1.
spring 2014 TIES483 Nonlinear optimization
From Gill et al., Practical Optimization, 1981, Academic Press
Reminder: Important steps
Computing a step length 𝛼ℎ – Requirement of 𝑓 𝑥ℎ + 𝛼ℎ𝑑ℎ < 𝑓 𝑥ℎ is not sufficient
– E.g. consider a univariate function 𝑓 𝑥 = 𝑥2 and a sequence −1 ℎ(
1
2+ 2−ℎ)
– Thus, the step length must produce a “sufficient decrease”
Computing a search direction 𝑑ℎ ∈ 𝑅𝑛 – Function 𝑓(𝑥) decreases the most locally in the direction −𝛻𝑓(𝑥)
– If 𝑑ℎ is almost orthogonal to 𝛻𝑓(𝑥ℎ), then 𝑓 is almost constant along 𝑑ℎ→ this must be avoided
When 𝑓 is bounded from below, a strictly decreasing sequence {𝑥ℎ} converges – Can converge to a non-optimal point
– If the above situations are avoided, the level set 𝐿 𝑓 𝑥0 ={𝑥|𝑓 𝑥 ≤ 𝑓(𝑥0)} is closed and bounded and 𝑓 is twice continuously differentiable, then lim
ℎ→∞𝛻𝑓(𝑥ℎ) = 0
spring 2014 TIES483 Nonlinear optimization
Examples of gradient-based methods
Steepest descent
Newton’s method
Quasi-Newton method
Conjugate gradient method
spring 2014 TIES483 Nonlinear optimization
Today: brief reminder
Today: introduce these
Summary: steepest descent &
Newton’s method
Steepest descent (𝑥ℎ+1 = 𝑥ℎ − 𝛼ℎ𝛻𝑓(𝑥ℎ))
– Descent direction: yes, 𝛻𝑓 𝑥ℎ 𝑇𝑑ℎ = −𝛻𝑓 𝑥ℎ 𝑇
𝛻𝑓 𝑥ℎ
– Global convergence: yes
– Local convergence: zig-zag near optimum
– Computational bottle neck: 𝛼ℎ
– Memory consumption: 𝑂(𝑛)
Newton’s method (𝑥ℎ+1 = 𝑥ℎ − 𝐻 𝑥ℎ −1𝛻𝑓(𝑥ℎ))
– Descent direction: only if 𝐻 𝑥ℎ −1 positive definite, 𝛻𝑓 𝑥ℎ 𝑇
𝑑ℎ =
− 𝛻𝑓 𝑥ℎ 𝑇𝐻 𝑥ℎ −1
𝛻𝑓 𝑥ℎ
– Global convergence: no
– Local convergence: yes, quadratic near optimum
– Computational bottle neck: 𝐻 𝑥ℎ −1
– Memory consumption: 𝑂(𝑛2)
spring 2014 TIES483 Nonlinear optimization
Notes on Newton’s method
Solve 𝑓 𝑥 = 0 with Newton’s (Newton-Rhapson) method:
– 𝑥ℎ+1 = 𝑥ℎ −𝑓 𝑥ℎ
𝑓′ 𝑥ℎ
– http://en.wikipedia.org/wiki/Newton%27s_method
Solve 𝑓′ 𝑥 = 0 with Newton’s method (𝐹 𝑥 ≔ 𝑓′(𝑥))
– 𝑥ℎ+1 = 𝑥ℎ −𝐹 𝑥ℎ
𝐹′ 𝑥ℎ = 𝑥ℎ −𝑓′ 𝑥ℎ
𝑓′′ 𝑥ℎ
→ Newton’s method in nonlinear optimization is equal to solving
𝛻𝑓 𝑥 = 0!
Usually, the next iterate is solved from
𝐻 𝑥ℎ 𝑥ℎ+1 − 𝑥ℎ = −𝛻𝑓(𝑥ℎ)
– No need to compute 𝐻 𝑥ℎ −1 explicitly
spring 2014 TIES483 Nonlinear optimization
Idea of quasi-Newton methods
spring 2014 TIES483 Nonlinear optimization
A drawback of Newton’s method is that 𝐻(𝑥ℎ) may not be positive definite
Quasi-Newton methods try to retain the good local convergence of Newton’s method, i.e., a local quadratic approximation of 𝑓 without the second derivatives – Cf. solving 𝑓′ 𝑥 = 0, left: Newton’s method, right: the secant
method
Fro
m M
iett
ine
n: N
on
line
ar
op
tim
iza
tio
n, 2
00
7 (
in F
inn
ish
)
Quasi-Newton method
Search direction 𝑑ℎ = −𝐷ℎ𝛻𝑓(𝑥ℎ), where a
symmetric positive definite matrix 𝐷ℎ is
updated iteratively
– 𝐷ℎ → 𝐻−1 when 𝑥ℎ → 𝑥∗
– 𝑑ℎ is a descent direction since 𝐷ℎ is positive
definite
The step length 𝛼ℎ is optimized by line search
or e.g. 𝛼ℎ = 1 can be used
– 𝑥ℎ+1 = 𝑥ℎ − 𝛼ℎ𝐷ℎ𝛻𝑓(𝑥ℎ)
spring 2014 TIES483 Nonlinear optimization
Quasi-Newton algorithm
1) Choose the final tolerance 𝜖 > 0, a starting point 𝑥1 and a
symmetric pos. def. 𝐷1 (e.g. 𝐷1 = 𝐼 i.e. an identity matrix).
Set 𝑦1 = 𝑥1 and ℎ = 𝑗 = 1.
2) If 𝛻𝑓(𝑦𝑗 ) < 𝜖 stop.
3) Set 𝑑𝑗 = −𝐷𝑗𝛻𝑓(𝑦𝑗). Let 𝛼𝑗 be the solution of problem
min𝛼≥0
𝑓(𝑦𝑗 + 𝛼𝑑𝑗). Set yj+1 = y𝑗 + 𝛼jdj. If 𝑗 = 𝑛, set
𝑦1 = xh+1 = 𝑦𝑛+1, ℎ = ℎ + 1, 𝑗 = 1 and go to 2).
4) Compute 𝐷𝑗+1. Set 𝑗 = 𝑗 + 1 and go to 2).
spring 2014 TIES483 Nonlinear optimization
Updating 𝐷ℎ
Exact Hessian is approximated by using only
the objective function and gradient values
Taylor series: 𝛻𝑓 𝑥ℎ + 𝑑ℎ ≈ 𝛻𝑓 𝑥ℎ +
𝐻 𝑥ℎ 𝑑ℎ
Curvature:
𝑑ℎ 𝑇𝐻 𝑥ℎ 𝑑ℎ ≈ 𝛻𝑓 𝑥ℎ + 𝑑ℎ − 𝛻𝑓(𝑥ℎ)
𝑇𝑑ℎ
Curvature information can be approximated
without second derivatives!
spring 2014 TIES483 Nonlinear optimization
DFP & BFGS updates
Let 𝑝ℎ = 𝑥ℎ+1 − 𝑥ℎ and 𝑞ℎ = 𝛻𝑓 𝑥ℎ+1 − 𝛻𝑓(𝑥ℎ)
𝐻(𝑥ℎ)−1 is approximated (DFP update)
– 𝐷ℎ+1 = 𝐷ℎ +𝑝ℎ 𝑝ℎ 𝑇
𝑝ℎ 𝑇𝑞ℎ
+𝐷ℎ𝑞ℎ 𝑞ℎ 𝑇
𝐷ℎ
𝑞ℎ 𝑇𝐷ℎ𝑞ℎ
𝐻(𝑥ℎ) is approximated (BFGS update)
– 𝐷ℎ+1 =
𝐷ℎ + 1 +𝑞ℎ 𝑇
𝐷ℎ𝑞ℎ
𝑝ℎ 𝑇𝑞ℎ
𝑝ℎ 𝑝ℎ 𝑇
𝑝ℎ 𝑇𝑞ℎ
−𝐷ℎ𝑞ℎ 𝑝ℎ 𝑇
+𝑝ℎ 𝑞ℎ 𝑇𝐷ℎ
𝑝ℎ 𝑇𝑞ℎ
spring 2014 TIES483 Nonlinear optimization
Notes
Usually 𝐷1 = 𝐼 is used if no other information is available
It can be shown that if 𝐷1 is pos. def. and 𝛼ℎ is optimized exactly, then the matrices 𝐷2, 𝐷3, … are pos. def. in both DFP and BFGS updates
Updating 𝐷ℎ is easier in DFP update but BFGS is not so sensitive to rounding errors
A drawback of quasi-Newton methods is that they also use lots of memory: 𝑂(𝑛2)
spring 2014 TIES483 Nonlinear optimization
Newton’s method: example
spring 2014 TIES483 Nonlinear optimization
Fro
m G
ill e
t a
l., P
ractical O
ptim
iza
tio
n, 1
98
1
Quasi-Newton: example
spring 2014 TIES483 Nonlinear optimization
Fro
m G
ill e
t a
l., P
ractical O
ptim
iza
tio
n, 1
98
1
Conjugate gradient methods
Developed originally for solving systems of linear equations
– For minimizing a quadratic function without constraints is equivalent
to solving 𝛻𝑓 𝑥 = 0 (system of linear equations if the resulting
matrix is pos. def.)
– Extended to system of nonlinear equations and nonlinear
unconstrained optimization
Are not as efficient as quasi-Newton methods but they
require less memory (𝑂(𝑛))
– Need only to store gradients
– Good candidates when solving problems with a large number of
variables
spring 2014 TIES483 Nonlinear optimization
Conjugate gradient methods (cont.)
Idea is to improve convergence properties of steepest
descent
– A search direction is a combination of the current search direction
and a previous search direction
Search direction
– 𝑑1 = −𝛻𝑓(𝑥1)
– 𝑑ℎ+1 = −𝛻𝑓 𝑥ℎ+1 + 𝛽ℎ𝑑ℎ =1
𝜇𝜇 −𝛻𝑓(𝑥ℎ+1) + 1 − 𝜇 𝑑ℎ
– 𝛽ℎ is a parameter which is chosen s.t. the consecutive search
directions are conjugate: Let 𝐴 be a symmetric 𝑛 × 𝑥 matrix.
Directions 𝑑1, … , 𝑑𝑝 are (𝐴-)conjugate if they are linearly independent
and if 𝑑𝑖 𝑇𝐴𝑑𝑗 = 0 for all 𝑖, 𝑗 = 1, … , 𝑝, 𝑖 ≠ 𝑗.
spring 2014 TIES483 Nonlinear optimization
Conjugate gradient algorithm
1) Choose the final tolerance 𝜖 > 0 and a starting point 𝑥1.
Set 𝑦1 = 𝑥1, 𝑑1 = −𝛻𝑓(𝑦1) and ℎ = 𝑗 = 1.
2) If 𝛻𝑓(𝑦𝑗 ) < 𝜖 stop.
3) Let 𝛼𝑗 be the solution of problem min𝛼≥0
𝑓(𝑦𝑗 + 𝛼𝑑𝑗). Set
yj+1 = y𝑗 + 𝛼jdj. If 𝑗 = 𝑛, go to 5).
4) Set 𝑑𝑗+1 = −𝛻𝑓 𝑦𝑗+1 + 𝛽𝑗𝑑𝑗, where e.g.
𝛽𝑗 =𝛻𝑓 𝑦𝑗+1 2
𝛻𝑓 𝑦𝑗 2 (Fletcher and Reeves method). Set
𝑗 = 𝑗 + 1 and go to 2).
5) Set 𝑦1 = xh+1 = 𝑦𝑛+1 and 𝑑1 = −𝛻𝑓(𝑦1). Set 𝑗 = 1, ℎ =ℎ + 1 and go to 2).
spring 2014 TIES483 Nonlinear optimization
Steepest descent: example
spring 2014 TIES483 Nonlinear optimization
Fro
m M
iett
ine
n: N
on
line
ar
op
tim
iza
tio
n, 2
00
7 (
in F
inn
ish
)
𝑓 𝑥 = 𝑥1 − 2 4 + x1 − 2x22
Conjugate gradient: example
spring 2014 TIES483 Nonlinear optimization
Fro
m M
iett
ine
n: N
on
line
ar
op
tim
iza
tio
n, 2
00
7 (
in F
inn
ish
)
𝑓 𝑥 = 𝑥1 − 2 4 + x1 − 2x22
Summary
Quasi-Newton methods – Improve Newton’s method by guaranteeing a descent
direction (𝐷ℎ pos. def.)
– Not quite as good convergence as Newton’s method
– No need to evaluate 2nd derivatives
– Memory consumption 𝑂 𝑛2
– Best suited for small and medium scale problems
Conjugate gradient methods – Improve the convergence of steepest descent
– Memory consumption 𝑂 𝑛
– Best suited for large scale problems
spring 2014 TIES483 Nonlinear optimization