c25 optimization - university of oxfordaz/lectures/opt/lect1.pdf · 2013. 1. 13. · steepest...

C25 Optimization8 Lectures Hilary Term 20132 Tutorial Sheets Prof. A. Zisserman

Overview• Lecture 1 (AZ): Review of B1 unconstrained multivariate optimization,

Levenberg-Marquardt algorithm.• Lecture 2 (AZ): Discrete optimization, dynamic programming.

• Lectures 3-6 (BK): Classical optimization. Constrained optimization - equality constraints, Lagrange multipliers, inequality constraints. Nonlinear programming - search methods, approximation methods, axial iteration, pattern search, descent methods, quasi-Newton methods. Least squares and singular values.

• Lecture 7 (AZ): Discrete optimization on graphs• Lecture 8 (AZ): Graph-cut, max-flow, and applications

Background reading and web resources

• Numerical Recipes in C (or C++) : The Art of Scientific ComputingWilliam H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. VetterlingCUP 1992/2002

• Good chapter on optimization• Available on line at

http://www.nrbook.com/a/bookcpdf.php

Background reading and web resources

• Convex Optimization• Stephen Boyd and Lieven Vandenberghe

CUP 2004• Available on line at

http://www.stanford.edu/~boyd/cvxbook/

• Further reading, web resources, and the lecture notes are on http://www.robots.ox.ac.uk/~az/lectures/opt

The Optimization Landscape

• continuous optimization• variables can take any values (subject to constraints)

• e.g. x > 0

• discrete optimization• variables are restricted to a set of values (often finite)

• e.g. binary, [0, 1, …, 255]

We will see that there are very efficient algorithms for cost functions with a particular structure using discrete optimization

Example: image processing – noise removal

1

0

i

minxf(x) =

Xi

(zi−xi)2+dφ(xi, xi−1)

where φ(xi, xj) =

(0 ifxi = xj1 ifxi 6= xj.

Measured noisy data

zi = xi+ wi

where

xi ∈ {0,1} and wi ∼ N(0,σ2)

Estimate the true (no noise) signal by solving an optimization problem

xi, zi

• n is the number of pixels in the image, e.g. n = 20k

ji

N (i)Now in 2D for an image …

N (i) is the neighbourhood of node i

f(x) =nXi=1

(zi − xi)2 + dX

j∈N (i)φ(xi, xj)

original

original plus noise

20 40 60 80 100 120 140 160 180 200 220

20

40

60

20 40 60 80 100 120 140 160 180 200 220

20

40

60

Task of minimization:

• continuous optimization, many local minima,

• discrete optimization, restrict x to 0 or 1, O(2n)

IRn

original

original plus noise

Min x

20 40 60 80 100 120 140 160 180 200 220

20

40

60

20 40 60 80 100 120 140 160 180 200 220

20

40

60

20 40 60 80 100 120 140 160 180 200 220

20

40

60

d = 10 Max-flow algorithm

20 40 60 80 100 120 140 160 180 200 220

20

40

60

20 40 60 80 100 120 140 160 180 200 220

20

40

60

20 40 60 80 100 120 140 160 180 200 220

20

40

60

original

original plus noise

Min x

d = 60

n ∼ 20 K

Max-flow algorithm

B1 revision – unconstrained continuous optimization:• Convexity• Iterative optimization algorithms• Gradient descent• Newton’s method• Gauss-Newton method

New topics:• Axial iteration• Levenberg-Marquardt algorithm• Application

Lecture outline

Introduction: Problem specification

Suppose we have a cost function (or objective function)

Our aim is find the value of the parameters that minimize this function

subject to the following constraints:

• equality

• inequality

We will start by focussing on unconstrained problems

f(x) : IRn → IRx

x∗ = argminxf(x)

ci(x) = 0, i = 1, . . . ,me

ci(x) ≥ 0, i = me+1, . . . ,m

Unconstrained optimization

• down-hill search (gradient descent) algorithms can find local minima

• which of the minima is found depends on the starting point

• such minima often occur in real applications

minxf(x)

f(x)

xlocal minimum

global minimum

function of one variable

Reminder: convexity

Class of functions …

convex Not convex

• Convexity provides a test for a single extremum

• A non-negative sum of convex functions is convex

Class of functions continued …

single extremum – convex single extremum – non-convex

multiple extrema – non-convex noisy

Not convex

horrible

Optimization algorithm – key ideas

• Find δx such that f(x+ δx) < f(x)

• This leads to an iterative update xn+1 = xn+ δx

• Reduce the problem to a series of 1D line searches δx = αp

-5 0 5 10 15-5

0

5

10

15

Choosing the direction 1: axial iteration

Alternate minimization over x and y

-5 0 5 10 15-5

0

5

10

15

Choosing the direction 2: steepest descent

Move in the direction of the gradient ∇f(xn)

-5 0 5 10 15-5

0

5

10

15

-5 0 5 10 15-5

0

5

10

15

• The gradient is everywhere perpendicular to the contour lines.

• After each line minimization the new gradient is always orthogonalto the previous step direction (true of any line minimization.)

• Consequently, the iterates tend to zig-zag down the valley in a veryinefficient manner

Steepest descent

A harder case: Rosenbrock’s function

f(x, y) = 100(y − x2)2 + (1− x)2

-2 -1 0 1 2-1

-0.5

0

0.5

1

1.5

2

2.5

3Rosenbrock function

Minimum is at [1,1]

-0.95 -0.9 -0.85 -0.8 -0.750.65

0.7

0.75

0.8

0.85

Steepest Descent

-2 -1 0 1 2-1

-0.5

0

0.5

1

1.5

2

2.5

3Steepest Descent

Steepest descent on Rosenbrock function

• The zig-zag behaviour is clear in the zoomed view (100 iterations)

• The algorithm crawls down the valley

Conjugate Gradients – sketch onlyThe method of conjugate gradients chooses successive descent direc-

tions pn such that it is guaranteed to reach the minimum in a finite

number of steps.

• Each pn is chosen to be conjugate to all previous search directionswith respect to the Hessian H :

p>n H pj = 0, 0 =< j < n

• The resulting search directions are mutually linearly independent.

• Remarkably, pn can be chosen using only knowledge of pn−1,∇f(xn−1)and ∇f(xn) (see Numerical Recipes)

pn = ∇fn+⎛⎝ ∇f>n ∇fn∇f>n−1∇fn−1

⎞⎠pn−1

Choosing the direction 3: conjugate gradients

Again, uses first derivatives only, but avoids “undoing” previous work

• An N-dimensional quadratic form can be minimized in at most Nconjugate descent steps.

• 3 different starting points.

• Minimum is reached in exactly 2 steps.

Choosing the direction 4: Newton’s method

Start from Taylor expansion in 2DA function may be approximated locally by its Taylor series expansion

about a point x0

f(x+ δx) ≈ f(x) +Ã∂f

∂x,∂f

∂y

!Ãδxδy

!+1

2(δx, δy)

⎡⎢⎣ ∂2f

∂x2∂2f∂x∂y

∂2f∂x∂y

∂2f∂y2

⎤⎥⎦ Ã δxδy

!

The expansion to second order is a quadratic function

f(x+ δx) = a+ g>δx+1

2δx>H δx

Now minimize this expansion over δx:

minδx


2δx>H δx

-5 0 5 10 15-5

0

5

10

15

minδx


2δx>H δx

For a minimum we require that ∇f(x+ δx) = 0, and so

∇f(x+ δx) = g+ Hδx = 0with solution δx = −H−1g (Matlab δx = −H\g).

This gives the iterative update

xn+1 = xn − H−1n gn

• If f(x) is quadratic, then the solution is found in one step.

• The method has quadratic convergence (as in the 1D case).

• The solution δx = −H−1n gn is guaranteed to be a downhill directionprovided that H is positive definite

• Rather than jump straight to the predicted solution at xn − H−1n gn,it is better to perform a line search

xn+1 = xn − αnH−1n gn

• If H = I then this reduces to steepest descent.

xn+1 = xn − H−1n gn

Newton’s method - example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3Newton method with line search

gradient < 1e-3 after 15 iterations-2 -1 0 1 2-1

-0.5

0

0.5

1

1.5

2

2.5


gradient < 1e-3 after 15 iterations

ellipses show successive quadratic approximations

• The algorithm converges in only 15 iterations – far superior to steepest descent

• However, the method requires computing the Hessian matrix at each iteration – this is not always feasible

Performance issues for optimization algorithms

1. Number of iterations required

2. Cost per iteration

3. Memory footprint

4. Region of convergence

Special structure for cost function - non-linear least squares

• It is very common in applications for a cost function f(x) to be thesum of a large number of squared residuals

f(x) =MXi=1

ri(x)2

• If each residual ri(x) depends non-linearly on the parameters x thenthe minimization of f(x) is a non-linear least squares problem.

• We also assume that the residuals ri are: (i) small at the optimum,and (ii) zero-mean.

f(x) =MXi=1

r2i

Gradient

∇f(x) = 2MXi

ri(x)∇ri(x)

Hessian

H = ∇∇>f(x) = 2MXi

∇³ri(x)∇>ri(x)

´

= 2MXi

∇ri(x)∇>ri(x) + ri(x)∇∇>ri(x)

which is approximated as

HGN = 2MXi

∇ri(x)∇>ri(x)

This is the Gauss-Newton approximation

Non-linear least squares

∇ri

∇ri∇>ri

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3Gauss-Newton method with line search

gradient < 1e-3 after 14 iterations-2 -1 0 1 2-1

-0.5

0

0.5

1

1.5

2

2.5



• minimization with the Gauss-Newton approximation with line search takes only 14 iterations

xn+1 = xn − αnH−1n gn with Hn (x) = HGN (xn)

Comparison

-2 -1 0 1 2-1

-0.5

0

0.5

1

1.5

2

2.5


gradient < 1e-3 after 14 iterations-2 -1 0 1 2

-1

-0.5

0

0.5

1

1.5

2

2.5



Newton Gauss-Newton

• requires computing Hessian

• exact solution if quadratic

• approximates Hessian by product of gradient of residuals

• requires only derivatives

Summary of minimizations methods

Update xn+1 = xn+ δx

1. Newton.

H δx = −g

2. Gauss-Newton.

HGN δx = −g

3. Gradient descent.

λδx = −g

Levenberg-Marquardt algorithm

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

1.4

• Away from the minimum, in regions of negative curvature, theGauss-Newton approximation is not very good.

• In such regions, a simple steepest-descent step is probably the bestplan.

• The Levenberg-Marquardt method is a mechanism for varying be-tween steepest-descent and Gauss-Newton steps depending on how

good the HGN approximation is locally.

gradient descent

Newton

• The method uses the modified Hessian

H (x,λ) = HGN + λI

• When λ is small, H approximates the Gauss-Newton Hessian.

• When λ is large, H is close to the identity, causing steepest-descentsteps to be taken.

LM Algorithm

H (x,λ) = HGN(x) + λI

1. Set λ = 0.001 (say)

2. Solve δx = −H(x,λ)−1 g

3. If f(xn+ δx) > f(xn), increase λ (×10 say) and go to 2.

4. Otherwise, decrease λ (×0.1 say), let xn+1 = xn+ δx, and go to 2.

Note : This algorithm does not require explicit line searches.

Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3Levenberg-Marquardt method

gradient < 1e-3 after 31 iterations-2 -1 0 1 2

-1

-0.5

0

0.5

1

1.5

2

2.5



• Minimization using Levenberg-Marquardt (no line search) takes 31iterations.

Matlab: lsqnonlin

Comparison

-2 -1 0 1 2-1

-0.5

0

0.5

1

1.5

2

2.5



-2 -1 0 1 2-1

-0.5

0

0.5

1

1.5

2

2.5



Gauss-Newton

• more iterations than Gauss-Newton, but

• no line search required,

• and more frequently converges

Levenberg-Marquardt

Case study – Bundle Adjustment (non-examinable)

Notation:

• A 3D point Xj is imaged in the “i” th view asxij = P

i Xj

P : 3× 4 matrixX : 4-vector

x : 3-vector

Problem statement

• Given: n matching image points xij over m views

• Find: the cameras Pi and the 3D points Xj such that xij = P

i Xj

Number of parameters

• for each camera there are 6 parameters

• for each 3D point there are 3 parameters

a total of 6 m + 3 n parameters must be estimated• e.g. 50 frames, 1000 points: 3300 unknowns

For efficient optimization:• exploit the Gauss-Newton approximation (squared residual cost

function)

• and additional sparsity which means the Hessian has blocks of zeros

Example

image sequence cameras and points

Application: Augmented reality

original sequence

Augmentation

c25 optimization - university of oxfordaz/lectures/opt/lect1.pdf · 2013. 1. 13. · steepest...

Documents