c25 optimization - university of oxfordaz/lectures/opt/lect1.pdf · 2013. 1. 13. · steepest...

21
C25 Optimization 8 Lectures Hilary Term 2013 2 Tutorial Sheets Prof. A. Zisserman Overview Lecture 1 (AZ): Review of B1 unconstrained multivariate optimization, Levenberg-Marquardt algorithm. Lecture 2 (AZ): Discrete optimization, dynamic programming. Lectures 3-6 (BK): Classical optimization. Constrained optimization - equality constraints, Lagrange multipliers, inequality constraints. Nonlinear programming - search methods, approximation methods, axial iteration, pattern search, descent methods, quasi-Newton methods. Least squares and singular values. Lecture 7 (AZ): Discrete optimization on graphs Lecture 8 (AZ): Graph-cut, max-flow, and applications Background reading and web resources Numerical Recipes in C (or C++) : The Art of Scientific Computing William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling CUP 1992/2002 Good chapter on optimization Available on line at http://www.nrbook.com/a/bookcpdf.php

Upload: others

Post on 01-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • C25 Optimization8 Lectures Hilary Term 20132 Tutorial Sheets Prof. A. Zisserman

    Overview• Lecture 1 (AZ): Review of B1 unconstrained multivariate optimization,

    Levenberg-Marquardt algorithm.• Lecture 2 (AZ): Discrete optimization, dynamic programming.

    • Lectures 3-6 (BK): Classical optimization. Constrained optimization - equality constraints, Lagrange multipliers, inequality constraints. Nonlinear programming - search methods, approximation methods, axial iteration, pattern search, descent methods, quasi-Newton methods. Least squares and singular values.

    • Lecture 7 (AZ): Discrete optimization on graphs• Lecture 8 (AZ): Graph-cut, max-flow, and applications

    Background reading and web resources

    • Numerical Recipes in C (or C++) : The Art of Scientific ComputingWilliam H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. VetterlingCUP 1992/2002

    • Good chapter on optimization• Available on line at

    http://www.nrbook.com/a/bookcpdf.php

  • Background reading and web resources

    • Convex Optimization• Stephen Boyd and Lieven Vandenberghe

    CUP 2004• Available on line at

    http://www.stanford.edu/~boyd/cvxbook/

    • Further reading, web resources, and the lecture notes are on http://www.robots.ox.ac.uk/~az/lectures/opt

    The Optimization Landscape

    • continuous optimization• variables can take any values (subject to constraints)

    • e.g. x > 0

    • discrete optimization• variables are restricted to a set of values (often finite)

    • e.g. binary, [0, 1, …, 255]

    We will see that there are very efficient algorithms for cost functions with a particular structure using discrete optimization

  • Example: image processing – noise removal

    1

    0

    i

    minxf(x) =

    Xi

    (zi−xi)2+dφ(xi, xi−1)

    where φ(xi, xj) =

    (0 ifxi = xj1 ifxi 6= xj.

    Measured noisy data

    zi = xi+ wi

    where

    xi ∈ {0,1} and wi ∼ N(0,σ2)

    Estimate the true (no noise) signal by solving an optimization problem

    xi, zi

    • n is the number of pixels in the image, e.g. n = 20k

    ji

    N (i)Now in 2D for an image …

    N (i) is the neighbourhood of node i

    f(x) =nXi=1

    (zi − xi)2 + dX

    j∈N (i)φ(xi, xj)

    original

    original plus noise

    20 40 60 80 100 120 140 160 180 200 220

    20

    40

    60

    20 40 60 80 100 120 140 160 180 200 220

    20

    40

    60

    Task of minimization:

    • continuous optimization, many local minima,

    • discrete optimization, restrict x to 0 or 1, O(2n)

    IRn

  • original

    original plus noise

    Min x

    20 40 60 80 100 120 140 160 180 200 220

    20

    40

    60

    20 40 60 80 100 120 140 160 180 200 220

    20

    40

    60

    20 40 60 80 100 120 140 160 180 200 220

    20

    40

    60

    d = 10 Max-flow algorithm

    20 40 60 80 100 120 140 160 180 200 220

    20

    40

    60

    20 40 60 80 100 120 140 160 180 200 220

    20

    40

    60

    20 40 60 80 100 120 140 160 180 200 220

    20

    40

    60

    original

    original plus noise

    Min x

    d = 60

    n ∼ 20 K

    Max-flow algorithm

  • B1 revision – unconstrained continuous optimization:• Convexity• Iterative optimization algorithms• Gradient descent• Newton’s method• Gauss-Newton method

    New topics:• Axial iteration• Levenberg-Marquardt algorithm• Application

    Lecture outline

    Introduction: Problem specification

    Suppose we have a cost function (or objective function)

    Our aim is find the value of the parameters that minimize this function

    subject to the following constraints:

    • equality

    • inequality

    We will start by focussing on unconstrained problems

    f(x) : IRn → IRx

    x∗ = argminxf(x)

    ci(x) = 0, i = 1, . . . ,me

    ci(x) ≥ 0, i = me+1, . . . ,m

  • Unconstrained optimization

    • down-hill search (gradient descent) algorithms can find local minima

    • which of the minima is found depends on the starting point

    • such minima often occur in real applications

    minxf(x)

    f(x)

    xlocal minimum

    global minimum

    function of one variable

    Reminder: convexity

  • Class of functions …

    convex Not convex

    • Convexity provides a test for a single extremum

    • A non-negative sum of convex functions is convex

    Class of functions continued …

    single extremum – convex single extremum – non-convex

    multiple extrema – non-convex noisy

    Not convex

    horrible

  • Optimization algorithm – key ideas

    • Find δx such that f(x+ δx) < f(x)

    • This leads to an iterative update xn+1 = xn+ δx

    • Reduce the problem to a series of 1D line searches δx = αp

    -5 0 5 10 15-5

    0

    5

    10

    15

    Choosing the direction 1: axial iteration

    Alternate minimization over x and y

    -5 0 5 10 15-5

    0

    5

    10

    15

  • Choosing the direction 2: steepest descent

    Move in the direction of the gradient ∇f(xn)

    -5 0 5 10 15-5

    0

    5

    10

    15

    -5 0 5 10 15-5

    0

    5

    10

    15

    • The gradient is everywhere perpendicular to the contour lines.

    • After each line minimization the new gradient is always orthogonalto the previous step direction (true of any line minimization.)

    • Consequently, the iterates tend to zig-zag down the valley in a veryinefficient manner

    Steepest descent

  • A harder case: Rosenbrock’s function

    f(x, y) = 100(y − x2)2 + (1− x)2

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Rosenbrock function

    Minimum is at [1,1]

    -0.95 -0.9 -0.85 -0.8 -0.750.65

    0.7

    0.75

    0.8

    0.85

    Steepest Descent

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Steepest Descent

    Steepest descent on Rosenbrock function

    • The zig-zag behaviour is clear in the zoomed view (100 iterations)

    • The algorithm crawls down the valley

  • Conjugate Gradients – sketch onlyThe method of conjugate gradients chooses successive descent direc-

    tions pn such that it is guaranteed to reach the minimum in a finite

    number of steps.

    • Each pn is chosen to be conjugate to all previous search directionswith respect to the Hessian H :

    p>n H pj = 0, 0 =< j < n

    • The resulting search directions are mutually linearly independent.

    • Remarkably, pn can be chosen using only knowledge of pn−1,∇f(xn−1)and ∇f(xn) (see Numerical Recipes)

    pn = ∇fn+⎛⎝ ∇f>n ∇fn∇f>n−1∇fn−1

    ⎞⎠pn−1

    Choosing the direction 3: conjugate gradients

    Again, uses first derivatives only, but avoids “undoing” previous work

    • An N-dimensional quadratic form can be minimized in at most Nconjugate descent steps.

    • 3 different starting points.

    • Minimum is reached in exactly 2 steps.

  • Choosing the direction 4: Newton’s method

    Start from Taylor expansion in 2DA function may be approximated locally by its Taylor series expansion

    about a point x0

    f(x+ δx) ≈ f(x) +Ã∂f

    ∂x,∂f

    ∂y

    !Ãδxδy

    !+1

    2(δx, δy)

    ⎡⎢⎣ ∂2f

    ∂x2∂2f∂x∂y

    ∂2f∂x∂y

    ∂2f∂y2

    ⎤⎥⎦ Ã δxδy

    !

    The expansion to second order is a quadratic function

    f(x+ δx) = a+ g>δx+1

    2δx>H δx

    Now minimize this expansion over δx:

    minδx

    f(x+ δx) = a+ g>δx+1

    2δx>H δx

    -5 0 5 10 15-5

    0

    5

    10

    15

    minδx

    f(x+ δx) = a+ g>δx+1

    2δx>H δx

    For a minimum we require that ∇f(x+ δx) = 0, and so

    ∇f(x+ δx) = g+ Hδx = 0with solution δx = −H−1g (Matlab δx = −H\g).

    This gives the iterative update

    xn+1 = xn − H−1n gn

  • • If f(x) is quadratic, then the solution is found in one step.

    • The method has quadratic convergence (as in the 1D case).

    • The solution δx = −H−1n gn is guaranteed to be a downhill directionprovided that H is positive definite

    • Rather than jump straight to the predicted solution at xn − H−1n gn,it is better to perform a line search

    xn+1 = xn − αnH−1n gn

    • If H = I then this reduces to steepest descent.

    xn+1 = xn − H−1n gn

    Newton’s method - example

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Newton method with line search

    gradient < 1e-3 after 15 iterations-2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Newton method with line search

    gradient < 1e-3 after 15 iterations

    ellipses show successive quadratic approximations

    • The algorithm converges in only 15 iterations – far superior to steepest descent

    • However, the method requires computing the Hessian matrix at each iteration – this is not always feasible

  • Performance issues for optimization algorithms

    1. Number of iterations required

    2. Cost per iteration

    3. Memory footprint

    4. Region of convergence

    Special structure for cost function - non-linear least squares

    • It is very common in applications for a cost function f(x) to be thesum of a large number of squared residuals

    f(x) =MXi=1

    ri(x)2

    • If each residual ri(x) depends non-linearly on the parameters x thenthe minimization of f(x) is a non-linear least squares problem.

    • We also assume that the residuals ri are: (i) small at the optimum,and (ii) zero-mean.

  • f(x) =MXi=1

    r2i

    Gradient

    ∇f(x) = 2MXi

    ri(x)∇ri(x)

    Hessian

    H = ∇∇>f(x) = 2MXi

    ∇³ri(x)∇>ri(x)

    ´

    = 2MXi

    ∇ri(x)∇>ri(x) + ri(x)∇∇>ri(x)

    which is approximated as

    HGN = 2MXi

    ∇ri(x)∇>ri(x)

    This is the Gauss-Newton approximation

    Non-linear least squares

    ∇ri

    ∇ri∇>ri

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Gauss-Newton method with line search

    gradient < 1e-3 after 14 iterations-2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Gauss-Newton method with line search

    gradient < 1e-3 after 14 iterations

    • minimization with the Gauss-Newton approximation with line search takes only 14 iterations

    xn+1 = xn − αnH−1n gn with Hn (x) = HGN (xn)

  • Comparison

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Gauss-Newton method with line search

    gradient < 1e-3 after 14 iterations-2 -1 0 1 2

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Newton method with line search

    gradient < 1e-3 after 15 iterations

    Newton Gauss-Newton

    • requires computing Hessian

    • exact solution if quadratic

    • approximates Hessian by product of gradient of residuals

    • requires only derivatives

    Summary of minimizations methods

    Update xn+1 = xn+ δx

    1. Newton.

    H δx = −g

    2. Gauss-Newton.

    HGN δx = −g

    3. Gradient descent.

    λδx = −g

  • Levenberg-Marquardt algorithm

    -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    • Away from the minimum, in regions of negative curvature, theGauss-Newton approximation is not very good.

    • In such regions, a simple steepest-descent step is probably the bestplan.

    • The Levenberg-Marquardt method is a mechanism for varying be-tween steepest-descent and Gauss-Newton steps depending on how

    good the HGN approximation is locally.

    gradient descent

    Newton

    • The method uses the modified Hessian

    H (x,λ) = HGN + λI

    • When λ is small, H approximates the Gauss-Newton Hessian.

    • When λ is large, H is close to the identity, causing steepest-descentsteps to be taken.

  • LM Algorithm

    H (x,λ) = HGN(x) + λI

    1. Set λ = 0.001 (say)

    2. Solve δx = −H(x,λ)−1 g

    3. If f(xn+ δx) > f(xn), increase λ (×10 say) and go to 2.

    4. Otherwise, decrease λ (×0.1 say), let xn+1 = xn+ δx, and go to 2.

    Note : This algorithm does not require explicit line searches.

    Example

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Levenberg-Marquardt method

    gradient < 1e-3 after 31 iterations-2 -1 0 1 2

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Levenberg-Marquardt method

    gradient < 1e-3 after 31 iterations

    • Minimization using Levenberg-Marquardt (no line search) takes 31iterations.

    Matlab: lsqnonlin

  • Comparison

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Levenberg-Marquardt method

    gradient < 1e-3 after 31 iterations

    -2 -1 0 1 2-1

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3Gauss-Newton method with line search

    gradient < 1e-3 after 14 iterations

    Gauss-Newton

    • more iterations than Gauss-Newton, but

    • no line search required,

    • and more frequently converges

    Levenberg-Marquardt

    Case study – Bundle Adjustment (non-examinable)

    Notation:

    • A 3D point Xj is imaged in the “i” th view asxij = P

    i Xj

    P : 3× 4 matrixX : 4-vector

    x : 3-vector

    Problem statement

    • Given: n matching image points xij over m views

    • Find: the cameras Pi and the 3D points Xj such that xij = P

    i Xj

  • Number of parameters

    • for each camera there are 6 parameters

    • for each 3D point there are 3 parameters

    a total of 6 m + 3 n parameters must be estimated• e.g. 50 frames, 1000 points: 3300 unknowns

    For efficient optimization:• exploit the Gauss-Newton approximation (squared residual cost

    function)

    • and additional sparsity which means the Hessian has blocks of zeros

    Example

    image sequence cameras and points

  • Application: Augmented reality

    original sequence

    Augmentation