lecture 5

COMPUTER VISION: LEAST SQUARES MINIMIZATION

IIT Kharagpur

Computer Science and Engineering,Indian Institute of Technology

Kharagpur.

(IIT Kharagpur) Minimization Jan ’10 1 / 35

Solution of Linear equationsConsider a system of equations of the form Ax = b. Let A be an m × nmatrix.

If m < n there are more unknowns than equations. In this casethere will not be a unique solution, but rather a vector space ofsolutions.If m = n there will be a unique solution as long as A is invertible.If m > n there will be more equations than unknowns. In generalthe system will not have a solution.


Least-squares solution Full rank case

Consider the case m > n and assume that A is of rank n. We seek avector x that is closest to providing a solution to the system Ax = b.

We seek x such that ||Ax− b|| is minimized. Such an x is known asthe least squares solution to the over-determined system.We seek x that minimizes ||Ax − b|| = ||UDVTx − b||Because of the norm preserving property of orthogonaltransforms,

||UDVTx − b|| = ||DVTx − UTb||

Writing y = VTx and b′ = UT b the problem becomes one ofminimizing ||Dy − b′|| where D is a diagonal matrix.


d1d2

. . .dn

0

y1y2...

yn

=

b′1b′2...

b′n

b′n+1...

b′m

The nearest Dy can approach to b′ is the vector(b′1,b

′

2, . . . ,b′n,0, . . . , . . . , 0)T

This is achieved by setting yi = b′i /d′

i for i = 1, . . . ,nThe assumption rank A = n ensures that di , 0Finally x is retrieved from x = Vy.


Algorithm Least Squares

Objective:

Find the least-squares solution to the m × n set of equations Ax = b,where m > n and rank A= n.

Algorithm:

(i) Find the SVD A = UDVT

(ii) Set b′ = UTb(iii) Find the vector y defined yi = b′i /di , where di is the i th diagonal

entry of D(iv) The solution is x = Vy


Pseudo InverseGiven a square diagonal matrix D, we define its pseudo-inverse tobe the diagonal matrix D+ such that

D+n =

{0 if Dii = 0D−1

ii otherwise

For an m × n matrix A with m ≥ n, let the SVD of A = UDVT. Thepseudo-inverse of matrix A is

A+ = VD+UT

The least-squares solution to an m × n system of equations Ax = b ofrank n is given by x = A+b. In the case of a deficient-rank system,x = A+b is the solution that minimizes ||x||.


Linear least-squares using normal equationsConsider a system of equations of the form Ax = b. Let A be an m × nmatrix. m > n.

In general, no solution x will exist for this set of equations.Consequently, the task is to find the vector x that minimizes thenorm ||Ax − b||.As the vector x varies over all values, the product Ax varies overthe complete column space of A, i.e. the subspace of Rm spannedby the columns of A.The task is to find the closest vector to b that lies in the columnspace of A.


Linear least-squares using normal equationsLet x be the solution to this problem.Thus Ax is the closest point to b. In this case, the differenceAx − b must be orthogonal to the column space of A.This means that Ax − b is perpendicular to each of the columns ofA, hence

AT(Ax − b) = 0 (ATA)x = ATb

The solution is given as:

x = (ATA)−1ATb

x = A+b A+ = (ATA)−1AT

The pseudo-inverse of matrix A using SVD is given as

A+ = VD+UT


Least-squares solution ofhomogeneous equationsSolving a set of equations of the form Ax = 0.

x has the homogeneous representation. Hence if x is a solution,then kx is also a solution.A reasonable constraint would be to seek a solution for which||x|| = 1In general, such a set of equations will not have an exact solution.The problem is to find x that minimizes ||Ax|| subject to ||x|| = 1


Least-squares solution ofhomogeneous equations

Let A = UDVT

We need to minimize ||UDVTx||.Note that ||UDVTx|| = ||DVTx|| so we need to minimize ||DVTx||Note that ||x|| = ||VTx|| so we have the condition that ||VTx|| = 1Let y = VTx, so we minimize ||Dy|| subject to ||y || = 1

Since D is a diagonal matrix with its diagonalentries in descending order.It follows that the solution to this problem isSince y = VTx, x = Vy is simply the lastcolumn of V

y =

00...01


Iterative estimation techniquesX = f(P)

X is a measurement vector in RN

P is a parameter vector in RM .

We wish to seek the vector P̂ satisfying

X = f(P̂) − ε

for which ||ε|| is minimized.

The linear least squares problem is exactly of this type with thefunction f being defined as a linear function f(P) = AP


Iterative estimation methodsIf the function f is not alinear function we useiterative estimationtechniques.


Iterative estimation methodsWe start with an initial estimated value P0, and proceed to refinethe estimate under the assumption that the function f is locallylinear.

Let ε0 = f(P0) − X

We assume that the function is approximated at P0 by

f(P0 + ∆) = f(P0) + J∆

J is the linear mapping represented by the Jacobian matrix

J = ∂f/∂P


Iterative estimation methodsWe seek a point f(P1), with P1 = P0 + ∆, which minimizes

f(P1) − X = f(P0) + J∆ − X= ε0 + J∆

Thus it is required to minimize ||ε0 + J∆|| over ∆, which is a linearminimization problem.The vector ∆ is obtained by solving the normal equations

JTJ∆ = −JTε0 ∆ = −J+ε0


Iterative estimation methodsThe solution vector P̂ is obtained by starting with an estimate P0and computing successive approximations according to theformula

Pi+1 = Pi + ∆i

where ∆i is the solution to the linear least-squares problem

J∆ = −εi

Matrix J is the Jacobian ∂f/∂P evaluated at Pi and εi = f(Pi) − X.

The algorithm converges to a least squares solution P̂.Convergence can take place to a local minimum, or there may beno convergence at all.


Newton’s methodWe consider finding minima of functions of many variables.Consider an arbitrary scalar-valued function g(P) where P is avector.The optimization problem is simply to minimize g(P) over allvalues of P.Expand g(P) about P0 in a Taylor series to get

g(P0 + ∆) = g + gP∆ + ∆TgPP∆/2 + . . .

where gP denotes the differentiation of g(P) with respect to P,where gPP denotes the differentiation of gP with respect to P.


Newton’s methodExpand g(P) about P0 in a Taylor series to get

g(P0 + ∆) = g + gP∆ + ∆TgPP∆/2 + . . .

Differentiating the Taylor series with respect to ∆ we get

gP + gPP∆ = 0 ∆ = −gP/gPP

Hessian matrix: gPP is the matrix of second derivatives, theHessian of g. The (i , j)th entry is ∂2g/∂pi∂pj , and pi and pj are thei th and j th parameters. Vector gP is the gradient of g.

The method of Newton iteration consists in starting with an initialvalue of the parameters, P0 and iteratively computing parameterincrements ∆ until convergence occurs.


Gauss Newton MethodConsider a special case when gP is a squared norm of an errorfunction.

g(P) =12||ε(P)||2 =

ε(P)Tε(P)

2ε(P) = f(P) − X

ε(P) is the error function ε(P) = f(P) − Xε(P) is a vector valued function of the parameter P

The gradient∂g(P)

∂P' gP = εT

Pε

where εP =∂ε(P)

∂P=∂f(P)

∂P' fP

We know that fP = J, ∴ εP = J hence we have gP = JTε


Gauss Newton MethodConsider the second derivative gPP

gP = εTPε therefore gPP = εT

P εP + εTPP ε

Since εP = fP, and assuming that f(P) is linear, εPP vanishes.

gPP = εTPεP = JTJ

We have got an approximation of the 2nd derivative gPP.Now using the Newton’s equation

gPP∆ = −gP we get JTJ∆ = −JTε

This is the Gauss-Newton method, in which we use anapproximation of the Hessian gPP = JTJ of the function g(P).


Gradient DescentThe gradient of g(P) is given as gP = εT

Pε

The negative gradient vector −gP = −εTPε defines the direction of

most rapid decrease of the cost function.Gradient descent is a strategy of minimization of g where wemove iteratively in the gradient direction.We take small steps in the direction of descent.

∆ =−gP

λwhere λ controls the length of the step

Recall that in Newton’s method, the step size is given by

∆ =−g∆

gPPHessian approximated by scalar matrix λI


Gradient DescentGradient descent by itself is not a very good minimization strategy,typically characterized by slow convergence due to zig-zagging.However Gradient descent can be quite useful in conjunction withGauss-Newton iteration as a way of getting out of tight corners.Levenberg-Marquardt method is essentially a Gauss-Newtonmethod that transitions smoothly to gradient descent when theGauss-Newton updates fail.


Summaryg(P) is an arbitrary scalar valued function. g(P) = ε(P)Tε(P)/2

Newton’s Method

gPP∆ = −gP

wheregPP = εT

PεP +εTPPε

and gP = εTPε The

cost function isapproximated asquadratic near theminimum.

Gauss Newton

εTPεP∆ = − εT

Pε

The Hessian isapproximated asεT

PεP

Gradient Descent

λ∆ = −εTε = −gP

The Hessian isreplaced by λI


Levenberg-Marquardt iteration LMThis is a slight variation of the Gauss-Newton iteration method.We have the augmented normal equations:

JTJ∆ = −JTε −→ (JTJ + λI)∆ = −JTε

The value of λ varies from iteration to iteration.A typical initial value of λ is 10−3 times the average of the diagonalelements of JTJ


Levenberg-Marquardt iteration LMIf the value of ∆obtained by solving theaugmented normalequations leads toreduction of error, thenthe increment isaccepted and λ isdivided by a factor(typically 10) before thenext iteration.

If the value of ∆ leads to anincreased error, then λ is multiplied bythe same factor and the augmentednormal equations are solved again.This process continues until a valueof ∆ is found that gives rise to adecreased error.

The process of repeatedly solving the augmentednormal equations for different values of λ until anacceptable ∆ is found constitutes one iteration ofthe LM algorithm.


Robust cost functions

Squared Error (convex) PDF Attenuation function



Blake Zisserman (non-convex) PDF Attenuation function

Corrupted Gaussian (non-convex) PDF Attenuation function



Cauchy (non-convex) PDF Attenuation function

L1 cost (convex) PDF Attenuation function



Huber (convex) PDF Attenuation function

Pseudo Huber (convex) PDF Attenuation function


Square Error cost function

C(δ) = δ2 PDF = exp(−C(δ))

Its main drawback is that it is not robust to outliers in themeasurements.Because of the rapid growth of the quadratic curve, distant outliersexert an excessive influence, and can draw the cost minimum wellaway from the desired value.

The squared-error cost function is generally very susceptible tooutliers, and may be regarded as unusable as long as outliers arepresent.If outliers have been thoroughly eradicated, using for instanceRANSAC, then it may be used.


Non-convex cost functionsThe Blake-Zisserman, corrupted Gaussian and Cauchy costfunctions seek to mitigate the deleterious effect of outliers bygiving them diminished weight.As is seen in the plot of the first two of these, once the errorexceeds a certain threshold, it is classified as an outlier, and thecost remains substantially constant.The Cauchy cost function also seeks to deemphasize the cost ofoutliers, but this is done more gradually.


Asymptotically Linear cost functionsThe L1 cost function measures the absolute value of the error.The main effect of this is to give outliers less weight comparedwith the squared error.This cost function acts to find the median of a set of data.Consider a set of real valued data {ai } and a cost function definedby C(x) =

∑i |x − ai | The minimum of this function is at the median

of the set {ai }.For higher dimensional data ai ∈ R

n, the minimum of the costfunction C(x) =

∑i ||x − ai || similar stability properties with regard

to outliers.


Huber Cost functionThe Huber cost function takes the form of a quadratic for smallvalues of the error, δ, and becomes linear for values of δ beyond agiven threshold.It retains the outlier stability of the L1 cost function, while for inliersit reflects the property that the squared-error cost function givesthe Maximum Likelihood estimate.


Non-convex Cost functionsThe non-convex cost functions, though generally having a stableminimum, not much effected by outliers have the significantdisadvantage of having local minima, which can makeconvergence to a global minimum chancy.The estimate is not strongly attracted to the minimum from outsideof its immediate neighbourhood.Thus, they are not useful, unless (or until) the estimate is close tothe final correct value.


Maximum Likelihood methodMaximum likelihood is the procedure of finding the value of one ormore parameters for a given statistic which makes the knownlikelihood distribution a maximum.The maximum likelihood estimate for a parameter µ is denoted µ̂.

f (x1, x2, . . . , xn|µσ) =

n∏i=1

1

σ√

2πe(xi−µ)2/2σ2

=(2π)−n/2

σn exp[−

∑(xi − µ)2

2σ2

]Taking the logarithm

log f = −12

n log(2π) − n log σ −∑

(xi − µ)2

2σ2


To maximize the log likelihood

∂(log f )∂µ

=

∑(xi − µ)

σ2= 0 giving µ̂ =

∑xi

n

Similarly

∂(log f )∂σ

= −nσ

+

∑(xi − µ)2

σ3= 0 giving σ̂ =

√∑(xi−µ̂)2

n

Minimizing the least squares cost function gives a result which isequivalent to the maximum likelihood estimate assumingGaussian distribution.

In general, the maximum likelihood estimate of the parameter vector θis given as

θ̂ML = arg maxθ

p(x |θ)


lecture 5

Documents