optimization techniques on riemannian manifolds - arxiv · pdf fileoptimization techniques on...

Fields Institute CommunicationsVolume 3, 1994

Optimization Techniques on

Riemannian Manifolds

Steven T. Smith

Harvard UniversityDivision of Applied Sciences

Cambridge, Massachusetts 02138

Abstract. The techniques and analysis presented in this paper provide new meth-ods to solve optimization problems posed on Riemannian manifolds. A new point ofview is offered for the solution of constrained optimization problems. Some classicaloptimization techniques on Euclidean space are generalized to Riemannian manifolds.Several algorithms are presented and their convergence properties are analyzed em-ploying the Riemannian structure of the manifold. Specifically, two apparently newalgorithms, which can be thought of as Newton’s method and the conjugate gradientmethod on Riemannian manifolds, are presented and shown to possess, respectively,quadratic and superlinear convergence. Examples of each method on certain Rieman-nian manifolds are given with the results of numerical experiments. Rayleigh’s quotientdefined on the sphere is one example. It is shown that Newton’s method applied tothis function converges cubically, and that the Rayleigh quotient iteration is an effi-cient approximation of Newton’s method. The Riemannian version of the conjugategradient method applied to this function gives a new algorithm for finding the eigen-vectors corresponding to the extreme eigenvalues of a symmetric matrix. Anotherexample arises from extremizing the function tr ΘTQΘN on the special orthogonalgroup. In a similar example, it is shown that Newton’s method applied to the sumof the squares of the off-diagonal entries of a symmetric matrix converges cubically.

Keywords. Optimization, constrained optimization, Riemannian manifolds, Liegroups, homogeneous spaces, steepest descent, Newton’s method, conjugate gradientmethod, eigenvalue problem, Rayleigh’s quotient, Rayleigh quotient iteration, Jacobimethods, numerical methods.

1 Introduction

The preponderance of optimization techniques address problems posed on Eu-clidean spaces. Indeed, several fundamental algorithms have arisen from the de-sire to compute the minimum of quadratic forms on Euclidean space. However,many optimization problems are posed on non-Euclidean spaces. For example,finding the largest eigenvalue of a symmetric matrix may be posed as the max-imization of Rayleigh’s quotient defined on the sphere. Optimization problemssubject to nonlinear differentiable equality constraints on Euclidean space also

c© 1993 by the American Mathematical Society. 113

arX

iv:1

407.

5965

v1 [

mat

h.O

C]

22

Jul 2

014

114 STEVEN T. SMITH

lie within this category. Many optimization problems share with these examplesthe structure of a differentiable manifold endowed with a Riemannian metric.This is the subject of this paper: the extremization of functions defined onRiemannian manifolds.

The minimization of functions on a Riemannian manifold is, at least locally,equivalent to the smoothly constrained optimization problem on a Euclideanspace, because every C∞ Riemannian manifold can be isometrically imbeddedin some Euclidean space [46, Vol. V]. However, the dimension of the Euclideanspace may be larger than the dimension of the manifold; practical and aes-thetic considerations suggest that one try to exploit the intrinsic structure ofthe manifold. Elements of this spirit may be found throughout the field ofnumerical methods, such as the emphasis on unitary (norm preserving) trans-formations in numerical linear algebra [22], or the use of feasible direction meth-ods [18, 21, 38].

An intrinsic approach leads one from the extrinsic idea of vector addition tothe exponential map and parallel translation, from minimization along lines tominimization along geodesics, and from partial differentiation to covariant dif-ferentiation. The computation of geodesics, parallel translation, and covariantderivatives can be quite expensive. For an n-dimensional manifold, the compu-tation of geodesics and parallel translation requires the solution of a system of2n nonlinear and n linear ordinary differential equations. Nevertheless, manyoptimization problems are posed on manifolds that have an underlying algebraicstructure that may be exploited to greatly reduce the complexity of these compu-tations. For example, on a real compact semisimple Lie group endowed with itsnatural Riemannian metric, geodesics and parallel translation may be computedvia matrix exponentiation [24]. Several algorithms are available to perform thiscomputation [22, 32]. This algebraic structure may be found in the problemsposed by Brockett [8, 9, 10], Bloch et al. [3, 4], Smith [45], Faybusovich [17],Lagarias [30], Chu et al. [13, 14], Perkins et al. [35], and Helmke [25]. Thisapproach is also applicable if the manifold can be identified with a symmetricspace or, excepting parallel translation, a reductive homogeneous space [29, 33].Perhaps the simplest nontrivial example is the sphere, where geodesics and par-allel translation can be computed at low cost with trigonometric functions andvector addition. Furthermore, Brown and Bartholomew-Biggs [11] show thatin some cases function minimization by following the solution of a system ofordinary differential equations can be implemented such that it is competitivewith conventional techniques.

The outline of the paper is as follows. In Section 2, the optimization prob-lem is posed and conventions to be held throughout the paper are established.The method of steepest descent on a Riemannian manifold is described in Sec-tion 3. To fix ideas, a proof of linear convergence is given. The examples ofRayleigh’s quotient on the sphere and the function tr ΘTQΘN on the specialorthogonal group are presented. In Section 4, Newton’s method on a Rieman-nian manifold is derived. As in Euclidean space, this algorithm may be used tocompute the extrema of differentiable functions. It is proved that this methodconverges quadratically. The example of Rayleigh’s quotient is continued, and it

OPTIMIZATION ON RIEMANNIAN MANIFOLDS 115

is shown that Newton’s method applied to this function converges cubically, andis approximated by the Rayleigh quotient iteration. The example consideringtr ΘTQΘN is continued. In a related example, it is shown that Newton’s methodapplied to the sum of the squares of the off-diagonal elements of a symmetricmatrix converges cubically. This provides an example of a cubically convergentJacobi-like method. The conjugate gradient method is presented in Section 5with a proof of superlinear convergence. This technique is shown to providean effective algorithm for computing the extreme eigenvalues of a symmetricmatrix. The conjugate gradient method is applied to the function tr ΘTQΘN .

2 Preliminaries

This paper is concerned with the following problem.

Problem 2.1. Let M be a complete Riemannian manifold, and f a C∞ func-tion on M . Compute

minp∈M

f(p).

There are many well-known algorithms for solving this problem in the casewhere M is a Euclidean space. This paper generalizes several of these algorithmsto the case of complete Riemannian manifolds by replacing the Euclidean no-tions of straight lines and ordinary differentiation with geodesics and covariantdifferentiation. These concepts are reviewed in the following paragraphs. Wefollow Helgason’s [24] and Spivak’s [46] treatments of covariant differentiation,the exponential map, and parallel translation. Details may be found in thesereferences.

Let M be a complete n-dimensional Riemannian manifold with Riemannianstructure g and corresponding Levi-Civita connection ∇. Denote the tangentplane at p in M by Tp or TpM . For every p in M , the Riemannian structure gprovides an inner product on Tp given by the nondegenerate symmetric bilinearform gp:Tp×Tp → R. The notation 〈X,Y 〉 = gp(X,Y ) and ‖X‖ = gp(X,X)1/2,where X, Y ∈ Tp, is often used. The distance between two points p and q in M isdenoted by d(p, q). The gradient of a real-valued C∞ function f on M at p, de-noted by (gradf)p, is the unique vector in Tp such that dfp(X) = 〈(gradf)p, X〉for all X in Tp.

Denote the set of C∞ functions on M by C∞(M) and the set of C∞ vectorfields on M by X(M). An affine connection on M is a function ∇ which assignsto each vector field X ∈ X(M) an R-linear map ∇X : X(M) → X(M) whichsatisfies

(i) ∇fX+gY = f∇X + g∇Y , (ii) ∇X(fY ) = f∇XY + (Xf)Y,

for all f, g ∈ C∞(M), X, Y ∈ X(M). The map ∇X may be applied to tensorsof arbitrary type. Let ∇ be an affine connection on M and X ∈ X(M). Then

116 STEVEN T. SMITH

there exists a unique R-linear map A 7→ ∇XA of C∞ tensor fields into C∞

tensor fields which satisfies

(i) ∇Xf = Xf, (iv) ∇X preserves the type of tensors,(ii) ∇XY is given by ∇, (v) ∇X commutes with contractions,

(iii) ∇X is a derivation: ∇X(A⊗B) = ∇XA⊗B +A⊗∇XB,

where f ∈ C∞(M), Y ∈ X(M), and A, B are C∞ tensor fields. If A isof type (k, l), then ∇XA, called the covariant derivative of A along X, is oftype (k, l), and ∇A:X 7→ ∇XA, called the covariant differential of A, is oftype (k, l + 1).

Let M be a differentiable manifold with affine connection ∇. Let γ: I →Mbe a smooth curve with tangent vectors X(t) = γ(t), where I ⊂ R is an openinterval. The curve γ is called a geodesic if ∇XX = 0 for all t ∈ I. LetY (t) ∈ Tγ(t) (t ∈ I) be a smooth family of tangent vectors defined along γ. Thefamily Y (t) is said to be parallel along γ if ∇XY = 0 for all t ∈ I.

For every p in M and X 6= 0 in Tp, there exists a unique geodesic t 7→γX(t) such that γX(0) = p and γX(0) = X. We define the exponential mapexpp:Tp →M by expp(X) = γX(1) for all X ∈ Tp such that 1 is in the domainof γX . Oftentimes the map expp will be denoted by “exp” when the choice oftangent plane is clear, and γX(t) will be denoted by exp tX. A neighborhood Npof p in M is a normal neighborhood if Np = expN0, where N0 is a star-shapedneighborhood of the origin in Tp and exp maps N0 diffeomorphically onto Np.Normal neighborhoods always exist.

Given a curve γ: I → M such that γ(0) = p, for each Y ∈ Tp there exists aunique family Y (t) ∈ Tγ(t) (t ∈ I) of tangent vectors parallel along γ such thatY (0) = Y . If γ joins the points p and γ(α) = q, the parallelism along γ inducesan isomorphism τpq:Tp → Tq defined by τpqY = Y (α).

Let M be a manifold with an affine connection ∇, and Np a normal neigh-

borhood of p ∈ M . Define the vector field X on Np adapted to the tangent

vector X in Tp by putting Xq = τpqX, the parallel translation of X along theunique geodesic segment joining p and q.

Given a Riemannian structure g onM , there exists a unique affine connection∇ on M , called the Levi-Civita connection, which for all X, Y ∈ X(M) satisfies

(i) ∇XY −∇YX = [X,Y ] (∇ is symmetric or torsion-free),(ii) ∇g = 0 (parallel translation is an isometry).

Length minimizing curves on M are geodesics of the Levi-Civita connection.We shall use this connection throughout the paper.

Unless otherwise specified, all manifolds, vector fields, and functions areassumed to be smooth. When considering a function f to be minimized, theassumption that f is differentiable of class C∞ can be relaxed throughout thepaper, but f must be continuously differentiable at least beyond the derivativesthat appear. As the results of this paper are local ones, the assumption that Mbe complete may also be relaxed in certain instances.

We will use the the following definitions to compare the convergence ratesof various algorithms.


Definition 2.2. Let pi be a Cauchy sequence in M that converges to p.(i) The sequence pi is said to converge (at least) linearly if there exists aninteger N and a constant θ ∈ [0, 1) such that d(pi+1, p) ≤ θd(pi, p) for all i ≥ N .(ii) The sequence pi is said to converge (at least) quadratically if there existsan integer N and a constant θ ≥ 0 such that d(pi+1, p) ≤ θd2(pi, p) for all i ≥ N .(iii) The sequence pi is said to converge (at least) cubically if there exists aninteger N and a constant θ ≥ 0 such that d(pi+1, p) ≤ θd3(pi, p) for all i ≥ N .(iv) The sequence pi is said to converge superlinearly if it converges fasterthan any sequence that converges linearly.

3 Steepest descent on Riemannian manifolds

The method of steepest descent on a Riemannian manifold is conceptually iden-tical to the method of steepest descent on Euclidean space. Each iteration in-volves a gradient computation and minimization along the geodesic determinedby the gradient. Fletcher [18], Botsaris [5, 6, 7], and Luenberger [31] describethis algorithm in Euclidean space. Gill and Murray [21] and Sargent [38] ap-ply this technique in the presence of constraints. In this section we restate themethod of steepest descent described in the literature and provide an alternativeformalism that will be useful in the development of Newton’s method and theconjugate gradient method on Riemannian manifolds.

Algorithm 3.1 (The method of steepest descent). Let M be a complete Rie-mannian manifold with Riemannian structure g and Levi-Civita connection ∇,and let f ∈ C∞(M).

Step 0. Select p0 ∈M , compute G0 = −(gradf)p0, and set i = 0.

Step 1. Compute λi such that

f(exppi λiGi) ≤ f(exppi λGi)

for all λ ≥ 0.

Step 2. Setpi+1 = exppi λiGi,

Gi+1 = −(gradf)pi+1,

increment i, and go to Step 1.

It is easy to verify that 〈Gi+1, τGi〉 = 0, for i ≥ 0, where τ is the paral-lelism with respect to the geodesic from pi to pi+1. By assumption, the func-tion λ 7→ f(expλGi) is minimized at λi. Therefore, we have 0 = (d/dt)|t=0

f(exp(λi + t)Gi) = dfpi+1(τGi) = 〈(gradf)pi+1 , τGi〉. Thus the method ofsteepest descent on a Riemannian manifold has the same deficiency as its coun-terpart on a Euclidean space, i.e., it makes a ninety degree turn at every step.

The convergence of Algorithm 3.1 is linear. To prove this fact, we will makeuse of a standard theorem of the calculus, expressed in differential geometric

118 STEVEN T. SMITH

language. The covariant derivative ∇Xf of f along X is defined to be Xf . Fork = 1, 2, . . . , define ∇kXf = ∇X · · · ∇Xf (k times), and let ∇0

Xf = f .

Remark 3.2 (Taylor’s formula). Let M be a manifold with an affine connec-tion ∇, Np a normal neighborhood of p ∈M , the vector field X on Np adaptedto X in Tp, and f a C∞ function on M . Then there exists an ε > 0 such thatfor every λ ∈ [0, ε)

f(expp λX) = f(p) + λ(∇Xf)(p) + · · ·+ λn−1

(n− 1)!(∇n−1

Xf)(p)

+λn

(n− 1)!

∫ 1

0

(1− t)n−1(∇nXf)(expp tλX) dt.

(1)

Proof. Let N0 be a star-shaped neighborhood of 0 ∈ Tp such that Np = expN0.There exists ε > 0 such that λX ∈ N0 for all λ ∈ [0, ε). The map λ 7→ f(expλX)is a real C∞ function on [0, ε) with derivative (∇Xf)(expλX). The statementfollows by repeated integration by parts.

The following special cases of Remark 3.2 will be particularly useful. Whenn = 2, Eq. (1) yields

f(expp λX) = f(p) + λ(∇Xf)(p) + λ2

∫ 1

0

(1− t)(∇2Xf)(expp tλX) dt. (2)

Furthermore, when n = 1, Eq. (1) applied to the function Xf = ∇Xf yields

(Xf)(expp λX) = (Xf)(p) + λ

∫ 1

0

(∇2Xf)(expp tλX) dt. (3)

The convergence proofs require a characterization of the second order termsof f near a critical point. Consider the second covariant differential ∇∇f = ∇2fof a smooth function f :M → R. If (U, x1, . . . , xn) is a coordinate chart on M ,then at p ∈ U this (0, 2) tensor takes the form

(∇2f)p =∑i,j

(( ∂2f

∂xi∂xj

)p−∑k

Γkji

( ∂f

∂xk

)p

)dxi ⊗ dxj (4)

where Γkij are the Christoffel symbols at p. If p in U is a critical point of f, then

(∂f/∂xk)p = 0, k = 1, . . . , n. Therefore (∇2f)p = (d2f)p, where (d2f)p is the

Hessian of f at the critical point p. Furthermore, for p ∈M , X, Y ∈ Tp, and X

and Y vector fields adapted to X and Y , respectively, on a normal neighborhoodNp of p, we have (∇2f)(X, Y ) = ∇Y∇Xf on Np. Therefore the coefficient of thesecond term of the Taylor expansion of f(exp tX) is (∇2

Xf)p = (∇2f)p(X,X).

Note that the bilinear form (∇2f)p on Tp × Tp is symmetric if and only if ∇ issymmetric, which true of the Levi-Civita connection by definition.


Theorem 3.3. Let M be a complete Riemannian manifold with Riemannianstructure g and Levi-Civita connection ∇. Let f ∈ C∞(M) have a nondegener-ate critical point at p such that the Hessian (d2f)p is positive definite. Let pi bea sequence of points in M converging to p and Hi ∈ Tpi a sequence of tangentvectors such that

(i) pi+1 = exppiλiHi for i = 0, 1, . . . ,

(ii) 〈−(gradf)pi , Hi〉 ≥ c ‖(gradf)pi‖ ‖Hi‖ for c ∈ (0, 1],

where λi is chosen such that f(expλiHi) ≤ f(expλHi) for all λ ≥ 0. Thenthere exists a constant E and a θ ∈ [0, 1) such that for all i = 0, 1, . . . ,

d(pi, p) ≤ Eθi.

Proof. The proof is a generalization of the one given in Polak [36, p. 242ff] forthe method of steepest descent on Euclidean space.

The existence of a convergent sequence is guaranteed by the smoothness of f .If pj = p for some integer j, the assertion becomes trivial; assume otherwise.By the smoothness of f, there exists an open neighborhood U of p such that(∇2f)p is positive definite for all p ∈ U . Therefore, there exist constants k > 0and K ≥ k > 0 such that for all X ∈ Tp and all p ∈ U ,

k‖X‖2 ≤ (∇2f)p(X,X) ≤ K‖X‖2. (5)

Define Xi ∈ Tp by the relations expXi = pi, i = 0, 1, . . . By assumption,dfp = 0 and from Eq. (2), we have

f(pi)− f(p) =

∫ 1

0

(1− t)(∇2Xif)(expp tXi) dt. (6)

Combining this equality with the inequalities of (5) yields

12kd

2(pi, p) ≤ f(pi)− f(p) ≤ 12Kd

2(pi, p). (7)

Similarly, we have by Eq. (3)

(Xif)(pi) =

∫ 1

0

(∇2Xif)(expp tXi) dt.

Next, use (6) with Schwarz’s inequality and the first inequality of (7) to obtain

kd2(pi, p) = k‖Xi‖2 ≤∫ 1

0

(∇2Xif)(expp tXi) dt = (Xif)(pi)

= dfpi((Xi)pi

)= dfpi(τXi) = 〈(gradf)pi , τXi〉

≤ ‖(gradf)pi‖ ‖τXi‖ = ‖(gradf)pi‖ d(pi, p).

Therefore,‖(gradf)pi‖ ≥ kd(pi, p). (8)

120 STEVEN T. SMITH

Define the function ∆:Tp×R→ R by the equation ∆(X,λ) = f(expp λX)−f(p). By Eq. (2), the second order Taylor formula, we have

∆(Hi, λ) = λ(Hif)(pi) + 12λ

2

∫ 1

0

(1− t)(∇2Hif)(exppi λHi) dt.

Using assumption (ii) of the theorem along with (5) we establish for λ ≥ 0

∆(Hi, λ) ≤ −λc‖(gradf)pi‖ ‖Hi‖+ 12λ

2K‖Hi‖2. (9)

We may now compute an upper bound for the rate of linear convergenceθ. By assumption (i) of the theorem, λ must be chosen to minimize the righthand side of (9). This corresponds to choosing λ = c‖(gradf)pi‖

/K‖Hi‖. A

computation reveals that

∆(Hi, λi) ≤ −c2

2K‖(gradf)pi‖2.

Applying (7) and (8) to this inequality and rearranging terms yields

f(pi+1)− f(p) ≤ θ(f(pi)− f(p)

), (10)

where θ =(1 − (ck/K)2

). By assumption, c ∈ (0, 1] and 0 < k ≤ K, therefore

θ ∈ [0, 1). (Note that Schwarz’s inequality bounds c below unity.) From (10)it is seen that

(f(pi) − f(p)

)≤ Eθi where E =

(f(p0) − f(p)

). From (7) we

conclude that for i = 0, 1, . . . ,

d(pi, p) ≤√

2E

k

(√θ)i. (11)

Corollary 3.4. If Algorithm 3.1 converges to a local minimum, it convergeslinearly.

The choice Hi = −(gradf)pi yields c = 1 in the second assumption theTheorem 3.3, which establishes the corollary.

Example 3.5 (Rayleigh’s quotient on the sphere). Let Sn−1 be the imbeddedsphere in Rn, i.e., Sn−1 = x ∈ Rn : xTx = 1 , where xTy denotes the standardinner product on Rn, which induces a metric on Sn−1. Geodesics on the sphereare great circles and parallel translation along geodesics is equivalent to rotatingthe tangent plane along the great circle. Let x ∈ Sn−1 and h ∈ Tx have unitlength, and v ∈ Tx be any tangent vector. Then

expx th = x cos t+ h sin t,

τh = h cos t− x sin t,

τv = v − (hTv)(x sin t+ h(1− cos t)

),

where τ is the parallelism along the geodesic t 7→ exp th. LetQ be an n-by-n pos-itive definite symmetric matrix with distinct eigenvalues and define ρ:Sn−1 → Rby ρ(x) = xTQx. A computation shows that

12 (grad ρ)x = Qx− ρ(x)x. (12)


The function ρ has a unique minimum and maximum point at the eigenvec-tors corresponding to the smallest and largest eigenvalues of Q, respectively.Because Sn−1 is geodesically complete, the method of steepest descent in theopposite direction of the gradient converges to the eigenvector correspondingto the smallest eigenvalue of Q; likewise for the eigenvector corresponding tothe largest eigenvalue. Chu [13] considers the continuous limit of this prob-lem. A computation shows that ρ(x) is maximized along the geodesic expx th(‖h‖ = 1) when a cos 2t − b sin 2t = 0, where a = 2xTQh and b = ρ(x) − ρ(h).Thus cos t and sin t may be computed with simple algebraic functions of a and b(which appear below in Algorithm 5.5). The results of a numerical experimentdemonstrating the convergence of the method of steepest descent applied tomaximizing Rayleigh’s quotient on S20 are shown in Figure 1 on page 133.

Example 3.6 (Brockett [9, 10]). Consider the function f(Θ) = tr ΘTQΘNon the special orthogonal group SO(n), where Q is a real symmetric matrixwith distinct eigenvalues and N is a real diagonal matrix with distinct diagonalelements. It will be convenient to identify tangent vectors in TΘ with tangentvectors in TI ∼= so(n), the tangent plane at the identity, via left translation.The gradient of f (with respect to the negative Killing form of so(n), scaledby 1/(n − 2)) at Θ ∈ SO(n) is Θ[H,N ], where H = AdΘT(Q) = ΘTQΘ. Thegroup SO(n) acts on the set of symmetric matrices by conjugation; the orbitof Q under the action of SO(n) is an isospectral submanifold of the symmetricmatrices. We seek a Θ such that f(Θ) is maximized. This point corresponds toa diagonal matrix whose diagonal entries are ordered similarly to those of N . Arelated example is found in Smith [45], who considers the homogeneous spaceof matrices with fixed singular values, and in Chu [14].

The Levi-Civita connection on SO(n) is bi-invariant and invariant with re-spect to inversion; therefore, geodesics and parallel translation may be computedvia matrix exponentiation of elements in so(n) and left (or right) translation [24,Ch. II, Ex. 6]. The geodesic emanating from the identity in SO(n) in directionX ∈ so(n) is given by the formula expI tX = etX , where the right hand sidedenotes regular matrix exponentiation. The expense of geodesic minimizationmay be avoided if instead one uses Brockett’s estimate [10] for the step size.Given Ω ∈ so(n), we wish to find t > 0 such that φ(t) = tr Ade−tΩ(H)N isminimized. Differentiating φ twice shows that φ′(t) = − tr Ade−tΩ(adΩH)Nand φ′′(t) = − tr Ade−tΩ(adΩH) adΩN , where adΩA = [Ω, A]. Hence, φ′(0) =2 trHΩN and, by Schwarz’s inequality and the fact that Ad is an isometry,|φ′′(t)| ≤ ‖ adΩH‖ ‖ adΩN‖. We conclude that if φ′(0) > 0, then φ′ is nonneg-ative on the interval

0 ≤ t ≤ 2 trHΩN

‖ adΩH‖ ‖ adΩN‖, (13)

which provides an estimate for the step size of Step 1 in Algorithm 3.1. Theresults of a numerical experiment demonstrating the convergence of the methodof steepest descent (ascent) in SO(20) using this estimate are shown in Figure 2.

122 STEVEN T. SMITH

4 Newton’s method on Riemannian manifolds

As in the optimization of functions on Euclidean space, quadratic convergencecan be obtained if the second order terms of the Taylor expansion are used ap-propriately. In this section we present Newton’s algorithm on Riemannian man-ifolds, prove that its convergence is quadratic, and provide examples. Whereasthe convergence proof for the method of steepest descent relies upon the Taylorexpansion of the function f, the convergence proof for Newton’s method will relyupon the Taylor expansion of the one-form df . Note that Newton’s method hasa counterpart in the theory of constrained optimization, as described by, e.g.,Fletcher [18], Bertsekas [1, 2], or Dunn [15, 16]. The Newton method presentedin this section has only local convergence properties. There is a theory of globalNewton methods on Euclidean space and computational complexity; see thework of Hirsch and Smale [27], Smale [43, 44], and Shub and Smale [40, 41].

Let M be an n-dimensional Riemannian manifold with Riemannian structureg and Levi-Civita connection ∇, let µ be a C∞ one-form on M , and let p in Mbe such that the bilinear form (∇µ)p:Tp × Tp → R is nondegenerate. Then, byabuse of notation, we have the pair of isomorphisms

Tp(∇µ)p−−−−−−→←−−−−−−(∇µ)−1

p

T ∗p

with the forward map defined by X 7→ (∇Xµ)p = (∇µ)p(·, X), which is nonsin-gular. The notation (∇µ)p will henceforth be used for both the bilinear formdefined by the covariant differential of µ evaluated at p and the homomorphismfrom Tp to T ∗p induced by this bilinear form. In case of an isomorphism, theinverse can be used to compute a point in M where µ vanishes, if such a pointexists. The case µ = df will be of particular interest, in which case ∇µ = ∇2f .Before expounding on these ideas, we make the following remarks.

Remark 4.1 (The mean value theorem). Let M be a manifold with affine con-nection ∇, Np a normal neighborhood of p ∈ M , the vector field X on Npadapted to X ∈ Tp, µ a one-form on Np, and τλ the parallelism with respectto exp tX for t ∈ [0, λ]. Denote the point expλX by pλ. Then there exists anε > 0 such that for every λ ∈ [0, ε), there is an α ∈ [0, λ] such that

τ−1λ µpλ − µp = λ(∇Xµ)pα τα.

Proof. As in the proof of Remark 3.2, there exists an ε > 0 such that λX ∈ N0

for all λ ∈ [0, ε). The map λ 7→ (τ−1λ µpλ)(A), for any A in Tp, is a C∞ function

on [0, ε) with derivative (d/dt)(τ−1t µpt)(A) = (d/dt)µpt(τtA) = ∇X

(µpt(τtA)

)=

(∇Xµ)pt(τtA) + µpt(∇X(τtA)

)= (∇Xµ)pt(τtA). The lemma follows from the

mean value theorem of real analysis.

This remark can be generalized in the following way.

Remark 4.2 (Taylor’s theorem). Let M be a manifold with affine connec-tion ∇, Np a normal neighborhood of p ∈M , the vector field X on Np adapted


to X ∈ Tp, µ a one-form on Np, and τλ the parallelism with respect to exp tXfor t ∈ [0, λ]. Denote the point expλX by pλ. Then there exists an ε > 0 suchthat for every λ ∈ [0, ε), there is an α ∈ [0, λ] such that

τ−1λ µpλ = µp + λ(∇Xµ)p + · · ·+ λn−1

(n− 1)!(∇n−1

Xµ)p +

λn

n!(∇n

Xµ)pα τα. (14)

The remark follows by applying Remark 4.1 and the Taylor’s theorem of realanalysis to the function λ 7→ (τ−1

λ µpλ)(A) for any A in Tp.Remarks 4.1 and 4.2 can be generalized to C∞ tensor fields, but we will only

require Remark 4.2 for case n = 2 to make the following observation.Let µ be a one-form on M such that for some p in M , µp = 0. Given any p

in a normal neighborhood of p, we wish to find X in Tp such that exppX = p.Consider the Taylor expansion of µ about p, and let τ be the parallel translationalong the unique geodesic joining p to p. We have by our assumption that µvanishes at p, and from Eq. (14) for n = 2,

0 = τ−1µp = τ−1µexppX = µp + (∇µ)p(·, X) + h.o.t.

If the bilinear form (∇µ)p is nondegenerate, the tangent vector X may be ap-proximated by discarding the higher order terms and solving the resulting linearequation

µp + (∇µ)p(·, X) = 0

for X, which yieldsX = −(∇µ)−1

p µp.

This approximation is the basis of the following algorithm.

Algorithm 4.3 (Newton’s method). Let M be a complete Riemannian mani-fold with Riemannian structure g and Levi-Civita connection ∇, and let µ be aC∞ one-form on M .

Step 0. Select p0 ∈M such that (∇µ)p0is nondegenerate, and set i = 0.

Step 1. ComputeHi = −(∇µ)−1

pi µpi

pi+1 = exppi Hi,

(assume that (∇µ)pi is nondegenerate), increment i, and repeat.

It can be shown that if p0 is chosen suitably close (within the so-calleddomain of attraction) to a point p in M such that µp = 0 and (∇µ)p is non-degenerate, then Algorithm 4.3 converges quadratically to p. The followingtheorem holds for general one-forms; we will consider the case where µ is exact.

Theorem 4.4. Let f ∈ C∞(M) have a nondegenerate critical point at p. Thenthere exists a neighborhood U of p such that for any p0 ∈ U , the iterates ofAlgorithm 4.3 for µ = df are well defined and converge quadratically to p.

124 STEVEN T. SMITH

The proof of this theorem is a generalization of the corresponding prooffor Euclidean spaces, with an extra term containing the Riemannian curvaturetensor (which of course vanishes in the latter case).

Proof. If pj = p for some integer j, the assertion becomes trivial; assumeotherwise. Define Xi ∈ Tpi by the relations p = expXi, i = 0, 1, . . . , sothat d(pi, p) = ‖Xi‖ (n.b. this convention is opposite that used in the proof ofTheorem 3.3). Consider the geodesic triangle with vertices pi, pi+1, and p, andsides exp tXi from pi to p, exp tHi from pi to pi+1, and exp tXi+1 from pi+1

to p, for t ∈ [0, 1]. Let τ be the parallelism with respect to the side exp tHi

between pi and pi+1. There exists a unique tangent vector Ξi in Tpi defined bythe equation

Xi = Hi + τ−1Xi+1 +Ξi (15)

(Ξi may be interpreted as the amount by which vector addition fails). If weuse the definition Hi = −(∇2f)−1

pi dfpi of Algorithm 4.3, apply the isomorphism(∇2f)pi :Tpi → T ∗pi to both sides of Eq. (15), we obtain the equation

(∇2f)pi(τ−1Xi+1) = dfpi + (∇2f)piXi − (∇2f)piΞi. (16)

By Taylor’s theorem, there exists an α ∈ [0, 1] such that

τ−11 dfp = dfpi + (∇Xidf)pi + 1

2 (∇2Xidf)pα τα (17)

where τt is the parallel translation from pi to pt = exp tXi. The trivial identities(∇Xidf)pi = (∇2f)piXi and (∇2

Xidf)pα = (∇3f)pα(τα·, ταXi, ταXi) will be used

to replace the last two terms on the right hand side of Eq. (17). Combining theassumption that dfp = 0 with Eqs. (16) and (17), we obtain

(∇2f)pi(τ−1Xi+1) = − 1

2 (∇2Xidf)pα τα − (∇2f)piΞi. (18)

By the smoothness of f and g, there exists an ε > 0 and constants δ′, δ′′, δ′′′,all greater than zero, such that whenever p is in the convex normal ball Bε(p),

(i) ‖(∇2f)p(·, X)‖ ≥ δ′‖X‖ for all X ∈ Tp,(ii) ‖(∇2f)p(·, X)‖ ≤ δ′′‖X‖ for all X ∈ Tp,(iii) ‖(∇3f)p(·, X,X)‖ ≤ δ′′′‖X‖2 for all X ∈ Tp,

where the induced norm on T ∗p is used in all three cases. Taking the normof both sides of Eq. (18), applying the triangle inequality to the right handside, and using the fact that parallel translation is an isometry, we obtain theinequality

δ′d(pi+1, p) ≤ δ′′′d2(pi, p) + δ′′‖Ξi‖. (19)

The length of Ξi can be bounded by a cubic expression in d(pi, p) by con-sidering the distance between the points exp(Hi + τ−1Xi+1) and expXi+1 = p.Given p ∈M , ε > 0 small enough, let a, v ∈ Tp be such that ‖a‖+ ‖v‖ ≤ ε, and


let τ be the parallel translation with respect to the geodesic from p to q = expp a.Karcher [28, App. C2.2] shows that

d(expp(a+ v), expq(τv)

)≤ ‖a‖ · const. (max |K|) · ε2, (20)

where K is the sectional curvature of M along any section in the tangent planeat any point near p.

There exists a constant c > 0 such that ‖Ξi‖ ≤ c d(p, exp(Hi + τ−1Xi+1)

).

By (20), we have ‖Ξi‖ ≤ const. ‖Hi‖ε2. Taking the norm of both sides of the

Taylor formula dfpi = −∫ 1

0(∇Xidf)(exp tXi) dt and applying a standard integral

inequality and inequality (ii) from above yields ‖dfpi‖ ≤ δ′′‖Xi‖ so that ‖Hi‖ ≤const. ‖Xi‖. Furthermore, we have the triangle inequality ‖Xi+1‖ ≤ ‖Xi‖ +‖Hi‖, therefore ε may be chosen such that ‖Hi‖ + ‖Xi+1‖ ≤ ε ≤ const. ‖Xi‖.By (20) there exists δiv > 0 such that ‖Ξi‖ ≤ δivd3(pi, p).

Corollary 4.5. If (∇2f)p is positive (negative) definite and Algorithm 4.3converges to p, then Algorithm 4.3 converges quadratically to a local minimum(maximum) of f .

Example 4.6 (Rayleigh’s quotient on the sphere). Let Sn−1 and ρ(x) = xTQxbe as in Example 3.5. It will be convenient to work with the coordinates x1, . . . ,xn of the ambient space Rn, treat the tangent plane TxS

n−1 as a vector subspaceof Rn, and make the identification TxS

n−1 ∼= T ∗xSn−1 via the metric. In this

coordinate system, geodesics on the sphere obey the second order differentialequation xk + xk = 0, k = 1, . . . , n. Thus the Christoffel symbols are givenby Γkij = δijx

k, where δij is the Kronecker delta. The ijth component of the

second covariant differential of ρ at x in Sn−1 is given by (cf. Eq. (4))((∇2ρ)x

)ij

= 2Qij −∑k,l

δijxk · 2Qklxl = 2

(Qij − ρ(x)δij

),

or, written as matrices,12 (∇2ρ)x = Q− ρ(x)I. (21)

Let u be a tangent vector in TxSn−1. A linear operator A: Rn → Rn defines a

linear operator on the tangent plane TxSn−1 for each x in Sn−1 such that

A·u = Au− (xTAu)x = (I − xxT)Au

If A is invertible as an endomorphism of the ambient space Rn, the solution tothe linear equation A·u = v for u, v in TxS

n−1 is

u = A−1

(v − (xTA−1v)

(xTA−1x)x

). (22)

For Newton’s method, the direction Hi in TxSn−1 is the solution of the equation

(∇2ρ)xi ·Hi = −(grad ρ)xi .

126 STEVEN T. SMITH

Combining Eqs. (12), (21), and (22), we obtain

Hi = −xi + αi(Q− ρ(xi)I

)−1xi

where αi = 1/xTi (Q − ρ(xi)I)−1xi. This gives rise to the following algorithm

for computing eigenvectors of the symmetric matrix Q.

Algorithm 4.7 (Newton-Rayleigh quotient method). Let Q be a real symmet-ric n-by-n matrix.

Step 0. Select x0 in Rn such that xT0x0 = 1, and set i = 0.

Step 1. Compute

yi =(Q− ρ(xi)I

)−1xi

and set αi = 1/xTi yi.

Step 2. ComputeHi = −xi + αiyi, θi = ‖Hi‖,

xi+1 = xi cos θi +Hi sin θi/θi,


The quadratic convergence guaranteed by Theorem 4.4 is in fact too conser-vative for Algorithm 4.7. As evidenced by Figure 1, Algorithm 4.7 convergescubically.

Proposition 4.8. If λ is a distinct eigenvalue of the symmetric matrix Q, andAlgorithm 4.7 converges to the corresponding eigenvector x, then it convergescubically.

Proof 1. In the coordinates x1, . . . , xn of the ambient space Rn, the ijkthcomponent of the third covariant differential of ρ at x is −2λxkδij . Let X ∈TxS

n−1. Then (∇3ρ)x(·, X,X) = 0 and the second order terms on the righthand side of Eq. (18) vanish at the critical point. The proposition follows fromthe smoothness of ρ.

Proof 2. The proof follows Parlett’s [34, p. 72ff] proof of cubic convergencefor the Rayleigh quotient iteration. Assume that for all i, xi 6= x, and denoteρ(xi) by ρi. For all i, there is an angle ψi and a unit length vector ui definedby the equation xi = x cosψi + ui sinψi, such that xTui = 0. By Algorithm 4.7

xi+1 = x cosψi+1 + ui+1 sinψi+1 = xi cos θi +Hi sin θi/θi

= x

(αi sin θi

(λ− ρi)θi+ βi

)cosψi +

(αi sin θiθi

(Q− ρiI)−1ui + βiui

)sinψi,

where βi = cos θi − sin θi/θi. Therefore,

| tanψi+1| =

∥∥∥αi sin θiθi

(Q− ρiI)−1ui + βiui

∥∥∥∣∣∣ αi sin θi(λ−ρi)θi + βi

∣∣∣ · | tanψi|. (23)


The following equalities and low order approximations in terms of the smallquantities λ−ρi, θi, and ψi are straightforward to establish: λ− ρi = (λ− ρ(ui))× sin2 ψi, θ

2i = cos2 ψi sin2 ψi + h.o.t., αi = (λ− ρi) + h.o.t., and βi = −θ2

i /3 +h.o.t. Thus, the denominator of the large fraction in Eq. (23) is of order unityand the numerator is of order sin2 ψi. Therefore, we have

|ψi+1| = const. |ψi|3 + h.o.t.

Remark 4.9. If Algorithm 4.7 is simplified by replacing Step 2 with

Step 2.′ Computexi+1 = yi

/‖yi‖,


then we obtain the Rayleigh quotient iteration. These two algorithms differ bythe method in which they use the vector yi = (Q− ρ(xi)I)−1xi to compute thenext iterate on the sphere. Algorithm 4.7 computes the point Hi in TxiS

n−1

where yi intersects this tangent plane, then computes xi+1 via the exponentialmap of this vector (which “rolls” the tangent vector Hi onto the sphere). TheRayleigh quotient iteration computes the intersection of yi with the sphere itselfand takes this intersection to be xi+1. The latter approach approximates Algo-rithm 4.7 up to quadratic terms when xi is close to an eigenvector. Algorithm 4.7is more expensive to compute than—though of the same order as—the Rayleighquotient iteration; thus, the RQI is seen to be an efficient approximation ofNewton’s method.

If the exponential map is replaced by the chart v ∈ Tx 7→ (x+ v)/‖x+ v‖ ∈Sn−1, Shub [39] shows that a corresponding version of Newton’s method isequivalent to the RQI.

Example 4.10 (The function tr ΘTQΘN). Let Θ, Q, H = AdΘT(Q), and Ω beas in Example 3.6. The second covariant differential of f(Θ) = tr ΘTQΘN maybe computed either by polarization of the second order term of tr Ade−tΩ(H)N ,or by covariant differentiation of the differential dfΘ = − tr[H,N ]ΘT(·):

(∇2f)Θ(ΘX,ΘY ) = − 12 tr([H, adX N ]− [adX H,N ]

)Y,

where X, Y ∈ so(n). To compute the direction ΘX ∈ TΘ, X ∈ so(n), forNewton’s method, we must solve the equation (∇2f)Θ(Θ·,ΘX) = dfΘ, whichyields the linear equation

LΘ(X)def= [H, adX N ]− [adX H,N ] = 2[H,N ].

The linear operator LΘ: so(n)→ so(n) is self-adjoint for all Θ and, in a neigh-borhood of the maximum, negative definite. Therefore, standard iterative tech-niques in the vector space so(n), such as the classical conjugate gradient method,may be used to solve this equation near the maximum. The results of a numer-ical experiment demonstrating the convergence of Newton’s method in SO(20)are shown in Figure 2. As can be seen, Newton’s method converged withinround-off error in two iterations.

128 STEVEN T. SMITH

Remark 4.11. If Newton’s method applied to the function f(Θ) = tr ΘTQΘNconverges to the point Θ such that AdΘT(Q) = H∞ = αN , α ∈ R, then itconverges cubically.

Proof. By covariant differentiation of ∇2f, the third covariant differential of fat Θ evaluated at the tangent vectors ΘX, ΘY , ΘZ ∈ TΘ, X, Y , Z ∈ so(n), is

(∇3f)Θ(ΘX,ΘY,ΘZ) = − 14 tr([adY adZ H,N ]− [adZ adY N,H]

+ [H, adadY Z N ]− [adY H, adZ N ] + [adY N, adZ H])X.

If H = αN , α ∈ R, then (∇3f)Θ(·,ΘX,ΘX) = 0. Therefore, the second orderterms on the right hand side of Eq. (18) vanish at the critical point. The remarkfollows from the smoothness of f .

This remark illuminates how rapid convergence of Newton’s method appliedto the function f can be achieved in some instances. If Eij ∈ so(n) is a matrixwith entry +1 at element (i, j), −1 at element (j, i), and zero elsewhere, X =∑i<j x

ijEij , H = diag(h1, . . . , hn), and N = diag(ν1, . . . , νn), then

(∇3f)Θ(ΘEij ,ΘX,ΘX) =

−2∑k 6=i,j

xikxjk((hiνj − hjνi) + (hjνk − hkνj) + (hkνi − hiνk)

).

If the hi are close to ανi, α ∈ R, for all i, then (∇3f)Θ(·,ΘX,ΘX) may besmall, yielding a fast rate of quadratic convergence.

Example 4.12 (Jacobi’s method). Let π be the projection of a square matrixonto its diagonal, and let Q be as above. Consider the maximization of thefunction f(Θ) = trHπ(H), H = AdΘT(Q), on the special orthogonal group.This is equivalent to minimizing the sum of the squares of the off-diagonalelements of H (Golub and Van Loan [22] derive the classical Jacobi method).The gradient of this function at Θ is 2Θ[H,π(H)] [14]. By repeated covariantdifferentiation of f, we find

(∇f)I(X) = −2 tr[H,π(H)]X

(∇2f)I(X,Y ) = − tr([H, adX π(H)]− [adX H,π(H)]− 2[H,π(adX H)]

)Y

(∇3f)I(X,Y, Z) = − 12 tr([adY adZ H,π(H)]− [adZ adY π(H), H]

+ [H, adadY Z π(H)]− [adY H, adZ π(H)] + [adY π(H), adZ H]

+ 2[H,π(adY adZ H)] + 2[H,π(adZ adY H)]

+ 2[adY H,π(adZ H)]− 2[H, adY π(adZ H)]

+ 2[adZ H,π(adY H)]− 2[H, adZ π(adY H)])X

where I is the identity matrix and X, Y , Z ∈ so(n). It is easily shown thatif [H,π(H)] = 0, i.e., if H is diagonal, then (∇3f)Θ(·,ΘX,ΘX) = 0 (n.b.π(adX H) = 0). Therefore, by the same argument as the proof of Remark 4.11,Newton’s method applied to the function trHπ(H) converges cubically.


5 Conjugate gradient on Riemannian manifolds

The method of steepest descent provides an optimization technique which isrelatively inexpensive per iteration, but converges relatively slowly. Each steprequires the computation of a geodesic and a gradient direction. Newton’smethod provides a technique which is more costly both in terms of computa-tional complexity and memory requirements, but converges relatively rapidly.Each step requires the computation of a geodesic, a gradient, a second covariantdifferential, and its inverse. In this section we describe the conjugate gradientmethod, which has the dual advantages of algorithmic simplicity and superlinearconvergence.

Hestenes and Stiefel [26] first used conjugate gradient methods to computethe solutions of linear equations, or, equivalently, to compute the minimum ofa quadratic form on Rn. This approach can be modified to yield effective algo-rithms to compute the minima of nonquadratic functions on Rn. In particular,Fletcher and Reeves [19] and Polak and Ribiere [36] provide algorithms basedupon the assumption that the second order Taylor expansion of the functionto be minimized sufficiently approximates this function near the minimum. Inaddition, Davidon, Fletcher, and Reeves developed the variable metric meth-ods [18, 36], but these will not be discussed here. One noteworthy feature ofconjugate gradient algorithms on Rn is that when the function in question isquadratic, they compute its minimum in no more than n steps.

The conjugate gradient method on Euclidean space is uncomplicated. Givena function f : Rn → R with continuous second derivatives and a local minimumat x, and an initial point x0 ∈ Rn, the algorithm is initialized by computingthe (negative) gradient direction G0 = H0 = −(gradf)x0 . The recursive partof the algorithm involves (i) a line minimization of f along the affine spacexi + tHi, t ∈ R, where the minimum occurs at, say, t = λi, (ii) computation ofthe step xi+1 = xi + λiHi, (iii) computation of the (negative) gradient Gi+1 =−(gradf)xi+1

, and (iv) computation of the next direction for line minimization,

Hi+1 = Gi+1 + γiHi, (24)

where γi is chosen such that Hi and Hi+1 conjugate with respect to the Hessianmatrix of f at x. When f is a quadratic form represented by the symmetricpositive definite matrix Q, the conjugacy condition becomes HT

i QHi+1 = 0;therefore, γi = −HT

i QGi+1/HTi QHi. It can be shown in this case that the se-

quence of vectors Gi are all mutually orthogonal and the sequence of vectorsHi are all mutually conjugate with respect to Q. Using these facts, the com-putation of γi may be simplified with the observation that γi = ‖Gi+1‖2/‖Gi‖2(Fletcher-Reeves) or γi = (Gi+1 −Gi)TGi+1/‖Gi‖2 (Polak-Ribiere). When f isnot quadratic, it is assumed that its second order Taylor expansion sufficientlyapproximates f in a neighborhood of the minimum, and the γi are chosen sothat Hi and Hi+1 are conjugate with respect to the matrix (∂2f/∂xi∂xj)(xi+1)of second partial derivatives of f at xi+1. It may be desirable to “reset” thealgorithm by setting Hi+1 = Gi+1 every rth step (frequently, r = n) becausethe conjugate gradient method does not, in general, converge in n steps if the

130 STEVEN T. SMITH

function f is nonquadratic. However, if f is closely approximated by a quadraticfunction, the reset strategy may be expected to converge rapidly, whereas theunmodified algorithm may not be.

Many of these ideas have straightforward generalizations in the geometry ofRiemannian manifolds; several of them have already appeared. We need onlymake the following definition.

Definition 5.1. Given a tensor field ω of type (0, 2) on M such that for p in M ,ωp:Tp × Tp → R is a symmetric bilinear form, the tangent vectors X and Yin Tp are said to be ωp-conjugate or conjugate with respect to ωp if ωp(X,Y ) = 0.

An outline of the conjugate gradient method on Riemannian manifolds maynow be given. Let M be an n-dimensional Riemannian manifold with Rie-mannian structure g and Levi-Civita connection ∇, and let f ∈ C∞(M) havea local minimum at p. As in the conjugate gradient method on Euclideanspace, choose an initial point p0 in M and compute the (negative) gradientdirections G0 = H0 = −(gradf)p0

in Tp0. The recursive part of the algo-

rithm involves minimizing f along the geodesic t 7→ exppi tHi, t ∈ R, makinga step along the geodesic to the minimum point pi+1 = expλiHi, computingGi+1 = −(gradf)pi+1 , and computing the next direction in Tpi+1 for geodesicminimization. This direction is given by the formula

Hi+1 = Gi+1 + γiτHi, (25)

where τ is the parallel translation with respect to the geodesic step from pito pi+1, and γi is chosen such that τHi and Hi+1 are (∇2f)pi+1-conjugate, i.e.,

γi = −(∇2f)pi+1

(τHi, Gi+1)

(∇2f)pi+1(τHi, τHi)

. (26)

Eq. (26) is, in general, expensive to use because the second covariant differen-tial of f appears. However, we can use the Taylor expansion of df about pi+1 tocompute an efficient approximation of γi. By the fact that pi = exppi+1

(−λiτHi)and by Eq. (14), we have

τdfpi = τdfexppi+1(−λiτHi) = dfpi+1

− λi(∇2f)pi+1(·, τHi) + h.o.t.

Therefore, the numerator of the right hand side of Eq. (26) multiplied by thestep size λi can be approximated by the equation

λi(∇2f)pi+1(τHi, Gi+1) = dfpi+1

(Gi+1)− (τdfpi)(Gi+1)

= −〈Gi+1 − τGi, Gi+1〉

because, by definition, Gi = −(gradf)pi , i = 0, 1, . . . , and for any X in Tpi+1 ,(τdfpi)(X) = dfpi(τ

−1X) = 〈(gradf)pi , τ−1X〉 = 〈τ(gradf)pi , X〉. Similarly,

the denominator of the right hand side of Eq. (26) multiplied by λi can beapproximated by the equation

λi(∇2f)pi+1(τHi, τHi) = dfpi+1

(τHi)− (τdfpi)(τHi)

= 〈Gi, Hi〉


because 〈Gi+1, τHi〉 = 0 by the assumption that f is minimized along thegeodesic t 7→ exp tHi at t = λi. Combining these two approximations withEq. (26), we obtain a formula for γi that is relatively inexpensive to compute:

γi =〈Gi+1 − τGi, Gi+1〉

〈Gi, Hi〉. (27)

Of course, as the connection ∇ is compatible with the metric g, the denominatorof Eq. (27) may be replaced, if desired, by 〈τGi, τHi〉.

The conjugate gradient method may now be presented in full.

Algorithm 5.2 (Conjugate gradient method). Let M be a complete Rieman-nian manifold with Riemannian structure g and Levi-Civita connection ∇, andlet f be a C∞ function on M .

Step 0. Select p0 ∈M , compute G0 = H0 = −(gradf)p0, and set i = 0.

Step 1. Compute λi such that

f(exppi λiHi) ≤ f(exppi λHi)

for all λ ≥ 0.

Step 2. Set pi+1 = exppi λiHi.

Step 3. Set

Gi+1 = −(gradf)pi+1,

Hi+1 = Gi+1 + γiτHi, γi =〈Gi+1 − τGi, Gi+1〉

〈Gi, Hi〉,

where τ is the parallel translation with respect to the geodesic from pito pi+1. If i ≡ n − 1 (mod n), set Hi+1 = Gi+1. Increment i, and go toStep 1.

Theorem 5.3. Let f ∈ C∞(M) have a nondegenerate critical point at p suchthat the Hessian (d2f)p is positive definite. Let pi be a sequence of points in Mgenerated by Algorithm 5.2 converging to p. Then there exists a constant θ > 0and an integer N such that for all i ≥ N ,

d(pi+n, p) ≤ θd2(pi, p).

Note that linear convergence is already guaranteed by Theorem 3.3.

Proof. If pj = p for some integer j, the assertion becomes trivial; assumeotherwise. Recall that if X1, . . . , Xn is some basis for Tp, then the map

expp(a1X1 + · · · + anXn)

ν→ (a1, . . . , an) defines a set of normal coordinatesat p. Let Np be a normal neighborhood of p on which the normal coordinatesν = (x1, . . . , xn) are defined. Consider the map ν∗f

def= f ν−1: Rn → R. By the

132 STEVEN T. SMITH

smoothness of f and exp, ν∗f has a critical point at 0 ∈ Rn such that the Hes-sian matrix of ν∗f at 0 is positive definite. Indeed, by the fact that (d exp)0 = id,the ijth component of the Hessian matrix of ν∗f at 0 is given by (d2f)p(Xi, Xj).

Therefore, there exists a neighborhood U of 0 ∈ Rn, a constant θ′ > 0, andan integer N , such that for any initial point x0 ∈ U , the conjugate gradientmethod on Euclidean space (with resets) applied to the function ν∗f yields asequence of points xi converging to 0 such that for all i ≥ N ,

‖xi+n‖ ≤ θ′‖xi‖2.

See Polak [36, p. 260ff] for a proof of this fact. Let x0 = ν(p0) in U be an initialpoint. Because exp is not an isometry, Algorithm 5.2 yields a different sequenceof points in Rn than the classical conjugate gradient method on Rn (uponequating points in a neighborhood of p ∈ M with points in a neighborhoodof 0 ∈ Rn via the normal coordinates).

Nevertheless, the amount by which exp fails to preserve inner products canbe quantified via the Gauss Lemma and Jacobi’s equation; see, e.g., Cheegerand Ebin [12], or the appendices of Karcher [28]. Let t be small, and let X ∈ Tpand Y ∈ TtX(Tp) ∼= Tp be orthonormal tangent vectors. The amount by whichthe exponential map changes the length of tangent vectors is approximated bythe Taylor expansion

‖d exp(tY )‖2 = t2 − 13Kt

4 + h.o.t.

where K is the sectional curvature of M along the section in Tp spanned byX and Y . Therefore, near p Algorithm 5.2 differs from the conjugate gradientmethod on Rn applied to the function ν∗f only by third order and higher terms.Thus both algorithms have the same rate of convergence. The theorem follows.

Example 5.4 (Rayleigh’s quotient on the sphere). Applied to Rayleigh’s quo-tient on the sphere, the conjugate gradient method provides an efficient tech-nique to compute the eigenvectors corresponding to the largest or smallest eigen-value of a real symmetric matrix. Let Sn−1 and ρ(x) = xTQx be as in Examples3.5 and 4.6. From Algorithm 5.2, we have the following algorithm.

Algorithm 5.5 (CG method for the extreme eigenvalue/eigenvector). Let Qbe a real symmetric n-by-n matrix.

Step 0. Select x0 in Rn such that xT0x0 = 1, compute G0 = H0 = (Q −

ρ(x0)I)x0, and set i = 0.

Step 1. Compute c, s, and v = 1− c = s2/(1+ c), such that ρ(xic+his) is max-imized, where c2 + s2 = 1 and hi = Hi/‖Hi‖. This can be accomplishedby geodesic minimization, or by the formulae

c =(

12 (1 + b/r)

) 12

s = a/(2rc)if b ≥ 0, or

s =(

12 (1− b/r)

) 12

c = a/(2rs)if b ≤ 0,

where a = 2xTi Qhi, b = xT

i Qxi − hTi Qhi, and r =

√(a2 + b2).


10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

0 10 20 30 40 50 60

‖xi − ξ1‖

Step i

Method of Steepest Descent bb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b bConjugate Gradient Method ∗

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

Newton’s Method sss

s

Figure 1: Maximization of Rayleigh’s quotient xTQx on S20 ⊂ R21, where Q =diag(21, . . . , 1). The ith iterate is xi, and ξ1 is the eigenvector corresponding to thelargest eigenvalue of Q. Algorithm 4.7 was used for Newton’s method and Algo-rithm 5.5 was used for the conjugate gradient method.

Step 2. Set

xi+1 = xic+his, τHi = Hic−xi‖Hi‖s, τGi = Gi−(hT

i Gi)(xis+hiv).

Step 3. Set

Gi+1 =(Q− ρ(xi+1)I

)xi+1,

Hi+1 = Gi+1 + γiτHi, γi =(Gi+1 − τGi)TGi+1

GTiHi

.

If i ≡ n− 1 (mod n), set Hi+1 = Gi+1. Increment i, and go to Step 1.

The convergence rate of this algorithm to the eigenvector correspondingto the largest eigenvalue of Q is given by Theorem 5.3. This algorithm costsone matrix-vector multiplication (relatively inexpensive when Q is sparse), onegeodesic minimization or computation of ρ(hi), and 10n flops per iteration.The results of a numerical experiment demonstrating the convergence of Algo-rithm 5.5 on S20 are shown in Figure 1.

Fuhrmann and Liu [20] provide a conjugate gradient algorithm for Rayleigh’squotient on the sphere that uses an azimuthal projection onto tangent planes.

Example 5.6 (The function tr ΘTQΘN). Let Θ, Q, and H be as in Examples3.6 and 4.10. As before, the natural Riemannian structure of SO(n) is used.Let X, Y ∈ so(n). The parallel translation of Y along the geodesic etX is givenby the formula τY = LetX∗e

−(t/2)XY e(t/2)X , where Lg denotes left translation

134 STEVEN T. SMITH

10−5

10−4

10−3

10−2

10−1

100

101

0 20 40 60 80 100 120 140

‖Hi −Di‖

Step i

Method of Steepest Descent bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

Conjugate Gradient Method ∗

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

Newton’s Method s

ss

ssssss

Figure 2: Maximization of tr ΘTQΘN on SO(20) (dimension SO(20) = 190), whereN = diag(20, . . . , 1). The ith iterate is Hi = ΘT

i QΘi, Di is the diagonal matrixof eigenvalues of Hi, H0 is near N , and ‖ · ‖ is the norm induced by the standardinner product on gl(n). Geodesics and parallel translation were computed using thealgorithm of Ward and Gray [47,48]; the step sizes for the method of steepest descentand the conjugate gradient method were computed using Brockett’s estimate [10].

by g. Brockett’s estimate (n.b. Eq. (13)) for the step size may be used in Step 1of Algorithm 5.2. The results of a numerical experiment demonstrating theconvergence of the conjugate gradient method in SO(20) are shown in Figure 2.

Acknowledgments. The author enthusiastically thanks Tony Bloch and theFields Institute for the invitation to speak at the Fields Institute and for theirgenerous support during his visit. The author also thanks Roger Brockett for hissuggestion to investigate conjugate gradient methods on manifolds and for hiscriticism of this work, and the referee for his helpful suggestions. This work wassupported in part by the National Science Foundation under the Engineering Re-search Center Program NSF D CRD-8803012, the Army Research Office underGrant DAA103-92-G-0164 supporting the Brown, Harvard, and MIT Center forIntelligent Control, and by DARPA under Air Force contract F49620-92-J-0466.

References

[1] Bertsekas, D. P. Projected Newton methods for optimization problems with simpleconstraints, SIAM J. Cont. Opt. 20 : 221–246, 1982.

[2] . Constrained Optimization and Lagrange Multiplier Methods. New York: Aca-demic Press, 1982.

[3] Bloch, A. M., Brockett, R. W., and Ratiu, T. S. A new formulation of the generalizedToda lattice equations and their fixed point analysis via the momentum map, Bull. Amer.Math. Soc. 23 (2) : 477–485, 1990.


[4] . Completely integrable gradient flows, Commun. Math. Phys. 147 : 57–74, 1992.

[5] Botsaris, C. A. Differential gradient methods, J. Math. Anal. Appl. 63 : 177–198, 1978.

[6] . A class of differential descent methods for constrained optimization, J. Math.Anal. Appl. 79 : 96–112, 1981.

[7] . Constrained optimization along geodesics, J. Math. Anal. Appl. 79 : 295–306,1981.

[8] Brockett, R. W. Least squares matching problems, Lin. Alg. Appl. 122/123/124 :761–777, 1989.

[9] . Dynamical systems that sort lists, diagonalize matrices, and solve linear pro-gramming problems, Lin. Alg. Appl. 146 : 79–91, 1991.

[10] . Differential geometry and the design of gradient algorithms, Proc. Symp. PureMath. R. Green and S. T. Yau, eds. Providence, RI: Amer. Math. Soc., to appear.

[11] Brown, A. A. and Bartholomew-Biggs, M. C. Some effective methods for uncon-strained optimization based on the solution of systems of ordinary differential equations,J. Optim. Theory Appl. 62 (2) : 211–224, 1989.

[12] Cheeger, J. and Ebin, D. G. Comparison Theorems in Riemannian Geometry. Ams-terdam: North-Holland Publishing Company, 1975.

[13] Chu, M. T. Curves on Sn−1 that lead to eigenvalues or their means of a matrix, SIAMJ. Alg. Disc. Meth. 7 (3) : 425–432, 1986.

[14] Chu, M. T. and Driessel, K. The projected gradient method for least squares matrix ap-proximations with spectral constraints, SIAM J. Numer. Anal. 27 (4) : 1050–1060, 1990.

[15] Dunn, J. C. Newton’s method and the Goldstein step length rule for constrained mini-mization problems, SIAM J. Cont. Opt. 18 : 659–674, 1980.

[16] . Global and asymptotic convergence rate estimates for a class of projected gra-dient processes, SIAM J. Cont. Opt. 19 : 368–400, 1981.

[17] Faybusovich, L. Hamiltonian structure of dynamical systems which solve linear pro-gramming problems, Phys. D 53 : 217–232, 1991.

[18] Fletcher, R. Practical Methods of Optimization, 2d ed. New York: Wiley & Sons, 1987.

[19] Fletcher, R. and Reeves, C. M. Function minimization by conjugate gradients, Com-put. J. 7 (2) : 149–154, 1964.

[20] Fuhrmann, D. R. and Liu, B. An iterative algorithm for locating the minimal eigenvectorof a symmetric matrix, Proc. IEEE ICASSP 84 pp. 45.8.1–4, 1984.

[21] Gill, P. E. and Murray, W. Newton-type methods for linearly constrained optimiza-tion, in Numerical Methods for Constrained Optimization. P. E. Gill and W. Murray,eds. London: Academic Press, Inc., 1974.

[22] Golub, G. H. and Van Loan, C. Matrix Computations. Baltimore, MD: Johns HopkinsUniversity Press, 1983.

[23] Golubitsky, M. and Guillemin, V. Stable Mappings and Their Singularities. New York:Springer-Verlag, 1973.

[24] Helgason, S. Differential Geometry, Lie Groups, and Symmetric Spaces. New York:Academic Press, 1978.

[25] Helmke, U. Isospectral flows on symmetric matrices and the Riccati equation, Systems &Control Lett. 16 : 159–165, 1991.

[26] Hestenes, M. R. and Stiefel, E. Methods of conjugate gradients for solving linearsystems, J. Res. Nat. Bur. Stand. 49 : 409–436, 1952.

[27] Hirsch, M. W. and Smale, S. On algorithms for solving f(x) = 0, Comm. Pure Appl.Math. 32 : 281–312, 1979.

[28] Karcher, H. Riemannian center of mass and mollifier smoothing, Comm. Pure Appl.Math. 30 : 509–541, 1977.

[29] Kobayashi, S. and Nomizu, K. Foundations of Differential Geometry, Vol. 2. New York:Wiley Interscience Publishers, 1969.

[30] Lagarias, J. C. Monotonicity properties of the Toda flow, the QR-flow, and subspaceiteration, SIAM J. Numer. Anal. Appl. 12 (3) : 449–462, 1991.

[31] Luenberger, D. G. Introduction to Linear and Nonlinear Programming. Reading, MA:Addison-Wesley, 1973.

136 STEVEN T. SMITH

[32] Moler, C. and Van Loan, C. Nineteen dubious ways to compute the exponential of amatrix, SIAM Rev. 20 (4) : 801–836, 1978.

[33] Nomizu, K. Invariant affine connections on homogeneous spaces. Amer. J. Math. 76 :33–65, 1954.

[34] Parlett, B. The Symmetric Eigenvalue Problem. Englewood Cliffs, NJ: Prentice-Hall,1980.

[35] Perkins, J. E., Helmke, U., and Moore, J. B. Balanced realizations via gradient flowtechniques, Systems & Control Lett. 14 : 369–380, 1990.

[36] Polak, E. Computational Methods in Optimization. New York: Academic Press, 1971.

[37] Rudin, W. Principles of Mathematical Analysis, 3d ed. New York: McGraw-Hill, 1976.

[38] Sargent, R. W. H. Reduced gradient and projection methods for nonlinear program-ming, in Numerical Methods for Constrained Optimization. P. E. Gill and W. Murray,eds. London: Academic Press, Inc., 1974.

[39] Shub, M. Some remarks on dynamical systems and numerical analysis, in DynamicalSystems and Partial Differential Equations: Proc. VII ELAM. L. Lara-Carrero andJ. Lewowicz, eds. Caracas: Equinoccio, U. Simon Bolıvar, pp. 69–92, 1986.

[40] Shub, M. and Smale, S. Computational complexity: On the geometry of polynomials

and a theory of cost, Part I, Ann. scient. Ec. Norm. Sup. 4 (18) : 107–142, 1985.

[41] . Computational complexity: On the geometry of polynomials and a theory ofcost, Part II, SIAM J. Comput. 15 (1) : 145–161, 1986.

[42] . On the existence of generally convergent algorithms, J. Complex. 2 : 2–11, 1986.

[43] Smale, S. The fundamental theorem of algebra and computational complexity, Bull.Amer. Math. Soc. 4 (1) : 1–36, 1981.

[44] . On the efficiency of algorithms in analysis, Bull. Amer. Math. Soc. 13 (2) : 87–121, 1985.

[45] Smith, S. T. Dynamical systems that perform the singular value decomposition, Sys-tems & Control Lett. 16 : 319–327, 1991.

[46] Spivak, M. A Comprehensive Introduction to Differential Geometry, 2d ed. Vols. 1, 2,5, Houston, TX: Publish or Perish, Inc., 1979.

[47] Ward, R. C. and Gray, L. J. Eigensystem computation for skew-symmetric matricesand a class of symmetric matrices, ACM Trans. Math. Softw. 4 (3) : 278–285, 1978.

[48] Ward, R. C. and Gray, L. J. Algorithm 530: An algorithm for computing the eigensys-tem of skew-symmetric matrices and a class of symmetric matrices, ACM Trans. Math.Softw. 4 (3) : 286–289, 1978. See also Collected Algorithms from ACM, Vol. 3. New York:Assoc. Comput. Mach., 1978.

optimization techniques on riemannian manifolds - arxiv · pdf fileoptimization techniques on...

Documents