paper iii - people.cs.umu.se · 84 paper iii contents 1 introduction. 85 2 2-norm formulation 86 3...

Paper III

On the Number of Minima to WeightedOrthogonal Procrustes Problems∗

Thomas Viklands†

Department of Computing Science, Umea University

SE-901 87 Umea, Sweden.

[email protected]

Abstract

A weighted orthogonal Procrustes problem (WOPP) min ||AQX −B||2F , subject to QT Q = In, where Q ∈ R

m×n with n ≤ m, can haveseveral local minima. Hence some global optimization technique is oftenneeded in order to find the global minimum. This contribution investigatesthe maximal number of minima to a WOPP, a useful knowledge when de-veloping a global optimization algorithm. A natural first approach is tostudy the case when B = 0. It turns out that if A and X have strictlydecreasing singular values, there exist exactly 2n minima. By continuityreasoning it is shown that the amount of minima is conserved for smallperturbations B = 0 + δB. The special case when n = 1 is studied fora B 6= 0 and it turns out that a maximum of two minimizers can occur.Extensive empirical studies indicate that no more than 2n minima existfor a WOPP.

Keywords : Weighted, orthogonal, Procrustes, global minimum, Stiefel man-ifold, minima.

∗From UMINF-06.08, 2006. Submitted to BIT.†Financial support has partly been provided by the Swedish Foundation for Strategic Re-

search under the frame program grant A3 02:128.

83

84 Paper III

Contents

1 Introduction. 85

2 2-norm formulation 86

3 The tangent space of Vm,n 86

4 Lagrangian formulation 87

5 Why study the case when B = 0 87

5.1 The ellipsoid cases . . . . . . . . . . . . . . . . . . . . . . . . . . 885.2 Motivation of B = 0 in general cases . . . . . . . . . . . . . . . . 89

6 The B = 0 case 91

6.1 First order conditions and the critical points . . . . . . . . . . . . 916.2 Second order conditions and the minimum solutions . . . . . . . 926.3 Some cases with equal singular values . . . . . . . . . . . . . . . 96

7 Discussion of the general case B 6= 0 96

7.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.2 A simple algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 98

8 Conclusions 99

A The canonical form of a WOPP 99

B The solution to an OPP 100

C Parametrization of Vm,n by using the Cayley transform 101

C.1 The Tangent space of Vm,n . . . . . . . . . . . . . . . . . . . . . 101

D Number of minima to the ellipsoid cases 102

References 104

On the Number of Minima to Weighted Procrustes Problems 85

1 Introduction.

A weighted orthogonal Procrustes problem (WOPP) is an optimization problemthat arises in applications related to, e.g., multivariate analysis and multidi-mensional scaling [5, 12, 13], and photogrammetry [1]. Typically it is aboutcomputing an optimal rotation when it is desired to match one set of data toanother. Formally, a WOPP corresponds to computing a matrix Q ∈ R

m×n,where n ≤ m, with orthonormal columns that solves the minimization problem

min1

2||AQX − B||2F , subject to QT Q = In. (1)

Here A ∈ Rm×m, X ∈ R

n×n and B ∈ Rm×n are known matrices and || · ||F

denotes the Frobenius norm. We can assume that A and X are square diagonalmatrices A = diag(α1, ..., αm) and X = diag(χ1, ..., χn), where αi ≥ αi+1 > 0and χi ≥ χi+1 > 0 respectively, see Appendix A. We call this the canonicalform of a WOPP. From now on we assume that A and X are diagonal matriceson this form.

Equation (1) is an optimization problem defined on the Stiefel manifold [19],

Vm,n = {Q ∈ Rm×n : QT Q = In}.

As in [8], we call (1) balanced if m = n and unbalanced if n < m. WithA = Im, (1) specializes to the orthogonal Procrustes problem (OPP)

min1

2||QX − B||2F , subject to QT Q = Im. (2)

If B has full rank, this problem has a unique minimum that can be derived fromthe singular value decomposition of XBT , see Appendix B.

Consider a weighting of the residual QX − B for an OPP as A(QX − B),then (2) becomes

min1

2||A(QX − B)||2F , subject to QT Q = In. (3)

By taking B := AB we get the optimization problem on the form given in (1).

Generally, a solution to (1) can not be computed as easily as in the OPPcases. An iterative method is needed. Earlier work in connection to iterativealgorithms and methods for solving problems similar to (1) is reported in [3,4, 6, 8, 10, 14–18,22]. Moreover (1) can have several minima, also observed byothers [3, 4, 8, 10, 14, 15].

This paper investigates the maximal amount of minima to a WOPP. Todo this the special case when B = 0 is studied in detail by using a Lagrangeformulation of (1), and at the end some special and low-dimensional cases whenB 6= 0 are considered. We start with some introductory definitions, formulationsand motivations.

86 Paper III

2 2-norm formulation

In later sections, we mainly consider the function AQX ∈ Rm×n embedded in

Rmn. Usually this is done by using the vec-operator, which is the stacking of

the columns in a matrix into a column vector. For example, with Q = [q1, ..., qn]and Q ∈ R

m×n, then

vec(Q) =

q1

...qn

, vec(Q) ∈ R

mn.

An equivalent problem formulation of (1), but now in the 2-norm is

min1

2||Fvec(Q) − vec(B)||22 , subject to QT Q = I. (4)

The diagonal matrix F ∈ Rmn×mn is the Kronecker product of XT and A, i.e.,

F = XT ⊗ A = diag(χ1A, ..., χnA). More information regarding this problemformulation with algorithms can be found in [22] and [21].

The surface of Fvec(Q) is addressed later on when considering some specialcases of a WOPP.

Definition 2.1 Let

F = {y = Fvec(Q) | Q ∈ Vm,n}

denote the surface of Fvec(Q) ∈ Rmn.

3 The tangent space of Vm,n

Having a parametrization of the Stiefel manifold, the tangent space of Vm,n at

a point Q, is the set of all tangent directions. It is used in the following sectionswhen classifying critical points to a WOPP.

Definition 3.1 The tangent space of the Stiefel manifold Vm,n at a given point

Q can be expressed as

T = {T = QS + (I − QQT )C} (5)

where S = −ST ∈ Rn×n is skew-symmetric and C ∈ R

m×n arbitrary.

To derive the expression of the tangent space, the Cayley transform of askew symmetric matrix can be used, see Appendix C.1. For more informationregarding the tangent space of Vm,n, see also [6].


4 Lagrangian formulation

For later analysis, we use the Lagrangian formulation of (4)

L(Q, Λ) =1

2||Fvec(Q) − vec(B)||22 +

1

2

n∑

i=1

λi,i(qTi qi − 1) +

n∑

i<j

λi,jqTi qj .

Here Λ denotes the set of all Lagrange parameters λ corresponding to the con-straint(s) QT Q = I. Λ can be considered as an n by n symmetric matrix withelements Λi,j = λi,j .

The gradient of the Lagrangian with respect to Q is denoted

∇QL =

∇q1L

...∇qn

L

,

and can be written as n sets of m equations,

∇qiL = χ2

i Dqi + λi,iqi +

n∑

j<i

λj,iqj +

n∑

i<j

λi,jqj + χ2i A

T bi , (6)

for i = 1, 2, ..., n, and here we use D to express the diagonal matrix D = AT A.To simplify indexing, we denote the i’th diagonal element in D as Di.

We denote the second order derivative, Hessian matrix, as

H = ∇2QL , H ∈ R

mn×mn. (7)

The well known Kuhn-Tucker Necessary Conditions for a minimizer Q saythat there exist unique Lagrange multipliers Λ and that

∇QL(Q, Λ) = 0,

tT Ht ≥ 0, t = vec(T ) ∀ T ∈ T . (8)

The somewhat similar Sufficient Conditions for a minimizer Q are that thereexist unique Lagrange multipliers Λ and that

∇QL(Q, Λ) = 0,

tT Ht > 0, t = vec(T ) ∀ T ∈ T . (9)

5 Why study the case when B = 0

In this section, we motivate why studying a WOPP with B = 0 seems asa reasonable first thing to do, when trying to derive the maximal amount ofminima of (4). The reasoning is based on studying lower dimensional cases andthen generalizing these ”B = 0 properties” to higher dimensional cases.

88 Paper III

5.1 The ellipsoid cases

Consider the special case when Q ∈ Rm×1, studied by Forsythe and Golub [9],

Gander [10] and Elden [7], commonly written as

min ||Aq − b||22 , subject to qT q = 1. (10)

In a geometric sense, (10) corresponds to determining the minimum distancebetween a hyper-ellipsoid in R

m, determined by A, and the given point b.Let m = 2, by using the parameterization q = [cosφ, sin φ]T , an equivalent

formulation of (10) is then

min ||

[

α1 cosφα2 sin φ

]

−

[

b1

b2

]

||22.

Assume that α1 > α2, then the optimization problem corresponds to find thepoint on the ellipse

x = α1 cosφy = α2 sinφ

that lies closest to the point b = [b1, b2]T . There can at most be two minima to

this problem. It turns out that if b is inside the evolute1 of the ellipse

xe =α2

1−α2

2

α1

cos3 φ,

ye =α2

2−α2

1

α2

sin3 φ,

then (10) has two minima, otherwise it has just one.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−3

−2

−1

0

1

2

3

Figure 1: An ellipse with α1 = 2 and α2 = 1 and its evolute.

The global minimum is always in the same quadrant as b, while the localminimum is in the quadrant vertically opposite to b (due to that α1 > α2).

1The evolute is the locus of centers of curvatures of a curve.


Especially if b = 0, then the two minima are q = [0,±1], no matter of the valuesof α1 and α2 (as long as α1 > α2 is fulfilled).

If α1 = α2 and b 6= 0, then the solution q to (10) is unique, q = b/||b||2.However, if α1 = α2 and b = 0, then there is a continuum of solutions (connectedminima). Any q ∈ R

2 is a minimizer (and maximizer).Connected minima can occur for any dimensions, but, we focus on the cases

yielding distinct minima. This always occurs if the singular values of A arestrictly decreasing, αi > αi+1 > 0 for all i = 1, . . . , m − 1. In Section 6.3, thecase with equal singular values is studied.

Deriving a similar ”evolute surface” for general ellipsoid cases, when q ∈R

m×1, m = 3, 4, . . ., is more complicated and not really so interesting. Whatis interesting is that the origin, the point b = 0, is inside these surfaces. Thatis, when b = 0 the optimization problem (10) always has two minimizers q =[0, . . . , 0,±1]. The maximal amount of minima to an ellipsoid case is two, statedby Theorem D.1 in Appendix D.

In connection to the ellipsoid cases are the oblique Procrustes problem [5, 13],commonly formulated as

min1

2||AQ − B||2F , subject to diag(QT Q) = [1, . . . , 1]. (11)

Since there are no orthogonality constraints qTi qj = 0, this problem is separable.

By denoting B = [b1, . . . , bn] we can write

||AQ − B||2F = ||Aq1 − b1||22 + . . . + ||Aqn − bn||

22.

The problem (11) can then be written as n optimization problems on the formin (10) as

min1

2||Aq1 − b1||

22 , subject to qT

1 q1 = 1,

... (12)

min1

2||Aqn − bn||

22 , subject to qT

n qn = 1.

Each of the n optimization problems in (12) can have two minimizers. Hence,the maximal number of minima for problem (11) is 2n.

5.2 Motivation of B = 0 in general cases

The difficulty in analyzing how many minima a WOPP might have, is that oneshould know a B that results in maximal amount of minimizers. At this point,doing analysis with an arbitrary B seems to border on the task of solving theoptimization problem analytically, like one can do with an OPP. However, forthe ellipsoid cases when Q ∈ R

m×1, having B = 0 always results in maximalnumber of minima. Does the same hold for cases when Q ∈ R

m×n when n > 1 ?Procrustes type problems have elliptic properties, studying the case when B = 0(or when B is in the vicinity of the origin) should yield valuable information.

90 Paper III

The elliptic properties in this case, are that we can consider F as the surfacetraced out by the ellipses given by plane rotations around each unit axis in R

m.As an example, for the ellipsoid case when Q ∈ R

3×1, we can regard F as asurface of ellipses, where the plane spanned by any ellipse is parallel to eitherthe xy-plane, xz-plane or yz-plane. Take a parametrization as, e.g.,

Q(φ1, φ2) =

cosφ1 − sin φ1 0sin φ1 cosφ2 0

0 0 1

cosφ2 0 − sinφ2

0 1 0sin φ2 0 cosφ2

100

.

We can extend the concept of the evolute in R2 to an ellipses in R

mn. Anyof these ellipses given by plane rotations, for any Q ∈ R

m×n, can be written as

E

[

cosφsin φ

]

+ d,

where E ∈ Rmn×2 and d is a translation of the ellipse along a direction orthog-

onal to the plane spanned by the ellipse, i.e., d⊥Range(E). By using the SVDUΣV T = E, the minimization

minφ

||E

[

cosφsin φ

]

+ d − b||22

is equivalent to

minφ

||Σ

[

cos θsin θ

]

+ d − b||22 = (13)

minφ

||

[

σ1 00 σ2

] [

cos θsin θ

]

−

[

b1

b2

]

||22 + ||

d3 − b3

...

dmn − bmn

||22,

where [cos θ, sin θ]T = V T [cosφ, sin φ]T , d = UT d = [0, 0, d3, ..., dmn]T and b =UT b. What determines if (13) has two minima is [b1, b2]

T ; the remaining mn−2elements in b does not affect this at all. Hence by using the evolute for the R

2

case, we can define a similar function

h(θ, h3, ..., hmn) =

α2

1−α2

2

α1

cos3 θα2

2−α2

1

α2

sin3 θ

h3

...hmn

.

The surface of h is the boundary of the set of all b such that (13) has maximalnumber of minimizers. A reasonable assumption would be that if a given pointb = vec(B) (that after SVD rotation UT b) is in each of these sets given by everyellipse, maximal amount of minima would occur. One point that fulfils this isthe origin B = 0, since then b = UT b = 0 for all ellipses.


6 The B = 0 case

In this section, we study the special case when B = 0. The practical relevancewith B = 0 is perhaps insignificant. However, when studying how many minimaa WOPP can have, it is an intuitive first approach. We derive all critical pointsto the optimization problem and classify which are minimum, maximum andinflection points. Under some conditions on A and X , we show that there are2n minima. Earlier studies of the number of minima to optimization problemson Stiefel manifolds, similar to a WOPP with B = 0, have been done in [2]. Forthe problem type considered in [2], the number of minima is also 2n.

6.1 First order conditions and the critical points

We now consider a WOPP with B = 0,

min 12 ||AQX ||2F , subject to QT Q = In

⇓min 1

2 ||Fvec(Q)||22 , subject to QT Q = In.(14)

Additionally, we also assume that the diagonal elements (singular values) of Aand X are strictly decreasing, i.e.,

αi > αi+1 > 0 ⇒ Di > Di+1 (15)

andχi > χi+1 > 0. (16)

The reason for this assumption is that the WOPP with B = 0 will not haveconnected minima. Later on, in Section 6.3, we study some cases with equalsingular values (αi = αi + 1 or χi = χi+1 for at least one i).

Theorem 6.1 Any critical point of (14) only has 0, 1 and/or −1 as elements.

Proof. The Lagrangian corresponding to the problem is

L(Q, Λ) =1

2||Fvec(Q)||22 +

n∑

i=1

λi,i(qTi qi − 1) +

n∑

i<j

λi,jqTi qj . (17)

According to (6), a stationary point results in n sets of m equations

∇qiL = χ2

i Dqi + λi,iqi +

n∑

j<i

λj,iqj +

n∑

i<j

λi,jqj = 0 , i = 1, 2, ..., n. (18)

Now we show that at a critical point to (14), λi,j = 0 ∀ i 6= j.With i < j ≤ n, take the i’th and j’th set of equations

χ2i Dqi + λi,iqi + λ1,iq1 + ... + λi,jqj + ... + λi,nqn = 0 (19)

92 Paper III

χ2jDqj + λj,jqj + λ1,jq1 + ... + λi,jqi + ... + λj,nqn = 0. (20)

Due to orthogonality, multiplying (19) with qTj and (20) with qT

i gives

χ2i q

Tj Dqi + λi,j = χ2

i γ + λi,j = 0 (21)

χ2jq

Ti Dqj + λi,j = χ2

jγ + λi,j = 0. (22)

The condition (16) implies that γ = 0, yielding λi,j = 0.The set of equations (18) then have the form of an eigenvalue problem

χ21Dq1 = −λ1,1q1

χ22Dq2 = −λ2,2q2

...χ2

nDqn = −λn,nqn

. (23)

Hence each λi,i must be equal to any −χ2i Dj , j = 1, ..., m since D is a

diagonal matrix. Consequently qi = ±ej , where ej denotes any column vectorof I ∈ R

m×m . Q is orthogonal, so clearly if qi = ±ej then any other columnvector of Q fulfills qk = ±el where j 6= l. That is, if q1 = ±ei then q2 = ±ej

and q3 = ±ek and so on. 2

Now we know all critical points to (14). What remains is to classify whichof those that are minima.

6.2 Second order conditions and the minimum solutions

Theorem 6.2 A problem of the form (14) has 2n minima, and each minimumis on the form

[

ZK

]

where Z is an m−n by n zero matrix and K is an n by n anti diagonal matrixwith arbitrary ±1 as elements. Additionally, the minimum value ||AQiX ||2F for

all minima Qi, i = 1, ..., 2n, is the same.

To prove Theorem 6.2, we make use of the following lemmas. Each of thelemmas excludes forms of Q that results in non-minimum critical points. Whenproving the lemmas, we make use of the necessary conditions (8). An importantthing is that at a critical point, the Hessian is a diagonal matrix

H = diag(χ21D

2 + λ1,1I, ..., χ2nD2 + λn,nI),

since λi,j = 0 whenever i 6= j.

Lemma 6.1 A minimum Q can not be on the form

Q =

UzT

V

,


where zT is a row of zeros and U ∈ Rp×n has at least one row that contains a

1 or −1.

Proof. Assume that there is at least one element Ui,j in U that has anelement equal to ±1, i.e., Ui,j = ±1. When choosing a tangent direction fromDefinition 3.1, take S = 0 and we get

T = (I − QQT )C =

I − UUT 0 UV T

0 1 0V UT 0 I − V V T

C.

Let k = p + 1 and choose all elements in C, apart from Ck,j = c, as zero. ThenT = C, so t = vec(T ) has elements tm(j−1)+k = c and zeros elsewhere. Denotethe j’th column in T by Tj , then the condition (8) becomes

tT Ht = T Tj (χ2

jD + λj,jI)Tj = c2(χ2jDk + λj,j).

Looking back on (23), we see that if Ui,j = ±1, then qj = ±ei so λj,j =−χ2

jDi. But since i ≤ p and k > p ⇒ k > i then, by (15), Dk − Di < 0 so

tT Ht = c2χ2j (Dk − Di) < 0. We have shown that whenever there is an element

equal to ±1 at a row i in Q, and there is a row j with j > i containing justzeros, it is possible to find a tangent direction t resulting in that tT Ht < 0.Hence Q can not be a minimizer. 2

If Q now should be a minimizer, all m − n rows containing zeros must bein the top of the matrix. What is left to prove is that the remaining n rows,containing ±1 elements, must be ordered to form a n by n anti-diagonal matrix.

Lemma 6.2 If the element Qm,1 = 0 then Q is not a minimizer.

Proof. Assume that

Q =

[

ZP

]

where P ∈ Rn×n is orthogonal and Z is a zero matrix. Also assume that

Qm,1 6= ±1. This results in that Qi,1 = ±1 for one i ∈ {m − n + 1, ..., m − 1}

and Qm,j = ±1 for one j ∈ {2, ..., n}. This means that there is a ±1 elementon row i in the first column, with (m− n) < i < m. Additionally, there is a ±1element in column j in the last row m where j > 1.

As tangent direction take C = 0 and choose the skew-symmetric matrix Sas Sj,1 = s, S1,j = −s and zeroes elsewhere. Then T = QS has zero elementseverywhere apart from the two elements Tm,1 = ±s and Ti,j = ±(−s). LetT1 and Tj , respectively, be the column vectors in T that contain these nonzeroelements. The condition (8) is then

tT Ht = T T1 (χ2

1D + λ1,1I)T1 + T Tj (χ2

jD + λj,jI)Tj == s2(χ2

1Dm + λ1,1) + s2(χ2jDi + λj,j).

(24)

94 Paper III

Now q1 = ±ei and qj = ±em, so λ1,1 = −χ21Di and λj,j = −χ2

jDm. Substitutingthis into (24) yields

tT Ht = s2(χ21Dm − χ2

1Di + χ2jDi − χ2

jDm) =

= s2(χ21 − χ2

j)(Dm − Di) < 0,

since (χ21 − χ2

j) > 0 and (Dm − Di) < 0 by (16) and (15), respectively.

We have shown that if Q1,m 6= ±1 then we can always find a tangent direction

t = vec(T ) such that tT Ht < 0. Hence, if Q should be a minimizer, Q1,m mustbe equal to ±1. 2

Lemma 6.3 Assume that

Q =

0 00 P

K 0

where K ∈ Rr×r is anti-diagonal with elements ±1 and P ∈ R

(n−r)×(n−r). IfPn−r,1 6= ±1 then Q is not a minimizer.

Proof. Let qr+1 = ±ei with (m − n) ≤ i < (m − r) and qj = ±em−r withr + 1 < j ≤ n. Choose tangent direction as C = 0, but now choose Sj,r+1 = s(and Sr+1,j = −s due to skew symmetry) and zeroes elsewhere. We now getTm−r,r+1 = ±s and Ti,j = ±(−s) and all other elements are equal to zero.Denote Tr+1 and Tj as the columns containing these two elements, we then get

tT Ht = T Tr+1(χ

2r+1D + λr+1,r+1I)Tr+1 + T T

j (χ2jD + λj,jI)Tj =

= s2(χ2r+1Dm−r + λr+1,r+1) + s2(χ2

jDi + λj,j).(25)

The Lagrange parameters are λr+1,r+1 = −χ2r+1Di and λj,j = −χ2

jDm−r, sub-stituting this into (25) and we get

tT Ht = s2(χ2r+1Dm−r − χ2

r+1Di + χ2jDi − χ2

jDm−r) =

= s2(χ2r+1 − χ2

j )(Dm−r − Di) < 0,

by (16) and (15) since r + 1 < j and i < (m − r) respectively. 2

Proof of Theorem 6.2. By lemmas 6.1, 6.2 and 6.3, take

Q =

0 0

0 P

K 0

where

K :=

[

0 ±1

K 0

]

, K ∈ R(r+1)×(r+1) , P ∈ R

(n−r−1)×(n−r−1)


and induction follows trivially, i.e., a minimizer must be of the form stated inTheorem 6.2.

The only thing left to prove is that a matrix of the form

Q =

[

ZK

]

is a minimizer. The tangent direction at Q is

T = QS + (I − QQT )C = QS +

[

Im−n 00 0

] [

C1

C2

]

=

[

C1

KS

]

.

Observe that C 6= 0 only contributes with positive terms to the condition tT Ht.Hence we can choose C = 0 for simplicity to get

T =

[

0

S

]

.

The matrix S = KS has the ”permuted and possibly negated” appearanceof

S =

±s1,n ±s2,n ... ±sn−1,n 0±s1,n−1 ... ±sn−2,n−1 0 ±sn,n−1

: ... 0 ... :±s1,2 0 ±s3,2 ... ±sn,2

0 ±s2,1 .... ±sn−1,1 ±sn,1

.

However, as we shall see, it is the absolute value of these elements that areimportant. The necessary condition is

tT Ht =

n∑

i=1

n∑

j=1,j 6=i

(χ2i Dm−j+1+λi,i)s

2i,j =

n∑

i=1

n∑

j=1,j 6=i

(χ2i Dm−j+1−χ2

i D2m−i+1)s

2i,j

(26)Since s2

i,j = s2j,i, we can collect these terms and write (26) as

tT Ht =

n∑

i=1

n∑

j>i

s2i,j(χ

2i Dm−j+1 − χ2

i Dm−i+1 + χ2jDm−i+1 − χ2

jDm−j+1) =

=

n∑

i=1

n∑

j>i

s2i,j(χ

2i − χ2

j)(Dm−j+1 − Dm−i+1) ≥ 0, (27)

since j ≥ i ⇒ (χ2i − χ2

j) > 0 and (Dm−j+1 − Dm−i+1) < 0 by (16) and (15).

Equality tT Ht = 0 only occurs if t = 0, so the last condition (27) is a sufficientcondition for a minimizer, i.e., tT Ht > 0, t = vec(T ) ∀ T ∈ T .

96 Paper III

Finally, it is easily seen that each minimum Qi results in the same objectivefunction value,

||AQiX ||2F = ||Σvec(Q)||22 =n

∑

i=1

(χ2i Dm−i+1(±1)2) =

n∑

i=1

χ2i α

2m−i+1. 2

In a similar way, all maxima to (14) can be proven to be on the form

Q =

[

diag(±1, . . . ,±1)Z

]

∈ Rm×n.

The remaining critical points that are not minima or maxima, are then saddlepoints.

6.3 Some cases with equal singular values

If two or more singular values are equal, e.g., αi = αi+1 and/or χi = χi+1, theoptimization problem with B = 0 can have connected minima. This is easilyunderstood when looking at the 2 by 1 case with A = I2 and X = 1. Then theoptimization problem min ||AQX ||2F , subject to QT Q = 1, consists of findingthe shortest distance from the unit circle to the origin. Obviously, this problemhas an infinite amount of solutions since the distance from the unit circle to theorigin (radii) is constant.

The orthogonal Procrustes problem is a case with equal singular values. AnyQ ∈ R

m×n yields the same objective function value if B = 0, because of thecircular properties of F .

If Q ∈ Rm×1 and αi = αi+1 = ... = αm we get a subspace minimizing the

problem according to

Q =

[

zq

]

,

where z ∈ Ri−1 is a zero vector and q ∈ R

m−i+1 fulfills qT q = 1. The samehappens for unbalanced problems of general dimensions, e.g., take X = In thenif Q is a minimizer to min 1

2 ||AQ||2F , so is any Q = QV where V ∈ Rn×n is any

orthogonal matrix since

||AQ||2F = trace(QT AT AQ) = trace(QQT AT A) = (28)

= trace(QV T V QT AT A) = ||AQ||2F .

7 Discussion of the general case B 6= 0

For the OPP it is known that if B is rank deficient, the solution is not unique.In this section, we consider some special cases when B 6= 0 that result in severalminima for different setups of (1). As earlier mentioned, doing analysis with anarbitrary B is beyond the scoop of this paper.


For a problem of the form (14) define the function g(s, b) = ||Fvec(Q)− b||22,where s ∈ R

p is a parameterization of Q and b ∈ Rmn. The gradient of g(s, b)

with respect to s is ∇sg(S, b) ∈ Rp and at an extreme point ∇sg(s, b) = 0 is

fulfilled. By the implicit function theorem, if det(∇2sg(s, b)) 6= 0 there exists a

neighborhood W of b and an unique continuously function h : W 7→ Rp such

that ∇sg(h(u), u) = 0 for all u ∈ W . Let Wi, i = 1, .., 2n, be the neighborhoodfor each minima given when b = 0, then the optimization problem has at least

2n minima for all b ∈⋂2n

i=1 Wi.

7.1 Some examples

In the ellipsoid cases there is a number γ such that if ||B||F > γ then theproblem has an unique minimizer. The same does not hold for general caseswhen n > 1. Similar to the case when an OPP lacks a unique minimizer, it isalways possible to choose a rank deficient B at infinity such that (1) has morethan one minima. Let Q ∈ R

3×2 and take

B =

0 00 0β 0

,

where β > 0 is arbitrary large, then the optimization problem has the twominimizers

0 00 11 0

,

0 00 −11 0

.

This is easily generalized to hold for problems of general dimensions and we candraw the conclusion that there exist no bounded, finite ”evolute surface” as inthe ellipsoid cases.

For an unbalanced problem with X = In, (28) indicates circular propertiesof F close to the origin. One might think that for a small perturbation B = δB,there should be less than 2n minima. This is not necessarily the case, withQ ∈ R

3×2 (and X = I2) take

B =

β 00 00 0

.

For a sufficiently small β > 0 there is still 2n = 4 minimizers on the form

Q =

cos φ 0

± sin φ 00 ±1

,

where φ is the solution to

minφ

||

[

α1 00 α2

] [

cosφsin φ

]

−

[

β0

]

||22.

98 Paper III

Let us assume that the maximal amount of unconnected minima to (1) is 2n.For a given B = B, one could then assume that as B → 0 the minimizers wouldcontinuously follow. An idea would be to try the opposite, i.e., an algorithmthat starts out at B = 0 and approaches the given value B = B.

7.2 A simple algorithm

Consider the following optimization problem

minQ

||AQX − βB||2F , subject to QT Q = In,

where β is a parameter ranging from 0 to 1. At β = 0 there are several min-imizers, but it is possible to derive which of these minima that gives the leastobjective function value ||AQX − B||2F . Let Q0 be this minimum. If αi > αi+1

for i = 1, ..., m and X = In, Q0 is be on the form

Q0 =

[

Z

Q

]

,

where Q ∈ Rn×n is orthogonal, and Z ∈ R

(m−n)×n is a zero matrix. Theobjective function is then

||AQ0 − B||2F = trace(QT0 AT AQ0 − 2QT

0 AT B + BT B).

Because of the special structure of Q0 the first can be removed, so min ||AQ0 −B||2F = max trace(QT

0 AT B). Perform a SVD of AT B as

UΣV T =

[

U1

U2

]

ΣV T = AT B,

where U1 ∈ R(m−n)×m and U2 ∈ R

n×m, then

trace(QT0 AT B) = V T [ZT , QT ]

[

U1

U2

]

Σ = trace(V T QT U2Σ).

Since V T QT is an n by n orthogonal matrix we can derive the optimal solution,analogous to the procedure for an orthogonal Procrustes problem, by usingthe SVD of U2Σ. Let UΣV T = U2Σ, then trace(V T QT U ΣV T ) is maximizedif V T V T QT U = In, i.e., if Q = UV T V T . Here X = In was used, that isχi = χi+1 = 1, but the same procedure can be applied for cases with χi ≥ χi+1.

Consider now an algorithm as

1. Compute Q0.

2. k = 0.

3. for β > 0 to β = 1,

3.1 k = k + 1,


3.2 Let Qk be the solution to

min ||AQX − βB||2F , (29)

when using Qk−1 as the initial value for the iterative method used tosolve (29).

4. end for.

Does Qk converge to the global minimum as β → 1 ? Empirical studieshave shown that this is not always the case. For an optimal Q0 (computed asabove), Qk can at some point when β = β become a local minima, even thoughthe trajectory of Qk’s followed is (to the very best seemed to be) continuous.Studies have shown that starting with a non-optimal Q0 can yield a continuoustrajectory converging towards the global minimum as β → 1. Non-optimal heremeans that Q0 is a minimum to min ||AQX ||2F , but not optimal in the sense ofmin ||AQ0X − B||2F as described above. That is, a local minimum can becomea global minimizer at some point on the trajectory. It is not clear why this canhappen.

8 Conclusions

Studying the different cases when B ≈ 0 gives an insight to the amount ofminima a WOPP may have and how they are located in relation to each other.Extensive empirical studies have shown that not more than 2n minima exist fora WOPP, and it feels reasonable to conjecture that this is true.

Not mentioned here, are some continuation (homotopy) methods that wereconsidered. As with the algorithm described in Section 7.2, they too failed insome cases. However, in connection to this work, a successful algorithm hasbeen developed to compute all minimizers [20]. But there are still much to beunderstood about how the geometric properties of these problems can be usedto achieve global minimization for a general B.

Appendix

A The canonical form of a WOPP

Proposition A.1 The matrices A ∈ RmA×m and X ∈ R

n×nX with Rank(A) =m and Rank(X) = n belonging to a WOPP

min1

2||AQX − B||2F , subject to QT Q = In,

can always be considered as m by m and n by n diagonal matrices, respectively.

Proof. Let A = UAΣAV TA and X = UXΣXV T

X be the singular value decom-position of A and X . Then

100 Paper III

||UAΣAV TA QUBΣBV T

B − B||2F = ||UAΣAZΣXV TX − B||2F ,

where Z = V TA QUX ∈ R

m×n has orthonormal columns. Since UTAUA = ImA

and V TX VX = InX

it follows that

||UAΣAZΣXV TX − B||2F = tr(UAΣAZΣXV T

X − B)T (UAΣAZΣXV TX − B) =

= tr(VXΣXZT Σ2AZΣXV T

X − 2VXΣXZT Σ2AUT

AB + BT B) =

= tr(ΣXZT Σ2AZΣX) − tr(2ΣBZT ΣAUT

ABVX) + tr(BT B) =

tr(ΣAZΣX − UTABVX)T (ΣAZΣX − UT

ABVX) = ||ΣAZΣX − UTABVX ||F .

Hence, without loss of generality we can assume that A = diag(α1, ..., αm) andX = diag(χ1, ..., χn) with αi ≥ αi+1 ≥ 0 and χi ≥ χi+1 ≥ 0. 2

B The solution to an OPP

Theorem B.1 Let X ∈ Rn×n and B ∈ R

m×n be known matrices with Rank(X) =n and Rank(B) = n. Then the solution Q of the orthogonal Procrustes problem

min1

2||QX − B||2F , subject to QT Q = In, (30)

is Q = V Im,nUT , where U and V are the orthogonal matrices given by thesingular value decomposition UΣV T = XBT .

Proof. Since

||QX − B||2F = trace((QX − B)T (QX − B)) =

= trace((QX)T (QX)) + trace(BT B) − trace((QX)T B) − trace(BT (QX)) =

||X ||2F + ||B||2F − 2trace(BT QX).

Equation (30) is equivalent to

max trace(BT QX) , subject to QT Q = In. (31)

Note that trace(BT QX) = trace(XBT Q) and let UΣV T = XBT be a sin-gular value decomposition. Use the matrix Z = V T QU , Z ∈ R

m×n, and weget

trace(XBT Q) = trace(ΣV T QU) = trace(ΣZ) =

n∑

i=1

σizi,i.

Since Z has orthonormal columns, the upper bound of (31) is given by havingzi,i = 1, i.e., Z = Im,n. The solution to (30) is then V T QU = Im,n ⇒ Q =V Im,nUT . 2

If we consider the balanced case of a WOPP with X = In,

min1

2||AQ − B||2F , subject to QT Q = In, (32)

then (32) is an OPP since Q is orthogonal [11].


C Parametrization of Vm,n by using the Cayley

transform

The Cayley transform is often used to represent orthogonal matrices with posi-tive determinants as

Q(S) = (I + S)(I − S)−1, (33)

where S ∈ Rm×m is skew-symmetric (S = −ST ). Since S has imaginary eigen-

values, (I − S) does always have full rank. This parametrization fails in somecases, namely when (Q+ I) is singular. As an example, there exist no S ∈ R

2×2

such that Q(S) = diag(−1,−1). Instead of using (33) as a parametrizationof orthogonal matrices, a local parametrization can be used. Given a pointQ ∈ Vm,n, we can express any Q ∈ Vm,m in the vicinity of Q by using

Q(S) = Q(I + S)(I − S)−1. (34)

To get a local parametrization of Vm,n when n ≤ m, (34) is modified according

to the following. Given a point Q ∈ Vm,n, then a parametrization for any

Q ∈ Vm,n in the vicinity of Q can be written as

Q(S) = [Q, Q⊥](I + S)(I − S)−1Im,n. (35)

Here Q⊥ is any extension such that [Q, Q⊥] ∈ Rm×m is orthogonal and

Im,n =

[

In

0

]

∈ Rm×n.

S is skew-symmetric according to

S =

[

S11 −ST21

S21 0

]

, (36)

where S11 ∈ Rn×n is skew-symmetric, S21 ∈ R

m×n is arbitrary and the remain-ing lower right part is a zero matrix. Observe that if m = n, then (35) is thesame as (34).

C.1 The Tangent space of Vm,n

Definition C.1 The tangent space of the Stiefel manifold Vm,n at a given point

Q can be expressed as

T = {T = QS + (I − QQT )C}, (37)

where S ∈ Rn×n is skew-symmetric and C ∈ R

m×n arbitrary.

By using the power expansion

(I − S)−1 = I + S + S2 + S3 + ...

102 Paper III

(35) can be expressed as

Q(S) = [Q, Q⊥](I + S)(I + S + S2 + ...)Im,n.

The first order linear approximation in S is then

Q(S) ≈ [Q, Q⊥](I + 2S)Im,n = Q + [Q, Q⊥]2SIm,n.

The second term, that is dependent on S, gives a representation of the tangentspace as

[Q, Q⊥]2SIm,n = 2QS11 + 2Q⊥S21. (38)

With 2S11 = S and since Range(Q⊥) = Range(I − QQT ), (38) is the same as(37). Note that (38) is independent of the zero elements in lower right part ofS due to multiplication with Im,n, implying that S is on the form given in (36).

D Number of minima to the ellipsoid cases

Theorem D.1 If A ∈ Rm×m has distinct singular values αi > αi+1 > 0,

i = 1, . . . , m − 1, then (10) has a maximum of two minimizers.

In the proof that follows, it is shown that if a point q ∈ Rm fulfills sign(qi) 6=

sign(bi) for any i = 1, . . . , m − 1, then q is not a minimizer. Then it onlyremains two possible minimizers, i.e., either sign(qm) = sign(bm) or sign(qm) =−sign(bm). As a reminder, we also assume that A is diagonal (on the canonicalform). First we make an assumption that is made clear during the proof.

Assumption D.1 With the conditions stated in Theorem D.1, let q be a mini-mizer to (10). Then it does not exist any other minimum q such that sign(qi) =sign(qi) for all i = 1, . . . , m. That is, any other minimum q of (10) should haveat least one element qi with a different sign than qi.

Proof. First assume that bi 6= 0 for all i = 1, . . . , m. The case when anybi = 0 is considered at the end of the proof.

Now, let q be a point with sign(qk) 6= sign(bk) where 1 ≤ k ≤ m − 1. LetU(φ) ∈ R

m×m be a plane rotation in the plane spanned by [ek, em] according to

U(φ) =

Ik−1 0 0 00 cosφ 0 − sinφ0 0 Im−k−1 00 sinφ 0 cosφ

. (39)

Observe that U(0) = Im and that

minφ

||AU(φ)q − b||22,

is equivalent to

minφ

||

[

αk 00 αm

] [

cosφ − sinφsin φ cosφ

] [

qk

qm

]

−

[

bk

bm

]

||22 =


= minφ

||AU(φ)z − b||22, (40)

where z = [qk, qm]T . If q is a minimizer, then any arbitrary small δφ 6= 0 wouldyield ||AU(δφ)z − b||2 > ||Az − b||2. Equation (40) is just the ellipse problemdescribed earlier, i.e., take

A = ||z||2A , z(φ) = U(φ)z

||z||2

then ||z(φ)||2 = 1 and (40) is the same as

minφ

||Az(φ) − b||22. (41)

Assume for simplicity that b is in the first quadrant in R2, according to Figure

2. Since sign(qk) 6= sign(bk) then z(0) is either in the second or third quadrant,depending on the sign of qm. However, no matter what sign of qm, since αk > αm

there exists an arbitrary small |δφ| > 0 such that

||Az(δφ) − b||2 < ||Az(0) − b||2 ⇒ ||AU(δφ)q − b||2 < ||Aq − b||2, (42)

hence q can not be a minimizer.

b

r

Az(0)

Az(δφ)

Figure 2: The ellipse determined by A with semi major axis αk and semi minoraxis αm. At φ = 0 the residual r = Az(0) − b is shown. The dotted circlewith radii ||r||2 is centered at b and the direction δφ implies that z(0) is not aminimizer.

Two scenarios when (42) does not to hold are :

1). If z(0) is a global minimum to (41) in the first quadrant ⇒ sign(qk) =sign(bk) and sign(qm) = sign(bm).

2). If z(0) is a local minimum to (41) in the fourth quadrant ⇒ sign(qk) =sign(bk) and sign(qm) = −sign(bm).

104 Paper III

We have shown that any minimizer q of (10) must fulfil sign(qi) = sign(bi)for all i = 1, . . . , m− 1. By connecting two points with an ellipse, it should nowperhaps be clear that Assumption D.1 is valid. Let us assume the opposite, thatq is also a minimizer and that sign(qi) = sign(qi) is fulfilled for all i = 1, . . . , m.Then connecting Aq and Aq (and back to q again) with an ellipse, would yield acondition on the form (42). Then only q can be a minimizer. From this we canconclude that if bi 6= 0 for all i = 1, . . . , m the global minimizer of (10) fulfills1) whereas the second, local, minimizer (if any) fulfills 2). Hence a maximumof two minimizers can occur.

Now assume b has p zero elements bk = 0 where k ∈ {1, . . . , m − 1}. Byusing plane rotations as above in (39) and with b = [0, bm]T , it is shown that aminimizer q must have qk = 0. If qk 6= 0 then there exists a δφ such that (42)holds, no matter what bm is. Hence we can remove all p zero elements in b andcorresponding p equations in Aq, yielding an optimization problem with q andb in R

m−p. If now bm−p 6= 0, then we have just the case described first withbi 6= 0 for all i = 1, . . . , m − p. That is, (10) can at most have two minimizers.

Lastly assume that bm = 0, then by using plane rotations U(φ) as earlierbut in the plane spanned by [ek, em−1] and with b = [bk, bm−1]

T , the conclu-sion is that the elements of a minimizer q must fulfill sign(qi) = sign(bi) forall i = 1, . . . , m − 2. The element qm−1 however, could be assumed to fulfillsign(qm−1) = ±1. But now take a plane rotation in the plane spanned by[em−1, em] and with b = [bm−1, 0]T and we see that sign(qm−1) = sign(bm−1)must hold. Then qm is given by

qm = ±√

1 − q21 − q2

2 − . . . − q2m−1, (43)

due to the constraint qT q = 1. If the expression in (43) sums up to 0 there isonly one minimizer to (10), otherwise there are two. 2

References

[1] M. D. Akca. Generalized Procrustes Analysis and its Applications in Pho-togrammetry. ETH, Swiss Federal Institute of Technology Zurich, Instituteof Geodesy and Photogrammetry, 2003. Prepared for: Praktikum in Pho-togrammetrie, Fernerkundung und GIS.

[2] J. Balog, T. Csendes, and T. Rapcsak. Some global optimization problemson stiefel manifolds. J. Global Optimization, 30(1):91–101, 2004.

[3] M. T. Chu and N. T. Trendafilov. On a Differential Equation Approach tothe Weighted Orthogonal Procrustes Problem. Statistics and Computing,8(2):125–133, 1998.

[4] M. T. Chu and N. T. Trendafilov. The Orthogonally Constrained Regres-sion Revisted. J. Comput. Graph. Stat., 10:746–771, 2001.


[5] T. F. Cox and M. A. A. Cox. Multidimensional scaling. Chapman & Hall,1994.

[6] A. Edelman, T. A. Arias, and S. T. Smith. The Geometry of Algorithmswith Orthogonality Constraints. SIAM Journal on Matrix Analysis andApplications, 20(2):303–353, 1998.

[7] L. Elden. Solving Quadratically Constrained Least Squares Problems Usinga Differential-Geometric Approach. BIT Numerical Mathematics, 42(2),2002.

[8] L. Elden and H. Park. A Procrustes problem on the Stiefel manifold.Numer. Math., 82(4):599–619, 1999.

[9] G. E. Forsythe and G. H. Golub. On the Stationary Values of a Second-degree Polynomial on the Unit Sphere. J. Soc. Indust. Appl. Math., 13(4),1965.

[10] W. Gander. Least Squares with a Quadratic Constraint. Numer. Math.,36:291–307, 1981.

[11] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns HopkinsUniversity Press, 1989.

[12] J. C. Gower. Multivariate Analysis: Ordination, Multidimensional Scalingand Allied Topics. Handbook of Applicable Mathematics, VI:Statistics(B),1984.

[13] J. C. Gower and G. B. Dijksterhuis. Procrustes problems. Oxford UniversityPress, 2004.

[14] M. A. Koschat and D. F. Swayne. A Weigthed Procrustes Criterion. Psy-chometrika, 56(2):229–239, 1991.

[15] A. Mooijaart and J. J. F. Commandeur. A General Solution of the WeigthedOrthonormal Procrustes Problem. Psychometrika, 55(4):657–663, 1990.

[16] T. Rapcsak. On Minimization on Stiefel manifolds. European J. Oper.Res., 143(2):365–376, 2002.

[17] I. Soderkvist. Some Numerical Methods for Kinematical Analysis. ISSN-0348-0542, UMINF-186.90, Department of Computing Science, Umea Uni-versity, 1990.

[18] I. Soderkvist and Per-Ake Wedin. On Condition Numbers and Algorithmsfor Determining a Rigid Body Movement. BIT, 34:424–436, 1994.

[19] E. Stiefel. Richtungsfelder und Fernparallelismus in n-dimensionalen Man-nigfaltigkeiten. Commentarii Math. Helvetici, 8:305–353, 1935-1936.

106 Paper III

[20] T. Viklands. On Global Minimization of Weighted Orthogonal ProcrustesProblems. Technical Report UMINF-06.09, Department of Computing Sci-ence, Umea University, Umea, Sweden, 2006.

[21] T. Viklands and P. A . Wedin. Algorithms for Linear Least Squares Prob-lems on the Stiefel manifold. Technical Report UMINF-06.07, Departmentof Computing Science, Umea University, Umea, Sweden, 2006.

[22] P. A . Wedin and T. Viklands. Algorithms for 3-dimensional WeightedOrthogonal Procrustes Problems. Technical Report UMINF-06.06, Depart-ment of Computing Science, Umea University, Umea, Sweden, 2006.

paper iii - people.cs.umu.se · 84 paper iii contents 1 introduction. 85 2 2-norm formulation 86 3...

Documents