an optimal parallel jacobi-like solution method for the singular value decomposition

8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition

1/7

~

An Optimal Parallel Jacobi-LikeSolution Method

for the Singular Value Decomposition

G. R. Gao and S. J. Thomas .January, 1988

others have proposed various modifications for hypercube im-

plementations, which require the embedding of rings via binary

reflected Gray codes.

In this paper, we present a new parallel Jacobi-like solu-tion method for the SVD which is optimal in achieving both

the maximum concurrency in computation and the minimum

overhead in communication. Unlike previously published paral-

lel SVD algorithms based on a nearest neighbour ring topology

for communication, the new algorithm proposed in this paper

introduces a recursive divide-exchange communication pattern.

As a result of the recursive nature of the algorithm, proofs

are given to show that it achieves the lower bounds both in

computation and communication costs. Convergence aspects

of the new algorithm are briefly discussed. The paper illus-trates that the new algorithm can be mapped efficiently and

naturally onto hypercube architectures. We have implemented

the new algorithm on the Intei hypercube through simulation

and the preliminary results will be discussed. A comparisonwith related work is briefly outlined. We believe that the new

algorithm can be efficiently mapped onto multiprocessors with

interconnection patterns that have been proposed to supportlarge-scale parallelism such as the many PM2I-based or cube-

based networks [20].

2 J aco hi-like Algorithms

Abstract

A new parallel Jacobi-like solution method for the

singular value decomposition (SVD) is presented which

is optimal in achieving both the maximum concurrency

in computation and the minimum overhead in commu-

nication. Unlike previously published parallel SVD al-

gorithms based on a nearest neighbour ring topology forcommunication, the new algorithm introduces a recur-

siv~ divide-exchange communication pattern. As a re-

sult of the recursive nature of the algorithm, proofs are

eiven to show that it achieves the lower bounds both in

computation and communication costs. In general, the

recursive pairwise exchange communication operations

of the new algorithm can be efficiently supported by

multiprocessors with interconnect patterns used in manynetworks that have been proposed to support large-scale

parallelism. As an example, this paper illustrates that

the new algorithm can be mapped efficiently and natu-

rally onto hypercube architectures. Preliminary results

with an implementation of the new algorithm are re-

ported. Convergence aspects of the new algorithm are

briefly discussed. A comparison with related work isoutlined.

1 Introduction

Rapid technological advances in multiprocessor architectures

have aroused much' interest in parallel computation. Parallel

methods to compute the singular value decomposition (SVD)have received attention due to its many important applications

in science and engineering. A recent paper by Heath et al [8]includes a history of various Jacobi-like SVD algorithms.

An early investigation into parallel computation for the

symmetric eigenvalue problem, on the SIMD miac IV is de-scribed by Sameh in [18]. Sameh outlines the criteria for max-

imal parallelism in a Jacobi-like algorithm. More recently, a

number of authors including Berry et al [1] advocate the one-

sided SVD of Hestenes [9], [8], [15] for parallel computationof the SVD. Luk and his co-workers have examined various

systolic array configurations to compute the SVD [12], [3], [4].

Brent and Luk [4] have invented a linear array of n/2 proces-sors which implements a one-sided Hestenes algorithm, that

in real arithmetic, is an exact analogue of their Jacobi method

applied to the eigenvalue problem. The array requires O(mnS)

time, where S is the number of sweeps (typically :5 10). Brentand Luk demonstrate that their algorithm is computationally

optimal in the sense that it requires the minimum number of

computational steps per sweep i.e. n - 1, to ensure the exe-

cution of every possible pairwise column rotation. Maximum

concurrency is maintained throughout the computation. Their

systolic array is comparable to the architecture of a nearest-neighbour linear array of processors, where communication is

based on a ring topology.

Brent and Luk's algorithm is not optimal in terms of com-

munication overhead. Unnecessary costs are incurred by map-

ping the systolic array architecture onto a ring connected linear

array due to the double sends and receives required between

pairs of neighbouring processors. Eberlein [5], Bischof [2] and

'School of Computer Science, McGill University, Montreal, Quebec,Canada, H3A 2K6. This work was partially supported by the NaturalSciences and Engineering Research Council of Canada under Grant A9236.

2.1 The Singular Value Decomposition

The singular value decomposition (SVD) of a general non-square matrix may be given as follows,

Theorem 2.1 For a real matrix A(m x n) of rank r, there

exists orthogonal matrices U(m X m) and V(n x n), such that

UT AV = r: = diag(0"1,0"2,"') 2: 0,

where the elements ofr:(m x n) may be orderedso that

0"1 2: 0"2 2:... 2:O"r > O"r+1= ... = O"q = 0, q = min{m,n}.

I f m = n, r: is a square diagonal n x n matrix Ill].

In order to compute the SVD in an iterative fashion, a se-

ries of plane rotations may be applied to the matrix A(m x n)

described in theorem 2.1 above. This approach is similar in

nature to Jacobi's original method for computing the eigenval-

ues of a symmetric matrix where orthogonal matrices J(i,j,8)

are applied so as to annihilate a symmetrically placed pair ofthe n(n - 1) off-diagonal elements. These rotation matricesdiffer from the identity matrix of order n by the principal sub-

matrix formed at the intersection of the row and column pairs

corresponding to i and j. A 2 x 2 submatrix has the form

[~s :]

The cosine and sine of the rotation angle 8 are the constants

c = cos 8 and s = sin 8. Initially Al = A and at the k-thiteration,

47


2/7

~

AH1 = J(i/o,j/o,8/o)T A/oJ(i/o,j/o,8/o).

Rotations are applied simultaneously, in a symmetric fashionfrom the left and right. Cyclic Jacobi methods refer to a se-

quence of rotations which update row and column pairs in some

predetermined order. For a square matrix, a cyclic sweeprefersto the updating ofn(n-l)/2 elements. A number of sweeps 'arerequired in order to effectively reduce the off-diagonal mass ofthe matrix to a sufficiently small value, which eventually can be

ignored. A diagonal containing the eigenvalues then remains.

Annihilation of 2 off-diagonal elements of a symmetric matrixtakes the form,

[

(/0 ) (/0)

] [

(H1)

[~ ~8] :!~) :(~) [~8 ~] = a;;OOJ JJ

a


3/7

~-

",By normalizing the columns, we see that the SVD of theorem

2.1 is implicit in (2.3)

Q = UE, A= UEVT

'"

A one-sided algorithm is somewhat different from its earlier

counterparts, as rotations are applied from the right and there-

fore only columns are affected. Off-diagonal elements are no

longer annihilated, instead rotations are designed in order toproduce two orthogonal columns. As with similar Jacobi-likealgorithms, the orthogonal matrix V may be accumulated from

plane rotations J(i,j,lJ) which differ from the unit matrix Inin a 2 x 2 principal submatrix containing the cosines and sines

of the rotation. Setting Al =A,the k-th iteration updates Ak

Ak+1 = AkJ(i,j,lJk).

If the matrix sequence Ak converges, the result is Q in (2.3).A column update via a 2 x 2 submatrix takes the form

[ (k) (k) ][C -8

]= [ (HI) (HI) ]a., a, 8 c a. ,a, 'The orthogonality condition determines the rotation angleIJ.

2(a~kT a(k),(a~k)Ta~k) - (a}k)Ta}k) = tan21J.

(2.4)

By avoiding a potential loss of significant digits the magnitude

of the angle may be restricted to IIJI::; 7r/4 and provides for-

mulae for the rotation (see Nash [15] and Rustishauser [17]).As noted by Brent and Luk [4], if a cyclic-by-row rotation

ordering is chosen to update the n(n - 1)/2 column pairingsdetermined by the off-diagonal elements above the main diago-

nal, convergence would follow. Hestenes' computation is math-ematically equivalent to a Jacobi algorithm applied to AT A,

therefore we expect that the convergence analyses of Forsythe

and Henrici [6] or Wilkinson [22]are applicable und~r these cir-cumstances. Rather than testing for convergence, the threshold

Jacobi method originally introduced in the symmetric eigen-

problem is often employed [23, pp. 277-278], [17].

3 Parallel Computation3.1 Maximizing Concurrency

In this paper the computation cost is measured by the numberof parallel computation steps. The methods discussed process

(i,j) pairings consisting of partitions containing at least 1 col-umn or row. When n is even, if we assume one parallel com-

putation step has unit cost, then of the algorithms presentedthe minimum cost achieved is n - 1 per sweep. The systolic ar-ray and associated algorithm proposed by Brent and Luk were

proven to achieve this lower bound in [4]. We have illustratedtheir basic scheme in figure 1 for the case n = 8, where a lineararray of four processors {PI, P2, Ps, p.} is used.

3.2 Minimizing Communication Costs

Another important performance criteria for a parallel algorithm

is the total communication cost. For our purposes the commu-nication cost can be measured by the total number of inter-

processor transactions (messages). A transaction consists of acolumn transmission between a pair of processors. The totalcommunication cost of one sweep will be denoted C.

From the last section, we know that the minimum number

of computation steps in a sweep is K = n - 1. The minimumnumber of interprocessor transactions is achieved when each

processor retains one column from a pairing, and transmits theother to a destination processor. As a result, if there are p pro-cessors, p transmissions are performed between two consecutive

steps. Hence the minimum total communication cost Cmin is

r

Step

Pl

P2

P3

P.

Figure 1: Brent and Luk's Systolic Array

defined by the following.

Cmin= (K- l)p. (3.1)

In the parallel one-sided SVD algorithm each processor is as-

signed one ofn/2 column pairs at each step, assuming n is even.The total number of processors required isp= n/2 in (3.1) andthe communication costs are O(n2).

Cmin = (K - l)pn(n- 2)

2

As a contrast, a global broadcasting strategy may request

each processor to send both columns to all other p-l processors

between each step. The total cost for this case will be O(n3).

Brent and Luk's algorithm has the following communicationcost.

CBL = (K - 1) x 2p

= (n-2)x2G)- n(n - 2).

Therefore their algorithm is close to, but not quite optimal.In fact the inefficiency lies in the double sends and receives

between processors in the systolic array which are dictated bythe tournament ordering.

Several ways of modifying Brent and Luk's algorithm toavoid the double sends and receives have been proposed [13],

[14], [5], [2]. These algorithms all represent a communicationregimen based on a ring topology. A ring topology resemblesthe architecture of a linear array of processors. Embedding a

ring within another topology, for example the binfY n-cube,requires a special mapping scheme.

4 An Optimal Parallel SVD Algorithm

In this section we present a new parallel Jacobi-like algorithmwhich is optimal in terms of both achieving maximum concur-

rency and minimum communication overhead. The algorithmrelies on a recursive divide-exchange of n= 2d columns.

Unlike several orderings cited earlier, the new algorithm

maps naturally onto parallel architectures which support re-

cursive pairwise exchanges. A mapping onto a hypercube ispresented as an example in 5. Pairwise exchanges of columnshere are specified by a Perfect Shuffle of processor addresses

[21].

49


4/7

" 4.1 The Parallel AlgorithmLet us first illustrate the basic principle of the new algorithm

through an example where n= 8 and p= 4. The computationsteps Kl and communication steps Xl consisting of pairwiseexchanges, are shown in figure 2.

PI


5/7

c(n) = { ~!2+ c(n/2)n> 2,n = 2. (4.3)

Solving (4.3), we obtainc(n) = n - 2. Multiplying by thenumber of processors p = n/2 gives the communication costsODE for our recursive divide-exchange algorithm. We have

achieved the optimum since ODE = amino -

ODE = (n - 2)p= n(n - 2)2 .

Referring to our example in figure 2, 4 column transactions

have occurred at each communication step with a total cost of

6 X 4 = 24 transactions, which is optimal.

5 Mapping onto the Hypercube

In order to map the recursive divide-eichange algorithm of 4

onto a hypercube architecture we must first specify the oper-ations performed by each processor in the cube. Given the

two major components of our algorithm, namely a compute-exchange a nd a divide, deriving an algorithm for individual

processors is straightforward. Due to the tail-recursion in theparallel SVD algorithm, it may be transformed into an iterativeform.

Algorithm Divide-Exchange

for k= 1 to d dofor 1= 1 to 2d-k - 1 do

Compute (i,j)q = h(l)Exchange 2q

end

Compute (i,j)Divide"2d-k-1

end

The step "Compute (i,j)" refers to a column update in the

parallel version of Hestenes' one-sided computation. Using theterminology introduced in 4, each processor cycles through a

Jacobi-sweep consisting of dstages. A divide step, exchangingat a distance of 2-1 would not be carried out. The function

h(l) computes the height of an exchange node Xl, where I is

the label number derived by an inorder traversal of a completebinary tree.

Function h(l)

begin

q= llog2lJt = I- zq

if t= a then

return qelse

return h(t)end

end

The relative ease of mapping a recursive divide-exchange

onto the hypercube is due to the recursive nature of the hyper-cube itself. The fact that a hypercube is recursively constructed

out of lower dimensional subcubes may be exploited. A di-

vide step in our algorithm corresponds to a subdivision .of the

problem, allowing computations to proceed on the subcubes.

Exchanges will always consist of communication between pairsof nearest neighbours on the hypercube, A cube of dimension

d - 1 is required for a problem with n = 2d. The computa-tion and communication steps are determin~d by the exchangesequence shown in figure 3. "

5.2 Processor Pairings

Nearest neighbour processor pairings on the hypercube maybe determined by a Perfect Shuffle of node addresses. Stone's

original paper [21] details the generation of such pairings viaa left cyclic shift of the bits in an address. A perfect shuffleof an N element vector is a permutation P of the indices or

'addresses a of the elements such that'

P a -{

2a a ::;a::;N/2 - 1,( ) - "2a + 1- N N/2::; a ::;N - 1. (5.1)

Consider the binary representation of an integer address for

which N = 2d. Individual bits at position i are denoted ai.

a = ad-12d-1 + ad-22d-2 + . .. + a12 + ao (5.2)

A perfect shuffle (5.1) of an address a creating it new address

a' corresponds to a left cyclic shift of all bits ai to ai+1 with

the leftmost bit ad-1 wrapped around to ao [21].

a' = ad-22d-1 + ad-32d-2 + . .. + ao2 + ad-1

Our earlier requirement for a pairwise exchange of columns

at a distance 2h is easily satisfied, due to the geometry of ahypercube. The 'implication is that for addresses of the form

(5.2), a difference in a single bit ai indicates a distance of2i. We

also note that the addresses of neighbouring processors in the

hypercube differ in only one bit position. Exchanges, therefore,will always be between directly connected neighbours.

Processor nodes in a hypercube are labelled from a to 2d-l,for example in a 3-dimensional cube there are 8 processors withaddresses a to 7. We can use the perfect shuffle to generateprocessor pairings required for exchanges at a distance which is

a power of 2. This may be illustrated by an example with d= 3.Initially processor pairings for exchanges are at a distance of

Figure 4: 3-Dimensional Processor Pairings

1. After a perfect shuffle from addresses a to a' exchanges maytake place at a distance of 2, from a' to a" at a distance of 4

and so on. Processor pairings before and after a perfect shuffleare given in figure 4.

The exchange and divide steps required to complete onesweep of a Jacobi-like algorithm, when n = 24 = 16 are illus-trated in figure 5.

51

...-

node a node a' node a" a 000 a 000 a 0001 001 2 010 4 1002 010 4 100 1 0013 011 6 110 5 1014 100 1 001 2 0105 101 3 all 6 110

6 110 5 101 3 0117 111 7 111 7 111


6/7

,.

Th ere a re 15 c omput ation. steps . The c olumn pa irs

(i,j) at each step are written in the processor nodes.

T he c om mu ni ca ti on l in ks u se d b et we en t he c om pu ta ti on

s te ps a re m ar ke d.

Figure 5: Divide-Exchange on a 3-Cube

5.3 Computational Results

An implementation of Hestenes' one-sided SVD via the non-

recursive version of our algorithm was written in 'c' for sub-

sequent testing and analysis- on the Intel iPSC hypercube. Asimulator for the hypercube was provided by Intel Scientific

Corp. to McGill University for a SUN 3/280 running the BSD4.3 operating system. This SUN has an IEEE 754 standard

co-processor with a floating point precision of s= 2.22 X 10-16in double precision arithmetic.

A threshold Jacobi method, as described in 2, was em-

ployed to insure proper termination of the algorithm. Followingthe methods introduced by Berry et al in [1] for computationon an array processor, each node processor in the hypercube

maintains a counter stop. The counter is incremented by a pro-cessor when one of its assigned column pairs (,j) is deemed tobe orthogonal according to a threshold parameter T. For the

purposes of our tests we chose T= sIlAIIF.The parallel computation terminates at the end of a sweep

if each of the n/2 processors report istop counts ofn - 1. For a

series of random 8 x 8 matrices generated using the interactive

matrix software package Matlab, we typically observe conver-

gence in the hypercube computation after 6 sweeps.

Finally we have observed a communication pattern for ran-dom 16 x 16 matrices matching exactly with that shown infigure 6.

6 Conclusions and Future Research

We have described a new optimal parallel Jacobi-like algorithm

for- the singular value decomposition (SVD). We have demon-strated that the new algorithm can be mapped naturally onto

hypercube architectures, effectively utilizing the nearest neigh-bour communication capacity throughout the computation. In

general, the recursive pairwise exchange communication opera-tions of the new algorithm can be efficiently supported by mul-

tiprocessors with interconnect patterns used in many networks

that have been proposed to support large-scale parallelism [20].For example, we believe that the new algorithm can be mappedeffectively onto SIMD or MIMD parallel computers with inter~connection networks such as PM2I-based networks and cube-

based networks These interconnection networks have the par-tit ion ability property: the ability to divide the network into

independent subnetworks of different sizes [20], which matchthe recursive divide-exchange structure of the new parallel al-gorithm proposed in this paper.

We suggest the following future research directions: Study

extensions of the new algorithm to various forms of the SVD,to the unsymmetric eigenvalue problem and to the generalized

eigenvalue problem Ax = >"Bx. Furthermore, we would like togather empirical information concerning convergence properties

fro~ numerical simulations.

,

7 Acknowledgements

Intel Scientific Corp. provided us with a hypercube simulator

which allowed us to test the new algorithm. Martin Santavygave many valuable comments concerning the details of soft-ware development for the hypercube, particularly in the area

of synchronization problems. The figures were prepared withthe help of Peggy Gao. We would especially like to thank Prof.

Chris Paige for many helpful comments and corrections relatedto historical background and convergence results.

References

[1] M. Berry and A. H. Sameh, "Multiprocessor Jacobi algo-

rithms for dense symmetric eigenvalue problems and singu-lar value decompositions" , Proceedings of the InternationalConference on Parallel Processing, 1986.

[2] C. Bischof, "The two-sided Jacobi method on a hyper-cube", SIAM Proceedings of the Second Conference on Hy-

percube Multiprocessors, 1987.

[3] R. P. Brent, F. T. Luk and C. F. Van Loan, "Computation

of the singular value decomposition using mesh-connected

processors", J. VLSI Computer Systems, 1, (1985), pp.242-270.

[4] R. P. Brent and F. T. Luk, "The solution of singular-valueand symmetric eigenvalue problems on multiprocessor ar-

rays", SIAM J. Sci. Stat. Comput, 6 (1985), pp. 69-84.

[5] P. J. Eberlein, "On using the Jacobi method on the hy-percube", SIAM Proceedings of the Second Conference on

Hypercube Multiprocessors, 1987.

[6] G. E. Forsythe and P. Henrici, "The cyclic Jacobi method

for computing the principal values of a complex matrix",Trans. Amer. Math. Soc., 94, (1960), pp. 1-23.

[7] G. H. Golub and W. Kahan, "Calculating the singularvalues and pseudo-inversp- of a matrix", J. SIAM Ser. B:Numer. Anal. 2, (1965), pp. 205-224.

52

.J:'-:


7/7

I'-

/'

/','[8] M. T. Heath, A. J. Laub, C. C.Paige and R. C. Ward,

"Computing the singular value decomposition of a product

of two matrices", SIAM J. Sci. Stat. Comput., 7 (1986),pp. 1147-1159.

[9] M. R. Hestenes, "Inversion of matrices by biorthogonal-ization and related results", J. Soc. Indust. Appl. Math.,

6 (1958), pp. 51-90.

[10] E. G. Kogbetliantz, "Solution of linear equations by diag-onalization of coefficients matrix", Quart. Appl. Math., 13

(1955), pp. 123-132.

[11] C. Lawson and R. Hanson, Solving Least Squares Problems,Prentice-Hall, Englewood Cliffs, N.J., 1974.

[12] F. T. Luk, "A triangular pr6cessor array for computingsingular values", Linear Algebra Appl., 77 (1986), pp. 259-273.

[13] F. T. Luk and H. Park, "On parallel Jacobi orderings",Cornell University, School of Elec. Eng. Report, EE-CEG-

86-5, 1986.

[14] J. J. Modi and J. D. Pryce, "Efficient implementation ofJacobi's diagonalization method on the DAP", Numer.

Math., 46 (1985), pp. 443-454.

[15] J. C. Nash, "A one-sided transformation method for thesingular value decomposition and algebraic eigenproblem" ,

Comput. J., 18 (1975), pp. 74-76.

[16] C. C. Paige and P. Van Dooren, "On the quadratic conver-gence of Kogbetliantz's algorithm for computing the singu-

lar value decomposition", Linear Algebra Appl. 77 (1986),pp. 301-313.

[17] H. Rutishauser, "The Jacobi method for real symmetric

matrices", Numer. Math., 16, (1966), pp. 205-223.

[18] A. H. Sameh, "On Jacobi and Jacobi-like algorithms for a

parallel computer", Math. Comp., 25, (1971), pp. 579-590.

[19] A. H. Sameh, "Solving the linear least squares problem ona linear array of processors", Algorithmically Specialized

Parallel Computers, Academic-Press, 1985, pp. 191-200.

[20] H. J. Siegel, Intertonnection Networks for Large-Scale

Parallel Processing, Lexington Books, D.C. Heath and Co.,Mass., 1985.

[21] H. S. Stone, "Parallel processing with the perfect shuffle",

IEEE Trans. Comput., C'-:20 (1971), pp. 153-161.

[22] J. H. Wilkinson, "A note on the quadratic convergence of

the cyclic Jacobi process", Numer. Math., 4 (1962), pp.296-300.

[23] J. H. Wilkinson, The Algebraic Eigenvalue Problem,Clarendon-Press, Oxford, 1965.

53

J:iI

an optimal parallel jacobi-like solution method for the singular value decomposition

Documents