an optimal parallel jacobi-like solution method for the singular value decomposition
TRANSCRIPT
-
8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition
1/7
~
An Optimal Parallel Jacobi-LikeSolution Method
for the Singular Value Decomposition
G. R. Gao and S. J. Thomas .January, 1988
others have proposed various modifications for hypercube im-
plementations, which require the embedding of rings via binary
reflected Gray codes.
In this paper, we present a new parallel Jacobi-like solu-tion method for the SVD which is optimal in achieving both
the maximum concurrency in computation and the minimum
overhead in communication. Unlike previously published paral-
lel SVD algorithms based on a nearest neighbour ring topology
for communication, the new algorithm proposed in this paper
introduces a recursive divide-exchange communication pattern.
As a result of the recursive nature of the algorithm, proofs
are given to show that it achieves the lower bounds both in
computation and communication costs. Convergence aspects
of the new algorithm are briefly discussed. The paper illus-trates that the new algorithm can be mapped efficiently and
naturally onto hypercube architectures. We have implemented
the new algorithm on the Intei hypercube through simulation
and the preliminary results will be discussed. A comparisonwith related work is briefly outlined. We believe that the new
algorithm can be efficiently mapped onto multiprocessors with
interconnection patterns that have been proposed to supportlarge-scale parallelism such as the many PM2I-based or cube-
based networks [20].
2 J aco hi-like Algorithms
Abstract
A new parallel Jacobi-like solution method for the
singular value decomposition (SVD) is presented which
is optimal in achieving both the maximum concurrency
in computation and the minimum overhead in commu-
nication. Unlike previously published parallel SVD al-
gorithms based on a nearest neighbour ring topology forcommunication, the new algorithm introduces a recur-
siv~ divide-exchange communication pattern. As a re-
sult of the recursive nature of the algorithm, proofs are
eiven to show that it achieves the lower bounds both in
computation and communication costs. In general, the
recursive pairwise exchange communication operations
of the new algorithm can be efficiently supported by
multiprocessors with interconnect patterns used in manynetworks that have been proposed to support large-scale
parallelism. As an example, this paper illustrates that
the new algorithm can be mapped efficiently and natu-
rally onto hypercube architectures. Preliminary results
with an implementation of the new algorithm are re-
ported. Convergence aspects of the new algorithm are
briefly discussed. A comparison with related work isoutlined.
1 Introduction
Rapid technological advances in multiprocessor architectures
have aroused much' interest in parallel computation. Parallel
methods to compute the singular value decomposition (SVD)have received attention due to its many important applications
in science and engineering. A recent paper by Heath et al [8]includes a history of various Jacobi-like SVD algorithms.
An early investigation into parallel computation for the
symmetric eigenvalue problem, on the SIMD miac IV is de-scribed by Sameh in [18]. Sameh outlines the criteria for max-
imal parallelism in a Jacobi-like algorithm. More recently, a
number of authors including Berry et al [1] advocate the one-
sided SVD of Hestenes [9], [8], [15] for parallel computationof the SVD. Luk and his co-workers have examined various
systolic array configurations to compute the SVD [12], [3], [4].
Brent and Luk [4] have invented a linear array of n/2 proces-sors which implements a one-sided Hestenes algorithm, that
in real arithmetic, is an exact analogue of their Jacobi method
applied to the eigenvalue problem. The array requires O(mnS)
time, where S is the number of sweeps (typically :5 10). Brentand Luk demonstrate that their algorithm is computationally
optimal in the sense that it requires the minimum number of
computational steps per sweep i.e. n - 1, to ensure the exe-
cution of every possible pairwise column rotation. Maximum
concurrency is maintained throughout the computation. Their
systolic array is comparable to the architecture of a nearest-neighbour linear array of processors, where communication is
based on a ring topology.
Brent and Luk's algorithm is not optimal in terms of com-
munication overhead. Unnecessary costs are incurred by map-
ping the systolic array architecture onto a ring connected linear
array due to the double sends and receives required between
pairs of neighbouring processors. Eberlein [5], Bischof [2] and
'School of Computer Science, McGill University, Montreal, Quebec,Canada, H3A 2K6. This work was partially supported by the NaturalSciences and Engineering Research Council of Canada under Grant A9236.
2.1 The Singular Value Decomposition
The singular value decomposition (SVD) of a general non-square matrix may be given as follows,
Theorem 2.1 For a real matrix A(m x n) of rank r, there
exists orthogonal matrices U(m X m) and V(n x n), such that
UT AV = r: = diag(0"1,0"2,"') 2: 0,
where the elements ofr:(m x n) may be orderedso that
0"1 2: 0"2 2:... 2:O"r > O"r+1= ... = O"q = 0, q = min{m,n}.
I f m = n, r: is a square diagonal n x n matrix Ill].
In order to compute the SVD in an iterative fashion, a se-
ries of plane rotations may be applied to the matrix A(m x n)
described in theorem 2.1 above. This approach is similar in
nature to Jacobi's original method for computing the eigenval-
ues of a symmetric matrix where orthogonal matrices J(i,j,8)
are applied so as to annihilate a symmetrically placed pair ofthe n(n - 1) off-diagonal elements. These rotation matricesdiffer from the identity matrix of order n by the principal sub-
matrix formed at the intersection of the row and column pairs
corresponding to i and j. A 2 x 2 submatrix has the form
[~s :]
The cosine and sine of the rotation angle 8 are the constants
c = cos 8 and s = sin 8. Initially Al = A and at the k-thiteration,
47
-
8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition
2/7
~
AH1 = J(i/o,j/o,8/o)T A/oJ(i/o,j/o,8/o).
Rotations are applied simultaneously, in a symmetric fashionfrom the left and right. Cyclic Jacobi methods refer to a se-
quence of rotations which update row and column pairs in some
predetermined order. For a square matrix, a cyclic sweeprefersto the updating ofn(n-l)/2 elements. A number of sweeps 'arerequired in order to effectively reduce the off-diagonal mass ofthe matrix to a sufficiently small value, which eventually can be
ignored. A diagonal containing the eigenvalues then remains.
Annihilation of 2 off-diagonal elements of a symmetric matrixtakes the form,
[
(/0 ) (/0)
] [
(H1)
[~ ~8] :!~) :(~) [~8 ~] = a;;OOJ JJ
a
-
8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition
3/7
~-
",By normalizing the columns, we see that the SVD of theorem
2.1 is implicit in (2.3)
Q = UE, A= UEVT
'"
A one-sided algorithm is somewhat different from its earlier
counterparts, as rotations are applied from the right and there-
fore only columns are affected. Off-diagonal elements are no
longer annihilated, instead rotations are designed in order toproduce two orthogonal columns. As with similar Jacobi-likealgorithms, the orthogonal matrix V may be accumulated from
plane rotations J(i,j,lJ) which differ from the unit matrix Inin a 2 x 2 principal submatrix containing the cosines and sines
of the rotation. Setting Al =A,the k-th iteration updates Ak
Ak+1 = AkJ(i,j,lJk).
If the matrix sequence Ak converges, the result is Q in (2.3).A column update via a 2 x 2 submatrix takes the form
[ (k) (k) ][C -8
]= [ (HI) (HI) ]a., a, 8 c a. ,a, 'The orthogonality condition determines the rotation angleIJ.
2(a~kT a(k),(a~k)Ta~k) - (a}k)Ta}k) = tan21J.
(2.4)
By avoiding a potential loss of significant digits the magnitude
of the angle may be restricted to IIJI::; 7r/4 and provides for-
mulae for the rotation (see Nash [15] and Rustishauser [17]).As noted by Brent and Luk [4], if a cyclic-by-row rotation
ordering is chosen to update the n(n - 1)/2 column pairingsdetermined by the off-diagonal elements above the main diago-
nal, convergence would follow. Hestenes' computation is math-ematically equivalent to a Jacobi algorithm applied to AT A,
therefore we expect that the convergence analyses of Forsythe
and Henrici [6] or Wilkinson [22]are applicable und~r these cir-cumstances. Rather than testing for convergence, the threshold
Jacobi method originally introduced in the symmetric eigen-
problem is often employed [23, pp. 277-278], [17].
3 Parallel Computation3.1 Maximizing Concurrency
In this paper the computation cost is measured by the numberof parallel computation steps. The methods discussed process
(i,j) pairings consisting of partitions containing at least 1 col-umn or row. When n is even, if we assume one parallel com-
putation step has unit cost, then of the algorithms presentedthe minimum cost achieved is n - 1 per sweep. The systolic ar-ray and associated algorithm proposed by Brent and Luk were
proven to achieve this lower bound in [4]. We have illustratedtheir basic scheme in figure 1 for the case n = 8, where a lineararray of four processors {PI, P2, Ps, p.} is used.
3.2 Minimizing Communication Costs
Another important performance criteria for a parallel algorithm
is the total communication cost. For our purposes the commu-nication cost can be measured by the total number of inter-
processor transactions (messages). A transaction consists of acolumn transmission between a pair of processors. The totalcommunication cost of one sweep will be denoted C.
From the last section, we know that the minimum number
of computation steps in a sweep is K = n - 1. The minimumnumber of interprocessor transactions is achieved when each
processor retains one column from a pairing, and transmits theother to a destination processor. As a result, if there are p pro-cessors, p transmissions are performed between two consecutive
steps. Hence the minimum total communication cost Cmin is
r
Step
Pl
P2
P3
P.
Figure 1: Brent and Luk's Systolic Array
defined by the following.
Cmin= (K- l)p. (3.1)
In the parallel one-sided SVD algorithm each processor is as-
signed one ofn/2 column pairs at each step, assuming n is even.The total number of processors required isp= n/2 in (3.1) andthe communication costs are O(n2).
Cmin = (K - l)pn(n- 2)
2
As a contrast, a global broadcasting strategy may request
each processor to send both columns to all other p-l processors
between each step. The total cost for this case will be O(n3).
Brent and Luk's algorithm has the following communicationcost.
CBL = (K - 1) x 2p
= (n-2)x2G)- n(n - 2).
Therefore their algorithm is close to, but not quite optimal.In fact the inefficiency lies in the double sends and receives
between processors in the systolic array which are dictated bythe tournament ordering.
Several ways of modifying Brent and Luk's algorithm toavoid the double sends and receives have been proposed [13],
[14], [5], [2]. These algorithms all represent a communicationregimen based on a ring topology. A ring topology resemblesthe architecture of a linear array of processors. Embedding a
ring within another topology, for example the binfY n-cube,requires a special mapping scheme.
4 An Optimal Parallel SVD Algorithm
In this section we present a new parallel Jacobi-like algorithmwhich is optimal in terms of both achieving maximum concur-
rency and minimum communication overhead. The algorithmrelies on a recursive divide-exchange of n= 2d columns.
Unlike several orderings cited earlier, the new algorithm
maps naturally onto parallel architectures which support re-
cursive pairwise exchanges. A mapping onto a hypercube ispresented as an example in 5. Pairwise exchanges of columnshere are specified by a Perfect Shuffle of processor addresses
[21].
49
-
8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition
4/7
" 4.1 The Parallel AlgorithmLet us first illustrate the basic principle of the new algorithm
through an example where n= 8 and p= 4. The computationsteps Kl and communication steps Xl consisting of pairwiseexchanges, are shown in figure 2.
PI
-
8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition
5/7
c(n) = { ~!2+ c(n/2)n> 2,n = 2. (4.3)
Solving (4.3), we obtainc(n) = n - 2. Multiplying by thenumber of processors p = n/2 gives the communication costsODE for our recursive divide-exchange algorithm. We have
achieved the optimum since ODE = amino -
ODE = (n - 2)p= n(n - 2)2 .
Referring to our example in figure 2, 4 column transactions
have occurred at each communication step with a total cost of
6 X 4 = 24 transactions, which is optimal.
5 Mapping onto the Hypercube
In order to map the recursive divide-eichange algorithm of 4
onto a hypercube architecture we must first specify the oper-ations performed by each processor in the cube. Given the
two major components of our algorithm, namely a compute-exchange a nd a divide, deriving an algorithm for individual
processors is straightforward. Due to the tail-recursion in theparallel SVD algorithm, it may be transformed into an iterativeform.
Algorithm Divide-Exchange
for k= 1 to d dofor 1= 1 to 2d-k - 1 do
Compute (i,j)q = h(l)Exchange 2q
end
Compute (i,j)Divide"2d-k-1
end
The step "Compute (i,j)" refers to a column update in the
parallel version of Hestenes' one-sided computation. Using theterminology introduced in 4, each processor cycles through a
Jacobi-sweep consisting of dstages. A divide step, exchangingat a distance of 2-1 would not be carried out. The function
h(l) computes the height of an exchange node Xl, where I is
the label number derived by an inorder traversal of a completebinary tree.
Function h(l)
begin
q= llog2lJt = I- zq
if t= a then
return qelse
return h(t)end
end
The relative ease of mapping a recursive divide-exchange
onto the hypercube is due to the recursive nature of the hyper-cube itself. The fact that a hypercube is recursively constructed
out of lower dimensional subcubes may be exploited. A di-
vide step in our algorithm corresponds to a subdivision .of the
problem, allowing computations to proceed on the subcubes.
Exchanges will always consist of communication between pairsof nearest neighbours on the hypercube, A cube of dimension
d - 1 is required for a problem with n = 2d. The computa-tion and communication steps are determin~d by the exchangesequence shown in figure 3. "
5.2 Processor Pairings
Nearest neighbour processor pairings on the hypercube maybe determined by a Perfect Shuffle of node addresses. Stone's
original paper [21] details the generation of such pairings viaa left cyclic shift of the bits in an address. A perfect shuffleof an N element vector is a permutation P of the indices or
'addresses a of the elements such that'
P a -{
2a a ::;a::;N/2 - 1,( ) - "2a + 1- N N/2::; a ::;N - 1. (5.1)
Consider the binary representation of an integer address for
which N = 2d. Individual bits at position i are denoted ai.
a = ad-12d-1 + ad-22d-2 + . .. + a12 + ao (5.2)
A perfect shuffle (5.1) of an address a creating it new address
a' corresponds to a left cyclic shift of all bits ai to ai+1 with
the leftmost bit ad-1 wrapped around to ao [21].
a' = ad-22d-1 + ad-32d-2 + . .. + ao2 + ad-1
Our earlier requirement for a pairwise exchange of columns
at a distance 2h is easily satisfied, due to the geometry of ahypercube. The 'implication is that for addresses of the form
(5.2), a difference in a single bit ai indicates a distance of2i. We
also note that the addresses of neighbouring processors in the
hypercube differ in only one bit position. Exchanges, therefore,will always be between directly connected neighbours.
Processor nodes in a hypercube are labelled from a to 2d-l,for example in a 3-dimensional cube there are 8 processors withaddresses a to 7. We can use the perfect shuffle to generateprocessor pairings required for exchanges at a distance which is
a power of 2. This may be illustrated by an example with d= 3.Initially processor pairings for exchanges are at a distance of
Figure 4: 3-Dimensional Processor Pairings
1. After a perfect shuffle from addresses a to a' exchanges maytake place at a distance of 2, from a' to a" at a distance of 4
and so on. Processor pairings before and after a perfect shuffleare given in figure 4.
The exchange and divide steps required to complete onesweep of a Jacobi-like algorithm, when n = 24 = 16 are illus-trated in figure 5.
51
...-
node a node a' node a" a 000 a 000 a 0001 001 2 010 4 1002 010 4 100 1 0013 011 6 110 5 1014 100 1 001 2 0105 101 3 all 6 110
6 110 5 101 3 0117 111 7 111 7 111
-
8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition
6/7
,.
Th ere a re 15 c omput ation. steps . The c olumn pa irs
(i,j) at each step are written in the processor nodes.
T he c om mu ni ca ti on l in ks u se d b et we en t he c om pu ta ti on
s te ps a re m ar ke d.
Figure 5: Divide-Exchange on a 3-Cube
5.3 Computational Results
An implementation of Hestenes' one-sided SVD via the non-
recursive version of our algorithm was written in 'c' for sub-
sequent testing and analysis- on the Intel iPSC hypercube. Asimulator for the hypercube was provided by Intel Scientific
Corp. to McGill University for a SUN 3/280 running the BSD4.3 operating system. This SUN has an IEEE 754 standard
co-processor with a floating point precision of s= 2.22 X 10-16in double precision arithmetic.
A threshold Jacobi method, as described in 2, was em-
ployed to insure proper termination of the algorithm. Followingthe methods introduced by Berry et al in [1] for computationon an array processor, each node processor in the hypercube
maintains a counter stop. The counter is incremented by a pro-cessor when one of its assigned column pairs (,j) is deemed tobe orthogonal according to a threshold parameter T. For the
purposes of our tests we chose T= sIlAIIF.The parallel computation terminates at the end of a sweep
if each of the n/2 processors report istop counts ofn - 1. For a
series of random 8 x 8 matrices generated using the interactive
matrix software package Matlab, we typically observe conver-
gence in the hypercube computation after 6 sweeps.
Finally we have observed a communication pattern for ran-dom 16 x 16 matrices matching exactly with that shown infigure 6.
6 Conclusions and Future Research
We have described a new optimal parallel Jacobi-like algorithm
for- the singular value decomposition (SVD). We have demon-strated that the new algorithm can be mapped naturally onto
hypercube architectures, effectively utilizing the nearest neigh-bour communication capacity throughout the computation. In
general, the recursive pairwise exchange communication opera-tions of the new algorithm can be efficiently supported by mul-
tiprocessors with interconnect patterns used in many networks
that have been proposed to support large-scale parallelism [20].For example, we believe that the new algorithm can be mappedeffectively onto SIMD or MIMD parallel computers with inter~connection networks such as PM2I-based networks and cube-
based networks These interconnection networks have the par-tit ion ability property: the ability to divide the network into
independent subnetworks of different sizes [20], which matchthe recursive divide-exchange structure of the new parallel al-gorithm proposed in this paper.
We suggest the following future research directions: Study
extensions of the new algorithm to various forms of the SVD,to the unsymmetric eigenvalue problem and to the generalized
eigenvalue problem Ax = >"Bx. Furthermore, we would like togather empirical information concerning convergence properties
fro~ numerical simulations.
,
7 Acknowledgements
Intel Scientific Corp. provided us with a hypercube simulator
which allowed us to test the new algorithm. Martin Santavygave many valuable comments concerning the details of soft-ware development for the hypercube, particularly in the area
of synchronization problems. The figures were prepared withthe help of Peggy Gao. We would especially like to thank Prof.
Chris Paige for many helpful comments and corrections relatedto historical background and convergence results.
References
[1] M. Berry and A. H. Sameh, "Multiprocessor Jacobi algo-
rithms for dense symmetric eigenvalue problems and singu-lar value decompositions" , Proceedings of the InternationalConference on Parallel Processing, 1986.
[2] C. Bischof, "The two-sided Jacobi method on a hyper-cube", SIAM Proceedings of the Second Conference on Hy-
percube Multiprocessors, 1987.
[3] R. P. Brent, F. T. Luk and C. F. Van Loan, "Computation
of the singular value decomposition using mesh-connected
processors", J. VLSI Computer Systems, 1, (1985), pp.242-270.
[4] R. P. Brent and F. T. Luk, "The solution of singular-valueand symmetric eigenvalue problems on multiprocessor ar-
rays", SIAM J. Sci. Stat. Comput, 6 (1985), pp. 69-84.
[5] P. J. Eberlein, "On using the Jacobi method on the hy-percube", SIAM Proceedings of the Second Conference on
Hypercube Multiprocessors, 1987.
[6] G. E. Forsythe and P. Henrici, "The cyclic Jacobi method
for computing the principal values of a complex matrix",Trans. Amer. Math. Soc., 94, (1960), pp. 1-23.
[7] G. H. Golub and W. Kahan, "Calculating the singularvalues and pseudo-inversp- of a matrix", J. SIAM Ser. B:Numer. Anal. 2, (1965), pp. 205-224.
52
.J:'-:
-
8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition
7/7
I'-
/'
/','[8] M. T. Heath, A. J. Laub, C. C.Paige and R. C. Ward,
"Computing the singular value decomposition of a product
of two matrices", SIAM J. Sci. Stat. Comput., 7 (1986),pp. 1147-1159.
[9] M. R. Hestenes, "Inversion of matrices by biorthogonal-ization and related results", J. Soc. Indust. Appl. Math.,
6 (1958), pp. 51-90.
[10] E. G. Kogbetliantz, "Solution of linear equations by diag-onalization of coefficients matrix", Quart. Appl. Math., 13
(1955), pp. 123-132.
[11] C. Lawson and R. Hanson, Solving Least Squares Problems,Prentice-Hall, Englewood Cliffs, N.J., 1974.
[12] F. T. Luk, "A triangular pr6cessor array for computingsingular values", Linear Algebra Appl., 77 (1986), pp. 259-273.
[13] F. T. Luk and H. Park, "On parallel Jacobi orderings",Cornell University, School of Elec. Eng. Report, EE-CEG-
86-5, 1986.
[14] J. J. Modi and J. D. Pryce, "Efficient implementation ofJacobi's diagonalization method on the DAP", Numer.
Math., 46 (1985), pp. 443-454.
[15] J. C. Nash, "A one-sided transformation method for thesingular value decomposition and algebraic eigenproblem" ,
Comput. J., 18 (1975), pp. 74-76.
[16] C. C. Paige and P. Van Dooren, "On the quadratic conver-gence of Kogbetliantz's algorithm for computing the singu-
lar value decomposition", Linear Algebra Appl. 77 (1986),pp. 301-313.
[17] H. Rutishauser, "The Jacobi method for real symmetric
matrices", Numer. Math., 16, (1966), pp. 205-223.
[18] A. H. Sameh, "On Jacobi and Jacobi-like algorithms for a
parallel computer", Math. Comp., 25, (1971), pp. 579-590.
[19] A. H. Sameh, "Solving the linear least squares problem ona linear array of processors", Algorithmically Specialized
Parallel Computers, Academic-Press, 1985, pp. 191-200.
[20] H. J. Siegel, Intertonnection Networks for Large-Scale
Parallel Processing, Lexington Books, D.C. Heath and Co.,Mass., 1985.
[21] H. S. Stone, "Parallel processing with the perfect shuffle",
IEEE Trans. Comput., C'-:20 (1971), pp. 153-161.
[22] J. H. Wilkinson, "A note on the quadratic convergence of
the cyclic Jacobi process", Numer. Math., 4 (1962), pp.296-300.
[23] J. H. Wilkinson, The Algebraic Eigenvalue Problem,Clarendon-Press, Oxford, 1965.
53
J:iI