an optimal parallel jacobi-like solution method for the singular value decomposition

Upload: shock643

Post on 03-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition

    1/7

    ~

    An Optimal Parallel Jacobi-LikeSolution Method

    for the Singular Value Decomposition

    G. R. Gao and S. J. Thomas .January, 1988

    others have proposed various modifications for hypercube im-

    plementations, which require the embedding of rings via binary

    reflected Gray codes.

    In this paper, we present a new parallel Jacobi-like solu-tion method for the SVD which is optimal in achieving both

    the maximum concurrency in computation and the minimum

    overhead in communication. Unlike previously published paral-

    lel SVD algorithms based on a nearest neighbour ring topology

    for communication, the new algorithm proposed in this paper

    introduces a recursive divide-exchange communication pattern.

    As a result of the recursive nature of the algorithm, proofs

    are given to show that it achieves the lower bounds both in

    computation and communication costs. Convergence aspects

    of the new algorithm are briefly discussed. The paper illus-trates that the new algorithm can be mapped efficiently and

    naturally onto hypercube architectures. We have implemented

    the new algorithm on the Intei hypercube through simulation

    and the preliminary results will be discussed. A comparisonwith related work is briefly outlined. We believe that the new

    algorithm can be efficiently mapped onto multiprocessors with

    interconnection patterns that have been proposed to supportlarge-scale parallelism such as the many PM2I-based or cube-

    based networks [20].

    2 J aco hi-like Algorithms

    Abstract

    A new parallel Jacobi-like solution method for the

    singular value decomposition (SVD) is presented which

    is optimal in achieving both the maximum concurrency

    in computation and the minimum overhead in commu-

    nication. Unlike previously published parallel SVD al-

    gorithms based on a nearest neighbour ring topology forcommunication, the new algorithm introduces a recur-

    siv~ divide-exchange communication pattern. As a re-

    sult of the recursive nature of the algorithm, proofs are

    eiven to show that it achieves the lower bounds both in

    computation and communication costs. In general, the

    recursive pairwise exchange communication operations

    of the new algorithm can be efficiently supported by

    multiprocessors with interconnect patterns used in manynetworks that have been proposed to support large-scale

    parallelism. As an example, this paper illustrates that

    the new algorithm can be mapped efficiently and natu-

    rally onto hypercube architectures. Preliminary results

    with an implementation of the new algorithm are re-

    ported. Convergence aspects of the new algorithm are

    briefly discussed. A comparison with related work isoutlined.

    1 Introduction

    Rapid technological advances in multiprocessor architectures

    have aroused much' interest in parallel computation. Parallel

    methods to compute the singular value decomposition (SVD)have received attention due to its many important applications

    in science and engineering. A recent paper by Heath et al [8]includes a history of various Jacobi-like SVD algorithms.

    An early investigation into parallel computation for the

    symmetric eigenvalue problem, on the SIMD miac IV is de-scribed by Sameh in [18]. Sameh outlines the criteria for max-

    imal parallelism in a Jacobi-like algorithm. More recently, a

    number of authors including Berry et al [1] advocate the one-

    sided SVD of Hestenes [9], [8], [15] for parallel computationof the SVD. Luk and his co-workers have examined various

    systolic array configurations to compute the SVD [12], [3], [4].

    Brent and Luk [4] have invented a linear array of n/2 proces-sors which implements a one-sided Hestenes algorithm, that

    in real arithmetic, is an exact analogue of their Jacobi method

    applied to the eigenvalue problem. The array requires O(mnS)

    time, where S is the number of sweeps (typically :5 10). Brentand Luk demonstrate that their algorithm is computationally

    optimal in the sense that it requires the minimum number of

    computational steps per sweep i.e. n - 1, to ensure the exe-

    cution of every possible pairwise column rotation. Maximum

    concurrency is maintained throughout the computation. Their

    systolic array is comparable to the architecture of a nearest-neighbour linear array of processors, where communication is

    based on a ring topology.

    Brent and Luk's algorithm is not optimal in terms of com-

    munication overhead. Unnecessary costs are incurred by map-

    ping the systolic array architecture onto a ring connected linear

    array due to the double sends and receives required between

    pairs of neighbouring processors. Eberlein [5], Bischof [2] and

    'School of Computer Science, McGill University, Montreal, Quebec,Canada, H3A 2K6. This work was partially supported by the NaturalSciences and Engineering Research Council of Canada under Grant A9236.

    2.1 The Singular Value Decomposition

    The singular value decomposition (SVD) of a general non-square matrix may be given as follows,

    Theorem 2.1 For a real matrix A(m x n) of rank r, there

    exists orthogonal matrices U(m X m) and V(n x n), such that

    UT AV = r: = diag(0"1,0"2,"') 2: 0,

    where the elements ofr:(m x n) may be orderedso that

    0"1 2: 0"2 2:... 2:O"r > O"r+1= ... = O"q = 0, q = min{m,n}.

    I f m = n, r: is a square diagonal n x n matrix Ill].

    In order to compute the SVD in an iterative fashion, a se-

    ries of plane rotations may be applied to the matrix A(m x n)

    described in theorem 2.1 above. This approach is similar in

    nature to Jacobi's original method for computing the eigenval-

    ues of a symmetric matrix where orthogonal matrices J(i,j,8)

    are applied so as to annihilate a symmetrically placed pair ofthe n(n - 1) off-diagonal elements. These rotation matricesdiffer from the identity matrix of order n by the principal sub-

    matrix formed at the intersection of the row and column pairs

    corresponding to i and j. A 2 x 2 submatrix has the form

    [~s :]

    The cosine and sine of the rotation angle 8 are the constants

    c = cos 8 and s = sin 8. Initially Al = A and at the k-thiteration,

    47

  • 8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition

    2/7

    ~

    AH1 = J(i/o,j/o,8/o)T A/oJ(i/o,j/o,8/o).

    Rotations are applied simultaneously, in a symmetric fashionfrom the left and right. Cyclic Jacobi methods refer to a se-

    quence of rotations which update row and column pairs in some

    predetermined order. For a square matrix, a cyclic sweeprefersto the updating ofn(n-l)/2 elements. A number of sweeps 'arerequired in order to effectively reduce the off-diagonal mass ofthe matrix to a sufficiently small value, which eventually can be

    ignored. A diagonal containing the eigenvalues then remains.

    Annihilation of 2 off-diagonal elements of a symmetric matrixtakes the form,

    [

    (/0 ) (/0)

    ] [

    (H1)

    [~ ~8] :!~) :(~) [~8 ~] = a;;OOJ JJ

    a

  • 8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition

    3/7

    ~-

    ",By normalizing the columns, we see that the SVD of theorem

    2.1 is implicit in (2.3)

    Q = UE, A= UEVT

    '"

    A one-sided algorithm is somewhat different from its earlier

    counterparts, as rotations are applied from the right and there-

    fore only columns are affected. Off-diagonal elements are no

    longer annihilated, instead rotations are designed in order toproduce two orthogonal columns. As with similar Jacobi-likealgorithms, the orthogonal matrix V may be accumulated from

    plane rotations J(i,j,lJ) which differ from the unit matrix Inin a 2 x 2 principal submatrix containing the cosines and sines

    of the rotation. Setting Al =A,the k-th iteration updates Ak

    Ak+1 = AkJ(i,j,lJk).

    If the matrix sequence Ak converges, the result is Q in (2.3).A column update via a 2 x 2 submatrix takes the form

    [ (k) (k) ][C -8

    ]= [ (HI) (HI) ]a., a, 8 c a. ,a, 'The orthogonality condition determines the rotation angleIJ.

    2(a~kT a(k),(a~k)Ta~k) - (a}k)Ta}k) = tan21J.

    (2.4)

    By avoiding a potential loss of significant digits the magnitude

    of the angle may be restricted to IIJI::; 7r/4 and provides for-

    mulae for the rotation (see Nash [15] and Rustishauser [17]).As noted by Brent and Luk [4], if a cyclic-by-row rotation

    ordering is chosen to update the n(n - 1)/2 column pairingsdetermined by the off-diagonal elements above the main diago-

    nal, convergence would follow. Hestenes' computation is math-ematically equivalent to a Jacobi algorithm applied to AT A,

    therefore we expect that the convergence analyses of Forsythe

    and Henrici [6] or Wilkinson [22]are applicable und~r these cir-cumstances. Rather than testing for convergence, the threshold

    Jacobi method originally introduced in the symmetric eigen-

    problem is often employed [23, pp. 277-278], [17].

    3 Parallel Computation3.1 Maximizing Concurrency

    In this paper the computation cost is measured by the numberof parallel computation steps. The methods discussed process

    (i,j) pairings consisting of partitions containing at least 1 col-umn or row. When n is even, if we assume one parallel com-

    putation step has unit cost, then of the algorithms presentedthe minimum cost achieved is n - 1 per sweep. The systolic ar-ray and associated algorithm proposed by Brent and Luk were

    proven to achieve this lower bound in [4]. We have illustratedtheir basic scheme in figure 1 for the case n = 8, where a lineararray of four processors {PI, P2, Ps, p.} is used.

    3.2 Minimizing Communication Costs

    Another important performance criteria for a parallel algorithm

    is the total communication cost. For our purposes the commu-nication cost can be measured by the total number of inter-

    processor transactions (messages). A transaction consists of acolumn transmission between a pair of processors. The totalcommunication cost of one sweep will be denoted C.

    From the last section, we know that the minimum number

    of computation steps in a sweep is K = n - 1. The minimumnumber of interprocessor transactions is achieved when each

    processor retains one column from a pairing, and transmits theother to a destination processor. As a result, if there are p pro-cessors, p transmissions are performed between two consecutive

    steps. Hence the minimum total communication cost Cmin is

    r

    Step

    Pl

    P2

    P3

    P.

    Figure 1: Brent and Luk's Systolic Array

    defined by the following.

    Cmin= (K- l)p. (3.1)

    In the parallel one-sided SVD algorithm each processor is as-

    signed one ofn/2 column pairs at each step, assuming n is even.The total number of processors required isp= n/2 in (3.1) andthe communication costs are O(n2).

    Cmin = (K - l)pn(n- 2)

    2

    As a contrast, a global broadcasting strategy may request

    each processor to send both columns to all other p-l processors

    between each step. The total cost for this case will be O(n3).

    Brent and Luk's algorithm has the following communicationcost.

    CBL = (K - 1) x 2p

    = (n-2)x2G)- n(n - 2).

    Therefore their algorithm is close to, but not quite optimal.In fact the inefficiency lies in the double sends and receives

    between processors in the systolic array which are dictated bythe tournament ordering.

    Several ways of modifying Brent and Luk's algorithm toavoid the double sends and receives have been proposed [13],

    [14], [5], [2]. These algorithms all represent a communicationregimen based on a ring topology. A ring topology resemblesthe architecture of a linear array of processors. Embedding a

    ring within another topology, for example the binfY n-cube,requires a special mapping scheme.

    4 An Optimal Parallel SVD Algorithm

    In this section we present a new parallel Jacobi-like algorithmwhich is optimal in terms of both achieving maximum concur-

    rency and minimum communication overhead. The algorithmrelies on a recursive divide-exchange of n= 2d columns.

    Unlike several orderings cited earlier, the new algorithm

    maps naturally onto parallel architectures which support re-

    cursive pairwise exchanges. A mapping onto a hypercube ispresented as an example in 5. Pairwise exchanges of columnshere are specified by a Perfect Shuffle of processor addresses

    [21].

    49

  • 8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition

    4/7

    " 4.1 The Parallel AlgorithmLet us first illustrate the basic principle of the new algorithm

    through an example where n= 8 and p= 4. The computationsteps Kl and communication steps Xl consisting of pairwiseexchanges, are shown in figure 2.

    PI

  • 8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition

    5/7

    c(n) = { ~!2+ c(n/2)n> 2,n = 2. (4.3)

    Solving (4.3), we obtainc(n) = n - 2. Multiplying by thenumber of processors p = n/2 gives the communication costsODE for our recursive divide-exchange algorithm. We have

    achieved the optimum since ODE = amino -

    ODE = (n - 2)p= n(n - 2)2 .

    Referring to our example in figure 2, 4 column transactions

    have occurred at each communication step with a total cost of

    6 X 4 = 24 transactions, which is optimal.

    5 Mapping onto the Hypercube

    In order to map the recursive divide-eichange algorithm of 4

    onto a hypercube architecture we must first specify the oper-ations performed by each processor in the cube. Given the

    two major components of our algorithm, namely a compute-exchange a nd a divide, deriving an algorithm for individual

    processors is straightforward. Due to the tail-recursion in theparallel SVD algorithm, it may be transformed into an iterativeform.

    Algorithm Divide-Exchange

    for k= 1 to d dofor 1= 1 to 2d-k - 1 do

    Compute (i,j)q = h(l)Exchange 2q

    end

    Compute (i,j)Divide"2d-k-1

    end

    The step "Compute (i,j)" refers to a column update in the

    parallel version of Hestenes' one-sided computation. Using theterminology introduced in 4, each processor cycles through a

    Jacobi-sweep consisting of dstages. A divide step, exchangingat a distance of 2-1 would not be carried out. The function

    h(l) computes the height of an exchange node Xl, where I is

    the label number derived by an inorder traversal of a completebinary tree.

    Function h(l)

    begin

    q= llog2lJt = I- zq

    if t= a then

    return qelse

    return h(t)end

    end

    The relative ease of mapping a recursive divide-exchange

    onto the hypercube is due to the recursive nature of the hyper-cube itself. The fact that a hypercube is recursively constructed

    out of lower dimensional subcubes may be exploited. A di-

    vide step in our algorithm corresponds to a subdivision .of the

    problem, allowing computations to proceed on the subcubes.

    Exchanges will always consist of communication between pairsof nearest neighbours on the hypercube, A cube of dimension

    d - 1 is required for a problem with n = 2d. The computa-tion and communication steps are determin~d by the exchangesequence shown in figure 3. "

    5.2 Processor Pairings

    Nearest neighbour processor pairings on the hypercube maybe determined by a Perfect Shuffle of node addresses. Stone's

    original paper [21] details the generation of such pairings viaa left cyclic shift of the bits in an address. A perfect shuffleof an N element vector is a permutation P of the indices or

    'addresses a of the elements such that'

    P a -{

    2a a ::;a::;N/2 - 1,( ) - "2a + 1- N N/2::; a ::;N - 1. (5.1)

    Consider the binary representation of an integer address for

    which N = 2d. Individual bits at position i are denoted ai.

    a = ad-12d-1 + ad-22d-2 + . .. + a12 + ao (5.2)

    A perfect shuffle (5.1) of an address a creating it new address

    a' corresponds to a left cyclic shift of all bits ai to ai+1 with

    the leftmost bit ad-1 wrapped around to ao [21].

    a' = ad-22d-1 + ad-32d-2 + . .. + ao2 + ad-1

    Our earlier requirement for a pairwise exchange of columns

    at a distance 2h is easily satisfied, due to the geometry of ahypercube. The 'implication is that for addresses of the form

    (5.2), a difference in a single bit ai indicates a distance of2i. We

    also note that the addresses of neighbouring processors in the

    hypercube differ in only one bit position. Exchanges, therefore,will always be between directly connected neighbours.

    Processor nodes in a hypercube are labelled from a to 2d-l,for example in a 3-dimensional cube there are 8 processors withaddresses a to 7. We can use the perfect shuffle to generateprocessor pairings required for exchanges at a distance which is

    a power of 2. This may be illustrated by an example with d= 3.Initially processor pairings for exchanges are at a distance of

    Figure 4: 3-Dimensional Processor Pairings

    1. After a perfect shuffle from addresses a to a' exchanges maytake place at a distance of 2, from a' to a" at a distance of 4

    and so on. Processor pairings before and after a perfect shuffleare given in figure 4.

    The exchange and divide steps required to complete onesweep of a Jacobi-like algorithm, when n = 24 = 16 are illus-trated in figure 5.

    51

    ...-

    node a node a' node a" a 000 a 000 a 0001 001 2 010 4 1002 010 4 100 1 0013 011 6 110 5 1014 100 1 001 2 0105 101 3 all 6 110

    6 110 5 101 3 0117 111 7 111 7 111

  • 8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition

    6/7

    ,.

    Th ere a re 15 c omput ation. steps . The c olumn pa irs

    (i,j) at each step are written in the processor nodes.

    T he c om mu ni ca ti on l in ks u se d b et we en t he c om pu ta ti on

    s te ps a re m ar ke d.

    Figure 5: Divide-Exchange on a 3-Cube

    5.3 Computational Results

    An implementation of Hestenes' one-sided SVD via the non-

    recursive version of our algorithm was written in 'c' for sub-

    sequent testing and analysis- on the Intel iPSC hypercube. Asimulator for the hypercube was provided by Intel Scientific

    Corp. to McGill University for a SUN 3/280 running the BSD4.3 operating system. This SUN has an IEEE 754 standard

    co-processor with a floating point precision of s= 2.22 X 10-16in double precision arithmetic.

    A threshold Jacobi method, as described in 2, was em-

    ployed to insure proper termination of the algorithm. Followingthe methods introduced by Berry et al in [1] for computationon an array processor, each node processor in the hypercube

    maintains a counter stop. The counter is incremented by a pro-cessor when one of its assigned column pairs (,j) is deemed tobe orthogonal according to a threshold parameter T. For the

    purposes of our tests we chose T= sIlAIIF.The parallel computation terminates at the end of a sweep

    if each of the n/2 processors report istop counts ofn - 1. For a

    series of random 8 x 8 matrices generated using the interactive

    matrix software package Matlab, we typically observe conver-

    gence in the hypercube computation after 6 sweeps.

    Finally we have observed a communication pattern for ran-dom 16 x 16 matrices matching exactly with that shown infigure 6.

    6 Conclusions and Future Research

    We have described a new optimal parallel Jacobi-like algorithm

    for- the singular value decomposition (SVD). We have demon-strated that the new algorithm can be mapped naturally onto

    hypercube architectures, effectively utilizing the nearest neigh-bour communication capacity throughout the computation. In

    general, the recursive pairwise exchange communication opera-tions of the new algorithm can be efficiently supported by mul-

    tiprocessors with interconnect patterns used in many networks

    that have been proposed to support large-scale parallelism [20].For example, we believe that the new algorithm can be mappedeffectively onto SIMD or MIMD parallel computers with inter~connection networks such as PM2I-based networks and cube-

    based networks These interconnection networks have the par-tit ion ability property: the ability to divide the network into

    independent subnetworks of different sizes [20], which matchthe recursive divide-exchange structure of the new parallel al-gorithm proposed in this paper.

    We suggest the following future research directions: Study

    extensions of the new algorithm to various forms of the SVD,to the unsymmetric eigenvalue problem and to the generalized

    eigenvalue problem Ax = >"Bx. Furthermore, we would like togather empirical information concerning convergence properties

    fro~ numerical simulations.

    ,

    7 Acknowledgements

    Intel Scientific Corp. provided us with a hypercube simulator

    which allowed us to test the new algorithm. Martin Santavygave many valuable comments concerning the details of soft-ware development for the hypercube, particularly in the area

    of synchronization problems. The figures were prepared withthe help of Peggy Gao. We would especially like to thank Prof.

    Chris Paige for many helpful comments and corrections relatedto historical background and convergence results.

    References

    [1] M. Berry and A. H. Sameh, "Multiprocessor Jacobi algo-

    rithms for dense symmetric eigenvalue problems and singu-lar value decompositions" , Proceedings of the InternationalConference on Parallel Processing, 1986.

    [2] C. Bischof, "The two-sided Jacobi method on a hyper-cube", SIAM Proceedings of the Second Conference on Hy-

    percube Multiprocessors, 1987.

    [3] R. P. Brent, F. T. Luk and C. F. Van Loan, "Computation

    of the singular value decomposition using mesh-connected

    processors", J. VLSI Computer Systems, 1, (1985), pp.242-270.

    [4] R. P. Brent and F. T. Luk, "The solution of singular-valueand symmetric eigenvalue problems on multiprocessor ar-

    rays", SIAM J. Sci. Stat. Comput, 6 (1985), pp. 69-84.

    [5] P. J. Eberlein, "On using the Jacobi method on the hy-percube", SIAM Proceedings of the Second Conference on

    Hypercube Multiprocessors, 1987.

    [6] G. E. Forsythe and P. Henrici, "The cyclic Jacobi method

    for computing the principal values of a complex matrix",Trans. Amer. Math. Soc., 94, (1960), pp. 1-23.

    [7] G. H. Golub and W. Kahan, "Calculating the singularvalues and pseudo-inversp- of a matrix", J. SIAM Ser. B:Numer. Anal. 2, (1965), pp. 205-224.

    52

    .J:'-:

  • 8/12/2019 An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition

    7/7

    I'-

    /'

    /','[8] M. T. Heath, A. J. Laub, C. C.Paige and R. C. Ward,

    "Computing the singular value decomposition of a product

    of two matrices", SIAM J. Sci. Stat. Comput., 7 (1986),pp. 1147-1159.

    [9] M. R. Hestenes, "Inversion of matrices by biorthogonal-ization and related results", J. Soc. Indust. Appl. Math.,

    6 (1958), pp. 51-90.

    [10] E. G. Kogbetliantz, "Solution of linear equations by diag-onalization of coefficients matrix", Quart. Appl. Math., 13

    (1955), pp. 123-132.

    [11] C. Lawson and R. Hanson, Solving Least Squares Problems,Prentice-Hall, Englewood Cliffs, N.J., 1974.

    [12] F. T. Luk, "A triangular pr6cessor array for computingsingular values", Linear Algebra Appl., 77 (1986), pp. 259-273.

    [13] F. T. Luk and H. Park, "On parallel Jacobi orderings",Cornell University, School of Elec. Eng. Report, EE-CEG-

    86-5, 1986.

    [14] J. J. Modi and J. D. Pryce, "Efficient implementation ofJacobi's diagonalization method on the DAP", Numer.

    Math., 46 (1985), pp. 443-454.

    [15] J. C. Nash, "A one-sided transformation method for thesingular value decomposition and algebraic eigenproblem" ,

    Comput. J., 18 (1975), pp. 74-76.

    [16] C. C. Paige and P. Van Dooren, "On the quadratic conver-gence of Kogbetliantz's algorithm for computing the singu-

    lar value decomposition", Linear Algebra Appl. 77 (1986),pp. 301-313.

    [17] H. Rutishauser, "The Jacobi method for real symmetric

    matrices", Numer. Math., 16, (1966), pp. 205-223.

    [18] A. H. Sameh, "On Jacobi and Jacobi-like algorithms for a

    parallel computer", Math. Comp., 25, (1971), pp. 579-590.

    [19] A. H. Sameh, "Solving the linear least squares problem ona linear array of processors", Algorithmically Specialized

    Parallel Computers, Academic-Press, 1985, pp. 191-200.

    [20] H. J. Siegel, Intertonnection Networks for Large-Scale

    Parallel Processing, Lexington Books, D.C. Heath and Co.,Mass., 1985.

    [21] H. S. Stone, "Parallel processing with the perfect shuffle",

    IEEE Trans. Comput., C'-:20 (1971), pp. 153-161.

    [22] J. H. Wilkinson, "A note on the quadratic convergence of

    the cyclic Jacobi process", Numer. Math., 4 (1962), pp.296-300.

    [23] J. H. Wilkinson, The Algebraic Eigenvalue Problem,Clarendon-Press, Oxford, 1965.

    53

    J:iI