mcgs: a modified conjugate gradient squared algorithm for ...hj/journals/67.pdfused to obtain fast...

Ž .The Journal of Supercomputing, 14, 257]280 1999Q 1999 Kluwer Academic Publishers. Manufactured in The Netherlands.

MCGS: A Modified Conjugate Gradient SquaredAlgorithm for Nonsymmetric Linear SystemsMUTHUCUMARU MAHESWARAN [email protected]

Department of Computer Science, Uni ersity of Manitoba, Winnipeg, MB R3T 2N2, Canada

� 4KEVIN J. WEBB AND HOWARD JAY SIEGEL webb,hj @ecn.purdue.edu

Parallel Processing Laboratory, School of Electrical and Computer Engineering, 1285 Electrical EngineeringBuilding, Purdue Uni ersity, West Lafayette, IN 47907-1285, USA

Editor: Hamid R. ArabniaŽ .Received November 16, 1998; final version accepted November 20, 1998.

Ž .Abstract. The conjugate gradient squared CGS algorithm is a Krylov subspace algorithm that can beŽ .used to obtain fast solutions for linear systems Ax s b with complex nonsymmetric, very large, and

Ž .very sparse coefficient matrices A . By considering electromagnetic scattering problems as examples, astudy of the performance and scalability of this algorithm on two MIMD machines is presented. A

Ž .modified CGS MCGS algorithm, where the synchronization overhead is effectively reduced by a factorof two, is proposed in this paper. This is achieved by changing the computation sequence in the CGSalgorithm. Both experimental and theoretical analyses are performed to investigate the impact of thismodification on the overall execution time. From the theoretical and experimental analysis it is foundthat CGS is faster than MCGS for smaller number of processors and MCGS outperforms CGS as thenumber of processors increases. Based on this observation, a set of algorithms approach is proposed,

Ž .where either CGS or MGS is selected depending on the values of the dimension of the A matrix NŽ .and number of processors P . The set approach provides an algorithm that is more scalable than either

the CGS or MCGS algorithms. The experiments performed on a 128-processor mesh Intel Paragon andon a 16-processor IBM SP2 with multistage network indicate that MCGS is approximately 20% fasterthan CGS.

Keywords: algorithm scalability, conjugate gradient squared, modified conjugate gradient squared,Intel Paragon, IBM SP-2, MIMD, synchronization

1. Introduction

This is an application-driven study of solutions to linear systems of equationsŽ .Ax s b on MIMD parallel machines. The application being considered is the

Ž .finite element method FEM modeling of open-region electromagnetic problems inw xthe frequency domain 7, 8 . The matrices obtained in this problem are very large,

very sparse, nonsymmetric, and have complex-valued elements.Ž .For the 2-D physical examples considered, first-order linear node-based func-

tions over triangular elements are used. The resulting A matrix is unstructured,with the non-zero entries dictated by the global node numbering. The correspon-dence between the node numbering and the sparsity pattern of A is illustrated in

Ž .Figure 1 for a simple 2-D mesh. Figure 1 a shows the connectivity among the

MAHESWARAN, WEBB AND SIEGEL258

Figure 1. A simple 2-D mesh and the corresponding portion of the A matrix.

Ž .nodes of the mesh and Figure 1 b shows the sparsity pattern of the resulting Amatrix. Note that while the sparsity pattern of A is symmetric, the actual matrixelement values are not. In contrast, quadrilateral elements, which are commonlyused in finite difference representations, result in a structured, multi-diagonal Amatrix. Triangular elements provide greater flexibility in the representation of thegeometry, but create the need for solution procedures that do not rely on amulti-diagonal A matrix. A similar A matrix will result for any wave equationproblem that is described by a differential equation. The challenge is to be able tosolve very large order problems effectively.

Ž . w xThe conjugate gradient squared CGS algorithm 14 is used for the solution ofthe linear system. This algorithm can provide fast solutions, even though theconvergence pattern is often non-uniform. This study focuses on the performanceand scalability of this algorithm on MIMD machines. Synchronization and commu-nication are two factors that introduce significant overhead when this algorithm isimplemented on a parallel machine. The communication overhead is dependent onthe structure of A, i.e., the sparsity pattern of A. Therefore, matrix reorderingtechniques can be used to reduce this overhead. The synchronization overheaddepends on the number of vector-vector inner products performed per iteration ofthe algorithm. For MIMD machines, the synchronization cost rises significantlywith increasing machine size. Hence, for scalable MIMD implementations, theamount of synchronization has to be minimized.

Ž .This paper proposes a modified CGS MCGS algorithm where the synchroniza-tion overhead is effectively reduced by a factor of two. This is achieved by changingthe computation sequence in the CGS algorithm. An approximate theoreticalcomplexity analyses and experimental studies have been done to investigate theimpact of the modification on the overall execution time of the CGS algorithm.The approximate complexity analyses using a mesh-connected model indicates thatfor larger machine sizes the performance of the MCGS algorithm may be up to34% better than that of the CGS algorithm, depending on the machine architec-ture. For smaller matrix sizes, CGS performs better than MCGS. The experimental

A MCGS ALGORITHM 259

studies on a 128-processor Intel Paragon reveals that MCGS is at least 20% betterthan the CGS for larger number of processors. The experiments are also per-formed on a 16-processor IBM SP2.

Because neither algorithm is better than the other for all values of input dataŽ wsizes and system parameters, a set of algorithms approach is presented e.g., 13,

x.18 . This provides a scalable solution scheme for Ax s b. Conditions for choosinga particular algorithm depending on input data and system parameters are alsoprovided. Conditions such as the one developed here to choose between CGS orMCGS depending on the system parameters are also useful in the area of

Ž . w xheterogeneous computing HC mapping systems 9 . In HC mapping, the inputŽ .data e.g., matrix size remains fixed, but the system parameters are varied, i.e., the

mapping system estimates the performance on different machines and executes theapplication on the machine that is expected to yield the best performance. Toobtain the best mapping, it is necessary for the HC mapping systems to haveinformation such as those provided by the conditions to select either MCGS orCGS depending on system parameters.

The solution of Ax s b is a time consuming step in the FEM modeling of manyproblems from diverse areas such as fluid dynamics, structures, and atmospherics.Therefore, a lot of work has been done in designing iterative algorithms and theirparallel implementations for solving Ax s b. Some of the widely used iterativealgorithms are the Krylov subspace algorithms. In general, these algorithms pro-vide fast and robust solutions for Ax s b.

w x Ž .Dazevedo et al. 3 developed two reformulations for the conjugate gradient CGalgorithm that reduce the synchronization overhead associated with the parallel

w x w ximplementations of the generic CG algorithm. Meurant 10 and Saad 11 alsodiscuss the reduction of the synchronization overhead in the parallel implementa-tions of the CG algorithm. The CG algorithm is applicable for symmetric positive

Ž .definite SPD A matrices. The work presented in the following section extendsthese ideas to reduce the synchronization overhead in the CGS algorithm, which isapplicable to nonsymmetric A matrices.

In Section 2, the open-region electromagnetic problem is briefly examined.General strategies used for solving linear systems with complex elements arediscussed in Section 3. The machine model and the notation used to parameterizethe algorithms are presented in Section 4. Section 5 examines the issues involved inparallelizing some of the basic linear algebra subroutines needed for implementinga Krylov algorithm. A modified CGS algorithm is provided in Section 6 and thisalgorithm is compared with the CGS algorithm in Section 7. In Section 8, theexperimental results are presented.

2. Problem definition

w xThis section is a brief overview of the physical problem. See 7, 8 for details.Progress in micro-machining of optical diffractive elements has sparked an

interest in the study of wavelength and subwavelength structures in the last fewyears. A rigorous modeling of these devices demands the solution of the underlying


electromagnetic equations. The FEM is used here to discretize the wave equation.The resulting set of linear equations then needs to be solved.

Consider a scalar wave equation for a 2-D problem with a time dependency e jv t

for the fields. Let u be the total, unknown, complex-valued scalar field, g be themagnetic or electric source inside the computational domain, and k be the wave0number in free space. Also, let p be the material permittivity, q be the materialpermeability, « be the permittivity of the free space, and m be the permeability0 0of the free space. The scalar wave equation in the frequency domain is

= ? p =u q k 2 qu s g , 1Ž . Ž .0

2 2 y1 Ž .with k s v m « , u s H , p s « and q s m for transverse electric TE0 0 0 z r ry1 Ž .polarization or u s E , p s m and q s « for transverse magnetic TM polar-z r r

ization.A general open-region electromagnetic scattering problem with an artificial

boundary V for TM polarization is shown in Figure 2, where the superscript incdenotes the incident field components, the superscript tot denotes the total fieldcomponents, the superscript s denotes the scattering field components, and pecdenotes a perfect electric conductor. Let B represent the basis functions and D ,i 1D , and D be coefficients that originate from radiation boundary conditions or2 3

w x incabsorbing boundary conditions at the outer boundary 8 . Furthermore, let udenote the incident field. Then, the discretization of the corresponding variationalformulation results in a system of linear equations:

Ax s b 2Ž .

Figure 2. General scattering problem in unbounded region: artificial boundary V and different fieldcomponents for TM polarization.


with A and b represented byi j i

A s p =B ? =B y k 2 qB B q gB ds� 4H Hi j i j 0 i j iV

Bjy D B B q B D y B D dm 3Ž . Ž .E 1 i j i 2 i 3½ 5ž / m m V

uinc uinc 2 uincincb s B y D u y D y D dm 4Ž .Ei i 1 2 3 2ž / n m m V

w xThe A matrix is of size N 7 , corresponding to the number of free variablesŽ .number of unknowns within the domain. Because of the local linear expansionand testing functions, the A matrix typically has a fill factor less than 1%.

3. Linear equation solution

Various approaches are used in the literature for solving linear systems withw xcomplex coefficient matrices 5 . One approach is to premultiply Ax s b by the

H Ž H Ž .T . w xhermitian matrix A A s A* 15 , which is also known as the normal equa-˜ ˜ ˜ H ˜ Htions method. This results in a linear system Ax s b, where A s A A and b s A b.

˜ ˜ ˜A has real-valued elements and b can have complex-valued elements. Because A isreal, the system can be split into two systems of equations:

Re x Re bŽ . Ž .A s 5Ž .

Im x Im bŽ . Ž .

For some A matrices this method can be very efficient, e.g., when A is close toŽ H .unitary A A f I . However, the normal equations method often leads to systems

˜with poor convergence properties. This is due to the condition number of A beingapproximately the square of the condition number of A. For many complexcoefficient systems that occur in practical applications, the normal equations

w xmethod has been observed to be inefficient 5 and so is not considered in thisstudy because of the structure of the A matrix involved.

Another approach is to split the complex system into real and imaginarycomponents:

Re x Re b Re A yIm AŽ . Ž . Ž . Ž .A* s , where A* s 6Ž .

Im x Im b Im A Re AŽ . Ž . Ž . Ž .

This process doubles the size of the coefficient matrix on each dimension, i.e., if Ais N = N then the resulting coefficient matrix, A*, is 2 N = 2 N. This method is notvery efficient, because the spectral properties of the resulting coefficient matrix are

w xless suitable for Krylov iterative methods than that of the original matrix 4 .


Hence, in this study, real Krylov iterative algorithms are adapted to solvecomplex linear systems. Although the overall algorithm remains the same, thereare changes to the basic operations such as the inner products.

4. The machine model and terminology

The analysis of algorithms presented in this paper is based on the distributedmemory MIMD machine model. The machine has P processors and each proces-

Ž .sor is paired with a memory module to form a processing element PE . The PEsare connected by an interconnection network. In the experiments reported here, all

Ž .PEs execute the same code, i.e., the single program multiple data SPMD model isused. The PE’s memory is used to store both instructions and data. If a PE needsdata stored in a remote PE then it is retrieved by message passing.

The algorithm complexity analysis performed in the next section is for amesh-based distributed memory MIMD machine model. To simplify the analysisand parameter measurement, the communication operations on the Intel Paragon

w xare modeled without considering the wormhole routing 2 that is used in theParagon. The computation is modeled by counting the floating point operationsŽ . Ž .FLOPs at the source code level the C language is used . The following notation

Ž . Ž . Ž .is used in this paper: 1 P: number of PEs, 2 N: dimension of the A matrix, 3 k:Žnumber of non-zero elements per row of the A matrix for the 2-D problems. Ž .considered in this paper the maximum value of k is 8 , 4 t : time for af p

Ž . Ž .floating-point operation at the source code level , 5 t : setup time for messagesŽ .passing, and 6 t : time to transfer a single word between two PEs.w

5. Krylov algorithms on parallel machines

Krylov subspace algorithms are fast and robust for the solution of unstructured andnonsymmetric matrix problems. One of these Krylov techniques, the conjugate

Ž . w xgradient squared CGS algorithm 14 , which can be directly applied to complexmatrices, is used in the experiments reported here. The CGS algorithm for thesolution of the linear system Ax s b is shown in Figure 3. An initial guess for thesolution vector x and an arbitrarily chosen vector r such that r H r / 0 are input˜ ˜0 0 0 0to the algorithm, in addition to the vector b and the sparse matrix A. Theconvergence criterion is based on the error measure, e .n

All Krylov algorithms require a basic linear algebra subroutine kernel thatimplements vector-vector operations, vector inner products, and matrix-vectormultiplication. The vector inner product and matrix-vector multiplication opera-tions need inter-PE communications on a distributed memory parallel machine.The A matrix is represented using a parallel version of the modified sparse row

w xformat 12 . In this format, each row of the matrix is represented as a variablelength array. Each element of this array is a record containing an element of A andthe corresponding column index.


Ž .1 r s b y Ax ; q s p s 0;0 0 0 y1HŽ .2 r s 1; n s 0; r s r r ;˜y1 0 0 0

Ž . Ž .3 while not converged doŽ .4 b s r rr ;n ny1Ž .5 u s r q bq ;n n nŽ . Ž .6 p s u q b q q b p ;n n n ny1Ž .7 v s Ap ;n n

HŽ .8 s s r v ; a s r rs ;˜0 n nŽ .9 q s u y a v ;nq 1 n nŽ .10 f s u q q ;nq 1 n nq1Ž .11 r s r y aAf ;nq 1 n nq1Ž .12 x s x q a f ;nq 1 n nq1Ž .13 n s n q 1;

H HŽ .14 r s r r ; e s r r ;˜n 0 n n n nŽ .15 endwhile

Ž .Figure 3. The conjugate gradient squared CGS algorithm.

Figure 4. Distribution of the A matrix and vectors under the row striped format.

For simplicity, assume that the matrix dimension N is a multiple of P. Let the Amatrix and the vectors be distributed over the PEs in row striped format, whereeach PE gets NrP contiguous rows, as shown in Figure 4; in particular, PE i gets

Ž . Ž .Ž .rows NrP i to NrP i q 1 y 1. With the data distribution given in Figure 4,any element-wise vector-vector operation can be performed concurrently by all PEswithout inter-PE communication. Let the time for an element-wise vector-vectoroperation be t . For the row striped format,¨ o p

Nt s t . 7Ž .¨ o p f pž /P


For the inner products, the evaluation is done in two phases. In the first phase,each PE computes a local inner product on its NrP elements. Because each PEhas NrP elements and for each element multiplication and addition operations

ŽŽ . .are performed, for p local inner products the computation time is 2 p N rP t .f pIn the next phase, the local inner products are combined to form a global sum. Thetime taken for this phase is dependent on the interconnection network used by the

' 'parallel system. In the Intel Paragon, the PEs are organized in a P = P mesh.Figure 5 shows one way to perform the combining sequence for a sum-to-one-PE

Ž .operation in a 4 = 4 mesh without the wormhole routing . The labels on theŽarrows are step numbers e.g., 0 indicates the transfers performed in the initial

.step . Once the sum is accumulated in a single PE, it is distributed to all the PEs.This distribution is performed by reversing the sequence of operations used toobtain the sum at one PE. Hence four more steps are required to distribute thesum back to all PEs.

' 'In general, for a P = P mesh, the global summing and distribution can be' Ž .performed in 2 P steps, where each step takes t q p t . Therefore, on thes w 'Ž .Paragon the combining phase to form the global sum takes 2 t q p t P . Let thes w

Ž .time for p vector inner-products be t p , which becomesinner

p N 't p s 2 t q 2 t q p t P . 8Ž . Ž . Ž .inner f p s wž /P

The time taken for a matrix-vector multiplication operation depends on thesparsity pattern of the A matrix. Consider the situation when A has an unstructuredsparsity pattern and is distributed among the PEs using the row striped format ofFigure 4. The shaded square in Figure 4 represents those elements of A in rowsŽ . Ž .Ž . Ž . Ž .Ž .NrP i to NrP i q 1 y 1 and columns NrP i to NrP i q 1 y 1. To per-form the matrix-vector multiplication operation Ax, PE i needs the values of row r

Ž . Ž .Ž .of x if column r of any of rows NrP i to NrP i q 1 y 1 of A contains aŽ . Ž .Ž .non-zero element. Thus, given PE i contains rows NrP i to NrP i q 1 y 1 of

Figure 5. An example global summing sequence for a 4 = 4 mesh.


x, PE i needs to perform an inter-PE communication for each column j of A thatŽ .Ž . Ž . Žhas at least one non-zero element and NrP i q 1 F j or j - NrP i i.e., for all

non-zero elements of A that are in PE i and lie outside the shaded area in Figure.4 . Because A is unstructured, PE i could potentially need elements of x from all

other PEs to perform the matrix-vector multiplication.For the complexity analysis, the worst-case situation of PE i needing x vector

elements from all other PEs is considered. Therefore, an all-to-all broadcast of thex vector elements is required before each matrix-vector multiplication operation.In a 2-D mesh network of PEs, the all-to-all broadcast can be performed in twophases. In the first phase, each PE broadcasts its NrP elements of the x vector

'along the columns of PEs of the 2-D mesh. This operation requires P y 1 stepsŽ .and each step takes t q NrP t time. At the end of this phase, each PE will holds w' 'Ž .NrP P s Nr P elements of the x vector. The next phase is similar, but the

broadcast is performed along the rows of PEs of the 2-D mesh and each PE sends'the Nr P elements it currently holds. The time taken for the second phase is' 'Ž Ž . .Ž .t q Nr P t P y 1 . Therefore, the total communication time is t ss w mcom m' ' 'Ž Ž . .Ž . Ž Ž . .Ž .t q NrP t P y 1 q t q Nr P t P y 1 . The computation time pers s s s

row of the A matrix is k multiplications and k y 1 additions; thus, for the wholeŽ .Ž .matrix the computation time t s 2k y 1 NrP t . From the above results,mcom p f p

the time for matrix-vector multiplication for an unstructured A matrix is tumul tŽ .unstructured mult iplication :

t s t q tumul t mcom p mcom m

'N N P y P tŽ . w's 2k y 1 t q 2 t P y 1 q . 9Ž . Ž .Ž .f p sž /P P

For large P,

N 't s 2k y 1 t q 2 t P q Nt . 10Ž . Ž .umult f p s wž /P

The analysis of the algorithms provided in this paper assumes a large value for P,the number of PEs.

The unstructured A matrix can be structured to obtain a reordered A matrix witha banded sparsity pattern. Let the reordered A matrix have an odd bandwidth of bŽ < < .i.e., a s 0 for i y j ) br2 , then PE i needs to communicate with PEs that? @i j

w xare numbered in the range i y br2 rNrP . . . i q br2 rNrP to obtain? @ ? @the necessary x vector elements to perform the matrix-vector multiplication.Instead of performing a broadcast operation to retrieve the x vector elementsbefore every matrix-vector multiplication, for the banded A matrices, send andreceive type communications are used. Hence, the banded structure of thereordered A matrix reduces the communication time spent in retrieving thenecessary x vector elements, but the time spent on computation remains the sameas the unstructured case.


For very large and very sparse reordered matrices such as those encountered inŽ .the application that is considered here, b is such that bP rN F c for a small

constant c. For the reordered matrices and the number of PEs considered here, cŽis equal to two i.e., PE i needs to perform inter-PE transfers with PEs i y 1 and

.i q 1 . If b does not satisfy this condition, then reordered A is considered to havean unstructured sparsity pattern and broadcast operations are used instead of thesend and receive calls to retrieve the x vector elements.

Ž .The total multiplication time for this case is called t banded mult iplication ,bmul twhich is given by

Nt s 2k y 1 t q 2 t q b t . 11Ž . Ž .bmult f p s wž /P

The parameters derived above are used to obtain expressions for the parallel runtime per iteration.

The amount of work done per iteration of the algorithm is approximated by theŽ .body of the ‘‘while’’ loop in Figure 3 for example , i.e., the initialization overhead

prior to loop entry is considered negligible. In the CGS algorithm of Figure 3, itcan be observed that for each iteration of the ‘‘while’’ loop three inner productsand two matrix-vector multiplications are required. Two of the inner products canbe performed with a single global summing operation, i.e., the local inner products

Ž .corresponding to the two inner products in Line 14 of Figure 3 can be reduced inŽ .a single global operation with two operands . Vector-vector additions, vector-vec-

tor subtractions, and vector-scalar multiplications are also needed to implementthe algorithm. However, these operations are completely parallelizable, i.e., nointer-PE communication is involved in a distributed memory machine.

The following expressions for the parallel run time of the CGS algorithm arederived assuming a data distribution as described in the beginning of this section.Table 1 shows the line by line time complexity of the CGS algorithm. Note thatonly the expressions within the while loop are considered here.

Table 1. Line by line time complexityof the CGS algorithm.

Line number Time complexity

Ž .4 t f pŽ . Ž .5 2 NrP tf pŽ . Ž .6 4 NrP tf pŽ .7 tm ul tŽ . Ž .8 t q t 1f p in n erŽ . Ž .9 2 NrP tf pŽ . Ž .10 NrP tf pŽ . Ž .11 t q 2 NrP tm ul t f pŽ . Ž .12 2 NrP tf pŽ .13 t f pŽ . Ž .14 t 2i nn er


Let the time taken for the ith iteration of the CGS algorithm be t i . AnCGSexpression for the value of t i can be obtained by counting the different types ofCGSoperations performed per iteration of the CGS algorithm. Therefore,

13Nit s 3 q t q t 1 q t 2 q 2 t , 12Ž . Ž . Ž .CGS f p inner inner mul tž /P

Ž . Ž . Ž . Ž . Ž . Ž . Ž . Ž .where the 3 q 13NrP t term comes from Lines 4 , 5 , 6 , 8 , 9 , 10 , 11 ,f pŽ . Ž . Ž . Ž . Ž .12 , and 13 , the t 1 term comes from Line 8 , the t 2 term comes frominner inner

Ž . Ž . Ž .Line 14 , and the 2 t terms comes from Lines 7 and 11 of Figure 3. Themultnumber of PEs that result in the minimum parallel execution time, P , can beCGS

Ž i . Ž .derived by solving for P in t r P s 0.CGSŽ .If A has an unstructured sparsity pattern, then Equation 12 can be simplified

Ž .by substituting the value of t from Equation 8 and the value of t for tinner umul t mul tŽ .from Equation 10 . For an unstructured A matrix, let the time per iteration of the

CGS algorithm be t i .CGS u]

17 q 4k NtŽ . f pi 't s 8t q 6 t P q q 3t q 2 Nt 13Ž . Ž .CGS u s w f p w] P

The number of PEs that result in the minimum parallel execution time, P , canCGS u]Ž i . Ž .be derived as the following by solving for P in t r P s 0.CGS u]

2r317 q 4k NtŽ . f pP s 14Ž .CGS u] ž /4 t q 3ts w

Ž .Similarly, for a reordered A obtained by reordering the unstructured A with abandwidth of b , let the time per iteration of the CGS algorithm be t i and theCGS b]number of PEs for minimum parallel execution time be P . A general expres-CGS b]sion for the time taken per iteration of the CGS algorithm is given by EquationŽ .12 , i.e., the expression is applicable for reordered or unstructured A matrices. The

Ž .only parameter in Equation 12 that depends on the structure of the A matrix isŽ .the t term. By substituting the value of t given by Equation 11 intomult bmul t

Ž . iEquation 12 an expression for i can be obtained. Therefore, from EquationsCGS b]Ž . Ž . Ž .8 , 11 , and 12 :

17 q 4k NtŽ . f pi 't s 4 t q 6 t P q q 3t q 4 t q 2b t 15Ž . Ž .CGS b s w f p s w] P

The number of PEs that result in the minimum parallel execution time, P , canCGS b]Ž i . Ž .be derived as the following by solving for P in t r P s 0.CGS b]

2r317 q 4k NtŽ . f pP s 16Ž .CGS b] ž /2 t q 3ts w


( )6. The modified CGS MCGS algorithm

In Figure 3, the element-wise vector-vector, vector-scalar, scalar-scalar, andŽ . Ž .matrix-vector operation that follow the inner products in Lines 8 and 14 depend

on the values generated by the inner product operations. Therefore, the operationŽ . Ž .in Lines 8 and 14 form two distinct synchronization operations per iteration of

the ‘‘while’’ loop. Depending on the values of N and P, the synchronizationoverhead can cause a significant impact on the actual execution time. The synchro-nization overhead can be reduced by merging the inner products together so that asingle global summing is sufficient for an iteration of the ‘‘while’’ loop.

The basic idea in formulating the MCGS algorithm is to merge the innerproducts present in the CGS algorithm so that they can be evaluated using a singleglobal summing operation per iteration of the ‘‘while’’ loop. One way of mergingthe inner products is to reformulate the algorithm so that the inner product in LineŽ . Ž .8 of Figure 3 can be combined with that in Line 14 of Figure 3. Let s s Ar ,i i

Ž . Ž . Ž . Ž .w s Aq , and y s Ap . From Figure 3, consider Lines 5 , 6 , 7 , and 8i i i i

s s r H v˜0 n

s r HAp˜0 n

s r HA r q 2bq q b 2 p˜ Ž .0 n n ny1

s r Hs q 2b r H w q b 2 r H y 17Ž .˜ ˜ ˜0 n 0 n 0 ny1

Ž .Consider Line 6 of Figure 3,

p s u q b q q b pŽ .n n n ny1

Ap s Ar q 2bAq q b 2Apn n n ny1

y s s q 2b w q b 2 y 18Ž .n n n ny1

Ž .From Line 9 of Figure 3,

q s u y a vnq1 n n

s u y aApn n

s u y a y 19Ž .n n

Ž . Ž . Ž .From Lines 5 , 10 , and 11 of Figure 3,

r s r y aA u q qŽ .nq1 n n nq1

s r y a Ar q bAq q AqŽ .n n n nq1

s r y a s q b w q w 20Ž . Ž .n n n nq1


Ž . Ž . Ž . Ž .Using the derivations given in the Equations 17 , 18 , 19 , and 20 , the CGSalgorithm of Figure 3 can be reformulated as the MCGS algorithm of Figure 6.

In the MCGS algorithm of Figure 6, the inner products are all grouped togetherŽ .so that a single global summing operation with five operands is sufficient to

evaluate them. The number of matrix-vector multiplications in the MCGS algo-rithm remain the same as the number of matrix-vector multiplications in the CGSalgorithm. Due to the modifications, additional scalar-scalar, scalar-vector, andelement-wise vector-vector operations are performed in the MCGS algorithmcompared to the operations performed by the CGS algorithm. However, theseadditional operations scale linearly with the inverse of the number of PEs.

The following expressions for the parallel run time of the MCGS algorithm arederived assuming a data distribution as described in the beginning of Section 5.Table 2 shows the line by line time complexity of the MCGS algorithm. Note thatonly the expressions within the while loop are considered here.

The parallel run time for the ith iteration of the MCGS algorithm is

16Nit s 10 q t q t 5 q 2 t , 21Ž . Ž .M CGS f p inner mul tž /P

Ž . Ž . Ž . Ž . Ž . Ž . Ž . Ž .where the 10 q 16NrP t term comes from Lines 6 , 7 , 8 , 9 , 10 , 12 , 14 ,f pŽ . Ž . Ž .and 15 , the t 5 term comes from Lines 16 , and the 2 t term comes frominner mul tŽ . Ž .Lines 11 and 13 of Figure 6. The number of PEs that result in the minimum

Ž i . Ž .parallel execution time, P , can be derived by solving for P in t r PM CGS M CGSs 0.

Ž .If A has an unstructured sparsity pattern, then Equation 21 can be simplifiedŽ .by substituting the value of t from Equation 8 and the value of t for tinner umul t mul t

Ž .1 g s 1; p s 0; q s 0; w s 0;y1 0 0 0Ž .2 w s 0; y s 0; r s b y Ax ; s s Ar ;y1 y1 0 0 0 0

H HŽ .3 g s r r ; h s r s ;˜ ˜0 0 0 0 0H HŽ .4 m s r w ; l s r y ; r s 1; n s 0;˜ ˜0 0 0 y1 y1

Ž . Ž .5 while not converged doŽ . Ž .6 b s g rg ; s s h q b 2m q lb ;n ny1Ž .7 a s g rs ;nŽ .8 u s r q bq ;n n n

2Ž .9 y s s q 2b w q b y ;n n n ny1Ž .10 q s u y a y ;nq 1 n nŽ .11 w s Aq ;nq 1 nq1Ž . Ž .12 r s r y a s q b w q w ;nq 1 n n n nq1Ž .13 s s Ar ;nq 1 nq1Ž . Ž .14 x s x q a u q q ;nq 1 n n nq1Ž .15 n s n q 1;

H H H H HŽ .16 g s r r ; h s r s ; m s r w ; l s r y ; e s r r ;˜ ˜ ˜ ˜n 0 n 0 n 0 n 0 ny1 n n nŽ .17 endwhile

Ž .Figure 6. The modified conjugate gradient squared MCGS algorithm.


Table 2. Line by line time complexityof the MCGS algorithm.

Line number Time complexity

Ž .6 6 t f pŽ .7 t f pŽ . Ž .8 2 NrP tf pŽ . Ž .9 2 t q 4 NrP tf p f pŽ . Ž .10 2 NrP tf pŽ .11 tm ul tŽ . Ž .12 5 NrP tf pŽ .13 tm ul tŽ . Ž .14 3 NrP tf pŽ .15 t f pŽ . Ž .16 t 5i nn er

Ž .from Equation 10 , giving

24 q 4K NtŽ . f pi 't s 6 t q 10 t P q q 10 t q 2 Nt 22Ž . Ž .M CGS u s w f p w] P

2r324 q 4k NtŽ . f pP s 23Ž .M CGS u] ž /3t q 5ts w

Ž .Similarly, for a banded A obtained by reordering the unstructured A with aŽ . Ž .bandwidth of b , from Equations 8 and 11 ,

24 q 4k NtŽ . f pi 't s 2 t q 10 t P q q 10 t q 4 t q 2b t 24Ž . Ž .M CGS b s w f p s w] P

2r324 q 4k NtŽ . f pP s 25Ž .M CGS b] ž /t q 5ts w

7. Comparison of the algorithms

From the derivation of the MCGS algorithm in section 6, it can be observed thatMCGS and CGS are basically the same algorithms, but with a different computingsequence. Because both algorithms are doing exactly the same calculations, theyare numerically equivalent, unless rounding-off errors create a difference. Theexperimental results presented in Section 8 further support this claim.

Reordering the matrix A for bandwidth reduction reduces the time taken for thematrix-vector multiplication operation, but the time for the other operationsremains almost the same. Therefore, reordering the A matrix reduces the totalexecution time. Because the total execution time for the banded A matrix is lowerthan the execution time for the unstructured A matrix case, the inner productoperation contributes a bigger percentage towards the overall execution time.


Hence, the reduction in the global synchronization overhead is likely to have amore significant impact for banded A matrices.

Because MCGS and CGS take the same number of iterations to converge, theparallel run time of the algorithms can be compared using the time taken periteration of the algorithm, i.e., for MCGS to perform better than CGS, t i

M CGSshould be less than t i . Consider the case where A is banded with a bandwidth bCGSand t 4 t , so the t term can be ignored. Using the expressions derived ins w w

Ž . Ž . i iEquations 15 and 24 for t and t , respectively, for MCGS to performCGS b M CGS b] ]better than CGS

2r37Ntf pP ) for t 4 t . 26Ž .s wž /2 ts

Let t min and t min be the minimum parallel execution times for the MCGSM CGS b CGS b] ]and CGS algorithms, respectively. The value of P can be substituted fromCGS b]Ž . Ž . minEquation 16 into Equation 15 to obtain the following expression for t :CGS b]

1 17 q 4k N 1Ž . f pmint s 4 t q 6 t P qŽ . 'CGS b s w CGS b] ] ž /2 2 t q 3t Ps w CGS b]

q 3t q 4 t q 2b tf p s w

s 6 t q 9t P q 3t q 4 t q 2b t 27Ž . Ž .'s w CGS b f p s w]

Ž . Ž . minSimilarly, Equations 24 and 25 can be used to derive the expression for t :M CGS b]

1 24 q 4k Nt 1Ž . f pmint s 2 t q 10 t P qŽ . 'M CGS b s w M CGS b] ] ž /2 t q 5t Ps w M CGS b]

q 10 t q 4 t q 2b tf p s w

s 3t q 15t P q 10 t q 4 t q 2b t 28Ž . Ž .'s w M CGS b f p s w]

Ž .For distributed memory MIMD machines, where t 4 t , from Equations 27 ands wŽ . min min28 , t and t can be approximated as follows:CGS b M CGS b] ]

mint s 6 t P 29Ž .'CGS b s CGS b] ]

mint s 3t P 30Ž .'M CGS b s M CGS b] ]


mint P3M CGS b M CGS b] ]smin )6 Pt CGS bCGS b ]]

1r31 24 q 4k Nt 2 t q 3tŽ . Ž .f p s ws2 t q 5t 17 q 4k NtŽ . Ž .s w f p

1r324 q 4kŽ .f s 0.66 for k s 8 31Ž .ž /4 17 q 4kŽ .

Hence, the best MCGS timing is approximately up to 34% better than the bestCGS timing. The variation of the ratio t min rt min with the value of k is shownM CGS b CGS b] ] Ž .in Figure 7. Also, for the assumption stated, Equation 26 provides the machine

Ž .size in number of PEs above which MCGS performs better than CGS. This canŽ .be used to automatically select an algorithm either CGS or MCGS depending on

w xN, P, t , and t 13, 18 .f p sw xThis approach will allow heterogeneous computing management systems 9 to

adaptively select the best algorithm to use from a set of algorithms. In this case,Ž .the problem size N is fixed, but the values of P, t , and t can vary depending ons f p

the machine that is selected for executing the solver. For the heterogeneouscomputing management systems to make the best decision, it is necessary to

Ž .provide information such as that in Equation 26 to the management systems.

8. Numerical experiments and discussion

Experiments were conducted on up to 128 nodes of the Purdue mesh-connectedw xIntel Paragon XPrS 2 and up to 16 thin nodes of the Purdue multistage networkw xconnected IBM SP2 1, 2, 16 . The algorithms were implemented on the Paragon

Figure 7. Variation of the t m in rt m in ratio with the value of k.M C GS b C G S b] ]


using C and NX message passing library routines. The algorithms were imple-Ž .mented on the IBM SP2 using C and message passing interface MPI library

routines. The global summing operations are performed using the recursive dou-bling routines provided by the NX and MPI libraries.

Ž .The timings varied slightly less than 5% between trials. This is due to cachemisses, background network traffic, and possibly other factors. The minimum of thetiming values was taken as the representative value, because the minimum occurswhen the impact due to external factors is minimal.

Two electromagnetic scattering matrices were used in the experiments reportedin this section. The bigger matrix that is referred to as A is 36818-by-36818 and1the smaller one that is referred to as A is 2075-by-2075. A is for a 5l radius2 1circular computation domain, a rectangular 5 = 0.1l perfect conductor scatterer,and TM polarization with an incidence angle of 458 with respect to the platenormal. A is for an incident angle of 2708. Each matrix was reordered to form a2

w xbanded matrix by a reverse Cuthill-McKee algorithm 6 , obtaining two morematrices A and A , where A is from A and A is from A . The sparsity1b 2 b 1b 1 2 b 2pattern of A is shown in Figure 8 and A is shown in Figure 9. The thick straight1 1bline shown in Figure 9 denotes a band that contains all the non-zero elements ofthe reordered A matrix. The timings for the reordering phase are not reportedhere, because these timings do not affect the CGS versus MCGS comparison.

The execution time of the CGS algorithm is dependent on the time taken for theŽ .basic operations: matrix-vector multiplication referred as matvec , inner-products

Ž .referred as innerprod , element-wise vector-vector operations, scalar-vector

Figure 8. Sparsity pattern of test matrix A , N s 36818.1


Figure 9. Sparsity pattern of test matrix A obtained by reordering A using the reverse Cuthill-1 b 1McKee algorithm.

operations, and scalar-scalar operations. The last three do not involve inter-PEŽ .communication i.e., each PE executes them independently and will be collectively

referred to as indops. Thus, there are three categories to be considered: matvec,innerprod, and indops. To get a better understanding of the variation in theperformance with the number of PEs, the timings were measured separately forinnerprod, matvec, and indops. Figure 10 shows the three timing curvescorresponding to innerprod time, matvec time, and indops time for theParagon.

The innerprod time includes the time needed to compute the local productterms and the time needed to sum the local products. The computation time forthe local products decreases with increasing P, whereas the global summing timeincreases with increasing P. Likewise, the matvec time includes the time to do themultiplications and additions and the inter-PE communication time needed toretrieve remote vector values. From Figure 10 it can be noted that the timing formatvec decreases with increasing P. This is due to the decrease in the computa-tion time associated with the matvec operation. Similarly, the time for theindops decreases with increasing number of PEs.

Ž .For the larger matrix A and 4 F P F 16, the indops time is the dominant1component of the total execution time. Therefore, any optimizations done for theinnerprod and matvec times will not have a significant impact on the overallexecution time. However, as the number of PEs are increased the indops timedecreases and at 128 PEs the innerprod and matvec times are nearly as same as


Figure 10. Independent components of the total execution time for CGS on the Intel Paragon for A .1

the indops time. Reordering the matrix A has a reducing effect on the matvec1time, as shown in Figure 11 for the Paragon. The indops and innerprod timesare independent of the reordering. The overall execution time of the CGSalgorithm is reduced, due to the reordering of A for bandwidth reduction. For thebandwidth-reduced A, innerprod time forms a higher percentage of the overallexecution time compared to the situation where matrix A is unstructured. There-fore, the optimizations towards reducing the innerprod are likely to have a

Figure 11. Independent components of the total execution time for CGS on the Intel Paragon for A .1 b


relatively high impact on the overall execution time when the matrix A is reorderedfor bandwidth reduction.

From the discussion of MCGS in Section 6, it is evident that the modificationsperformed in merging one inner product with the other one in the CGS algorithmto formulate the MCGS algorithm has introduced additional element-wise vector

Žoperations, vector-scalar, scalar-scalar operations i.e., indops forms a higherpercentage of the total number of operations performed by the MCGS algorithm

.compared to the indops performed by the CGS algorithm . For smaller P andlarger N values, the indops time dominates the other two components, inner-prod and matvec. Therefore, for smaller P and larger N values, MCGS takesmore time than the CGS algorithm, as shown in Figure 12 with A for the1bParagon. As P is increased MCGS starts to perform better. The same comparisonis shown for A in Figure 13. The experimentally derived difference of 20%2 bconfirms a significant improvement using MCGS for this set of parameters.

Ž .Equation 26 is used to predict the value of P*, where P* is the number of PEsbeyond which the MCGS algorithm should outperform the CGS algorithm. Thevalues of N, P, t , and t are needed to predict the value of P*. To determine thes f pvalue of t , small code segments with floating-point operations were executed onf pthe Paragon and the executions were timed. From the experiments, the value of t f pwas found to be equal to 1.078 = 10y6 seconds. The value of t determined byf pthese experiments do not include any loop or other integer operation overhead.However, the time delays incurred by cache misses are included in the measuredvalue of t .f p

The value of t is measured by an echo server experiment, as shown in Figure 14.sIn the echo server experiment, two PEs which are adjacent to each other areselected. One PE runs the echo server, which is a program that retransmits

Figure 12. Execution time of CGS versus MCGS for A on the Paragon.1 b


Figure 13. Execution time of CGS versus MCGS for A on the Paragon.2 b

Figure 14. The PE configuration for the echo server experiment.

whatever it receives back to the sending PE. The other PE in the experimentexecutes the testing program. The testing program turns on a timer and sends azero-length message to the PE that runs the echo server. Once the testing PEreceives the reply back from the echo server, the timer is stopped and the elapsedtime is measured. The elapsed time is equal to two times the message setupoverhead, i.e., 2 t . The value of t was determined as 112 ms from the echo servers sexperiment for the Intel Paragon. For the Paragon, using the values of t and ts f pdetermined by the above experiments, the value of P* is estimated by EquationŽ .26 to be 115 for A and 17 for A .1b 2 b

In Figure 15, the MCGS algorithm is compared with the CGS algorithm on theIBM SP2 for A . The PEs in the SP2 are faster and have larger cache memory2 bthan the PEs in the Paragon, and the SP2 has a relatively smaller t rt ratio thanf p sthe Paragon. Therefore, in the SP2, the MCGS algorithm is the better algorithmfor even smaller number of PEs, as compared to the Intel Paragon. A complexityanalysis of the algorithms was not performed for the SP2 machine, because only

w xempirically derived communication models are available for the SP2 19, 20 . FromFigure 16, it can be observed that for matrix A that CGS outperforms MCGS;1bhowever, as the number of PEs are increased the performance gap between the


Figure 15. Execution time of CGS versus MCGS for A on the SP2.2 b

Figure 16. Execution time of CGS versus MCGS for A on the SP2.1 b

MCGS and CGS decreases. At a sufficiently large P, the MCGS algorithm shouldŽ .perform better than the CGS algorithm for larger matrices e.g., A as it did for1b

Ž .smaller matrices e.g., A .2 b

9. Conclusions

A reformulation of the CGS algorithm called the MCGS is presented in this paper.The system of linear equations obtained from finite element modeling of open-re-


gion electromagnetic problems was solved using the CGS and MCGS algorithms.This work examined ways of reducing the communication and synchronizationoverhead associated with implementing the CGS algorithm on MIMD parallelmachines.

An experimental and theoretical complexity analysis was performed to evaluatethe performance benefits. The approximate complexity analysis of the CGS andMCGS algorithms on a mesh-based multiprocessor model estimates that theperformance of the MCGS algorithm may be up to 34% better than the perfor-mance of the CGS algorithm, depending on the machine architecture. The experi-mental results obtained from a 128-processor Intel Paragon show that the perfor-mance of the MCGS algorithm is at least 20% better than the performance of theCGS algorithm for a 2075 = 2075 sparse matrix. The experimental results on theIBM SP2 indicate that the MCGS is the better algorithm for the 2075 = 2075sparse matrix and the CGS algorithm is the better algorithm for the 36818 = 36818sparse matrix. The MCGS can be expected to be the better algorithm for the largermatrix as the number of PEs increases. Any optimization or preconditioning that isused on CGS to reduce the overall computation time can be used on MCGS aswell. The techniques used here to reduce the synchronization overhead of CGS canbe extended to other nonsymmetric Krylov methods, such as the BiCGSTAB

w xalgorithm 17 .Because MCGS is not the best method for all situations, the use of a set of

algorithms approach is proposed. In the set of algorithms approach, either CGS orMCGS is selected depending on the values of N and P. The set approach providesan algorithm that is more scalable than either the CGS or MCGS algorithms alone.Conditions such as the one developed here to choose between CGS or MCGSdepending on the system parameters are also useful in the area of HC mapping

w x Ž .systems 9 . In HC mapping, the input data e.g., matrix size remains fixed, but thesystem parameters are varied, i.e., the mapping system estimates the performanceon different machines and executes the application on the machine that is expectedto yield the best performance. To obtain the best mapping, it is necessary for theHC mapping systems to have information such as those provided by the conditionsto select either MCGS or CGS depending on system parameters.

Acknowledgments

A preliminary version of this paper was presented at the 1998 InternationalConference on Parallel Processing. The authors thank the referees for theirsuggestions. This work was supported in part by the DARPArITO QuorumProgram under NPS subcontract numbers N62271-97-M-0900 and N62271-98-M-0217.

References

1. T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir. SP2 systemarchitecture. IBM Systems Journal, 34:152]184, 1995.


2. G. S. Almasi and A. Gotlieb. Highly Parallel Computing, 2nd ed. Benjamin Cummings, RedwoodCity, CA, 1994.

3. E. Dazevedo, V. Eijkhout, and C. Romaine. Reducing communication costs in the conjugategradient algorithm on distributed memory multiprocessors. LaPack Working Note 56, 1992.

4. R. W. Freund. Conjugate gradient-type methods for linear systems with complex symmetriccoefficient matrices. SIAM Journal on Scientific and Statistical Computing, 13:425]448, 1992.

5. R. W. Freund, G. H. Golub, and N. H. Nachtigal. Iterative solution of linear systems. ActaNumerica, pp. 57]100, 1992.

6. A. George and J. W. Lu. Computer Solution of Large Sparse Positi e Definite Systems. Prentice-Hall,Englewood Cliffs, NJ, 1981.

7. B. Lichtenberg. Finite element modeling of wavelength-scale diffractive element. PhD thesis,Purdue University, West Lafayette, IN, 1994.

8. B. Lichtenberg, K. J. Webb, D. B. Meade, and A. F. Peterson. Comparison of two-dimensionalconformal local radiation boundary conditions. Electromagnetics, 16:359]384, 1996.

9. M. Maheswaran, T. D. Braun, and H. J. Siegel, High-performance mixed-machine heterogeneouscomputing. In 6th Euromicro Workshop on Parallel and Distributed Processing, pp. 3]6, 1998.

10. G. Meurant. Multitasking the conjugate gradient on the Cray X-MPr48. Parallel Computing,5:267]280, 1987.

11. Y. Saad, Krylov subspace methods on supercomputers. SIAM Journal on Scientific and StatisticalComputing, 10:1200]1232, 1989.

12. Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. LaPack Working Note 50,1994.

13. H. J. Siegel, L. Wang, J. E. So, and M. Maheswaran. Data Parallel Algorithms. In A. Y. Zomaya,ed., Parallel and Distributed Computing Handbook, pp. 466]499. McGraw Hill, New York, NY, 1996.

14. P. Sonnevald. CGS: A fast Lanczos-type solver for nonsymmetric linear systems. SIAM Journal onScientific and Statistical Computing, 10:36]52 1989.

15. G. Strang. Linear Algebra and Its Applications, 3rd ed. Harcourt Brace Jovanovich, San Diego, CA,1988.

16. C. B. Stunkel, D. G. Shea, B. Abali, M. G. Atkins, C. A. Bender, D. G. Grice, P. H. Hochschild, D.J. Joseph, B. J. Nathanson, R. A. Swetz, R. F. Stucke, M. Tsao, and P. R. Varker. The SP2high-performance switch. IBM Systems Journal, 34:185]204, 1995.

17. H. A. Van Der Vorst. Bi-CGSTAB: A fast and smooth converging variant of Bi-CG for the solutionof nonsymmetric linear systems. SIAM Journal of Scientific and Statistical Computing, 12:631]644,1992.

18. M-C. Wang, W. G. Nation, J. B. Armstrong, H. J. Siegel, S. D. Kim, M. A. Nichols, and M. Gherrity.Multiple quadratic forms: A case study in the design of data-parallel algorithms. Journal of Paralleland Distributed Computing, 21:124]139, 1994.

19. Z. Xu and K. Hwang. Modeling communication overhead: MPI and MPL performance on the IBMSP2. IEEE Parallel and Distributed Technology, 4:9]23, 1996.

20. Z. Xu and K. Hwang. Early prediction of MPP performance: The SP2, T3D, and Paragonexperiences. Parallel Computing, 22:917]924, 1996.

mcgs: a modified conjugate gradient squared algorithm for ...hj/journals/67.pdfused to obtain fast...

Documents