[ieee comput. soc. press 11th international parallel processing symposium - genva, switzerland (1-5...

The Sparse Cyclic Distribution against its Dense Counterparts *

Gerard0 Bandera Manuel Ujaldon Maria A. Trenas Emilio L. Zapata Computer Architecture Department. University of Malaga. Spain.

e-mail: {bandera, ujaldon, maria, ezapata}@ac.uma.es

Abstract

Several methods have been proposed in the literature for the distribution of da ta on distribute'd memory machines, either oriented to dense or sparse structures. Many of the real applications, however, deal with both kind of data jointly. This paper presents techniques for integrating dense and sparse array accesses in a way that optimizes locality and further allows an eficient loop partitioning within a data-parallel compiler. Our approach is evaluated through an experimental survey with several compilers and parallel platforms. The results prove the benefits of the BRS sparse distribution when combined with CYCLIC in mixed algorithms and the poor efficiency achieved by well-known distribution schemes when sparse elements arise in the source code.

1. Introduction

Since distributed memory multiprocessors became popular in the computer architecture level and the data-parallel compilers were born to aid their program- ming tasks, scatter decompositions have shown an out- standing way for distributing data and computations in a round-robin fashion. Fairly well studied exam- ples are the CYCLIC distribution for dense arrays and the BRS/BCS (Block Row/Column Scatter) scheme for sparse matrices [8, 111. BRS/BCS considers the particular way in which nonzeroes are represented and preserve the same storage format for the local data once distributed. Figure 1 shows this distribution as well as its relation with the CRS sparse format.

This paper studies the behaviour of such dis t r ibu-

tions on a more general context: mixed algorithms where large dense and sparse structures are accessed through different loops. For our study, we have selected the Conjugate Gradient (CG) algorithm [l], which con- stitutes the oldest,, best known and most effective of

*This work was supported by CICYT-Spain, project TIC96- 1125-C03-01 and TRACS (Human Capital and Mobility Pro- gram of the European Union), grant ERB-CHGE-CT92-0005.

1063-7133/97 $10.00 0 1997 IEEE

the nonstationary iterative methods for the solution of symmetric positive definite systems. Figure 2 shows an outline of this algorithm, where the input sparse matrix stores in a CRS format the set of nonzero coef- ficients for each of the equations constituting the linear system. Its computational cost does not lie so in its length as in the size of the data structures to deal with. Since the sparse matrix plays a primary role, we have selected for our experiments several real matrices from the Harwell-Boeing Collection [6], whose charac- teristic? are presented in Table 1.

We have parallelized CG using a Cray T3D and an Intel Paragon. For the first machine, the automatic parallelization was done through the CRAFT compiler and the manual one using the SHMEM routines [a] . For the Intel Paragon, we use the Vienna-Fortran data- parallel compiler and the NX libraries for the manual case.

The rest of the paper is organized as follows. Section 2 shows the parallelization of BRS loops within the Vi- enna Fortran Compilation System (VFCS). Section 3 provides alternative specifications for parallelizing the CG algorithm in the CRAFT compiler using the basic distribution mechanisms available in this tool. Section 4 makes a comparison between BRS and these alternative methods. Sections 5 and 6 conclude the paper with some related work and conclusions.

0 . 0 3 . 3 0 . 0 0 . 0 0 . 0 n . n 0 . 0 0 . 0 0.0 0.0 0.0 0.0 0 . 0 0.0 1 . 1 0.0

g . 4 0 . 0 0.00.0 n . n o . 0 0 . 0 6 . 0 0.0 0.0 0 . 0 0.0 0.0 2.6 0 . 0 0 . 0

0 . 0 0 . 0 0 . 0 7 . 5 0 . 0 0 . 0 0 . 0 0 . 0 0.0 0 .0 0.0 0 .0 3.2 0.0 0 . 0 0.0

n . o o . 0 o . o o . o o . o o . 0 1.10.0 l[+l&'w 0.0 0.0 0 . 0 0.0 4 . 4 0 . 0 0.n 1 . 7 I

Figure 1: (left) A BRS matrix partition using a 2x2 processor stencil; P(0,O) Is elements highlighted, (right) The local CRS representation on processor P(0,O).

638

P R O CESS 0 RS Q (0 :Pi - 1,0 :Pz- 1) P A R A M E T E R (N=lOOO, NUMJTERS= 10000) D O I = l , N REAL P(N), Q(N), R(N), X(N) DIST(CYCL1C) REAL scalarl, scalar2, scalar3 CONTINUE

REAL A(N,N), SPARSE(CRS(Data,Column,Row)), DYNAMIC DO I = 1, N ! SPARSE Matrix-Vector Product DISTRIBUTE A :: (CYCLIC, CYCLIC)

C - Fill initial guess for X and residuals R=B-A.X

ELSE

P(1) = R(1) + (t;calarl/scalar2)*P(I)

C - Sparse Matrix A uses Compressed Row Storage Format ENDIF

! After reading A from file Q(1) = 0.0 DO J=Row(I),Rovr(I+1)-1

Matrix BCSSTK29 BCSSTKBO PSMIGRl

DO K=l, IF (K

Dims Nonzeroes Features 13992 x 13992 316740 Large 28924 x 28924 1036208 Very large 3140 x 3140 543162 Small, dense

NUMJTERS .GT. 1) T H E N

scalar2 = scalarl scalarl = 0.0

E N D I F D O I = l , N

CONTINUE IF (K .EQ. 1) T H E N

D O I = l , N P(1) = R(1)

CONTINUE

scalarl = scalarl + R(I)*R(I)

Q(I) = Q(Ij + ,ata(j)*P(co,u,n(J)) CONTINUE

CONTINUE scalar3=0.0 D O I = 1, N

CONTINUE

DO I = 1, N

! Vector-Vector Product scalar3 = scalar3 + P(I)*Q(I)

scalar3=scalarl /scal ar3

X(1) = X(1) + scalar3 * P(1) R(I) = R(1)-scalar3 * Q(1)

! Update solution ! Update residuals

CONTINUE CONTINUE

Figure 2: Vienna-Fortran code for the sparse Conjugate Gradient ahgorithm (CG).

Table 1: Characteristics of benchmark matrices. 2. BRS Parallelization

We now use VFCS to compare both automatic and manual BRS parallelizations in the Intel Paragon. Fig- ure 2 shows the CG algorithm written in Vienna- Fortran with minor language extensions for the BRS specification: The SPARSE keyword determines either CRS or CCS format as well as the names of the three vectors that represent the matrix, and the DISTRIBUTE statement selects the distribution, in this case (CYCLIC, CYCLIC). This warns the compiler about the use of the BRS scheme for the sparse matrix. Note that the (CYCLIC,CYCLIC) annotation does not imply the use of any regular distribution for the sparse vectors, but only the way in which the global data space, conceptually considered as a dense one (that is, zeroes included), is mapped onto the processors of the parallel machine.

The BRS alignment with CYCLIC dense vectors is one of the major advantages of this sparse distribution, because in general it facilitates a high degree of locality in mixed operations.

The efficiency of a sparse code distributed by BRS and parallelized by the VFCS compiler depends largely on two factors: The sparsity rate of the input matrix and the cost of the preprocessing phase to figure out the access pattern.

Figure 3 illustrates the impact of those issues. A programmer can develop a code in which the translation can be effectively done on the fly by making use of its additional information and freedom against a com-

piler. However, this translation on-the-fly has to be computed on each iteration, and, as we will see inmedi- ately, for a number of iterations big enough the skills of the programmer fade #away and are defeated by the automatic tool.

Charts 3.a and 3.b compare the effect of the matrix density: The more sparse matrix is better for the compiler. They also reflect the scalability of the manual gain for a small number of iterations.

Figure 3.c depicts the number of iterations needed for the CG algorithm to converge, showing on the Y- axis the distance from the current X to the final solution for each new iterakion. Results showed corre- spond to only one matrix, PSMIGR1, though a similar behaviour has been found in all the others (in the ab- sence of a preconditioner which makes the algorithm to converge faster [I]).

It can be seen that the convergence process starts to exhibit a stationary beha.viour after no less than one hundred iterations. By that time, the compiler preprocessing cost has already been widely amortized regard- less of the input matrix and the overall time starts to be even better than the manual version. The main rea- son for this amazing high compiler efficiency lies in the cache data reuse for the vector storing the indices translation, which fits entirely in cache at the first iteration and remains always there unless the matrix is dense enough, just the fact that makes the compiler version to slow down with a more dense matrix, PSMIGRl.

In addition, the run-time sorts the non-local indices in separated lists for every other processor and thus the communications are eficiently organized across the underlying network. From one hundred iterations on, Figure 3 proves that this code transformation can even-

639

8 2.0$ BCSSTK29 P 1.9

U 32procr. 16procr.

m aprocr. I 4 proor.

$' 1.6

3 ;; 8 i; 3 1.0

g 0.8 a, 0.9

U 0.7 10 25 50 75 100 150 200

PSMIGRl

U 32 procr. 16procr. 8 prom.

1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7

I 4 proes.

10 25 50 75 100 150 200

Number of Iterations in CG Number of Iterations in CG

Figure 3: Compiler overhead for CG involving vectors distributed by BRS: ( a ) BCSSTKB matrix, (b) PSMIGRl matrix, (c) Number of iterations to converge. Compiler: VFCS. Platform: Intel Paragon.

tually outperform schemes based on buffers and manual offset translation developed by a programmer.

The conclusion is that the cost to be paid for an automatic parallelization when using the BRS scheme is worthwhile as long as the algorithm can amortize the preprocessing costs through a minimum number of iterations.

The remaining cost of the CG lies in the mul- tiple loops dealing with dense arrays distributed by CYCLIC. However, the computational weight of this part never goes over 10% of the total execution time (see figure 5 for a breakup). Even though the compiler efficiency is expected to be improved for such cases, its influence is minimum and does not lead to a significant variation of the already presented results.

3. Regular Parallelization

Former results have been obtained on the VFCS, an experimental data-parallel compiler. In this section, we use CRAFT, a commercial data-parallel compiler for the Cray T3D where the BRS scheme is missing. The parallelization of the CG algorithm is based on the ba- sis of the existing distributions, BLOCK and CYCLIC. Some BRS results will also be showed on the Cray T3D in the next section to establish a comparison of the efficiency that CRAFT can reach without BRS.

3.1. Specifications

When using CRAFT, the user can specify BLOCK, CYCLIC or BLOCK-CYCLIC(k) distributions. These annotations do not produce a logical distribution of the vectors across the processors, but rather using a shared memory paradigm where every global variable is shared. The actual distribution only takes place at a lower level.

In order for the CG algorithm to minimize the memory requirements when executed on the Cray, certain distributions must be applied to its data structures. For the dense vectors, the choices are clearly either BLOCK or CYCLIC. For the sparse matrix, a particular mechanism such as MRD or BRS is desired, as they provide an efficient mapping of the sparse matrix in terms of memory use, workload balance, degree of locality and good alignment when combined with dense structures distributed by BLOCK or CYCLIC.

Since CRAFT does not have available any of those particular mechanisms, the CG specification has to be completed by distributing the sparse vectors by either BLOCK or CYCLIC. This results in a poor workload balance and locality when accessing the sparse data.

3.2. Experimental Results

Figure 4.a shows a comparison between the performance obtained on the Cray T3D machine when parallelizing the CG algorithm using BLOCK and CYCLIC specifications within the CRAFT compiler.

Note that the CYCLIC dense distribution is better than BLOCK in a ratio that goes from 1 to 1.5 for up to 32 processors, with a better scalability in the CYCLIC case. The relative difference is expected to grow when increasing the number of processors or the sparsity rate in the input matrix. CYCLIC also presents a slightly better dense-sparse structures alignment, which makes the locality improvement to speed up the execution time in algorithms where the percentage of sparse computation predominates. Time spent in the dense loops of the algorithm has been similar in all the cases.

In order to avoid an excessive communication penalty when accessing the Row vector in the CG algorithm, we also tried to replicate this vector. The results

640

CWFT cod88 c o m p a n m (CVCLICVa BLOCK) Comparison BRS.Fonmn w CYCLIC.CRAFl *e+, , . , , , T O , , , , , ,

Figure 4: Improvements comparison for CG algorithms on the Cray T3D. Left: C R A F T codes: BLOCK versus CYCLIC. Right: BRS-Fortran us CYCLIC-CRAFT.

obtained with this new version were better, though in a small and insignificant percentage.

4. Overall comparison

The last step in our evaluation compares the efficiency provided by the BRS parallelization against the best of the dense approaches applied to the CG algorithm. The BRS code was parallelized by hand using message-passing Fortran with SHMEM routines [ a ] . Structure of the code matches with the version expected by a data-parallel compiler. Additional optimizations such as explicit use of local indexes and re- distributions are currently underway (see section 5 for more details). We do believe that such optimizations have a potential improvement rate quite large in the case of the CG algorithm, which can leave the compiler version very close to what a smart programmer achieved in section 2. In addition, when vector computations predominate in the algorithm, the BLAS [3] routines can provide support for further optimizations.

Figure 4.b depicts an improvement factor of around 3 for the BRS Fortran CG version mentioned above when compared to CYCLIC-CRAFTl the best of the regular versions in the previous section. This is a direct consequence of the large difference in locality between accessing the sparse data distributed by BRS and by CYCLIC together with the bad alignment of this last. The improvement factor also suffers from a logical neg- ative influence of the sparsity rate in the matrix and the number of processors in the parallel machine: since both factors make the workload to be decreased on each processor, a machine such as the Cray T3D does not have execution time enough to show its strength.

Execution time is also detailed in Figure 5 for both versions of the CG algorithm under two of the three matrices. We have removed the times under BC- SSTK29 because its behaviour is very similar to that of BCSSTK30. On the other hand, the dense and sparse computation time has been split and the execution time has also been showed in the manual BRS version when

Figure 5: Breakup of the execution time on the Cray T3D for the CYCLIC-CRAFT, BRS-Fortran and BRS-C parallel versions of CG. Left: BCSSTII30 matrix. Right: PSMIGRl matrix. changing the Fortran language by C (together with the proper version of the SHMEM routines [a ] ) . Conclu- sions drawn from these times are the following:

Sparse operations predominate in the CG algorithm as much as more than 90% of the execution time. A performance analysis tool attached to the compiler would warn this fact and, accord- ingly, would focus its optimizations in the sparse matrix-vector product operation.

An excellent scalability is exhibited by all codes. Automatic code generation is therefore ready to be used in massively parallel machines.

The C version of the BRS parallelization is the best of all, achieving an additional improvement factor of almost 2 when compared to its Fortran counterpart, which goes up to more than 5 when compared to the CYCLIC version parallelized by CRAFT.

Consequently, our experimental results denote the better adequacy of the BRS strategy for the representation and distribution of sparse matrices in massively parallel distributed-memory machines. With respect to the source language, it is widely accepted by the scien- tist community a better behaviour of Fortran for the data-parallel computation, mainly due to its pretty be- nign semantics which avoid the compiler to hassle with pointers during the parallelization process. However, when the pointers are handled by a smart programmer, they reveal as a very efiicient mechanism for solving complex data accesses, leading to a target code which is about five times faster depending on the workload of every processor.

5 . Related Work

Codes parallelization has been an idea strongly pur- sued by researchers for executing grand-challenge ap-

641

plications which demand high-performance computing. The general approaches are basically three: Manual, automatic and data-parallel.

Inside the field of manual parallelization, program- mers have aided themselves with standard message- passing interfaces (PARMACS, PVM, MPI) for attain- ing code portability and run-time libraries (BLAS [3], CHAOS [9]) for facilitating the parallelization process.

Automatic parallelization is not intended to be a framework for the parallelization of sparse algorithms such as those addressed in our present work.

In the data-parallel paradigm, many language and compiler features have been proposed for extending the HPF standard through a successful parallelization of non-regular applications [9, 51. For CYCLIC distributions, Benkner [4] and Nedeljkovic et al. [7] have proposed different translation schemes for dense arrays. For sparse distributions, MRD was developed and implemented by Ujaldon [ll], whereas BRS is still on the way.

6. Conclusions

In this paper, we have shown methods for integrating and dealing jointly with dense and sparse matrix computations within data-parallel compilers in an efficient way. An experimental analysis has been per- formed over the sparse Conjugate Gradient algorithm. Additional experiments to demonstrate the efficiency of the BRS distribution have been developed by Tre- nas [lo], who implemented a version of the Lanczos algorithm with a complete reorthogonalization. Simi- lar results for this algorithm evidence that our methods can be applied in general to many other algorithms in the field of the linear algebra.

Experimental results show the convenience of using our approach: The locality and workload balance prop- erties are maintained in the target code through an effective loop parallelization and global to local translation scheme, which overall reveals that BRS can be parallelized within a data-parallel compiler with an efficiency close to what can be expected from a manual programmer.

Furthermore, we have found that the current elements of the data-parallel languages lack support for advanced and complex applications at a reasonable efficiency rate: When the parallelization of the same ap- plication is driven by existing data-parallel language features, the compiler misses very valuable information on the way data are accesses and incurs in a huge overhead which varies from a two till a seven times slower code compared to the manual version.

The overall consequence that can be extracted from our analysis is that CYCLIC introduces a huge burden when parallelizing sparse or mixed algorithms in data- parallel compilers where BRS is missing. The inclusion of this distribution in current data-parallel compilers is critical when efficiency becomes a primary goal.

7. Acknowledgements

We want to thank the Institute for Software Tech- nology and Parallel Systems (Austria) for providing us access to the source code of the VFCS as well as the Edimburgh Parallel Computing Centre (Scotland) for the use of the Cray T3D and the CRAFT compiler.

References

[I] R. Barrett et al. Templates for t he solution of linear systems. Ed. SIAM, 1994.

[2] R. Barriuso, A. Knies, S H M E M User’s Guide f o r C and Fortran, Cray Research Inc., August 1994.

[3] Basic Linear Algebra Subprograms. A Quick Refer- ence Guide, Univ. Tennessee, Oak Ridge National Lab., Numerical Algorithms Group Ltd.

[4] S. Benkner, Handl ing Block-Cyclic Distributed Ar- rays in Vienna-Fortran 90, Procs 3Td Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT’95), Lymassol (Cyprus), June 1995.

[5] B. Chapman, P. Mehrotra, H. Zima, Extend- ing H P F for Advanced Data-Parallel Applications, Technical Report TR 94-7, Institute for Software Technology and Parallel Systems, University of Vi- enna, Austria, May 1994.

[6] I.S. Duff, R.G. Grimes, J.G. Lewis, Users’ Guide f o r the Harwell-Boeing Sparse Mat r i x Collection, Research and Technology Division, Boeing Com- puter Services, Seattle, WA, USA, 1992.

[7] N. Nedeljkovic, K. Kennedy, A. Sethi, E f i c i e n t Address Generat ion fo r Block-Cyclic Distributions, Proceedings gth ACM Int’l Conf. on Supercomput- ing, Barcelona (Spain), pp. 180-184, July 1995.

[8] L.F. Romero, E.L. Zapata, Data distributions for sparse matr ix vector multiplication, J . Parallel Computing, vol. 21, pp. 583-605, 1995.

[9] J. Saltz, R. Das, B. Moon, S. D. Sharma, Y. Hwang, R. Ponnusamy, M. Uysal, A Manual fo r the C H A O S R u n t i m e Library, Computer Science Department, Univ. of Maryland, December 1993.

[lo] M.A. Trenas, Parallel A lgor i thms for Eigenvalues Computa t ion with Sparse Matrices, Master Thesis, University of Malaga, November 1995 (in Spanish).

[ll] M. Ujaldon, Data-Parallel Compi la t ion Techniques f o r Sparse Matr ix Applications, PhD Thesis. Univ. Malaga. TR UMA-DAC-96/02, Dept of Computer Architecture, Univ. Malaga, 1996 (in Spanish).

642

[ieee comput. soc. press 11th international parallel processing symposium - genva, switzerland (1-5...

Documents