[ieee 2010 22nd international symposium on computer architecture and high performance computing...

Ring pipelined algorithm for the algebraic path problem on the CELLBroadband Engine∗

Claude TadonkiLaboratoire de l’Accelerateur Lineaire/IN2P3/CNRS

University of Orsay, Faculty of Sciences, Bat. 20091898 Orsay Cedex (France)

[email protected]

Abstract

The algebraic path problem (APP) unifies a number ofrelated combinatorial or numerical problems into one thatcan be resolved by a generic algorithmic schema. In thispaper, we propose a linear SPMD model based on theWarshall-Floyd procedure coupled with a systematic shift-toroıdal. Our scheduling initially requires n processors fora n × n matrix to achieve the O(n3) task in n2 + O(n)steps. However, with a fewer number of processors, p < n,we exploit the modularity revealed by our linear array toachieve the task in n3/p + O(n) after n/p rounds, usinga locally parallel and globally sequential (LPGS) partitio-ning. In any case, we just need each processor to have alocal memory large enough to house one (block) columnof the matrix. These two characteristics clearly justify animplementation on the CELL Broadband Engine, becauseof the efficient and asynchronous SPE to SPE communica-tion (for the pipeline) and the floating point performance ofeach SPE. We report our experimentations on a QS22 bladewith different matrix sizes, thus exhibiting the efficiency andscalability of our implementation. We show that, with ahighly optimized Warshall-Floyd kernel (in Assembly), wecould get close to 80 GFLOPS in single precision with 8SPEs, which represents 80% of the peak performance forthe APP on the CELL.

1 Introduction

The algebraic path problem (APP) unifies a number ofrelated problems (transitive closure, shortest paths, Gauss-Jordan elimination, to name a few) into a generic formula-tion. The problem itself has been extensively studied at themathematic, algorithmic, and programming point of view

∗Work done under the PetaQCD project, supported by the french re-search agency ANR.

on various technical contexts. Among existing algorithms,the dynamical programming procedure proposed by Floydfor the shortest paths [3] and by Warshall for the transitiveclosure [11] so far remain on the spotlight. Due to the widerange of (potential) applications for this problem, also usedas an ingredient to solve other combinatorial problems, pro-viding an efficient algorithm or program to solve the APP iscrucial.

In this paper, we consider an implementation on theCELL [7]. The CELL processor has proved to be quiteefficient compared to traditional processors when it comesto regular computation. For a more general use, an im-portant programming effort is required in order to achievethe expected performance. Apart from providing highlyoptimized SPE kernels, it is also important to derive aglobal scheduling in which all participating SPEs efficientlycooperate in order to achieve the global task (managed fromthe PPE). Most of the existing codes for the CELL are basedon a master slaves model, where the SPEs get the datafrom the PPE, perform the computation, and send the resultback to the main memory through direct memory accesses(DMAs). Such models suffer from lack of scalability, espe-cially on memory intensive applications. Our solution forthe APP is based on a linear SPMD algorithm, with quiteinteresting properties like local control, global modularity,fault-tolerance, and work optimal performance.

Some attempts to implement the transitive closure on theCELL can be found in the literature. Among them, wepoint out the works described in [5] (up to 50 GFLOPS)and in [8] (up to 78 GFLOPS in perspective). The twosolutions are both based on a block partitioning of the ba-sic Warshall-Floyd procedure together with ad-hoc mem-ory optimization and efficient global synchronization. Withsuch master-slaves models where all the SPEs compute thesame step of the Warshall-Floyd procedure at a time, a spe-cial care is required for data alignment in addition to re-dundant data management (the pivot elements). Moreover,

2010 22nd International Symposium on Computer Architecture and High Performance Computing Workshops

978-0-7695-4276-8/10 $26.00 © 2010 IEEE

DOI 10.1109/SBAC-PADW.2010.7

1

since the memory is a critical resource for the SPE, wethink it is important to come with a solution which a lessconstrained solution regarding both the memory space re-quirement and data transfers. We consider this work as oneanswer to those key points in addition to its competitive ab-solute performance (potential of 80 GFLOPS).

The rest of the paper is organized as follows. Thenext section provides an overview of the CELL BroadbandEngine and its main characteristics. This is followed bya fundamental description of the APP, together with theWarshall-Floyd algorithm and the variant of our concern.In section 4, we present our scheduling and discuss variouskey points. This is followed in section 5 by our implementa-tion strategy on the CELL and related technical issues. Weshow and comment our experimental results in section 6.Section 7 concludes the papers.

2 Overview of the CELL Broadband Engine

The CELL[1, 7] is a multi-core chip that includes nineprocessing elements. One core, the POWER Processing El-ement (PPE), is a 64-bit Power Architecture. The remainingeight cores, the Synergistic Processing Elements (SPEs), areSingle Instruction Multiple Data (SIMD) engines (3.2GHz)with 128-bit vector registers and 256 KB of local memory,referred to as local store (LS). Figure 1 provides a syntheticview of the CELL architecture.

Figure 1. DMA scheme from the SPU

Programming the CELL is mainly a mixture of singleinstruction multiple data parallelism, instruction level paral-lelism and thread-level parallelism. The chip was primarilyintended for digital image/video processing, but was imme-diately considered for general purpose scientific program-ming (see [13, 12] for an exhaustive report on the potentialof the CELL BE for several key scientific computing ker-nels). Nevertheless, exploiting the capabilities of the CELLin a standard programming context is really challenging.The programmer has to deal with constraints and perfor-

mance issues like data alignment, local store size, doubleprecision penalty, different level of parallelism. Efficientimplementations on the CELL commonly a conjunction ofa good computation/DMA overlap and a heavy use of theSPU intrinsics.

3 The algebraic path problem

3.1 Formulation

The algebraic path problem(APP) may be stated as fol-lows. We are given a weighted graph G = ⟨V,E,w⟩with vertices V = {1, 2, · · · , n}, edges E ⊆ V × Vand a weight function w : E → S, where S is a closedsemiring ⟨S,⊕,⊗, ∗, 0, 1⟩ (closed in the sense that ∗ is aunary “closure” operator, defined as the infinite sum x∗ =x⊕ (x⊗ x)⊕ (x⊗ x⊗ x)⊕ · · · ).

3.2 Warshall-Floyd algorithm

If M is the incidence or weight matrix of a finite graphG of order n, then M (k) denotes the matrix of d(k)(i, j)(distance between i and j i to j considering intermediatenodes from 1 to k), and M∗ the closure matrix (the one wewant to compute). By extending the operator ⊕ to matrices,we obtain

M∗ = M (0) ⊕M (1) ⊕M (2) ⊕ · · · ⊕M (n), (1)

where M (0) = M .The Warshall-Floyd dynamical programming procedure

to solve the APP formulation is inspired from equation (1).Thus, m(k)(i, j) can be computed from m(k−1)(i, j) byconsidering node k as follows

m(k)ij = mij ⊕ (m

(k−1)ik ⊗m

(k−1)kj ). (2)

An important property of this algorithm, which turns to bea memory advantage, is that all the M (k) can be housedinside the same matrix M . So, we perform n in-place (ma-trix) updates within the input matrix M and end up with theclosure matrix M∗. At step k, row k (resp. column k) iscalled pivot row (resp. pivot row), they are used to upgradeM (k−1) to M (k).

A part from the O(n3) floating point operations, it isimportant to notice that the move of the pivot row and thepivot column, although quite regular compared to gaussianpivoting, needs a special attention. There are mainly twoimpacts. The first one is on the memory access pattern, be-cause the pivots are shifted from one step to the next one.The second one is on the pipeline scheduling, the pivot ele-ments have to be ready before starting the correspondingWarshall-Floyd step. In order to get rid of the differencebetween Warshall-Floyd steps, we now consider a toroıdalshift proposed by Kung, Lo, and Lewis [6].

2

3.3 Kung-Lo-Lewis mapping

The idea is to maintain the pivots at the same location,preferably at the origins. To do so, Kung, Lo, and Lewishave suggested a shift-toroıdal of the matrix after each ap-plication of the standard Warshall-Floyd procedure. Tech-nically, it is equivalent to say that after each step, the nodesare renumbered so that node i becomes node i − 1 (or(i−1) mod n+1 to be precise). Thereby, the matrices M (k)

are completely identical, with the pivot row (resp. pivot col-umn) remaining the first row (resp. first column). There aretwo ways to handle such a reindexation. The first one is toexplicitly shift the matrix after the standard Warshall-Floydprocedure. The second one is perform the shift-toroıdal onthe fly, means after each update of the matrix entries. Figure2 depicts the two possibilities.

Figure 2. Toroıdal shift

In a formal point of view, we now apply the followingrule

m(k)i−1,j−1 = mij ⊕ (m

(k−1)ik ⊗m

(k−1)kj ), (3)

where operations on subscripts are performed modulo n.When implementing the algorithm in this way, one needsto keep the pivot row and/or the pivot column (its dependson the scheduling) as they could be overwritten due to theshift. In a parallel context, where the data move between thecomputing units, the aforementioned shifts could be doneby just adapting the transfer patterns accordingly (i.e. datatransfers + shifts are thus performed at the same time at theprice of the transfers only). We take all these into accountto derive our linear pipeline scheduling.

4 Description of our algorithm

4.1 Scheduling

Given a graph of order n, our scheduling can be intui-tively described as follows. The computation of M (k), as-signed to a single processor, is performed row by row, fromthe first row (the pivot) to the last one. Each row is com-puted from the first point (the pivot) to the last one.

If (i, j, k) refers to the (i, j) entry of M (k), then ourscheduling can be expressed by the timing function t and

the task allocation function a given by

t(i, j, k) = L(k, n) + (i× n+ j) + 1 (4)a(i, j, k) = k (5)

where L(k, n) is the computation latency due to the graphdependencies together with our row by row assignments.At this point, we need n processors cooperating on a linearbasis. Each processor operates as follows:

• compute the first row (the pivot) and keep it on the localmemory

• compute and send each of the remaining rows

• send the pivot row

Computing the pivot row requires n steps, which count forthe computation latency as no value is sent out during thattime. In addition, because of the rotation, a given processorcomputes a row in the order 0, 1, 2, · · · , n − 1 and outputsthe results in the order 1, 2, · · · , n − 1, 0. Thus, the totallatency between two consecutive processor is (n + 1), andwe thus obtain

L(k, n) = (k − 1)(n+ 1), k ≥ 1. (6)

So, processor k starts at step L(k, n) = k(n + 1) and endsits task n2 steps after (i.e. at step n2+k(n+1)). It is impor-tant to keep these two values in mind as they will be locallyused by each processor to asynchronously distinguish be-tween computing phases. Our solution originally needs nprocessors, which is a strong requirement in practice. For-tunately, the conceptual modularity of our scheduling helpsto overcome the problem as we now describe.

4.2 Modularity

Recall that processor k computes M (k) and communi-cates with processors k− 1 and processor k+1. If we havep processors, p < n, then we adapt our schedule by just re-questing processor k to computes M (k+αp), for all integersα such that k + αp ≤ n. This is naturally achieved by per-forming several rounds (n/p to be precise) over our lineararray of p processors. This corresponds to the so-called lo-cally parallel and globally sequential (LPGS). The fact thatour steps are completely identical makes it really natural toimplement. Moreover, there is no additional memory need.Indeed, the capability of performing all updates within thesame matrix is still valid, processor 0 continuously reads inA and processor p − 1 continuously writes in A (there willbe no read/write conflict since they always act on differentmemory locations).

The remaining part of M (αp) and the yet computed partof M ((α+1)p) will reside on the same matrix space into the

3

main memory. Moreover, the idempotent property of theAPP (i.e. M (n+k) = M (n) = M∗, ∀k ≥ 0) provides an-other simplicity. Indeed, if p does not divides n, then a strictapplication of our partitioning will ends with M (m), wherem = ⌈(n/p)⌉ × p is greater that n. We will still get thecorrect result, but with an additional p − (n mod p) steps.If we do not want this additional unnecessary computation,we could just dynamically set processor n mod p to be thelast processor at the beginning of the ultimate round.

Because of the communication latency, it is always fasterto perform block transfers instead of atomic ones. From astarting fine-grained scheduling, this is achieved by tiling.Since we plan to implement our algorithm at row level (i.e.we compute/send rows instead of single entries), applyinga standard tiling just means that we will work with blockrows and block columns partitioning.

4.3 Tiling

On parallel and/or accelerated computing systems, be-cause the communication latency is likely to dominate,the cost of communicating a single data element is onlymarginally different from that for a ”packet” of data. There-fore, it is necessary to combine the results of a number ofelementary computations, and send these results together.This is typically achieved by a technique called supern-ode partitioning [4] (also called iteration space tiling, loopblocking, or clustering), where the computation domain ispartitioned into identical ”blocks” and the blocks are treatedas atomic computations. Such a clustering, which can beapplied at various algorithmic level, has also proved to bea good way for a systematic transition from a fine grainedparallelism to a coarse grained parallelism [10, 2].

Considering the original APP formulation and theWarshall-Floyd algorithm as previously described, a tileversion can be derived by extending the atomic operations⊕ and ⊗ to operate on blocks. Now, each step is either theresolution of the APP on a b × b subgraph, or a “multiply-accumulate” operation A⊕(B⊗C), where the operands areb×b matrices and the operations are the matrix extension ofthe semiring operators. The only structural change imposedby a block version is that the row pivot (resp. column pivot)need to be explicitly updated too (they do not remain con-stant as in the atomic formulation). An important questionraised by the tile implementation of our linear algorithm isthe optimal tile size. Indeed, tiling yields a benefit from thecommunication latency, but at the price of the global com-putation latency (i.e. p× n, which now becomes p× (bn)).

We now explain our implementation on the CELL.

5 Implementation on the CELL BE

5.1 Overview

Among the reasons why the CELL suits for implemen-ting our algorithm, we may point out the followings:

• the local store and its moderate size. Our algorithmjust requires a processor to be able to store 2 (block)rows of the matrix (the pivot and the working rows).Thus, the small size of the local store does not appearas a constraint. Moreover, its fast and uniform accesshas a good impact on the per node performance and theglobal regularity

• each processor asynchronously communicates with itsforward (send/put) and backward (receive/get) neigh-bor. This is well implemented with SPE to SPEcommunication routines. Indeed, an SPE can asyn-chronously write to the local store of the others, thusin the LS of its forward neighbor.

• our algorithm operates with a fewer number of proces-sors because of modularity and scalability. The CELLhas up to 8 SPEs. So, this does not act as an implemen-tation or performance constraint. Moreover, operatingwith smaller number of processors reduces the globalcomputation latency, thus improves the efficiency ofthe program (relative to the number of processors).

• DMA is fast and can be performed asynchronouslywith the computations. Moreover, each SPE can ac-cess the main memory, thus the first processor will getthe input data from the main memory while the last (re-set dynamically if necessary) writes output results onthe main memory.

5.2 SPE computation

The main floating point computation kernel needed atthe SPE level is the APP on a b × b subgraph (tile) and a“multiply-accumulate” operation A⊕ (B⊗C) on tiles. Wewrote the corresponding SIMD code using SPU intrinsics.Indeed, the code could be significantly optimized, but thisis not the scope of this work although will be done at thesoonest. There are mainly three phases. The first one is thelatency phase, where the SPE has not yet received any data.The second phase starts with the reception of the pivot row,which is then updated an stored into a buffer in the localstore. During the third phase (the last one), the SPE is alsoinactive and is just waiting the other ones to finish their re-maining computations. Obviously, the overhead of the firstand the last phases increases with either a longer array or abigger tile size. This is the point for the pipeline compro-mise. Each SPE asynchronously detects the phases using a

4

timer variable (local counter incremented after each step).The program uses the formula of the latency ( i.e. 2k forprocessor k), and that of the last step (i.e. n + 2k + 1 forprocessor k), recall we compute/send a (block) row in onestep. Thus, the SPE performs an internal control during allits computations. That is why there is no need of any kindof synchronization with or via the PPU.

5.3 Data communication

At a given step, the SPE updates a row and puts it (DMA)into the corresponding buffer of its forward neighbor SPEand then waits for its completion. Since its backward SPEneighbor has done the same, the SPE then receives its datafor the next step without any additional delay. Thus, all ourcommunications (DMA) are local and fully parallel. ForSPE 0, we apply an explicit double buffering as it receivesits data from the main memory (DMA get). Although thissounds like a perfect DMA/computation overlap, we stillthink that there is room for improvement, mainly on the SPEto SPE communication. Indeed, the fact that the SPE waitsfor the completion of its DMA put could be avoided by ano-ther double buffering level. One way to achieve this is tostart with tiles of size 2b and then switch to tiles of size b atthe time each SPE has received its pivot row. This will bestudied later.

5.4 Synchronization issue

One way to synchronize the SPEs is to use mailboxmechanism with the PPU. Such a central synchronization issufficient, but could be costly as it would be heavily used forour step by step synchronizations. In addition, this wouldturn to a scaling issue. Instead, we chose to implement a lo-cal synchronization, thus strictly following the philosophyof our algorithm. To do so, each SPE has two flag variables,flag backward and flag forward. At the end of agiven step,

• SPE k writes counter+1 into the flag backwardof SPE k+1 to indicate he has finished and has put thedata on its local store for further processing.

• SPE k writes counter+1 into the flag forwardof SPE k−1 to indicate he has finished and is ready toreceive new data on its working buffer.

• SPE k pools its variables flag backward andflag forward to check if they have been updated(if their value are still equal to counter).

In order to avoid a deadlock, when a given SPE has per-formed its latest computation, it performs an additional for-ward signaling for the next SPE so that it could perform itslast step independently.

We now go to experimental results. The goal is to vali-date our algorithm and discussions about latency, tiling andscalability.

6 Performance results

All our experimentations are performed on a QS22CELL Blade with single precision data. First, we need tosee how tiling affects the performance of our program. Intable 1, we use our algorithm to solve the APP on a sin-gle SPE with matrices of size 128 × 128, 256 × 256, and512× 512 respectively (times are in seconds).

Tile 128× 128 256× 256 512× 512

1 0.0216 0.1321 0.89214 0.0110 0.0823 0.63718 0.0104 0.0782 0.6082

12 0.0092 0.0754 0.583916 0.0105 0.0781 0.601720 0.0095 0.0696 0.575724 0.0098 0.0704 0.587228 0.0088 0.0782 0.590132 0.0115 0.0815 0.6119

Table 1. Relative impact of tiling

Recall that tile size b means we operate on the whole ma-trix by b×n block rows. What we see is that, a part from thefine grained computation, the difference with different tilesizes is marginal. This is certainly due the fact that the ma-trix operations dominate as we will see on the next results.However, as mentioned previously, our kernel for the b × bAPP is not sufficiently optimized, otherwise we would havecertainly observed a more significant difference. Neverthe-less, we observe a factor 2 improvement between using tileof size 20 and the fine-grained version for instance. Now,we reconsider the same matrices and perform a scalabilitytest from 1 SPE to 8 SPEs. In figure 2, we display the globalexecution time ( measured from the PPU) with various tilesizes in {1, 4, 8, 12, 16} (we stop at 16 because 12 seems tobe the optimal), σ refers to the speedup compared to 1 SPE.

Apart of the fine-grained version, we observe a perfectscaling of our program. In order to illustrate the efficiencyof our method (scheduling + DMA + synchronization), weshow in table 3 the timings using a version of our programwhere the block APP kernel is not executed. We clearlysee that the overhead due to our pipeline mechanism is def-initely negligible and that the global performance just relieson the block APP kernel. By replacing our APP kernel codeby a fast implementation similar to that of the matrix prod-uct in [9], our implementation achieves 80 GFLOPS (notethat the peak performance on the APP is 102 GFLOPS asthe ”multiply-add” cannot be used).

5

1 SPE 2 SPEs 8 SPEsTile t(s) t(s) σ t(s) σ

1 1.3 1.25 1.06 0.31 4.294 0.8 0.41 2.00 0.11 7.788 0.7 0.39 1.99 0.11 7.21

12 0.7 0.36 2.08 0.10 7.9316 0.7 0.40 1.97 0.14 5.65

(a) Performance with a 256×256 matrix


1 0.892 0.434 2.05 0.213 4.194 0.637 0.318 2.00 0.080 7.968 0.608 0.304 2.00 0.078 7.79

12 0.584 0.293 1.99 0.074 7.8816 0.602 0.302 1.99 0.083 7.23

(b) Performance with a 512×512 matrix


1 6.67 3.28 2.03 1.60 4.164 5.01 2.50 2.00 0.62 7.998 4.79 2.39 2.00 0.60 7.95

12 4.70 2.32 2.02 0.58 7.9816 4.72 2.36 2.00 0.60 7.79(c) Performance with a 1024×1024 matrix

Table 2. Timings on a CELL QS22


1 1.87 1.20 1.56 0.53 3.494 0.39 0.47 0.83 0.11 3.708 0.19 0.12 1.64 0.06 3.78

12 0.12 0.08 1.65 0.03 4.0216 0.09 0.06 1.63 0.03 3.78

Table 3. DMA timings (1024×1024 matrix)

7 Conclusion

Definitely, SPE to SPE communication can be used toderive efficient linear SPMD program on the CELL. This isthe case with our solution for the algebraic path problem.Our algorithm has lot of interesting properties that matchthe characteristics and capabilities of the CELL. Moreover,we think that our linear array scheduling can be imple-mented in the same way on a cluster of CELLs (same ar-ray with border communications done with MPI). This hasto be investigated, together with an SPU optimization ofthe APP kernel. We are confident that, together with ourscheduling and implementation strategy, this will allow usto offer a solution very close to the peak performance evenon huge instances. We also believe that our methodologycould be successfully used at algorithmic level to derive ef-ficient CELL programs for many other applications.

References

[1] Cell SDK 3.0.www.ibm.com/developerworks/power/cell.

[2] D. Cachera, S. Rajopadhye, T. Risset, C. Tadonki,Parallelization of the Algebraic Path Problem on Lin-ear SIMD/SPMD Arrays, IRISA report n 1409, july2001.

[3] R. N. Floyd, Algorithm 97 (shortest path), Comm.ACM, 5(6):345, 1962.

[4] F. Irigoin and R. Triolet, Supernode partitioning, in15th ACM Symposium on Principles of Program-ming Languages, pp. 319-328, ACM, january 1988.

[5] Kazuya Mastumoto and Stanislav G. Sedukhin, A So-lution of the All-Pairs Shortest Paths Problem on theCell Broadband Engine Processor, IEICE Trans. Inf.& Syst., Vol. E92.D, No. 6, pp.1225-1231, 2009 .

[6] S.-Y Kung, S.-C. LO ; Lewis P. S., Optimal sys-tolic design for the transitive closure and the short-est path problems, IEEE transactions on computersISSN 0018-9340, 1987, vol. 36, no5, pp. 603-614.

[7] H. Peter Hofstee, Power Efficient Pro-cessor Design and the Cell Processor,http://www.hpcaconf.org/hpca11/slides/Cell Public Hofstee.pdf.

[8] Vinjamuri, S. Prasanna, V.K., Transitive closureon the cell broadband engine: A study on self-scheduling in a multicore processor, IEEE Interna-tional Parallel & Distributed Processing Symposium(IPDPS), p. 1 - 11, Rome, Italy, 23-29 May 2009.

[9] http://www.tu-dresden.de/zih/cell/matmul

[10] G. Venkataraman, S. Sahni, and S. Mukhopadhyaya,A blocked all-pairs shortest-paths algorithm, J. Ex-perimental Algorithmics, vol.8, no.2.2, 2003.

[11] S. Warshall, A Therorem on Boolean Matrices,JACM, 9(1):11-12, 1962.

[12] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Hus-bands, and K. Yelick, The potential of the Cell pro-cessor for scientific computing, CF ’06: Proc. 3rdConference on Computing Frontiers, pp.9-20, ACM,New York, NY, USA, 2006.

[13] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Hus-bands, and K. Yelick, Scientific Computing Kernelson the Cell Processor, International Journal of Paral-lel Programming, 2007.

6

[ieee 2010 22nd international symposium on computer architecture and high performance computing...

Documents