efficient breadth-first search on the cell b.e. processorcalpar/aa07-08/bfs.pdf · 2007-11-23 ·...

15
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 1 Ecient Breadth-First Search on the Cell/B.E. Processor Daniele Paolo Scarpazza, Oreste Villa, Fabrizio Petrini, Member, IEEE Abstract— Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But they also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges involved in designing a Breadth-First Search (BFS) algorithm for the Cell/B.E. processor. The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guar- antee portability with performance to future processors, with an implementation that embeds processor-specific optimizations. Using a fine-grained global coordination strategy derived by the Bulk-Synchronous Parallel (BSP) model, we have determined an accurate performance model that has guided the implementation and the optimization of our algorithm. Our experiments show an almost-linear scaling over the number of used synergistic processing elements in the Cell/B.E. platform, and compares favorably against other systems. On graphs which oer sucient parallelism, the Cell/B.E. is typically an order of magnitude faster than conventional processors such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, and custom-designed architectures, such as the MTA-2 and BlueGene/L. Index Terms— Multi-core processors, Parallel Computing, Cell Broadband Engine, Parallelization Techniques, Graph Explo- ration Algorithms, Breadth-First Search, BFS. I. I O VER the last decade, high-performance computing has ridden the wave of commodity technology, building clusters that were able to leverage the tremendous growth in processor performance fueled by the commercial world. As this pace slows down, processors designers are facing com- plex problems when increasing gate density, reducing power consumption and designing ecient memory hierarchies. The scientific and industrial communities are looking for alterna- tive solutions that can keep up with the insatiable demand of computing cycles and yet have a sustainable market. Traditionally, performance gains in commodity processors have come through higher clock frequencies, an exponential increase in the number of devices integrated on a chip, and other architectural improvements. Power consumption is in- creasingly becoming the driving constraint in processor design: processors are more and more power-limited rather than area- limited. Current general purpose processors are optimized for single-threaded workloads, and spend a significant amount of resources (and power) to extract parallelism. Common tech- niques are out-of-order execution, register renaming, dynamic schedulers, branch prediction, reorder buers, etc. Experienc- ing diminishing returns, processor designers are turning their attention to thread-level, VLIW and SIMD parallelism. Ex- plicitly parallel techniques, where a user or a compiler express Year 2003 2005 2007 2009 2011 2013 Medieval Times Renaissance Period Industrial Age SMT 100 Threads Small Number Of Traditional Cores Arrays of Throughput Cores 10 1 Fig. 1 T -, many-core . the available parallelism as a set of cooperating threads, oer a more ecient means of converting power into performance than techniques which speculatively discover the implicit –and often limited– instruction level parallelism hidden in a single thread. Another important trend in computer architecture is the implementation of highly integrated chips. Several design avenues have been explored both in the academia, such as the Raw multiprocessor [47] and TRIPS [43], and in the industrial world, with notable examples being the AMD Opteron, IBM Power5, Sun Niagara, Intel Montecito and others [32], [27], [28], [34]. The Cell Broadband Engine (Cell/B.E.) processor, jointly developed by IBM, Sony and Toshiba is a heterogeneous chip with nine cores (one control processor coupled with eight independent synergistic processing elements) capable of massive floating point processing, optimized for compute- intensive workloads and broadband, rich media applications. The processing power of the Cell/B.E., that with a frequency of 3.2 GHz peaks at 204.8 single-precision Gflops/second, has not passed unobserved. Intel has also recently announced its Tera-Scale research initiative, by connecting eighty simple cores on a simple test chip. These chips will be able to deliver, most likely by the end of the decade, teraflop-level performance. Figure 1 provides some intuition on what the near future might look like (courtesy of Doug Carmean, “The Evolution of Computing Architectures”, Intel slideshow).

Upload: others

Post on 13-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 1

Efficient Breadth-First Searchon the Cell/B.E. Processor

Daniele Paolo Scarpazza, Oreste Villa, Fabrizio Petrini, Member, IEEE

Abstract— Multi-core processors are a shift of paradigm incomputer architecture that promises a dramatic increase inperformance. But they also bring an unprecedented level ofcomplexity in algorithmic design and software development.

In this paper we describe the challenges involved in designing aBreadth-First Search (BFS) algorithm for the Cell/B.E. processor.The proposed methodology combines a high-level algorithmicdesign that captures the machine-independent aspects, to guar-antee portability with performance to future processors, withan implementation that embeds processor-specific optimizations.Using a fine-grained global coordination strategy derived by theBulk-Synchronous Parallel (BSP) model, we have determined anaccurate performance model that has guided the implementationand the optimization of our algorithm.

Our experiments show an almost-linear scaling over thenumber of used synergistic processing elements in the Cell/B.E.platform, and compares favorably against other systems. Ongraphs which offer sufficient parallelism, the Cell/B.E. is typicallyan order of magnitude faster than conventional processors suchas the AMD Opteron and the Intel Pentium 4 and Woodcrest,and custom-designed architectures, such as the MTA-2 andBlueGene/L.

Index Terms— Multi-core processors, Parallel Computing, CellBroadband Engine, Parallelization Techniques, Graph Explo-ration Algorithms, Breadth-First Search, BFS.

I. I

OVER the last decade, high-performance computing hasridden the wave of commodity technology, building

clusters that were able to leverage the tremendous growth inprocessor performance fueled by the commercial world. Asthis pace slows down, processors designers are facing com-plex problems when increasing gate density, reducing powerconsumption and designing efficient memory hierarchies. Thescientific and industrial communities are looking for alterna-tive solutions that can keep up with the insatiable demand ofcomputing cycles and yet have a sustainable market.

Traditionally, performance gains in commodity processorshave come through higher clock frequencies, an exponentialincrease in the number of devices integrated on a chip, andother architectural improvements. Power consumption is in-creasingly becoming the driving constraint in processor design:processors are more and more power-limited rather than area-limited. Current general purpose processors are optimized forsingle-threaded workloads, and spend a significant amount ofresources (and power) to extract parallelism. Common tech-niques are out-of-order execution, register renaming, dynamicschedulers, branch prediction, reorder buffers, etc. Experienc-ing diminishing returns, processor designers are turning theirattention to thread-level, VLIW and SIMD parallelism. Ex-plicitly parallel techniques, where a user or a compiler express

Year2003 2005 2007 2009 2011 2013

Medieval Times Renaissance Period Industrial Age

SMT

100

Thre

ads

Small NumberOf Traditional

Cores

Arrays ofThroughput 

Cores

10

1

Fig. 1T -, many-core .

the available parallelism as a set of cooperating threads, offera more efficient means of converting power into performancethan techniques which speculatively discover the implicit –andoften limited– instruction level parallelism hidden in a singlethread.

Another important trend in computer architecture is theimplementation of highly integrated chips. Several designavenues have been explored both in the academia, such as theRaw multiprocessor [47] and TRIPS [43], and in the industrialworld, with notable examples being the AMD Opteron, IBMPower5, Sun Niagara, Intel Montecito and others [32], [27],[28], [34].

The Cell Broadband Engine (Cell/B.E.) processor, jointlydeveloped by IBM, Sony and Toshiba is a heterogeneouschip with nine cores (one control processor coupled witheight independent synergistic processing elements) capableof massive floating point processing, optimized for compute-intensive workloads and broadband, rich media applications.The processing power of the Cell/B.E., that with a frequencyof 3.2 GHz peaks at 204.8 single-precision Gflops/second, hasnot passed unobserved.

Intel has also recently announced its Tera-Scale researchinitiative, by connecting eighty simple cores on a simple testchip. These chips will be able to deliver, most likely bythe end of the decade, teraflop-level performance. Figure 1provides some intuition on what the near future might look like(courtesy of Doug Carmean, “The Evolution of ComputingArchitectures”, Intel slideshow).

Page 2: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 2

A. The Programming Challenge

To fully exploit the potential of multi-core processors, weneed a significant shift of paradigm in the software develop-ment process. Unfortunately, for some classes of applications,this implies the radical redesign of algorithms.

Together with the initial excitement of early evaluations[49], several concerns have emerged. More specifically, thereis an interest in understanding the complexity of developingnew applications and parallelizing compilers [18], whetherthere is a clear migration path for existing legacy soft-ware [25], [40], [8], and what fraction of the peak performancecan actually be achieved by real applications.

Several recent works have provided insight into these prob-lems [44], [22], [24], [41], [7], [18], [6], [30], [4], [29], [53].In fact, to develop efficient multi-core algorithms one mustunderstand in depth multiple algorithmic and architectural as-pects. The list includes (1) identifying the available dimensionsof parallelism, (2) mapping parallel threads of activity to apotentially large pool of processing and functional units, (3)using simple processing cores with limited functionalities, (4)coping with the limited on-chip memory per core and (5) thelimited off-chip memory bandwidth (when compared to therich resources available on-chip), and many others.

The programming landscape of these advanced multi-coreprocessors is rather sophisticated. Many similarities appearwith cluster computing: in both fields we need to extractexplicit parallelism, deal with communication, and take care ofhow threads of computation are mapped onto the physical ma-chine [16], [15], [26], [3], [14]. But there are also differences,mostly in the data orchestration between processing cores andmain memory, which demand a fresh look to the problem, andthe design of new algorithms.

B. Graph Exploration Algorithms

Many areas of science (genomics, astrophysics, artificialintelligence, data mining, national security and informationanalytics, to quote a few) demand techniques to explore large-scale data sets which are represented by graphs. Among graphsearch algorithms, Breadth-First Search (BFS) is probably themost common, and a building block for a wide range of graphapplications. For example, in the analysis of semantic graphsthe relationship between two vertices is expressed by theproperties of the shortest path between them, given by a BFSsearch [9], [17], [35], [36], [37]. BFS is also the basic buildingblock for best-first search, uniform-cost search, greedy-searchand A*, which are commonly used in motion planning forrobotics [52], [46].

A good amount of literature deals with the design of BFSsolutions, either based on commodity processors [51], [2], [45]or dedicated hardware [13]. Some recent publications describesuccessful parallelization strategies of list ranking [1] andphylogenetic trees on the Cell/B.E. [5]. But, to the best of ourknowledge, no studies have investigated how effectively theCell/B.E. can be employed to perform a BFS search on largegraphs, and how it compares with other commodity processorsand supercomputers. The Cell/B.E. with its impressive amountof parallelism (multiple cores, SIMD) promises interesting

speedups in the BFS exploration of the many graph topologieswhich can benefit from it.

Searching large graphs poses difficult challenges, becausethe vast search space is combined with the lack of spatialand temporal locality in the data access pattern. In fact, fewparallel algorithms outperform their best sequential imple-mentations on clusters due to long memory latencies andhigh synchronization costs. These issues call for even moreattention on multi-cores like the Cell/B.E., because of thememory hierarchy which must be managed explicitly.

C. Contribution

This paper provides four primary contributions.1) A detailed description of a BFS graph exploration al-

gorithm for multi-core processors. We put emphasis onits peculiar characteristics, such as the data-flow anddata layout, the explicit management of a hierarchy ofworking sets and the data orchestration between them.

2) An experimental evaluation of the algorithm on theCell/B.E. processor that discusses how its different com-ponents are integrated, and an accurate comparison withother architectures. The goal is to provide an insight onthe performance impact of the possible architectural andsoftware design choices.

3) Perhaps the most interesting contribution is the paral-lelization methodology that we have adopted to designour algorithm and guide the software development pro-cess. Our work is inspired by the Bulk-SynchronousParallel (BSP) methodology [48], [21].

4) An arsenal of low-level optimizations to exploit the fullpotential of the Cell/B.E. processor.

Our methodology is based on the following cornerstones:• a high-level algorithmic design which focuses on essential

machine-independent aspects, which guarantee the porta-bility of performance to other multi-core processors,

• a machine-dependent refinement of the initial algorithmwhich embeds the specific optimizations of the chosentarget multi-core architecture,

• a BSP-style global coordination, used by both themachine-dependent and machine-independent parts, thatallows implementing and validating the individual stepsof the algorithm in a modular way,

• an accurate analytical performance model, facilitated bythe BSP programming style, that helped us determineupper and lower bounds on the execution time of eachstep of the algorithm,

• and the enforcement of a deterministic behavior withinthe steps of the algorithm that, for a given input set andnumber of processing elements, can be re-played as asequential program.

The proposed algorithm blends the theoretical analysisof graph algorithms of Dehne et al. [12] on constrainedparallel models (Coarse-Grained Multicomputer (CGM) andBSP), with the parallel implementation of the Boost GraphLibrary [23], to take advantage of the specific hardwarecapabilities of multi-core processors.

Page 3: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 3

As already pointed out by academic [31] and industrialprojects [11], non-determinism should be judiciously andcarefully introduced where needed, or eliminated altogetherwhen possible, to attack the programming wall of multi-coreprocessors. We think that the optimal level of performanceand ease of programming can be achieved at the same time,by attacking the real problem first: i.e., the unmanageablecomplexity caused by many concurrent activities.

The remainder of this paper is organized as follows. Sec-tion II discusses the high-level, machine-independent partof our parallel BFS algorithm for multi-core processors. InSection III we describe the machine-dependent parallelizationon the Synergistic Processing Elements (SPEs) of Cell/B.E.Section IV gives a detailed performance analysis of the algo-rithm, pinpoints its bottlenecks and proposes several types ofoptimizations. Section V describes the experiments we haveperformed to measure the performance of our algorithm on aCell/B.E. board, and Section VI compares these results againstthe ones provided by other, state-of-the-art processors andsupercomputers. Finally, Section VII concludes the paper. Theappendix is devoted to describe the low-level optimizations ofthe performance-critical sections of our algorithm.

II. T B-F S We present the methodology we used to parallelize the

Breadth-First Search (BFS) algorithm. We first introduce thenotation employed throughout the rest of this paper and a base-line, sequential version of BFS. Then, we describe a simplifiedparallel BFS algorithm as a collection of cooperating shared-memory threads. Finally, we refine it into an algorithm thatexplicitly manages a hierarchy of working sets.

A. Notation

A graph G = (V, E) is composed by a set of vertices V anda set of edges E. We define the size of a graph as the numberof vertices |V |. Given a vertex v ∈ V , we indicate with Ev

its adjacency list, i.e. the set of neighboring vertices of v (orneighbors, for short), such that Ev = {w ∈ V : (v,w) ∈ E}. Weindicate with dv the degree, i.e. the number of elements |Ev|.We will denote as d the average degree of the vertices in agraph, d =

∑v∈V |Ev|/|V |.

Given a graph G(V, E) and a root vertex r ∈ V , the BFSalgorithm explores the edges of G to discover all the verticesreachable from r, and it produces a breadth-first tree, rootedat r, containing all the vertices reachable from r. Vertices arevisited in levels: when a vertex is visited at level l it is alsosaid to be at distance l from the root.

B. Sequential BFS algorithm

Algorithm 1 presents a sequential BFS algorithm. At anytime, Q is the set of vertices that must be visited in the currentlevel. Q is initialized with the root r (see line 4). At level 1,Q will contain the neighbors of r. At level 2, Q will containthese neighbors’ neighbors (except those visited in level 0 and1), and so on.

During the exploration of each level, the algorithm scansthe content of Q, and for each vertex v ∈ Q it adds the

Algorithm 1 Sequential BFS exploration of a graph.Input: G(V, E), graph;

r, root vertex;Variables: level, exploration level;

Q, vertices to be explored in the current level;Qnext, vertices to be explored in the next level;marked, array of booleans: markedi ∀i ∈ [1...|V |];

1 ∀i ∈ [1...|V |] : markedi = false2 markedr = true3 level← 04 Q← {r}5 repeat6 Qnext ← {}7 for all v ∈ Q do8 for all n ∈ Ev do9 if markedn = false then

10 markedn ← true11 Qnext ← Qnext ∪ {n}12 end if13 end for14 end for15 Q← Qnext16 level← level + 117 until Q = {}

corresponding neighbors to Qnext. Qnext is the set of verticesto visit in the next level. At the end of the exploration of alevel, the content of Qnext is assigned to Q, and Qnext isemptied. The algorithm terminates when there are no moreneighbors to visit, i.e. Q is empty (line 17).

The algorithm does not visit a vertex more than once. To doso, it maintains an array of boolean variables markedv ∀v ∈ V ,where each variable markedv tells whether vertex v has alreadybeen visited. Neighboring vertices are added to Qnext onlywhen they have not been marked yet.

C. A Parallel BFSA straightforward way to parallelize the algorithm just

presented is by exploring the vertices in Q concurrentlywith all the available processing elements (PEs). The forall statement of Algorithm 1 (line 7) can be executed inparallel by different threads, provided that access to the arraymarked is protected with some synchronization mechanism,such as a multiple locks [33]. This is the conventional solutionin a cache-coherent shared-memory machine with uniformmemory access time and a limited number of hardware threads.Unfortunately, it does not scale well with larger numbers ofPEs [38].

We adopt a different approach, illustrated in Algorithm 2.We partition V in disjoint sets Vi, one per PE. We say that PE iowns the vertices in its partition Vi. Each PE i is only allowedto explore and mark the vertices it owns, and it must forwardany other vertices to the respective owners. As indicated inline 9, all steps are globally synchronized across the PEs.Horizontal lines denote synchronization points, which imposesa sequential order between the phases.

The steps of Algorithm 2 are executed in parallel by allthe available PEs. PE i accesses its private Qi and Qnexti,and its partition of the marked array.1 Additionally, each

1Also level is a private variable and should be denoted with a subscriptindicating the PE. We avoid this without ambiguity because the value of levelis kept the same across all PEs.

Page 4: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 4

Algorithm 2 bulk-synchronous parallel version of a breadthfirst graph exploration.Input: G(V, E), graph;

r, root vertex;P, available processing elements (PE);V1,V2, ...,VP :

(⋃i:1...P Vi

)= V ,

∀(i, j) ∈ [1...P]2 : Vi ∩ V j = {} if i , j;Variables: Qi, vertices to be explored in the current level;

Qnexti, vertices to be explored in the next level;level, exploration level;markedv, ∀v ∈ Vi;Qouti,p, ∀p ∈ [1...P], outgoing queues;Qini,p, ∀p ∈ [1...P], incoming queues;

Processing element i:1 level← 02 Qi ← {}

3 ∀v ∈ Vi : markedv ← false4 if r ∈ Vi then5 Qi ← {r}6 markedr ← true7 end if89 repeat in lockstep across the processing elements:

10 Qnexti ← {}11 ∀p ∈ [1...P] : Qouti,p ← {}1213 // Gather and Dispatch14 for all (p, v) ∈ [1...P] × Qi do15 Qouti,p ← Qouti,p ∪ {(v, Ev ∩ Vp)}16 end for1718 // All-to-All19 ∀p ∈ [1...P] : Qinp,i ← Qouti,p2021 // Bitmap22 for all n ∈ Ev where (v, Ev) ∈

(⋃p:1...P Qini,p

)do

23 if markedn = false then24 markedn ← true25 Qnexti ← Qnexti ∪ {n}26 end if27 end for2829 Qi ← Qnexti30 level← level + 13132 until ∀p ∈ [1...P] : Qp = {}

Note: horizontal lines indicate barrier-synchronization points.

PE i has a set of private outgoing and incoming queues,called Qouti,1,Qouti,2, ...Qouti,P and Qini,1,Qini,2, ...Qini,P, re-spectively. Through these queues, PEs can forward the verticesto their respective owners.

At initialization time, the root vertex r is assigned to itsowner’s Qi. During the exploration, each PE i examines thevertices v in Qi and dispatches the vertices in Ev which belongto PE p to Qouti,p. Then, when all the PEs have completedthis phase, an all-to-all personalized exchange takes place, andthe contents of each Qouti,p are transferred to Qinp,i. Thisexchange delivers the vertices to their respective owners.

Next, each PE examines the queues of incoming vertices,marks them, and adds those that have not been visited toits private Qnexti, as done in the previous algorithm. Byconstruction Qnexti, which will become Qi during the nextlevel, consistently contains only vertices owned by PE i.

D. A Parallel BFS for Multi-core ProcessorsThe parallel algorithm just presented (Algorithm 2) does

not consider size limitations of Qi, Qnext, Qin, and Qout. Ifprivate data structures are entirely allocated in the local storageof each PE, Algorithm 2 can overflow the memory availablein each core. Local memories in a multi-core processor aresmall, and the issue won’t disappear in the future: whereas chipintegration promises tens of cores on a die, on-chip memorywill not increase as fast [50]. For this reason, applicationdevelopers should design their algorithms taking explicitly intoaccount application working sets and data orchestration amongthem.

Algorithm 3 is a refined version of the parallel algorithmthat explicitly distinguishes between variables allocated inmain memory and in the local memory of each single PE.Local memory variables can be subject to size constraints,but the algorithm can access their contents at any granularity(element or block). On the other hand, variables allocated inmain memory do not have any size constraint, but they can beaccessed only via explicit operations, preferably at a coarsergranularity.

In algorithm 3, the graph G and queues Qi and Qnexti arestill in main memory, while marked, Qin and Qout are nowallocated in the local memory. The algorithm does not accessQi directly; rather it fetches blocks of Qi into a smaller, size-constrained queue named bQi (the b prefix intuitively identifieslocal buffers), via an explicit fetch operation (see line 10).Symmetrically, it does not add elements directly to Qnexti,but to a small buffer bQnexti, which is then committed toQnext via an explicit operation (see line 39). Adjacency listsEv are also explicitly loaded into the local data structure bGduring the gather step (see line 15).

The algorithm can operate on graphs of arbitrary size andvertex degree, provided that all the local variables fit inlocal memory, that bG is at least as large as the longestadjacency list Ev and each Qini,p is at least as large as Qouti,p.Overflows of bG can be managed at graph creation by splittinga single adjacency list in multiple lists having the same vertex,and incorporating minor algorithmic changes to load-balancethe exploration of these heavy vertices across multiple PEs.Each partition of the marked variables must fit in the localmemory of each PE. This raises an additional constraint on themaximum size of graphs explorable with the above algorithmon a given architecture, which we will discuss later. Exceptfor the newly-introduced Fetch, Gather and Commit steps,the new algorithm is only slightly more sophisticated thatthe previous one, but incorporates what we believe are theessential features to achieve optimal efficiency on the existingand future generations of multi-core processors.

III. I AIn this section we describe how we parallelized Algorithm 3

on the Synergistic Processing Elements (SPEs) of Cell/B.E.We can consider this part as the machine-dependent one ofour parallelization process, which in the future could find itsbest place in a run-time system or a compiler. At this stagewe analyze lower-level details, such as remote DMAs, doublebuffering, data alignment, etc.

Page 5: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 5

Algorithm 3 bulk-synchronous parallel exploration of a graph,with limited-storage constraints.Input: G(V, E), graph (allocated in main memory);

r, root vertex;P, available processing elements (PE);V1,V2, ...,VP :

(⋃i:1...P Vi

)= V ,

∀(i, j) ∈ [1...N]2 : Vi ∩ V j = {} if i , j;Variables allocated in main memory:

Qi, vertices to be explored in the current level;Qnexti, vertices to be explored in the next level;

Variables allocated in the memory of the ith PE:level, exploration level;markedv ∀v ∈ Vi;Qouti,p ∀p ∈ [1...P], outgoing queues;Qini,p ∀p ∈ [1...P], incoming queues;bQi, a size-constrained subset of Q;bQnexti, a size-constrained subset of Qnext;bGi, a size-constrained subset of E;

Processing element i:1 level← 0;2 Qi ← {};3 ∀v ∈ Vi : markedv ← false4 if r ∈ Vi then5 Qi ← {r}6 markedr ← true7 end if89 repeat in lockstep across the processing elements:

10 // 1. Fetch11 load bQi ⊂ Qi12 Qi ← Qi − bQi13 while bQi , {} do1415 // 2. Gather16 determine a subset {v1, v2, ..., vn} ⊂ bQi such that:17 |Ev1 | + |Ev2 | + ... + |Evn | < max allowed |bGi |

18 bQi ← bQi − {v1, v2, ..., vn}

19 load bGi ← {Ev1 , Ev2 , ..., Evn }

2021 // 3. Dispatch22 ∀p ∈ [1...P] : Qouti,p ← {}23 for all (p, v) ∈ [1...P] × {v1, v2, ..., vn} do24 Qouti,p ← Qouti,p ∪ {(v, Ev ∩ Vp)}25 end for2627 // 4. All-to-All28 ∀p ∈ [1...P] : Qinp,i ← Qouti,p2930 // 5. Bitmap31 bQnexti ← {}32 for all n ∈ Ev where (v, Ev) ∈

(⋃p:1...P Qini,p

)do

33 if markedn = false then34 markedn ← true35 bQnexti ← bQnexti ∪ {n}36 end if37 end for3839 // 6. Commit40 Qnexti ← Qnexti ∪ bQnexti41 end while42 Qi ← Qnexti43 level← level + 14445 until ∀p ∈ [1...P] : Qp = {}

Figure 3 presents a schematic overview of the steps ofthe implementation, and the data structures they operate on.From a software engineering point of view, each of these stepscan be designed, tested and optimized in isolation. A detaileddescription of each step follows.

Fetch. Step 1 fetches a portion of Q into bQ. The fetch isimplemented by a DMA transfer, in a double buffering fashion.

This means that there are two data structures associated withbQ, and Step 1 waits for the previous transfer (if any) tocomplete, it swaps the two buffers to make the newly-arriveddata available to the subsequent steps, and it starts a newtransfer for the next block of Q, using the other buffer asa destination. Because of the much higher latency associatedwith the remaining steps in the algorithm, Step 1 never hasto actually wait for bQ to arrive, except for the very firstfetch at the beginning of each level of exploration. In ourimplementation bQ is a relatively small buffer, only 512 bytes.

Gather. Step 2 explores the vertices in bQ and loads theirrespective adjacency lists in bG, until bG is full, using a DMAlist. The Cell/B.E. architecture provides DMA lists as a low-overhead means to orchestrate a sequence of transfers (upto 2,048), which are carried out without further interventionof the processor, obtaining almost optimal overlap with thecomputation. It is necessary to know the size of a data structurebefore loading it with a DMA list, and there is no obvious wayto know the length of an adjacency list before loading it. Forthis reason, rather than representing vertices with their vertexidentifiers, we represent them with vertex codes. A vertexcode is a 32-bit word (or a larger binary representation, if thecardinality of the graph requires it) where a certain number ofbits are reserved for the vertex identifier, and the others arereserved to encode the length of its adjacency list, possiblyexpressed in a quantized form if not enough bits are available.In detail, a vertex code has two fields, the vertex identifier(which is v) and the vertex length, which is an encodedrepresentation of |Ev|. With the help of the length field, Step2 can prepare a DMA list to transfer as many adjacency listsas possible into bG, minimizing the amount of space wastedand optimizing the accesses to main memory. Step 2 operatesin a double-buffering style. Hence, its actual code consists inwaiting for the in-flight bG transfer to complete, swap buffers,prepare another DMA list for the next bG transfer and initiateit. The same considerations about wait times stated for Step 1apply here.

Dispatch. Step 3 splits the adjacency lists previously gath-ered into the respective Qout queues. To expedite this task,we encode the adjacency lists as shown in Figure 2. In detail,at graph generation time, adjacency lists are encoded in a per-SPE split form. Additionally, each adjacency list comprises aheader which specifies the offset and length of each per-SPEportion of that list. Each sub-list is padded to a multiple of a4 words size to enable dispatch of it one quadword at a time(a quadword is the size of registers and of loads from localstore). To increase the efficiency of Step 3, multiple iterationscan be unrolled: in this case, the step may load and processmore quadwords at a time, thus requiring adjacency lists to bepadded to larger quantities.

All-to-all. Step 4 is the all-to-all personalized exchange inwhich each SPE delivers the Qout queues to their destinations.It is not necessary to transfer the Qouti,i; therefore, Qini,i issimply an alias of Qouti,i.

Step 4 requires an appropriate synchronization mechanismto detect the presence of valid data in a Qin. An efficientway to implement this mechanism is through communicationguards. A communication guard is a flag that the receiver

Page 6: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 6

...

(a) Format of the adjacency list Ev of vertex v, as stored in the graph:

Cv P P

4 words

P L1 O1 L2

4 words

O2 L3 O3 L4

4 words

O4 L5 O5 L6

4 words

O6 L7 O7 L8

4 words

O8

C L1 P P N N N N N N P P C L P P N N N N N N N P PN N P C L P P N N N N N N N...

P

Adjacency sub­listHeader

Vertex 1 Vertex 2 Vertex n

Header Adjacency sub­listssub­list to the 1st SPE sub­list to the 8th SPE

N N N N N N P P

4* words4* words

N N N N N P P P

4* words4* words

Key:

P

Cv

Header Adjacency sub­list

Li

Oi

vertex code of vertex v

N neighboring vertex

padding word

length of the portionof an adjacency list owned by SPE i

offset of a partition of the adjacency list

4 words 4* words4 words 4 words4* words 4* words4* words4* words 4* words

(b) Format of a Qouti,p

 (or a Qini,p

) queue:

Header Adjacency sub­list

4* words

Fig. 2T : (): -. I , SPE. A , -; (): -

Qout ( Qin) , . “4 ” . “4* ” , B .

resets before the transfer is started, and that the sender setswhen the transfer is complete. At any time during the transfer,the receiver can determine the status of the transfer by readingthis flag. For the mechanism to work properly, a guard must bereserved for each outgoing queue, and appropriate hardwaresupport must be employed to guarantee that the guards arenot transferred before the payload has completely reached itsdestination.

This is done by preparing a DMA list where the trans-fers of queues and guards are interleaved, and employingthe mfc_putlb intrinsic (put DMA list with barrier), whichguarantees that each transfer in the list is completed beforethe next one is issued.2

To allow maximum efficiency, transfers are organized ac-cording to a predefined schedule in a circular fashion usinga predefined DMA list. In fact, both the sequence of trans-fers and their coordinates are known in advance at programinitialization, and the DMA list can be prepared at that time.Therefore, the actual implementation of Step 4 is a simpleinvocation of the mfc_putlb intrinsic, which takes only a fewclock cycles at the source.

Bitmap. In Step 5, each SPE i scans the vertices containedin the incoming Qin queues, and adds them to its privatebQnexti if they had not been marked before. Despite the simpledescription, Step 5 is the most computationally expensive partof the algorithm, and it was the hardest to optimize. For sakeof space efficiency, we have implemented the marked datastructure with a bitmap, stored in the local memory of eachSPE. This bitmap is a boolean data structure where a singlebit represents the status (marked or not marked) of one of the

2Simpler synchronization mechanisms, e.g. tagging the data payload witha special marker, are insufficient. They lead to potential race conditions andconsequent data corruption. In fact, the on-chip network of the Cell/B.E. mayre-arrange segments of a DMA transfer in arbitrary order, causing the markerto reach the receiving end before the whole transfer has been completed.

vertices owned by the current SPE. Given the limited size ofthe local store in the Cell/B.E. architecture (256 kbytes), it ishard to allocate more than 160 kbytes for the bitmap. Thislimits the maximum graph size to 10 million vertices (i.e. thecumulative number of bits stored in the bitmaps of 8 SPEs)on one Cell/B.E. processor. The algorithm can be simplygeneralized to larger graphs by gang-scheduling –schedulingin a coordinated way the graph exploration on subsets ofa larger bitmap which is loaded under program control indifferent phases– but the discussion of this enhancement isbeyond the scope of this paper.

Commit. Step 6 commits the content of bQnexti accumu-lated during the last execution of Step 5 and writes them toQnexti. bQnexti buffers are managed with a double-bufferingtechnique as bQi and bGi, and the same considerations madeabout them apply. The only additional complexity is due tothe fact DMA transfer must be aligned on a 128 byte bound-ary, and that the programmer must explicitly guarantee thisalignment. Unlike what happens with bQi and bGi, bQnextiis not naturally aligned. In fact, blocks from Q can be loadedwith arbitrary alignment, so it is not enough to choose the sizeof bQ as a multiple of 128 bytes to guarantee the alignmentof subsequent load operations. Similarly, bG loads adjacencylists which can be easily forced to begin at aligned locations atgraph generation time. On the other hand, bQnext may containan unpredictable number of vertices which may be misaligned.We used some simple heuristics to pad blocks of irregular size,and to rewrite the padding with new data during the subsequentcommit steps, avoiding unnecessary local memory loads.

At the end of Step 6, a barrier synchronizes all the process-ing elements. The algorithm terminates when all SPEs have novertices left in their Qi queues. The actual implementation ofthis check is obtained via an allreduce [42] primitive, whichexecutes a distributed sum of the length of all the Qi queues

Page 7: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 7

G, graphQ, list of vertices to visit in this level

3. Dispatch

1. Fetch

2. Gather

4. All to all

5. Bitmap

bG,adjacency lists

bQ, buffer of list of vertices to visit

marked

bQnext, buffer of vertices to visit next

6. Commit

Qnext, list of verticesto visit at next level

Main Memory

SPE local store

Qouti,1

Main Memory

Qouti,2

Qouti,N...

Qini,1

Qini,2

Qini,N...

SPE local store

Fig. 3T .

∀i. If this sum is zero, the algorithm has completed.

IV. P A O

We determine lower and upper bounds on the performanceof each step, on the basis of benchmarks performed on theCell/B.E. This analysis exposes a primary bottleneck, thebitmap implementation, whose optimization is discussed laterin this section.

Since any BFS implementation must visit all the edgesof the given graph which are connected to the root vertex,a natural way to express the performance is through thenumber of edges visited per unit of time. We call this quantitythroughput, we indicate it with the symbol Th and we measureit in edges per second (E/s). With ME/s and GE/s we indicatea million and a billion edges per second, respectively.

As a final step of our algorithmic design, we release someof the strict synchronization bounds imposed by the BSPdesign, to fully overlap computation with on-chip and off-chip communication and achieve almost optimal performance.

Throughput (GE/s)

Step 1. Fetch

Step 2. Gather

Step 3. Dispatch

Step 4.All­to­All

Step 5. Bitmap

Step 6. Commit

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

d=16 d=512

Fig. 4P (

SPE, GE/), 16 512.

We consider this as a key point of our methodology: weallow the concurrent execution of these activities –arguablythe most difficult part to debug and analyze, only after havinga reasonably accurate understanding of all the components ofour algorithm.

We now derive performance bounds of the maximumthroughput achievable by each step of the algorithm: thealgorithm can only be as fast as the slowest stage of the per-formance pipeline. Also, the bounds we present in this sectionare solely derived from computer-architecture constraints: infact, the nature of the input graph and its topological propertiesmay set additional, lower performance boundaries, as we showin Section V.

The result of this analysis is the performance diagramdescribed in Figure 4. All the values have been either measuredor analytically derived for a Cell/B.E. processor with a clockrunning at 3.2 GHz. Whenever the throughput depends on theavailable bandwidth of a data-transfer operations, we have as-sumed the worst traffic conditions: i.e., all the 8 SPE are used,and they are all contending for the communication resourcesat the same time. When the bandwidth varies significantlydepending on which block size is transferred, we have reportedmeaningful minimum and maximum estimates.

Fetch. Step 1 is a transfer of a single, contiguous block

Page 8: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 8

0

5

10

15

20

25

1 2 3 4 5 6 7 8

Aggre

gate

Mem

ory

Bandw

idth

(G

byte

s/s

)

Synergistic Processing Elements

64 bytes128 bytes256 bytes512 bytes and larger

Fig. 5A .

from main memory to the local store. The available aggregatebandwidth is a function of the block size, as shown in Figure5. The size of bQ in our implementation is always larger than1024 bytes, which guarantees high bandwidth, and in steady-state conditions bQ is always full. Therefore, the availableaggregate bandwidth is 22.06 Gbyte/s, i.e. 2.76 Gbyte/s pereach SPE.

Step 1 is different from the others because it transfersvertex identifiers rather than adjacency lists. It can transfer689.38 M vertices/s (i.e. the above bandwidth divided by 4,which is the number of bytes per vertex identifier). For eachvertex identifier transferred in Step 1, the remaining steps willtransfer an entire adjacency list, which will be d times larger,on average. In the worst case, when the d=1, this yields athroughput Th= 689.38 ME/s.

Gather. Step 2 is another get from main memory, imple-mented with a DMA list. DMA lists can specify up to 2,048independent transfers that can be handled by the Cell/B.E.architecture without any additional operation or overhead.When the average degree is large enough to transfer blocks ofmore than 512 bytes (i.e. d > 128), the available bandwidthis 2.76 Gbyte/s (see Figure 5) with a sustained throughput of689.38 ME/s. In the worst case, with blocks of 64 bytes, thebandwidth is 1.36 Gbyte/s, which yields Th= 341.76 ME/s.

Dispatch. Step 3 is a computational step. The optimizeddesign of the data structures simplifies this step to a minorcontrol portion plus a local data transfer. Studied in isolation,this phase is able to process 42.34 ME/s with small degree(d=16), and 984.23 ME/s with larger degree (d=512).

All-to-all. Step 4 transfers each Qoutp,i inside each SPEp into queue Qini,p inside SPE i. Unnecessary transfers areavoided, i.e. when p = i. We prepare a DMA list wherea communication guard appears after each queue, and wetransfer it with the mfc_putlb intrinsic (put DMA list withbarrier). This makes sure that the guard transfer is initiatedonly after the queue transfer is completed. To minimize thecomputational cost associated with this step, we set up this

0

1

2

3

4

5

6

7

8

9

2 3 4 5 6 7 8

Late

ncy (

µs)

Synergistic Processing Elements

Fig. 6T -- Qout/Qin .

DMA list at initialization time, which is possible because theaddresses of the Qin and Qout queues in memory do not varyduring execution. Figure 6 shows how the latency requiredto complete this step varies with the number of SPEs. Sincethe cumulative queue size is independent from the number ofSPEs, when fewer SPEs are employed this results in moreefficient, larger transfers. With 8 SPEs, the latency is 8.1 µs.Assuming a queue cumulative size of 36 kbyte, and that queuesare, on the average, exploited between 75% and 99.2% oftheir available capacity depending on the chosen degree (100%is not reachable because of the headers), the correspondingthroughput is between 853.3 ME/s and 1.13 GE/s per SPE.

Bitmap. Step 5 is a computational step. This is the primarybottleneck of our implementation. Since the bitmap is the per-formance bottleneck of the entire algorithm, its optimization iscrucial. Starting from a baseline implementation we explored8 gradual refinements. Each refinement aims at improvingthe throughput by either removing overhead or exploitingpotential sources of instruction-level or data-level parallelismprovided by the Cell/B.E. architecture. These optimizationhave lowered the edge processing time from 96 to 26 clockcycles. Our best implementation ensures a throughput between35.17 ME/s (with d=16) and 113.73 ME/s (with d=512) perSPE. We have obtained this level of performance using acombination of function inlining, selective SIMDization, loopunrolling, branch elimination through speculation and the useof restricted pointers, as discussed in more detail below. Theimpact of optimizations is outlined in Table I.

Commit. Step 6 is a main memory communication. Thisresults in a single transfer of a large block (> 512 byte) to mainmemory. This always allows for the maximum bandwidth,which is 2.76 Gbyte/s, leading to a throughput Th=689.38ME/s per SPE.

To summarize the results of this analysis, the maximumperformance achievable by the algorithm is set by Step 5between 35.17 and 113.73 ME/s per each SPE. A morerealistic upper bound on the throughput can be obtained by

Page 9: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 9

Computation EIB Memory

Scheduling P

eriod:  47.07 us

0 us

10 us

20 us

30 us

40 us

Dispatch10.65 us

Gather0.97 us

All­to­all0.02 us

Bitmap32.82 us

Barrier

Commit0.06 us

All­to­all8.34 us

Commit0.13 us

Dispatch

Gather0.97 us

All­to­all

Bitmap

Barrier

Commit

All­to­all

Commit

0 us

10 us

20 us

30 us

40 us

Time

2.55 us

Gather6.74 us

Gather

Scheduling P

eriod:  47.07 us

Fig. 7A (S 2–6) .

considering not only Step 5, but Step 3 and 5 jointly, becausethese two computational phases cannot be overlapped. Theirjoint throughput Th3,5 is given by

Th3,5 =Th3 · Th5

Th3 + Th5,

which formula yields new performance bounds between 19.21ME/s (for d=16) and 101.95 ME/s (for d=512) per SPEs.

A. Overlapping Computation and Data Transfers

In the final stage of our implementation we release theconstraints of the BSP scheduling to fully overlap computationwith on-chip and off-chip communication. This leads to apipelined schedule for the main loop of our algorithm (Steps2–6 plus a barrier), which is represented in Figure 7. Note thatan iteration of the main loop spans two scheduling periods, butat the end of each period an iteration is completed, as typicalwith pipelined designs. Steps depicted in color belong to thecurrent iteration, while the remaining ones do not (the whiteones belong to the previous iteration, and the grey ones to thenext). The time values reported in the figure are worst-case

0

20

40

60

80

10016 32 64 128 256 512

Re

lative

exe

cu

tion

tim

e (

%)

Average degree of the vertices

Computation

Barrier

5. Bitmap

3. Dispatch

2. Gather

0

20

40

60

80

10016 32 64 128 256 512

Re

lative

exe

cu

tion

tim

e (

%)

Average degree of the vertices

Transfer

6. Commit

4. All-to-All

2. Gather

Fig. 8F () - (). S -

F 7.

and refer to an exploration with d=128; they are insensitiveto the number of vertices in the graph.

Note that this version of the algorithm is still completelydeterministic: a given input graph, root and number of pro-cessing elements always results in the same visit. This provedto be a major advantage during the functional and performancedebugging.

When the average degree of the input graph varies, the du-ration of each step varies accordingly, and the relative impactof each step also changes. For example, larger degrees makethe bitmap step more and more predominant. We representthe fraction of the time spent in the various steps when dchanges in Figure 8 (steps are color-coded as in Figure 7;results are insensitive to the number of vertices). Data transferlatencies are always smaller than the computational latencies,irrespectively of the degree, thus ensuring overlapping ofcomputation and data transfer as desired.

V. E RWe have implemented the algorithm in C language using

the Cell/B.E., and compiled it with GNU GCC 4.0.2. We haverun the experiments on an IBM blade with two Cell/B.E. DD3processors running at 3.2 GHz, 1 Gbyte of RAM and a Linuxkernel version 2.6.16.

In the experiments, we have measured throughput andscalability of our BFS algorithm. The input sets we haveconsidered are two classes of graphs:• synthetic graphs whose vertex degrees are random vari-

ables extracted from a uniform distribution over the range[0...2d], for a set of chosen values of d;

• graphs from real-world problems in a variety of domains(structural engineering, computational fluid dynamics,

Page 10: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 10

Version 1 2 3 4 5 6 7 8 9

Function inlining no yes yes yes yes yes yes yes yesSIMD-ization type no no memory complete complete selective selective selective selectiveBranch elimination no no no yes yes yes yes yes yesrestrict pointers no no no no yes yes yes yes yesUnroll factor 1 1 1 2 4 8

Total cycles in the step 392530 377524 333022 262093 257053 121223 110135 106759 107596Cycles per edge 95.83 92.17 81.30 63.99 62.76 29.60 26.89 26.06 26.27Throughput @3.2GHz (ME/s) 33.39 34.72 39.36 50.01 50.99 108.12 119.01 122.77 121.82

Average CPI 2.18 2.53 2.44 1.40 1.38 1.03 0.99 0.99 0.96Dual issue % 3.00 3.10 4.30 13.90 15.70 23.30 30.20 31.00 31.30Dependency stall % 48.50 54.00 51.90 46.60 47.10 33.50 35.00 32.90 30.60Registers used 19 16 19 37 37 36 51 82 spills

Speedup 1.00 1.04 1.18 1.50 1.53 3.24 3.56 3.68 3.65

TABLE IO S 5. W SIMD-, ; ‘’ SIMD- ; ‘’ SIMD- ; ‘’ SIMD- . D

d=128.

Number Average degree (d)

of SPEs 16 32 50 64 100 128 200 256 400 512

1 46.28 63.85 76.80 82.97 93.39 97.55 103.47 109.00 112.07 113.142 71.52 103.37 124.98 136.43 158.29 167.43 184.41 209.33 209.76 221.023 92.39 138.38 168.43 183.72 212.70 228.89 249.39 272.71 296.65 315.884 104.81 160.50 200.79 223.06 257.96 280.37 312.26 389.47 374.24 424.195 119.74 186.43 231.59 264.56 306.98 330.08 390.28 400.38 507.14 479.366 135.01 205.03 260.96 293.89 351.76 381.38 426.59 498.57 505.86 551.227 140.10 219.01 283.39 321.20 392.66 424.60 540.13 546.73 617.36 647.668 140.99 228.73 305.69 340.97 420.19 461.78 537.89 688.64 695.77 786.85

TABLE IIT (ME/), d 16 512,

P SPE 1 8.

electromagnetics, thermodynamics, materials, acoustics,computer graphics, robotics, optimization, circuit sim-ulation, networks, economic modeling, theoretical andquantum chemistry, chemical process simulation, etc.).

The graphs in the first class offer a useful benchmarkingground, where the impact of variable degree can be studiedin isolation from other effects, such as load imbalance. Infact, the uniform distribution of degrees guarantees a goodload balancing among the processing elements. Instead, thegraphs in the second class (100 randomly-chosen graphsfrom the University of Florida Sparse Matrix Collection [10])exhibit diverse topological and statistical properties, and theirexploration is subject to the joint impact of two factors: theamount of available parallelism (which can be small, due tosmall average degree), and the load imbalance (which canbe significant, due to power-law degree distributions). Wehave not considered worst-case graph instances, e.g. chains ofdegree-1 vertices. These graphs do not offer any parallelismand locality exploitable by our implementation.

Tests on the first class of graphs are reported in Table II.For each considered d between 16 and 512, we generatedand explored the largest synthetic graphs (in terms of numberof vertices) that would fit the available main memory onthe blade (1 GB). Results are in good agreement with the

upper bounds estimated in Section IV, especially when d islarge, so that the impact of the Bitmap step is predominant,as shown in Figure 8. For example, when d=512 and P=8,the aggregate throughput is 786.85 ME/s, i.e. 98.36 ME/s perSPE, which is very close to the estimate of 101.95 ME/sobtained in Section IV. The 3.5% difference accounts for thecomputational part of the other steps, the overhead of flowcontrol and the load imbalance. When d=16, the aggregatethroughput is 140.99 ME/s, i.e. 17.62 ME/s per SPE, which isclose to the estimate of 19.21 ME/s (the difference is 8.6%).Within the assumptions made above, the performance of ourimplementation is independent from the number of vertices inthe graph.

Figure 9 shows how the throughput scales when P variesfrom from 1 to 8. Our implementation shows a good scalingbehavior, which is virtually linear at large values of d, with alimited saturation effect at small ones. When d is as low as 10,the throughput is only 101.6 ME/s. This reduced performanceis due to the insufficient parallelism, which causes paddingto be introduced in the data structures which are subjectto SIMD operations. Also, the smaller adjacency lists are,the less efficient is their transfer via DMA. In fact, withd=10 adjacency lists occupy blocks around 64 bytes in size,which lowers the aggregate memory bandwidth from 22 to 10

Page 11: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 11

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8

Thro

ughput (M

E/s

)

Number of SPEs

Avg. Degree51240025620012810064503216

Fig. 9H SPE, d

16 512.

Gbytes/s (see Figure 5). Nevertheless, complete overlappingof data transfers and computation is still ensured. Moreover,adjacency sub-lists also need to be padded to a quadwordor its multiples, depending on the unroll factor (as shown inFigure 2). Therefore, when more SPEs are used, the length ofadjacency sub-lists decreases, and they need a higher amountof padding, thus leading to lower performance, as Figure 9shows.

On the second class of graphs, these performance consid-erations hold too. Additionally, the non-uniformly distributeddegree of the vertices can cause load imbalance, and subse-quent performance degradation.

We assume as a reference the performance reported for thefirst class of graphs. Figure 10 reports the fraction of referenceperformance (%) that our algorithm achieves when exploring100 graphs in the second class, with a single SPE (points inred) and with 8 SPEs (points in green). Although graphs ofsimilar degree can exhibit significantly different performance,a clear trend correlates large degrees with high efficiency: thecoefficient of correlation between log(d) and Th is 0.727.

Finally, in Figure 11 we compare our BFS implementa-tion with others, running on different architectures. Valuesfor BlueGene/L and MTA-2 come respectively from Yoo etal. [51] and from a personal communication with Feo [19].We have measured values for the Intel Pentium 4 and AMDOpteron with an in-house, single-processor BFS implementa-tion, and values for the Intel Woodcrest with a similar, in-house a scalable pthread-based implementation. For sakeof consistency, all values represent the peak performanceprovided by each implementation, i.e. in the case of theCell/B.E., graphs are from the first class presented above.

Conventional processors have little cache locality, and theyare, on average, 9 times slower then the Cell/B.E. The bestperformance in this class is obtained by the 2 cores of theIntel Woodcrest which are between 5 and 12 times slower.

0

20

40

60

80

100

1 10 100 1000

Fra

ction o

f peak p

erf

orm

ance (

%)

Average degree of the vertices

1 SPE8 SPEs

Fig. 10F

.

The comparison between the Cell/B.E. and the MTA-2 andBlueGene/L is not an apple-to-apple one because of the limitedamount of memory available on a Cell/B.E. blade, only 1Gbyte versus several Gbytes. This is mostly a technologicallimitation that will be addressed by future generations ofblades.

With small degrees, BlueGene/L combines the lack of cachelocality with the communication overhead of small packets,and a single Cell/B.E. is two orders of magnitude faster,reaching the same scaled performance of 325 BlueGene/Lprocessors with degree d=50.

Our BFS implementation compares well with one providedby John Feo for the Cray MTA-2[19]. With d=10, a Cell/B.E.is approximately equivalent to 7 MTA-2 processors. A largerd enhances the effectiveness of the SIMD bitmap operations inthe Cell/B.E. With d=200, the Cell/B.E. is 22× faster than thePentium and the Woodcrest (12× faster than two Woodcrestcores), 26× faster than the AMD Opteron, and at the samelevel of performance of 128 BlueGene/L processors and anMTA-2 system with 23 processors.

VI. D

Our analysis, summarized in Table IV, suggests that highperformance comes at the price of increased hardware com-plexity or human effort (roughly expressed in lines of code)required to develop the software application.

Given a fixed amount of real estate on a chip, one can adoptthe traditional approach of the Intel Woodcrest, with fewercoarse-grained computational cores, or the more aggressivedesign of the Cell/B.E. with a higher number of fine-grainedcores. With the current state of the art, the former solutionallows two cores, and the latter 8 cores. It is easy to expect, astechnology progresses, to see the same architectural dilemma,fewer core with HW speculation, deep pipelines, standardcache-coherence protocols or a larger number of simpler cores

Page 12: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 12

ArchitectureDegree

10 50 100 200ME/s Speed-up ME/s Speed-up ME/s Speed-up ME/s Speed-up

IBM Cell/B.E. @3.2 GHz (reference)· 1 PPE + 8 SPE 101.6 – 305.7 – 420.2 – 537.9 –

Intel Woodcrest @3.0 GHz· 1 core 10.5 9.68× 20.1 15.22× 22.2 18.94× 23.5 22.90×· 2 cores 19.8 5.13× 37.7 8.11× 41.8 10.06× 44.3 12.13×

Intel Pentium 4 HT @3.4 GHz· 1 core 11.1 9.13× 20.2 15.10× 22.1 19.01× 24.4 22.01×

AMD Opteron 250 @2.4 GHz· 1 core 11.0 9.26× 17.9 17.05× 20.4 20.55× 20.6 26.16×

Cray MTA-2 @220 MHz [2]· 1 CPUs 17.7 5.75× 17.7 17.31× 17.7 23.80× 17.7 30.47×· 40 CPUs 512.0 0.20× 512.0 0.60× 512.0 0.82× 512.0 1.05×

Cray MTA-2 @220 MHz [19]· 1 CPUs 16.7 6.09× 22.0 13.93× 22.9 18.36× 23.4 22.97×· 40 CPUs 544.7 0.19× 814.8 0.38× 879.2 0.48× 916.6 0.59×

IBM BlueGene/L· 128 CPUs 45.7 2.22× 162.0 1.89× 328.2 1.28× 474.1 1.13×· 256 CPUs 79.8 1.27× 232.7 1.31× 492.3 0.85× 731.4 0.74×

TABLE IIIH BFS IBM C/B.E. .

Architecture Parallel Programming Type of Parallel Data Explicit Lines of ProgrammingAlgorithm Language Parallelism Decomposition Synchronization Code Effort

AMD Opteron 250· 1 CPU no C N/A N/A N/A 100 Low

Intel Pentium 4 HT· 1 CPU no C N/A N/A N/A 100 Low

Intel Woodcrest· 1 core no C N/A N/A N/A 100 Low· 2 cores yes C + pthread Thread No Yes 200 Medium Low

Cray MTA-2· 128 HW threads yes C + pragmas Thread No Yes[2]/No[19] 100 Medium Low

BlueGene/L· 2 cores yes C + MPI Message Passing Yes Yes ? Medium

Cell/B.E.· 1 + 8 cores yes C + intrinsics Mixed Yes Yes 600 Medium High

TABLE IVA BFS ,

.

Average degree 10 16 32 50 100 200Vertices 105200 65536 32768 21000 10480 5240

AMD Opteron 250 (1 MB cache) 27.87 36.44 47.63 53.57 64.73 71.57Intel Pentium 4 HT (2 MB cache) 24.58 45.85 76.25 96.88 125.58 152.77Intel Woodcrest (1 core) (4 MB cache) 42.59 65.41 106.15 132.21 167.65 196.00Intel Woodcrest (2 cores) (4 MB cache) 70.50 101.76 193.79 210.35 257.39 337.20

TABLE VP BFS , .

Page 13: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 13

0

200

400

600

800

1000

10 50 100 200

Th

rou

gh

pu

t (M

E/s

)

Average degree of the vertices

Woodcrest, 1 threadWoodcrest, 2 threadsPentium 4, 1 threadOpteron, 1 threadBlueGene/L, 128 CPUsBlueGene/L, 256 CPUsMTA-2, 1 CPUMTA-2, 40 CPUCell Broadband Engine

Fig. 11P C/B.E.

.

that require SW speculation and explicit data orchestration inthe local storage.

The programming advantage of the Woodcrest is clearly out-lined on Table IV. A simple program of roughly 100 lines ofcode can efficiently implement a graph exploration algorithmon a single core. In order to use both cores, the user needsto extract explicit thread parallelism with a shared-memoryprogramming model, in our case using the pthread library.The user must also explicitly deal with the potential raceconditions that can happen when multiple threads update datastructures of the same vertex at the same time, a consequenceof the inherent non-determinism of the graph exploration. Evenif the parallelization is very efficient, with almost optimalspeed-up on two threads, the aggregate processing rate is lowerthan the Cell/B.E., in some cases by more than an order ofmagnitude.

Table V provides some insight on this performance gap.With small graphs, that can fit in the L2 cache, the Woodcrest(and also the other conventional processors) can achieve amuch higher throughput, reaching 196 ME/s with one and 337ME/s with two cores.

The HW speculation of these processors, throttled by thelimited number of outstanding memory requests that can beissued at any given time, is not as powerful as the softwarespeculation, carefully organized in DMA transfers that can beissued at a user-specified granularity. In fact, the Cell/B.E.could process graphs at a wire-speed of a few GE/s (inFigure 6 we have seen that it is possible to achieve aggregate

communication bandwidth in excess of 20 Gbytes/s whenusing DMAs as small as 128 bytes), if it were not limitedby the processing rate of the bitmap manipulation, which isthe bottleneck of our implementation.

On the other hand, the Woodcrest and the other conventionalprocessors can execute a bitmap update in 4.6 ns, 2 timesfaster than the Cell/B.E., with a significantly simpler imple-mentation. This suggests a potential direction for performanceimprovements in future generations of Cell/B.E. processorsand other multi-core processors.

Another interesting comparison is between the Cell/B.E. andthe MTA-2 multi-threaded machine. The MTA-2 is a cache-less architecture that hides the memory latency with a largenumber of memory requests, 1,024 per processor, that canbe issued by 128 HW threads. The absence of a layeredmemory hierarchy makes the performance of this machinerelatively insensitive to the average degree of the graph, asshown in Table III. The graph exploration algorithm presentedby Bader [2] is remarkably simple, and only incrementallymore complex than the sequential pthread version. Using#pragmas, the user needs to identify the loops that containenough parallelism, estimate the number of threads that canbe executed in parallel and consider possible race conditions,that are avoided using presence bits in each memory locations[2] or a int_fetch_add [19]. Using these spare bits, eachmemory location can have associated a producer-consumer se-mantic, and HW threads can synchronize at the fine granularityof the single memory word.

The clock cycle of the MTA-2 is only 220 MHz, withan expected clock cycle of 500 MHz in the upcoming CrayThreadstorm processor that is the building block of the CrayXMT [20]. Surprisingly, the Cell/B.E. compares well witha 20-CPU cluster with 2,560 HW threads, providing similarperformance with graphs of relatively large average degree. Itis worth noting that many of these architectural features havebeen adopted by General Purpose Graphical Processing Units(GPGPU), such as the NVidia GeForce 8800 [39], that cantake advantage of a larger market, and therefore employ moreaggressive and expensive integration technologies.

The work on BlueGene/L done by Yoo et al. [51] takethe difficult step of parallelizing the graph exploration ona distributed memory machine, using an explicit message-passing paradigm. The level of complexity is substantiallyhigher than the other implementations, and it also requiresa sophisticated pre-processing of the graph. Quite remarkably,a single Cell/B.E. processor is two orders of magnitude fasterthan the BlueGene/L, across various graph configurations, inparticular those with small average degree where there isvery little data locality. In a specific case, with d =50, thescaled performance of a single Cell/B.E. is equivalent to 336BlueGene/L processors.

The programming effort that is needed to achieve optimalperformance on the Cell/B.E. is still high. This is in part dueto the lack of a clear and simple abstract machine model todevelop new algorithms. And in part to the lack of run-timesystems and compilers that can help the program developmentand optimization. In our case, for example, we had to explicitlyimplement barrier synchronization and reduction, collective

Page 14: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 14

communication primitives that are typically included in a run-time libraries such as MPI, and optimize at almost assemblylevel the computational kernels of our application. In Table IVwe can see that the actual algorithm required 600 lines of code,plus 600 lines to implement run-time functionalities such asbarrier, allreduce, initialization and termination.

VII. CTogether with an unprecedented level of performance, multi-

core processors are also bringing an unprecedented level ofcomplexity in terms of software development. We see a clearshift of paradigm from classical parallel computing, whereparallelism is typically expressed in a single dimension (i.e,local vs. remote communication, or scalar vs. vector code),to the complex, multi-dimensional parallelization space ofmulti-core processors, where several levels of control and dataparallelism must be exploited in order to gain the expectedperformance.

With this paper we proved that, for the specific case of thebreadth-first search graph exploration, it is possible to tame thealgorithmic and software development process and achieve atthe same time an impressive level of performance.

The explicit management of the memory hierarchy, withemphasis on the local memories of the multiple cores, is afundamental aspect that needs to be captured by the high-level algorithmic design, to guarantee portability of perfor-mance across existing and future multi-core architectures.Programmability greatly benefits from the separation of com-putation and communication, and from a Bulk-SynchronousParallel (BSP) parallelization, which also leads to accurateperformance models and guides the implementation and opti-mization effort.

Our experiments show that the Cell/B.E. can obtain highperformance in this class of algorithms: a speedup of one orderof magnitude when compared to other commodity and special-purpose processors, reaching two orders of magnitude withBlueGene/L.

A major strength of the Cell/B.E. is the possibility of over-coming the memory wall: the user can explicitly orchestratethe memory traffic by pipelining multiple DMA requests tomain memory. This is a unique feature that is not availableon other commodity multiprocessors, that cannot efficientlyhandle working sets that overflow the cache memory. Themajor limitation is the extraction of the SIMD parallelism,a non-trivial effort with multiple concurrent activities.

VIII. AWe thank Deborah Gracio, Troy Thompson and Dave Thur-

man for their support. We thank Mike Kistler of the AustinIBM Research Laboratory for his insightful technical advices,and John Feo for the description of his BFS parallelization onCray MTA-2

The research described in this paper was conducted underthe Laboratory Directed Research and Development Programfor the Data Intensive Computing Initiative at Pacific North-west National Laboratory, a multi-program national laboratoryoperated by Battelle for the U.S. Department of Energy underContract DEAC0576RL01830.

R

[1] D. Bader, V. Agarwal, and K. Madduri, “On the Design and Analysisof Irregular Algorithms on the Cell Processor: A Case Study onList Ranking,” in International Parallel and Distributed ProcessingSymposium (IPDPS’07), Long Beach, CA, March 2007.

[2] D. A. Bader and K. Madduri, “Designing Multithreaded Algorithms forBreadth-First Search and st-connectivity on the Cray MTA-2,” in Proc.Intl. Conf. on Parallel Processing (ICPP’06), Columbus, OH, Aug. 2006.

[3] G. Bell, J. Gray, and A. Szalay, “Petascale computational systems,” IEEEComputer, vol. 39, no. 1, pp. 110–112, Jan. 2006.

[4] P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta, “CellSs: aProgramming Model for the Cell BE Architecture,” in Proc. Intl. Conf.for High Performance Computing, Networking, Storage and Analysis(SuperComputing’06), Tampa, FL, Nov. 2006.

[5] F. Blagojevic, A. Stamatakis, C. Antonopoulos, and D. Nikolopou-los, “RAxML-Cell: Parallel Phylogenetic Tree Inference on the CellBroadband Engine,” in International Parallel and Distributed ProcessingSymposium (IPDPS’07), Long Beach, CA, March 2007.

[6] B. Bouzas, J. Greene, R. Cooper, M. Pepe, and M. J. Prelle, “MultiCoreFramework: An API for Programming Heterogeneous Multicore Pro-cessors,” in STMCS: First Workshop on Software Tools for Multi-CoreSystems (STMCS), Manhattan, New York, NY, March 2006.

[7] J. Carter, L. Oliker, and J. Shalf, “Performance Evaluation of ScientificApplications on Modern Parallel Vector Systems,” in Intl. Meeting onHigh Performance Computing for Computational Science (VECPAR),Rio de Janeiro, Brazil, July 2006.

[8] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. Mc-Donald, Parallel Programming in OpenMP. Morgan Kaufmann, 2001.

[9] A. Clauset, M. E. J. Newman, and C. Moore, “Finding CommunityStructure in Very Large Networks,” Physical Review E, vol. 6, no. 70,December 2004.

[10] T. Davis, “Sparse Matrix Collection,” NA Digest, vol. 94, no. 42, October1994, available at http://www.cise.ufl.edu/research/sparse/matrices/.

[11] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processingon Large Clusters,” in 6th Symposium on Operating System Design andImplementation (OSDI), December 2004, pp. 137–150.

[12] F. Dehne, A. Ferreira, E. Caceres, W. Song, and A. Roncato, “EfficientParallel Graph Algorithms for Coarse Grained Multicomputers andBSP,” Algorithmica, vol. 33, pp. 183–200, 2002.

[13] M. deLorimier, N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin,T. E. Uribe, T. F. J. Knight, and A. DeHon, “GraphStep: A SystemArchitecture for Sparse-Graph Algorithms,” in Proc. Symposium onField-Programmable Custom Computing Machines (FCCM’06). LosAlamitos, CA, USA: IEEE Computer Society, 2006.

[14] R. Drost, C. Forrest, B. Guenin, R. Ho, A. Krishnamoorty, D. Cohen,J. Cunningham, B. Tourancheau, A. Zingher, A. Chow, G. Lauterbach,and I. Sutherland, “Challenges in Building a Flat-Bandwidth MemoryHierarchy for a Large-scale Computer with Proximity Communication,”in Hot Interconnects 13, Palo Alto, CA, August 2005.

[15] J. Duato, “A New Theory of Deadlock-Free Adaptive Routing inWormhole Networks,” IEEE Transactions on Parallel and DistributedSystems, vol. 4, no. 12, pp. 1320–1331, December 1993.

[16] ——, “A Necessary and Sufficient Condition for Deadlock-Free Adap-tive Routing in Wormhole Networks,” IEEE Transactions on Paralleland Distributed Systems, vol. 6, no. 10, pp. 1055–1067, October 1995.

[17] J. Duch and A. Arenas, “Community Detection in Complex NetworksUsing Extremal Optimization,” Physical Review E, vol. 72, January2005.

[18] K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem,J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, “Sequoia:Programming the Memory Hierarchy,” in Proc. Intl. Conf. for HighPerformance Computing, Networking, Storage and Analysis (SuperCom-puting’06), Tampa, FL, November 2006.

[19] J. Feo, “Optimized BFS Algorithm on the MTA-2 Architecture,” Novem-ber 2006, Personal Communication.

[20] J. Feo, D. Harpera, S. Kahan, and P. Konecny, “ELDORADO,” in Proc.Intl. Conf. on Computing Frontiers, Ischia, Italy, May 2005.

[21] J. Fernandez, E. Frachtenberg, and F. Petrini, “BCS MPI: A NewApproach in the System Software Design for Large-Scale ParallelComputers,” in Proc. Intl. Conf. for High Performance Computing,Networking, Storage and Analysis (SuperComputing’03), Phoenix, AZ,Nov. 2003.

[22] U. Geuder, M. Hardtner, B. Worner, and R. Zink, “Scalable ExecutionControl of Grid-based Scientific Applications on Parallel Systems,”in Scalable High-Performance Computing Conference, Knoxville, TN,May 1994, pp. 788–795.

Page 15: Efficient Breadth-First Search on the Cell B.E. Processorcalpar/AA07-08/bfs.pdf · 2007-11-23 · IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, XXXXX 200X 15

[23] D. Gregor and A. Lumsdaine, “Lifting Sequential Graph Algorithmsfor Distributed-Memory Parallel Computation,” in OOPSLA’05, SandDiego, CA, October 2005.

[24] M. Guo, “Automatic Parallelization and Optimization for IrregularScientific Applications,” in Intl. Parallel & Distributed ProcessingSymposium (IPDPS’04), Santa Fe, NM, April 2004.

[25] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, andD. Shippy, “Introduction to the Cell Multiprocessor,” IBM Journal ofResearch and Development, pp. 589–604, July/September 2005.

[26] M. Kistler, M. Perrone, and F. Petrini, “Cell Processor InterconnectionNetwork: Built for Speed,” IEEE Micro, vol. 25, no. 3, May/June 2006.

[27] P. Kongetira, K. Aingaran, and K. Olokotun, “Niagara: A 32-WayMultithreaded Sparc Processor,” IEEE Micro, vol. 25, no. 2, pp. 21–29, Mar./Apr. 2005.

[28] R. Kota and R. Oehler, “Horus: Large-Scale Symmetric Multiprocessingfor Opteron Systems,” IEEE Micro, vol. 25, no. 2, pp. 30–40, Mar./Apr.2005.

[29] D. Kunzman, G. Zheng, E. Bohm, and L. V. Kale, “Charm++, OffloadAPI, and the Cell Processor,” in Proc. of the Workshop on ProgrammingModels for Ubiquitous Parallelism, Seattle, WA, Sep. 2006.

[30] J. Kurzak and J. Dongarra, “Implementation of the Mixed-Precision inSolving Systems of Linear Equations on the Cell Processor,” Universityof Tennessee, Tech. Rep., 2006.

[31] E. A. Lee, “The Problem with Threads,” IEEE Computer, vol. 39, no. 5,pp. 33–42, May 2006.

[32] C. McNairy and R. Bhatia, “Montecito: A Dual-Core, Dual-ThreadItanium Processor,” IEEE Micro, vol. 25, no. 2, pp. 10–20, Mar./Apr.2005.

[33] J. Mellor-Crummey and M. Scott, “Algorithms for Scalable Synchro-nization on Shared-memory Multiprocessors,” ACM Transactions onComputer Systems (TOCS), vol. 9, no. 1, pp. 21–64, February 1991.

[34] J. Montrym and H. Moreton, “The GeForce 6800,” IEEE Micro, vol. 25,no. 2, pp. 41–51, Mar./Apr. 2005.

[35] M. E. J. Newman, “Detecting Community Structure in Networks,”European Physical Journal B, vol. 38, pp. 321–330, May 2004.

[36] ——, “Fast Algorithm for Detecting Community Structure in Networks,”Physical Review E, vol. 69, no. 6, p. 066133, June 2004.

[37] M. E. J. Newman and M. Girvan, “Finding and Evaluating CommunityStructure in Networks,” Physical Review E, vol. 69, no. 2, p. 026113,February 2004.

[38] D. S. Nikolopoulos and T. S. Papatheodorou, “The Architectural andOperating System Implications on the Performance of Synchronizationon ccNUMA Multiprocessors,” International Journal of Parallel Pro-gramming, vol. 29, no. 3, October 2001, 249–282.

[39] NVIDIA, “GeForce 8800 GPU Architecture Overview – TechnicalBrief,” available from http://www.nvidia.com/object/IO 37100.html.

[40] M. Ohara, H. Inoue, Y. Sohda, H. Komatsu, and T. Nakatani, “MPIMicrotask for Programming the Cell Broadband Engine Processor,” IBMSystems Journal, vol. 45, no. 1, pp. 85–102, January 2006.

[41] L. Oliker, R. Biswas, J. Borrill, A. Canning, J. Carter, M. J. Djomehri,H. Shan, and D. Skinner, “A Performance Evaluation of the Cray X1for Scientific Applications,” in Proc. Intl. Meeting on High PerformanceComputing for Computational Science (VECPAR), Valencia, Spain, June2004, pp. 51–65.

[42] F. Petrini, J. Fernandez, A. Moody, E. Frachtenberg, and D. K.Panda, “NIC-based Reduction Algorithms for Large-scale Clusters,”International Journal of High Performance Computing and Networking(IJHPCN), vol. 4, no. 3/4, pp. 122–136, Feb. 2006.

[43] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger,S. W. Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP withthe polymorphous TRIPS architecture,” in ISCA ’03: Proceedings of the30th Annual International Symposium on Computer Architecture, NewYork, NY, 2003, pp. 422–433.

[44] M. C. Smith, J. S. Vetter, and X. Liang, “Accelerating ScientificApplications with the SRC-6 Reconfigurable Computer: Methodologiesand Analysis,” in International Parallel and Distributed ProcessingSymposium (IPDPS’05), vol. 4, Denver, CO, Apr. 2005.

[45] V. Subramaniam and P.-H. Cheng, “A Fast Graph Search MultiprocessorAlgorithm,” in Proc. of the Aerospace and Electronics Conf. (NAE-CON’97), Dayton, OH, July 1997.

[46] A. Sud, E. Andersen, S. Curtis, M. C. Lin, and D. Manocha, “Real-timePath Planning for Virtual Agents in Dynamic Environments,” in IEEEVirtual Reality, Charlotte, NC, March 2007.

[47] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald,H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski,N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agar-wal, “The Raw Microprocessor: A Computational Fabric for Software

Circuits and General-Purpose Programs,” IEEE Micro, vol. 22, no. 2,pp. 25–35, 2002.

[48] L. G. Valiant, “A Bridging Model for Parallel Computation,” Commu-nications of the ACM, vol. 33, no. 8, pp. 103–111, 1990.

[49] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick,“The Potential of the Cell Processor for Scientific Computing,” in Proc.ACM Intl. Conf. on Computing Frontiers, Ischia, Italy, May 2006.

[50] W. A. Wuld and S. A. McKee, “Hitting the Memory Wall: Implicationsof the Obvious,” ACM Computer Architecture News, vol. 23, no. 1,March 1995.

[51] A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, andU. Catalyurek, “A Scalable Distributed Parallel Breadth-First SearchAlgorithm on BlueGene/L,” in Proc. Intl. Conf. for High PerformanceComputing, Networking, Storage and Analysis (SuperComputing’05),Seattle, WA, November 2005.

[52] L. Zhang, Y. J. Kim, and D. Manocha, “A Simple Path Non-ExistenceAlgorithm using C-Obstacle Query,” in Proc. Intl. Workshop on theAlgorithmic Foundations of Robotics (WAFR’06), New York City, July2006.

[53] Y. Zhao and K. Kennedy, “Dependence-based Code Generation for aCell Processor,” in Proc. Intl. Workshop on Languages and Compilers forParallel Computing (LCPC 2006). New Orleans, Louisiana: Springer-Verlag, Lecture Notes in Computer Science, Nov. 2006.

Daniele Paolo Scarpazza is a post-doctoral researchfellow at the Cell Solutions Department of the IBMThomas J. Watson Research Laboratory in YorktownHeights, NY. He received a M.S. in Electrical Engi-neering and Computer Science from the Universityof Illinois at Chicago in 2001, and a Laurea cumlaude in Computer Engineering from Politecnico diMilano, Italy in 2002. He received his Ph. D. inInformation Engineering from Politecnico di Milanoin 2006. Prior to this position, he was a post-doctoralresearch fellow at the Pacific Northwest National

Laboratory, Richland, WA.

Oreste Villa received the Laurea degree in Elec-tronic Engineering from the University of Cagliari,Italy, in 2003, and the M.S. degree in embeddedsystems design at the ALaRI Institute, Lugano,Switzerland, in 2004. He is currently pursuing hisPh.D. degree at Politecnico of Milan, Italy, majoringin design methodologies of multiprocessor architec-tures for embedded systems, focusing also on powerestimation, on-chip communication, and clusteredsystems. He has a joint appointment as a researchintern at the Pacific Northwest National Laboratory.

Fabrizio Petrini is a researcher of the Cell So-lution Department of the IBM Thomas J. WatsonResearch Laboratory in Yorktown Heights, NY. Hisresearch interests include various aspects of multi-core processors and supercomputers, including high-performance interconnection networks and networkinterfaces, fault tolerance, job scheduling algorithms,parallel architectures, operating systems, and parallelprogramming languages. He has received numerousawards from the U.S. Department of Energy (DOE)for contributions to supercomputing projects, and

from other organizations for scientific publications. He is a member of theIEEE.