blue gene q network

7/27/2019 Blue Gene q Network

1/12

Looking Under the Hood of the

IBM Blue Gene/Q Network

Dong Chen, Noel Eisley, Philip Heidelberger,Sameer Kumar, Amith Mamidala, Fabrizio Petrini,

Robert Senger, Yutaka Sugawara, Robert Walkup

IBM T.J. Watson Research Center

Yorktown Heights, NY 10598

{chendong, naeisley, philiph, sameerk, amithr, fpetrin,

rmsenger, ysugawa, walkup}@us.ibm.com

Burkhard Steinmacher-Burow

IBM Deutschland Research & Development GmbH

71032 Bblingen, Germany

[email protected]

Anamitra Choudhury, Yogish Sabharwal,Swati Singhal

IBM India Research Lab

New Delhi, India

{anamchou, ysabharwal, swatisin}@in.ibm.com

Jeffrey J. Parker

IBM Systems &Technology Group

Systems Hardware Development

Rochester, MN 55901

[email protected]

AbstractThis paper explores the performance and optimization

of the IBM Blue Gene/Q (BG/Q) five dimensional torus network

on up to 16K nodes. The BG/Q hardware supports multiple

dynamic routing algorithms and different traffic patterns may

require different algorithms to achieve best performance.

Between 85% to 95% of peak network performance is achieved

for all-to-all traffic, while over 85% of peak is obtained for

challenging bisection pairings. A new software-controlled

algorithm is developed for bisection traffic that selects which

hardware algorithm to employ and achieves better performance

than any individual hardware algorithm. The benefit of dynamic

routing is shown for a highly non-uniform transpose trafficpattern. To evaluate memory and network performance, the

HPCC Random Access benchmark was tuned for BG/Q and

achieved 858 Giga Updates per Second (GUPS) on 16K nodes. To

further accelerate message processing, the message libraries on

BG/Q enable the offloading of messaging overhead onto

dedicated communication threads. Several applications, including

Algebraic Multigrid (AMG), exhibit from 3 to 20% gain using

communication threads.

Keywords- interconnection network; network performance;

network routing; GUPS; Blue Gene;

I. INTRODUCTION

Blue Gene/Q (BG/Q) is the third generation of highlyscalable, power efficient supercomputers in the IBM Blue

Gene line, following Blue Gene/L [1] and Blue Gene/P [2]. A

96 rack, 20 petaflops, Blue Gene/Q system called Sequoia has

been installed at the Lawrence Livermore National

Laboratory, while a 48 rack configuration named Mira has

been installed at the Argonne National Laboratory.

BG/Q leverages a highly integrated System-on-a-Chip

(SoC) design with custom on-die torus network and dense

system-level packaging to provide a low-latency, low-power,

high-bandwidth and cost efficient solution for massive scale-

out installations. Design for scalability is especially important

for large petaflop class machines where performance, density,

and power are key inter-related system parameters. As shown

in Figure 1, a BG/Q compute node consists of the SoC single-

chip module with associated memory. 32 compute nodes are

electrically interconnected to form a 2x2x2x2x2 grid on a

node card. 16 node cards comprise a 512-node midplane and

two midplanes stack vertically to form a 1024-node rack, with

electrical links within midplanes and optical links between

midplanes. Racks may also contain special I/O drawers with

Gen-2 PCIe connectivity. The final BG/Q system scales to 96

2. Single Chip Module

4. Node Board:

32 Compute Nodes

Optical Modules, Link Chips; 5D Torus

6. Rack:

2 Midplanes

1, 2 or 4 I/O drawers

7. System:

Up to 96 racks or more

20 petaflops+

3. Compute Card (Node):Chip module

16 GB DDR3 Memory

5b. I/O drawer:

8 I/O cards

8 PCIe Gen2 x8 slots

5a. Midplane:

16 Node Cards

1. BG/Q Chip:

17 PowerPC cores

Figure 1. BG/Q dense packaging hierarchy for massive scale-out. 2012

Springer Verlag. Reprinted, with permission, from [14].

SC12, November 10-16, 2012, Salt Lake City, Utah, USA978-1-4673-0806-9/12/$31.00 2012 IEEE


2/12

racks and beyond. The racks are water cooled to permit

maximum compute density.

An overview of BG/Q is given in [3]. The BG/Q SoC has

16 cores for user code, and a 17th core is reserved for use by

the system software. Each core has four hardware threads. The

64-bit, in-order, PowerPC cores run at 1.6 GHz. A core can

execute two instructions per cycle: a floating point instruction

on one thread and an integer, branch, load or store on another

thread. Each core has a four wide SIMD floating point engine

capable of executing 8 floating point operations per cycle; the

peak peformance of a node is 204.8 GFlops. A crossbar switch

connects the cores to a 32 MB shared L2 cache, organized as

16 slices with 2 MB per slice. Detailed descriptions of the

BG/Q five dimensional (5D) torus interconnection network

and its associated DMA engine, called the Message Unit,

which are integrated onto the same chip as the cores, are given

in [4][5]. The Message Unit attaches to the cores and the

memory system over the crossbar switch. Other notable uses

of a torus interconnect in supercomputers include 3D Cray

machines [6][7] and the 6D Fujitsu K computer [8]. Other

scalable networks used in supercomputers today are clos [18]

and dragon-fly [16] indirect networks, and all connected directnetworks [17].

BG/Q was designed for scalability and power efficiency.

Sequoia placed first on the June 2012 TOP500 list

(http://www.top500.org) at 16.3 Petaflops, an efficiency of

81.1% of peak, and various configurations of BG/Q have

ranked first on the four most recent Green500 lists

(http://www.green500.org) for power efficiency (November

2010 to June 2012). Additionally, BG/Q ranked first on the

November 2011 and June 2012 Graph 500 lists

(http://www.graph500.org), a network and data intensive

benchmark.

On such a large machine, parallel applications face severalchallenges to scale, and communication performance can be a

major limiting factor. This paper covers a diversity of

techniques showing how communication performance can be

optimized using both hardware and software techniques

developed through a coordinated co-design effort.

We first provide a detailed look at the performance of the

BG/Q interconnection network on a number of important

communication patterns. In particular, BG/Q provides

multiple, flexible, and programmable hardware dynamic

routing algorithms which support a diverse application set. We

explore the routing algorithms effectiveness for all-to-all,

challenging bisection pairings, and random communications

patterns. We also investigate how several software techniquescan optimize and improve communications intensive

benchmarks and applications. We describe optimizations,

including multithreading and message aggregation, for the

HPCC Random Access benchmark

(http://www.hpcchallenge.org). While not an official HPCC

submission, this paper reports how a 16 rack (16384 node)

BG/Q achieves 858 Giga Updates per Second (GUPS), or 54

GUPS per rack. We also present results showing how the

Algebraic Multigrid (AMG) application [9] and an iterative

Poissons equation solver can be accelerated using

communication threads in which otherwise idle threads are

used to offload and manage communications activity.

Our paper makes the following contributions:

We demonstrate excellent performance achieved bythe 5D BG/Q torus network for several all-to-all and

bisection communication patterns.

We develop a hybrid routing algorithm and show itseffectiveness under non-uniform traffic loads.

We show how the BG/Q system performance can besignificantly improved by offloading communication

activity to separate threads.

We describe how the BG/Q messaging layerincorporates configurable features of the network,

providing very good performance to the average user

while still permitting the experienced user to select

routing algorithms and messaging settings to further

optimize application performance.

We demonstrate excellent GUPS performance with asoftware-optimized version of the Random Access

benchmark.

Taken as a whole, this paper shows the benefits of providing

multiple hardware routing algorithms to more efficiently

support different communication patterns. Furthermore, tight

coordination between hardware and software can significantly

accelerate communications. Offloading to software can in

some cases reduce hardware complexity as will be illustrated

in the paper.

II. SUMMARY OF BG/QNETWORK ARCHITECTURE

To properly understand the results in this paper, we

summarize the most relevent features of the BG/Q

interconnection network architecture. For user applications,

BG/Q presents a 5D torus with each link running at 2 GB/s (2

GB/s send + 2 GB/s receive). A subset of compute nodes,

called bridge nodes, use an 11th link that attaches to BG/Q IO

nodes. Including packet and protocol overhead, up to 90% of

the raw data rate (1.8 GB/s) is available for user data. The

network supports point-to-point messages, collectives and

barriers/global interrupts over the same physical torus (BG/L

and BG/P had separate networks for collectives and barriers).

The machine can be partitioned into non-overlapping

rectangular sub-machines. These sub-machines do notinterfere with each other, except possibly on the IO nodes and

its corresponding storage system. For point-to-point messages,

BG/Q supports both determinsitic and dynamic routing with

deadlocks being prevented via Bubble routing [10] in which

packets can switch from a dynamic virtual channel to the

bubble (deterministic) escape virtual channel when network

tokens are exhausted. The deterministic routing is

(programmably) dimension ordered; we have found that

ordering the dimensions from longest first to shortest last is


3/12

typically best for performance. With this, queues for packets

waiting to enter the bottleneck (longest dimension) links are

actually stored in the memory system rather than in the much

more limited network FIFOs.

Dynamic routing is also programmable enabling different

routing algorithms to be used, on a per message basis, at the

same time, i.e., a given message always uses the same

algorithm but different messages can use different algorithms.

This is called zone routing and implements in hardware

ideas first explored in software on BG/L [11]. When a packet

enters the network, it is assigned a vector of hint bits, one bit

per direction indicating whether the packet should move in the

plus or minus direction for each dimension, until it reaches its

destination. The hint bits may be assigned by hardware for

minimal path routing or can be programmed by software. On

BG/L, at each hop in the network, a packet may dynamically

move in any direction for which a hint bit is specified. On

BG/Q, a packet header also contains two bits which specify

one of fourzone IDs and the allowable movement of dynamic

packets is constrained by programmable mask registers for

each of the zone IDs. For example, the masks for one zone ID

can be set so that packets must complete all hops in the longestdimension(s) first before moving to smaller dimensions, while

for a different zone ID the masks could permit movement

along any valid direction, as on BG/L. Each such mask is

referred to as azone, and we refer to a specific mask aszone x

of zone ID y. To describe a zone ID, we use the following

notation and example: {A}{BCD}{E}. This means that a

packet first must travel to its final destination along the A

dimension; then it may travel along the B, C, and D

dimensions, taking hops in any order until all three of these

dimensions are complete; and finally the packet routes along

the E dimension until it reaches its final destination. Table I

shows the zone routing masks which we use in this paper for

selected system sizes. Experiments in [11] and near cycle-accurate simulations of the BG/Q network indicate that longest

dimension(s) first to shortest dimension(s) last typically

performs well. Conversely, we found that typically a shortest-

to-longest approach did not perform well, so we do not include

results here. Studies in this paper show that other, more

flexible, forms of zone routing can be beneficial.

Note that in Table I, zone ID 3 is the same as the

deterministic ordered zone ID 2 except that hops in dimension

E are also permitted to occur first. In other words, packets are

first injected and may switch between either the longest

dimension in the system or dimension E. This can improve

performance since the length of E is always 2, no packet can

travel more than one hop in E. Even if the E network FIFOs

are full of dynamic packets, they cannot block packets from

longer dimensions turning onto E since those packets can use

the bubble escape virtual channel. In this case the small

additional contention from packets turning from E to thelongest dimension may be outweighed by the additional

buffering effect of allowing packets to inject into either

dimension E or the longest dimension.

To further improve performance, we explore the use of

software pacing in which the fullness of packet queues

within the network logic is controlled by limiting the injection

rate of packets into the network, similar to TCP/IP window

flow control. In our form of pacing, there is a window size of

W bytes and each node is permitted to inject requests for at

most 2W bytes at any one time. After W bytes are received, a

remote get (rDMA read) request is issued for another W bytes

(or the remaining message size).

The tests described in Sections III and IV are written using

low level System Programming Interface (SPI) calls that

access the network hardware resources directly [5], so as to

eliminate most software overhead from the measurements. The

GUPS results of Section V are obtained using the BG/Q

production messaging library PAMI (Parallel Active Message

Interface) [12]. PAMI uses SPI calls to access the hardware

and supports both communication threads and a form of

pacing. The BG/Q MPI implementation runs on top of PAMI.

III. ALL-TO-ALL BANDWIDTH

The peak all-to-all bandwidth (BW) of a torus is limited by

the length of its longest dimension, since a given link in thisdimension is utilized by more source-destination pairs. If the

length of the longest dimension is L, then the peak user data

per-node all-to-all BW is 8/L 1.8 GB/s [11].

We ran our SPI-based large-message all-to-all

performance test on systems up to 16 racks (16384 nodes). In

this test, each node sends 32 KB of data to each of the other

N-1 nodes. The data is broken up into a number of smaller

messages of constant size which are sprayed randomly over

the destinations. Breaking up the 32KB into smaller

submessages had only a small effect since each node is already

spraying packets from different messages throughout the

network. To explore the effect of zone routing, we ran the test

usingdynamic routing zone IDs 0 through 3 as well as using

deterministic routing on 4-rack and 16-rack systems, and the

results are shown in Figure 2. The best results are achieved

with zone ID 0, which is expected. Recall that in zone ID 0,

packets are first routed along the longest dimension (here, A),

which is the most heavily loaded in this case; so no packets

turn onto A from other dimensions, mitigating the effect of

contention. At the same time, once packets turn off of A, they

turn onto less heavily loaded dimensions, so the effect of

TABLE I:DYNAMIC ZONE ROUTING MASKS FOR SELECTED SYSTEM SIZES

USED IN THIS PAPER.

Zone

ID

Description 16 racks

16x8x8x8x2

4 racks

8x4x8x8x2

1 rack

4x4x4x8x20 Longest-to-

shortest{A}{BCD}{E}

{ACD}{B}{E}

{D}{ABC}{E}

1 Unrestricted {ABCDE} {ABCDE} {ABCDE}

2 Deterministic

ordering

{A}{B}{C}

{D}{E}

{A}{C}

{D}{B}{E}

{D}{A}{B}

{C}{E}

3 Add E to the

first zone ofDet. ordering

{AE}{B}

{C}{D}

{AE}{C}

{D}{B}

{DE}{A}

{B}{C}


4/12

multiple dimensions turning onto B, for example, is less

severe than it otherwise would be. For 16K nodes, there is a

single longest dimension of length 16, which is twice as long

Figure 2: All-to-All Performance as a Percentage of Peak, for Dynamic and

Deterministic Routing on 4 and 16-Rack Systems. Submessage size 4KB.

as the next longest dimensions. Since zone ID 2 and

deterministic order also route the longest dimension first, theirperformance is similar to that of zone ID 0. On the more

symmetric 4K nodes, with three longest dimensions of length

8, dynamic routing is able to more effectively distribute traffic

throughout the network than deterministic routing.

We ran the all-to-all performance test on a wide range ofsystem sizes, from 512 nodes up to 16384 nodes. All-to-allresults for systems up to 2048 nodes were reported in [5] andare included along with the larger systems in Table II. Table IIshows that as system size grows, the network is capable ofsustaining excellent all-to-all bandwidth from 85% to 95% ofpeak using a longest-to-shortest dimension dynamic zone-routing approach. The PAMI implementation uses an algorithm

that sprays traffic using zone ID 1 for systems 512 nodes andsmaller, and it uses zone ID 0 for larger systems.

IV. BISECTION BANDWIDTH

For a torus of N nodes with longest dimension of length L,

the bisection bandwidth is (N/L) 4 B, where B is the

bandwidth of a single unidirectional link.

A. Diagonal and Furthest-Node PairingsOne type of communication pattern which is useful for

evaluating the effectiveness of an interconnection network at

sustaining its bisection bandwidth is the bisection pairing. In

a bisection pairing each node in the network communicates

with exactly one other node, no two nodes communicate withthe same node, and each source-destination pair crosses the

bisection of the network exactly once. In this paper we

evaluate two such challenging pairings, referred to as the

diagonal and furthest-node pairings, as described below.

Diagonal pairing: each node communicates with thenode which is a reflection across the midpoint of each

dimension. In each dimension the node with index i

communicates with the node with index L-i-1, where

L is the length of the dimension. On a mesh, these

pairings are such that if you draw a line between each

pair, they all pass through the center of the mesh.

Furthest-node pairing: each node communicates withthe node which is the maximum number of hops

away.We ran an SPI-level bisection performance test on 1-rack

(1024 node) and 4-rack systems, using dynamic routing zoneIDs 0-3 as well as deterministic routing, and the results arepresented in Table III for diagonal pairing and Table IV forfurthest-node pairing. We also vary the pacing of the messagebetween nodes by changing the window submessage size. Thishas the beneficial effect of preventing the network from over-saturating and causing performance to deteriorate. Based onTables III and IV, we observe that using a pacing window sizeof 8KB gives the best performance across all zone IDs, sothroughout the rest of the paper we limit our results to this

pacing window size. The bisection performance as a percentageof peak is significantly better on one rack than on four,especially for the more challenging diagonal pairing. This isdue to the fact that there is a single long dimension in the one-rack system size, so that as discussed in Section III, packets areprevented from turning onto that long dimension. For moresymmetrical system sizes with more than one long dimension,it is not possible to completely eliminate packets turning ontoat least one of the long dimensions.

On the 4-rack system, the best routing for the diagonalpairing is zone ID 3, since it maintains high performance acrossa wide range of window sizes. For the furthest-node pairing,the best performance is achieved with zone ID 0 since thispairing naturally has a much more evenly distributed traffic

pattern, equally utilizing all of the links, similar to the all-to-allcase, so that the standard longest-to-shortest dynamic routingperforms quite well. Conversely, the diagonal pairing does notevenly utilize the links, so that dynamic routing inadvertentlyconcentrates the traffic on a relatively small number of links,including bisection links. By definition, in order to obtain ahigh percentage of the peak bisection bandwidth, all of thebisection links must be utilized. Deterministic (anddeterministic-ordered dynamic) routing forces some of thetraffic around the hot-spots and mitigates the congestionsignificantly.

A key observation is that there are some source-destination

pairs in the diagonal pairing which have only one minimal

path between them (i.e. a single hop in each dimension), andthere are other pairs which have many possible paths between

them. Of those paths, some overlap with the close pairs, and

others avoid using the same links. We next explore whether it

can be beneficial to use different zone IDs for different

partners in order to diffuse the hot spots in the network.

B. Flexibility MetricIn order to differentiate between the pairs with varying

numbers of minimal paths between them we introduce the

TABLE II:ALL-TO-ALL PERFORMANCE, AS A PERCENTAGE OF PEAK FOR ZONE

ID0DYNAMIC ROUTING AND 4KBSUBMESSAGE SIZE AS A FUNCTION OF

SYSTEM SIZE

# Nodes 512 1024 2048 4096 16384

Performance (%) 95 92 94 85 91

85

7480

7471

91

64

91

83

91

0

20

40

60

80

100

Percento

fpeak

4096 Nodes 16384 Nodes

Zone 0 Zone 1 Zone 2 Zone 3 Det. Routing


5/12

flexibility metric:

)2//(1

0

=

=

D

i

ii LhF ,

where hi is the number of hops in dimension i for the given

source-destination pair; Li/2 is half the length of dimension i

(i.e. the maximum number of hops in a torus using minimal

path routing); and D is the number of dimensions in the

network. In our implementation dimension E is length 2 for allsystem sizes and thus can be ignored. Since hmax = Li/2, Fmax =

D = 4 in this case. Furthermore, all traffic for the furthest-node

pairing has F = Fmax, since each message in that pairing travels

the maximum distance in the torus. In general, there are a

relatively small number of possible values of F for a given size

system and communication pattern.

On a system size of 4 racks, the size of the network is

8x4x8x8x2. For the diagonal pairing on a torus, each packet

takes an odd number of hops in each dimension. So on a

dimension of length 4, all packets travel exactly 1 hop; on a

dimension of length 8, either 1 or 3 hops. This means that thevalue of F for a dimension of length 4 is 0.5, and the two

possible values of F for a dimension of length 8 are 0.25 and

0.75. So for this configuration, there are four possible sums of

F for the diagonal pairing: 1.25, 1.75, 2.25, and 2.75. Our

scheme uses two thresholds, Th and Tl, to choose between

zone IDs. For source-destination pairs with F < T l or F >= Th,

zone ID 0 is used; if Tl


6/12

As on 4 racks, the 16-rack performance is much moresensitive to the value of Tl than Th. Trends seen in the smallersystems also apply to the larger 16-rack system. Routingmessages with very low flexibility or very high flexibility witha longest-to-shortest zoned approach, while routing othermessages with intermediate flexibility with a deterministic-ordering approach can provide better performance than eitherapproach alone.

A form of pacing with the flexibility metric has beenimplemented in PAMI, and is thus used by MPI. Pacing iscontrolled by a thread on the seventeenth core and theflexibility metric thresholds are chosen differently dependingon the system size. Default settings can be overridden usingenvironment variables, permitting users to tune and optimizetheir codes.

C. Random PairingAn important benchmark to evaluate the performance of an

interconnection network is the random-pairing benchmark. In

this benchmark, each node is randomly paired with another

node in the system. Each node in the network communicates

with exactly one other node; no two nodes communicate with

the same node. As with all-to-all the expected per-node peak

bandwidth is 8/L 1.8 GB/s.

(s,k)-random pairing benchmark: Since the pairs are

determined randomly and the aforementioned calculation only

yields the peak bandwidth in expectation, it only serves as an

upper bound. There can be local hot-spots due to the

randomness of the pair selections, and this smoothes out as the

number of pairs increases and eventually approaches a true all-to-all communication pattern. Thus, in order to get a better

idea of the performance, we extend this benchmark as follows.

We define an (s,k)-random pairing wherein each node utilizes

s cores and each core communicates with k random partners

on different nodes. Thus every node communicates with sk

other nodes. Note that the (1,1)-random-pairing benchmark is

equivalent to the random-pairing benchmark. The expected

peak data-per-node BW is the same as before, i.e., 8/L 1.8

GB/s.

We ran our SPI-based random-pairing tests on systems of 1

rack and 4 racks. In this test, we exchanged 1 MB of data

between each pair in the (s,k)-random pairing. To explore the

effect of zone routing, we ran the test using dynamic routingzone IDs 0 through 3 as well as using deterministic routing.

These numbers are presented in Table VIII for s=16 and k=16.

All the tests were performed using pacing with a window size

of 8KB. We observe that the best results were obtained with

zone ID 1 routing. We believe that local hotspots are more

easily avoided using the unrestricted dynamic routing of zone

ID 1 compared to the longest-to-shortest routing of zone ID 0.

We also ran the tests with s=16 and k=1,2,4,8,16 in order to

study the effect of increasing communication partners on the

performance. The results are shown in Table IX; these were

obtained with zone ID 1 routing and with pacing. As expected,

performance steadily improves as the number of

communication partners increases. With (s,k) = (16,16),

performance goes as high as 77% on 4096 nodes. We also see

that performance on 4096 nodes is significantly better than on

1024 nodes. On the larger system, this is probably due to the

more symmetric topology, more opportunity for dynamic

routing to avoid hotspots, and a smaller likelihood of selecting

adversarial pairings such as multiple collinear pairs.

D. ReverseThe reverse benchmark evaluates the performance of the

interconnection network at sustaining bisection bandwidth on

an irregular communication pattern. In this benchmark, a node

with MPI rank X communicates with the node having rank Y

where the bit representation of the coordinates of Y on each

dimension are obtained by reversing the bit pattern of thecorresponding co-ordinate of X, i.e., for any dimension A and

bit i (i = 0, 1, , log2 LA 1), the ith

bit of Y along dimension

A is the same as (log2 LA i 1)th

bit of X along dimension A,

where LA is the length of the dimension A. The peak

performance for this benchmark is calculated by examining

the central cut along the longest dimension. For 4 racks

TABLE VI:PERCENTAGE OF PEAK BISECTION, FOR SINGLE-ZONE IDROUTING

FOR DIAGONAL AND FURTHEST-NODE PAIRING ON 16384NODES. PACINGWINDOW 8KB.

Zone ID 0 1 2 3 Det.

Diagonal 71 62 77 91 85

Furthest-node 95 83 93 76 92

TABLE VII:PERCENTAGE OF PEAK BISECTION, FOR SELECTED COMBINATIONS

OF FLEXIBILITY METRIC THRESHOLDS FOR DIAGONAL PAIRING ON 16384

NODES.PACING WINDOW 8KB.

Tl, Th1.0,

1.5

1.0,

1.75

1.0,

2.0

1.0,

2.25

1.0,

2.5

1.0,

2.75

1.0,

3.0

Performance 87 85 91 92 93 92 92

Tl, Th1.25,1.5

1.25,1.75

1.25,2.0

1.25,2.25

1.25,2.5

1.25,2.75

1.25,3.0

Performance 85 85 94 92 94 93 93

Tl, Th1.5,

2.25

1.5,

2.5

1.5,

2.75

1.5,

3.0

Performance 72 72 72 72

TABLE VIII:RANDOM-PAIRING PERFORMANCE AS A PERCENTAGE OF PEAK

FOR 1 AND 4-RACK SYSTEMS WITH DIFFERENT ROUTING SCHEMES WITH S=16AND K=16. PACING WINDOW 8KB.

TABLE IX:RANDOM-PAIRING PERFORMANCE AS A PERCENTAGE OF PEAK

BANDWIDTH FOR 1 AND 4-RACK SYSTEMS USING ZONE ID1ROUTING WITHS=16. PACING WINDOW 8KB.

Number

of Nodes

Zone ID 0 Zone ID 1 Zone ID 2 Zone ID 3 Det.

1024 56 67 54 57 51

4096 70 77 45 47 38

Number ofNodes

k=1 k=2 k=4 k=8 k=16

1024 50 57 58 64 67

4096 65 66 72 75 77


7/12

(8x4x8x8x2), the longest dimension is of size 8, which is

represented by 3 bits. The node pairs that communicate with

each other are the pair [1 (001), 4 (100)] and the pair [3 (011),

6 (110)]. Note that both of these communicating pairs use the

link between nodes 3 and 4 (they do not use the diametrically

opposite link of the torus). Thus when we look at the cut

across the longest dimension, the total amount of data passingthough the cut is twice the data generated on each node.

Therefore the peak data-per-node BW is 2 1.8 GB/s.

Similarly, for 1 rack (4x4x4x8x2), the longest dimension is 8,

and hence the peak data-per-node BW is again 2 1.8 GB/s.

We ran our SPI-based reverse-pairing tests on systems of 1rack and 4 racks. In this test, we exchanged 1 MB of databetween the communicating pairs. To explore the effect of zonerouting, we ran the test using dynamic routing zone IDs 0through 3 as well as using deterministic routing. The results areshown in Table X. We observe that on 4096 nodes, theperformance with dynamic routing zone IDs 2 or 3 isapproximately 75% of the peak. The performance of theflexibility metric approach is between that of zone IDs 0 and 3,as expected. On 1024 nodes, the performance is very consistentacross the different zone routings and reaches 95% of the peak.

E. TransposeIn transpose benchmark, the nodes on the network form a

virtual 2D square matrix where each node (x,y) is paired withthe node (y,x). Diagonal nodes (x,x) do not participate in thispairing communication operation. On the 5D BG/Q torusnetwork, the 2D mesh is overlaid on the dimensions of the 5Dtorus. Depending on how the processes are mapped to thedimensions of the 5D torus, it may be possible to fold thedimensions of the 5D torus to form a 2D mesh. For example,

on 1024 nodes when dimensions A,B,C,D,E have sizes4x4x4x8x2 respectively, a 32x32 virtual mesh can be formedas {CD}x{ABE} when CDABE mapping is used. Othermappings such as ABCDE may result in a dimension (C) beingshared by both X and Y dimensions in the mesh.

As shown in [15], the transpose pairing is a challengingcommunication operation that can cause hotspots along thediagonal nodes in the 2D Mesh. On a 5D torus with staticdeterministic routing packets will converge towards the hotspotdiagonal nodes resulting in lower overall throughput. We

developed a simple program to compute the load on the linksfor the transpose communication pattern with deterministicrouting and the results are presented in Table XI.

Observe that with deterministic routing, links around thehotpots have several messages passing through them and theachievable percent of peak is quite small. With adaptiverouting, where the torus routers send packets along the leastloaded links, significant improvement in performance can beexpected. Table XII shows the percent of bisection throughputachieved with dynamic routing on zone ID 1 and deterministicrouting for the transpose operation using a pairing test writtenin SPI. The percent is adjusted to account for the fact that only

N - sqrt(N) nodes participate in the transpose operation.Observe adaptive routing with zone ID 1 achieves higherthroughput than deterministic routing as it can smoothennetwork load around hotspots. We also observed betterperformance when the 5D torus can be folded to form the 2Dmesh. Note that in Table XII, mapping CDABE performs betterthan ABCDE. Zone ID 1 performs best as it has the mostflexibility in moving packets around hotspots. Other zone IDs0,2,3 achieve throughputs between deterministic routing andzone ID 1, as does the flexibility metric approach.

V. GUPS

A. IntroductionRandom access performance of the memory subsystem is

critical to many applications. The HPCC suite includes theRandom Access benchmark which measures the capability of asystem to generate and apply updates to random locations inthe memory. On earlier machines, Blue Gene/L and BlueGene/P [13], 3D bucketing algorithms have been designed toamortize the transfer costs by aggregating multiple updates intoa single bucket. Such techniques lower the software costs ofinjection and reception of the update and also help in betterutilization of the network. The performance of the benchmarkis measured in GUPS and is bounded by the bisectionperformance of the network, although other factors such assoftware overhead could be the bottleneck. Further, the totalamount of look-ahead depth for aggregation is restricted to1024 updates per process or 8192 bytes with eight bytes perupdate, limiting the size of the buckets used.

B. GUPS design on BG/QThe benchmark is run with sixteen processes per node, one

process per core, with each process utilizing four threads. Out

of the four threads, two threads are completely dedicated for

software routing and the other two are used for generating the

TABLE X:REVERSE PERFORMANCE AS A PERCENTAGE OF PEAK FOR 1 AND 4-

RACK SYSTEMS WITH DIFFERENT ROUTING SCHEMES.PACING WINDOW 8KB.

Number

of Nodes

Zone ID 0 Zone ID 1 Zone ID 2 Zone ID 3 Det.

1024 94 93 94 94 94

4096 65 65 75 75 53

TABLE XI:TRANSPOSE PAIRING LOAD ON TORUS NETWORK LINKS WITH

STATIC ROUTING.

Nodes RoutingDimension

Order

Rank toCoord

Mapping

Max linkload

Predicted % ofbisection

throughput

1024 DBCAE ABCDE 4 50 %

1024 DBCAE CDABE 4 50%

4096 ADCBE ABCDE 16 12.5%

TABLE XII:TRANSPOSE PERFORMANCE WITH DETERMINISTIC AND ADAPTIVE

ROUTING.

Nodes Rank toCoord

Mapping

% of BisectionThroughput for Adaptive

Routing zone ID 1

% of BisectionThroughput for

Deterministic Routing

1024 ABCDE 83% 31%

1024 CDABE 89% 41%

4096 ABCDE 74% 13%


8/12

updates and applying the updates. The salient features of the

new design are the following.

1) Software routing for the five dimensional torus:Because BG/Q has multiple threads per core, there exist new

bucketing opportunities. In addition, the 5D torus permits

larger buckets compared to a 3D torus with the same number

of nodes. For example, in a 64K node 64x32x32 3D system

the process handling the longest dimension has 64 bucketswhereas a 16x16x16x8x2 5D system has at most 16 buckets

per dimension.

In the design proposed in this paper, a process is required

to route traffic from only one incoming dimension to only one

outgoing dimension. This greatly reduces the number of

buckets thus allowing for more aggregation. For example, on

the largest machine, the number of send buckets utilized

would only be around 16. The basic idea is to aggregate all the

updates from the processes on a node and then route them

along the dimensions of the torus. Once the updates reach the

final destination node, they are scattered to their respective

processes. Also, the packets are always routed from the shorter

to the longer dimensions to increase message aggregation andto avoid any cyclic dependencies. In a 16 rack system, the E

dimension is the shortest and the A dimension is the longest.

2) Translating communication parallelism into GUPSperformance: The MU provides a high level of parallelism

within a node with multiple injection and reception FIFOs

operating concurrently on different messages. For example,

within a single process, multiple threads can send and receive

messages on separate hardware FIFOs eliminating the need for

shared locks. PAMI on BG/Q exposes this concurrency in the

form of higher level abstraction such as contexts. Further,

these threads can be pinned to a specific context and are

addressed using end-points. A complete discussion on theseconcepts is given in [12]. Our design of GUPS uses these

PAMI concepts as building blocks and the entire algorithm is

implemented in the pre-registered message handlers.

0,1,2

3,4,5

6,7,8

9,10,11

12,13,14,15

0,1,2

3,4,5

6,7,8

9,10,11

12,13,14,15

Set #, Routing Function

1, E to D

2, D to C

3, C to B

4, B to A

5, A to T

E=0

plan

e

E=1plan

e

Figure 3. Routing along E planes.

Our new design harnesses communication parallelism by

allowing threads in more than one process to route in the same

dimension. Processes belonging to one routing set drain

packets from the reception FIFOs of a lower dimension and

route to the routing set of processes of a higher dimension.

Further, each process spawns two independent routing threads

working in parallel, for a total of 32 routing threads per node.

D dimension

C

dimension

1

2

3

4

5

1

2

3

4

5

1 2 3 4 5

RoutingSets

Figure 4. Dimension ordered routing in the routing sets.

3) Detailed illustration of the parallel software routing:The initial routing step is explained as follows. As shown in

Figure 3, the sixteen processes on a node are divided into five

routing sets. All these processes, after generating the updates,

route to routing set 1, comprised of processes with local ranks

{0, 1, 2}. The other routing sets numbered from two to five are

also shown in Figure 3. As explained below these are used for

routing along the remaining dimensions of the network, D to

A. The T dimension is the local dimension, and processes in

routing set 5 with local ranks {12, 13, 14, 15} are used in thelast step of the software routing and forward the updates to all

sixteen processes within the node. Note that only the first

thread of these processes is used to generate the updates. Apart

from generating the updates, the thread also maintains two

buckets, corresponding to the E = 0 and E = 1 plane. All the

updates are aggregated into these buckets before sending to

the processes of routing set 1. As shown in Figure 3, processes

in the E = 0 plane communicate with routing set 1 of E = 1

plane via the network. For communicating to the processes in

the same plane, the updates utilize shared memory. Note that

in the initial phase of the algorithm, thread 0 of each process

communicates to threads 1 and 2 of the processes belonging to

routing set 1 in order to aggregate all the updates on a node.

By careful mapping, we allow for uniform distribution of

updates to each of the routing threads belonging to the three

processes of a routing set.

The remaining routing steps traverse the dimensions of thenetwork in the order DCBA. A further optimization togeneralize the algorithm for any arbitrary system configurationwould be to go from shortest to longest dimension to get themost aggregation of the updates. However, it is to be noted that


9/12

on a complete 96 rack machine, the ordering required is thesame as in this paper. Figure 4 shows two hops of this routingfirst along the D dimension followed by the C dimension. Asindicated in Figure 4, the packets injected by routing set 1 arereceived by the processes belonging to the routing set 2. For theC dimension, updates travel from routing set 2 to 3.

C. Peformance evaluation:The performance of the Random Access benchmark is

tightly coupled to the bucket size used for messageaggregation. In the following we describe the calculation usedto obtain the bucket sizes. We first enumerate the types ofbuckets used per process in our design:

1) Issue Send buckets: Used by the issue thread 0 whichgenerate the random numbers and send updates along the E

dimension. There are two issue send buckets one for each E

plane.

2) Routing Send buckets: Used by the routing threads 1and 2 to send along a given dimension. The number of routing

send buckets is the same as the dimension size.

3) Routing Receive buckets: Used by routing threads 1and 2 to receive updates. There is one routing receive bucket

to process data received in the active message handler.

4) Final Update Receive buckets: There is one final updatereceive bucket that is used by the update thread 3 to receive

the final updates.

An issue send bucket size of 512 B was experimentally

determined to maximize performance. Similarly, the final

update receive bucket size was experimentally selected at 256

B. The benchmark allows 8 KB of total bucket memory space,

thus the remaining space for the routing send and receive

buckets is (8192 - 512 - 256) = 7424 B, or 3712 B for each of

the two routing threads. A routing send bucket is required per

node along a dimension, as well as a single receive bucket.Thus each routing send and receive bucket is

3712/(dimension_size + 1) bytes as there are dimension_size

sending buckets and one receive bucket.

Since GUPS follows an all-to-all kind of pattern and thereare 8 bytes per update, the network bound on updates persecond per node is ((8/L)*B)/8 = B/L, where B is the peak linkbandwidth obtained after adjusting for the per packet overheadused in the software routing. For example, for a packet size ofS bytes, B = S/(S+52)*2.0 GB/s, where 52 is the total numberof bytes used in the header, trailer and the ack of the packet. Sis determined from the bucket sizes used. From 1 to 8 racks, thenetwork bound is over 200 million updates per node persecond, and it is 100 million updates per node per second for16 racks up to the full system size. From experimentalevaluation, we observed that the performance achieved on asingle node is 106 Million updates per second. Since eachupdate requires a read and write of 128 B, this corresponds to

an off-chip memory bandwidth of 27.6 GB/s. We use 106million updates per second as the memory system hardwarelimit.

Table XIII reports the total GUPS, the update rate per node,and the hardware bound per node which is the minimum of thenetwork and memory bounds for system sizes of 1 to 16 racks.For 16 racks, we achieved 858.1 GUPS which is 52.4% of thehardware bound. In our experience, the bottleneck for obtaining

a higher GUPS performance is the processing costs forsoftware routing on the five network dimensions and one localdimension.

VI. OPTIMIZATION OF APPLICATIONS USINGCOMMUNICATION THREADS

Each core of the BG/Q node has four hardware threads that

share the cores resources. In applications that have high

communication overheads, one of these threads can be

dedicated to accelerate communication. We observed that in

some hybrid applications that use OpenMP within the node

and MPI to communicate across nodes, best performance is

achieved with two or three hardware threads per core for

computation. This is because OpenMP overheads may cancelthe benefit of the additional threads for computation. MPI

libraries on Blue Gene/Q can enable one or two

communication threads per core to optimize the above

mentioned scenarios. In addition to improving the overall

messaging performance, communication threads also enable

independent progress in the messaging stack that can be highly

advantageous for asynchronous communication.

Past work [12] shows that we achieve a message rate of

107 million messages per second via the PAMI API and 20.9

million messages per second via MPI using 32 processes per

BG/Q node. MPI has higher overheads than PAMI, as

messages have to be matched on the receiver, based on the tagand the source rank. MPI libraries enable communication

threads to accelerate the message processing, so that

applications can take advantage of the high message rate

available in the BG/Q torus network. With 8 processes per

node, MPI libraries achieve a message rate of ~10 million

messages per second with communication threads, and ~8

million messages per second without communication threads.

The difference with or without communication threads is more

pronounced when there are fewer processes per node.

To demonstrate the benefits of communication threads we

present case studies with two linear algebra applications,

Algebraic Multi-Grid (AMG) and an iterative Poisson solver.

Both these applications are weak scaling applications wherethe problem size is increased proportionate to the increase in

the number of cores. They also send and receive several

messages of different sizes in each iteration.


10/12

The AMG method is used to iteratively solve partial

differential equations using a heirarchy of grids with different

resolutions. The communication pattern is dense near neighbor

where processes send and receive hundreds of messages in

each iteration of the solver, as processes having coarse grid

points must communicate with a number of processes that

have fine grid points. The size of the messages can vary from

a few bytes to several hundred KB. To achieve high

throughputs AMG requires the messaging libraries to achievehigh message rates. On BG/Q we achieve the high messaging

rates by enabling communication threads. We ran the AMG

benchmark from the Sequoia Benchmark suite [9] with

refinement levels of 8x8x8 using solver 3, which uses a

preconditioned generalized minimum-residual iterative

method. We measured performance with and without

communication threads. Table XIV presents the application

throughput computed as a Figure Of Merit (FOM =

system_size * iterations / iteration_time). These measurements

used four MPI processes per node, and three threads per core

for a total of 12 OpenMP threads per process, leaving one

hardware thread per core available for communication threads.

Note, even without communication threads, the bestperformance achieved in AMG is with three threads per core

(i.e. exchanging the communication thread for an additional

OpenMP thread does not improve performance), possibly

because OpenMP overheads cancel the gains from an

additional SMT thread. For example, on 512 nodes the FOM

achieved with both 3 and 4 threads per core is 1.38e9. The

performance improvement in the overall solver time due to

communication threads is between 3.3 and 6.2%.

A more dramatic improvement was observed with a simple

iterative solver for Poissons equation. This solver was used to

represent applications where the main communication pattern

is boundary exchange on a regular grid. The computational

performance of this solver is limited by bandwidth to memory.As a result, the overall performance of the solver is optimized

by using two or three threads per core for computation,

leaving one or two threads per core available for

communication. Table XV shows the benefit of using

communication threads with this simple iterative solver. Note

that it is not meaningful to compare the step times between

differently sized systems, since the number of iterations

changes.

VII. RELATED MESSAGING STACK (PAMI)FEATURES

The SPI-level research on message pacing, the flexibility

metric, and commthreads described in this paper have been

incorporated into the messaging stack (MPI / PAMI) asdescribed below. Similar performance results are observed.

Message pacing is controlled by an "agent" thread that runs

on the 17th core of each node. PAMI posts a given message to

the agent for pacing when the size of the block exceeds 1 rack,

and the message is larger than a W bytes (default 64KB), and

the destination node is more than H hops away (default 4) or

its ABCD coordinates differ from the source node's in more

than D dimensions (default 1). The agent multiplexes the

messages posted from the processes on the node and paces

them, controlling the amount of data in the network on behalf

of the node. The agent divides each message into windows of

size W and allows up to M simultaneous windows in the

network. The defaults for W and M vary based upon the block

size and are currently empirically determined but can be

overridden at job launch time by the user. The agent round

robins through its list of messages, injecting one window from

each message, pausing as needed to wait for a previous

window to finish in the network, maintaining up to M active

windows in the network. Software controlled pacing leveragesthe many threads on the BG/Q system and provides more

flexibility versus the added complexity of a hardware-based

pacing implemention.

In PAMI, the flexibility metric is used to determine the

network routing for point-to-point messages that exceed F

bytes (default 64KB). Routing can be deterministic, or

dynamic with zone ID 0, 1, 2, or 3. One of two routing

methods is used, depending on whether the metric between the

source and destination nodes is within the range or outside of

the range. The metric determines the routing for both paced

and non-paced messages. Messages F bytes or less are

deterministically routed by default. The default metric range,routing values, and threshold F vary based upon the block size

and can be overridden by environment variables. The default

configuration gives good performance for all of the bisection

pairings, with the exception of the extreme transpose pairing.

PAMI allows the user to override the default zone ID used at

job launch time with environment variables; or the user can

specify what zone ID to use on a message-by-message basis

through the use of SPI calls. Modifying the zone IDs

themselves (e.g. the dimension ordering) is possible but

requires system calls. Ongoing research is in progress to refine

this approach for when there are a large number of outstanding

communication partners.

PAMI has contexts and commthreads that enable parallelcommunications progress. Contexts are a method of dividing

messaging hardware resources so that parallel operations can

occur. Commthreads run on hardware threads not being used

by the application. The number of contexts and commthreads

is determined by PAMI when MPI is initialized. Each context

initially has its own commthread that makes progress on that

context. If the application creates threads on the same

hardware thread that is running a commthread, the

TABLE XIII:RANDOM ACCESS PERFORMANCE

Nodes TotalGUPS

Million Updates per Nodeper Second

Hardware Bound forMillions Updates perNode per Second

1024 47.3 46.2 106

2048 95.2 46.5 106

4096 184.7 45.1 106

8192 485.9 59.3 106

16384 858.1 52.4 100


11/12

commthread gives its context to another commthread and

yields to the application thread. Messages are associated with

a contexts based on the destination rank and MPI

communicator. This evenly distributes messages among the

contexts while maintaining MPI ordering semantics. MPI

posts a message to the contexts. The commthread picks it up,

makes progress on it, and completes it. The main application

thread finishes computing and finds the message completed.

There are two versions of the MPI library on BG/Q. One is

enabled for threaded operations and one is optimized for non-

threaded operations. To have commthreads, the application

must be linked with the thread-enabled MPI library and must

initialize MPI with MPI_THREAD_MULTIPLE.

VIII. CONCLUSIONS

The Blue Gene/Q integrated network offers programmable

zone routing control for dynamic (adaptive) routing. Using

low level SPI programming and default zone settings, all-to-allperformance ranges from 85% to 95% of the theoretical peak

on 512 to 16K nodes. With 16K nodes, a software optimized

version of the Random Access benchmark from the HPCC

suite achieves a preliminary result of 858.1 GUPS. With this

result, we learned that careful hardware/software co-design

can lead to a thin but efficient software layer for message

aggregation, thus changing a short-message random

communication pattern into longer messages that perform well

on the torus topology.

We studied the performance of difficult bisection pairings.

Diagonal and furthest-node pairings each achieve good

performance, albeit with different zone ID settings. Thus, a

software-controlled flexibility metric routing mechanism isdeveloped where different hardware routing algorithms are

selected depending on the distance messages travel. Both the

diagonal and furthest-node pairings achieve over 90% of peak

on 16K nodes. This shows the importance of having multiple

hardware routing options since different applications perform

optimally under different routing algorithms. The flexibility

metric enables the system to provide very good performance

with default settings but still allows individual applications the

opportunity to optimize further.

When running communication intensive applications using

MPI, it is often beneficial to have dedicated MPI

communication threads. The performance improvement ranges

from 3.6% to 6.2% for AMG, and 11.8% to 19.7% for an

iterative solver for Poissons equation, on 512 to 2K nodes.

An initial implementation of the low-level PAMI API used

by MPI automatically manages pacing, the flexibility metric,

and communication threads. These settings can also be

overridden by the end user. The Blue Gene/Q architecture

provides the capability to fine-tune many hardware features.

How higher level libraries such as MPI can best exploit these

hardware features is an ongoing investigation.

ACKNOWLEDGMENTS

The Blue Gene project is a team effort. We would like tothank the entire IBM Blue Gene team for their contributionsand support that made this work possible.

The Blue Gene/Q project has been supported and partiallyfunded by Argonne National Laboratory and the LawrenceLivermore National Laboratory on behalf of the U.S.

Department of Energy, under Lawrence Livermore NationalLaboratory subcontract no. B554331. We acknowledge thecollaboration and support of Columbia University and theUniversity of Edinburgh.

REFERENCES

[1] A. Gara, M. A. Blumrich, D. Chen, G. L,-T. Chiu, P. W. Coteus, M. E.Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay,T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, and

P. Vranas, Overview of the Blue Gene/L system architecture, IBM

Journal of Research and Development, vol 49, no. 2/3, pp. 195212,

March/May 2005

[2] IBM Blue Gene Team, Overview of the IBM BlueGene/P project,IBM Journal of Research and Development, vol. 52, no. 1/2, pp. 199-220, January/March 2008

[3] R.A. Haring, M. Ohmacht, T. W. Fox, M. K. Gschwind, P. A. Boyle, N.H. Christ, C. Kim, D. L. Satterfield, K. Sugavanam, P. W. Coteus, P.

Heidelberger, M.A. Blumrich, R.W. Wisniewski, A. Gara, and G. L.-T.

Chiu, The IBM Blue Gene/Q Compute Chip,IEEE Micro, vol. 32, no.2, pp. 48-60, (Mar/Apr 2012).

[4] D. Chen , N.A. Eisley, P. Heidelberger, R.M. Senger, Y. Sugawara, S.Kumar, V. Salapura, D.L. Satterfield, B. Steinmacher-Burow, J.J.Parker, The IBM Blue Gene/Q Interconnection Network and Message

Unit, Proc. Intl Conf. High Performance Computing, Networking,

Storage and Analysis (SC 11), ACM Press, 2011, article 26.[5] D. Chen, N.A. Eisley, P. Heidelberger, R.M. Senger, Y. Sugawara, S.

Kumar, V. Salapura, D.L. Satterfield, B. Steinmacher-Burow, J.J.

Parker, The IBM Blue Gene/Q Interconnection Fabric, IEEE Micro,vol. 32, no. 1, pp. 32-43, -(Jan/Feb 2012).

[6] S. Scott and G. Thorson, The Cray T3E Network: Adaptive Routing ina High Performance 3D Torus, Proceedings of HOT Interconnects IV,August 1996, pp. 147156.

[7] R. Alverson, D. Roweth, L. Kaplan, The Gemini System Interconnect,18th IEEE Symposium on High Performance Interconnects, August2010.

[8] Y. Ajima, Y. Takagi, T. Inoue, S. Hiramoto and T. Shimizu, The TofuInterconnect,IEEE Micro, vol. 32, no. 1, pp. 21-31, (Jan/Feb 2012).

[9] Sequoia Algebraic Multi Grid (AMG) benchmarkhttps://asc.llnl.gov/sequoia/benchmarks/#amg

[10] V. Puente, R. Beivide, J. A. Gregorio, J. M. Prellezo, J. Duato, and C.Izu, Adaptive Bubble Router: A Design to Improve Performance in

Torus Networks, Proceedings of the IEEE 274 International

Conference on Parallel Processing, September 1999, pp. 58 67

TABLE XV:STEP TIME (SECONDS) FOR THE POISSON SOLVER KERNEL

Nodes Process /

Node

OMP

Threads

/ Process

Step Time

(s) w/o

comm

threads

Step Time (s)

with comm

threads

%

Gain

512 8 6 3.682 3.076 19.7

1024 8 6 2.525 2.258 11.8

2048 8 6 5.784 5.073 14.0

TABLE XIV:FIGURE OF MERIT (FOM) FOR THE AMGAPPLICATION

Nodes Process /

node

OMP

Threads /Process

FOM

withoutComm.

Threads

FOM with

Comm.Threads

%

Gain

512 4 12 1.38 e+9 1.45 e+9 5.0

1024 4 12 2.42 e+9 2.57 e+9 6.2

2048 4 12 4.27 e+9 4.41 e+9 3.3


12/12

[11] S. Kumar, Y. Sabharwal, R. Garg and P. Heidelberger, Optimization ofAll-to-all communication on the Blue Gene/L supercomputer, InProceedings of International Conference on Parallel Processing (ICPP) ,

Portland, Oregon, 2008.[12] S. Kumar, A.R. Mamidala, D.A. Faraj, B. Smith, M. Blocksome, B.

Cernohous, D. Miller, J.Parker, J. Ratterman, P. Heidelberger, D. Chen,

and B. Steinmacher-Burow. PAMI: A Parallel Active Message Interface

for the Blue Gene/Q Supercomputer. To appear in proceedings ofInternational Parallel and Distributed Symposium (IPDPS 12),

Shanghai, China, May 2012

[13] V. Aggarwal, Y. Sabharwal, R. Garg, and P. Heidelberger,, HPCCRandomAccess benchmark for next generation supercomputers, IEEEInternational Symposium on Parallel & Distributed Processing, 2009

(IPDPS 2009). Pp. 1-11, 2009.[14] The Blue Gene Team, "Blue Gene/Q: by co-design," to appear in

International Supercomputing Conference, June 2012.

[15] Fabrizio Petrini and Marco Vanneschi. Minimal vs. non MinimalAdaptive Routing on k-ary n-cubes. InInternational Conference onParallel and Distributed Processing Techniques and Applications

(PDPTA'96), Volume I, pages 505-516, Sunnyvale, CA, August 1996.

[16] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput. Archit. News, 36:7788,

June 2008.

[17] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T.Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS

High-Performance Interconnect. In 2010 IEEE 18th Annual Symposium

on High Performance Interconnects (HOTI), pages 75 82, August 2010.[18] Steve Scott, Dennis Abts, John Kim, and William J. Dally. 2006. The

BlackWidow High-Radix Clos Network. In Proceedings of the 33rd

annual international symposium on Computer Architecture (ISCA '06).

IEEE Computer Society, Washington, DC, USA, 16-28

blue gene q network

Documents