blue gene q network
TRANSCRIPT
-
7/27/2019 Blue Gene q Network
1/12
Looking Under the Hood of the
IBM Blue Gene/Q Network
Dong Chen, Noel Eisley, Philip Heidelberger,Sameer Kumar, Amith Mamidala, Fabrizio Petrini,
Robert Senger, Yutaka Sugawara, Robert Walkup
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598
{chendong, naeisley, philiph, sameerk, amithr, fpetrin,
rmsenger, ysugawa, walkup}@us.ibm.com
Burkhard Steinmacher-Burow
IBM Deutschland Research & Development GmbH
71032 Bblingen, Germany
Anamitra Choudhury, Yogish Sabharwal,Swati Singhal
IBM India Research Lab
New Delhi, India
{anamchou, ysabharwal, swatisin}@in.ibm.com
Jeffrey J. Parker
IBM Systems &Technology Group
Systems Hardware Development
Rochester, MN 55901
AbstractThis paper explores the performance and optimization
of the IBM Blue Gene/Q (BG/Q) five dimensional torus network
on up to 16K nodes. The BG/Q hardware supports multiple
dynamic routing algorithms and different traffic patterns may
require different algorithms to achieve best performance.
Between 85% to 95% of peak network performance is achieved
for all-to-all traffic, while over 85% of peak is obtained for
challenging bisection pairings. A new software-controlled
algorithm is developed for bisection traffic that selects which
hardware algorithm to employ and achieves better performance
than any individual hardware algorithm. The benefit of dynamic
routing is shown for a highly non-uniform transpose trafficpattern. To evaluate memory and network performance, the
HPCC Random Access benchmark was tuned for BG/Q and
achieved 858 Giga Updates per Second (GUPS) on 16K nodes. To
further accelerate message processing, the message libraries on
BG/Q enable the offloading of messaging overhead onto
dedicated communication threads. Several applications, including
Algebraic Multigrid (AMG), exhibit from 3 to 20% gain using
communication threads.
Keywords- interconnection network; network performance;
network routing; GUPS; Blue Gene;
I. INTRODUCTION
Blue Gene/Q (BG/Q) is the third generation of highlyscalable, power efficient supercomputers in the IBM Blue
Gene line, following Blue Gene/L [1] and Blue Gene/P [2]. A
96 rack, 20 petaflops, Blue Gene/Q system called Sequoia has
been installed at the Lawrence Livermore National
Laboratory, while a 48 rack configuration named Mira has
been installed at the Argonne National Laboratory.
BG/Q leverages a highly integrated System-on-a-Chip
(SoC) design with custom on-die torus network and dense
system-level packaging to provide a low-latency, low-power,
high-bandwidth and cost efficient solution for massive scale-
out installations. Design for scalability is especially important
for large petaflop class machines where performance, density,
and power are key inter-related system parameters. As shown
in Figure 1, a BG/Q compute node consists of the SoC single-
chip module with associated memory. 32 compute nodes are
electrically interconnected to form a 2x2x2x2x2 grid on a
node card. 16 node cards comprise a 512-node midplane and
two midplanes stack vertically to form a 1024-node rack, with
electrical links within midplanes and optical links between
midplanes. Racks may also contain special I/O drawers with
Gen-2 PCIe connectivity. The final BG/Q system scales to 96
2. Single Chip Module
4. Node Board:
32 Compute Nodes
Optical Modules, Link Chips; 5D Torus
6. Rack:
2 Midplanes
1, 2 or 4 I/O drawers
7. System:
Up to 96 racks or more
20 petaflops+
3. Compute Card (Node):Chip module
16 GB DDR3 Memory
5b. I/O drawer:
8 I/O cards
8 PCIe Gen2 x8 slots
5a. Midplane:
16 Node Cards
1. BG/Q Chip:
17 PowerPC cores
Figure 1. BG/Q dense packaging hierarchy for massive scale-out. 2012
Springer Verlag. Reprinted, with permission, from [14].
SC12, November 10-16, 2012, Salt Lake City, Utah, USA978-1-4673-0806-9/12/$31.00 2012 IEEE
-
7/27/2019 Blue Gene q Network
2/12
racks and beyond. The racks are water cooled to permit
maximum compute density.
An overview of BG/Q is given in [3]. The BG/Q SoC has
16 cores for user code, and a 17th core is reserved for use by
the system software. Each core has four hardware threads. The
64-bit, in-order, PowerPC cores run at 1.6 GHz. A core can
execute two instructions per cycle: a floating point instruction
on one thread and an integer, branch, load or store on another
thread. Each core has a four wide SIMD floating point engine
capable of executing 8 floating point operations per cycle; the
peak peformance of a node is 204.8 GFlops. A crossbar switch
connects the cores to a 32 MB shared L2 cache, organized as
16 slices with 2 MB per slice. Detailed descriptions of the
BG/Q five dimensional (5D) torus interconnection network
and its associated DMA engine, called the Message Unit,
which are integrated onto the same chip as the cores, are given
in [4][5]. The Message Unit attaches to the cores and the
memory system over the crossbar switch. Other notable uses
of a torus interconnect in supercomputers include 3D Cray
machines [6][7] and the 6D Fujitsu K computer [8]. Other
scalable networks used in supercomputers today are clos [18]
and dragon-fly [16] indirect networks, and all connected directnetworks [17].
BG/Q was designed for scalability and power efficiency.
Sequoia placed first on the June 2012 TOP500 list
(http://www.top500.org) at 16.3 Petaflops, an efficiency of
81.1% of peak, and various configurations of BG/Q have
ranked first on the four most recent Green500 lists
(http://www.green500.org) for power efficiency (November
2010 to June 2012). Additionally, BG/Q ranked first on the
November 2011 and June 2012 Graph 500 lists
(http://www.graph500.org), a network and data intensive
benchmark.
On such a large machine, parallel applications face severalchallenges to scale, and communication performance can be a
major limiting factor. This paper covers a diversity of
techniques showing how communication performance can be
optimized using both hardware and software techniques
developed through a coordinated co-design effort.
We first provide a detailed look at the performance of the
BG/Q interconnection network on a number of important
communication patterns. In particular, BG/Q provides
multiple, flexible, and programmable hardware dynamic
routing algorithms which support a diverse application set. We
explore the routing algorithms effectiveness for all-to-all,
challenging bisection pairings, and random communications
patterns. We also investigate how several software techniquescan optimize and improve communications intensive
benchmarks and applications. We describe optimizations,
including multithreading and message aggregation, for the
HPCC Random Access benchmark
(http://www.hpcchallenge.org). While not an official HPCC
submission, this paper reports how a 16 rack (16384 node)
BG/Q achieves 858 Giga Updates per Second (GUPS), or 54
GUPS per rack. We also present results showing how the
Algebraic Multigrid (AMG) application [9] and an iterative
Poissons equation solver can be accelerated using
communication threads in which otherwise idle threads are
used to offload and manage communications activity.
Our paper makes the following contributions:
We demonstrate excellent performance achieved bythe 5D BG/Q torus network for several all-to-all and
bisection communication patterns.
We develop a hybrid routing algorithm and show itseffectiveness under non-uniform traffic loads.
We show how the BG/Q system performance can besignificantly improved by offloading communication
activity to separate threads.
We describe how the BG/Q messaging layerincorporates configurable features of the network,
providing very good performance to the average user
while still permitting the experienced user to select
routing algorithms and messaging settings to further
optimize application performance.
We demonstrate excellent GUPS performance with asoftware-optimized version of the Random Access
benchmark.
Taken as a whole, this paper shows the benefits of providing
multiple hardware routing algorithms to more efficiently
support different communication patterns. Furthermore, tight
coordination between hardware and software can significantly
accelerate communications. Offloading to software can in
some cases reduce hardware complexity as will be illustrated
in the paper.
II. SUMMARY OF BG/QNETWORK ARCHITECTURE
To properly understand the results in this paper, we
summarize the most relevent features of the BG/Q
interconnection network architecture. For user applications,
BG/Q presents a 5D torus with each link running at 2 GB/s (2
GB/s send + 2 GB/s receive). A subset of compute nodes,
called bridge nodes, use an 11th link that attaches to BG/Q IO
nodes. Including packet and protocol overhead, up to 90% of
the raw data rate (1.8 GB/s) is available for user data. The
network supports point-to-point messages, collectives and
barriers/global interrupts over the same physical torus (BG/L
and BG/P had separate networks for collectives and barriers).
The machine can be partitioned into non-overlapping
rectangular sub-machines. These sub-machines do notinterfere with each other, except possibly on the IO nodes and
its corresponding storage system. For point-to-point messages,
BG/Q supports both determinsitic and dynamic routing with
deadlocks being prevented via Bubble routing [10] in which
packets can switch from a dynamic virtual channel to the
bubble (deterministic) escape virtual channel when network
tokens are exhausted. The deterministic routing is
(programmably) dimension ordered; we have found that
ordering the dimensions from longest first to shortest last is
-
7/27/2019 Blue Gene q Network
3/12
typically best for performance. With this, queues for packets
waiting to enter the bottleneck (longest dimension) links are
actually stored in the memory system rather than in the much
more limited network FIFOs.
Dynamic routing is also programmable enabling different
routing algorithms to be used, on a per message basis, at the
same time, i.e., a given message always uses the same
algorithm but different messages can use different algorithms.
This is called zone routing and implements in hardware
ideas first explored in software on BG/L [11]. When a packet
enters the network, it is assigned a vector of hint bits, one bit
per direction indicating whether the packet should move in the
plus or minus direction for each dimension, until it reaches its
destination. The hint bits may be assigned by hardware for
minimal path routing or can be programmed by software. On
BG/L, at each hop in the network, a packet may dynamically
move in any direction for which a hint bit is specified. On
BG/Q, a packet header also contains two bits which specify
one of fourzone IDs and the allowable movement of dynamic
packets is constrained by programmable mask registers for
each of the zone IDs. For example, the masks for one zone ID
can be set so that packets must complete all hops in the longestdimension(s) first before moving to smaller dimensions, while
for a different zone ID the masks could permit movement
along any valid direction, as on BG/L. Each such mask is
referred to as azone, and we refer to a specific mask aszone x
of zone ID y. To describe a zone ID, we use the following
notation and example: {A}{BCD}{E}. This means that a
packet first must travel to its final destination along the A
dimension; then it may travel along the B, C, and D
dimensions, taking hops in any order until all three of these
dimensions are complete; and finally the packet routes along
the E dimension until it reaches its final destination. Table I
shows the zone routing masks which we use in this paper for
selected system sizes. Experiments in [11] and near cycle-accurate simulations of the BG/Q network indicate that longest
dimension(s) first to shortest dimension(s) last typically
performs well. Conversely, we found that typically a shortest-
to-longest approach did not perform well, so we do not include
results here. Studies in this paper show that other, more
flexible, forms of zone routing can be beneficial.
Note that in Table I, zone ID 3 is the same as the
deterministic ordered zone ID 2 except that hops in dimension
E are also permitted to occur first. In other words, packets are
first injected and may switch between either the longest
dimension in the system or dimension E. This can improve
performance since the length of E is always 2, no packet can
travel more than one hop in E. Even if the E network FIFOs
are full of dynamic packets, they cannot block packets from
longer dimensions turning onto E since those packets can use
the bubble escape virtual channel. In this case the small
additional contention from packets turning from E to thelongest dimension may be outweighed by the additional
buffering effect of allowing packets to inject into either
dimension E or the longest dimension.
To further improve performance, we explore the use of
software pacing in which the fullness of packet queues
within the network logic is controlled by limiting the injection
rate of packets into the network, similar to TCP/IP window
flow control. In our form of pacing, there is a window size of
W bytes and each node is permitted to inject requests for at
most 2W bytes at any one time. After W bytes are received, a
remote get (rDMA read) request is issued for another W bytes
(or the remaining message size).
The tests described in Sections III and IV are written using
low level System Programming Interface (SPI) calls that
access the network hardware resources directly [5], so as to
eliminate most software overhead from the measurements. The
GUPS results of Section V are obtained using the BG/Q
production messaging library PAMI (Parallel Active Message
Interface) [12]. PAMI uses SPI calls to access the hardware
and supports both communication threads and a form of
pacing. The BG/Q MPI implementation runs on top of PAMI.
III. ALL-TO-ALL BANDWIDTH
The peak all-to-all bandwidth (BW) of a torus is limited by
the length of its longest dimension, since a given link in thisdimension is utilized by more source-destination pairs. If the
length of the longest dimension is L, then the peak user data
per-node all-to-all BW is 8/L 1.8 GB/s [11].
We ran our SPI-based large-message all-to-all
performance test on systems up to 16 racks (16384 nodes). In
this test, each node sends 32 KB of data to each of the other
N-1 nodes. The data is broken up into a number of smaller
messages of constant size which are sprayed randomly over
the destinations. Breaking up the 32KB into smaller
submessages had only a small effect since each node is already
spraying packets from different messages throughout the
network. To explore the effect of zone routing, we ran the test
usingdynamic routing zone IDs 0 through 3 as well as using
deterministic routing on 4-rack and 16-rack systems, and the
results are shown in Figure 2. The best results are achieved
with zone ID 0, which is expected. Recall that in zone ID 0,
packets are first routed along the longest dimension (here, A),
which is the most heavily loaded in this case; so no packets
turn onto A from other dimensions, mitigating the effect of
contention. At the same time, once packets turn off of A, they
turn onto less heavily loaded dimensions, so the effect of
TABLE I:DYNAMIC ZONE ROUTING MASKS FOR SELECTED SYSTEM SIZES
USED IN THIS PAPER.
Zone
ID
Description 16 racks
16x8x8x8x2
4 racks
8x4x8x8x2
1 rack
4x4x4x8x20 Longest-to-
shortest{A}{BCD}{E}
{ACD}{B}{E}
{D}{ABC}{E}
1 Unrestricted {ABCDE} {ABCDE} {ABCDE}
2 Deterministic
ordering
{A}{B}{C}
{D}{E}
{A}{C}
{D}{B}{E}
{D}{A}{B}
{C}{E}
3 Add E to the
first zone ofDet. ordering
{AE}{B}
{C}{D}
{AE}{C}
{D}{B}
{DE}{A}
{B}{C}
-
7/27/2019 Blue Gene q Network
4/12
multiple dimensions turning onto B, for example, is less
severe than it otherwise would be. For 16K nodes, there is a
single longest dimension of length 16, which is twice as long
Figure 2: All-to-All Performance as a Percentage of Peak, for Dynamic and
Deterministic Routing on 4 and 16-Rack Systems. Submessage size 4KB.
as the next longest dimensions. Since zone ID 2 and
deterministic order also route the longest dimension first, theirperformance is similar to that of zone ID 0. On the more
symmetric 4K nodes, with three longest dimensions of length
8, dynamic routing is able to more effectively distribute traffic
throughout the network than deterministic routing.
We ran the all-to-all performance test on a wide range ofsystem sizes, from 512 nodes up to 16384 nodes. All-to-allresults for systems up to 2048 nodes were reported in [5] andare included along with the larger systems in Table II. Table IIshows that as system size grows, the network is capable ofsustaining excellent all-to-all bandwidth from 85% to 95% ofpeak using a longest-to-shortest dimension dynamic zone-routing approach. The PAMI implementation uses an algorithm
that sprays traffic using zone ID 1 for systems 512 nodes andsmaller, and it uses zone ID 0 for larger systems.
IV. BISECTION BANDWIDTH
For a torus of N nodes with longest dimension of length L,
the bisection bandwidth is (N/L) 4 B, where B is the
bandwidth of a single unidirectional link.
A. Diagonal and Furthest-Node PairingsOne type of communication pattern which is useful for
evaluating the effectiveness of an interconnection network at
sustaining its bisection bandwidth is the bisection pairing. In
a bisection pairing each node in the network communicates
with exactly one other node, no two nodes communicate withthe same node, and each source-destination pair crosses the
bisection of the network exactly once. In this paper we
evaluate two such challenging pairings, referred to as the
diagonal and furthest-node pairings, as described below.
Diagonal pairing: each node communicates with thenode which is a reflection across the midpoint of each
dimension. In each dimension the node with index i
communicates with the node with index L-i-1, where
L is the length of the dimension. On a mesh, these
pairings are such that if you draw a line between each
pair, they all pass through the center of the mesh.
Furthest-node pairing: each node communicates withthe node which is the maximum number of hops
away.We ran an SPI-level bisection performance test on 1-rack
(1024 node) and 4-rack systems, using dynamic routing zoneIDs 0-3 as well as deterministic routing, and the results arepresented in Table III for diagonal pairing and Table IV forfurthest-node pairing. We also vary the pacing of the messagebetween nodes by changing the window submessage size. Thishas the beneficial effect of preventing the network from over-saturating and causing performance to deteriorate. Based onTables III and IV, we observe that using a pacing window sizeof 8KB gives the best performance across all zone IDs, sothroughout the rest of the paper we limit our results to this
pacing window size. The bisection performance as a percentageof peak is significantly better on one rack than on four,especially for the more challenging diagonal pairing. This isdue to the fact that there is a single long dimension in the one-rack system size, so that as discussed in Section III, packets areprevented from turning onto that long dimension. For moresymmetrical system sizes with more than one long dimension,it is not possible to completely eliminate packets turning ontoat least one of the long dimensions.
On the 4-rack system, the best routing for the diagonalpairing is zone ID 3, since it maintains high performance acrossa wide range of window sizes. For the furthest-node pairing,the best performance is achieved with zone ID 0 since thispairing naturally has a much more evenly distributed traffic
pattern, equally utilizing all of the links, similar to the all-to-allcase, so that the standard longest-to-shortest dynamic routingperforms quite well. Conversely, the diagonal pairing does notevenly utilize the links, so that dynamic routing inadvertentlyconcentrates the traffic on a relatively small number of links,including bisection links. By definition, in order to obtain ahigh percentage of the peak bisection bandwidth, all of thebisection links must be utilized. Deterministic (anddeterministic-ordered dynamic) routing forces some of thetraffic around the hot-spots and mitigates the congestionsignificantly.
A key observation is that there are some source-destination
pairs in the diagonal pairing which have only one minimal
path between them (i.e. a single hop in each dimension), andthere are other pairs which have many possible paths between
them. Of those paths, some overlap with the close pairs, and
others avoid using the same links. We next explore whether it
can be beneficial to use different zone IDs for different
partners in order to diffuse the hot spots in the network.
B. Flexibility MetricIn order to differentiate between the pairs with varying
numbers of minimal paths between them we introduce the
TABLE II:ALL-TO-ALL PERFORMANCE, AS A PERCENTAGE OF PEAK FOR ZONE
ID0DYNAMIC ROUTING AND 4KBSUBMESSAGE SIZE AS A FUNCTION OF
SYSTEM SIZE
# Nodes 512 1024 2048 4096 16384
Performance (%) 95 92 94 85 91
85
7480
7471
91
64
91
83
91
0
20
40
60
80
100
Percento
fpeak
4096 Nodes 16384 Nodes
Zone 0 Zone 1 Zone 2 Zone 3 Det. Routing
-
7/27/2019 Blue Gene q Network
5/12
flexibility metric:
)2//(1
0
=
=
D
i
ii LhF ,
where hi is the number of hops in dimension i for the given
source-destination pair; Li/2 is half the length of dimension i
(i.e. the maximum number of hops in a torus using minimal
path routing); and D is the number of dimensions in the
network. In our implementation dimension E is length 2 for allsystem sizes and thus can be ignored. Since hmax = Li/2, Fmax =
D = 4 in this case. Furthermore, all traffic for the furthest-node
pairing has F = Fmax, since each message in that pairing travels
the maximum distance in the torus. In general, there are a
relatively small number of possible values of F for a given size
system and communication pattern.
On a system size of 4 racks, the size of the network is
8x4x8x8x2. For the diagonal pairing on a torus, each packet
takes an odd number of hops in each dimension. So on a
dimension of length 4, all packets travel exactly 1 hop; on a
dimension of length 8, either 1 or 3 hops. This means that thevalue of F for a dimension of length 4 is 0.5, and the two
possible values of F for a dimension of length 8 are 0.25 and
0.75. So for this configuration, there are four possible sums of
F for the diagonal pairing: 1.25, 1.75, 2.25, and 2.75. Our
scheme uses two thresholds, Th and Tl, to choose between
zone IDs. For source-destination pairs with F < T l or F >= Th,
zone ID 0 is used; if Tl
-
7/27/2019 Blue Gene q Network
6/12
As on 4 racks, the 16-rack performance is much moresensitive to the value of Tl than Th. Trends seen in the smallersystems also apply to the larger 16-rack system. Routingmessages with very low flexibility or very high flexibility witha longest-to-shortest zoned approach, while routing othermessages with intermediate flexibility with a deterministic-ordering approach can provide better performance than eitherapproach alone.
A form of pacing with the flexibility metric has beenimplemented in PAMI, and is thus used by MPI. Pacing iscontrolled by a thread on the seventeenth core and theflexibility metric thresholds are chosen differently dependingon the system size. Default settings can be overridden usingenvironment variables, permitting users to tune and optimizetheir codes.
C. Random PairingAn important benchmark to evaluate the performance of an
interconnection network is the random-pairing benchmark. In
this benchmark, each node is randomly paired with another
node in the system. Each node in the network communicates
with exactly one other node; no two nodes communicate with
the same node. As with all-to-all the expected per-node peak
bandwidth is 8/L 1.8 GB/s.
(s,k)-random pairing benchmark: Since the pairs are
determined randomly and the aforementioned calculation only
yields the peak bandwidth in expectation, it only serves as an
upper bound. There can be local hot-spots due to the
randomness of the pair selections, and this smoothes out as the
number of pairs increases and eventually approaches a true all-to-all communication pattern. Thus, in order to get a better
idea of the performance, we extend this benchmark as follows.
We define an (s,k)-random pairing wherein each node utilizes
s cores and each core communicates with k random partners
on different nodes. Thus every node communicates with sk
other nodes. Note that the (1,1)-random-pairing benchmark is
equivalent to the random-pairing benchmark. The expected
peak data-per-node BW is the same as before, i.e., 8/L 1.8
GB/s.
We ran our SPI-based random-pairing tests on systems of 1
rack and 4 racks. In this test, we exchanged 1 MB of data
between each pair in the (s,k)-random pairing. To explore the
effect of zone routing, we ran the test using dynamic routingzone IDs 0 through 3 as well as using deterministic routing.
These numbers are presented in Table VIII for s=16 and k=16.
All the tests were performed using pacing with a window size
of 8KB. We observe that the best results were obtained with
zone ID 1 routing. We believe that local hotspots are more
easily avoided using the unrestricted dynamic routing of zone
ID 1 compared to the longest-to-shortest routing of zone ID 0.
We also ran the tests with s=16 and k=1,2,4,8,16 in order to
study the effect of increasing communication partners on the
performance. The results are shown in Table IX; these were
obtained with zone ID 1 routing and with pacing. As expected,
performance steadily improves as the number of
communication partners increases. With (s,k) = (16,16),
performance goes as high as 77% on 4096 nodes. We also see
that performance on 4096 nodes is significantly better than on
1024 nodes. On the larger system, this is probably due to the
more symmetric topology, more opportunity for dynamic
routing to avoid hotspots, and a smaller likelihood of selecting
adversarial pairings such as multiple collinear pairs.
D. ReverseThe reverse benchmark evaluates the performance of the
interconnection network at sustaining bisection bandwidth on
an irregular communication pattern. In this benchmark, a node
with MPI rank X communicates with the node having rank Y
where the bit representation of the coordinates of Y on each
dimension are obtained by reversing the bit pattern of thecorresponding co-ordinate of X, i.e., for any dimension A and
bit i (i = 0, 1, , log2 LA 1), the ith
bit of Y along dimension
A is the same as (log2 LA i 1)th
bit of X along dimension A,
where LA is the length of the dimension A. The peak
performance for this benchmark is calculated by examining
the central cut along the longest dimension. For 4 racks
TABLE VI:PERCENTAGE OF PEAK BISECTION, FOR SINGLE-ZONE IDROUTING
FOR DIAGONAL AND FURTHEST-NODE PAIRING ON 16384NODES. PACINGWINDOW 8KB.
Zone ID 0 1 2 3 Det.
Diagonal 71 62 77 91 85
Furthest-node 95 83 93 76 92
TABLE VII:PERCENTAGE OF PEAK BISECTION, FOR SELECTED COMBINATIONS
OF FLEXIBILITY METRIC THRESHOLDS FOR DIAGONAL PAIRING ON 16384
NODES.PACING WINDOW 8KB.
Tl, Th1.0,
1.5
1.0,
1.75
1.0,
2.0
1.0,
2.25
1.0,
2.5
1.0,
2.75
1.0,
3.0
Performance 87 85 91 92 93 92 92
Tl, Th1.25,1.5
1.25,1.75
1.25,2.0
1.25,2.25
1.25,2.5
1.25,2.75
1.25,3.0
Performance 85 85 94 92 94 93 93
Tl, Th1.5,
2.25
1.5,
2.5
1.5,
2.75
1.5,
3.0
Performance 72 72 72 72
TABLE VIII:RANDOM-PAIRING PERFORMANCE AS A PERCENTAGE OF PEAK
FOR 1 AND 4-RACK SYSTEMS WITH DIFFERENT ROUTING SCHEMES WITH S=16AND K=16. PACING WINDOW 8KB.
TABLE IX:RANDOM-PAIRING PERFORMANCE AS A PERCENTAGE OF PEAK
BANDWIDTH FOR 1 AND 4-RACK SYSTEMS USING ZONE ID1ROUTING WITHS=16. PACING WINDOW 8KB.
Number
of Nodes
Zone ID 0 Zone ID 1 Zone ID 2 Zone ID 3 Det.
1024 56 67 54 57 51
4096 70 77 45 47 38
Number ofNodes
k=1 k=2 k=4 k=8 k=16
1024 50 57 58 64 67
4096 65 66 72 75 77
-
7/27/2019 Blue Gene q Network
7/12
(8x4x8x8x2), the longest dimension is of size 8, which is
represented by 3 bits. The node pairs that communicate with
each other are the pair [1 (001), 4 (100)] and the pair [3 (011),
6 (110)]. Note that both of these communicating pairs use the
link between nodes 3 and 4 (they do not use the diametrically
opposite link of the torus). Thus when we look at the cut
across the longest dimension, the total amount of data passingthough the cut is twice the data generated on each node.
Therefore the peak data-per-node BW is 2 1.8 GB/s.
Similarly, for 1 rack (4x4x4x8x2), the longest dimension is 8,
and hence the peak data-per-node BW is again 2 1.8 GB/s.
We ran our SPI-based reverse-pairing tests on systems of 1rack and 4 racks. In this test, we exchanged 1 MB of databetween the communicating pairs. To explore the effect of zonerouting, we ran the test using dynamic routing zone IDs 0through 3 as well as using deterministic routing. The results areshown in Table X. We observe that on 4096 nodes, theperformance with dynamic routing zone IDs 2 or 3 isapproximately 75% of the peak. The performance of theflexibility metric approach is between that of zone IDs 0 and 3,as expected. On 1024 nodes, the performance is very consistentacross the different zone routings and reaches 95% of the peak.
E. TransposeIn transpose benchmark, the nodes on the network form a
virtual 2D square matrix where each node (x,y) is paired withthe node (y,x). Diagonal nodes (x,x) do not participate in thispairing communication operation. On the 5D BG/Q torusnetwork, the 2D mesh is overlaid on the dimensions of the 5Dtorus. Depending on how the processes are mapped to thedimensions of the 5D torus, it may be possible to fold thedimensions of the 5D torus to form a 2D mesh. For example,
on 1024 nodes when dimensions A,B,C,D,E have sizes4x4x4x8x2 respectively, a 32x32 virtual mesh can be formedas {CD}x{ABE} when CDABE mapping is used. Othermappings such as ABCDE may result in a dimension (C) beingshared by both X and Y dimensions in the mesh.
As shown in [15], the transpose pairing is a challengingcommunication operation that can cause hotspots along thediagonal nodes in the 2D Mesh. On a 5D torus with staticdeterministic routing packets will converge towards the hotspotdiagonal nodes resulting in lower overall throughput. We
developed a simple program to compute the load on the linksfor the transpose communication pattern with deterministicrouting and the results are presented in Table XI.
Observe that with deterministic routing, links around thehotpots have several messages passing through them and theachievable percent of peak is quite small. With adaptiverouting, where the torus routers send packets along the leastloaded links, significant improvement in performance can beexpected. Table XII shows the percent of bisection throughputachieved with dynamic routing on zone ID 1 and deterministicrouting for the transpose operation using a pairing test writtenin SPI. The percent is adjusted to account for the fact that only
N - sqrt(N) nodes participate in the transpose operation.Observe adaptive routing with zone ID 1 achieves higherthroughput than deterministic routing as it can smoothennetwork load around hotspots. We also observed betterperformance when the 5D torus can be folded to form the 2Dmesh. Note that in Table XII, mapping CDABE performs betterthan ABCDE. Zone ID 1 performs best as it has the mostflexibility in moving packets around hotspots. Other zone IDs0,2,3 achieve throughputs between deterministic routing andzone ID 1, as does the flexibility metric approach.
V. GUPS
A. IntroductionRandom access performance of the memory subsystem is
critical to many applications. The HPCC suite includes theRandom Access benchmark which measures the capability of asystem to generate and apply updates to random locations inthe memory. On earlier machines, Blue Gene/L and BlueGene/P [13], 3D bucketing algorithms have been designed toamortize the transfer costs by aggregating multiple updates intoa single bucket. Such techniques lower the software costs ofinjection and reception of the update and also help in betterutilization of the network. The performance of the benchmarkis measured in GUPS and is bounded by the bisectionperformance of the network, although other factors such assoftware overhead could be the bottleneck. Further, the totalamount of look-ahead depth for aggregation is restricted to1024 updates per process or 8192 bytes with eight bytes perupdate, limiting the size of the buckets used.
B. GUPS design on BG/QThe benchmark is run with sixteen processes per node, one
process per core, with each process utilizing four threads. Out
of the four threads, two threads are completely dedicated for
software routing and the other two are used for generating the
TABLE X:REVERSE PERFORMANCE AS A PERCENTAGE OF PEAK FOR 1 AND 4-
RACK SYSTEMS WITH DIFFERENT ROUTING SCHEMES.PACING WINDOW 8KB.
Number
of Nodes
Zone ID 0 Zone ID 1 Zone ID 2 Zone ID 3 Det.
1024 94 93 94 94 94
4096 65 65 75 75 53
TABLE XI:TRANSPOSE PAIRING LOAD ON TORUS NETWORK LINKS WITH
STATIC ROUTING.
Nodes RoutingDimension
Order
Rank toCoord
Mapping
Max linkload
Predicted % ofbisection
throughput
1024 DBCAE ABCDE 4 50 %
1024 DBCAE CDABE 4 50%
4096 ADCBE ABCDE 16 12.5%
TABLE XII:TRANSPOSE PERFORMANCE WITH DETERMINISTIC AND ADAPTIVE
ROUTING.
Nodes Rank toCoord
Mapping
% of BisectionThroughput for Adaptive
Routing zone ID 1
% of BisectionThroughput for
Deterministic Routing
1024 ABCDE 83% 31%
1024 CDABE 89% 41%
4096 ABCDE 74% 13%
-
7/27/2019 Blue Gene q Network
8/12
updates and applying the updates. The salient features of the
new design are the following.
1) Software routing for the five dimensional torus:Because BG/Q has multiple threads per core, there exist new
bucketing opportunities. In addition, the 5D torus permits
larger buckets compared to a 3D torus with the same number
of nodes. For example, in a 64K node 64x32x32 3D system
the process handling the longest dimension has 64 bucketswhereas a 16x16x16x8x2 5D system has at most 16 buckets
per dimension.
In the design proposed in this paper, a process is required
to route traffic from only one incoming dimension to only one
outgoing dimension. This greatly reduces the number of
buckets thus allowing for more aggregation. For example, on
the largest machine, the number of send buckets utilized
would only be around 16. The basic idea is to aggregate all the
updates from the processes on a node and then route them
along the dimensions of the torus. Once the updates reach the
final destination node, they are scattered to their respective
processes. Also, the packets are always routed from the shorter
to the longer dimensions to increase message aggregation andto avoid any cyclic dependencies. In a 16 rack system, the E
dimension is the shortest and the A dimension is the longest.
2) Translating communication parallelism into GUPSperformance: The MU provides a high level of parallelism
within a node with multiple injection and reception FIFOs
operating concurrently on different messages. For example,
within a single process, multiple threads can send and receive
messages on separate hardware FIFOs eliminating the need for
shared locks. PAMI on BG/Q exposes this concurrency in the
form of higher level abstraction such as contexts. Further,
these threads can be pinned to a specific context and are
addressed using end-points. A complete discussion on theseconcepts is given in [12]. Our design of GUPS uses these
PAMI concepts as building blocks and the entire algorithm is
implemented in the pre-registered message handlers.
0,1,2
3,4,5
6,7,8
9,10,11
12,13,14,15
0,1,2
3,4,5
6,7,8
9,10,11
12,13,14,15
Set #, Routing Function
1, E to D
2, D to C
3, C to B
4, B to A
5, A to T
E=0
plan
e
E=1plan
e
Figure 3. Routing along E planes.
Our new design harnesses communication parallelism by
allowing threads in more than one process to route in the same
dimension. Processes belonging to one routing set drain
packets from the reception FIFOs of a lower dimension and
route to the routing set of processes of a higher dimension.
Further, each process spawns two independent routing threads
working in parallel, for a total of 32 routing threads per node.
D dimension
C
dimension
1
2
3
4
5
1
2
3
4
5
1 2 3 4 5
RoutingSets
Figure 4. Dimension ordered routing in the routing sets.
3) Detailed illustration of the parallel software routing:The initial routing step is explained as follows. As shown in
Figure 3, the sixteen processes on a node are divided into five
routing sets. All these processes, after generating the updates,
route to routing set 1, comprised of processes with local ranks
{0, 1, 2}. The other routing sets numbered from two to five are
also shown in Figure 3. As explained below these are used for
routing along the remaining dimensions of the network, D to
A. The T dimension is the local dimension, and processes in
routing set 5 with local ranks {12, 13, 14, 15} are used in thelast step of the software routing and forward the updates to all
sixteen processes within the node. Note that only the first
thread of these processes is used to generate the updates. Apart
from generating the updates, the thread also maintains two
buckets, corresponding to the E = 0 and E = 1 plane. All the
updates are aggregated into these buckets before sending to
the processes of routing set 1. As shown in Figure 3, processes
in the E = 0 plane communicate with routing set 1 of E = 1
plane via the network. For communicating to the processes in
the same plane, the updates utilize shared memory. Note that
in the initial phase of the algorithm, thread 0 of each process
communicates to threads 1 and 2 of the processes belonging to
routing set 1 in order to aggregate all the updates on a node.
By careful mapping, we allow for uniform distribution of
updates to each of the routing threads belonging to the three
processes of a routing set.
The remaining routing steps traverse the dimensions of thenetwork in the order DCBA. A further optimization togeneralize the algorithm for any arbitrary system configurationwould be to go from shortest to longest dimension to get themost aggregation of the updates. However, it is to be noted that
-
7/27/2019 Blue Gene q Network
9/12
on a complete 96 rack machine, the ordering required is thesame as in this paper. Figure 4 shows two hops of this routingfirst along the D dimension followed by the C dimension. Asindicated in Figure 4, the packets injected by routing set 1 arereceived by the processes belonging to the routing set 2. For theC dimension, updates travel from routing set 2 to 3.
C. Peformance evaluation:The performance of the Random Access benchmark is
tightly coupled to the bucket size used for messageaggregation. In the following we describe the calculation usedto obtain the bucket sizes. We first enumerate the types ofbuckets used per process in our design:
1) Issue Send buckets: Used by the issue thread 0 whichgenerate the random numbers and send updates along the E
dimension. There are two issue send buckets one for each E
plane.
2) Routing Send buckets: Used by the routing threads 1and 2 to send along a given dimension. The number of routing
send buckets is the same as the dimension size.
3) Routing Receive buckets: Used by routing threads 1and 2 to receive updates. There is one routing receive bucket
to process data received in the active message handler.
4) Final Update Receive buckets: There is one final updatereceive bucket that is used by the update thread 3 to receive
the final updates.
An issue send bucket size of 512 B was experimentally
determined to maximize performance. Similarly, the final
update receive bucket size was experimentally selected at 256
B. The benchmark allows 8 KB of total bucket memory space,
thus the remaining space for the routing send and receive
buckets is (8192 - 512 - 256) = 7424 B, or 3712 B for each of
the two routing threads. A routing send bucket is required per
node along a dimension, as well as a single receive bucket.Thus each routing send and receive bucket is
3712/(dimension_size + 1) bytes as there are dimension_size
sending buckets and one receive bucket.
Since GUPS follows an all-to-all kind of pattern and thereare 8 bytes per update, the network bound on updates persecond per node is ((8/L)*B)/8 = B/L, where B is the peak linkbandwidth obtained after adjusting for the per packet overheadused in the software routing. For example, for a packet size ofS bytes, B = S/(S+52)*2.0 GB/s, where 52 is the total numberof bytes used in the header, trailer and the ack of the packet. Sis determined from the bucket sizes used. From 1 to 8 racks, thenetwork bound is over 200 million updates per node persecond, and it is 100 million updates per node per second for16 racks up to the full system size. From experimentalevaluation, we observed that the performance achieved on asingle node is 106 Million updates per second. Since eachupdate requires a read and write of 128 B, this corresponds to
an off-chip memory bandwidth of 27.6 GB/s. We use 106million updates per second as the memory system hardwarelimit.
Table XIII reports the total GUPS, the update rate per node,and the hardware bound per node which is the minimum of thenetwork and memory bounds for system sizes of 1 to 16 racks.For 16 racks, we achieved 858.1 GUPS which is 52.4% of thehardware bound. In our experience, the bottleneck for obtaining
a higher GUPS performance is the processing costs forsoftware routing on the five network dimensions and one localdimension.
VI. OPTIMIZATION OF APPLICATIONS USINGCOMMUNICATION THREADS
Each core of the BG/Q node has four hardware threads that
share the cores resources. In applications that have high
communication overheads, one of these threads can be
dedicated to accelerate communication. We observed that in
some hybrid applications that use OpenMP within the node
and MPI to communicate across nodes, best performance is
achieved with two or three hardware threads per core for
computation. This is because OpenMP overheads may cancelthe benefit of the additional threads for computation. MPI
libraries on Blue Gene/Q can enable one or two
communication threads per core to optimize the above
mentioned scenarios. In addition to improving the overall
messaging performance, communication threads also enable
independent progress in the messaging stack that can be highly
advantageous for asynchronous communication.
Past work [12] shows that we achieve a message rate of
107 million messages per second via the PAMI API and 20.9
million messages per second via MPI using 32 processes per
BG/Q node. MPI has higher overheads than PAMI, as
messages have to be matched on the receiver, based on the tagand the source rank. MPI libraries enable communication
threads to accelerate the message processing, so that
applications can take advantage of the high message rate
available in the BG/Q torus network. With 8 processes per
node, MPI libraries achieve a message rate of ~10 million
messages per second with communication threads, and ~8
million messages per second without communication threads.
The difference with or without communication threads is more
pronounced when there are fewer processes per node.
To demonstrate the benefits of communication threads we
present case studies with two linear algebra applications,
Algebraic Multi-Grid (AMG) and an iterative Poisson solver.
Both these applications are weak scaling applications wherethe problem size is increased proportionate to the increase in
the number of cores. They also send and receive several
messages of different sizes in each iteration.
-
7/27/2019 Blue Gene q Network
10/12
The AMG method is used to iteratively solve partial
differential equations using a heirarchy of grids with different
resolutions. The communication pattern is dense near neighbor
where processes send and receive hundreds of messages in
each iteration of the solver, as processes having coarse grid
points must communicate with a number of processes that
have fine grid points. The size of the messages can vary from
a few bytes to several hundred KB. To achieve high
throughputs AMG requires the messaging libraries to achievehigh message rates. On BG/Q we achieve the high messaging
rates by enabling communication threads. We ran the AMG
benchmark from the Sequoia Benchmark suite [9] with
refinement levels of 8x8x8 using solver 3, which uses a
preconditioned generalized minimum-residual iterative
method. We measured performance with and without
communication threads. Table XIV presents the application
throughput computed as a Figure Of Merit (FOM =
system_size * iterations / iteration_time). These measurements
used four MPI processes per node, and three threads per core
for a total of 12 OpenMP threads per process, leaving one
hardware thread per core available for communication threads.
Note, even without communication threads, the bestperformance achieved in AMG is with three threads per core
(i.e. exchanging the communication thread for an additional
OpenMP thread does not improve performance), possibly
because OpenMP overheads cancel the gains from an
additional SMT thread. For example, on 512 nodes the FOM
achieved with both 3 and 4 threads per core is 1.38e9. The
performance improvement in the overall solver time due to
communication threads is between 3.3 and 6.2%.
A more dramatic improvement was observed with a simple
iterative solver for Poissons equation. This solver was used to
represent applications where the main communication pattern
is boundary exchange on a regular grid. The computational
performance of this solver is limited by bandwidth to memory.As a result, the overall performance of the solver is optimized
by using two or three threads per core for computation,
leaving one or two threads per core available for
communication. Table XV shows the benefit of using
communication threads with this simple iterative solver. Note
that it is not meaningful to compare the step times between
differently sized systems, since the number of iterations
changes.
VII. RELATED MESSAGING STACK (PAMI)FEATURES
The SPI-level research on message pacing, the flexibility
metric, and commthreads described in this paper have been
incorporated into the messaging stack (MPI / PAMI) asdescribed below. Similar performance results are observed.
Message pacing is controlled by an "agent" thread that runs
on the 17th core of each node. PAMI posts a given message to
the agent for pacing when the size of the block exceeds 1 rack,
and the message is larger than a W bytes (default 64KB), and
the destination node is more than H hops away (default 4) or
its ABCD coordinates differ from the source node's in more
than D dimensions (default 1). The agent multiplexes the
messages posted from the processes on the node and paces
them, controlling the amount of data in the network on behalf
of the node. The agent divides each message into windows of
size W and allows up to M simultaneous windows in the
network. The defaults for W and M vary based upon the block
size and are currently empirically determined but can be
overridden at job launch time by the user. The agent round
robins through its list of messages, injecting one window from
each message, pausing as needed to wait for a previous
window to finish in the network, maintaining up to M active
windows in the network. Software controlled pacing leveragesthe many threads on the BG/Q system and provides more
flexibility versus the added complexity of a hardware-based
pacing implemention.
In PAMI, the flexibility metric is used to determine the
network routing for point-to-point messages that exceed F
bytes (default 64KB). Routing can be deterministic, or
dynamic with zone ID 0, 1, 2, or 3. One of two routing
methods is used, depending on whether the metric between the
source and destination nodes is within the range or outside of
the range. The metric determines the routing for both paced
and non-paced messages. Messages F bytes or less are
deterministically routed by default. The default metric range,routing values, and threshold F vary based upon the block size
and can be overridden by environment variables. The default
configuration gives good performance for all of the bisection
pairings, with the exception of the extreme transpose pairing.
PAMI allows the user to override the default zone ID used at
job launch time with environment variables; or the user can
specify what zone ID to use on a message-by-message basis
through the use of SPI calls. Modifying the zone IDs
themselves (e.g. the dimension ordering) is possible but
requires system calls. Ongoing research is in progress to refine
this approach for when there are a large number of outstanding
communication partners.
PAMI has contexts and commthreads that enable parallelcommunications progress. Contexts are a method of dividing
messaging hardware resources so that parallel operations can
occur. Commthreads run on hardware threads not being used
by the application. The number of contexts and commthreads
is determined by PAMI when MPI is initialized. Each context
initially has its own commthread that makes progress on that
context. If the application creates threads on the same
hardware thread that is running a commthread, the
TABLE XIII:RANDOM ACCESS PERFORMANCE
Nodes TotalGUPS
Million Updates per Nodeper Second
Hardware Bound forMillions Updates perNode per Second
1024 47.3 46.2 106
2048 95.2 46.5 106
4096 184.7 45.1 106
8192 485.9 59.3 106
16384 858.1 52.4 100
-
7/27/2019 Blue Gene q Network
11/12
commthread gives its context to another commthread and
yields to the application thread. Messages are associated with
a contexts based on the destination rank and MPI
communicator. This evenly distributes messages among the
contexts while maintaining MPI ordering semantics. MPI
posts a message to the contexts. The commthread picks it up,
makes progress on it, and completes it. The main application
thread finishes computing and finds the message completed.
There are two versions of the MPI library on BG/Q. One is
enabled for threaded operations and one is optimized for non-
threaded operations. To have commthreads, the application
must be linked with the thread-enabled MPI library and must
initialize MPI with MPI_THREAD_MULTIPLE.
VIII. CONCLUSIONS
The Blue Gene/Q integrated network offers programmable
zone routing control for dynamic (adaptive) routing. Using
low level SPI programming and default zone settings, all-to-allperformance ranges from 85% to 95% of the theoretical peak
on 512 to 16K nodes. With 16K nodes, a software optimized
version of the Random Access benchmark from the HPCC
suite achieves a preliminary result of 858.1 GUPS. With this
result, we learned that careful hardware/software co-design
can lead to a thin but efficient software layer for message
aggregation, thus changing a short-message random
communication pattern into longer messages that perform well
on the torus topology.
We studied the performance of difficult bisection pairings.
Diagonal and furthest-node pairings each achieve good
performance, albeit with different zone ID settings. Thus, a
software-controlled flexibility metric routing mechanism isdeveloped where different hardware routing algorithms are
selected depending on the distance messages travel. Both the
diagonal and furthest-node pairings achieve over 90% of peak
on 16K nodes. This shows the importance of having multiple
hardware routing options since different applications perform
optimally under different routing algorithms. The flexibility
metric enables the system to provide very good performance
with default settings but still allows individual applications the
opportunity to optimize further.
When running communication intensive applications using
MPI, it is often beneficial to have dedicated MPI
communication threads. The performance improvement ranges
from 3.6% to 6.2% for AMG, and 11.8% to 19.7% for an
iterative solver for Poissons equation, on 512 to 2K nodes.
An initial implementation of the low-level PAMI API used
by MPI automatically manages pacing, the flexibility metric,
and communication threads. These settings can also be
overridden by the end user. The Blue Gene/Q architecture
provides the capability to fine-tune many hardware features.
How higher level libraries such as MPI can best exploit these
hardware features is an ongoing investigation.
ACKNOWLEDGMENTS
The Blue Gene project is a team effort. We would like tothank the entire IBM Blue Gene team for their contributionsand support that made this work possible.
The Blue Gene/Q project has been supported and partiallyfunded by Argonne National Laboratory and the LawrenceLivermore National Laboratory on behalf of the U.S.
Department of Energy, under Lawrence Livermore NationalLaboratory subcontract no. B554331. We acknowledge thecollaboration and support of Columbia University and theUniversity of Edinburgh.
REFERENCES
[1] A. Gara, M. A. Blumrich, D. Chen, G. L,-T. Chiu, P. W. Coteus, M. E.Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay,T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, and
P. Vranas, Overview of the Blue Gene/L system architecture, IBM
Journal of Research and Development, vol 49, no. 2/3, pp. 195212,
March/May 2005
[2] IBM Blue Gene Team, Overview of the IBM BlueGene/P project,IBM Journal of Research and Development, vol. 52, no. 1/2, pp. 199-220, January/March 2008
[3] R.A. Haring, M. Ohmacht, T. W. Fox, M. K. Gschwind, P. A. Boyle, N.H. Christ, C. Kim, D. L. Satterfield, K. Sugavanam, P. W. Coteus, P.
Heidelberger, M.A. Blumrich, R.W. Wisniewski, A. Gara, and G. L.-T.
Chiu, The IBM Blue Gene/Q Compute Chip,IEEE Micro, vol. 32, no.2, pp. 48-60, (Mar/Apr 2012).
[4] D. Chen , N.A. Eisley, P. Heidelberger, R.M. Senger, Y. Sugawara, S.Kumar, V. Salapura, D.L. Satterfield, B. Steinmacher-Burow, J.J.Parker, The IBM Blue Gene/Q Interconnection Network and Message
Unit, Proc. Intl Conf. High Performance Computing, Networking,
Storage and Analysis (SC 11), ACM Press, 2011, article 26.[5] D. Chen, N.A. Eisley, P. Heidelberger, R.M. Senger, Y. Sugawara, S.
Kumar, V. Salapura, D.L. Satterfield, B. Steinmacher-Burow, J.J.
Parker, The IBM Blue Gene/Q Interconnection Fabric, IEEE Micro,vol. 32, no. 1, pp. 32-43, -(Jan/Feb 2012).
[6] S. Scott and G. Thorson, The Cray T3E Network: Adaptive Routing ina High Performance 3D Torus, Proceedings of HOT Interconnects IV,August 1996, pp. 147156.
[7] R. Alverson, D. Roweth, L. Kaplan, The Gemini System Interconnect,18th IEEE Symposium on High Performance Interconnects, August2010.
[8] Y. Ajima, Y. Takagi, T. Inoue, S. Hiramoto and T. Shimizu, The TofuInterconnect,IEEE Micro, vol. 32, no. 1, pp. 21-31, (Jan/Feb 2012).
[9] Sequoia Algebraic Multi Grid (AMG) benchmarkhttps://asc.llnl.gov/sequoia/benchmarks/#amg
[10] V. Puente, R. Beivide, J. A. Gregorio, J. M. Prellezo, J. Duato, and C.Izu, Adaptive Bubble Router: A Design to Improve Performance in
Torus Networks, Proceedings of the IEEE 274 International
Conference on Parallel Processing, September 1999, pp. 58 67
TABLE XV:STEP TIME (SECONDS) FOR THE POISSON SOLVER KERNEL
Nodes Process /
Node
OMP
Threads
/ Process
Step Time
(s) w/o
comm
threads
Step Time (s)
with comm
threads
%
Gain
512 8 6 3.682 3.076 19.7
1024 8 6 2.525 2.258 11.8
2048 8 6 5.784 5.073 14.0
TABLE XIV:FIGURE OF MERIT (FOM) FOR THE AMGAPPLICATION
Nodes Process /
node
OMP
Threads /Process
FOM
withoutComm.
Threads
FOM with
Comm.Threads
%
Gain
512 4 12 1.38 e+9 1.45 e+9 5.0
1024 4 12 2.42 e+9 2.57 e+9 6.2
2048 4 12 4.27 e+9 4.41 e+9 3.3
-
7/27/2019 Blue Gene q Network
12/12
[11] S. Kumar, Y. Sabharwal, R. Garg and P. Heidelberger, Optimization ofAll-to-all communication on the Blue Gene/L supercomputer, InProceedings of International Conference on Parallel Processing (ICPP) ,
Portland, Oregon, 2008.[12] S. Kumar, A.R. Mamidala, D.A. Faraj, B. Smith, M. Blocksome, B.
Cernohous, D. Miller, J.Parker, J. Ratterman, P. Heidelberger, D. Chen,
and B. Steinmacher-Burow. PAMI: A Parallel Active Message Interface
for the Blue Gene/Q Supercomputer. To appear in proceedings ofInternational Parallel and Distributed Symposium (IPDPS 12),
Shanghai, China, May 2012
[13] V. Aggarwal, Y. Sabharwal, R. Garg, and P. Heidelberger,, HPCCRandomAccess benchmark for next generation supercomputers, IEEEInternational Symposium on Parallel & Distributed Processing, 2009
(IPDPS 2009). Pp. 1-11, 2009.[14] The Blue Gene Team, "Blue Gene/Q: by co-design," to appear in
International Supercomputing Conference, June 2012.
[15] Fabrizio Petrini and Marco Vanneschi. Minimal vs. non MinimalAdaptive Routing on k-ary n-cubes. InInternational Conference onParallel and Distributed Processing Techniques and Applications
(PDPTA'96), Volume I, pages 505-516, Sunnyvale, CA, August 1996.
[16] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput. Archit. News, 36:7788,
June 2008.
[17] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T.Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS
High-Performance Interconnect. In 2010 IEEE 18th Annual Symposium
on High Performance Interconnects (HOTI), pages 75 82, August 2010.[18] Steve Scott, Dennis Abts, John Kim, and William J. Dally. 2006. The
BlackWidow High-Radix Clos Network. In Proceedings of the 33rd
annual international symposium on Computer Architecture (ISCA '06).
IEEE Computer Society, Washington, DC, USA, 16-28