1 scaling collective multicast fat-tree networks sameer kumar parallel programming laboratory...
TRANSCRIPT
![Page 1: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/1.jpg)
1
Scaling Collective Multicast Fat-tree
Networks
Sameer KumarParallel Programming
LaboratoryUniversity Of Illinois at Urbana
ChampaignICPADS’04
![Page 2: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/2.jpg)
07/07/04 ICPADS’04 2
Collective Communication Communication operation in which
all or a large subset participate For example broadcast
Performance impediment All to all communication
All to all personalized communication (AAPC)
All to all multicast (AAM)
![Page 3: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/3.jpg)
07/07/04 ICPADS’04 3
Communication Model Overhead of a point to point message is
Tp2p = α + mβ
α is the total software overhead of sending the message
β is the per byte network overhead
m is the size of the message
Direct all to all overhead TAAM = (P – 1) × (α + mβ) α domain when m is small β domain when m is large
![Page 4: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/4.jpg)
07/07/04 ICPADS’04 4
Optimization Strategies
Short messages Parameter α
dominates Message combining
Reduce the total number of messages
Multistage algorithm to send messages along a virtual topology
Large messages Parameter β
dominates Network contention Network topology
specific optimizations that minimize network contention
![Page 5: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/5.jpg)
07/07/04 ICPADS’04 5
Direct Strategies Direct strategies optimize all to all
multicast for large messages Minimize network contention Topology specific optimizations that take
advantage of contention free schedules
![Page 6: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/6.jpg)
07/07/04 ICPADS’04 6
Fat-tree Networks
Popular network topology for clusters
Bisection bandwidth O(P) Network scales to several
thousands of nodes Topology: k-ary,n-tree
![Page 7: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/7.jpg)
07/07/04 ICPADS’04 7
k-ary n-trees
c)
4-ary 3 tree
a)
4-ary 1-tree
b)
4-ary 2-tree
![Page 8: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/8.jpg)
07/07/04 ICPADS’04 8
Contention Free Permutations
Fat-trees have a nice property:some processor permutations are contention free Prefix permutation k
Processor i sends data to
Cyclic shift by k Processor i sends a message to
Contention free if
Contention free permutations presented in Heller et. al. from CM-5
ki
Pki )%(
0,3,2,1,4 jaak j
![Page 9: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/9.jpg)
07/07/04 ICPADS’04 9
Prefix Permutation 1
0 1 2 3 4 5 6 7
Prefix Permutation by 1Processor p sends to p XOR 1
![Page 10: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/10.jpg)
07/07/04 ICPADS’04 10
Prefix Permutation 2
0 1 2 3 4 5 6 7
Prefix Permutation by 2Processor p sends to p XOR 2
![Page 11: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/11.jpg)
07/07/04 ICPADS’04 11
Prefix Permutation 3
0 1 2 3 4 5 6 7
Prefix Permutation by 3Processor p sends to p XOR 3
![Page 12: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/12.jpg)
07/07/04 ICPADS’04 12
Prefix Permutation 4 …
0 1 2 3 4 5 6 7
Prefix Permutation by 4Processor p sends to p XOR 4
![Page 13: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/13.jpg)
07/07/04 ICPADS’04 13
Cyclic Shift by k
0 1 2 3 4 5 6 7
Cyclic Shift by 2
![Page 14: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/14.jpg)
07/07/04 ICPADS’04 14
Quadrics: HPC Interconnect Popular interconnect
Several in top500 use quadrics Used by Pittsburgh’s Lemieux (6TF) and
ASCI-Q (20TF) Features
Low latency (5 μs for MPI) High bandwidth (320MB/s/node) Fat tree topology Scales to 2K nodes
![Page 15: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/15.jpg)
07/07/04 ICPADS’04 15
Effect of Contention of Throughput
100
150
200
250
300
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
k
Nod
e B
andw
idth
(M
B/s
) Cyclic Shift
Prefix Send
Cyclic Shift(Main Memory)
Drop in bandwidth at k=4,16,64
Node Bandwidth Kth Permutation (MB/s)
Sending data from main memory is much slower
![Page 16: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/16.jpg)
07/07/04 ICPADS’04 16
Performance Bottlenecks 320 byte packet size
Packet protocol restricts bandwidth to faraway nodes
PCI/DMA bandwidth is restrictive Achievable bandwidth is only 128MB/s
![Page 17: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/17.jpg)
07/07/04 ICPADS’04 17
Quadrics Packet Protocol
Nearby NodesFull Utilization
Send the first packet
Sender Receiver
Ack Header
Receive Ack
Send Header
Send Payload
Send the next packet
after first has been acked.
![Page 18: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/18.jpg)
07/07/04 ICPADS’04 18
Far Away MessagesSend the first packet
Sender Receiver
Ack Header
Receive Ack
Send Header
Send Payload
Send the next packet
Faraway NodesLow Utilization
![Page 19: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/19.jpg)
07/07/04 ICPADS’04 19
AAM on Fat-tree Networks Overcome bottlenecks
Messages sent from NIC memory have 2.5 times better performance
Avoid sending messages to far away nodes
Using contention free permutations Permutation: every processor sends a
message to a different destination
![Page 20: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/20.jpg)
07/07/04 ICPADS’04 20
AAM Strategy: Ring Performs all to all multicast by sending
messages along a ring formed by the processors Equivalent to P-1 cyclic-shift-by-1 operations Congestion free Has appeared in literature before
Drawback Processors send different messages in each step
0 1 2 i i+1 P-1
…… ……..
![Page 21: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/21.jpg)
07/07/04 ICPADS’04 21
Prefix Send Strategy P-1 prefix permutations
In stage j, processor i sends a message to processor (i XOR (j+1))
Congestion free Can send messages from Elan memory Bad performance on large fat-trees
Sends P/2 messages to far-away nodes at distance P/2 or more away
Wire/Switch delays restrict performance
![Page 22: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/22.jpg)
07/07/04 ICPADS’04 22
K-Prefix Strategy Hybrid of ring strategy and prefix
send Prefix send used in partitions of size k Ring used between the partitions Our contribution!
0 1 2 i i+1 P-1
…… ……..
Ring across fat-trees of size k
Prefix Send within
![Page 23: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/23.jpg)
07/07/04 ICPADS’04 23
PerformanceCollective Multicast Performance (128 Nodes)
10
100
10000 100000Message Size (bytes)
Com
plet
ion
Tim
e (m
s)
MPI prefix-send k-prefix ring
Node bandwidth (MB/s) each way
Nodes
MPI Prefix K-Prefix
64 123 260 265
128 99 224 259
144 94 - 261
256 95 215 256
Our strategies send messages from Elan memory
![Page 24: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/24.jpg)
07/07/04 ICPADS’04 24
Cost Equation
α , host and network software overhead αb, cost of barrier (barriers needed to
synchronize the nodes) βem, per byte network transmission cost δ, copying overhead to NIC memory P, Number of processors k, Size of the partition in k-Prefix
mkPmPT embprefixk )/())(1(
![Page 25: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/25.jpg)
07/07/04 ICPADS’04 25
K-Prefixlb Strategy
k-Prefixlb strategy synchronizes nodes
after a few steps
AAM Performance (128 Nodes)
1
10
100
10000 100000Message Size (bytes)
Com
plet
ion
Tim
e (m
s)
MPI k-prefix k-prefixlb k-prefixlb-cpu
![Page 26: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/26.jpg)
07/07/04 ICPADS’04 26
CPU Overhead
Strategies should also be evaluated on compute overhead
Asynchronous non blocking primitives needed
A data driven system like Charm++ will automatically support this
![Page 27: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/27.jpg)
07/07/04 ICPADS’04 27
Predicted vs Actual Performance
k-Prefix Performance (128 Nodes)
1
10
100
10000 100000Message Size (bytes)
Co
mp
leti
on
Tim
e (m
s)
k-prefix K-Prefix Predicted
Predicted plot assumes: α = 9us, αb= 15us, β = δ = 294MB/s
![Page 28: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/28.jpg)
07/07/04 ICPADS’04 28
Missing Nodes
• Missing nodes due to down nodes in the fat tree
• Prefix-Send and k-Prefix do badly in this scenario 173-69240
16915872128
K-PrefixPrefix-SendMPINodes
Node bandwidth with 1 missing node
![Page 29: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/29.jpg)
07/07/04 ICPADS’04 29
K-Shift Strategy Processor i sends data to the consecutive
nodes [i-k/2+1,…, i-1, i+1,…, i+k/2] and to i+k
Contention free and good performance with non-contiguous nodes, when k=8
Our contribution
173197240
169196128
K-PrefixK-ShiftNodesNode bandwidth (MB/s) with one
missing node
0 i+k/2i-k/2+1 i-1 i P-1
…… … ……i+k
…
K-shift gains because most of the destinations for each node do not change in the presence
of missing nodes
![Page 30: 1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04](https://reader036.vdocument.in/reader036/viewer/2022081519/56649e605503460f94b5b6c0/html5/thumbnails/30.jpg)
07/07/04 ICPADS’04 30
Conclusion We optimize AAM for Quadrics QsNet
Copying and sending a message from the NIC has more bandwidth
K-Prefix avoids sending messages to far away nodes
Handle missing nodes through the k-shift strategy Cluster interconnects other than quadrics
also have such problems Impressive performance results CPU overhead should be a metric to evaluate
AAM strategies