brian austin, jacob balma, krishna kandalla, kalyan...
TRANSCRIPT
GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks
*Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas J. Wright
SC 19 - Denver, CO (*primary authors contributed equally)
The HPC and Data Center community needs a standard set of benchmarks for characterizing network performance under load.
1. Motivate/introduce GPCNeT: network congestion benchmark2. Describe design of the GPCNeT3. Comparison GPCNeT to congestion seen in production4. Architectural/Site evaluations:
– 4 different DoE Labs– 3 different network architectures – Including Slingshot network with advanced congestion control
Summary of Contributions
2
Sample of work at SC 13-19 focused on network congestion:• There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs. SC13• Network Endpoint Congestion Control for Fine-Grained Communication. SC15• Evaluating HPC Networks via Simulation of Parallel Workloads. SC16• Watch Out for the Bully! Job Interference Study on Dragonfly Network. SC16• Run-to-run Variability on Xeon Phi Based Cray XC Systems. SC17• Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing SC18• Understanding Congestion in High Performance Interconnection Networks Using Sampling. SC19• Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing. SC19• …….
Despite the importance, there is no standard benchmark to measure network performance under congestion.
Network Congestion is Trending
3
“Tests like ping pong latency are like trying to understand your commute into NYC by driving the route alone at 4am.” – Steve Scott
Best Case Performance is Rare
4
Ping Pong on a quiet system Doing an FFT with congestion
Applications bound by the outliers (tail latency)
HPC Workloads Limited by Congestion
5
Und
erst
andi
ng P
erfo
rman
ce V
aria
bilit
y on
the
Arie
s D
rago
nfly
Net
wor
k, 2
017
# m
easu
rem
ents
latency
P99
5122561286432Process Count
Allr
educ
e La
tenc
y (u
s)
~600X increase
Designing GPCNeT
1. Strike a balance -- Flexible + Representative of common HPC communication patterns
2. Report performance limiting metrics
3. Measure network performance under the effects of Congestion
GPCNeT Design Criteria
7
Topology compatible, can run on:
We need to be able to run on any number of nodes (not just powers of 2)
Designed for Flexible Deployment
8
Probe• Representative
communication pattern on a quiet system
• Baseline for performance– e.g. 30 minutes from
work to home
Baseline with Isolated Probes
9
Probes -- Communication Pattern
10
Natural Ring Random Ring
Probes -- Measurements
11
Probes perform and report congested and isolated:1. Latency2. Bandwidth, and 3. MPI_Allreduce latency
By default a probe occupies 20% of the job nodes
Remaining 80% divided across four congestors
Congestors
• Stress the network to evaluate performance under load– e.g. 30→50 min.
work to home
Evaluate Probes under Stress
12
End-point congestion:● Insensitive to routing
○ Point-to-point Incast○ RMA Incast○ RMA Broadcast
Intermediate Congestion:● Sensitive to bisection
bandwidth and routing● Pairwise all-to-all
The Two Classes of Congestors
13
INTERMEDIATE
END-POINT
END-POINT
Congestors (End-point)
14
RMA Broadcast (get based) RMA Incast (put based)
(point-to-point version not shown)
Congestor (Intermediate)
15
Pairwise All-to-All:● at each iteration,
rank i exchanges with i+1
● n-1 iterations of 4KB exchanges
iteration n-1
1. Divide all nodes into 5 roughly equal size groups– 20% run canaries one at a time (isolated)– 4 groups of 20% each run a specific congestor
2. Measure isolated performance of canary3. Start all 4 congestors4. Measure loaded/congested performance of canary5. Repeat steps 2-4 for each canary
Execution Sequence
16
GPCNeT Informs MPI Performance
17
Distribution of Isolated Probe performance across all processes pairs
Latency
Cou
nt
● 696 Node Cray XC — Cray MPICH MPI● This is the baseline which we compare to
GPCNeT Informs MPI Performance
18
696 Node Cray XC — Cray MPICH MPI
Latency
Cou
nt
Not all Congestors are Created Equal696 Node Cray XC — Cray MPICH MPI
19
Cou
nt
Designing for Robustness vs Best-case 128 Node Cray CS500 with EDR MVAPICH MPI
20
GPCNeT performance encompasses the whole communication/network stack (MPI, Topology, Fabric)
• Congestion and congestion control is increasing importance in next-gen networks
• Introduce GPCNeT for evaluating congestion in HPC networks
• Observed congestion up to 4 orders of magnitude• GPCNeT enables tuning of communication
libraries, and establishing requirements for system performance
Conclusions and Future Work
21
Tuning GPCNeT
Scaling up Process Density
23
Increasing process count per node creates additional sub-communicators and avoids traffic within a node
Node Node
• 696 Nodes• 1, 8 & 32 PPN
• Increase PPN → • increase dimensions• increase hotspots• increase NIC utilization
Tuning Congestion by Process Density
24
1PPN100us
8PPN1000us
32PPN10 ms
Increase Node Count →• Increase degree of
incast
Default (recommend):
fully populate system
• 10% populated (64 nodes) results in limited congestion
Tuning Congestion by Node Count
25
Co
un
t (6
4 n
od
es)
Co
un
t (6
96 n
od
es)
GPCNeT on Production Systems
• 5575 Nodes NERSC Edison (Aries)• 20% nodes run as probes, remaining 80% run in three modes
– Quiet: idle (this is our baseline)– Wild: production traffic (two representative runs)– Congested: four congestors (20:20:20:20)
We show how congestion manifests:
1. Hardware counters of network routers– per-port router stall rate @ 1s
2. Increase to Latency of GPCNeT Allreduce Probes
GPCNeT vs. Congestion in the Wild
27
How does GPCNeT Compare to Production?
28
Widespread intermediate congestion (GPCNeT 3X > Wild)
P99 MPI_Allreduce probes slowed:
• 2200X vs Quiet System• 40X greater than Wild Q3
GPCNeT default is aggressive and stresses the system
7 Systems (4 DoE Production, 3 Cray Testbeds)
• Theta, Edison, Sierra, Summit• System size from 128 to 5.5k nodes• Aries, EDR IB and Slingshot Networks• Fully populated with GPCNeT defaults• Report mean and P99 normalized to baseline
GPCNeT Architectural Comparisons
29
EDR IB100%50%
128
Impact of Congestion on Modern Systems
Slowdown (multiplier) compared to mean baseline (log-scale)
● Node Count● Bisection to
Injection Bandwidth
● Architecture
Aries SS
9000X
5X
P99
Mean
696 485Nodes: 4392 5586 4320 4608Global/Injection BW: 100% 50% 50% 100% 50%
485128
Crystal and Osprey: reduced congestion compared to larger systems of same architecture
Smaller Systems →Less Congestion
Nodes: 696 4392 5586 4320 4608Aries EDR IB SS
696 485128
Smaller Systems →Less Congestion
Nodes: 4392 5586 4320 4608Aries EDR IB SS
Crystal and Osprey: reduced congestion compared to larger systems of same architecture
Aries EDR IB SS
Latency is more Sensitive than Bandwidth
Fact
or o
f Slo
wdo
wn
Larger messages sizes have a larger baseline time to complete transfer
Larger messages can be distributed across multiple paths
Capturing Trade-offs in Bisection BW
Global/Injection BW: 100% 50% 50% 50%100% 100% 50%
Summit has 2X the links going across it’s bisection
443X slowdown (Sierra) vs 135X slowdown (Summit) latency
Production settings
Next Generation Congestion Control
GPCNeT shows the value of congestion control
Slingshot designed to handle the worst case traffic patterns
100%50%128696 485Nodes: 4392 5586 4320 4608
Global/Injection BW: 100% 50% 50% 100% 50%EDR IBAries SS
Next Generation Congestion Control
Aries EDR IB SSGlobal/Injection BW: 100% 50% 50% 50%100% 100% 50%
Routing can’t always eliminate congestion
Slingshot congestion control:1. Identifies source(s)
of congestion2. Throttles the
offending traffic
• Observed congestion up to 4 orders of magnitude• Congestion control is vital in next-gen networks• Introduced GPCNeT for evaluating congestion in
HPC networks– useful tuning of (1) communication libraries,
(2) tuning of routing and congestion control algorithms (3) system procurement
Conclusions
37
Questions?
https://github.com/netbench/GPCNET
Thanks to ORNL (Scott Atchley) and LLNL (Ramesh Pankajakshan) for running GPCNeT on their systems and providing the results for this work.
Backup
39
Backup
40
How do congestors impact a real workload?
• Ran Lulesh on the three smaller test systems– lulesh not as communication bound as canaries
FAQS
41
Not all Congestors are Created Equal
42
696 Node Cray XCCray MPICH MPI
128 Node Cray CS500 with EDRMVAPICH MPI
P2P Incast:No impact on XCSignificant impact on EDR
Not all Congestors are Created Equal
43
696 Node Cray XCCray MPICH MPI
128 Node Cray CS500 with EDRMVAPICH MPI
RMA Bcast:Significant impact on XCNo impact on EDR
P2P Incast:No impact on XCSignificant impact on EDR
OMPI and MVAPICH P2P latency
• 26 node random ring
• similar trends
Differences Across MPI Implementations
44
CS500EDR IB128 nodes
OMPI and MVAPICH Allreduce latency
• 26 node Allreduce
• larger differences for more complex collectives
Differences Across MPI Implementations
45
CS500EDR IB128 nodes
GPCNeT vs. Endpoint Cong. in the Wild
46
• Distribution of port stall rate for entire network
• normalized to mean on quiet system
• Wild-1, Wild-2 are Q1, Q3 production, respectively
Max
Min
P50
1st Quartile 3rd Quartile GPCNeT Canary Only
GPCNeT vs. Endpoint Cong. in the Wild
47
Similar peak stalls (wild vs. cong.)
• more incast of smaller degree in production
GPCNeT vs. Intermediate Cong. in the Wild
48
GPCNeT more aggressive than production traffic at NERSC
• Background traffic varies widely across facilities
GPCNeT vs. Throughput in the Wild
49
• Need > 1PPN for full tput.
• 1PPN tput. 1/5th that of wild-1,2
How is random placement fair for a system like BGQ?
Fragmentation much more common on modern systems
What are other approaches to solving congestion?
• underprovision work• overprovision network• congestion control
FAQS
50
What kind of run-to-run variability did we see?• Random Canary Pairs Shift at iterations• You could have a higher density of congestor roots
within a part of a physical topology for high PPN• Information in verbose mode to track rank mappings
FAQS
51
Q: Is Congestion Control active on the Infiniband tests for Summit and Sierra?
A: No
Q: Why not?
A: Congestion control does not run in production on Summit or Sierra.
FAQS
52