brian austin, jacob balma, krishna kandalla, kalyan...

GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks

*Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas J. Wright

SC 19 - Denver, CO (*primary authors contributed equally)

The HPC and Data Center community needs a standard set of benchmarks for characterizing network performance under load.

1. Motivate/introduce GPCNeT: network congestion benchmark2. Describe design of the GPCNeT3. Comparison GPCNeT to congestion seen in production4. Architectural/Site evaluations:

– 4 different DoE Labs– 3 different network architectures – Including Slingshot network with advanced congestion control

Summary of Contributions

2

Sample of work at SC 13-19 focused on network congestion:• There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs. SC13• Network Endpoint Congestion Control for Fine-Grained Communication. SC15• Evaluating HPC Networks via Simulation of Parallel Workloads. SC16• Watch Out for the Bully! Job Interference Study on Dragonfly Network. SC16• Run-to-run Variability on Xeon Phi Based Cray XC Systems. SC17• Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing SC18• Understanding Congestion in High Performance Interconnection Networks Using Sampling. SC19• Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing. SC19• …….

Despite the importance, there is no standard benchmark to measure network performance under congestion.

Network Congestion is Trending

3

“Tests like ping pong latency are like trying to understand your commute into NYC by driving the route alone at 4am.” – Steve Scott

Best Case Performance is Rare

4

Ping Pong on a quiet system Doing an FFT with congestion

Applications bound by the outliers (tail latency)

HPC Workloads Limited by Congestion

5

Und

erst

andi

ng P

erfo

rman

ce V

aria

bilit

y on

the

Arie

s D

rago

nfly

Net

wor

k, 2

017

# m

easu

rem

ents

latency

P99

5122561286432Process Count

Allr

educ

e La

tenc

y (u

s)

~600X increase

https://ieeexplore.ieee.org/abstract/document/8049022/

Designing GPCNeT

1. Strike a balance -- Flexible + Representative of common HPC communication patterns

2. Report performance limiting metrics

3. Measure network performance under the effects of Congestion

GPCNeT Design Criteria

7

Topology compatible, can run on:

We need to be able to run on any number of nodes (not just powers of 2)

Designed for Flexible Deployment

8

Probe• Representative

communication pattern on a quiet system

• Baseline for performance– e.g. 30 minutes from

work to home

Baseline with Isolated Probes

9

Probes -- Communication Pattern

10

Natural Ring Random Ring

Probes -- Measurements

11

Probes perform and report congested and isolated:1. Latency2. Bandwidth, and 3. MPI_Allreduce latency

By default a probe occupies 20% of the job nodes

Remaining 80% divided across four congestors

Congestors

• Stress the network to evaluate performance under load– e.g. 30→50 min.

work to home

Evaluate Probes under Stress

12

End-point congestion:● Insensitive to routing

○ Point-to-point Incast○ RMA Incast○ RMA Broadcast

Intermediate Congestion:● Sensitive to bisection

bandwidth and routing● Pairwise all-to-all

The Two Classes of Congestors

13

INTERMEDIATE

END-POINT

END-POINT

Congestors (End-point)

14

RMA Broadcast (get based) RMA Incast (put based)

(point-to-point version not shown)

Congestor (Intermediate)

15

Pairwise All-to-All:● at each iteration,

rank i exchanges with i+1

● n-1 iterations of 4KB exchanges

iteration n-1

1. Divide all nodes into 5 roughly equal size groups– 20% run canaries one at a time (isolated)– 4 groups of 20% each run a specific congestor

2. Measure isolated performance of canary3. Start all 4 congestors4. Measure loaded/congested performance of canary5. Repeat steps 2-4 for each canary

Execution Sequence

16

GPCNeT Informs MPI Performance

17

Distribution of Isolated Probe performance across all processes pairs

Latency

Cou

nt

● 696 Node Cray XC — Cray MPICH MPI● This is the baseline which we compare to

GPCNeT Informs MPI Performance

18

696 Node Cray XC — Cray MPICH MPI

Latency

Cou

nt

Not all Congestors are Created Equal696 Node Cray XC — Cray MPICH MPI

19

Cou

nt

Designing for Robustness vs Best-case 128 Node Cray CS500 with EDR MVAPICH MPI

20

GPCNeT performance encompasses the whole communication/network stack (MPI, Topology, Fabric)

• Congestion and congestion control is increasing importance in next-gen networks

• Introduce GPCNeT for evaluating congestion in HPC networks

• Observed congestion up to 4 orders of magnitude• GPCNeT enables tuning of communication

libraries, and establishing requirements for system performance

Conclusions and Future Work

21

Tuning GPCNeT

Scaling up Process Density

23

Increasing process count per node creates additional sub-communicators and avoids traffic within a node

Node Node

• 696 Nodes• 1, 8 & 32 PPN

• Increase PPN → • increase dimensions• increase hotspots• increase NIC utilization

Tuning Congestion by Process Density

24

1PPN100us

8PPN1000us

32PPN10 ms

Increase Node Count →• Increase degree of

incast

Default (recommend):

fully populate system

• 10% populated (64 nodes) results in limited congestion

Tuning Congestion by Node Count

25

Co

un

t (6

4 n

od

es)

Co

un

t (6

96 n

od

es)

GPCNeT on Production Systems

• 5575 Nodes NERSC Edison (Aries)• 20% nodes run as probes, remaining 80% run in three modes

– Quiet: idle (this is our baseline)– Wild: production traffic (two representative runs)– Congested: four congestors (20:20:20:20)

We show how congestion manifests:

1. Hardware counters of network routers– per-port router stall rate @ 1s

2. Increase to Latency of GPCNeT Allreduce Probes

GPCNeT vs. Congestion in the Wild

27

How does GPCNeT Compare to Production?

28

Widespread intermediate congestion (GPCNeT 3X > Wild)

P99 MPI_Allreduce probes slowed:

• 2200X vs Quiet System• 40X greater than Wild Q3

GPCNeT default is aggressive and stresses the system

7 Systems (4 DoE Production, 3 Cray Testbeds)

• Theta, Edison, Sierra, Summit• System size from 128 to 5.5k nodes• Aries, EDR IB and Slingshot Networks• Fully populated with GPCNeT defaults• Report mean and P99 normalized to baseline

GPCNeT Architectural Comparisons

29

EDR IB100%50%

128

Impact of Congestion on Modern Systems

Slowdown (multiplier) compared to mean baseline (log-scale)

● Node Count● Bisection to

Injection Bandwidth

● Architecture

Aries SS

9000X

5X

P99

Mean

696 485Nodes: 4392 5586 4320 4608Global/Injection BW: 100% 50% 50% 100% 50%

485128

Crystal and Osprey: reduced congestion compared to larger systems of same architecture

Smaller Systems →Less Congestion

Nodes: 696 4392 5586 4320 4608Aries EDR IB SS

696 485128

Smaller Systems →Less Congestion

Nodes: 4392 5586 4320 4608Aries EDR IB SS

Crystal and Osprey: reduced congestion compared to larger systems of same architecture

Aries EDR IB SS

Latency is more Sensitive than Bandwidth

Fact

or o

f Slo

wdo

wn

Larger messages sizes have a larger baseline time to complete transfer

Larger messages can be distributed across multiple paths

Capturing Trade-offs in Bisection BW

Global/Injection BW: 100% 50% 50% 50%100% 100% 50%

Summit has 2X the links going across it’s bisection

443X slowdown (Sierra) vs 135X slowdown (Summit) latency

Production settings

Next Generation Congestion Control

GPCNeT shows the value of congestion control

Slingshot designed to handle the worst case traffic patterns

100%50%128696 485Nodes: 4392 5586 4320 4608

Global/Injection BW: 100% 50% 50% 100% 50%EDR IBAries SS

Next Generation Congestion Control

Aries EDR IB SSGlobal/Injection BW: 100% 50% 50% 50%100% 100% 50%

Routing can’t always eliminate congestion

Slingshot congestion control:1. Identifies source(s)

of congestion2. Throttles the

offending traffic

• Observed congestion up to 4 orders of magnitude• Congestion control is vital in next-gen networks• Introduced GPCNeT for evaluating congestion in

HPC networks– useful tuning of (1) communication libraries,

(2) tuning of routing and congestion control algorithms (3) system procurement

Conclusions

37

Questions?

https://github.com/netbench/GPCNET

[email protected]

[email protected]

[email protected]

Thanks to ORNL (Scott Atchley) and LLNL (Ramesh Pankajakshan) for running GPCNeT on their systems and providing the results for this work.

Backup

39

Backup

40

How do congestors impact a real workload?

• Ran Lulesh on the three smaller test systems– lulesh not as communication bound as canaries

FAQS

41

Not all Congestors are Created Equal

42

696 Node Cray XCCray MPICH MPI

128 Node Cray CS500 with EDRMVAPICH MPI

P2P Incast:No impact on XCSignificant impact on EDR

Not all Congestors are Created Equal

43

696 Node Cray XCCray MPICH MPI

128 Node Cray CS500 with EDRMVAPICH MPI

RMA Bcast:Significant impact on XCNo impact on EDR

P2P Incast:No impact on XCSignificant impact on EDR

OMPI and MVAPICH P2P latency

• 26 node random ring

• similar trends

Differences Across MPI Implementations

44

CS500EDR IB128 nodes

OMPI and MVAPICH Allreduce latency

• 26 node Allreduce

• larger differences for more complex collectives

Differences Across MPI Implementations

45

CS500EDR IB128 nodes

GPCNeT vs. Endpoint Cong. in the Wild

46

• Distribution of port stall rate for entire network

• normalized to mean on quiet system

• Wild-1, Wild-2 are Q1, Q3 production, respectively

Max

Min

P50

1st Quartile 3rd Quartile GPCNeT Canary Only

GPCNeT vs. Endpoint Cong. in the Wild

47

Similar peak stalls (wild vs. cong.)

• more incast of smaller degree in production

GPCNeT vs. Intermediate Cong. in the Wild

48

GPCNeT more aggressive than production traffic at NERSC

• Background traffic varies widely across facilities

GPCNeT vs. Throughput in the Wild

49

• Need > 1PPN for full tput.

• 1PPN tput. 1/5th that of wild-1,2

How is random placement fair for a system like BGQ?

Fragmentation much more common on modern systems

What are other approaches to solving congestion?

• underprovision work• overprovision network• congestion control

FAQS

50

What kind of run-to-run variability did we see?• Random Canary Pairs Shift at iterations• You could have a higher density of congestor roots

within a part of a physical topology for high PPN• Information in verbose mode to track rank mappings

FAQS

51

Q: Is Congestion Control active on the Infiniband tests for Summit and Sierra?

A: No

Q: Why not?

A: Congestion control does not run in production on Summit or Sierra.

FAQS

52

brian austin, jacob balma, krishna kandalla, kalyan...

Documents