brian austin, jacob balma, krishna kandalla, kalyan...

52
GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas J. Wright SC 19 - Denver, CO (*primary authors contributed equally)

Upload: others

Post on 21-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks

*Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas J. Wright

SC 19 - Denver, CO (*primary authors contributed equally)

Page 2: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

The HPC and Data Center community needs a standard set of benchmarks for characterizing network performance under load.

1. Motivate/introduce GPCNeT: network congestion benchmark2. Describe design of the GPCNeT3. Comparison GPCNeT to congestion seen in production4. Architectural/Site evaluations:

– 4 different DoE Labs– 3 different network architectures – Including Slingshot network with advanced congestion control

Summary of Contributions

2

Page 3: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Sample of work at SC 13-19 focused on network congestion:• There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs. SC13• Network Endpoint Congestion Control for Fine-Grained Communication. SC15• Evaluating HPC Networks via Simulation of Parallel Workloads. SC16• Watch Out for the Bully! Job Interference Study on Dragonfly Network. SC16• Run-to-run Variability on Xeon Phi Based Cray XC Systems. SC17• Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing SC18• Understanding Congestion in High Performance Interconnection Networks Using Sampling. SC19• Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing. SC19• …….

Despite the importance, there is no standard benchmark to measure network performance under congestion.

Network Congestion is Trending

3

Page 4: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

“Tests like ping pong latency are like trying to understand your commute into NYC by driving the route alone at 4am.” – Steve Scott

Best Case Performance is Rare

4

Ping Pong on a quiet system Doing an FFT with congestion

Page 5: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Applications bound by the outliers (tail latency)

HPC Workloads Limited by Congestion

5

Und

erst

andi

ng P

erfo

rman

ce V

aria

bilit

y on

the

Arie

s D

rago

nfly

Net

wor

k, 2

017

# m

easu

rem

ents

latency

P99

5122561286432Process Count

Allr

educ

e La

tenc

y (u

s)

~600X increase

Page 6: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Designing GPCNeT

Page 7: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

1. Strike a balance -- Flexible + Representative of common HPC communication patterns

2. Report performance limiting metrics

3. Measure network performance under the effects of Congestion

GPCNeT Design Criteria

7

Page 8: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Topology compatible, can run on:

We need to be able to run on any number of nodes (not just powers of 2)

Designed for Flexible Deployment

8

Page 9: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Probe• Representative

communication pattern on a quiet system

• Baseline for performance– e.g. 30 minutes from

work to home

Baseline with Isolated Probes

9

Page 10: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Probes -- Communication Pattern

10

Natural Ring Random Ring

Page 11: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Probes -- Measurements

11

Probes perform and report congested and isolated:1. Latency2. Bandwidth, and 3. MPI_Allreduce latency

By default a probe occupies 20% of the job nodes

Remaining 80% divided across four congestors

Page 12: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Congestors

• Stress the network to evaluate performance under load– e.g. 30→50 min.

work to home

Evaluate Probes under Stress

12

Page 13: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

End-point congestion:● Insensitive to routing

○ Point-to-point Incast○ RMA Incast○ RMA Broadcast

Intermediate Congestion:● Sensitive to bisection

bandwidth and routing● Pairwise all-to-all

The Two Classes of Congestors

13

INTERMEDIATE

END-POINT

END-POINT

Page 14: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Congestors (End-point)

14

RMA Broadcast (get based) RMA Incast (put based)

(point-to-point version not shown)

Page 15: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Congestor (Intermediate)

15

Pairwise All-to-All:● at each iteration,

rank i exchanges with i+1

● n-1 iterations of 4KB exchanges

iteration n-1

Page 16: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

1. Divide all nodes into 5 roughly equal size groups– 20% run canaries one at a time (isolated)– 4 groups of 20% each run a specific congestor

2. Measure isolated performance of canary3. Start all 4 congestors4. Measure loaded/congested performance of canary5. Repeat steps 2-4 for each canary

Execution Sequence

16

Page 17: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

GPCNeT Informs MPI Performance

17

Distribution of Isolated Probe performance across all processes pairs

Latency

Cou

nt

● 696 Node Cray XC — Cray MPICH MPI● This is the baseline which we compare to

Page 18: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

GPCNeT Informs MPI Performance

18

696 Node Cray XC — Cray MPICH MPI

Latency

Cou

nt

Page 19: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Not all Congestors are Created Equal696 Node Cray XC — Cray MPICH MPI

19

Cou

nt

Page 20: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Designing for Robustness vs Best-case 128 Node Cray CS500 with EDR MVAPICH MPI

20

GPCNeT performance encompasses the whole communication/network stack (MPI, Topology, Fabric)

Page 21: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

• Congestion and congestion control is increasing importance in next-gen networks

• Introduce GPCNeT for evaluating congestion in HPC networks

• Observed congestion up to 4 orders of magnitude• GPCNeT enables tuning of communication

libraries, and establishing requirements for system performance

Conclusions and Future Work

21

Page 22: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Tuning GPCNeT

Page 23: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Scaling up Process Density

23

Increasing process count per node creates additional sub-communicators and avoids traffic within a node

Node Node

Page 24: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

• 696 Nodes• 1, 8 & 32 PPN

• Increase PPN → • increase dimensions• increase hotspots• increase NIC utilization

Tuning Congestion by Process Density

24

1PPN100us

8PPN1000us

32PPN10 ms

Page 25: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Increase Node Count →• Increase degree of

incast

Default (recommend):

fully populate system

• 10% populated (64 nodes) results in limited congestion

Tuning Congestion by Node Count

25

Co

un

t (6

4 n

od

es)

Co

un

t (6

96 n

od

es)

Page 26: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

GPCNeT on Production Systems

Page 27: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

• 5575 Nodes NERSC Edison (Aries)• 20% nodes run as probes, remaining 80% run in three modes

– Quiet: idle (this is our baseline)– Wild: production traffic (two representative runs)– Congested: four congestors (20:20:20:20)

We show how congestion manifests:

1. Hardware counters of network routers– per-port router stall rate @ 1s

2. Increase to Latency of GPCNeT Allreduce Probes

GPCNeT vs. Congestion in the Wild

27

Page 28: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

How does GPCNeT Compare to Production?

28

Widespread intermediate congestion (GPCNeT 3X > Wild)

P99 MPI_Allreduce probes slowed:

• 2200X vs Quiet System• 40X greater than Wild Q3

GPCNeT default is aggressive and stresses the system

Page 29: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

7 Systems (4 DoE Production, 3 Cray Testbeds)

• Theta, Edison, Sierra, Summit• System size from 128 to 5.5k nodes• Aries, EDR IB and Slingshot Networks• Fully populated with GPCNeT defaults• Report mean and P99 normalized to baseline

GPCNeT Architectural Comparisons

29

Page 30: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

EDR IB100%50%

128

Impact of Congestion on Modern Systems

Slowdown (multiplier) compared to mean baseline (log-scale)

● Node Count● Bisection to

Injection Bandwidth

● Architecture

Aries SS

9000X

5X

P99

Mean

696 485Nodes: 4392 5586 4320 4608Global/Injection BW: 100% 50% 50% 100% 50%

Page 31: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

485128

Crystal and Osprey: reduced congestion compared to larger systems of same architecture

Smaller Systems →Less Congestion

Nodes: 696 4392 5586 4320 4608Aries EDR IB SS

Page 32: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

696 485128

Smaller Systems →Less Congestion

Nodes: 4392 5586 4320 4608Aries EDR IB SS

Crystal and Osprey: reduced congestion compared to larger systems of same architecture

Page 33: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Aries EDR IB SS

Latency is more Sensitive than Bandwidth

Fact

or o

f Slo

wdo

wn

Larger messages sizes have a larger baseline time to complete transfer

Larger messages can be distributed across multiple paths

Page 34: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Capturing Trade-offs in Bisection BW

Global/Injection BW: 100% 50% 50% 50%100% 100% 50%

Summit has 2X the links going across it’s bisection

443X slowdown (Sierra) vs 135X slowdown (Summit) latency

Production settings

Page 35: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Next Generation Congestion Control

GPCNeT shows the value of congestion control

Slingshot designed to handle the worst case traffic patterns

100%50%128696 485Nodes: 4392 5586 4320 4608

Global/Injection BW: 100% 50% 50% 100% 50%EDR IBAries SS

Page 36: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Next Generation Congestion Control

Aries EDR IB SSGlobal/Injection BW: 100% 50% 50% 50%100% 100% 50%

Routing can’t always eliminate congestion

Slingshot congestion control:1. Identifies source(s)

of congestion2. Throttles the

offending traffic

Page 37: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

• Observed congestion up to 4 orders of magnitude• Congestion control is vital in next-gen networks• Introduced GPCNeT for evaluating congestion in

HPC networks– useful tuning of (1) communication libraries,

(2) tuning of routing and congestion control algorithms (3) system procurement

Conclusions

37

Page 38: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Questions?

https://github.com/netbench/GPCNET

[email protected]

[email protected]

[email protected]

Thanks to ORNL (Scott Atchley) and LLNL (Ramesh Pankajakshan) for running GPCNeT on their systems and providing the results for this work.

Page 39: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Backup

39

Page 40: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Backup

40

Page 41: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

How do congestors impact a real workload?

• Ran Lulesh on the three smaller test systems– lulesh not as communication bound as canaries

FAQS

41

Page 42: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Not all Congestors are Created Equal

42

696 Node Cray XCCray MPICH MPI

128 Node Cray CS500 with EDRMVAPICH MPI

P2P Incast:No impact on XCSignificant impact on EDR

Page 43: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Not all Congestors are Created Equal

43

696 Node Cray XCCray MPICH MPI

128 Node Cray CS500 with EDRMVAPICH MPI

RMA Bcast:Significant impact on XCNo impact on EDR

P2P Incast:No impact on XCSignificant impact on EDR

Page 44: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

OMPI and MVAPICH P2P latency

• 26 node random ring

• similar trends

Differences Across MPI Implementations

44

CS500EDR IB128 nodes

Page 45: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

OMPI and MVAPICH Allreduce latency

• 26 node Allreduce

• larger differences for more complex collectives

Differences Across MPI Implementations

45

CS500EDR IB128 nodes

Page 46: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

GPCNeT vs. Endpoint Cong. in the Wild

46

• Distribution of port stall rate for entire network

• normalized to mean on quiet system

• Wild-1, Wild-2 are Q1, Q3 production, respectively

Max

Min

P50

1st Quartile 3rd Quartile GPCNeT Canary Only

Page 47: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

GPCNeT vs. Endpoint Cong. in the Wild

47

Similar peak stalls (wild vs. cong.)

• more incast of smaller degree in production

Page 48: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

GPCNeT vs. Intermediate Cong. in the Wild

48

GPCNeT more aggressive than production traffic at NERSC

• Background traffic varies widely across facilities

Page 49: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

GPCNeT vs. Throughput in the Wild

49

• Need > 1PPN for full tput.

• 1PPN tput. 1/5th that of wild-1,2

Page 50: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

How is random placement fair for a system like BGQ?

Fragmentation much more common on modern systems

What are other approaches to solving congestion?

• underprovision work• overprovision network• congestion control

FAQS

50

Page 51: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

What kind of run-to-run variability did we see?• Random Canary Pairs Shift at iterations• You could have a higher density of congestor roots

within a part of a physical topology for high PPN• Information in verbose mode to track rank mappings

FAQS

51

Page 52: Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan ...sc19.supercomputing.org/proceedings/tech_paper/... · *Sudheer Chunduri, *Taylor Groves, *Peter Mendygral, Brian Austin, Jacob

Q: Is Congestion Control active on the Infiniband tests for Summit and Sierra?

A: No

Q: Why not?

A: Congestion control does not run in production on Summit or Sierra.

FAQS

52