re#designing)communica0on)and)work …...re#designing)communica0on)and)work...

Re-‐designing Communica0on and Work Distribu0on in Scien0fic Applica0ons for Extreme-‐scale Heterogeneous Systems

Project Team: Karen Tomko (PI), Ohio Supercomputer Center

Dhabeleswar K. Panda (Co-‐PI), Ohio State University Khaled Hamidouche, Ohio State University Hari Subramoni, Ohio State University

Jithin Jose, Ohio State University Raghunathan Raja Chandrasekar, Ohio State University

Rong Shi, Ohio State University Akshay Venkatesh, Ohio State University

Jie Zhang, Ohio State University

Blue Waters Symposium -‐ 2014 1

Drivers of Modern HPC System Architectures

•  Mul%-‐core processors are ubiquitous

•  Modern interconnects have high performance features such as RDMA and support for collec%ves

•  Accelerators/Coprocessors becoming common in high-‐end systems

•  Pushing the envelope for Exascale compu%ng

Accelerators / Coprocessors high compute density, high performance/waE

>1 TFlop DP on a chip

High Performance Interconnects

2

Mul%-‐core Processors

Tianhe – 2 (1) Titan (2) Stampede (6) Blue Waters Blue Waters Symposium -‐ 2014

•  Complex Architecture –  Within a node

•  Accelerators connected via PCIe, •  NUMA shared memory

–  Interconnect feature and topology considera%on

•  Scaling –  Current algorithms developed and tested with 100s to 1000s of

processes

–  few systems on which to run with 10,000s to 100,000s


Challenges for Communica0on Run0mes

4

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, OpenMP Distributed Memory Model

MPI (Message Passing Interface)

Par%%oned Global Address Space (PGAS)

Global Arrays, UPC, OpenSHMEM, …

•  Programming models provide abstract machine models

•  Models can be mapped on different types of systems –  e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

•  Many Core models –  OpenMP, OpenACC, CUDA

Blue Waters Symposium -‐ 2014

•  How do MPI collec%ves perform at extreme scales?

•  How well do the CraySHMEM and UPC PGAS collec%ve communica%ons scale?

•  Can both the CPU and GPU resources be leveraged effec%vely in a hybrid node system?


Key Ques0ons

•  Domain applica%ons such weather forecas%ng, earthquake simula%ons and many more have a real requirement for large throughput capability

•  MPI is the most dominant programming model for distributed memory systems

•  MPI jobs in order of 1K processes becoming common

•  MPI jobs in order of 1M processes is the maximum

•  Blue Waters is one of the first instances that can be used to test performance of MPI jobs at a really large scale

6

MPI on Blue Waters


•  Point-‐to-‐point opera%ons and Collec%ve opera%ons determine the performance of MPI programs

•  Performance of point-‐to-‐point opera%ons involve –  Efficient u%liza%on of underlying interconnec%on hardware

–  Design of high performance protocols

•  Performance of collec%ves addi%onally involves –  Design of efficient algorithms

•  We evaluate performance of common collec%ves such as: –  MPI_Bcast

–  MPI_Reduce

–  MPI_Allgather

7

Blue Waters MPI Collec0ve Performance



Performance of MPI_Bcast (64 – 512 Processes)

0

5

10

15

20

1 2 4 8 16 32 64 128 256 512 1K

Latency (us)

Message Size (bytes)

64 128 256 512

0

20

40

60

80

100

120

140

1K 2K 4K 8K 16K 32K

Latency (us)


64 128 256 512

0

500

1000

1500

2000

2500

3000

64K 128K 256K 512K 1M

Latency (us)


64 128 256 512

•  Latency is flat in the 1 byte – 32 byte range and then starts climbing – regardless of process count

•  Latency of broadcast more than doubles in the short message range going from 128 processes to 256 processes which is undesirable


Performance of MPI_Bcast (1K – 8K Processes)

0

10

20

30

40

50

1 2 4 8 16 32 64 128 256 512

Latency (us)


1K 2K 4K

0

20

40

60

80

100

120

140

1K 2K 4K 8K 16K

Latency (us)


1K 2K 4K

0 500

1000 1500 2000 2500 3000 3500 4000 4500

64K 128K 256K 512K 1M

Latency (us)


1K 2K

4K 8K

•  For a process count over 1K, there is spike in latency at the 256 byte range where bandwidth available starts gekng stressed


Performance of MPI_Bcast (16K – 128K Processes)

0

20

40

60

80

100

120

140

1 2 4 8 16 32 64 128 256 512

Latency (us)


16K 32K

64K 128K

0 100 200 300 400 500 600 700 800 900 1000

1K 2K 4K 8K 16K 32K

Latency (us)


16K 32K

64K 128K

0

1000

2000

3000

4000

5000

6000

64K 128K 256K 512K 1M

Latency (us)


16K 32K

64K 128K

•  Unlike the 64 – 8K process count there is variability – possible traffic effect

•  The spike at 8K message range is indica%ve of algorithm selec%on problem


Performance of MPI_Reduce (64 – 512 Processes)

0

1

2

3

4

5

1 2 4 8 16 32 64 128 256 512

Latency (us)


64 128

256 512

0 500

1000 1500 2000 2500 3000 3500 4000 4500

1K 2K 4K 8K 16K 32K 64K 128K 256K

Latency (us)


64 128

256 512

•  Reduce latency is hardware accelerated and regardless of process count the latency is similar

•  There does seem to be a limita%on with hardware accelera%on at 128K byte range


Performance of MPI_Reduce (1K – 8K Processes)

0

1

2

3

4

5

1 2 4 8 16 32 64 128 256 512

Latency (us)


1K 2K

4K 8K

0 500

1000 1500 2000 2500 3000 3500 4000

1K 2K 4K 8K 16K 32K 64K 128K 256K

Latency (us)


1K 2K

4K 8K

•  Trends similar to smaller process count


Performance of MPI_Reduce (16K – 128K Processes)

•  Notable increase in latency for 128K processes in the short message range

0

1

2

3

4

5

6

7

1 2 4 8 16 32 64 128 256 512

Latency (us)


16K 32K

64K 128K

0

500

1000

1500

2000

2500

3000

3500

1K 2K 4K 8K 16K 32K 64K 128K 256K

Latency (us)


16K 32K

64K 128K


Scalability of MPI_Bcast and MPI_Reduce

•  Scalability normalized to 64 process job case

•  MPI_Reduce is highly scalable

•  MPI_Bcast is not as scalable

0 5 10 15 20 25 30

1 2 4 8 16 32 64 128 256 512

Normalized

Laten

cy

Message Size (Bytes)

MPI_Bcast Scalability

64

4K

128K

0

0.5

1

1.5

2

2.5

1 2 4 8 16 32 64 128 256 512

Normalized

Laten

cy


MPI_Reduce Scalability

64

4K

128K


Performance of MPI_Allgather (128K Processes)

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us) x 1000


128K-‐Process Allgather Latency

128K

•  Allgather is equivalent to all processes performing broadcasts

•  Bandwidth of the interconnec%on is tested

•  Tradi%onally order of log (N) algorithms applicable to short message allgathers

•  The above graph raises an alarm of latency growth for large scale dense collec%ves

•  Performance of latency sensi%ve opera%ons such as Reduce is compe%%ve in the opera%onal range with increasing scale

•  Conges%on effects, cross job traffic likely to play a role in performance of collec%ves as job sizes get larger (as seen in the 128K jobs)

•  Performance of dense collec%ves like Allgather suffer from bandwidth limita%ons => –  Applica%ons should perform such collec%ves in smaller

communicators or using non-‐blocking variant of the collec%ves

–  BeEer algorithms need to be devised to overcome bandwidth limita%ons

16

Observa0ons on MPI Collec0ve Performance






Key Ques0ons

•  Par%%oned Global Address Space (PGAS) programming models gekng more trac%on –  Shared memory abstrac%on over distributed nodes

–  Global view of data and one-‐sided communica%on calls

–  Provides improved produc%vity

–  Can express irregular communica%on paEerns easily

•  Unified Parallel C (UPC) – a language based PGAS model

•  SHMEM – a library based model

•  Blue Waters provides a good plaoorm to evaluate performance of UPC/SHMEM jobs at scale

18

PGAS (UPC/SHMEM) on Blue Waters


•  Point-‐to-‐point opera%ons and Collec%ve opera%ons determine the performance of UPC programs

•  Used Cray UPC and OSU UPC Microbenchmarks for evalua%ons

•  Performance of point-‐to-‐point opera%ons involve –  upc_memput

–  upc_memget

•  Performance of collec%ves addi%onally involves –  upc_barrier –  upc_broadcast –  upc_reduce

19

Blue Waters UPC Performance Evalua0ons



UPC Put/Get Performance

0

2

4

6

8

10

12 1 2 4 8 16

32

64

128

256

512 1K

2K

4K

8K

16K

32K

Time (us)

Message Size

UPC Memput Latency

Intra-‐node

Inter-‐node

0 1 2 3 4 5 6 7 8

1 2 4 8 16

32

64

128

256

512 1K

2K

4K

8K

16K

32K

Time (us)

Message Size

UPC Memget Latency Intra-‐node

Inter-‐node

•  Latency is flat in the 1 byte – 512 byte range and then starts climbing –  Latency for UPC Put (intra/inter) for 4 byte message: 0.13/2.34 us

–  Latency for UPC Get (intra/inter) for 4 byte message: 0.07/1.17 us

•  Higher costs for Put opera%on might be because of the extra synchroniza%on opera%on (upc_fence) for ensuring comple%on


UPC Barrier Performance

•  Barrier Opera%on Latency at 32,768 process – 186us

•  Scalability graph shows the latency normalized to that at 1,024 processes

•  Linear scalability observed for smaller system sizes

0 20 40 60 80

100 120 140 160 180 200

1024 2048 8192 32768

Time (us)

System Size (# of Processes)

Latency

0

2

4

6

8

10

12

1024 2048 8192 32768

Normalized

to Laten

cy at 1

K


Scalability


UPC Broadcast Performance

•  Broadcast Latency for a 4byte message at 32,768 processes – 13us

•  Varia%on in latencies observed aqer 8192 processes, and the varia%on increases with scale

•  Broadcast latency does not scale linearly with increase in system size

0

5000

10000

15000

20000

25000

30000

35000

40000

1 4 16 64 256 1K 4K 16K 64K 256K

Time (us)

Message Size

Latency 2K Processes

8K Processes

16 KProcesses

32K Processes

0

10

20

30

40

50

60

70

1 4 16 64 256 1K 4K 16K 64K 256K

Normalized

to Laten

cy at 2

K

Message Size

Scalability 2K Processes

8K Processes

16 KProcesses

32K Processes


UPC Reduce Performance

•  Reduce Latency for 4 byte message at 32,768 processes – 5.4us

•  Linear scalability observed for small message range

•  Varia%on in opera%on latency observed as the system size increases

0

1000

2000

3000

4000

5000

6000

1 4 16 64 256 1K 4K 16K 64K 256K

Time (us)

Message Size


8K Processes

16K Processes

32K Processes

0

5

10

15

20

25

1 4 16 64 256 1K 4K 16K 64K 256K

Normalized

to Laten

cy at 2

K Message Size


8K Processes

16 KProcesses

32K Processes

•  Point-‐to-‐point opera%ons and Collec%ve opera%ons determine the performance of SHMEM programs

•  Used CraySHMEM library and OSU OpenSHMEM Microbenchmarks for evalua%ons

•  Performance of point-‐to-‐point opera%ons involve –  shmem_put

–  shmem_get

•  Performance of collec%ves addi%onally involves –  shmem_barrier

–  shmem_broadcast

–  shmem_reduce

–  shmem_collect 24

Blue Waters CraySHMEM Performance Evalua0ons



CraySHMEM Put/Get Performance

•  Latency is flat in the 1 byte – 512 byte range and then starts climbing aqer 1K bytes

–  Latency for 4byte Put opera%on (intra/inter) – 0.12/1.04 us –  Latency for 4byte Get opera%on (intra/inter) – 0.05/1.41 us

•  Significantly higher latency observed for get opera%on, with increase in message size –  Get Latency for 512K message – 763 us

0 1 2 3 4 5 6 7 8 9

1 2 4 8 16

32

64

128

256

512 1K

2K

4K

8K

16K

32K

Time (us)

Message Size

SHMEM Put Latency Intra-‐Node

Inter-‐Node

0

10

20

30

40

50

60

1 2 4 8 16

32

64

128

256

512 1K

2K

4K

8K

16K

32K

Time (us)

Message Size

SHMEM Get Latency Intra-‐Node

Inter-‐Node


CraySHMEM Barrier Performance

•  Barrier Latency at 16,384 processes – 138.64 us

•  Similar latencies as that of UPC barrier

•  Shows good scalability trends with increase in system size

0

20

40

60

80

100

120

140

160

2048 4096 16384

Time (us)


Latency

0 0.5 1

1.5 2

2.5 3

3.5 4

4.5

2048 4096 16384

Normalized

to Laten

cy at 2

K System Size (# of Processes)

Scalability


CraySHMEM Broadcast Performance

•  Latency is flat in the 1 byte – 512 byte range and then starts climbing – regardless of process count

•  Broadcast Latency for 4-‐byte message at 16,384 processes – 72.3us

•  Varia%on in latencies observed with increase in system size

0

200

400

600

800

1000

1200

1400

4 16 64 256 1K 4K 16K 64K 256K

Time (us)

Message Size


4K Processes

8K Processes

16K Processes

0

0.5

1

1.5

2

2.5

3

3.5

4 8 16

32

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

Normalized

to Laten

cy at 2

K Message Size

Scalability 2K Processes 4K Processes 8K Processes 16K Processes


CraySHMEM Reduce Performance

•  Latency for 4-‐byte message at 16K processes – 210 us

•  Scalability analysis shows good scalability trends with even higher system sizes as well

•  Latencies smaller compared to UPC reduce opera%on – extra synchroniza%on opera%ons in UPC collec%ve opera%ons

0

20000

40000

60000

80000

100000

120000

140000

160000

4 16 64 256 1K 4K 16K 64K 256K

Time (us)

Message Size


4K Processes

8K Processes

16K Processes

0 1 2 3 4 5 6 7 8 9 10

4 8 16

32

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

Normalized

to Laten

cy at 2

K Message Size


4K Processes

8K Processes

16K Processes


CraySHMEM Collect Performance

•  Latency for 4byte collect (all-‐gather) opera%on at 16K processes – 319.3 ms

•  Scalability analysis shows collect opera%on scales well

0

5000

10000

15000

20000

25000

30000 4 8 16

32

64

128

256

512 1K

2K

4K

8K

16K

32K

Time (m

s)

Message Size

Latency 2K Processes 4K Processes 8K Processes 16K Processes

0

5

10

15

20

25

30

4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K

Normalized

to Laten

cy at 2

K Message Size

Scalability 2K Processes 4K Processes 8K Processes 16K Processes





Key Ques0ons

•  HPL (High Performance Linpack) Benchmark for ranking supercomputers in the top500 list

•  Current HPL support for GPU Clusters –  Heterogeneity inside a node CPU+GPU –  Homogeneity across nodes

•  Current HPL execu%on on heterogeneous GPU Clusters –  Only CPU nodes (using all the CPU cores) –  Only GPU nodes (using CPU+GPU on only GPU nodes) –  As the ra%o CPU/GPU is higher => report the “Only CPU” runs

•  Hybrid HPL support for heterogeneous systems –  Heterogeneity inside a node (CPU+GPU) –  Heterogeneity across nodes (nodes w/o GPUs)

31

Current Execu0on of HPL on Heterogeneous GPU Clusters


R. Shi, S. Potluri, K. Hamidouche, X. Lu, K. Tomko and D. K. Panda, A Scalable and Portable Approach to Accelerate Hybrid HPL on Heterogeneous CPU-‐GPU Clusters, IEEE Cluster (Cluster '13), Best Student Paper Award

Two Level Workload Par00oning: Inter-‐node

•  Inter-‐node Sta0c Par00oning Original design: uniform distribu%on, boEleneck on CPU nodes

Original Par%%oning MPI Process-‐scaling based Par%%oning

G1 G2 C1 C2

2x2 2x3

New design: iden%cal block size, schedules more MPI processes on GPU nodes

MPI_GPU = ACTUAL_PEAK_GPU / ACTUAL_PEAK_CPU + β (NUM_CPU_CORES mod MPI_GPU = 0 ) Evenly split the cores

32 Blue Waters Symposium -‐ 2014

Two Level Workload Par00oning: Intra-‐node

A

B2

C1 C2

B1

•  Intra-‐node Dynamic Par00oning •  MPI-‐to-‐Device Mapping Original design: 1:1 New design: M: N (M > N), N= number of GPUs/Node, M= number of MPI processes

•  Ini%al Split Ra%o Tuning: alpha = GPU_LEN / (GPU_LEN + CPU_LEN) Fewer CPU cores per MPI processes Overhead caused by scheduling mul%ple MPI processes on GPU nodes

GPU_LEN CPU_LEN

33 Blue Waters Symposium -‐ 2014


Performance Tuning of Single CPU Node and GPU Node Netlib-‐CPU: Standard HPL version from Netlib (UTK) Hybrid-‐CPU: Hybrid HPL version with OpenMP support NVIDIA-‐GPU: NVIDIA’s HPL version * OpenBLAS Math Library is used

0 100 200 300 400 500 600 700 800

10000 20000 30000 35000 40000 45000 50000 55000 60000

Performan

ce (G

flops)

Problem Size N

Peak Performance Scaling on Single CPU/GPU Node

Netlib-‐CPU Hybrid-‐CPU NVIDIA-‐GPU


Peak Performance Scaling of Pure CPU/GPU Nodes Measure the peak performance of either pure CPU Nodes or pure GPU

Nodes (1, 2, 4, 8, 16)

0 1000 2000 3000 4000 5000 6000 7000

1 2 4 8 16

Performan

ce (G

flops)

Number of CPU/GPU Nodes

Performance Scaling of Pure CPU/GPU Nodes

Netlib-‐CPU Hybrid-‐CPU NVIDIA-‐GPU


Strong and Weak Scalability of Hybrid CPU+GPU Nodes Using Hybrid-‐HPL to measure the scalability with 4 GPU Nodes + (4, 8, 12,

16) CPU Nodes Launch 1 MPI process / CPU node; 1, 2 or 4 MPI processes / GPU node Strong Scalability: fixed problem size N for each combina%on of CPUs+GPUs (e.g. N=100,000 for 4 GPUs + 4 CPUs) Weak Scalability: fixed memory usage (~40%) on GPU nodes for all cases

0 500

1000 1500 2000 2500 3000 3500 4000

4 8 12 16

Peak Perform

ance (G

flops)

Number of CPU Nodes

Strong Scalability

1 MPI/GPU 2 MPI/GPU 4 MPI/GPU

0

1000

2000

3000

4000

5000

4 8 12 16

Peak Perform

ance (G

flops)

Number of CPU Nodes

Weak Scalability

1 MPI/GPU 2 MPI/GPU 4 MPI/GPU


Peak Performance of Hybrid CPU Nodes + GPU Nodes Measure the peak performance of 64 CPU Nodes and 16 GPU Nodes Launch 1 MPI process / CPU node, and 4 MPI processes / GPU node

Node Configura0on Peak Performance (Gflops)

16 GPUs 6,480

64 CPUs 13,210

16 GPUs + 64 CPUs 14,520

Peak Performance Efficiency (Hybrid-‐HPL) Peak Perf. of hybrid Nodes / (Peak Perf. of CPUs + Peak Perf. of GPUs) (e.g. 14,520 / (6,480 + 13,210) = 73.7 %

•  The Blue Waters system provides unique opportuni%es –  Communica%ons at large scale

–  Hybrid system with XE6 and XK7 nodes

•  MPI collec%ves study on up to 128K processes –  Latency sensi%ve collec%ves such as reduce perform well

–  Bandwidth limita%ons impact dense collec%ves such as Allgather

•  UPC and SHMEM communica%ons study up 32K and 16K cores respec%vely –  UPC and SHMEM point-‐to-‐point performance is good

–  Some collec%ves (UPC ScaEer, SHMEM Broadcast) scale well, for others (SHMEM collect) we observed high latencies


Conclusion

•  Hybrid HPL –  Peak single CPU node performance 202 Gflops/sec

–  Peak GPU node performance 670 Gflops/sec

–  Performance efficiency of hybrid HPL compared to the sum of pure CPU and GPU nodes, above 70% efficiency with 16 GPU nodes and 64 CPU nodes.

•  Contact US:


Conclusion (con0nued)

Dhabaleswar K. (DK) Panda The Ohio State University

E-‐mail: [email protected]‐state.edu hEp://www.cse.ohio-‐state.edu/~panda

Karen Tomko Ohio Supercomputer Center E-‐mail: [email protected]

hEp://www.osc.edu/~ktomko

re#designing)communica0on)and)work …...re#designing)communica0on)and)work...

Documents