programming models for exascale systems€¦ · programming models for exascale systems dhabaleswar...

68
Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Keynote Talk at HPCAC-Stanford (Feb 2016) by

Upload: others

Post on 27-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

Programming Models for Exascale Systems

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Keynote Talk at HPCAC-Stanford (Feb 2016)

by

Page 2: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 2Network Based Computing Laboratory

High-End Computing (HEC): ExaFlop & ExaByte

100-200

PFlops in

2016-2018

1 EFlops in

2020-2024?

10K-20K

EBytes in

2016-2018

40K EBytes

in 2020 ?

ExaFlop & HPC•

ExaByte & BigData•

Page 3: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 3Network Based Computing Laboratory

0102030405060708090100

050

100150200250300350400450500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

85%

Page 4: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 4Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

Tianhe – 2 Titan Stampede Tianhe – 1A

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD

• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects -InfiniBand

<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

Page 5: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 5Network Based Computing Laboratory

• 235 IB Clusters (47%) in the Nov’ 2015 Top500 list (http://www.top500.org)

• Installations in the Top 50 (21 systems):

Large-scale InfiniBand Installations

462,462 cores (Stampede) at TACC (10th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (25th)

185,344 cores (Pleiades) at NASA/Ames (13th) 194,616 cores (Cascade) at PNNL (27th)

72,800 cores Cray CS-Storm in US (15th) 76,032 cores (Makman-2) at Saudi Aramco (32nd)

72,800 cores Cray CS-Storm in US (16th) 110,400 cores (Pangea) in France (33rd)

265,440 cores SGI ICE at Tulip Trading Australia (17th) 37,120 cores (Lomonosov-2) at Russia/MSU (35th)

124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (18th) 57,600 cores (SwiftLucy) in US (37th)

72,000 cores (HPC2) in Italy (19th) 55,728 cores (Prometheus) at Poland/Cyfronet (38th)

152,692 cores (Thunder) at AFRL/USA (21st ) 50,544 cores (Occigen) at France/GENCI-CINES (43rd)

147,456 cores (SuperMUC) in Germany (22nd) 76,896 cores (Salomon) SGI ICE in Czech Republic (47th)

86,016 cores (SuperMUC Phase 2) in Germany (24th) and many more!

Page 6: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 6Network Based Computing Laboratory

Towards Exascale System (Today and Target)

Systems 2016Tianhe-2

2020-2024 DifferenceToday & Exascale

System peak 55 PFlop/s 1 EFlop/s ~20x

Power 18 MW(3 Gflops/W)

~20 MW(50 Gflops/W)

O(1)~15x

System memory 1.4 PB(1.024PB CPU + 0.384PB CoP)

32 – 64 PB ~50X

Node performance 3.43TF/s(0.4 CPU + 3 CoP)

1.2 or 15 TF O(1)

Node concurrency 24 core CPU + 171 cores CoP

O(1k) or O(10k) ~5x - ~50x

Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x

System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x

Total concurrency 3.12M12.48M threads (4 /core)

O(billion) for latency hiding

~100x

MTTI Few/day Many/day O(?)

Courtesy: Prof. Jack Dongarra

Page 7: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 7Network Based Computing Laboratory

• Scientific Computing

– Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant

Programming Model

– Many discussions towards Partitioned Global Address Space (PGAS)

• UPC, OpenSHMEM, CAF, etc.

– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)

• Big Data/Enterprise/Commercial Computing

– Focuses on large data and data analysis

– Hadoop (HDFS, HBase, MapReduce)

– Spark is emerging for in-memory computing

– Memcached is also used for Web 2.0

Two Major Categories of Applications

Page 8: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 8Network Based Computing Laboratory

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• PGAS models and Hybrid MPI+PGAS models are gradually receiving

importance

Page 9: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 9Network Based Computing Laboratory

Partitioned Global Address Space (PGAS) Models

• Key features

- Simple shared memory abstractions

- Light weight one-sided communication

- Easier to express irregular communication

• Different approaches to PGAS

- Languages

• Unified Parallel C (UPC)

• Co-Array Fortran (CAF)

• X10

• Chapel

- Libraries

• OpenSHMEM

• Global Arrays

Page 10: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 10Network Based Computing Laboratory

Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based on communication

characteristics

• Benefits:

– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Exascale Roadmap*:

– “Hybrid Programming is a practical way to

program exascale systems”

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

Kernel 1MPI

Kernel 2MPI

Kernel 3MPI

Kernel NMPI

HPC Application

Kernel 2PGAS

Kernel NPGAS

Page 11: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 11Network Based Computing Laboratory

Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,

OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications

Networking Technologies(InfiniBand, 40/100GigE,

Aries, and OmniPath)

Multi/Many-coreArchitectures

Accelerators(NVIDIA and MIC)

Middleware

Co-Design

Opportunities

and

Challenges

across Various

Layers

Performance

Scalability

Fault-

Resilience

Communication Library or Runtime for Programming Models

Point-to-point

Communication

Collective

Communication

Energy-

Awareness

Synchronization

and Locks

I/O and

File Systems

Fault

Tolerance

Page 12: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 12Network Based Computing Laboratory

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)

– Scalable job start-up

• Scalable Collective communication– Offload

– Non-blocking

– Topology-aware

• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)– Multiple end-points per node

• Support for efficient multi-threading

• Integrated Support for GPGPUs and Accelerators

• Fault-tolerance/resiliency

• QoS support for communication and I/O

• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, …)

• Virtualization

• Energy-Awareness

Broad Challenges in Designing Communication Libraries for (MPI+X) at Exascale

Page 13: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 13Network Based Computing Laboratory

• Extreme Low Memory Footprint– Memory per core continues to decrease

• D-L-A Framework

– Discover

• Overall network topology (fat-tree, 3D, …), Network topology for processes for a given job

• Node architecture, Health of network and node

– Learn

• Impact on performance and scalability

• Potential for failure

– Adapt

• Internal protocols and algorithms

• Process mapping

• Fault-tolerance solutions

– Low overhead techniques while delivering performance, scalability and fault-tolerance

Additional Challenges for Designing Exascale Software Libraries

Page 14: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 14Network Based Computing Laboratory

Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Used by more than 2,525 organizations in 77 countries

– More than 351,000 (> 0.35 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘15 ranking)

• 10th ranked 519,640-core cluster (Stampede) at TACC

• 13th ranked 185,344-core cluster (Pleiades) at NASA

• 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)

Page 15: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 15Network Based Computing Laboratory

MVAPICH2 Architecture

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++*)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime

Diverse APIs and Mechanisms

Point-to-

point

Primitives

Collectives

Algorithms

Energy-

Awareness

Remote

Memory

Access

I/O and

File Systems

Fault

ToleranceVirtualization

Active

MessagesJob Startup

Introspection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, OmniPath)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower*, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU)

Transport Protocols Modern Features

RC XRC UD DC UMR ODP*SR-

IOV

Multi

Rail

Transport Mechanisms

Shared

MemoryCMA IVSHMEM

Modern Features

MCDRAM* NVLink* CAPI*

* - Upcoming

Page 16: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 16Network Based Computing Laboratory

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided

RMA)

– Support for advanced IB mechanisms (UMR and ODP)

– Extremely minimal memory footprint

– Scalable job start-up

• Collective communication

• Integrated Support for GPGPUs

• Integrated Support for MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

Page 17: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 17Network Based Computing Laboratory

One-way Latency: MPI over IB with MVAPICH2

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00 Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.261.19

0.95

1.15

TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switchConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back

0

20

40

60

80

100

120TrueScale-QDR

ConnectX-3-FDR

ConnectIB-DualFDR

ConnectX-4-EDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 18: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 18Network Based Computing Laboratory

Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000 Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

12465

3387

6356

12104

0

5000

10000

15000

20000

25000

30000TrueScale-QDR

ConnectX-3-FDR

ConnectIB-DualFDR

ConnectX-4-EDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

21425

12161

24353

6308

TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switchConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switchConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back

Page 19: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 19Network Based Computing Laboratory

0

0.5

1

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

Latest MVAPICH2 2.2b

Intel Ivy-bridge0.18 us

0.45 us

0

5000

10000

15000

Ban

dw

idth

(M

B/s

)Message Size (Bytes)

Bandwidth (Inter-socket)inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC

0

5000

10000

15000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth (Intra-socket)intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC

14,250 MB/s

13,749 MB/s

Page 20: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 20Network Based Computing Laboratory

• Introduced by Mellanox to support direct local and remote noncontiguous

memory access

– Avoid packing at sender and unpacking at receiver

• Available with MVAPICH2-X 2.2b

User-mode Memory Registration (UMR)

050

100150200250300350

4K 16K 64K 256K 1M

Late

ncy

(u

s)

Message Size (Bytes)

Small & Medium Message LatencyUMR

Default

0

5000

10000

15000

20000

2M 4M 8M 16M

Late

ncy

(u

s)

Message Size (Bytes)

Large Message LatencyUMR

Default

Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with

User-mode Memory Registration: Challenges, Designs and Benefits, CLUSTER, 2015

Page 21: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 21Network Based Computing Laboratory

• Introduced by Mellanox to support direct remote memory access without pinning

• Memory regions paged in/out dynamically by the HCA/OS

• Size of registered buffers can be larger than physical memory

• Will be available in upcoming MVAPICH2-X 2.2 RC1

On-Demand Paging (ODP)

Connect-IB (54 Gbps): 2.6 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

0

500

1000

1500

16 32 64

Pin

-do

wn

Bu

ffe

r Si

ze

(MB

)

Number of Processes

Graph500 Pin-down Buffer Sizes

Pin-down ODP

0

1

2

3

4

5

16 32 64

Exec

uti

on

Tim

e (s

)

Number of Processes

Graph500 BFS Kernel

Pin-down ODP

Page 22: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 22Network Based Computing Laboratory

Minimizing Memory Footprint by Direct Connect (DC) Transport

No

de

0 P1

P0

Node 1

P3

P2

Node 3

P7

P6

No

de

2 P5

P4

IBNetwork

• Constant connection cost (One QP for any peer)

• Full Feature Set (RDMA, Atomics etc)

• Separate objects for send (DC Initiator) and receive (DC Target)

– DC Target identified by “DCT Number”

– Messages routed with (DCT Number, LID)

– Requires same “DC Key” to enable communication

• Available since MVAPICH2-X 2.2a

0

0.5

1

160 320 620

No

rmal

ized

Exe

cuti

on

Ti

me

Number of Processes

NAMD - Apoa1: Large data set

RC DC-Pool UD XRC

1022

4797

1 1 12

10 10 10 10

1 1

35

1

10

100

80 160 320 640

Co

nn

ecti

on

Mem

ory

(K

B)

Number of Processes

Memory Footprint for Alltoall

RC DC-Pool UD XRC

H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT)

of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14)

Page 23: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 23Network Based Computing Laboratory

• Near-constant MPI and OpenSHMEM

initialization time at any process count

• 10x and 30x improvement in startup time

of MPI and OpenSHMEM respectively at

16,384 processes

• Memory consumption reduced for

remote endpoint information by

O(processes per node)

• 1GB Memory saved per node with 1M

processes and 16 processes per node

Towards High Performance and Scalable Startup at Exascale

P M

O

Job Startup Performance

Mem

ory

Req

uir

ed t

o S

tore

En

dp

oin

t In

form

atio

n

a b c d

eP

M

PGAS – State of the art

MPI – State of the art

O PGAS/MPI – Optimized

PMIX_Ring

PMIX_Ibarrier

PMIX_Iallgather

Shmem based PMI

b

c

d

e

aOn-demand Connection

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K

Panda, 20th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS ’15)

PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st

European MPI Users' Group Meeting (EuroMPI/Asia ’14)

Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th

IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’15)

SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th

IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’16) , Accepted for Publication

a

b

c d

e

Page 24: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 24Network Based Computing Laboratory

• SHMEMPMI allows MPI processes to directly read remote endpoint (EP) information from the process

manager through shared memory segments

• Only a single copy per node - O(processes per node) reduction in memory usage

• Estimated savings of 1GB per node with 1 million processes and 16 processes per node

• Up to 1,000 times faster PMI Gets compared to default design. Will be available in MVAPICH2 2.2RC1.

Process Management Interface over Shared Memory (SHMEMPMI)

TACC Stampede - Connect-IB (54 Gbps): 2.6 GHz Quad Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDRSHMEMPMI – Shared Memory Based PMI for Performance and Scalability S. Chakraborty, H. Subramoni, J. Perkins, and D.K. Panda,

16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ‘16), Accepted for publication

0

50

100

150

200

250

300

1 2 4 8 16 32

Tim

e Ta

ken

(m

illis

eco

nd

s)

Number of Processes per Node

Time Taken by one PMI_Get

Default

SHMEMPMI

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

16 64 256 1K 4K 16K 64K 256K 1MMem

ory

Usa

ge p

er N

od

e (M

B)

Number of Processes per Job

Memory Usage for Remote EP Information

Fence - DefaultAllgather - DefaultFence - ShmemAllgather - Shmem

Estimated

10

00

x

Actual

16x

Page 25: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 25Network Based Computing Laboratory

• Scalability for million to billion processors

• Collective communication– Offload and Non-blocking

– Topology-aware

• Integrated Support for GPGPUs

• Integrated Support for MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

Page 26: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 26Network Based Computing Laboratory

Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)

0

1

2

3

4

5

512 600 720 800

Ap

plic

atio

n R

un

-Tim

e

(s)

Data Size

0

5

10

15

64 128 256 512Ru

n-T

ime

(s)

Number of Processes

PCG-Default Modified-PCG-Offload

Co-Design with MPI-3 Non-Blocking Collectives and Collective Offload Co-Direct Hardware (Available since MVAPICH2-X 2.2a)

Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)

K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All

with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT,

ISC 2011

17%

00.20.40.60.8

11.2

10 20 30 40 50 60 70

No

rmal

ized

P

erfo

rman

ce

HPL-Offload HPL-1ring HPL-Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version

K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective Offload on

InfiniBand Clusters: A Case Study with HPL, HotI 2011

K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

21.8%

Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12

Page 27: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 27Network Based Computing Laboratory

Network-Topology-Aware Placement of Processes• Can we design a highly scalable network topology detection service for IB?

• How do we design the MPI communication library in a network-topology-aware manner to efficiently leverage the topology information generated by our service?

• What are the potential benefits of using a network-topology-aware MPI library on the performance of parallel scientific applications?

Overall performance and Split up of physical communication for MILC on Ranger

Performance for varyingsystem sizes Default for 2048 core run Topo-Aware for 2048 core run

15%

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand

Topology Service to Enable Network-Topology-Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist

• Reduce network topology discovery time from O(N2hosts) to O(Nhosts)

• 15% improvement in MILC execution time @ 2048 cores

• 15% improvement in Hypre execution time @ 1024 cores

Page 28: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 28Network Based Computing Laboratory

• Scalability for million to billion processors

• Collective communication

• Integrated Support for GPGPUs– CUDA-Aware MPI

– GPUDirect RDMA (GDR) Support

– CUDA-aware Non-blocking Collectives

– Support for Managed Memory

– Efficient datatype Processing

• Integrated Support for MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

Page 29: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 29Network Based Computing Laboratory

PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(s_hostbuf, s_devbuf, . . .);

MPI_Send(s_hostbuf, size, . . .);

At Receiver:

MPI_Recv(r_hostbuf, size, . . .);

cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• Data movement in applications with standard MPI and CUDA interfaces

High Productivity and Low Performance

MPI + CUDA - Naive

Page 30: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 30Network Based Computing Laboratory

PCIe

GPU

CPU

NIC

Switch

At Sender:for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(s_hostbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

<<Similar at receiver>>

• Pipelining at user level with non-blocking MPI and CUDA interfaces

Low Productivity and High Performance

MPI + CUDA - Advanced

Page 31: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 31Network Based Computing Laboratory

At Sender:

At Receiver:

MPI_Recv(r_devbuf, size, …);

inside

MVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware MPI Library: MVAPICH2-GPU

Page 32: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 32Network Based Computing Laboratory

• OFED with support for GPUDirect RDMA is

developed by NVIDIA and Mellanox

• OSU has a design of MVAPICH2 using

GPUDirect RDMA

– Hybrid design using GPU-Direct RDMA

• GPUDirect RDMA and Host-based pipelining

• Alleviates P2P bandwidth bottlenecks on SandyBridge and

IvyBridge

– Support for communication using multi-rail

– Support for Mellanox Connect-IB and ConnectX VPI adapters

– Support for RoCE with Mellanox ConnectX VPI adapters

GPU-Direct RDMA (GDR) with CUDA

IB Adapter

SystemMemory

GPUMemory

GPU

CPU

Chipset

P2P write: 5.2 GB/s

P2P read: < 1.0 GB/s

SNB E5-2670

P2P write: 6.4 GB/s

P2P read: 3.5 GB/s

IVB E5-2680V2

SNB E5-2670 /

IVB E5-2680V2

Page 33: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 33Network Based Computing Laboratory

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers

Page 34: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 34Network Based Computing Laboratory

MVAPICH2-GDR-2.2bIntel Ivy Bridge (E5-2680 v2) node - 20 cores

NVIDIA Tesla K40c GPUMellanox Connect-IB Dual-FDR HCA

CUDA 7Mellanox OFED 2.4 with GPU-Direct-RDMA

10x

2X

11x

2x

Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)

0

5

10

15

20

25

30

0 2 8 32 128 512 2K

MV2-GDR2.2b MV2-GDR2.0bMV2 w/o GDR

GPU-GPU internode latency

Message Size (bytes)

Late

ncy

(u

s)

2.18us0

500

1000

1500

2000

2500

3000

1 4 16 64 256 1K 4K

MV2-GDR2.2b

MV2-GDR2.0b

MV2 w/o GDR

GPU-GPU Internode Bandwidth

Message Size (bytes)

Ban

dw

idth

(M

B/s

) 11X

0

1000

2000

3000

4000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2 w/o GDR

GPU-GPU Internode Bi-Bandwidth

Message Size (bytes)

Bi-

Ban

dw

idth

(M

B/s

)

Page 35: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 35Network Based Computing Laboratory

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)

• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768

MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768

MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Ave

rage

Tim

e St

eps

per

se

con

d (

TPS)

Number of Processes

MV2 MV2+GDR

0

500

1000

1500

2000

2500

3000

3500

4 8 16 32Ave

rage

Tim

e St

eps

pe

r se

con

d (

TPS)

Number of Processes

64K Particles 256K Particles

2X2X

Page 36: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 36Network Based Computing Laboratory

0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Ove

rlap

(%

)

Message Size (Bytes)

Medium/Large Message Overlap (64 GPU nodes)

Ialltoall (1process/node)

Ialltoall (2process/node; 1process/GPU)0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Ove

rlap

(%

)

Message Size (Bytes)

Medium/Large Message Overlap(64 GPU nodes)

Igather (1process/node)

Igather (2processes/node;1process/GPU)

Platform: Wilkes: Intel Ivy Bridge NVIDIA Tesla K20c + Mellanox Connect-IB

Available since MVAPICH2-GDR 2.2a

CUDA-Aware Non-Blocking Collectives

A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU

Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC,

2015

Page 37: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 37Network Based Computing Laboratory

Communication Runtime with GPU Managed Memory

● CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified)

memory allowing a common memory allocation for GPU or

CPU through cudaMallocManaged() call

● Significant productivity benefits due to abstraction of

explicit allocation and cudaMemcpy()

● Extended MVAPICH2 to perform communications directly

from managed buffers (Available in MVAPICH2-GDR 2.2b)

● OSU Micro-benchmarks extended to evaluate the

performance of point-to-point and collective

communications using managed buffers

● Available in OMB 5.2

D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, to be held in conjunction with PPoPP ‘16

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 4 8 16 32 64 128 256 1K 4K 8K 16K

Hal

o E

xch

ange

Tim

e (

ms)

Total Dimension Size (Bytes)

2D Stencil Performance for Halowidth=1

Device

Managed

Page 38: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 38Network Based Computing Laboratory

CPU

Progress

GPU

Time

Initi

ate

Kern

el

Star

t Se

nd

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd

Init

iate

Ke

rnel

GPU

CPU

Initi

ate

Kern

el

Star

tSe

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)

Existing Design

Proposed Design

Kernel on Stream

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Init

iate

Ke

rnel

Star

t Se

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)

Init

iate

Ke

rnel

Star

t Se

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1) Wait

WFK

Star

t Se

nd

Wait

Progress

Start Finish Proposed Finish Existing

WFK

WFK

Expected Benefits

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPUCommon Scenario

*Buf1, Buf2…contain non-contiguous MPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);

Page 39: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 39Network Based Computing Laboratory

Application-Level Evaluation (HaloExchange - Cosmo)

0

0.5

1

1.5

16 32 64 96

No

rmal

ized

Exe

cuti

on

Tim

e

Number of GPUs

CSCS GPU clusterDefault Callback-based Event-based

0

0.5

1

1.5

4 8 16 32

No

rmal

ized

Exe

cuti

on

Tim

e

Number of GPUs

Wilkes GPU ClusterDefault Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-

Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

Page 40: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 40Network Based Computing Laboratory

• Scalability for million to billion processors

• Collective communication

• Integrated Support for GPGPUs

• Integrated Support for MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

Page 41: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 41Network Based Computing Laboratory

MPI Applications on MIC Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

OffloadedComputation

MPI Program

MPI Program

MPI Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

• Flexibility in launching MPI jobs on clusters with Xeon Phi

Page 42: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 42Network Based Computing Laboratory

MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC

• Offload Mode

• Intranode Communication

• Coprocessor-only and Symmetric Mode

• Internode Communication

• Coprocessors-only and Symmetric Mode

• Multi-MIC Node Configurations

• Running on three major systems

• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)

Page 43: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 43Network Based Computing Laboratory

MIC-Remote-MIC P2P Communication with Proxy-based Communication

Bandwidth

Bette

r

Bet

ter

Bet

ter

Latency (Large Messages)

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)

Message Size (Bytes)

0

2000

4000

6000

1 16 256 4K 64K 1MBan

dw

idth

(M

B/s

ec)

Message Size (Bytes)

5236

Intra-socket P2P

Inter-socket P2P

0

5000

10000

15000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)

Message Size (Bytes)

Latency (Large Messages)

0

2000

4000

6000

1 16 256 4K 64K 1MBan

dw

idth

(MB

/se

c)Message Size (Bytes)

Bette

r

5594

Bandwidth

Page 44: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 44Network Based Computing Laboratory

Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)

A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014

0

10000

20000

30000

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

secs

)

Message Size (Bytes)

32-Node-Allgather (16H + 16 M)Small Message LatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(u

secs

)

Message Size (Bytes)

32-Node-Allgather (8H + 8 M)Large Message LatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

4K 8K 16K 32K 64K 128K 256K 512K

Late

ncy

(u

secs

)

Message Size (Bytes)

32-Node-Alltoall (8H + 8 M)Large Message LatencyMV2-MIC

MV2-MIC-Opt

0

20

40

60

MV2-MIC-Opt MV2-MICExe

cuti

on

Tim

e (

secs

)

32 Nodes (8H + 8M), Size = 2K*2K*1K

P3DFFT PerformanceCommunication

Computation

76%

58%

55%

Page 45: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 45Network Based Computing Laboratory

• Scalability for million to billion processors

• Collective communication

• Integrated Support for GPGPUs

• Integrated Support for MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

Page 46: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 46Network Based Computing Laboratory

MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS ApplicationsMPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)

Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI CallsUPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X

1.9 (2012) onwards!

• UPC++ support will be available in upcoming MVAPICH2-X 2.2RC1

• Feature Highlights

– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP)

+ UPC + CAF

– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with initial

support for UPC 1.3), CAF 2008 standard (OpenUH)

– Scalable Inter-node and intra-node communication – point-to-point and collectives

CAF Calls

Page 47: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 47Network Based Computing Laboratory

Application Level Performance with Graph500 and SortGraph500 Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,

International Supercomputing Conference (ISC’13), June 2013

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation,

Int'l Conference on Parallel Processing (ICPP '12), September 2012

05

101520253035

4K 8K 16K

Tim

e (

s)

No. of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid (MPI+OpenSHMEM)13X

7.6X

• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design

• 8,192 processes- 2.4X improvement over MPI-CSR

- 7.6X improvement over MPI-Simple

• 16,384 processes- 1.5X improvement over MPI-CSR

- 13X improvement over MPI-Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned

Global Address Space Programming Models (PGAS '13), October 2013.

Sort Execution Time

0

500

1000

1500

2000

2500

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Tim

e (

seco

nd

s)

Input Data - No. of Processes

MPI Hybrid

51%

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application

• 4,096 processes, 4 TB Input Size- MPI – 2408 sec; 0.16 TB/min

- Hybrid – 1172 sec; 0.36 TB/min

- 51% improvement over MPI-design

Page 48: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 48Network Based Computing Laboratory

MiniMD – Total Execution Time

• Hybrid design performs better than MPI implementation

• 1,024 processes- 17% improvement over MPI version

• Strong ScalingInput size: 128 * 128 * 128

Performance Strong Scaling

0

500

1000

1500

2000

2500

512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

17%

0

500

1000

1500

2000

2500

3000

256 512 1,024

Hybrid-Barrier MPI-Original Hybrid-Advanced

Tim

e (

ms)

Tim

e (

ms)

# of Cores # of Cores

M. Li, J. Lin, X. Lu, K. Hamidouche, K. Tomko and D. K. Panda, Scalable MiniMD Design with Hybrid MPI and OpenSHMEM, OpenSHMEM User Group

Meeting (OUG ’14), held in conjunction with 8th International Conference on Partitioned Global Address Space Programming Models, (PGAS 14).

Page 49: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 49Network Based Computing Laboratory

Hybrid MPI+UPC NAS-FT

• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall

• Truly hybrid program

• For FT (Class C, 128 processes)

• 34% improvement over UPC-GASNet

• 30% improvement over UPC-OSU

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Tim

e (s

)

NAS Problem Size – System Size

UPC-GASNet

UPC-OSU

Hybrid-OSU

34%

J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth Conference on

Partitioned Global Address Space Programming Model (PGAS ’10), October 2010

Hybrid MPI + UPC Support

Available since

MVAPICH2-X 1.9 (2012)

Page 50: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 50Network Based Computing Laboratory

• Scalability for million to billion processors

• Collective communication

• Integrated Support for GPGPUs

• Integrated Support for MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

Page 51: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 51Network Based Computing Laboratory

• Virtualization has many benefits– Fault-tolerance

– Job migration

– Compaction

• Have not been very popular in HPC due to overhead associated with Virtualization

• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field

• Enhanced MVAPICH2 support for SR-IOV

• MVAPICH2-Virt 2.1 (with and without OpenStack) is publicly available

Can HPC and Virtualization be Combined?

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based

Virtualized InfiniBand Clusters? EuroPar'14

J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand

Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build

HPC Clouds, CCGrid’15

Page 52: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 52Network Based Computing Laboratory

• Redesign MVAPICH2 to make it

virtual machine aware

– SR-IOV shows near to native

performance for inter-node point to

point communication

– IVSHMEM offers zero-copy access to

data on shared memory of co-resident

VMs

– Locality Detector: maintains the locality

information of co-resident virtual machines

– Communication Coordinator: selects the

communication channel (SR-IOV, IVSHMEM)

adaptively

Overview of MVAPICH2-Virt with SR-IOV and IVSHMEM

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter

Physical Function

user space

kernel space

MPI proc

PCI Device

VF Driver

Guest 2

user space

kernel space

MPI proc

PCI Device

VF Driver

Virtual

Function

Virtual

Function

/dev/shm/

IV-SHM

IV-Shmem Channel

SR-IOV Channel

J. Zhang, X. Lu, J. Jose, R. Shi, D. K. Panda. Can Inter-VM

Shmem Benefit MPI Applications on SR-IOV based

Virtualized InfiniBand Clusters? Euro-Par, 2014.

J. Zhang, X. Lu, J. Jose, R. Shi, M. Li, D. K. Panda. High

Performance MPI Library over SR-IOV Enabled InfiniBand

Clusters. HiPC, 2014.

Page 53: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 53Network Based Computing Laboratory

• OpenStack is one of the most popular

open-source solutions to build clouds and

manage virtual machines

• Deployment with OpenStack

– Supporting SR-IOV configuration

– Supporting IVSHMEM configuration

– Virtual Machine aware design of MVAPICH2

with SR-IOV

• An efficient approach to build HPC Clouds

with MVAPICH2-Virt and OpenStack

MVAPICH2-Virt with SR-IOV and IVSHMEM over OpenStack

J. Zhang, X. Lu, M. Arnold, D. K. Panda. MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build

HPC Clouds. CCGrid, 2015.

Page 54: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 54Network Based Computing Laboratory

0

50

100

150

200

250

300

350

400

milc leslie3d pop2 GAPgeofem zeusmp2 lu

Exe

cuti

on

Tim

e (

s)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native

1%9.5%

0

1000

2000

3000

4000

5000

6000

22,20 24,10 24,16 24,20 26,10 26,16

Exe

cuti

on

Tim

e (

ms)

Problem Size (Scale, Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-Native2%

• 32 VMs, 6 Core/VM

• Compared to Native, 2-5% overhead for Graph500 with 128 Procs

• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Application-Level Performance on Chameleon

SPEC MPI2007Graph500

5%

Page 55: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 55Network Based Computing Laboratory

NSF Chameleon Cloud: A Powerful and Flexible Experimental Instrument

• Large-scale instrument

– Targeting Big Data, Big Compute, Big Instrument research

– ~650 nodes (~14,500 cores), 5 PB disk over two sites, 2 sites connected with 100G network

• Reconfigurable instrument

– Bare metal reconfiguration, operated as single instrument, graduated approach for ease-of-use

• Connected instrument

– Workload and Trace Archive

– Partnerships with production clouds: CERN, OSDC, Rackspace, Google, and others

– Partnerships with users

• Complementary instrument

– Complementing GENI, Grid’5000, and other testbeds

• Sustainable instrument

– Industry connections http://www.chameleoncloud.org/

Page 56: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 56Network Based Computing Laboratory

• Scalability for million to billion processors

• Collective communication

• Integrated Support for GPGPUs

• Integrated Support for MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

Page 57: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 57Network Based Computing Laboratory

• MVAPICH2-EA 2.1 (Energy-Aware)

• A white-box approach

• New Energy-Efficient communication protocols for pt-pt and collective operations

• Intelligently apply the appropriate Energy saving techniques

• Application oblivious energy saving

• OEMT

• A library utility to measure energy consumption for MPI applications

• Works with all MPI runtimes

• PRELOAD option for precompiled applications

• Does not require ROOT permission:

• A safe kernel module to read only a subset of MSRs

Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT)

Page 58: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 58Network Based Computing Laboratory

• An energy efficient runtime that

provides energy savings without

application knowledge

• Uses automatically and

transparently the best energy

lever

• Provides guarantees on

maximum degradation with 5-

41% savings at <= 5%

degradation

• Pessimistic MPI applies energy

reduction lever to each MPI call

MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)

A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.

K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]

1

Page 59: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 59Network Based Computing Laboratory

• Scalability for million to billion processors

• Collective communication

• Integrated Support for GPGPUs

• Integrated Support for MICs

• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …)

• Virtualization

• Energy-Awareness

• InfiniBand Network Analysis and Monitoring (INAM)

Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale

Page 60: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 60Network Based Computing Laboratory

• OSU INAM monitors IB clusters in real time by querying various subnet management entities

in the network

• Major features of the OSU INAM tool include:

– Analyze and profile network-level activities with many parameters (data and errors) at user specified

granularity

– Capability to analyze and profile node-level, job-level and process-level activities for MPI

communication (pt-to-pt, collectives and RMA)

– Remotely monitor CPU utilization of MPI processes at user specified granularity

– Visualize the data transfer happening in a "live" fashion - Live View for

• Entire Network - Live Network Level View

• Particular Job - Live Job Level View

• One or multiple Nodes - Live Node Level View

– Capability to visualize data transfer that happened in the network at a time duration in the past

• Entire Network - Historical Network Level View

• Particular Job - Historical Job Level View

• One or multiple Nodes - Historical Node Level View

Overview of OSU INAM

Page 61: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 61Network Based Computing Laboratory

OSU INAM – Network Level View

• Show network topology of large clusters

• Visualize traffic pattern on different links

• Quickly identify congested links/links in error state

• See the history unfold – play back historical state of the network

Full Network (152 nodes) Zoomed-in View of the Network

Page 62: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 62Network Based Computing Laboratory

OSU INAM – Job and Node Level Views

Visualizing a Job (5 Nodes) Finding Routes Between Nodes

• Job level view• Show different network metrics (load, error, etc.) for any live job

• Play back historical data for completed jobs to identify bottlenecks

• Node level view provides details per process or per node• CPU utilization for each rank/node

• Bytes sent/received for MPI operations (pt-to-pt, collective, RMA)

• Network metrics (e.g. XmitDiscard, RcvError) per rank/node

Page 63: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 63Network Based Computing Laboratory

MVAPICH2 – Plans for Exascale

• Performance and Memory scalability toward 1M cores

• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)– Support for task-based parallelism (UPC++)

• Enhanced Optimization for GPU Support and Accelerators

• Taking advantage of advanced features– User Mode Memory Registration (UMR)

– On-demand Paging

• Enhanced Inter-node and Intra-node communication schemes for upcoming OmniPath and Knights Landing architectures

• Extended RMA support (as in MPI 3.0)

• Extended topology-aware collectives

• Energy-aware point-to-point (one-sided and two-sided) and collectives

• Extended Support for MPI Tools Interface (as in MPI 3.0)

• Extended Checkpoint-Restart and migration support with SCR

Page 64: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 64Network Based Computing Laboratory

• Exascale systems will be constrained by– Power

– Memory per core

– Data movement cost

– Faults

• Programming Models and Runtimes for HPC need to be designed for

– Scalability

– Performance

– Fault-resilience

– Energy-awareness

– Programmability

– Productivity

• Highlighted some of the issues and challenges

• Need continuous innovation on all these fronts

Looking into the Future ….

Page 65: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 65Network Based Computing Laboratory

Funding Acknowledgments

Funding Support by

Equipment Support by

Page 66: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 66Network Based Computing Laboratory

Personnel AcknowledgmentsCurrent Students

– A. Augustine (M.S.)

– A. Awan (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– N. Islam (Ph.D.)

– M. Li (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist

– S. Sur

Current Post-Doc

– J. Lin

– D. Banerjee

Current Programmer

– J. Perkins

Past Post-Docs

– H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekar (Ph.D.)

– K. Kulkarni (M.S.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Research Scientists Current Senior Research Associate

– H. Subramoni

– X. Lu

Past Programmers

– D. Bureddy

- K. Hamidouche

Current Research Specialist

– M. Arnold

Page 67: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 67Network Based Computing Laboratory

International Workshop on Communication Architectures at Extreme Scale (Exacomm)

ExaComm 2015 was held with Int’l Supercomputing Conference (ISC ‘15), at Frankfurt,

Germany, on Thursday, July 16th, 2015

One Keynote Talk: John M. Shalf, CTO, LBL/NERSCFour Invited Talks: Dror Goldenberg (Mellanox); Martin Schulz (LLNL);

Cyriel Minkenberg (IBM-Zurich); Arthur (Barney) Maccabe (ORNL)Panel: Ron Brightwell (Sandia)

Two Research Papers

ExaComm 2016 will be held in conjunction with ISC ’16

http://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html

Technical Paper Submission Deadline: Friday, April 15, 2016

Page 68: Programming Models for Exascale Systems€¦ · Programming Models for Exascale Systems Dhabaleswar K. (DK) Panda ... Towards Exascale System (Today and Target) Systems 2016 Tianhe-2

HPCAC-Stanford (Feb ‘16) 68Network Based Computing Laboratory

[email protected]

Thank You!

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/