support hybrid mpi+pgas (upc/openshmem/caf)...
TRANSCRIPT
Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime:
An MVAPICH2-X Approach
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at OSC theater (SC ’15)
by
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• Additionally, OpenMP can be used to parallelize computation
within the node
• Each model has strengths and drawbacks - suite different problems or
applications
2 SC15
Partitioned Global Address Space (PGAS) Models
• Key features
- Simple shared memory abstractions
- Light weight one-sided communication
- Easier to express irregular communication
• Different approaches to PGAS
- Languages
• Unified Parallel C (UPC)
• Co-array Fortran (CAF)
• X10
• Chapel
- Libraries
• OpenSHMEM
• Global Arrays
3 SC15
OpenSHMEM: PGAS library
• SHMEM implementations – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP
SHMEM, GSHMEM
• Subtle differences in API, across versions – example:
SGI SHMEM Quadrics SHMEM Cray SHMEM
Initialization start_pes(0) shmem_init start_pes
Process ID _my_pe my_pe shmem_my_pe
• Made application codes non-portable
• OpenSHMEM is an effort to address this:
“A new, open specification to consolidate the various extant SHMEM versions
into a widely accepted standard.” – OpenSHMEM Specification v1.0
by University of Houston and Oak Ridge National Lab
SGI SHMEM is the baseline
4 SC15
• UPC: a parallel extension to the C standard
• UPC Specifications and Standards: – Introduction to UPC and Language Specification, 1999
– UPC Language Specifications, v1.0, Feb 2001
– UPC Language Specifications, v1.1.1, Sep 2004
– UPC Language Specifications, v1.2, June 2005
– UPC Language Specifications, v1.3, Nov 2013
• UPC Consortium – Academic Institutions: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland…
– Government Institutions: ARSC, IDA, LBNL, SNL, US DOE…
– Commercial Institutions: HP, Cray, Intrepid Technology, IBM, …
• Supported by several UPC compilers – Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC
– Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC
• Aims for: high performance, coding efficiency, irregular applications, …
Unified Parallel C: PGAS language
5 SC15
Co-array Fortran (CAF): Language-level PGAS support in Fortran
• An extension to Fortran to support global shared array in
parallel Fortran applications
• CAF = CAF compiler + CAF runtime (libcaf)
• Coarray syntax and basic synchronization support in Fortran 2008
• Collective communication and atomic operations in upcoming Fortran 2015
E.g.
real :: a(n)[*]
x(:)[q] = x(:) + x(:)[p]
name[i] = name
sync all
...
ATOMIC_ADD
ATOMIC_FETCH_ADD
...
6 SC15
• Hierarchical architectures with multiple address spaces
• (MPI + PGAS) Model
– MPI across address spaces
– PGAS within an address space
• MPI is good at moving data between address spaces
• Within an address space, MPI can interoperate with other shared
memory programming models
• Applications can have kernels with different communication patterns
• Can benefit from different models
• Re-writing complete applications can be a huge effort
• Port critical kernels to the desired model instead
MPI+PGAS for Exascale Architectures and Applications
7 SC15
• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over
Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Used by more than 2,475 organizations in 76 countries
– More than 308,000 downloads from the OSU site directly
– Empowering many TOP500 clusters (June ‘15 ranking)
• 10th ranked 519,640-core cluster (Stampede) at TACC
• 13th ranked 185,344-core cluster (Pleiades) at NASA
• 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
– Stampede at TACC (8th in Jun’15, 519,640 cores, 5.168 Plops)
MVAPICH2 Software
8 SC15
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based
on communication characteristics
• Benefits:
– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*:
– “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1 MPI
Kernel 2 MPI
Kernel 3 MPI
Kernel N MPI
HPC Application
Kernel 2 PGAS
Kernel N PGAS
9 SC15
MVAPICH2-X for Hybrid MPI + PGAS Applications
MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)
Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls UPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with
MVAPICH2-X 1.9 (2012) onwards!
– http://mvapich.cse.ohio-state.edu
• Feature Highlights
– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC
– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH)
– Scalable Inter-node and intra-node communication – point-to-point and collectives
CAF Calls
10 SC15
OpenSHMEM Collective Communication: Performance
Reduce (1,024 processes) Broadcast (1,024 processes)
Collect (1,024 processes) Barrier
0
50
100
150
200
250
300
350
400
128 256 512 1024 2048
Tim
e (
us)
No. of Processes
1
10
100
1000
10000
100000
1000000
10000000
Tim
e (
us)
Message Size
180X
1
10
100
1000
10000
4 16 64 256 1K 4K 16K 64K 256K
Tim
e (
us)
Message Size
18X
1
10
100
1000
10000
100000
1 4 16 64 256 1K 4K 16K 64K
Tim
e (
us)
Message Size
MV2X-SHMEM
Scalable-SHMEM
OMPI-SHMEM
12X
3X
11 SC15
Hybrid MPI+UPC NAS-FT
• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall
• Truly hybrid program
• For FT (Class C, 128 processes)
• 34% improvement over UPC (GASNet)
• 30% improvement over UPC (MV2-X)
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Tim
e (
s)
NAS Problem Size – System Size
34%
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth
Conference on Partitioned Global Address Space Programming Model (PGAS '10), October 2010
UPC (GASNet)
UPC (MV2-X)
MPI+UPC (MV2-X)
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, PGAS 2010
12 SC15
CAF One-sided Communication: Performance
0
1000
2000
3000
4000
5000
6000
Ban
dw
idth
(M
B/s
)
Message Size (byte)
GASNet-IBV
GASNet-MPI
MV2X
0
1000
2000
3000
4000
5000
6000
Ban
dw
idth
(M
B/s
)
Message Size (byte)
GASNet-IBV
GASNet-MPI
MV2X
02468
10
GASNet-IBV GASNet-MPI MV2XLate
ncy (
ms)
02468
10
GASNet-IBV GASNet-MPI MV2XLate
ncy (
ms)
0 100 200 300
bt.D.256
cg.D.256
ep.D.256
ft.D.256
mg.D.256
sp.D.256
GASNet-IBV GASNet-MPI MV2X
Time (sec)
Get NAS-CAF Put
• Micro-benchmark improvement (MV2X vs. GASNet-IBV, UH CAF test-suite)
– Put bandwidth: 3.5X improvement on 4KB; Put latency: reduce 29% on 4B
• Application performance improvement (NAS-CAF one-sided implementation)
– Reduce the execution time by 12% (SP.D.256), 18% (BT.D.256)
3.5X
29%
12%
18%
J. Lin, K. Hamidouche, X. Lu, M. Li and D. K. Panda, High-performance Co-array Fortran support with
MVAPICH2-X: Initial experience and evaluation, HIPS’15
13 SC15
Graph500 - BFS Traversal Time
• Hybrid design performs better than MPI implementations
• 16,384 processes - 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple (Same communication characteristics)
• Strong Scaling Graph500 Problem Scale = 29
0
5
10
15
20
25
30
35
4K 8K 16K
Tim
e (
s)
No. of Processes
13X
7.6X
0.00E+00
1.00E+09
2.00E+09
3.00E+09
4.00E+09
5.00E+09
6.00E+09
7.00E+09
8.00E+09
9.00E+09
1024 2048 4096 8192
Trav
ers
ed
Ed
ges
Pe
r Se
con
d
# of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid
Performance Strong Scaling
14 SC15
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
512-1TB 1024-2TB 2048-4TB 4096-8TB
Sort
Rat
e (
TB/m
in)
No. of Processes - Input Data
MPI
Hybrid-SR
Hybrid-ER
Hybrid MPI+OpenSHMEM Sort Application
Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
Weak Scalability
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application • Execution Time (seconds)
- 1TB Input size at 8,192 cores: MPI – 164, Hybrid-SR (Simple Read) – 92.5, Hybrid-ER (Eager Read) - 90.36
- 45% improvement over MPI-based design
• Weak Scalability (configuration: input size of 1TB per 512 cores)
- At 4,096 cores: MPI – 0.25 TB/min, Hybrid-SR – 0.34 TB/min, Hybrid-SR – 0.38 TB/min
- 38% improvement over MPI based design
38%
0
100
200
300
400
500
600
700
1024-1TB 2048-1TB 4096-1TB 8192-1TB
Tim
e (
sec)
No. of Processes - Input Data
MPI
Hybrid-SR
Hybrid-ER
45%
15 SC15
Accelerating MaTEx k-NN with Hybrid MPI and OpenSHMEM
KDD (2.5GB) on 512 cores
9.0%
KDD-tranc (30MB) on 256 cores
27.6%
• Benchmark: KDD Cup 2010 (8,407,752 records, 2 classes, k=5) • For truncated KDD workload on 256 cores, reduce 27.6% execution time • For full KDD workload on 512 cores, reduce 9.0% execution time
J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu, D. Panda. Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM,
OpenSHMEM 2015
• MaTEx: MPI-based Machine learning algorithm library • k-NN: a popular supervised algorithm for classification • Hybrid designs:
– Overlapped Data Flow; One-sided Data Transfer; Circular-buffer Structure
16 SC15
• Before CUDA 4: Additional copies
• Low performance and low productivity
• After CUDA 4: Host-based pipeline
• Unified Virtual Address
• Pipeline CUDA copies with IB transfers
• High performance and high productivity
• After CUDA 5.5: GPUDirect-RDMA support
• GPU to GPU direct transfer
• Bypass the host memory
• Hybrid design to avoid PCI bottlenecks
InfiniBand
CUDA-Aware Concept
GPU Memory
GPU
CPU
Chip set
Limitations of OpenSHMEM for GPU Computing
• OpenSHMEM memory model does not support disjoint memory address
spaces - case with GPU clusters
PE 0
Existing OpenSHMEM Model with CUDA
• Copies severely limit the performance
PE 1
GPU-to-GPU Data Movement
PE 0
cudaMemcpy (host_buf, dev_buf, . . . )
shmem_putmem (host_buf, host_buf, size, pe)
shmem_barrier (…)
host_buf = shmalloc (…)
PE 1
shmem_barrier ( . . . )
cudaMemcpy (dev_buf, host_buf, size, . . . )
host_buf = shmalloc (…)
• Synchronization negates the benefits of one-sided communication
• Similar issues with UPC
cudaMemcpy (host_buf, dev_buf, . . . )
cudaMemcpy (host_buf, dev_buf, . . . )
shmem_barrier (…)
shmem_barrier (…)
18 SC15
Global Address Space with Host and Device Memory
Host Memory
Private
Shared
Host Memory
Device Memory Device Memory
Private
Shared
Private
Shared
Private
Shared
shared space on host memory
shared space on device memory
N N
N N
• Extended APIs:
• heap_on_device/heap_on_host
• a way to indicate location of heap
• host_buf = shmalloc (sizeof(int), 0);
• dev_buf = shmalloc (sizeof(int), 1);
CUDA-Aware OpenSHMEM Same design for UPC PE 0
shmem_putmem (dev_buf, dev_buf, size, pe)
PE 1
dev_buf = shmalloc (size, 1);
dev_buf = shmalloc (size, 1); S. Potluri, D. Bureddy, H. Wang, H. Subramoni and D. K.
Panda, Extending OpenSHMEM for GPU Computing,
IPDPS’13
SC15 19
• After device memory becomes part of the global shared space:
- Accessible through standard UPC/OpenSHMEM communication APIs
- Data movement transparently handled by the runtime
- Preserves one-sided semantics at the application level
• Efficient designs to handle communication
- Inter-node transfers use host-staged transfers with pipelining
- Intra-node transfers use CUDA IPC
- Possibility to take advantage of GPUDirect RDMA (GDR)
• Goal: Enabling High performance one-sided communications
semantics with GPU devices
CUDA-aware OpenSHMEM and UPC runtimes
SC15
20
• GDR for small/medium message sizes
• Host-staging for large message to avoid PCIe
bottlenecks
• Hybrid design brings best of both
• 3.13 us Put latency for 4B (7X improvement )
and 4.7 us latency for 4KB
• 9X improvement for Get latency of 4B
Exploiting GDR: OpenSHMEM: Inter-node Evaluation
0
5
10
15
20
25
1 4 16 64 256 1K 4K
Host-Pipeline
GDR
Small Message shmem_put D-D
Message Size (bytes)
Late
ncy
(u
s)
0100200300400500600700800
8K 32K 128K 512K 2M
Host-Pipeline
GDR
Large Message shmem_put D-D
Message Size (bytes)
Late
ncy
(u
s)
0
5
10
15
20
25
30
35
1 4 16 64 256 1K 4K
Host-Pipeline
GDR
Small Message shmem_get D-D
Message Size (bytes)
Late
ncy
(u
s)
7X
9X
SC15 21
- Platform: Wilkes (Intel Ivy Bridge + NVIDIA
Tesla K20c + Mellanox Connect-IB)
- New designs achieve 20% and 19%
improvements on 32 and 64 GPU nodes
K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, C.
Ching and D. K. Panda, Exploiting GPUDirect RDMA in
Designing High Performance OpenSHMEM for GPU Clusters.
IEEE Cluster 2015
Application Evaluation: GPULBM and 2DStencil
19%
SC15
0
50
100
150
200
8 16 32 64
Evo
luti
on
Tim
e (
mse
c)
Number of GPU Nodes
Weak Scaling
Host-Pipeline Enhanced-GDR
0
0.02
0.04
0.06
0.08
0.1
8 16 32 64
Exe
cuti
on
tim
e (
sec)
Number of GPU Nodes
Host-Pipeline Enhanced-GDR
GPULBM: 64x64x64 2DStencil 2Kx2K
• Redesign the application • CUDA-Aware MPI : Send/Recv=> hybrid
CUDA-Aware MPI+OpenSHMEM • cudaMalloc =>shmalloc(size,1); • MPI_Send/recv => shmem_put + fence • 53% and 45% • Degradation is due to small Input size
45 %
22
• Presented an overview of a Hybrid MPI+PGAS models
• Outlined research challenges in designing efficient runtimes for
these models on clusters with InfiniBand
• Demonstrated the benefits of Hybrid MPI+PGAS models for a set
of applications on Host based systems
• Presented the model extension and runtime support for
Accelerator-Aware hybrid MPI+PGAS model
• Hybrid MPI+PGAS model is an emerging paradigm which can lead
to high-performance and scalable implementation of applications
on exascale computing systems using accelerators
Concluding Remarks
23 SC15
Funding Acknowledgments
Funding Support by
Equipment Support by
24 SC15
Personnel Acknowledgments
Current Students – A. Augustine (M.S.)
– A. Awan (Ph.D.)
– A. Bhat (M.S.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– N. Islam (Ph.D.)
Past Students – P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist – S. Sur
Current Post-Docs – J. Lin
– D. Shankar
Current Programmer – J. Perkins
Past Post-Docs – H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekar (Ph.D.)
– K. Kulkarni (M.S.)
– M. Li (Ph.D.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Research
Scientists – H. Subramoni
– X. Lu
Past Programmers – D. Bureddy
Current Research
Specialist – M. Arnold
Current Senior Research
Associate - K. Hamidouche
25 SC15
Thank You!
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/
26 SC15