preparing openshmem for exascale

Oak Ridge National Laboratory

Computing and Computational Sciences

HPC Advisory Council Stanford Conference

California

Feb 2, 2015

Preparing OpenSHMEM for Exascale

Presented by:

Pavel Shamis (Pasha)

2 Preparing OpenSHMEM for Exascale

Outline

• CORAL overview

– Summit

• What is OpenSHMEM ?

• Preparing OpenSHMEM for Exascale

– Recent advances


CORAL

• CORAL – Collaboration of ORNL, ANL, LLNL

• Objective – Procure 3 leadership computers to be sited at Argonne, Oak Ridge and Lawrence Livermore in 2017

– Two of the contracts have been awarded with the Argonne contract in process

• Leadership Computers

– RFP requests >100 PF, 2 GB/core main memory, local NVRAM, and science performance 5x-10x Titan or Sequoia


The Road to Exascale

CORAL SystemJaguar: 2.3 PFMulti-core CPU7 MW

Titan: 27 PFHybrid GPU/CPU9 MW

2010 2012 2017 2022

OLCF5: 5-10x Summit~20 MWSummit: 5-10x Titan

Hybrid GPU/CPU10 MW

Since clock-rate scaling ended in 2003, HPC performance has been achieved through increased parallelism. Jaguar scaled to 300,000 cores.

Titan and beyond deliver hierarchical parallelism with very powerful nodes. MPI plus thread level parallelism through OpenACC or OpenMP plus vectors


System Summary

Mellanox® Interconnect

Dual-rail EDR Infiniband®

IBM POWER

• NVLink™

NVIDIA Volta

• HBM

• NVLink

Compute Node

POWER® Architecture Processor

NVIDIA®Volta™

NVMe-compatible PCIe 800GB SSD

> 512 GB HBM + DDR4

Coherent Shared Memory

Compute Rack

Standard 19”

Warm water cooling

Compute System

Summit: 5x-10x Titan

10 MW


Summit VS Titan

12 SC’14 Summit - Bland Do Not Release Prior to Monday, Nov. 17, 2014

How does Summit compare to Titan

Feature Summit Titan

Application Performance 5-10x Titan Baseline

Number of Nodes ~3,400 18,688

Node performance > 40 TF 1.4 TF

Memory per Node >512 GB (HBM + DDR4) 38GB (GDDR5+DDR3)

NVRAM per Node 800 GB 0

Node Interconnect NVLink (5-12x PCIe 3) PCIe 2

System Interconnect (node injection bandwidth)

Dual Rail EDR-IB (23 GB/s) Gemini (6.4 GB/s)

Interconnect Topology Non-blocking Fat Tree 3D Torus

Processors IBM POWER9

NVIDIA Volta™

AMD Op

t er on™

NVIDIA Ke p ler™

File System 120 PB, 1 TB/s, GP FS™ 32 PB, 1 TB/s, Lustre®

Peak power consumption 10 MW 9 MW

Present and Future Leadership Computers at OLCF, Buddy Bland

https://www.olcf.ornl.gov/wp-content/uploads/2014/12/OLCF-User-Group-Summit-12-3-2014.pdf


Challenges for Programming Models

• Very powerful compute nodes

– Hybrid architecture

– Multiple CPU/GPU

– Different types of memory

• Must be fun to program ;-)

– MPI + X


What is OpenSHMEM ?

• Communication library and interface specification that implements a Partitioned Global Address Space (PGAS) programming model

• Processing Element (PE) an OpenSHMEM process

• Symmetric objects have same address (or offset) on all PEsPE N-1

Global and Static Variables

Symmetric Heap

Local Variables

PE 0


Symmetric Heap

Local Variables

PE 1


Symmetric Heap

Local Variables

Rem

ote

ly A

cce

ssib

le S

ym

me

tric

Data

Obje

cts

Variable: X Variable: X Variable: X

X = shmalloc(sizeof(long))

Priva

te D

ata

Ob

jects


OpenSHMEM Operations

• Remote memory Put and Get

– void shmem_getmem(void *target, const void *source, size_t len, int pe);

– void shmem_putmem(void *target, const void *source, size_t len, int pe);

• Remote memory Atomic operations

– long long shmem_int_add(int *target, int value, int pe);

• Collective

– broadcast, reductions, etc

• Synchronization operations

– Point-to-point

– Global

• Ordering operations

• Distributed lock operations

PE N-1


Symmetric Heap

Local Variables

PE 0


Symmetric Heap

Local Variables

PE 1


Symmetric Heap

Local Variables

Rem

ote

ly A

cce

ssib

le S

ym

me

tric

Data

Obje

cts

Variable: X Variable: X Variable: X

X = shmalloc(sizeof(long))

Priva

te D

ata

Ob

jects


OpenSHMEM Code Example

1

2

3

4


OpenSHMEM Code Example

• You just learned program OpenSHMEM !

– Library initialization

– AMO/PUT/GET

– Synchronization

– Done 1

2

3

4


OpenSHMEM

• OpenSHMEM is a one-sided communications library

– C and Fortran API

– Uses symmetric data objects to efficiently communicate across processes

• Advantages:

– Good for irregular applications, latency-driven communication

• Random memory access patterns

– Maps really well to hardware/interconnects

OpenSHMEM InfniBand (Mellanox) Gemini/Aries (Cray)

RMA PUT/GET V V

Atomics V V

Collectives V V


OpenSHMEM Key Principles

• Keep it simple

– The specification is only ~ 80 pages

• Keep it fast

– As close as possible to hardware


Evolution of OpenSHMEM

2011 20131990s 20152012

• SHMEM library introduced by Cray Research Inc. (T3D systems)

• Adapted by SGI for products based on the Numa-Link architecture and included in the Message Passing Toolkit (MPT).

• Vendor specific SHMEM libraries emerge (Quadrics, HP, IBM, Mellanox, Intel, gpSHMEM, SiCortex etc.).

• OpenSHMEM is born.• ORNL and UH come together to address the differences between

various SHMEM implementations.• OSSS signed SHMEM trademark licensing agreement

• OpenSHMEM 1.0 is finalized

• OpenSHMEM 1.0 reference implementation & V&V, Tools

• OpenSHMEM 1.1 released

mid 2014

• OpenSHMEM 1.2

2015 onwards, next OpenSHMEMspecifications: faster, more

predictable, more agile

OpenSHMEM is a living specification!


OpenSHMEM - Roadmap

• OpenSHMEM v1.1 (June 2014)

– Errata, bug fixes

– Ratified (100+ tickets resolved)

• OpenSHMEM v1.2 (Early 2015)

– API naming convention

– finalize(), global_exit()

– Consistent data type support

– Version information

– Clarifications: zero-length, wait

– shmem_ptr()

• OpenSHMEM v1.5 (Late 2015)

– Non-blocking communication semantics (RMA, AMO)

– teams, groups

– Thread safety

• OpenSHMEM v1.6

– Non-blocking collectives

• OpenSHMEM v1.7

– Thread safety update

• OpenSHMEM Next Generation (2.0)

– Let’s go wild !!! (Exascale!)

– Active set + Memory context

– Fault Tolerance

– Exit codes

– Locality

– I/O

White paper:

OpenSHMEM Tools API


OpenSHMEM Community Today

Academia

Vendors

Government


OpenSHMEM Implementations

• Proprietary

– SGI SHMEM

– Cray SHMEM

– IBM SHMEM

– HP SHMEM

– Mellanox Scalable SHMEM

• Legacy

– Quadrics SHMEM

• Open Source

– OpenSHMEM Reference Implementation (UH)

– Portals SHMEM

– Oshmpi / Open SHMEM over MPI (under development)

– OpenSHMEM with OpenMPI

– OpenSHMEM with MvapichMPI (OSU)

– TSHMEM (UFL)

– GatorSHMEM (UFL)


OpenSHMEM Eco-system

OpenSHMEM

Reference

Implementation

ANALYZER

Vampir


OpenSHMEM Eco-system

• OpenSHMEM Specification

– http://www.openshmem.org/site/Downloads/Source

• Vampir

– https://www.vampir.eu

• TAU

– http://www.cs.uoregon.edu/research/tau/home.php

• DDT

– www.allinea.com/products/ddt

• OpenSHMEM Analyzer

– https://www.openshmem.org/OSA

• UCCS

– http://uccs.github.io/uccs/

http://www.openshmem.org/site/Downloads/Source

https://www.vampir.eu

http://www.cs.uoregon.edu/research/tau/home.php

http://www.allinea.com/products/ddt

https://www.openshmem.org/OSA

http://uccs.github.io/uccs/


Upcoming Challenges for OpenSHMEM

• Based on what we know about the upcoming architecture…

• Communication across different components of system

• Locality of resources

Hybrid Architecture

• Thread Safety (without performance sacrifices)

• Threads locality

• Scalability

Multiple CPU/GPU

• Address spacesDifferent Types

of memory


Hybrid Architecture Challenges and Ideas

• OpenSHMEM for accelerators

• “TOC-Centric Communication: a case study with NVSHMEM”, OUG/PGAS 2014, Shreeram Potluri

– http://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf

– Preliminary study, prototype concept

http://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf


NVSHMEM

• The problem

– Communication across GPU requires synchronization with Host

• Software overheads, hardware overhead of launching kernels, etc.

• Research idea/concept proposed by Nvidia

– GPU-initiated communication

– NVSHMEM communication primitives: nvshmem_put(), nvshmem_get()to/from remote GPU memory

– Emulated using CUDA IPC (CUDA 4.2)

The slide is based on “TOC-Centric Communication: a case study with NVSHMEM”,

OUG/PGAS 2014, Shreeram

Potlurihttp://www.csm.ornl.gov/OpenSHMEM2014/documents/NVIDIA_Invite_OUG14.pdf

CHANGE IN THE MODEL

Loop { Interior Compute (kernel launch) Pack Boundaries (kernel launch) Stream Synchronize Exchange (MPI/OpenSHMEM) Unpack Boundaries (kernel launch) Boundary Compute (kernel launch) Stream/Device Synchronize } - Kernel launch overheads

- CPU based blocking synchronization

Traditional

Compute, Exchange and Synchronize (single kernel launch) - Support SHMEM communication and synchronization primitives from inside GPU kernel

Envisioned

17



NVSHMEM

SIMPLIFIED 2DSTENCIL EXAMPLE

u[i][j] = u[i][j] + (v[i+1][j] + v[i-1][j]

+ v[i][j+1] + v[i][j+1])/x

16

Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”,



PRELIMINARY RESULTS

Domain Size/

GPU Traditional Persistent Kernel

64 195.33 13.88

128 193.7 21.32

256 193.18 39.77

512 220.28 132.61

1024 375.8 389.65

2048 1319.74 1312.59

4096 5299.23 4776.31

8192 21480.32 18394.88

Time per Step (usec) (Ghost Width – 1; Boundary – 16)

(Threadsperblock – 512; blocks -15) (4 Processes – 1 Process/GPU)

tl

Time per Step (usec) (Domain size – 2048; Ghost Width – 1; Boundary – 2)

(Extrapolation by reducing problem size per GPU, assuming constant exchange and synchronization time)

Benchmark numbers, beware!!

GPU Count Traditional Persistent Kernel

4 375 389

16 226 132

64 196 39

256 194 21

1K 192 13

4K 202 13

16K 193 12

64K 194 13

0"

500"

1000"

1500"

64" 128" 256" 512" 1K" 2K"

Time%per%Step%(usec)%

Stencil%Size%%

tradi/ onal" persistent"kernel"

1"

10"

100"

1000"

4" 16" 64" 256" 1K" 4K" 16K" 64K"

Time%per%Step%(usec)%

Number%of%GPUs%

Tradi. onal" Persistent"Kernel"

23 4 K40m GPUs connected on a Xeon E5-2690 socket using PLX switches



Many-Core System Challenges

• It is challenging to provide high-performance THREAD_MULTIPLE support

– Locks / Atomic operations in communication path

• Even though MPI IMB benchmarks benefits from full process memory separation, multi-threaded UCCS obtains comparable performance

Aurelien Bouteiller, Thomas Herault and George Bosilca, “A Multithreaded Communication Substrate for OpenSHMEM”, OUG2014


Many-Core System Challenges – “Old” Ideas

• SHMEM_PTR (or SHMEM_LOCAL_PTR on Cray)

PE 0

Symmetric Heap

Local Variables

PE 1

Symmetric Heap

Local Variables

Variable: X Variable: X

Y = shmem_ptr(&X, PE1)

Variable: Y

Mem

ory M

apping


Many-Core System Challenges – “Old” Ideas

• Provides direct assess to “remote” PE element with memory load and store operations

• Supported on a systems where SHMEM_PUT/GET are implemented with memory load and store operations

– Usually implemented using XPMEM (https://code.google.com/p/xpmem/)

• Gabriele Jost, Ulf R. Hanebutte, James Dinan, “OpenSHMEM with Threads: A Bad Idea?”

• http://www.csm.ornl.gov/OpenSHMEM2014/documents/talk6_jost_OUG14.pdf

https://code.google.com/p/xpmem/

http://www.csm.ornl.gov/OpenSHMEM2014/documents/talk6_jost_OUG14.pdf


Many-Core System Challenges – New Ideas

• OpenSHMEM Context by Intel

– James Dinan and Mario Flajslik, “Contexts: A Mechanism for High Throughput Communication in OpenSHMEM”, PGAS 2015

– Explicit API for allocation and management of communication contexts

OpenSHMEM Application

Thread 0 Thread 1 Thread 2

Context 0

Context 1

Context 2

OpenSHMEM Library

Put Put Get Put Put Get Put Put Get


Many-Core System Challenges – New Ideas

• Cray’s proposal of “Hot” Threads

– Monika ten Bruggencate Cray Inc. “Cray SHMEM Update”, First OpenSHMEM Workshop: Experiences, Implementations and Tools

• http://www.csm.ornl.gov/workshops/openshmem2013/documents/presentations_and_tutorials/tenBruggencate_Cray_SHMEM_Update.pdf

– Idea: each thread is registered within OpenSHMEM library. The library allocates and automatically manages communication resources (context) for the application

– Compatible with current API

http://www.csm.ornl.gov/workshops/openshmem2013/documents/presentations_and_tutorials/tenBruggencate_Cray_SHMEM_Update.pdf


Address Space and Locality Challenges

• Symmetric heap is not flexible enough

– All PE have to allocate the same amount of memory

• No concept of locality

• How we manage different types of memory ?

• What is the right abstraction ?


Memory Spaces

• Aaron Welch , Swaroop Pophale , Pavel Shamis , Oscar Hernandez, Stephen Poole, Barbara Chapman, “Extending the OpenSHMEMMemory Model to Support User-Defined Spaces”, PGAS2014

• Concept of teams

– Original OpenSHMEM active-set (group of Pes) concept is outdates, BUT very lightweight (local operation)

• Memory Space

– Memory space association with a team

– Similar concepts can be found in MPI, Chapel, etc.


Teams

• Explicit method of grouping PEs

• Fully local objects and operations - Fast

• New (sub)teams created from parent teams

• Re-indexing of PE ids with respect to the team

• Strided teams and axial splits

– No need to maintain “translation” array

– All translations can be done with simple arithmetic

• Ability to specify team index for remote operations


Spaces


Spaces

• Spaces and teams creation is decoupled

• Faster memory allocation compared to “shmalloc”

• Future directions

– Different types of memory

– Locality

– Separate address spaces

– Asymmetric RMA access


Fault Tolerance ?

• How to run in presence of faults ?

• What is the responsibility of programming model and communication libraries

• Pengfei Hao, Pavel Shamis, Manjunath Gorentla Venkata, SwaroopPophale, Aaron Welch, Stephen Poole, Barbara Chapman, “Fault Tolerance for OpenSHMEM”, PGAS/OUG14

– http://nic.uoregon.edu/pgas14/oug_submissions/oug2014_submission_12.pdf

http://nic.uoregon.edu/pgas14/oug_submissions/oug2014_submission_12.pdf


Fault Tolerance

• Basic idea

– In memory checkpoint of symmetric memory regions

– Symmetric recovery or only “memory recovery”


Fault Tolerance

• Code snippet


Fault Tolerance

• Work in progress…

• OpenSHMEM is just one piece of the puzzle

– Run-time, I/O, drivers, etc.

– The system has to provide fault tolerance infrastructure

• Error notification, coordination, etc.

• Leveraging existing work/research in the HPC community

– MPI, Hadoop, etc.


Summary

• This just a “snapshot” some of the ideas

– Other active research & development topics: non-blocking operations, counting operations, signaled operation, asymmetric memory access, etc

• These challenges are relevant for many other HPC programming models

• The key to success

– Co-design of hardware and software

– Generic solutions that target broader community

• The challenges are common across different fields: storage, analytics, big-data, etc.


How to get involved?

• Join the mailing list

– www.openshmem.org/Community/MailingList

• Join OpenSHMEM redmine

– www.openshmem.org/redmine

• GitHUB

– https://github.com/orgs/openshmem-org

• OpenSHMEM RF, test suites, benchmarks, etc.

• Participate in our upcoming events

– Workshop, user group meetings, and conference calls

http://www.openshmem.org/Community/MailingList

http://www.openshmem.org/redmine

https://github.com/orgs/openshmem-org


Upcoming Events…

Workshop 2015August,4th-6th, 2015


www.csm.ornl.gov/OpenSHMEM2015/

Co-Located with

PGAS 20159th international Conference on

Partitioned Global Address SpaceProgramming Models

Washington, DC

Upcoming Events…

http://www.csm.ornl.gov/OpenSHMEM2014/


Acknowledgements

This work was supported by the United States Department of Defense & used resources of the Extreme Scale Systems Center at Oak Ridge

National Laboratory.

Empowering the Mission


Questions ?


Backup Slides


NVSHMEM Code Example

USING NVSHMEM Device Code __global__ void one_kernel (u, v, sync, …) { i = threadIdx.x; for (…) { if (i+1 > nx) { v[i+1] = nvshmem_float_g (v[1], rightpe) } if (i-1 < 1) { v[i-1] = nvshmem_float_g (v[nx], leftpe) } ------- u[i] = (u[i] + (v[i+1] + v[i-1] . . . contd….

contd…. /*peers array has left and right PE ids*/ if (i < 2) { nvshmem_int_p (sync[i], 1, peers[i]); nvshmem_quiet(); nvshmem_wait_until (sync[i], EQ, 1); } //intra-process sync ------- //compute v from u and sync } }

19 Evaluation results from: “TOC-Centric Communication: a case study with NVSHMEM”,




preparing openshmem for exascale

Technology

x summit

program mpi x

x titan10 mw

future leadership computers

challenges ufouopropin

x titanhybrid gpucpu10

exascalethe system

hpc performance