ghex: generic halo exchange for exascale · req = comm.recv(msg, peer_rank, tag); // … compute....

04/02/2020

Marcin Krotkiewski (UiO / SIGMA2)Mauro Bianco, Fabian Bösch,Marco Bettiol (CSCS, ETH)

GHEX: Generic Halo Exchange for Exascale

• PRACE EU’s Horizon 2020 Research and Innovation Programme

• CSCS / ETH Zurich• SIGMA2

2

Funding

• RoCS (Rosseland Centre for Solar Physics)– BIFROST, stellar atmosphere simulation code (Mats Carlsson, Mikolaj Szydlarski, FORTRAN)

– DISPATCH, task-based numerical simulation framework (Åke Nordlund, FORTRAN)

• ECMWF (European Centre for Medium-Range Weather Forecasts

– Atlas, numerical weather prediction and climate modeling

• MeteoSwiss– COSMO, large-scale climate and atmospheric simulations

Scientific collaborators

Introduction

• Parallel PDE solvers and halo exchange• GHEX: goals• Performance optimizations• Examples: C++, FORTRAN• Benchmarks

4

5

Structured grids

6

Unstructured grids

7

, 1, 1,

, 1 , 1

(

)

i j i j i j

i j i j

a b

2,3

3,3

2,4

1,3

2,2

P1 P2 P3

P4

P1

P6

P7 P8 P9

Parallel stencil computations and halo exchange

8 neighbors in 2D26 neighbors in 3D

• Wide range of grids– Structured, staggered, on a sphere, unstructured

• Performance portability– CPU / GPU, InfiniBand / Cray

• Highly scalable– No global synchronization– Hybrid parallelism (threads + ranks), task-based parallelism

• Modern design– C++ templates, overhead-free abstraction layers

• Wide applicability– Abstractions to allow user-defined domains/grids– FORTRAN and Python bindings– Expose both high-level (halo exchange) and low-level (transport layer) API

GHEX: goals

8

Pseudocode: Common implementation

9

allocate(halo1, halo2, halo3, … , halo26)

do while(.not. end_of_simulation)

! computecall update_timestep()

! copy halos for all 26 neighborscall copy_halo(halo1, field, nb1)…

! send/recv to/from all 26 neighborscall mpi_sendrecv(halo1, …, nb1, …)…

! Repeat for all physical fields ! wait for comm to finishend do

Pseudocode: GHEX

10

! define halo, grid, and local domainhalo = [1 2 1 2 1 2]eh = exchange_descr(local_domain, halo, field1, field2, …)

do while(.not. end_of_simulation)

! queue asynchronous communicationch = ghex_exchange(eh)

! computecall update_timestep()

! wait for comm to finishch.wait()

end do

• Data / neighbor locality– Renumber ranks (and neighbors) to maximize compute node locality

and NUMA / memory domain locality

• Zero-copy node-local halo exchange– Halo data copied directly from source data cube to destination data cube on another rank– XPMEM, shared memory (mmap), threads

Performance optimizations: domain specific

11

Map local domains to ranks

12

Grid part processed by MPI rank 1

Map ranks to compute nodes

13

1 2 3 ... 16

Ranks running on one compute node(shared memory communication)

Off-node network communication

Node-aware mapping of ranks to nodes

14

Ranks running on one compute node(shared memory communication)

Network communication

• Data / neighbor locality– Renumber ranks (and neighbors) to maximize compute node locality

and NUMA / memory domain locality

• Zero-copy node-local halo exchange– Halo data copied directly from source data cube to destination data cube on another rank– XPMEM, shared memory

• Pack multiple messages into a single buffer– Multiple field data, coalesce messages sent to the same peer

• Low overhead communication backends– UCX (IB), libfabric (Cray), MPI if nothing else is available

• Multithreading / hybrid parallelism– OpenMP, pthreads, std::thread

Performance optimizations: general

15

16

Using GHEX

17

• Unifies multiple communication backends– MPI, UCX, libfabric

• Asynchronous scheduling communication– overlapping of communication and computations

• Request-based, MPI-like API– Threads post send/recv requests and test for their completion

GHEX: low-level API

req = comm.recv(msg, peer_rank, tag);

// … compute.// Communication can progress in the background,// depending on the hardware.

req.wait(); // or req.test();

18

• Callback-based asynchronous API– Threads post send/recv requests– GHEX notifies them about completion in a callback

• Suitable for task-based parallelism– No need to manage request and message queues– Highly asynchronous: any thread can progress and complete any other thread’s request

GHEX: low-level API

auto recv_callback = [&](mesg, peer_rank, tag){

// unpack the message, mark task as ready to compute};

// schedule commcomm.recv(msg, peer_rank, tag, recv_callback);

// do other work// Occasionally progress the communication. comm.progress();

19

GHEX: high-level API

// 2 domains per MPI rank auto local_domains = std::list{my_domain_t(rank*2, ...),

my_domain_t(rank*2+1, ...) };

// one domain lives on GPU auto pressure_field_dom_a = data_desc<GPU, double>(...); auto density_field_dom_a = data_desc<GPU, float>(...);

// the other domain lives on CPU auto pressure_field_dom_b = data_desc<CPU,double>(...); auto density_field_dom_b = data_desc<CPU,float>(...);

// halo generation function objects auto halo_gen1 = my_halo_gen_t(1,1,1,1,0,0, ...); auto halo_gen2 = my_halo_gen_t(1,2,1,2,1,1, ...);

20

GHEX: high-level API // determine and hold connectivity information based on halos auto pattern1 = ghex::make_pattern<ghex::structured_grid>(context, halo_gen1, local_domains); auto pattern2 = ghex::make_pattern<ghex::structured_grid>(context, halo_gen2, local_domains);

// generic communication object auto co = ghex::make_communication_object<pattern_type>(mpi_comm);

// simulation code for (auto time_step = 0; time_step<N; ++time_step) { auto h = co.exchange( pattern_1(pressure_field_dom_a), pattern_1(pressure_field_dom_b), pattern_2(density_field_dom_a), pattern_2(density_field_dom_b)); // do stuff... h.wait(); }

21

GHEX: FORTRAN ! define the local domain – 1 per rank domain_desc(1)%id = rank domain_desc(1)%device_id = DeviceCPU ! ...

! initialize field data structures call ghex_field_init(temp_fd, temp_cube, halo, periodic=[1,1,0]) call ghex_domain_add_field(domain_desc(1), temp_fd) call ghex_field_init(pressure_fd, pressure_cube, halo, periodic=[1,1,0]) call ghex_domain_add_field(domain_desc(1), pressure_fd)

! collect halo info from all fields in an exchange descriptor ed = ghex_exchange_desc_new(domain_desc)

! create GHEX communicator co = ghex_struct_co_new()

! simulation loop do while (i < niters) handle = ghex_exchange(co, ed) ! ... local computations ... call ghex_wait(handle) end do

22

Shared memory halo exchange benchmarks

Pack-copy-unpack zero-copy (write directly to dest)

---------------------------------------------64^3Halo=1 4.5ms 2.8msHalo=5 14.0ms 6.6ms

128^3Halo=1 14.3ms 9.5msHalo=5 55.0ms 25.0ms

• Speedup due to lower memory footprint– MPI: gather+write, memcpy, read+scatter– Zero-copy: gather+scatter

23

Transport layer benchmarks

• Multi-threaded tagged send/recv– Callback-based and request-based implementations

• Each thread has nmsg messages in progress– Suitable for task-based parallelism with fully asynchronous communication

while(!end)

// for each message in-flightfor(mid=0; mid<nmsg; mid++)

// post recvif(completed(rreq[mid]))

rreq[mid] = irecv(rbuf[mid], peer_rank, tag)

// post sendif(completed(sreq[mid]))

sreq[mid] = isend(sbuf[mid], peer_rank, tag)

24


• Small message size:latency dominated

• Large message size:fabric bandwidth limit

• Coalescing of messages to the same destination10x10 kB vs 1x100 kB

25


• Many concurrent messagesmultiple spatial neighbors

• In practice – 10 : 100 messages in-flight

• GHEX is optimized to handle many messages in-flight

26


• Many concurrent messagesmultiple spatial neighbors

• In practice – 10 : 100 messages in-flight

• GHEX is optimized to handle many messages in-flight

27


• Many ranks / threads share the same fabric interface on a compute node

• Efficient multi-threaded communication is important for scalability

• Multiple modern transport backends (UCX, libfabric, MPI)

• Request- and callback-based backends• CPU / GPU support• Multi-threaded scalability• Problem-specific optimizations• Support for numerous grid types

(structured and unstructured)• C++, bindings to FORTRAN, Python

Summary

P1 P2 P3

P4

P1

P6

P7 P8 P9

ghex: generic halo exchange for exascale · req = comm.recv(msg, peer_rank, tag); // … compute....

Documents