ghex: generic halo exchange for exascale · req = comm.recv(msg, peer_rank, tag); // … compute....

28
04/02/2020 Marcin Krotkiewski (UiO / SIGMA2) Mauro Bianco, Fabian Bösch, Marco Bettiol (CSCS, ETH) GHEX: Generic Halo Exchange for Exascale

Upload: others

Post on 18-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

04/02/2020

Marcin Krotkiewski (UiO / SIGMA2)Mauro Bianco, Fabian Bösch,Marco Bettiol (CSCS, ETH)

GHEX: Generic Halo Exchange for Exascale

Page 2: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

• PRACE EU’s Horizon 2020 Research and Innovation Programme

• CSCS / ETH Zurich• SIGMA2

2

Funding

Page 3: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

• RoCS (Rosseland Centre for Solar Physics)– BIFROST, stellar atmosphere simulation code (Mats Carlsson, Mikolaj Szydlarski, FORTRAN)

– DISPATCH, task-based numerical simulation framework (Åke Nordlund, FORTRAN)

• ECMWF (European Centre for Medium-Range Weather Forecasts

– Atlas, numerical weather prediction and climate modeling

• MeteoSwiss– COSMO, large-scale climate and atmospheric simulations

Scientific collaborators

Page 4: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

Introduction

• Parallel PDE solvers and halo exchange• GHEX: goals• Performance optimizations• Examples: C++, FORTRAN• Benchmarks

4

Page 5: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

5

Structured grids

Page 6: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

6

Unstructured grids

Page 7: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

7

, 1, 1,

, 1 , 1

(

)

i j i j i j

i j i j

a b

2,3

3,3

2,4

1,3

2,2

P1 P2 P3

P4

P1

P6

P7 P8 P9

Parallel stencil computations and halo exchange

8 neighbors in 2D26 neighbors in 3D

Page 8: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

• Wide range of grids– Structured, staggered, on a sphere, unstructured

• Performance portability– CPU / GPU, InfiniBand / Cray

• Highly scalable– No global synchronization– Hybrid parallelism (threads + ranks), task-based parallelism

• Modern design– C++ templates, overhead-free abstraction layers

• Wide applicability– Abstractions to allow user-defined domains/grids– FORTRAN and Python bindings– Expose both high-level (halo exchange) and low-level (transport layer) API

GHEX: goals

8

Page 9: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

Pseudocode: Common implementation

9

allocate(halo1, halo2, halo3, … , halo26)

do while(.not. end_of_simulation)

! computecall update_timestep()

! copy halos for all 26 neighborscall copy_halo(halo1, field, nb1)…

! send/recv to/from all 26 neighborscall mpi_sendrecv(halo1, …, nb1, …)…

! Repeat for all physical fields ! wait for comm to finishend do

Page 10: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

Pseudocode: GHEX

10

! define halo, grid, and local domainhalo = [1 2 1 2 1 2]eh = exchange_descr(local_domain, halo, field1, field2, …)

do while(.not. end_of_simulation)

! queue asynchronous communicationch = ghex_exchange(eh)

! computecall update_timestep()

! wait for comm to finishch.wait()

end do

Page 11: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

• Data / neighbor locality– Renumber ranks (and neighbors) to maximize compute node locality

and NUMA / memory domain locality

• Zero-copy node-local halo exchange– Halo data copied directly from source data cube to destination data cube on another rank– XPMEM, shared memory (mmap), threads

Performance optimizations: domain specific

11

Page 12: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

Map local domains to ranks

12

Grid part processed by MPI rank 1

Page 13: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

Map ranks to compute nodes

13

1 2 3 ... 16

Ranks running on one compute node(shared memory communication)

Off-node network communication

Page 14: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

Node-aware mapping of ranks to nodes

14

Ranks running on one compute node(shared memory communication)

Network communication

Page 15: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

• Data / neighbor locality– Renumber ranks (and neighbors) to maximize compute node locality

and NUMA / memory domain locality

• Zero-copy node-local halo exchange– Halo data copied directly from source data cube to destination data cube on another rank– XPMEM, shared memory

• Pack multiple messages into a single buffer– Multiple field data, coalesce messages sent to the same peer

• Low overhead communication backends– UCX (IB), libfabric (Cray), MPI if nothing else is available

• Multithreading / hybrid parallelism– OpenMP, pthreads, std::thread

Performance optimizations: general

15

Page 16: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

16

Using GHEX

Page 17: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

17

• Unifies multiple communication backends– MPI, UCX, libfabric

• Asynchronous scheduling communication– overlapping of communication and computations

• Request-based, MPI-like API– Threads post send/recv requests and test for their completion

GHEX: low-level API

req = comm.recv(msg, peer_rank, tag);

// … compute.// Communication can progress in the background,// depending on the hardware.

req.wait(); // or req.test();

Page 18: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

18

• Callback-based asynchronous API– Threads post send/recv requests– GHEX notifies them about completion in a callback

• Suitable for task-based parallelism– No need to manage request and message queues– Highly asynchronous: any thread can progress and complete any other thread’s request

GHEX: low-level API

auto recv_callback = [&](mesg, peer_rank, tag){

// unpack the message, mark task as ready to compute};

// schedule commcomm.recv(msg, peer_rank, tag, recv_callback);

// do other work// Occasionally progress the communication. comm.progress();

Page 19: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

19

GHEX: high-level API

   // 2 domains per MPI rank   auto local_domains = std::list{my_domain_t(rank*2, ...),

my_domain_t(rank*2+1, ...) };

   // one domain lives on GPU   auto pressure_field_dom_a = data_desc<GPU, double>(...);   auto density_field_dom_a  = data_desc<GPU, float>(...);

   // the other domain lives on CPU   auto pressure_field_dom_b = data_desc<CPU,double>(...);   auto density_field_dom_b = data_desc<CPU,float>(...);

   // halo generation function objects   auto halo_gen1     = my_halo_gen_t(1,1,1,1,0,0, ...);   auto halo_gen2     = my_halo_gen_t(1,2,1,2,1,1, ...); 

Page 20: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

20

GHEX: high-level API        // determine and hold connectivity information based on halos   auto pattern1 = ghex::make_pattern<ghex::structured_grid>(context, halo_gen1, local_domains);   auto pattern2 = ghex::make_pattern<ghex::structured_grid>(context, halo_gen2, local_domains);

   // generic communication object   auto co       = ghex::make_communication_object<pattern_type>(mpi_comm);

   // simulation code   for (auto time_step = 0; time_step<N; ++time_step)   {       auto h = co.exchange(            pattern_1(pressure_field_dom_a),            pattern_1(pressure_field_dom_b),            pattern_2(density_field_dom_a),            pattern_2(density_field_dom_b));       // do stuff...       h.wait();   }

Page 21: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

21

GHEX: FORTRAN  ! define the local domain – 1 per rank domain_desc(1)%id = rank domain_desc(1)%device_id = DeviceCPU ! ...

! initialize field data structures call ghex_field_init(temp_fd, temp_cube, halo, periodic=[1,1,0]) call ghex_domain_add_field(domain_desc(1), temp_fd) call ghex_field_init(pressure_fd, pressure_cube, halo, periodic=[1,1,0]) call ghex_domain_add_field(domain_desc(1), pressure_fd)

! collect halo info from all fields in an exchange descriptor ed = ghex_exchange_desc_new(domain_desc)

! create GHEX communicator co = ghex_struct_co_new()

! simulation loop do while (i < niters) handle = ghex_exchange(co, ed) ! ... local computations ... call ghex_wait(handle) end do

Page 22: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

22

Shared memory halo exchange benchmarks

Pack-copy-unpack zero-copy (write directly to dest)

---------------------------------------------64^3Halo=1 4.5ms 2.8msHalo=5 14.0ms 6.6ms

128^3Halo=1 14.3ms 9.5msHalo=5 55.0ms 25.0ms

• Speedup due to lower memory footprint– MPI: gather+write, memcpy, read+scatter– Zero-copy: gather+scatter

Page 23: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

23

Transport layer benchmarks

• Multi-threaded tagged send/recv– Callback-based and request-based implementations

• Each thread has nmsg messages in progress– Suitable for task-based parallelism with fully asynchronous communication

while(!end)

// for each message in-flightfor(mid=0; mid<nmsg; mid++)

// post recvif(completed(rreq[mid]))

rreq[mid] = irecv(rbuf[mid], peer_rank, tag)

// post sendif(completed(sreq[mid]))

sreq[mid] = isend(sbuf[mid], peer_rank, tag)

Page 24: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

24

Transport layer benchmarks

• Small message size:latency dominated

• Large message size:fabric bandwidth limit

• Coalescing of messages to the same destination10x10 kB vs 1x100 kB

Page 25: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

25

Transport layer benchmarks

• Many concurrent messagesmultiple spatial neighbors

• In practice – 10 : 100 messages in-flight

• GHEX is optimized to handle many messages in-flight

Page 26: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

26

Transport layer benchmarks

• Many concurrent messagesmultiple spatial neighbors

• In practice – 10 : 100 messages in-flight

• GHEX is optimized to handle many messages in-flight

Page 27: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

27

Transport layer benchmarks

• Many ranks / threads share the same fabric interface on a compute node

• Efficient multi-threaded communication is important for scalability

Page 28: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();

• Multiple modern transport backends (UCX, libfabric, MPI)

• Request- and callback-based backends• CPU / GPU support• Multi-threaded scalability• Problem-specific optimizations• Support for numerous grid types

(structured and unstructured)• C++, bindings to FORTRAN, Python

Summary

P1 P2 P3

P4

P1

P6

P7 P8 P9