ghex: generic halo exchange for exascale · req = comm.recv(msg, peer_rank, tag); // … compute....
TRANSCRIPT
![Page 1: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/1.jpg)
04/02/2020
Marcin Krotkiewski (UiO / SIGMA2)Mauro Bianco, Fabian Bösch,Marco Bettiol (CSCS, ETH)
GHEX: Generic Halo Exchange for Exascale
![Page 2: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/2.jpg)
• PRACE EU’s Horizon 2020 Research and Innovation Programme
• CSCS / ETH Zurich• SIGMA2
2
Funding
![Page 3: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/3.jpg)
• RoCS (Rosseland Centre for Solar Physics)– BIFROST, stellar atmosphere simulation code (Mats Carlsson, Mikolaj Szydlarski, FORTRAN)
– DISPATCH, task-based numerical simulation framework (Åke Nordlund, FORTRAN)
• ECMWF (European Centre for Medium-Range Weather Forecasts
– Atlas, numerical weather prediction and climate modeling
• MeteoSwiss– COSMO, large-scale climate and atmospheric simulations
Scientific collaborators
![Page 4: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/4.jpg)
Introduction
• Parallel PDE solvers and halo exchange• GHEX: goals• Performance optimizations• Examples: C++, FORTRAN• Benchmarks
4
![Page 5: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/5.jpg)
5
Structured grids
![Page 6: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/6.jpg)
6
Unstructured grids
![Page 7: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/7.jpg)
7
, 1, 1,
, 1 , 1
(
)
i j i j i j
i j i j
a b
2,3
3,3
2,4
1,3
2,2
P1 P2 P3
P4
P1
P6
P7 P8 P9
Parallel stencil computations and halo exchange
8 neighbors in 2D26 neighbors in 3D
![Page 8: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/8.jpg)
• Wide range of grids– Structured, staggered, on a sphere, unstructured
• Performance portability– CPU / GPU, InfiniBand / Cray
• Highly scalable– No global synchronization– Hybrid parallelism (threads + ranks), task-based parallelism
• Modern design– C++ templates, overhead-free abstraction layers
• Wide applicability– Abstractions to allow user-defined domains/grids– FORTRAN and Python bindings– Expose both high-level (halo exchange) and low-level (transport layer) API
GHEX: goals
8
![Page 9: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/9.jpg)
Pseudocode: Common implementation
9
allocate(halo1, halo2, halo3, … , halo26)
do while(.not. end_of_simulation)
! computecall update_timestep()
! copy halos for all 26 neighborscall copy_halo(halo1, field, nb1)…
! send/recv to/from all 26 neighborscall mpi_sendrecv(halo1, …, nb1, …)…
! Repeat for all physical fields ! wait for comm to finishend do
![Page 10: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/10.jpg)
Pseudocode: GHEX
10
! define halo, grid, and local domainhalo = [1 2 1 2 1 2]eh = exchange_descr(local_domain, halo, field1, field2, …)
do while(.not. end_of_simulation)
! queue asynchronous communicationch = ghex_exchange(eh)
! computecall update_timestep()
! wait for comm to finishch.wait()
end do
![Page 11: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/11.jpg)
• Data / neighbor locality– Renumber ranks (and neighbors) to maximize compute node locality
and NUMA / memory domain locality
• Zero-copy node-local halo exchange– Halo data copied directly from source data cube to destination data cube on another rank– XPMEM, shared memory (mmap), threads
Performance optimizations: domain specific
11
![Page 12: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/12.jpg)
Map local domains to ranks
12
Grid part processed by MPI rank 1
![Page 13: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/13.jpg)
Map ranks to compute nodes
13
1 2 3 ... 16
Ranks running on one compute node(shared memory communication)
Off-node network communication
![Page 14: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/14.jpg)
Node-aware mapping of ranks to nodes
14
Ranks running on one compute node(shared memory communication)
Network communication
![Page 15: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/15.jpg)
• Data / neighbor locality– Renumber ranks (and neighbors) to maximize compute node locality
and NUMA / memory domain locality
• Zero-copy node-local halo exchange– Halo data copied directly from source data cube to destination data cube on another rank– XPMEM, shared memory
• Pack multiple messages into a single buffer– Multiple field data, coalesce messages sent to the same peer
• Low overhead communication backends– UCX (IB), libfabric (Cray), MPI if nothing else is available
• Multithreading / hybrid parallelism– OpenMP, pthreads, std::thread
Performance optimizations: general
15
![Page 16: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/16.jpg)
16
Using GHEX
![Page 17: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/17.jpg)
17
• Unifies multiple communication backends– MPI, UCX, libfabric
• Asynchronous scheduling communication– overlapping of communication and computations
• Request-based, MPI-like API– Threads post send/recv requests and test for their completion
GHEX: low-level API
req = comm.recv(msg, peer_rank, tag);
// … compute.// Communication can progress in the background,// depending on the hardware.
req.wait(); // or req.test();
![Page 18: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/18.jpg)
18
• Callback-based asynchronous API– Threads post send/recv requests– GHEX notifies them about completion in a callback
• Suitable for task-based parallelism– No need to manage request and message queues– Highly asynchronous: any thread can progress and complete any other thread’s request
GHEX: low-level API
auto recv_callback = [&](mesg, peer_rank, tag){
// unpack the message, mark task as ready to compute};
// schedule commcomm.recv(msg, peer_rank, tag, recv_callback);
// do other work// Occasionally progress the communication. comm.progress();
![Page 19: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/19.jpg)
19
GHEX: high-level API
// 2 domains per MPI rank auto local_domains = std::list{my_domain_t(rank*2, ...),
my_domain_t(rank*2+1, ...) };
// one domain lives on GPU auto pressure_field_dom_a = data_desc<GPU, double>(...); auto density_field_dom_a = data_desc<GPU, float>(...);
// the other domain lives on CPU auto pressure_field_dom_b = data_desc<CPU,double>(...); auto density_field_dom_b = data_desc<CPU,float>(...);
// halo generation function objects auto halo_gen1 = my_halo_gen_t(1,1,1,1,0,0, ...); auto halo_gen2 = my_halo_gen_t(1,2,1,2,1,1, ...);
![Page 20: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/20.jpg)
20
GHEX: high-level API // determine and hold connectivity information based on halos auto pattern1 = ghex::make_pattern<ghex::structured_grid>(context, halo_gen1, local_domains); auto pattern2 = ghex::make_pattern<ghex::structured_grid>(context, halo_gen2, local_domains);
// generic communication object auto co = ghex::make_communication_object<pattern_type>(mpi_comm);
// simulation code for (auto time_step = 0; time_step<N; ++time_step) { auto h = co.exchange( pattern_1(pressure_field_dom_a), pattern_1(pressure_field_dom_b), pattern_2(density_field_dom_a), pattern_2(density_field_dom_b)); // do stuff... h.wait(); }
![Page 21: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/21.jpg)
21
GHEX: FORTRAN ! define the local domain – 1 per rank domain_desc(1)%id = rank domain_desc(1)%device_id = DeviceCPU ! ...
! initialize field data structures call ghex_field_init(temp_fd, temp_cube, halo, periodic=[1,1,0]) call ghex_domain_add_field(domain_desc(1), temp_fd) call ghex_field_init(pressure_fd, pressure_cube, halo, periodic=[1,1,0]) call ghex_domain_add_field(domain_desc(1), pressure_fd)
! collect halo info from all fields in an exchange descriptor ed = ghex_exchange_desc_new(domain_desc)
! create GHEX communicator co = ghex_struct_co_new()
! simulation loop do while (i < niters) handle = ghex_exchange(co, ed) ! ... local computations ... call ghex_wait(handle) end do
![Page 22: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/22.jpg)
22
Shared memory halo exchange benchmarks
Pack-copy-unpack zero-copy (write directly to dest)
---------------------------------------------64^3Halo=1 4.5ms 2.8msHalo=5 14.0ms 6.6ms
128^3Halo=1 14.3ms 9.5msHalo=5 55.0ms 25.0ms
• Speedup due to lower memory footprint– MPI: gather+write, memcpy, read+scatter– Zero-copy: gather+scatter
![Page 23: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/23.jpg)
23
Transport layer benchmarks
• Multi-threaded tagged send/recv– Callback-based and request-based implementations
• Each thread has nmsg messages in progress– Suitable for task-based parallelism with fully asynchronous communication
while(!end)
// for each message in-flightfor(mid=0; mid<nmsg; mid++)
// post recvif(completed(rreq[mid]))
rreq[mid] = irecv(rbuf[mid], peer_rank, tag)
// post sendif(completed(sreq[mid]))
sreq[mid] = isend(sbuf[mid], peer_rank, tag)
![Page 24: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/24.jpg)
24
Transport layer benchmarks
• Small message size:latency dominated
• Large message size:fabric bandwidth limit
• Coalescing of messages to the same destination10x10 kB vs 1x100 kB
![Page 25: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/25.jpg)
25
Transport layer benchmarks
• Many concurrent messagesmultiple spatial neighbors
• In practice – 10 : 100 messages in-flight
• GHEX is optimized to handle many messages in-flight
![Page 26: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/26.jpg)
26
Transport layer benchmarks
• Many concurrent messagesmultiple spatial neighbors
• In practice – 10 : 100 messages in-flight
• GHEX is optimized to handle many messages in-flight
![Page 27: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/27.jpg)
27
Transport layer benchmarks
• Many ranks / threads share the same fabric interface on a compute node
• Efficient multi-threaded communication is important for scalability
![Page 28: GHEX: Generic Halo Exchange for Exascale · req = comm.recv(msg, peer_rank, tag); // … compute. // Communication can progress in the background, // depending on the hardware. req.wait();](https://reader036.vdocument.in/reader036/viewer/2022090605/605aa69d34584051e6008622/html5/thumbnails/28.jpg)
• Multiple modern transport backends (UCX, libfabric, MPI)
• Request- and callback-based backends• CPU / GPU support• Multi-threaded scalability• Problem-specific optimizations• Support for numerous grid types
(structured and unstructured)• C++, bindings to FORTRAN, Python
Summary
P1 P2 P3
P4
P1
P6
P7 P8 P9