pyfr: technical challenges of bringing next generation ......techniques employed 1. abstract the...
TRANSCRIPT
PyFR: Technical Challenges of Bringing Next Generation
Fluid Dynamics to GPUsF.D. Witherden
!Department of Aeronautics Imperial College London
���1
Introduction
• Computational fluid dynamics (CFD) is the bedrock of several high-tech industries.
• Desire amongst practitioners to perform unsteady, scale resolving simulations, within the vicinity of complex geometries.
���2
���3
Current Generation CFD
• Optimised for steady state problems.
• Numerics established in the 1980s:
• assumes FLOPS are expensive…
• and memory bandwidth plentiful.
���4
Next Generation CFD
• To make unsteady simulations practical on an industrial scale we need:
• new numerics;
• new hardware;
• new implementations;
• new rules…?
���5
PyFR
• Our solution PyFR.
• A high-order compressible Navier-Stokes solver for 3D unstructured grids.
• Designed from the ground up to run on NVIDIA GPUs.
• Written entirely in Python!
���6
The FR in PyFR
• Uses flux reconstruction (FR) approach;
• can recover well-know schemes including nodal Discontinuous Galerkin (DG).
• Majority of operations element-local.
• Can obtain over 50% of peak FLOPS.
���7
The Py in PyFR
• Leverages PyCUDA and mpi4py.
• Makes extensive use of run-time code generation.
• All compute performed on device.
• Overhead from the interpreter < 1%.
• Just 5,000 lines of code.
���8
PyFR In Practice
���9
• Flow over a cylinder:
!
!
!
!
• Isosurfaces of density at Ma = 0.2; Re = 3900.
Scalability of PyFR
• Performance has been evaluated on the Emerald cluster;
• one of the largest GPU clusters in the UK;
• 372 NVIDIA M2090s.
• Nodes connected via QDR InfiniBand.
���10
Scaling: Weak• Problem size kept in proportion.
���11
0.00
0.25
0.50
0.75
1.00
1.25
1 2 4 8 16 32 64 104
NVIDIA M2090s
No
rm
alise
dru
ntim
e
Scaling: Strong• Problem size kept constant.
���12
0
4
8
12
16
20
24
28
32
1 2 4 8 16 32NVIDIA M2090s
Spe
edup
Techniques Employed
1. Abstract the numerics in terms of well-understood performance primitives.
2. Use Python to reduce boiler plate code and facilitate run-time kernel generation.
3. Improve communication efficiency by GPUDirect and CUDA-aware-MPI.
���13
Using Common Abstractions• Writing portable and maintainable FLOP-
intensive code is non-trivial.
• But, CUDA comes “batteries included” with a range of high-performance primitives:
• cuBLAS; cuSPARSE; cuFFT;
• …
���14
Using Common Abstractions• Example from PyFR: inside each element
• have data at ;
• want to interpolate to ;
• are a linear combination of .
���15
Using Common Abstractions• Can cast this as GEMV:
!
!
!
!
• Each element has its own Mij.
���16
= Mij
Using Common Abstractions
• Performant a decade ago.
• But, today GEMV is bandwidth bound.
• Storing an Mij per-element is also expensive.
• Solution: adapt the numerics!
���17
Using Common Abstractions• Consider transforming each element to a
reference element.
!
!
• Mij become identical for each element:
• a single GEMM.
���18
import Python!
• Why Python?
• Interpreted, supports duck typing, garbage collection, exceptions…just like every other scripting language.
• But makes it extremely easy to call C, FORTRAN, and CUDA code.
���19
import Python!
• Python and CUDA are a great fit:
• overheads from Python masked by the asynchronous nature of CUDA.
• With PyCUDA it can help deliver run-time code generation to the wider community:
• think C++ templates on steroids.
���20
import Python!• In fluid dynamics we need the flux:
!
!
• System closed through an equation of state.
• Implement this generally in PyFR using Mako.
���21
u =
8>>>>>>>>><>>>>>>>>>:
⇢⇢v
x
⇢vy
⇢vz
E
9>>>>>>>>>=>>>>>>>>>;
f(u) =
8>>>>>>>>><>>>>>>>>>:
⇢vx
⇢vy
⇢vz
⇢v2x
+ p ⇢vy
v
x
⇢vz
v
x
⇢vx
v
y
⇢v2y
+ p ⇢vz
v
y
⇢vx
v
z
⇢vy
v
z
⇢v2z
+ p
v
x
(E + p) v
y
(E + p) v
z
(E + p)
9>>>>>>>>>=>>>>>>>>>;
(3D)
import Python!
���22
fpdtype_t invrho = 1.0/s[0], E = s[${ndims + 1}]; ! // Compute the velocities fpdtype_t rhov[${ndims}]; % for i in range(ndims): rhov[${i}] = s[${i + 1}]; v[${i}] = invrho*rhov[${i}]; % endfor ! // Compute the pressure p = ${c['gamma'] - 1}*(E - 0.5*invrho*${pyfr.dot('rhov[{i}]', i=ndims)}); ! // Density and energy fluxes % for i in range(ndims): f[${i}][0] = rhov[${i}]; f[${i}][${ndims + 1}] = (E + p)*v[${i}]; % endfor ! // Momentum fluxes % for i, j in pyfr.ndrange(ndims, ndims): f[${i}][${j + 1}] = rhov[${i}]*v[${j}]${' + p' if i == j else ''}; % endfor
Improving Communication• In multi-GPU simulations we need to perform
halo exchanges between GPUs.
���23
Improving Communication
• Traditionally requires an explicit device↔host
transfer on each end.
• Copy must be marshalled by the application and integrated with MPI_Isend/Irecv
• MPI can not start until cudaMemcpy[Async] has finished.
���24
Improving Communication
• With a CUDA aware MPI implementation:
• MPI_Isend(device_ptr, …).
• Let the MPI implementation handle the copy:
• intra-node communication can exploit GPUDirect: device↔device.
���25
Improving Communication
• Not just simpler; better performance too.
• Transfers can be pipelined to permit better overlap.
• If you are very lucky can even exploit GPUDirect over RDMA.
���26
Improving Communication• What about Python?
• With the git master branch of PyCUDA:
���27
from mpi4py import MPI !def send_cuptr(cubuf, nbytes, dest, tag): comm = MPI.COMM_WORLD pybuf = cubuf.as_buffer(nbytes) return comm.Isend(pybuf, dest, tag)
Summary• Funded and supported by
!
!
• Any questions?
• E-mail: [email protected]
• Website: http://pyfr.org
���28