pyfr: technical challenges of bringing next generation ......techniques employed 1. abstract the...

PyFR: Technical Challenges of Bringing Next Generation

Fluid Dynamics to GPUsF.D. Witherden

!Department of Aeronautics Imperial College London

��1

Introduction

• Computational fluid dynamics (CFD) is the bedrock of several high-tech industries.

• Desire amongst practitioners to perform unsteady, scale resolving simulations, within the vicinity of complex geometries.

��2

��3

Current Generation CFD

• Optimised for steady state problems.

• Numerics established in the 1980s:

• assumes FLOPS are expensive…

• and memory bandwidth plentiful.

��4

Next Generation CFD

• To make unsteady simulations practical on an industrial scale we need:

• new numerics;

• new hardware;

• new implementations;

• new rules…?

��5

PyFR

• Our solution PyFR.

• A high-order compressible Navier-Stokes solver for 3D unstructured grids.

• Designed from the ground up to run on NVIDIA GPUs.

• Written entirely in Python!

��6

The FR in PyFR

• Uses flux reconstruction (FR) approach;

• can recover well-know schemes including nodal Discontinuous Galerkin (DG).

• Majority of operations element-local.

• Can obtain over 50% of peak FLOPS.

��7

The Py in PyFR

• Leverages PyCUDA and mpi4py.

• Makes extensive use of run-time code generation.

• All compute performed on device.

• Overhead from the interpreter < 1%.

• Just 5,000 lines of code.

��8

PyFR In Practice

��9

• Flow over a cylinder:

!

!

!

!

• Isosurfaces of density at Ma = 0.2; Re = 3900.

Scalability of PyFR

• Performance has been evaluated on the Emerald cluster;

• one of the largest GPU clusters in the UK;

• 372 NVIDIA M2090s.

• Nodes connected via QDR InfiniBand.

��10

Scaling: Weak• Problem size kept in proportion.

��11

0.00

0.25

0.50

0.75

1.00

1.25

1 2 4 8 16 32 64 104

NVIDIA M2090s

No

rm

alise

dru

ntim

e

Scaling: Strong• Problem size kept constant.

��12

0

4

8

12

16

20

24

28

32

1 2 4 8 16 32NVIDIA M2090s

Spe

edup

Techniques Employed

1. Abstract the numerics in terms of well-understood performance primitives.

2. Use Python to reduce boiler plate code and facilitate run-time kernel generation.

3. Improve communication efficiency by GPUDirect and CUDA-aware-MPI.

��13

Using Common Abstractions• Writing portable and maintainable FLOP-

intensive code is non-trivial.

• But, CUDA comes “batteries included” with a range of high-performance primitives:

• cuBLAS; cuSPARSE; cuFFT;

• …

��14

Using Common Abstractions• Example from PyFR: inside each element

• have data at ;

• want to interpolate to ;

• are a linear combination of .

��15

Using Common Abstractions• Can cast this as GEMV:

!

!

!

!

• Each element has its own Mij.

��16

= Mij

Using Common Abstractions

• Performant a decade ago.

• But, today GEMV is bandwidth bound.

• Storing an Mij per-element is also expensive.

• Solution: adapt the numerics!

��17

Using Common Abstractions• Consider transforming each element to a

reference element.

!

!

• Mij become identical for each element:

• a single GEMM.

��18

import Python!

• Why Python?

• Interpreted, supports duck typing, garbage collection, exceptions…just like every other scripting language.

• But makes it extremely easy to call C, FORTRAN, and CUDA code.

��19

import Python!

• Python and CUDA are a great fit:

• overheads from Python masked by the asynchronous nature of CUDA.

• With PyCUDA it can help deliver run-time code generation to the wider community:

• think C++ templates on steroids.

��20

import Python!• In fluid dynamics we need the flux:

!

!

• System closed through an equation of state.

• Implement this generally in PyFR using Mako.

��21

u =

8>>>>>>>>><>>>>>>>>>:

⇢⇢v

x

⇢vy

⇢vz

E

9>>>>>>>>>=>>>>>>>>>;

f(u) =

8>>>>>>>>><>>>>>>>>>:

⇢vx

⇢vy

⇢vz

⇢v2x

+ p ⇢vy

v

x

⇢vz

v

x

⇢vx

v

y

⇢v2y

+ p ⇢vz

v

y

⇢vx

v

z

⇢vy

v

z

⇢v2z

+ p

v

x

(E + p) v

y

(E + p) v

z

(E + p)

9>>>>>>>>>=>>>>>>>>>;

(3D)

import Python!

��22

fpdtype_t invrho = 1.0/s[0], E = s[${ndims + 1}]; ! // Compute the velocities fpdtype_t rhov[${ndims}]; % for i in range(ndims): rhov[${i}] = s[${i + 1}]; v[${i}] = invrho*rhov[${i}]; % endfor ! // Compute the pressure p = ${c['gamma'] - 1}*(E - 0.5*invrho*${pyfr.dot('rhov[{i}]', i=ndims)}); ! // Density and energy fluxes % for i in range(ndims): f[${i}][0] = rhov[${i}]; f[${i}][${ndims + 1}] = (E + p)*v[${i}]; % endfor ! // Momentum fluxes % for i, j in pyfr.ndrange(ndims, ndims): f[${i}][${j + 1}] = rhov[${i}]*v[${j}]${' + p' if i == j else ''}; % endfor

Improving Communication• In multi-GPU simulations we need to perform

halo exchanges between GPUs.

��23

Improving Communication

• Traditionally requires an explicit device↔host

transfer on each end.

• Copy must be marshalled by the application and integrated with MPI_Isend/Irecv

• MPI can not start until cudaMemcpy[Async] has finished.

��24


• With a CUDA aware MPI implementation:

• MPI_Isend(device_ptr, …).

• Let the MPI implementation handle the copy:

• intra-node communication can exploit GPUDirect: device↔device.

��25


• Not just simpler; better performance too.

• Transfers can be pipelined to permit better overlap.

• If you are very lucky can even exploit GPUDirect over RDMA.

��26

Improving Communication• What about Python?

• With the git master branch of PyCUDA:

��27

from mpi4py import MPI !def send_cuptr(cubuf, nbytes, dest, tag): comm = MPI.COMM_WORLD pybuf = cubuf.as_buffer(nbytes) return comm.Isend(pybuf, dest, tag)

Summary• Funded and supported by

!

!

• Any questions?

• E-mail: [email protected]

• Website: http://pyfr.org

��28

mailto:[email protected]

http://pyfr.org

pyfr: technical challenges of bringing next generation ......techniques employed 1. abstract the...

Documents