pyfr: technical challenges of bringing next generation ......techniques employed 1. abstract the...

28
PyFR: Technical Challenges of Bringing Next Generation Fluid Dynamics to GPUs F.D. Witherden Department of Aeronautics Imperial College London 1

Upload: others

Post on 01-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

PyFR: Technical Challenges of Bringing Next Generation

Fluid Dynamics to GPUsF.D. Witherden

!Department of Aeronautics Imperial College London

���1

Page 2: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Introduction

• Computational fluid dynamics (CFD) is the bedrock of several high-tech industries.

• Desire amongst practitioners to perform unsteady, scale resolving simulations, within the vicinity of complex geometries.

���2

Page 3: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

���3

Page 4: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Current Generation CFD

• Optimised for steady state problems.

• Numerics established in the 1980s:

• assumes FLOPS are expensive…

• and memory bandwidth plentiful.

���4

Page 5: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Next Generation CFD

• To make unsteady simulations practical on an industrial scale we need:

• new numerics;

• new hardware;

• new implementations;

• new rules…?

���5

Page 6: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

PyFR

• Our solution PyFR.

• A high-order compressible Navier-Stokes solver for 3D unstructured grids.

• Designed from the ground up to run on NVIDIA GPUs.

• Written entirely in Python!

���6

Page 7: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

The FR in PyFR

• Uses flux reconstruction (FR) approach;

• can recover well-know schemes including nodal Discontinuous Galerkin (DG).

• Majority of operations element-local.

• Can obtain over 50% of peak FLOPS.

���7

Page 8: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

The Py in PyFR

• Leverages PyCUDA and mpi4py.

• Makes extensive use of run-time code generation.

• All compute performed on device.

• Overhead from the interpreter < 1%.

• Just 5,000 lines of code.

���8

Page 9: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

PyFR In Practice

���9

• Flow over a cylinder:

!

!

!

!

• Isosurfaces of density at Ma = 0.2; Re = 3900.

Page 10: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Scalability of PyFR

• Performance has been evaluated on the Emerald cluster;

• one of the largest GPU clusters in the UK;

• 372 NVIDIA M2090s.

• Nodes connected via QDR InfiniBand.

���10

Page 11: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Scaling: Weak• Problem size kept in proportion.

���11

0.00

0.25

0.50

0.75

1.00

1.25

1 2 4 8 16 32 64 104

NVIDIA M2090s

No

rm

alise

dru

ntim

e

Page 12: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Scaling: Strong• Problem size kept constant.

���12

0

4

8

12

16

20

24

28

32

1 2 4 8 16 32NVIDIA M2090s

Spe

edup

Page 13: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Techniques Employed

1. Abstract the numerics in terms of well-understood performance primitives.

2. Use Python to reduce boiler plate code and facilitate run-time kernel generation.

3. Improve communication efficiency by GPUDirect and CUDA-aware-MPI.

���13

Page 14: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Using Common Abstractions• Writing portable and maintainable FLOP-

intensive code is non-trivial.

• But, CUDA comes “batteries included” with a range of high-performance primitives:

• cuBLAS; cuSPARSE; cuFFT;

• …

���14

Page 15: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Using Common Abstractions• Example from PyFR: inside each element

• have data at ;

• want to interpolate to ;

• are a linear combination of .

���15

Page 16: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Using Common Abstractions• Can cast this as GEMV:

!

!

!

!

• Each element has its own Mij.

���16

= Mij

Page 17: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Using Common Abstractions

• Performant a decade ago.

• But, today GEMV is bandwidth bound.

• Storing an Mij per-element is also expensive.

• Solution: adapt the numerics!

���17

Page 18: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Using Common Abstractions• Consider transforming each element to a

reference element.

!

!

• Mij become identical for each element:

• a single GEMM.

���18

Page 19: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

import Python!

• Why Python?

• Interpreted, supports duck typing, garbage collection, exceptions…just like every other scripting language.

• But makes it extremely easy to call C, FORTRAN, and CUDA code.

���19

Page 20: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

import Python!

• Python and CUDA are a great fit:

• overheads from Python masked by the asynchronous nature of CUDA.

• With PyCUDA it can help deliver run-time code generation to the wider community:

• think C++ templates on steroids.

���20

Page 21: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

import Python!• In fluid dynamics we need the flux:

!

!

• System closed through an equation of state.

• Implement this generally in PyFR using Mako.

���21

u =

8>>>>>>>>><>>>>>>>>>:

⇢⇢v

x

⇢vy

⇢vz

E

9>>>>>>>>>=>>>>>>>>>;

f(u) =

8>>>>>>>>><>>>>>>>>>:

⇢vx

⇢vy

⇢vz

⇢v2x

+ p ⇢vy

v

x

⇢vz

v

x

⇢vx

v

y

⇢v2y

+ p ⇢vz

v

y

⇢vx

v

z

⇢vy

v

z

⇢v2z

+ p

v

x

(E + p) v

y

(E + p) v

z

(E + p)

9>>>>>>>>>=>>>>>>>>>;

(3D)

Page 22: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

import Python!

���22

fpdtype_t invrho = 1.0/s[0], E = s[${ndims + 1}]; ! // Compute the velocities fpdtype_t rhov[${ndims}]; % for i in range(ndims): rhov[${i}] = s[${i + 1}]; v[${i}] = invrho*rhov[${i}]; % endfor ! // Compute the pressure p = ${c['gamma'] - 1}*(E - 0.5*invrho*${pyfr.dot('rhov[{i}]', i=ndims)}); ! // Density and energy fluxes % for i in range(ndims): f[${i}][0] = rhov[${i}]; f[${i}][${ndims + 1}] = (E + p)*v[${i}]; % endfor ! // Momentum fluxes % for i, j in pyfr.ndrange(ndims, ndims): f[${i}][${j + 1}] = rhov[${i}]*v[${j}]${' + p' if i == j else ''}; % endfor

Page 23: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Improving Communication• In multi-GPU simulations we need to perform

halo exchanges between GPUs.

���23

Page 24: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Improving Communication

• Traditionally requires an explicit device↔host

transfer on each end.

• Copy must be marshalled by the application and integrated with MPI_Isend/Irecv

• MPI can not start until cudaMemcpy[Async] has finished.

���24

Page 25: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Improving Communication

• With a CUDA aware MPI implementation:

• MPI_Isend(device_ptr, …).

• Let the MPI implementation handle the copy:

• intra-node communication can exploit GPUDirect: device↔device.

���25

Page 26: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Improving Communication

• Not just simpler; better performance too.

• Transfers can be pipelined to permit better overlap.

• If you are very lucky can even exploit GPUDirect over RDMA.

���26

Page 27: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Improving Communication• What about Python?

• With the git master branch of PyCUDA:

���27

from mpi4py import MPI !def send_cuptr(cubuf, nbytes, dest, tag): comm = MPI.COMM_WORLD pybuf = cubuf.as_buffer(nbytes) return comm.Isend(pybuf, dest, tag)

Page 28: PyFR: Technical Challenges of Bringing Next Generation ......Techniques Employed 1. Abstract the numerics in terms of well-understood performance primitives. 2. Use Python to reduce

Summary• Funded and supported by

!

!

• Any questions?

• E-mail: [email protected]

• Website: http://pyfr.org

���28