stfc corporate powerpoint template · 2020-01-07 · 13th– 14th sept 2016, ncar multicore 6...

26
13 th – 14 th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite difference codes Rupert Ford, Andrew Porter and Karthee Sivalingam, Mike Ashworth (speaker) STFC Daresbury Laboratory, United Kingdom Funded by the Hartree Centre

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

13th– 14th Sept 2016, NCAR MultiCore 6 Workshop

PSyclone: a code generation and optimisation system for finite element and

finite difference codes

Rupert Ford, Andrew Porter and Karthee Sivalingam, Mike Ashworth (speaker)

STFC Daresbury Laboratory,

United Kingdom

Funded by the Hartree Centre

Page 2: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Overview

Motivation − Maintainable, efficient, scalable software on

current and future HPC architectures

Aims − Portable performance (today and future) − Single source science code − High(er) level problem specification

Page 3: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Overview

Challenge − Parallelism increasing − Memory latency and hierarchy increasing − Different and changing

architectural solutions software standards Compilers

− It is hard to restructure codes

Page 4: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Overview

Some Solutions − HPC experts optimise for particular architectures

Not single source science Not portable performance

− Trade maintenance, portability and performance e.g. only support MPI Not portable performance

− Use domain specific knowledge ...

Page 5: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Domain Specific Knowledge

Finite element, volume, difference − Operations over a mesh − Typically same operation at each

element/volume/point − Data parallel (typically independent operations) − Low level functional parallelism − Nearest neighbour communications for stencils − Global sum(s) for solver convergence and/or

conservation (e.g. temperature)

Page 6: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Projects GungHo

− Separate Science from HPC optimisation (PSyKAl) Performance portability Single-source science code

− 'Dynamo' prototype implementation − PSyclone generation of PSy layer

GOcean − Apply GungHo approach to Ocean models

Intel Parallel Computing Centre − PSyclone optimisation for Xeon and Xeon Phi

LFRic - See Chris Maynard's talk

Page 7: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

PSyKAl

Performance

Algorithm

PSy

Kernels

Science

Infrastructure

Algorithm layer refers to the whole model domain

Kernels for individual columns

Parallel System layer handles multiple levels of parallelism

Page 8: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Dynamo algorithm example ...

From lhs_alg_mod.x90 Multiple kernels within an invoke() Note: 'built-ins' now supported in PSyclone

Page 9: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Dynamo kernel example ...

From matrix_vector_kernel_mod.F90

Page 10: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

PSyclone

PSy Generator

Algorithm Generator

Parser Alg Code

Kernel Codes

PSy Code

Alg Code

Transforms Transformation

psy

ast ast

ast info

schedule

Page 11: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Dynamo algorithm post PSyclone

Page 12: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -nodm -s ./global.py

Vanilla, no dm

Page 13: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -s ./global.py

Vanilla, dm

Page 14: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -s ./global.py

Colours, dm

Page 15: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -s ./global.py

OpenMP + dm

Page 16: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

PSyclone Transformations example ...

Page 17: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

No Scientists were harmed

Going MPI Parallel

LFRic infrastructure and PSyclone support

PSyclone command line change > python generator.py -oalg alg.f90 -opsy psy.f90 file.f90 -nodm

Page 18: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Dynamo Initial MPI results L is the patch of columns on each core Blue is weak scaling (small problem) Red is weak scaling (larger problem) From blue to red is strong scaling

Page 19: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Going OpenMP Parallel

Dynamo infrastructure and Psyclone support

PSyclone script > python generator.py -oalg alg.f90 -opsy psy.f90 file.f90 -s script.py

No Scientists were harmed

Page 20: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Dynamo Initial OpenMP results

0

5

10

15

20

25

30

0 16 32 48 64

Perf

orm

ance

(/se

c)

Number of Physical Cores

Xeon IvyBridge

Xeon Phi KNC

KNC: HT=2

KNC: HT=4

Dual-socket 12-core IvyBridge Hyperthreading gives a small performance boost on the Xeon Phi No attempt has been made to improve vectorization

Page 21: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

What about performance?

Page 22: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Shallow water model GOcean FD API Sequential performance

Comparing with a hand-

optimised code PSyKAl-restructured code can perform as well (some-times better) than hand optimized code

Page 23: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

NEMOlite2D model, GOcean FD API

OpenMP and OpenACC performance

Page 24: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

How well optimised is the code?

Plan to analyse kernels to determine a realistic achievable performance upper bound

Analyse kernels via a generated DAG

Matrix vector loop1 example

Page 25: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Summary PSyclone has domain specific APIs to support FE in

LFRic and FD in GOcean

The full LFRic code, incorporating the GungHo dynamical core, is running using PSyclone

From single source science code we are generating serial, MPI, OpenMP and MPI/OpenMP code

MPI scaling to O(10k) tasks – see Chris Maynard’s talk for O(100k)

OpenMP scaling to 8 threads on CPU, more on KNC

Optimisations in progress

OpenPOWER port in progress

Page 26: STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6 Workshop PSyclone: a code generation and optimisation system for finite element and finite

Future Work

Open source later this year

Currently working toward − MPI optimisations − Multigrid support − Kernel optimisations

Future − Physics integration − OpenACC − 3D finite difference / finite volume API − Explore / search the optimisation space