towards exascale direct numerical simulations of...

Towards Exascale Direct Numerical Simulations of Turbulent Combustion on Heterogeneous

Machines Jacqueline Chen

Director of ExaCT Sandia National Laboratories

Livermore, CA [email protected]

www.exactcodesign.org NVIDIA Booth@SC’14

November 20, 2014, New Orleans

mailto:[email protected]


Why Exascale Combustion

• Predict behavior of new fuels in different combustion scenarios at realistic pressure and turbulence conditions

– Develop new combustor concepts

– Design new fuels

• Co-design center is focusing on high-fidelity direct numerical simulation methodologies

– Need to perform simulations with sufficient chemical fidelity to differentiate effects of fuels where there is strong coupling with turbulence

– Need to address uncertainties in thermo-chemical properties

– Not addressing complexity of geometry in engineering design codes

Fundamental Turbulence-Chemistry Interactions Motivated by Advanced Engines and Gas Turbines

• Higher fuel efficiency and lower emissions driving combustion towards more dilute, fuel lean, partially-premixed conditions

• New mixed-mode combustion regimes

• Strong sensitivities to fuel chemistry

• Preferential diffusion effects – synthesis gases enriched with hydrogen for carbon capture storage in gas turbines for power

Motivation: Understanding Stabilization of Lifted Flames in Heated Coflow

Chemiluminescence from diesel lift-off stabilization for #2 diesel,

ambient 21% O2, 850K, 35 bar courtesy of Lyle Pickett, SNL

What is the role of ignition in lifted flame stabilization?

DNS of Lifted Ethylene-air Jet Flame in a Heated Coflow

• 3D slot burner configuration:

– Lx Ly Lz = 30 40 6 mm3 with

– 1.28 billion grid points

– High fuel jet velocity (204m/s); coflow

velocity (20m/s)

– Nozzle size for fuel jet, H = 2.0mm

– Rejet = 10,000

– Cold fuel jet (18% C2H4 + 82% N2) at 550K,

ηst ≈ 0.27

– Detailed C2H4/air chemistry, 22 species 18 global reactions, 201 steps

– Hot coflow air at 1,550K

Ethylene-air lifted jet flame at Re=10000

Dynamics of lifted flame stabilization – Log(scalar dissipation) and Temperature

Why does this need exascale?

• Turbulent combustion consists of phenomena occurring over a wide range of scales that are closely coupled

– More grid points needed to resolve larger dynamic range of scales

– More time steps needed for better statistics and less dependence on initial condition

• Complex fuels require higher number of equations per grid point

• In situ uncertainty quantification with adjoint sensitivity – reverse causality – uncertainties in chemical inputs

• In situ analytics/visualization

• Coupled execution (hybrid Eulerian-Lagrangian particle solver, or lockstep DNS/LES)

• Future algorithms, programming environments, runtimes, hardware need to:

– Express data locality (sometimes at the expense of FLOPS) and independence

– Allow expression of massive parallelism

– Minimize data movement and reduce synchronization

– Detect and address faults

Why do we need to do co-design?

• Peak clock frequency: as primary

limiter for performance improvement

• Cost: FLOPs are biggest cost for

system: optimize for compute

• Concurrency: Modest growth of

parallelism by adding nodes

• Locality: MPI+X model (uniform costs

within node & between nodes)

• Memory Scaling: maintain byte per flop

capacity and bandwidth

• Uniformity: Assume uniform system

performance

• Power: primary design constraint for

future HPC system design

• Cost: Data movement dominates:

optimize to minimize data movement

• Concurrency: Exponential growth of

parallelism within chips

• Locality: must reason about data

locality and possibly topology

• Memory Scaling: Compute growing 2x

faster than capacity or bandwidth, no

global hardware cache coherence

• Heterogeneity:Architectural and

performance non-uniformity increase

Old Constraints New Constraints

ExaCT Vision and Goal

• Goal of combustion exascale co-design is to consider all aspects of the combustion simulation process from formulation and basic algorithms to programming environments to hardware characteristics needed to enable combustion simulations on exascale architectures

– Interact with vendors to help define hardware requirements, computer scientists on requirements for programming environment and software stack, and applied mathematics community locality-aware algorithms for PDE’s, UQ, and analytics

• Combustion is a surrogate for a much broader range of multiphysics computational science areas

Petascale codes provide starting point for co-design process

• S3D – Compressible formulation – Eighth-order finite difference discretization – Fourth-order Runge-Kutta temporal integrator – Detailed kinetics and transport – Hybrid parallel model with MPI + OpenMP – MPI+ OpenACC (directives for GPU’s) – Legion (deferred execution hides latencies)

• LMC – Low Mach Number model that exploits separation of

scales between acoustic wave speed and fluid motion – Second-order projection formulation – Detailed kinetics and transport – Block-structure adaptive mesh refinement – Hybrid parallel model with MPI + OpenMP

Expectation is that exascale will require new code base

LMC simulation of NOx emissions from a low swirl injector

S3D simulation of HO2 ignition marker in lifted flame

Laboratory scale

flames

S3D MPI Parallelism

• 3D domain decomposition.

– Each MPI process is in charge of a piece of the 3D domain.

• All MPI processes have the same number of grid points and the same computational load

• Inter-processor communication is only between nearest neighbors in 3D topology

– Large message sizes. Non-blocking sends and receives

• All-to-all communications are only required for monitoring and synchronization ahead of I/O

• Good parallel scaling on Titan

1

N

1 N

What happens in the main solver?

• Computes rate of change of N conserved quantities at every grid point

– d/dt (Qk) = (Advection) + (Diffusion) + (Source)

– Sum of all the terms that contribute to the time derivative is called the RHS

• d/dt (Qk) is integrated explicitly in time through Runge-Kutta

• RHS contains multiple terms that are functions of Qk, variables derived from Qk

• Advection and diffusion require finite differencing and MPI

• Source terms are point-wise functions

• Thermodynamic, chemical and molecular transport properties are point-wise functions of Qk

Source term is the most compute intensive kernel

• Called as ckwyp or getrates

• Chemical reaction rate computed using Arrhenius model

– A + B C + D

– Forward reaction rate = C*[A]*[B]*Taexp(-Ta/T)

– Equilibrium constant gives reverse reaction rates

– More terms for third body efficiency, collision efficiency, pressure corrections …

• The source term for a species is the sum of the rates of all reactions in which it participates

• The kernel uses exp/log heavily

Hybridizing S3D: OLCF CAAR Early Science Project

• Collaboration between Cray (John Levesque), NVIDIA (Greg Ruetsch, Cliff Woolley), NREL (Ray Grout) and ORNL (Ramanan Sankaran)

• Significant restructuring of S3D to expose node-level parallelism

– Movement of outer loops to the highest level in RHS

– Combine several pointwise physics computations together within same OpenMP structure

– Reordered computation and communication to achieve most overlap

– Restructured computation to minimize memory operations and vectorized all loops that reside on accelerator

• Control minimal data communication between the host and the accelerator with asynchronous updates

• Resulting code is hybrid MPI+OpenMP and MPI+OpenACC (-DGPU only changes directives)

• 6-fold performance improvement on Titan over Jaguar

Levesque et al. SC’12

Day 1 Science with S3D OpenACC on Titan

• Extinction/Re-ignition in a Turbulent Di-methyl Ether Jet Flame – Largest Reynolds number reacting DNS

with complex chemistry (32 species for DME, an oxygenated biofuel)

– Enabled comparison with companion experiment

– Validate an experimental diagnostic for surrogate peak heat release based on product imaging of CH2O and OH

– Exploring the causality between turbulence and chemistry in re-ignition process

The logarithm of the scalar dissipation rate

(that is, the local mixing rate) where white

denotes high mixing rates and red, lower

mixing rates

Bhagatwala et al., Proc. Combust. Inst. (2014)

In Situ Uncertainty Quantification Guided by Analytics

• Topological Segmentation and Tracking

Topological segmentation and tracking

Distance field (level set)

• Statistics

Filtering and averaging (spatial and temporal)

Statistical moments (conditional)

Statistical dimensionality reduction (joint PDFS)

Spectra (scalar, velocity, coherency)

Chemical Explosive Mode

• Uncertainty in reaction rates characterizing ignition/extinction events that control fuel efficiency and emissions with respect to uncertainties in input chemical and transport parameters

• Solve adjoint equations backward in time: need the primal state at all times

• Exploit space-time locality guided by analytics to bound regions of interest

• I/O bandwidth constraint make it infeasible to save all raw simulation data to persistent storage Workflow must integrate simulation , UQ and analysis !!!!

• Challenge: co-design a workflow that supports smart placement of solver + UQ, and analytics, reducing checkpointing size with in-situ and in-transit analytics

Petascale Workflow Model Won’t Scale

O(1B) cores

O(1 PB)/dump every 30 min (1 min)

O(1-10 EB)/run

Synchronous I/O

Synchronous I/O combustion simulation

• Analysis • Visualization

Performing the simulation is not enough – need to analyze results

Proxy Architectures and Apps are a key Capability for HPC Co-design

• Mini-Applications and Proxy Apps – Express essential parts of application

that are essential to science – Express pain points for performance – Hide complexities of REAL apps that

are inconsequential to science

• Proxy Hardware Models are a Reflection of the Underlying Machine Architecture

– Express what is important for performance

– Hide complexity that is not consequential to performance

• Use Proxy Applications on Proxy Hardware Architectures to understand what choices offer the highest value (using modeling and simulation)

– Not absolute performance predictions – Scientific experiments that gauge

relative improvements over control

• Still Need Full-Applications

Applications

Architectures

Proxy Apps

Proxy

Architectures

Models/

Measurements

Architecture

Simulators

Overall Workflow Captured Through Proxy and Full Apps

Exascale Machine

Compressible and Low-Mach Reacting Flow

Solvers

Write to Disk

• Reduced Checkpoints • Reduced Data

Proxy Apps

Descriptive Statistics Visualization Topological

Analysis UQ

Meta-skeletal workflow proxy app

New UQ + topology proxy for adjoint solution data flow

CNS SSA Streaming Statistics

PVR Parallel Volume

Rendering

RTC Reduced Topology

LMC Multi-grid Chemistry

UQ

SMC S3D www.exactcodesign.org

Castro

adds,3620

muls,4303

divs,540

trans,710

Instruc onMix

adds3% muls

4%

divs18%

trans75%

BreakdownofCPUTime

adds,14509

muls,16447

divs,387

sqrt,1

Instruc onMix

adds34%

muls34%

divs32%

BreakdownofCPUTime

Chemistry Dynamics

1

4

16

64

256

1024

4096

1 4 16 64 256 1024 4096Num

berofAccesses

Variables(sortedbynumberofaccesses)

9Species

21Species

53Species

71Species

107Species

Chemistry FP State Variables by Rank

ExaSAT: Analysis of Proxy Applications

• Extracts floating point instruction mix and estimates CPU time

• Analyzes state variable accesses and estimates register spill impact on L1 cache traffic

Impact of Software Optimization on SMC Dynamics Computation

• Data reuse analysis shows impact of software optimizations on memory traffic for various cache sizes

• Large potential benefit from increased memory bandwidth and fast exponentials

• Trends can be investigated in further detail using hardware simulators

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2 8 32 128 512 2048 8192

BytesperFlop

CacheSize(kB)

Baseline

+BestBlocking

+BestFusion

Bandwidth Tapering

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

9 21 53 71 107

EsmatedTeraflo

ps

NumberofSpecies

+FastNIC(400GB/s)

+Fast-exp

+Fast-div

+Fastmemory(4TB/s)

+Loopfusion

+Cacheblocking

Baseline

Estimated Performance Benefits

Programming Environment Critical to Performance

Effective use of exascale hardware will require programming environment that effectively maps algorithms to hardware

• Driven by programmability of combustion applications and characterization of algorithms on different designs of architectures

– Simplify programming to express locality and independence

– Simplify programming of block-structured PDE’s, analytics, UQ for performance, scalability & portability on heterogeneous architectures with high variability and still maintain readability

Legion Approach

• Capture the structure of program data

• Decouple specification from mapping

• Automate data movement, parallelism discovery, synchronization, hiding long latency

Goal: A performance-oriented programming model with

a raised abstraction for productivity and portability.

(Bauer, Treichler, Slaughter, Aiken – Structure Slicing: Extending Logical Regions

with Fields, Thursday Nov 20 1:30-2:00, Rm 388-390)

http://legion.stanford.edu

f0 f1 p0

p1

p2

p3

p4

p5

Describing Data

• Logical regions (array of objects):

– Have no implied layout

– Have no implied location

• Described by:

– Index space (set of keys)

– Field space (set of fields)

• Operations include:

– Partitioning into subregions

– Slicing by fields

25

Field Space

Index S

pace p6

p7

p8

p9

p10

p11

f2 f3

B

Legion Tasks

• Hierarchical Tree of Tasks

• Legion tasks specify:

– Region usage

– Field usage

– Access modes

• Legion runtime:

– Infers data dependences

– Inserts copies

26

Task 1

Region R

Field A

Read-Write

Task 2

Region R

Field B

Read-Only

Node 0 Node 1

Copy

Legion Mapping

• Mapping happens at runtime:

– Assign task to processors

– Assign regions to memories

• With mapping API, programmer selects:

– Where tasks run

– Where data is placed

27

t 1

t 2

t 3

t 4

t 5

rc

rw

rw1 rw2

rn

rn1 rn2

$

$

$

$

N

U

M

A

N

U

M

A

FB

D

R

A

M

x86

CUDA

x86

x86

x86

Application: S3D

• Current state-of-the-art combustion DNS simulation

– Written in 200K lines of Fortran

– Direct numerical solver using explicit methods

– Many phases, lots of parallelism

11/05/13 28 ξ OH HO2 CH2O

S3D Task Parallelism

• One call to Right-Hand-Side-Function (RHSF) as seen by the Legion runtime

– Called 6 times per time step by Runge-Kutta solver

– Width == task parallelism

– H2 mechanism (only 9 species)

– Heptane (52 species) is significantly wider

• Manual task scheduling would be difficult!

S3D Execution Profile

• Two custom mappers

– Schedule tasks onto CPUs and GPUs

– Prioritize tasks on critical path of RHSf

Legion execution of RHSF on one node

11/05/13 30

GPU’s

CPU

cores

Legion

Runtime

core

Dependence analysis tasks

Leaf Tasks

• Legion treats tasks as black boxes

– Doesn’t care how tasks are written

• Still need fast leaf tasks for computationally expensive chemistry, diffusion, viscosity

– For CPUs & GPUs

– For multiple mechanisms

• Singe* is a DSL compiler for chemistry kernels

Singe Compiler

CHEMKIN TRANSPORT THERMO

SSE AVX CUDA

*Bauer et al. PPoPP’14

Combustion Challenges

• GPU programming models emphasize data parallelism

– Not always the best choice for performance

• Large working sets (per point)

– nHeptane chemistry needs 566 double precision reaction rates (per point)

– GPU register file only store 128 per thread

• Multi-phase computations

– Fissioned kernels limited by memory bandwidth, slow

Compute

Reaction Rates

QSSA

Stiffness

Accumulate

Outputs

Warp Specialization

• Leverage knowledge of underlying hardware

– GPUs execute warps: streams of 32-wide vector instructions

– All threads in warp execute the same program (data parallel unit)

• Each warp can run different computation

– Generate code that specializes each warp,(leverage task parallelism)

– Different warps do different computations on the same data

– Allows much better use of memory while keeping processors busy

– Fit large working sets on chip

Program P Input Data

P P P P P P P P P

Warp 0 Warp 1 Warp 2

Partitioned

Program

P0 P1 P2

Input Data

P0 P0 P0 P1 P1 P1 P2 P2 P2

Warp 0 Warp 1 Warp 2

Data parallel

Warp specialization

Singe Warp-Specialized Chemistry Kernel

Chem

Reacs

1

Chem

Reacs

2

Chem

Reacs

3

Chem

Reacs

4

Chem

Reacs

5

Chem

Reacs

6

QSSA QSSA

Stiff Stiff Stiff Stiff

Output

Math

Output

Math

Output

Math

Output

Math

Reaction Rate Exchange through Shared Mem

Reaction Rate Exchange through Shared Mem

Warp 0 Warp 1 Warp 2 Warp 3

QSS

5

QSS

1

QSS

11

QSS

10

QSS

9

QSS

6

QSS

3

QSS

7

QSS

15

QSS

14

QSS

2

Performance Results

• Chemistry Kernel

– All Singe kernels significantly faster than current production versions

– Warp specialized SINGE code is up to 3.75 times faster than previously optimized data-parallel CUDA kernels

• Multi-Node Heterogeneous Testbeds S3D Legion:

– Keeneland: 128 nodes, 16 Sandy Bridge cores, 3 Fermis

– Titan: 18K nodes, 16 Interlagos cores, 1 K20 GPU

11/05/13 35

S3D Performance Comparison

• Compare against MPI+OpenACC code on Titan

– Tuned by Cray and NVIDIA engineers with ORNL/NREL domain experts

• OpenACC runs 483 and 643 for DME and heptane

– Fixed mapping (most compute on GPU)

• Legion runs 483, 643, and 963 for any mechanism (DME, heptane, PRF)

– Try both All-GPU and mixed CPU+GPU mappings

• Use re-ranking script for runs >= 1024 nodes

Legion S3D Performance on Titan (weak scaling)

N-heptane 52 species

1.71X - 2.33X faster between 1024 and 8192 nodes DME

1.88-2.85X faster between 1024 and 8192 nodes n-heptane (64*3)

Exascale Use Cases: Science at Relevant Conditions

• Homogeneous Charge Compression Ignition (HCCI engine combustion) – ‘Chemical’ engine with high diesel-like efficiency without NOx and soot, tailor the charge stratification to control ignition and burn rate

• Turbulent Jet Flames (Swirl, transverse, cavity) – low-swirl Injector gas turbines with staged lean premixed combustion, flame stabilization, emissions

• Lifted Diesel Jet flames –lifted autoignitive diesel jet flame stabilization with multi-stage ignition fuels

• Need to include UQ with respect to chemistry and transport properties

• Extrapolation of current capability show that these cases will require exascale-level resources

Thanks!

• Questions?

• [email protected]

• www.exactcodesign.org


towards exascale direct numerical simulations of...

Documents