massively parallel lattice boltzmann-based simulations with grid … · 8192 32,768 262,144 cores...

46
Massively Parallel Lattice Boltz- mann-based Simulations with Grid Refinement SIAM PP 2014, Portland February 19, 2014 Florian Schornbaum, Ehsan Fattahi, David Staubach, Christian Godenschwager, Martin Bauer, Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Upload: others

Post on 10-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Massively Parallel Lattice Boltz-mann-based Simulations with Grid Refinement SIAM PP 2014, Portland

February 19, 2014

Florian Schornbaum, Ehsan Fattahi, David Staubach, Christian Godenschwager, Martin Bauer, Ulrich Rüde

Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Page 2: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

2

Outline

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Introduction • The waLBerla Simulation Framework

• The Lattice Boltzmann Method

• Uniform Grids • Domain Decomposition & Parallelization

• Performance / Benchmarks

• Statically Refined Grids • Lattice Boltzmann & Grid Refinement

• Domain Decomposition & Load Balancing

• Performance / Benchmarks

• Outlook on Dynamic Grid Refinement / Conclusion

Page 3: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Introduction

• The waLBerla Simulation Framework

• The Lattice Boltzmann Method

Page 4: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

4

Introduction

• waLBerla (widely applicable Lattice Boltzmann frame-work from Erlangen): • main focus on CFD (computational fluid dynamics) simulations

based on the lattice Boltzmann method (LBM)

• at its very core designed as an HPC software framework: • scales from laptops to current petascale supercomputers

• largest simulation: 1,835,008 processes (IBM Blue Gene/Q @ Jülich)

• hybrid parallelization: MPI + OpenMP

• vectorization of compute kernels

• written in C++(11)

• support for different platforms (Linux, Windows) and compilers (GCC, Intel XE, Visual Studio, llvm/clang, IBM XL)

• coupling with in-house rigid body physics engine pe

• open source → http://www.walberla.net

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Page 5: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

5

Introduction

• The lattice Boltzmann method: • regular grid with multiple particle distribution functions (=

scalar values) per cell [D2Q9, D3Q19, D3Q27, …]

• explicit method → time stepping

• two steps: stream (neighbors) & collide (cell-local)

• For the collision, different operators exist: SRT, TRT, MRT.

• Macroscopic quantities (velocity, density, …) can be calculated from the particle distribution functions.

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

(stream) (collide)

Page 6: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Uniform Grids

• Domain Decomposition & Parallelization

• Performance / Benchmarks

Page 7: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

7

Uniform Grids

• Domain Decomposition: • regular decomposition into blocks containing uniform grids

• Parallelization: • data exchange on borders between blocks via ghost layers

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

[special case of our much more general forest of

octrees data structure → non-uniform/refined grids]

receiver process

sender process

Page 8: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

8

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

geometry given by surface mesh domain decomposition into blocks

empty blocks are discarded load balancing

Load balancing can be based on either space-filling curves (Z-order/Morton order, Hilbert curve) using the under-lying forest of octrees or graph partitioning (METIS, …).

Whatever fits best the needs of the simulation.

flow simulation only in here (example: complex geometry

of an artery)

Page 9: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

9

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

geometry given by surface mesh domain decomposition into blocks

empty blocks are discarded load balancing

Load balancing can be based on either space-filling curves (Z-order/Morton order, Hilbert curve) using the under-lying forest of octrees or graph partitioning (METIS, …).

Whatever fits best the needs of the simulation.

Page 10: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

10

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

geometry given by surface mesh domain decomposition into blocks

empty blocks are discarded load balancing

allocation of block data (→ grids)

The domain decomposition and load balancing can be performed during

the actual simulation … OR …

Page 11: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

11

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

geometry given by surface mesh domain decomposition into blocks

empty blocks are discarded load balancing

allocation of block data (→ grids)

DISK

separation of domain

partitioning from simulation

file size: kilobytes to few megabytes

DISK

Page 12: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

12

Uniform Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

allocation of block data (→ grids)

DISK

separation of domain

partitioning from simulation

file size: kilobytes to few megabytes

DISK

empty blocks are discarded load balancing

geometry given by surface mesh domain decomposition into blocks

All of this (the entire pipeline) works just the same when grid

refinement is used.

Page 13: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

13

• Benchmark Environments: • JUQUEEN (TOP500: 8)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (TOP500: 10) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmarks (LBM D3Q19):

Uniform Grids - Performance

lid-driven cavity weak scaling

(= const. number of cells per core)

coronary artery tree strong scaling (= const. total

number of cells)

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

⇒ C. Godenschwager, F. Schornbaum, M. Bauer, H. Köstler, and U. Rüde, A Framework for Hybrid ⇒ Parallel Flow Simulations with a Trillion Cells in Complex Geometries, SC13, Denver

Page 14: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

14

• SuperMUC – single node (3.34 million cells per core)

⇒ LBM: low FLOPs to bytes ratio → memory intensive

0

20

40

60

80

100

120

140

160

180

MLU

P/s

SRT .

TRT

1 2 4 8 12 16 cores

LBM compute kernel type

more complex, but also much more relevant

for physical simulations!

MLUP/s: million/mega

lattice cell updates per

second

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Page 15: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

15

• SuperMUC – single node (3.34 million cells per core)

⇒ limited by memory bandwidth

0

20

40

60

80

100

120

140

160

180

MLU

P/s

SRT

TRT

SRT .

SRT .

1 2 4 8 12 16 cores naïve, straightforward

implementation

already quite optimized!

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

bandwidth limit

vectorized compute kernel

Page 16: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

16

• JUQUEEN – single node (1.73 million cells per core)

⇒ limited by memory bandwidth

0

10

20

30

40

50

60

70

80

MLU

P/s

SRT

TRT

SRT .

SRT .

1 2 4 8 16 cores naïve, straightforward

implementation

already quite optimized!

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

vectorized compute kernel

bandwidth limit

Page 17: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

17

• SuperMUC – TRT kernel (3.34 million cells per core)

0

1

2

3

4

5

6

7

8

9

10

MLU

P/s

pe

r c

ore

16P 1T

4P 4T

2P 8T

32 8192 147,456 cores

#processes per node

#threads per process

0.99 x 1012 cells updated per second! (19 values per cell)

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Page 18: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

18

• JUQUEEN – TRT kernel (1.73 million cells per core)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

MLU

P/s

pe

r c

ore

64P 1T

16P 4T

8P 8T

32 512 8192 458,752 cores

#processes per node

#threads per process

2.1 x 1012 cells updated per second! (19 values per cell)

⇒ 40 x 1012 values updated per second!

⇒ 0.41 PFlop/s (0.41 x 1015 Flop/s)

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Page 19: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

19

• SuperMUC – TRT kernel (2.1 million fluid cells)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

tim

e s

tep

s /

sec

32 128 512 2048 8192 32,768 cores

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

“extreme” strong scaling

more processes = shorter time to solution!

Page 20: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

20

• JUQUEEN – TRT kernel (16.9 million fluid cells)

0

100

200

300

400

500

600

700

800

900

1000

tim

e s

tep

s /

sec

512 2048 8192 32,768 262,144 cores

Uniform Grids - Performance

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

“extreme” strong scaling

more processes = shorter time to solution!

Page 21: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

• Lattice Boltzmann & Grid Refinement

• Domain Decomposition & Load Balancing

• Performance / Benchmarks

Page 22: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

• Lattice Boltzmann & Grid Refinement: • Almost all grid refinement schemes for the lattice Boltzmann

method rely on a 2:1 balance between neighboring cells:

• waLBerla now uses a massively parallel implementation of the refinement scheme presented in [1] (also relies on 2:1 balance)

• consequences of the 2:1 balance: • twice as many time steps on the fine grid as on the next coarser grid

• In 3D, for each finer grid level, the memory requirement increases by a factor of 8 and the generated workload by a factor of 16.

22 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

[1] Zhao Yu and Liang-Shih Fan, 2009, An interaction potential based lattice Boltzmann method with adaptive [1] mesh refinement (AMR) for two-phase flow simulation, J. Comput. Phys. 228, 17 (September 2009), 6456-6478

Page 23: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

23 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Domain Decomposition: • “blocks containing regular grids

distributed among all processes”

⇒ distributed forest of octrees (→ 2:1 balance)

• Each process only knows about its own blocks and their neighbors, but has no information about the rest of the domain.

⇒ perfectly distributed data structure that scales to huge numbers of processes without any runtime overhead

• Property of this Distributed Data Structure: • can be viewed as a graph: each block is connected to all of its

neighboring blocks → enables all kinds of graph algorithms

Page 24: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

24 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Comparison with Uniform LBM: • The setup phase remains unchanged:

The decoupling (via a file) of the domain decomposition & initial load balancing from the actual simulation is still possible.

• All the refinement “magic” happens during communication

(fine and coarse blocks share ghost layer regions → refinement requires multiple ghost layers per block†).

• Time stepping becomes more complicated (simple, uniform LBM:

boundary handling → stream & collide → communication).

• All the compute kernels remain unchanged!

† Even though for the computation/interpolation multiple ghost layers are required, the com- † munication is heavily optimized and the average number of bytes communicated per block † only slightly increases compared to a uniform simulation.

Page 25: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

25 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1

3

3

2

3

3

2

2

2

2

2

2

1

1

0

1

1 1

2

2

2

3

3

3

3

Example: 24 blocks on four

different levels (forest with 3 trees)

Page 26: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

26 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1

3

3

2

3

3

2

2

2

2

2

2

1

1

0

1

1 1

2

2

2

3

3

3

3

Example: 24 blocks on four

different levels (forest with 3 trees)

Page 27: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

27 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1

numbers represent corresponding grid levels

Example: 24 blocks on four

different levels (forest with 3 trees)

Page 28: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

28 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Page 29: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

29 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Page 30: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

30 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Page 31: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

31 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

- 0 -

- 1 1 - 1 1 - 1

2 - 2 2 2 2 2 2 2 2 - 2

3 3 3 3 3 3 3 3

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Page 32: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

32 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Page 33: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

33 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3

3

3

3

3 3 3 3

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Page 34: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

34 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3

3

2

3

3

2

3

2

2

3

2

2

3

2

2

3

2

2

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Page 35: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

35 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3

3

2

1

3

3

2

1

3

2

2

1

3

2

2

1

3

2

2

1

3

2

2

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Page 36: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

36 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level

• “even” distribution to all available processes

Example (cont.): distribution to six

available processes - level by level

process: 0 1 2 3 4 5

3

3

2

1

3

3

2

1

3

2

2

1

3

2

2

1

3

2

2

1

3

2

2

0

3 3 3 3 3 3 3 3

2 2 2 2 2 2

1 1 1

2 2 2

0

2

1 1 Example: 24 blocks on four

different levels (forest with 3 trees)

Page 37: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

37

• Benchmark Environments: • JUQUEEN (TOP500: 8)

• Blue Gene/Q, 459K cores, 1 GB/core

• compiler: IBM XL / IBM MPI

• SuperMUC (TOP500: 10) • Intel Xeon, 147K cores, 2 GB/core

• compiler: Intel XE / IBM MPI

• Benchmark (LBM D3Q19 TRT):

• lid-driven cavity • weak scaling • 6.6875 blocks per

process • 4 grid levels

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Statically Refined Grids

• finest level … … covers 1.4% of space … generates 80% of workload

• coarsest level … … covers 78% of space … generates 1.1% of workload

Page 38: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

38

• SuperMUC – TRT kernel (3.34 million cells per core)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

MLU

P/s

pe

r c

ore

16P 1T

4P 4T

2P 8T

32 8192 147,456 cores

#processes per node

#threads per process

Statically Refined Grids

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Page 39: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

39

• JUQUEEN – TRT kernel (1.81 million cells per core)

0

0.5

1

1.5

2

2.5

MLU

P/s

pe

r c

ore

64P 1T

16P 4T

8P 8T

32 512 8192 262,144 cores

#processes per node

#threads per process

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Statically Refined Grids

Page 40: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

40 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• LBM with refinement only achieves half the number of cell updates per second as regular LBM without refinement.

• Refinement Overhead:

• much more complex communication patterns which additionally involve interpolation

• more complex time stepping scheme

• multiple blocks per process ⇔ smaller blocks

BUT

LBM with grid refinement requires much less memory and far fewer cells (100 times fewer ...) !

Page 41: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

41 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Example Application: A Study of the Vocal Fold

⇒ http://youtu.be/kUPf__THVZs

Page 42: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

42 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Example Application: A Study of the Vocal Fold

DNS (direct numerical simulation)

Reynolds number: 1000 / D3Q19 TRT

4300 processes @ SuperMUC

101,466,432 fluid cells

25,800 blocks with 16 x 16 x 16 cells

Page 43: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Statically Refined Grids

43 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

• Example Application: A Study of the Vocal Fold

number of different grid levels: 5

95.5 time steps / sec (finest grid)

total number of time steps (finest grid): 864,000

without refinement: 55.2 times more memory …

… and 98.6 times the workload

Page 44: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

Outlook on Dynamic Grid Refinement / Conclusion

Page 45: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

45

Outlook / Conclusion

• Dynamic Grid Refinement: • space filling curve? (see p4est and work done by C. Burstedde et al.)

• multiple space filling curves – one for each grid level?

• repartitioning based on graph algorithms?

→ diffusive algorithms! (see work of H. Meyerhenke et al.)

• Conclusion: • The waLBerla framework now has a highly efficient, massively

parallel implementation of LBM for …

• … uniform grids as well as …

• … statically refined grids.

⇒ next milestone: support for dynamic grid refinement / next milestone: runtime adaptivity

Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014

Page 46: Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores Uniform Grids - Performance Massively Parallel Lattice Boltzmann-based Simulations

THANK YOU FOR YOUR ATTENTION!

QUESTIONS ? (visit http://www.walberla.net)