stefan donath optimizing performance of the lattice boltzmann method for complex structures...

Stefan Donath

Optimizing Performance of the Lattice Boltzmann Method

for Complex Structures

Friedrich-Alexander University Erlangen/Nuremberg

Department of Computer Science 10 (System Simulation)

Regional Computing Center of Erlangen (RRZE)

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg2

ConclusionResultsOptimizationImplementationApplicationIntroduction

Outline

IntroductionLattice Boltzmann MethodImplementation Aspects

Application

Implementation

Optimization

Results

Conclusion


Nuremberg3


Lattice Boltzmann Method

Boltzmann Equation

Discretization of particle velocity space

(finite set of discrete velocities)

][1 )0(fffft

timerelaxation ...

functionon distributi mequilibriu ...

velocityparticle ... )0(

f

][1 )(eq

t ffff

t),,(),(

t),,(),()0()(

xftxf

xftxfeq


Nuremberg4



Different discretization schemes

Numerical accuracy and stability

Computational speed and simplicity

D3Q15 D3Q19 D3Q27


Nuremberg5



Discretization in space x and time t:

),(~

),(

)],(),([),(),(~

)(

txftttexf

txftxftxftxf

ii

ieq

iii

collision step:streaming step:


Nuremberg6


Lattice Boltzmann Method (Implementation Aspects)

Discretization in space x and time t:

),(~

),(

)],(),([),(),(~

)(

txftttexf

txftxftxftxf

ii

ieq

iii

collision step:streaming step:

source destination

Stream-Collide (Pull-Method)Get the distributions from the neighboring cells in the source array and store the relaxated values to one cell in the destination array

Collide-Stream (Push-Method)Take the distributions from one cell in the source array and store the relaxated values to the neighboring cells in the destination array


Nuremberg7


Lattice Boltzmann Method (Implementation Aspects)

Walls and Obstacles: Bounce Back rule


Nuremberg8


Implementation Aspects

Data Dependencies

Two Grids

Compressed Grid


Nuremberg9


double precision f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1)do z=1,zMax do y=1,yMax do x=1,xMax

if( fluidcell(x,y,z) ) then

LOAD f(x,y,z, 0:18,t)

Relaxation (complex computations) SAVE f(x ,y ,z , 0,t+1) SAVE f(x+1,y+1,z , 1,t+1) SAVE f(x ,y+1,z , 2,t+1) SAVE f(x-1,y+1,z , 3,t+1) … SAVE f(x ,y-1,z-1,18,t+1)

endif enddo enddoenddo

Implementation Aspects

Collide

Stream


Nuremberg10


Application


Nuremberg11


Application: Porous Media Combustion

New technology for heating installations:

Porous Media Combustion

Fuel-air-mixture does no longer react in a free flame

Combustion process takes place inside the pores of a porous medium that is placed in the reaction area

P CPorous Media Combustion

60mm

20mm


Nuremberg12


Application: Porous Media CombustionP CPorous Media Combustion

Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser


Nuremberg13


Application: Porous Media Combustion

Various applications:

modern steam engines, vehicle heaters

P CPorous Media Combustion

Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser


Nuremberg14


Application: Introducing Complex Geometries used

Porous Medium from PMC: Silicon-Carbide (SiC)

Many small obstacles

Obstacle/fluid-ratio: ~2%

High number of fluid-solid faces


Nuremberg15


Application: Introducing Complex Geometries used

Second Test-Geometry:

“MC“

Huge obstacles, only fewfluid tubes

Obstacle/fluid-ratio: ~50%

Low number of fluid-solid faces


Nuremberg16


Implementation


Nuremberg17


Implementation

Collision and streaming step in same loop

Push-Method

Data representation in 1D-Array

Stores only Fluid Cells ( saves memory)

Indirect addressing of target cells (by extra connectivity array)

Boundary conditions (Bounce Back) handled implicitly


Nuremberg18


Implementation

Indirect addressing and implicit Bounce Back

obstacle wall

connectivity array

i i+1 i i+1i-1 i-1


Nuremberg19


Implementation

Indirect addressing and implicit Bounce Back

obstacle wall

connectivity array

i i+1 i i+1i-1 i-1


Nuremberg20


Implementation

Preprocessor

Sets up connectivity array

Specifies all domain parameters:Geometry and obstacles

Traversing scheme

Solver

Reads in preprocessed information

Performs lattice Boltzmann method (in single loop)


Nuremberg21


Implementation

Rating: 1D-Array compared to standard implementation using multidimensional array and three loops

AdvantagesSaves memoryThe more obstacles in the domain the higher is the compression

Implicit Bounce BackNo extra routine or if-statement needed

DrawbacksIndirect addressingPrevents compiler from vectorization and other optimization techniques Consequently, worse performance


Nuremberg22


Optimization - Outline

Memory Traversing Schemes

Space-Filling Curves

Blocking

Memory Layouts

Further Optimization Techniques


Nuremberg23


Memory Traversing Schemes: Space-Filling Curves

What is a space-filling curve?

Loosely spoken:A one dimensional curve that fills a higher dimensional space

Which curves were used?

Hilbert

Peano

How are they constructed?

Again, loosely:By a mapping from a one-dimensional interval to the higher

dimensional space

Then, by recursion new mapping of each part of the interval


Nuremberg24



And how works construction really?

0

1


Nuremberg25



How does that look in 3D?

“Peano fsw“

“Hilbert enb“

“Hilbert fws“

“Hilbert bne“

“Hilbert nwf“

OK, but how to construct them?


Nuremberg26



How to construct them?

Table based approachHilbert: 48 Productions with 8 entries and 7 connectors

Peano: 8 Productions with 27 entries and 26 connectors

Current Level Next Level

bne enb b ben n ben f fse e fse b bws s bws f wnf

fws swf f fsw w fsw b bes s bes f fne e fne b nwb


Nuremberg27



Summary for Space-Filling Curves

Recursive production by segmentationLimitation in system sizes:

Hilbert: 23n

Peano: 33n

Increase spatial locality

Enable mesh-refinement


Nuremberg28


Memory Traversing Schemes: Blocking

Implicit blocking technique

Arrangement of data in a blocking manner

Increases spatial locality


Nuremberg29



Notes on Memory Traversing Schemes

Pure preprocessing technique:Only arrangement of data in memory changed

No change for solver

No overhead for solver (e.g. loop overhead for blocking)


Nuremberg30


Memory Layouts: Collision Optimized Layout

Standard Array-Layout: F(i,x,y,z,t) “Array-of-Structures“

Collision optimized

Optimal read access:2 cache lines per LUP

Bad write access:19 stores in 19 cache linesBut: Depending on systemsize some of them arealready/still in cache


Nuremberg31



18 write accesseson one cell from3 different z-layers

8 write accesseson one cell from3 different rows

Byte

NNMem jik

8192

)2()2(3

Byte

NMem ij

8192

)2(3


Nuremberg32



Performance of Collision-Optimized Layout (P4, 512kB L2)

012345678910

2000 x

22 x

22

1700 x

22 x

22

1500 x

26 x

26

1200 x

29 x

29

1000 x

32 x

32

700 x 38

x 38

500 x 45

x 45

200 x 70

x 70

100 x 10

0 x 1

00

64 x 6

4 x 6

4

32 x 3

2 x 3

2

16 x 1

6 x 1

6

8 x 8 x 8

MLUPS

Collision Optimized Collision Optimized, 1D Collision Optimized, 3D


Nuremberg33


Memory Layouts: Propagation Optimized Layout

Optimized Array-Layout: F(x,y,z,i,t) “Structure-of-Arrays“

stride-1-access on x (inner loop)

19 cache lines per 16 LUPsin read and write process

1 cache miss each 16th memory access


Nuremberg34


Memory Layouts: Propagation Optimized Layout

Performance on Pentium 4, 512kB L2

012345678910

2000 x

22 x

22

1700 x

22 x

22

1500 x

26 x

26

1200 x

29 x

29

1000 x

32 x

32

700 x 38

x 38

500 x 45

x 45

200 x 70

x 70

100 x 10

0 x 1

00

64 x 6

4 x 6

4

32 x 3

2 x 3

2

16 x 1

6 x 1

6

8 x 8 x 8

MLUPS

Propagation Optimized Collision Optimized


Nuremberg35


Further Optimization Techniques

Additional BottlenecksLarge loop body (causes register spills on IA32)Concurrent writing to 19 different cache lines interferes with number of write combine buffers on IA32 (6 for Intel Xeon/Nocona)Indirect addressing prevents IA32 hardware prefetcher from preloading values for target cells (due to bounce back at obstacles)

Solutions (implemented in Solver):

Split up loop in 5 loops of length Nx

Manual Block Preload Technique(Drawback: Both techniques need a loop blocking scheme)These solutions only needed for IA32


Nuremberg36


Results - Outline

Architecture Descriptions

Comparison 1D-Solver to Standard SolverMemory LayoutsFor Standard SolverFor 1D-Solver

Influence of GeometryMCSiC

Memory Traversing SchemesSpace-Filling CurvesBlocking


Nuremberg37


Architecture Description: Nocona/Irwindale

Test System: Test machine at RRZE

CPUs: Nocona (Irwindale), 3.6GHz, 2MB L2-Cache

Memory: DDR400, 6.4GB/s

Architectural specialties:

EM64T extension

Hyperthreading

One memory bus for both processors


Nuremberg38


Architecture Description: Itanium2 / Altix

Test System: RRZE SGI Altix

CPUs: Itanium 2, 1.3 GHz, 3MB L3-CacheMemory: 112GB “distributed shared memory“

Architectural specialties:Itanium 2:

“EPIC“ (Explicitly Parallel Instruction Computing)No out-of-orderParallelization of commands in the grip of compiler ( bundles)L1-Cache only for Integer

Altix:ccNUMA with NUMALink 3Memory connected hierarchically by SHUBs


Nuremberg39


Architecture Description: AMD Opteron

Test System: LSS HPC-Cluster

CPUs: AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compatible

Memory: DDR333, 5.2GB/s

Architectural specialties:

Compute nodes with four CPUs

4GB RAM per CPU, each CPU can access 16GB per ccNUMA

Interconnect:CPUs on one node: HyperTransport (6.4GB/s)

Nodes: Infiniband (10GB/s)


Nuremberg40


Comparison 1D-Array Solver to Standard Solver

1D-Solver vs. LBMKernel on Itanium 2 (not Altix)

0

2

4

6

8

10

12

16 32 64 100 130

ML

UP

S

1D-Solver Coll Opt 1D-Solver Prop Opt LBMKernel Coll Opt LBMKernel Prop Opt


Nuremberg41


Memory Layouts for Standard Solver

Memory Layouts with LBMKernel on Itanium 2 (not Altix)

0

2

4

6

8

10

12

16 32 64 100 130

ML

UP

S

LBMKernel Coll Opt LBMKernel Prop Opt LBMKernel Coll Opt Blk


Nuremberg42


Memory Layouts on Itanium 2

Memory Layouts with 1D-Solver on Itanium 2

0

2

4

6

8

10

12

16 16 27 32 64 81 100 120 128 130

ML

UP

S

Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk


Nuremberg43


Memory Layouts on AMD Opteron

Memory Layouts with 1D-Solver on AMD Opteron

0

2

4

6

8

10

12

16 27 32 64 81 100 120 128 130 140

ML

UP

S



Nuremberg44


Memory Layouts on Nocona/Irwindale

Memory Layouts with 1D-Solver on Nocona/Irwindale

0

2

4

6

8

10

12

16 27 32 64 81 100 120 128 130 140

ML

UP

S



Nuremberg45


Influence of Geometry: SiC-foam (~2% obstacles)

1D-Solver on Nocona/Irwindale with SiC-Geometry

0

2

4

6

8

10

12

16 27 32 64 81 100 120 128 130 140

ML

UP

S



Nuremberg46


Influence of Geometry: MC (~50% obstacles)

1D-Solver on Nocona/Irwindale with MC-Geometry

0

2

4

6

8

10

12

16 27 32 64 81 100 120 128 130 140

ML

UP

S



Nuremberg47



Blocking vs. Space Filling Curves on Nocona with MC

0

2

4

6

8

10

12

14

16 27 32 64 81 100 120 128 130 140

ML

UP

S

Coll Opt Hilbert Prop Opt Hilbert Coll Opt Peano Prop Opt Peano Coll Opt Blk Prop Opt Std Prop Opt Blk


Nuremberg48



Blocking vs. Space Filling Curves on Itanium 2 with MC

0

2

4

6

8

10

12

14

16 27 32 64 81 100 120 128 130 140

ML

UP

S

Coll Opt Hilbert Prop Opt Hilbert Coll Opt Peano Prop Opt Peano Coll Opt Blk Prop Opt Std Prop Opt Blk


Nuremberg49


Conclusion

1D-Array Data representation makes performance independent of obstacle to fluid ratio

Memory traversing by Space-Filling Curves results in similar performance as spatial blocking Implementation of SFCs is not worth the effort (if they are used as

memory traversing alternative only)

Together with indirect addressing Collision Optimized Layout with blocking is best technique if cache is larger than 1 MB

Indeed, there are cases where Propagation Optimized Layout is not best


Nuremberg50


Outlook

Future work could concern:Space-Filling Curves:

Kind of staggered SFCs, for every direction own curve Avoid waste of underused cache lines where lattice sites are neighboring

cells which are visited much later

Galerkin-discretization or point wise evaluation of LBM to enable stack-implementation in conjunction with SFCsBUT: For real-world problems construction on non-cubic grids is necessary at first

Search for vectorization enhancing techniques to over-come problems with indirect addressing on Itanium 2Search for reasons why Collision Optimized Layout is better than Propagation Optimized Layout


Nuremberg51


Acknowledgement / References

Acknowledgement:Bavarian Graduate School for Computational EngineeringThomas Zeiser (RRZE)Gerhard Wellein (RRZE)

References:S. Donath, T. Zeiser, G. Hager, J. Habich, G. Wellein: Optimizing Performance of the Lattice Boltzmann Method for Complex Structures on Cache-based ArchitecturesG. Wellein, P. Lammers, G. Hager, S. Donath, T. Zeiser: Towards Optimal Performance for Lattice Boltzmann Applications on Terascale ComputersG. Wellein, T. Zeiser, S. Donath, G. Hager: On the single processor performance of simple lattice Boltzmann kernels

stefan donath optimizing performance of the lattice boltzmann method for complex structures...

Documents