stefan donath optimizing performance of the lattice boltzmann method for complex structures...

51
Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department of Computer Science 10 (System Simulation) Regional Computing Center of Erlangen (RRZE)

Upload: jared-park

Post on 04-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan Donath

Optimizing Performance of the Lattice Boltzmann Method

for Complex Structures

Friedrich-Alexander University Erlangen/Nuremberg

Department of Computer Science 10 (System Simulation)

Regional Computing Center of Erlangen (RRZE)

Page 2: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg2

ConclusionResultsOptimizationImplementationApplicationIntroduction

Outline

IntroductionLattice Boltzmann MethodImplementation Aspects

Application

Implementation

Optimization

Results

Conclusion

Page 3: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg3

ConclusionResultsOptimizationImplementationApplicationIntroduction

Lattice Boltzmann Method

Boltzmann Equation

Discretization of particle velocity space

(finite set of discrete velocities)

][1 )0(fffft

timerelaxation ...

functionon distributi mequilibriu ...

velocityparticle ... )0(

f

][1 )(eq

t ffff

t),,(),(

t),,(),()0()(

xftxf

xftxfeq

Page 4: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg4

ConclusionResultsOptimizationImplementationApplicationIntroduction

Lattice Boltzmann Method

Different discretization schemes

Numerical accuracy and stability

Computational speed and simplicity

D3Q15 D3Q19 D3Q27

Page 5: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg5

ConclusionResultsOptimizationImplementationApplicationIntroduction

Lattice Boltzmann Method

Discretization in space x and time t:

),(~

),(

)],(),([),(),(~

)(

txftttexf

txftxftxftxf

ii

ieq

iii

collision step:streaming step:

Page 6: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg6

ConclusionResultsOptimizationImplementationApplicationIntroduction

Lattice Boltzmann Method (Implementation Aspects)

Discretization in space x and time t:

),(~

),(

)],(),([),(),(~

)(

txftttexf

txftxftxftxf

ii

ieq

iii

collision step:streaming step:

source destination

Stream-Collide (Pull-Method)Get the distributions from the neighboring cells in the source array and store the relaxated values to one cell in the destination array

Collide-Stream (Push-Method)Take the distributions from one cell in the source array and store the relaxated values to the neighboring cells in the destination array

Page 7: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg7

ConclusionResultsOptimizationImplementationApplicationIntroduction

Lattice Boltzmann Method (Implementation Aspects)

Walls and Obstacles: Bounce Back rule

Page 8: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg8

ConclusionResultsOptimizationImplementationApplicationIntroduction

Implementation Aspects

Data Dependencies

Two Grids

Compressed Grid

Page 9: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg9

ConclusionResultsOptimizationImplementationApplicationIntroduction

double precision f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1)do z=1,zMax do y=1,yMax do x=1,xMax

if( fluidcell(x,y,z) ) then

LOAD f(x,y,z, 0:18,t)

Relaxation (complex computations) SAVE f(x ,y ,z , 0,t+1) SAVE f(x+1,y+1,z , 1,t+1) SAVE f(x ,y+1,z , 2,t+1) SAVE f(x-1,y+1,z , 3,t+1) … SAVE f(x ,y-1,z-1,18,t+1)

endif enddo enddoenddo

Implementation Aspects

Collide

Stream

Page 10: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg10

ConclusionResultsOptimizationImplementationApplicationIntroduction

Application

Page 11: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg11

ConclusionResultsOptimizationImplementationApplicationIntroduction

Application: Porous Media Combustion

New technology for heating installations:

Porous Media Combustion

Fuel-air-mixture does no longer react in a free flame

Combustion process takes place inside the pores of a porous medium that is placed in the reaction area

P CPorous Media Combustion

60mm

20mm

Page 12: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg12

ConclusionResultsOptimizationImplementationApplicationIntroduction

Application: Porous Media CombustionP CPorous Media Combustion

Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser

Page 13: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg13

ConclusionResultsOptimizationImplementationApplicationIntroduction

Application: Porous Media Combustion

Various applications:

modern steam engines, vehicle heaters

P CPorous Media Combustion

Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser

Page 14: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg14

ConclusionResultsOptimizationImplementationApplicationIntroduction

Application: Introducing Complex Geometries used

Porous Medium from PMC: Silicon-Carbide (SiC)

Many small obstacles

Obstacle/fluid-ratio: ~2%

High number of fluid-solid faces

Page 15: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg15

ConclusionResultsOptimizationImplementationApplicationIntroduction

Application: Introducing Complex Geometries used

Second Test-Geometry:

“MC“

Huge obstacles, only fewfluid tubes

Obstacle/fluid-ratio: ~50%

Low number of fluid-solid faces

Page 16: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg16

ConclusionResultsOptimizationImplementationApplicationIntroduction

Implementation

Page 17: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg17

ConclusionResultsOptimizationImplementationApplicationIntroduction

Implementation

Collision and streaming step in same loop

Push-Method

Data representation in 1D-Array

Stores only Fluid Cells ( saves memory)

Indirect addressing of target cells (by extra connectivity array)

Boundary conditions (Bounce Back) handled implicitly

Page 18: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg18

ConclusionResultsOptimizationImplementationApplicationIntroduction

Implementation

Indirect addressing and implicit Bounce Back

obstacle wall

connectivity array

i i+1 i i+1i-1 i-1

Page 19: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg19

ConclusionResultsOptimizationImplementationApplicationIntroduction

Implementation

Indirect addressing and implicit Bounce Back

obstacle wall

connectivity array

i i+1 i i+1i-1 i-1

Page 20: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg20

ConclusionResultsOptimizationImplementationApplicationIntroduction

Implementation

Preprocessor

Sets up connectivity array

Specifies all domain parameters:Geometry and obstacles

Traversing scheme

Solver

Reads in preprocessed information

Performs lattice Boltzmann method (in single loop)

Page 21: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg21

ConclusionResultsOptimizationImplementationApplicationIntroduction

Implementation

Rating: 1D-Array compared to standard implementation using multidimensional array and three loops

AdvantagesSaves memoryThe more obstacles in the domain the higher is the compression

Implicit Bounce BackNo extra routine or if-statement needed

DrawbacksIndirect addressingPrevents compiler from vectorization and other optimization techniques Consequently, worse performance

Page 22: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg22

ConclusionResultsOptimizationImplementationApplicationIntroduction

Optimization - Outline

Memory Traversing Schemes

Space-Filling Curves

Blocking

Memory Layouts

Further Optimization Techniques

Page 23: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg23

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Traversing Schemes: Space-Filling Curves

What is a space-filling curve?

Loosely spoken:A one dimensional curve that fills a higher dimensional space

Which curves were used?

Hilbert

Peano

How are they constructed?

Again, loosely:By a mapping from a one-dimensional interval to the higher

dimensional space

Then, by recursion new mapping of each part of the interval

Page 24: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg24

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Traversing Schemes: Space-Filling Curves

And how works construction really?

0

1

Page 25: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg25

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Traversing Schemes: Space-Filling Curves

How does that look in 3D?

“Peano fsw“

“Hilbert enb“

“Hilbert fws“

“Hilbert bne“

“Hilbert nwf“

OK, but how to construct them?

Page 26: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg26

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Traversing Schemes: Space-Filling Curves

How to construct them?

Table based approachHilbert: 48 Productions with 8 entries and 7 connectors

Peano: 8 Productions with 27 entries and 26 connectors

Current Level Next Level

bne enb b ben n ben f fse e fse b bws s bws f wnf

fws swf f fsw w fsw b bes s bes f fne e fne b nwb

Page 27: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg27

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Traversing Schemes: Space-Filling Curves

Summary for Space-Filling Curves

Recursive production by segmentationLimitation in system sizes:

Hilbert: 23n

Peano: 33n

Increase spatial locality

Enable mesh-refinement

Page 28: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg28

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Traversing Schemes: Blocking

Implicit blocking technique

Arrangement of data in a blocking manner

Increases spatial locality

Page 29: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg29

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Traversing Schemes

Notes on Memory Traversing Schemes

Pure preprocessing technique:Only arrangement of data in memory changed

No change for solver

No overhead for solver (e.g. loop overhead for blocking)

Page 30: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg30

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Layouts: Collision Optimized Layout

Standard Array-Layout: F(i,x,y,z,t) “Array-of-Structures“

Collision optimized

Optimal read access:2 cache lines per LUP

Bad write access:19 stores in 19 cache linesBut: Depending on systemsize some of them arealready/still in cache

Page 31: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg31

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Layouts: Collision Optimized Layout

18 write accesseson one cell from3 different z-layers

8 write accesseson one cell from3 different rows

Byte

NNMem jik

8192

)2()2(3

Byte

NMem ij

8192

)2(3

Page 32: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg32

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Layouts: Collision Optimized Layout

Performance of Collision-Optimized Layout (P4, 512kB L2)

012345678910

2000 x

22 x

22

1700 x

22 x

22

1500 x

26 x

26

1200 x

29 x

29

1000 x

32 x

32

700 x 38

x 38

500 x 45

x 45

200 x 70

x 70

100 x 10

0 x 1

00

64 x 6

4 x 6

4

32 x 3

2 x 3

2

16 x 1

6 x 1

6

8 x 8 x 8

MLUPS

Collision Optimized Collision Optimized, 1D Collision Optimized, 3D

Page 33: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg33

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Layouts: Propagation Optimized Layout

Optimized Array-Layout: F(x,y,z,i,t) “Structure-of-Arrays“

stride-1-access on x (inner loop)

19 cache lines per 16 LUPsin read and write process

1 cache miss each 16th memory access

Page 34: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg34

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Layouts: Propagation Optimized Layout

Performance on Pentium 4, 512kB L2

012345678910

2000 x

22 x

22

1700 x

22 x

22

1500 x

26 x

26

1200 x

29 x

29

1000 x

32 x

32

700 x 38

x 38

500 x 45

x 45

200 x 70

x 70

100 x 10

0 x 1

00

64 x 6

4 x 6

4

32 x 3

2 x 3

2

16 x 1

6 x 1

6

8 x 8 x 8

MLUPS

Propagation Optimized Collision Optimized

Page 35: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg35

ConclusionResultsOptimizationImplementationApplicationIntroduction

Further Optimization Techniques

Additional BottlenecksLarge loop body (causes register spills on IA32)Concurrent writing to 19 different cache lines interferes with number of write combine buffers on IA32 (6 for Intel Xeon/Nocona)Indirect addressing prevents IA32 hardware prefetcher from preloading values for target cells (due to bounce back at obstacles)

Solutions (implemented in Solver):

Split up loop in 5 loops of length Nx

Manual Block Preload Technique(Drawback: Both techniques need a loop blocking scheme)These solutions only needed for IA32

Page 36: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg36

ConclusionResultsOptimizationImplementationApplicationIntroduction

Results - Outline

Architecture Descriptions

Comparison 1D-Solver to Standard SolverMemory LayoutsFor Standard SolverFor 1D-Solver

Influence of GeometryMCSiC

Memory Traversing SchemesSpace-Filling CurvesBlocking

Page 37: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg37

ConclusionResultsOptimizationImplementationApplicationIntroduction

Architecture Description: Nocona/Irwindale

Test System: Test machine at RRZE

CPUs: Nocona (Irwindale), 3.6GHz, 2MB L2-Cache

Memory: DDR400, 6.4GB/s

Architectural specialties:

EM64T extension

Hyperthreading

One memory bus for both processors

Page 38: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg38

ConclusionResultsOptimizationImplementationApplicationIntroduction

Architecture Description: Itanium2 / Altix

Test System: RRZE SGI Altix

CPUs: Itanium 2, 1.3 GHz, 3MB L3-CacheMemory: 112GB “distributed shared memory“

Architectural specialties:Itanium 2:

“EPIC“ (Explicitly Parallel Instruction Computing)No out-of-orderParallelization of commands in the grip of compiler ( bundles)L1-Cache only for Integer

Altix:ccNUMA with NUMALink 3Memory connected hierarchically by SHUBs

Page 39: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg39

ConclusionResultsOptimizationImplementationApplicationIntroduction

Architecture Description: AMD Opteron

Test System: LSS HPC-Cluster

CPUs: AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compatible

Memory: DDR333, 5.2GB/s

Architectural specialties:

Compute nodes with four CPUs

4GB RAM per CPU, each CPU can access 16GB per ccNUMA

Interconnect:CPUs on one node: HyperTransport (6.4GB/s)

Nodes: Infiniband (10GB/s)

Page 40: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg40

ConclusionResultsOptimizationImplementationApplicationIntroduction

Comparison 1D-Array Solver to Standard Solver

1D-Solver vs. LBMKernel on Itanium 2 (not Altix)

0

2

4

6

8

10

12

16 32 64 100 130

ML

UP

S

1D-Solver Coll Opt 1D-Solver Prop Opt LBMKernel Coll Opt LBMKernel Prop Opt

Page 41: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg41

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Layouts for Standard Solver

Memory Layouts with LBMKernel on Itanium 2 (not Altix)

0

2

4

6

8

10

12

16 32 64 100 130

ML

UP

S

LBMKernel Coll Opt LBMKernel Prop Opt LBMKernel Coll Opt Blk

Page 42: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg42

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Layouts on Itanium 2

Memory Layouts with 1D-Solver on Itanium 2

0

2

4

6

8

10

12

16 16 27 32 64 81 100 120 128 130

ML

UP

S

Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk

Page 43: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg43

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Layouts on AMD Opteron

Memory Layouts with 1D-Solver on AMD Opteron

0

2

4

6

8

10

12

16 27 32 64 81 100 120 128 130 140

ML

UP

S

Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk

Page 44: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg44

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Layouts on Nocona/Irwindale

Memory Layouts with 1D-Solver on Nocona/Irwindale

0

2

4

6

8

10

12

16 27 32 64 81 100 120 128 130 140

ML

UP

S

Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk

Page 45: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg45

ConclusionResultsOptimizationImplementationApplicationIntroduction

Influence of Geometry: SiC-foam (~2% obstacles)

1D-Solver on Nocona/Irwindale with SiC-Geometry

0

2

4

6

8

10

12

16 27 32 64 81 100 120 128 130 140

ML

UP

S

Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk

Page 46: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg46

ConclusionResultsOptimizationImplementationApplicationIntroduction

Influence of Geometry: MC (~50% obstacles)

1D-Solver on Nocona/Irwindale with MC-Geometry

0

2

4

6

8

10

12

16 27 32 64 81 100 120 128 130 140

ML

UP

S

Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk

Page 47: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg47

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Traversing Schemes

Blocking vs. Space Filling Curves on Nocona with MC

0

2

4

6

8

10

12

14

16 27 32 64 81 100 120 128 130 140

ML

UP

S

Coll Opt Hilbert Prop Opt Hilbert Coll Opt Peano Prop Opt Peano Coll Opt Blk Prop Opt Std Prop Opt Blk

Page 48: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg48

ConclusionResultsOptimizationImplementationApplicationIntroduction

Memory Traversing Schemes

Blocking vs. Space Filling Curves on Itanium 2 with MC

0

2

4

6

8

10

12

14

16 27 32 64 81 100 120 128 130 140

ML

UP

S

Coll Opt Hilbert Prop Opt Hilbert Coll Opt Peano Prop Opt Peano Coll Opt Blk Prop Opt Std Prop Opt Blk

Page 49: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg49

ConclusionResultsOptimizationImplementationApplicationIntroduction

Conclusion

1D-Array Data representation makes performance independent of obstacle to fluid ratio

Memory traversing by Space-Filling Curves results in similar performance as spatial blocking Implementation of SFCs is not worth the effort (if they are used as

memory traversing alternative only)

Together with indirect addressing Collision Optimized Layout with blocking is best technique if cache is larger than 1 MB

Indeed, there are cases where Propagation Optimized Layout is not best

Page 50: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg50

ConclusionResultsOptimizationImplementationApplicationIntroduction

Outlook

Future work could concern:Space-Filling Curves:

Kind of staggered SFCs, for every direction own curve Avoid waste of underused cache lines where lattice sites are neighboring

cells which are visited much later

Galerkin-discretization or point wise evaluation of LBM to enable stack-implementation in conjunction with SFCsBUT: For real-world problems construction on non-cubic grids is necessary at first

Search for vectorization enhancing techniques to over-come problems with indirect addressing on Itanium 2Search for reasons why Collision Optimized Layout is better than Propagation Optimized Layout

Page 51: Stefan Donath Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department

Stefan DonathFriedrich-Alexander University Erlangen-

Nuremberg51

ConclusionResultsOptimizationImplementationApplicationIntroduction

Acknowledgement / References

Acknowledgement:Bavarian Graduate School for Computational EngineeringThomas Zeiser (RRZE)Gerhard Wellein (RRZE)

References:S. Donath, T. Zeiser, G. Hager, J. Habich, G. Wellein: Optimizing Performance of the Lattice Boltzmann Method for Complex Structures on Cache-based ArchitecturesG. Wellein, P. Lammers, G. Hager, S. Donath, T. Zeiser: Towards Optimal Performance for Lattice Boltzmann Applications on Terascale ComputersG. Wellein, T. Zeiser, S. Donath, G. Hager: On the single processor performance of simple lattice Boltzmann kernels