stefan donath optimizing performance of the lattice boltzmann method for complex structures...
TRANSCRIPT
Stefan Donath
Optimizing Performance of the Lattice Boltzmann Method
for Complex Structures
Friedrich-Alexander University Erlangen/Nuremberg
Department of Computer Science 10 (System Simulation)
Regional Computing Center of Erlangen (RRZE)
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg2
ConclusionResultsOptimizationImplementationApplicationIntroduction
Outline
IntroductionLattice Boltzmann MethodImplementation Aspects
Application
Implementation
Optimization
Results
Conclusion
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg3
ConclusionResultsOptimizationImplementationApplicationIntroduction
Lattice Boltzmann Method
Boltzmann Equation
Discretization of particle velocity space
(finite set of discrete velocities)
][1 )0(fffft
timerelaxation ...
functionon distributi mequilibriu ...
velocityparticle ... )0(
f
][1 )(eq
t ffff
t),,(),(
t),,(),()0()(
xftxf
xftxfeq
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg4
ConclusionResultsOptimizationImplementationApplicationIntroduction
Lattice Boltzmann Method
Different discretization schemes
Numerical accuracy and stability
Computational speed and simplicity
D3Q15 D3Q19 D3Q27
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg5
ConclusionResultsOptimizationImplementationApplicationIntroduction
Lattice Boltzmann Method
Discretization in space x and time t:
),(~
),(
)],(),([),(),(~
)(
txftttexf
txftxftxftxf
ii
ieq
iii
collision step:streaming step:
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg6
ConclusionResultsOptimizationImplementationApplicationIntroduction
Lattice Boltzmann Method (Implementation Aspects)
Discretization in space x and time t:
),(~
),(
)],(),([),(),(~
)(
txftttexf
txftxftxftxf
ii
ieq
iii
collision step:streaming step:
source destination
Stream-Collide (Pull-Method)Get the distributions from the neighboring cells in the source array and store the relaxated values to one cell in the destination array
Collide-Stream (Push-Method)Take the distributions from one cell in the source array and store the relaxated values to the neighboring cells in the destination array
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg7
ConclusionResultsOptimizationImplementationApplicationIntroduction
Lattice Boltzmann Method (Implementation Aspects)
Walls and Obstacles: Bounce Back rule
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg8
ConclusionResultsOptimizationImplementationApplicationIntroduction
Implementation Aspects
Data Dependencies
Two Grids
Compressed Grid
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg9
ConclusionResultsOptimizationImplementationApplicationIntroduction
double precision f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1)do z=1,zMax do y=1,yMax do x=1,xMax
if( fluidcell(x,y,z) ) then
LOAD f(x,y,z, 0:18,t)
Relaxation (complex computations) SAVE f(x ,y ,z , 0,t+1) SAVE f(x+1,y+1,z , 1,t+1) SAVE f(x ,y+1,z , 2,t+1) SAVE f(x-1,y+1,z , 3,t+1) … SAVE f(x ,y-1,z-1,18,t+1)
endif enddo enddoenddo
Implementation Aspects
Collide
Stream
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg10
ConclusionResultsOptimizationImplementationApplicationIntroduction
Application
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg11
ConclusionResultsOptimizationImplementationApplicationIntroduction
Application: Porous Media Combustion
New technology for heating installations:
Porous Media Combustion
Fuel-air-mixture does no longer react in a free flame
Combustion process takes place inside the pores of a porous medium that is placed in the reaction area
P CPorous Media Combustion
60mm
20mm
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg12
ConclusionResultsOptimizationImplementationApplicationIntroduction
Application: Porous Media CombustionP CPorous Media Combustion
Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg13
ConclusionResultsOptimizationImplementationApplicationIntroduction
Application: Porous Media Combustion
Various applications:
modern steam engines, vehicle heaters
P CPorous Media Combustion
Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg14
ConclusionResultsOptimizationImplementationApplicationIntroduction
Application: Introducing Complex Geometries used
Porous Medium from PMC: Silicon-Carbide (SiC)
Many small obstacles
Obstacle/fluid-ratio: ~2%
High number of fluid-solid faces
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg15
ConclusionResultsOptimizationImplementationApplicationIntroduction
Application: Introducing Complex Geometries used
Second Test-Geometry:
“MC“
Huge obstacles, only fewfluid tubes
Obstacle/fluid-ratio: ~50%
Low number of fluid-solid faces
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg16
ConclusionResultsOptimizationImplementationApplicationIntroduction
Implementation
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg17
ConclusionResultsOptimizationImplementationApplicationIntroduction
Implementation
Collision and streaming step in same loop
Push-Method
Data representation in 1D-Array
Stores only Fluid Cells ( saves memory)
Indirect addressing of target cells (by extra connectivity array)
Boundary conditions (Bounce Back) handled implicitly
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg18
ConclusionResultsOptimizationImplementationApplicationIntroduction
Implementation
Indirect addressing and implicit Bounce Back
obstacle wall
connectivity array
i i+1 i i+1i-1 i-1
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg19
ConclusionResultsOptimizationImplementationApplicationIntroduction
Implementation
Indirect addressing and implicit Bounce Back
obstacle wall
connectivity array
i i+1 i i+1i-1 i-1
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg20
ConclusionResultsOptimizationImplementationApplicationIntroduction
Implementation
Preprocessor
Sets up connectivity array
Specifies all domain parameters:Geometry and obstacles
Traversing scheme
Solver
Reads in preprocessed information
Performs lattice Boltzmann method (in single loop)
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg21
ConclusionResultsOptimizationImplementationApplicationIntroduction
Implementation
Rating: 1D-Array compared to standard implementation using multidimensional array and three loops
AdvantagesSaves memoryThe more obstacles in the domain the higher is the compression
Implicit Bounce BackNo extra routine or if-statement needed
DrawbacksIndirect addressingPrevents compiler from vectorization and other optimization techniques Consequently, worse performance
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg22
ConclusionResultsOptimizationImplementationApplicationIntroduction
Optimization - Outline
Memory Traversing Schemes
Space-Filling Curves
Blocking
Memory Layouts
Further Optimization Techniques
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg23
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Traversing Schemes: Space-Filling Curves
What is a space-filling curve?
Loosely spoken:A one dimensional curve that fills a higher dimensional space
Which curves were used?
Hilbert
Peano
How are they constructed?
Again, loosely:By a mapping from a one-dimensional interval to the higher
dimensional space
Then, by recursion new mapping of each part of the interval
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg24
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Traversing Schemes: Space-Filling Curves
And how works construction really?
0
1
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg25
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Traversing Schemes: Space-Filling Curves
How does that look in 3D?
“Peano fsw“
“Hilbert enb“
“Hilbert fws“
“Hilbert bne“
“Hilbert nwf“
OK, but how to construct them?
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg26
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Traversing Schemes: Space-Filling Curves
How to construct them?
Table based approachHilbert: 48 Productions with 8 entries and 7 connectors
Peano: 8 Productions with 27 entries and 26 connectors
Current Level Next Level
bne enb b ben n ben f fse e fse b bws s bws f wnf
fws swf f fsw w fsw b bes s bes f fne e fne b nwb
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg27
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Traversing Schemes: Space-Filling Curves
Summary for Space-Filling Curves
Recursive production by segmentationLimitation in system sizes:
Hilbert: 23n
Peano: 33n
Increase spatial locality
Enable mesh-refinement
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg28
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Traversing Schemes: Blocking
Implicit blocking technique
Arrangement of data in a blocking manner
Increases spatial locality
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg29
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Traversing Schemes
Notes on Memory Traversing Schemes
Pure preprocessing technique:Only arrangement of data in memory changed
No change for solver
No overhead for solver (e.g. loop overhead for blocking)
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg30
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Layouts: Collision Optimized Layout
Standard Array-Layout: F(i,x,y,z,t) “Array-of-Structures“
Collision optimized
Optimal read access:2 cache lines per LUP
Bad write access:19 stores in 19 cache linesBut: Depending on systemsize some of them arealready/still in cache
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg31
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Layouts: Collision Optimized Layout
18 write accesseson one cell from3 different z-layers
8 write accesseson one cell from3 different rows
Byte
NNMem jik
8192
)2()2(3
Byte
NMem ij
8192
)2(3
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg32
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Layouts: Collision Optimized Layout
Performance of Collision-Optimized Layout (P4, 512kB L2)
012345678910
2000 x
22 x
22
1700 x
22 x
22
1500 x
26 x
26
1200 x
29 x
29
1000 x
32 x
32
700 x 38
x 38
500 x 45
x 45
200 x 70
x 70
100 x 10
0 x 1
00
64 x 6
4 x 6
4
32 x 3
2 x 3
2
16 x 1
6 x 1
6
8 x 8 x 8
MLUPS
Collision Optimized Collision Optimized, 1D Collision Optimized, 3D
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg33
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Layouts: Propagation Optimized Layout
Optimized Array-Layout: F(x,y,z,i,t) “Structure-of-Arrays“
stride-1-access on x (inner loop)
19 cache lines per 16 LUPsin read and write process
1 cache miss each 16th memory access
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg34
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Layouts: Propagation Optimized Layout
Performance on Pentium 4, 512kB L2
012345678910
2000 x
22 x
22
1700 x
22 x
22
1500 x
26 x
26
1200 x
29 x
29
1000 x
32 x
32
700 x 38
x 38
500 x 45
x 45
200 x 70
x 70
100 x 10
0 x 1
00
64 x 6
4 x 6
4
32 x 3
2 x 3
2
16 x 1
6 x 1
6
8 x 8 x 8
MLUPS
Propagation Optimized Collision Optimized
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg35
ConclusionResultsOptimizationImplementationApplicationIntroduction
Further Optimization Techniques
Additional BottlenecksLarge loop body (causes register spills on IA32)Concurrent writing to 19 different cache lines interferes with number of write combine buffers on IA32 (6 for Intel Xeon/Nocona)Indirect addressing prevents IA32 hardware prefetcher from preloading values for target cells (due to bounce back at obstacles)
Solutions (implemented in Solver):
Split up loop in 5 loops of length Nx
Manual Block Preload Technique(Drawback: Both techniques need a loop blocking scheme)These solutions only needed for IA32
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg36
ConclusionResultsOptimizationImplementationApplicationIntroduction
Results - Outline
Architecture Descriptions
Comparison 1D-Solver to Standard SolverMemory LayoutsFor Standard SolverFor 1D-Solver
Influence of GeometryMCSiC
Memory Traversing SchemesSpace-Filling CurvesBlocking
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg37
ConclusionResultsOptimizationImplementationApplicationIntroduction
Architecture Description: Nocona/Irwindale
Test System: Test machine at RRZE
CPUs: Nocona (Irwindale), 3.6GHz, 2MB L2-Cache
Memory: DDR400, 6.4GB/s
Architectural specialties:
EM64T extension
Hyperthreading
One memory bus for both processors
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg38
ConclusionResultsOptimizationImplementationApplicationIntroduction
Architecture Description: Itanium2 / Altix
Test System: RRZE SGI Altix
CPUs: Itanium 2, 1.3 GHz, 3MB L3-CacheMemory: 112GB “distributed shared memory“
Architectural specialties:Itanium 2:
“EPIC“ (Explicitly Parallel Instruction Computing)No out-of-orderParallelization of commands in the grip of compiler ( bundles)L1-Cache only for Integer
Altix:ccNUMA with NUMALink 3Memory connected hierarchically by SHUBs
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg39
ConclusionResultsOptimizationImplementationApplicationIntroduction
Architecture Description: AMD Opteron
Test System: LSS HPC-Cluster
CPUs: AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compatible
Memory: DDR333, 5.2GB/s
Architectural specialties:
Compute nodes with four CPUs
4GB RAM per CPU, each CPU can access 16GB per ccNUMA
Interconnect:CPUs on one node: HyperTransport (6.4GB/s)
Nodes: Infiniband (10GB/s)
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg40
ConclusionResultsOptimizationImplementationApplicationIntroduction
Comparison 1D-Array Solver to Standard Solver
1D-Solver vs. LBMKernel on Itanium 2 (not Altix)
0
2
4
6
8
10
12
16 32 64 100 130
ML
UP
S
1D-Solver Coll Opt 1D-Solver Prop Opt LBMKernel Coll Opt LBMKernel Prop Opt
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg41
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Layouts for Standard Solver
Memory Layouts with LBMKernel on Itanium 2 (not Altix)
0
2
4
6
8
10
12
16 32 64 100 130
ML
UP
S
LBMKernel Coll Opt LBMKernel Prop Opt LBMKernel Coll Opt Blk
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg42
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Layouts on Itanium 2
Memory Layouts with 1D-Solver on Itanium 2
0
2
4
6
8
10
12
16 16 27 32 64 81 100 120 128 130
ML
UP
S
Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg43
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Layouts on AMD Opteron
Memory Layouts with 1D-Solver on AMD Opteron
0
2
4
6
8
10
12
16 27 32 64 81 100 120 128 130 140
ML
UP
S
Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg44
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Layouts on Nocona/Irwindale
Memory Layouts with 1D-Solver on Nocona/Irwindale
0
2
4
6
8
10
12
16 27 32 64 81 100 120 128 130 140
ML
UP
S
Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg45
ConclusionResultsOptimizationImplementationApplicationIntroduction
Influence of Geometry: SiC-foam (~2% obstacles)
1D-Solver on Nocona/Irwindale with SiC-Geometry
0
2
4
6
8
10
12
16 27 32 64 81 100 120 128 130 140
ML
UP
S
Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg46
ConclusionResultsOptimizationImplementationApplicationIntroduction
Influence of Geometry: MC (~50% obstacles)
1D-Solver on Nocona/Irwindale with MC-Geometry
0
2
4
6
8
10
12
16 27 32 64 81 100 120 128 130 140
ML
UP
S
Coll Opt Std Prop Opt Std Coll Opt Blk Prop Opt Blk
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg47
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Traversing Schemes
Blocking vs. Space Filling Curves on Nocona with MC
0
2
4
6
8
10
12
14
16 27 32 64 81 100 120 128 130 140
ML
UP
S
Coll Opt Hilbert Prop Opt Hilbert Coll Opt Peano Prop Opt Peano Coll Opt Blk Prop Opt Std Prop Opt Blk
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg48
ConclusionResultsOptimizationImplementationApplicationIntroduction
Memory Traversing Schemes
Blocking vs. Space Filling Curves on Itanium 2 with MC
0
2
4
6
8
10
12
14
16 27 32 64 81 100 120 128 130 140
ML
UP
S
Coll Opt Hilbert Prop Opt Hilbert Coll Opt Peano Prop Opt Peano Coll Opt Blk Prop Opt Std Prop Opt Blk
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg49
ConclusionResultsOptimizationImplementationApplicationIntroduction
Conclusion
1D-Array Data representation makes performance independent of obstacle to fluid ratio
Memory traversing by Space-Filling Curves results in similar performance as spatial blocking Implementation of SFCs is not worth the effort (if they are used as
memory traversing alternative only)
Together with indirect addressing Collision Optimized Layout with blocking is best technique if cache is larger than 1 MB
Indeed, there are cases where Propagation Optimized Layout is not best
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg50
ConclusionResultsOptimizationImplementationApplicationIntroduction
Outlook
Future work could concern:Space-Filling Curves:
Kind of staggered SFCs, for every direction own curve Avoid waste of underused cache lines where lattice sites are neighboring
cells which are visited much later
Galerkin-discretization or point wise evaluation of LBM to enable stack-implementation in conjunction with SFCsBUT: For real-world problems construction on non-cubic grids is necessary at first
Search for vectorization enhancing techniques to over-come problems with indirect addressing on Itanium 2Search for reasons why Collision Optimized Layout is better than Propagation Optimized Layout
Stefan DonathFriedrich-Alexander University Erlangen-
Nuremberg51
ConclusionResultsOptimizationImplementationApplicationIntroduction
Acknowledgement / References
Acknowledgement:Bavarian Graduate School for Computational EngineeringThomas Zeiser (RRZE)Gerhard Wellein (RRZE)
References:S. Donath, T. Zeiser, G. Hager, J. Habich, G. Wellein: Optimizing Performance of the Lattice Boltzmann Method for Complex Structures on Cache-based ArchitecturesG. Wellein, P. Lammers, G. Hager, S. Donath, T. Zeiser: Towards Optimal Performance for Lattice Boltzmann Applications on Terascale ComputersG. Wellein, T. Zeiser, S. Donath, G. Hager: On the single processor performance of simple lattice Boltzmann kernels