3d adi method for fluid simulation on multiple...
TRANSCRIPT
![Page 1: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/1.jpg)
3D ADI Method for Fluid Simulation on Multiple GPUs
Nikolai Sakharnykh, NVIDIA
Nikolay Markovskiy, NVIDIA
![Page 2: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/2.jpg)
Introduction
Fluid simulation using direct numerical methods
— Gives the most accurate result
— Requires lots of memory and computational power
GPUs are very suitable for direct methods
— Have great instruction throughput and high memory bandwidth
How will it scale on multiple GPUs?
![Page 3: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/3.jpg)
cmc-fluid-solver
Open source project on Google Code
— Started at CMC faculty of MSU, Russia
— CPU: OpenMP, GPU: CUDA
3D fluid simulation using ADI solver
Key people:
— MSU: Vilen Paskonov, Sergey Berezin
— NVIDIA: Nikolay Sakharnykh, Nikolay Markovskiy
![Page 4: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/4.jpg)
Outline
Fluid Simulation in 3D domain
— Problem statement, applications
— ADI numerical method
— GPU implementation details, optimizations
— Performance analysis
Multi-GPU implementation
![Page 5: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/5.jpg)
Problem Statement
Viscid incompressible fluid in 3D domain
Arbitrary closed geometry for boundaries
Euler coordinates: velocity and temperature
free
injection
no-slip
![Page 6: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/6.jpg)
Applications
Sea and ocean simulation
— Additional parameters: salinity, etc.
Low-speed gas flow
— Inside 3D channel
— Around objects
![Page 7: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/7.jpg)
Definitions
Equation of state
— Describe relation between and
— Example:
Density
Velocity
Temperature
Pressure
– gas constant for air
![Page 8: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/8.jpg)
Governing equations
Continuity equation
— For incompressible fluids:
Navier-Stokes equations:
— Dimensionless form, use equation of state
– Reynolds number (= inertia/viscosity ratio)
![Page 9: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/9.jpg)
Governing equations
Energy equation:
— Dimensionless form, use equation of state
– heat capacity ratio
– Prandtl number
– dissipative function
![Page 10: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/10.jpg)
ADI numerical method
X Y
Z
Fixed Y, Z Fixed X, Z Fixed X, Y
X Y Z
![Page 11: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/11.jpg)
ADI numerical method
Benefits
— Doesn’t have hard requirements on time step
— Domain decomposition – each step can be well parallelized
Many applications
— Computational Fluid Dynamics
— Computational Finance
Linear 3D PDE
![Page 12: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/12.jpg)
ADI method – iterations
Use global iterations for the whole system of equations
Some equations are not linear:
— Use local iterations to approximate the non-linear term
previous
time step
Solve X-dir equations
Solve Y-dir equations
Solve Z-dir equations
Updating all variables next
time step global iterations
![Page 13: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/13.jpg)
Discretization
Use regular grid, implicit finite difference scheme
— Second order in space
— First order in time
Leads to a tridiagonal system for
— Independent system for each fixed pair (j, k)
![Page 14: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/14.jpg)
Need to solve lots of tridiagonal systems
Sizes of systems may vary across the grid
Tridiagonal systems
Outside cell
Inside cell
Boundary cell
system 1
system 2
system 3
![Page 15: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/15.jpg)
Implementation details
<for each direction X, Y, Z>
{
<for each local iteration>
{
<for each equation u, v, w, T>
{
build tridiagonal matrices and rhs
solve tridiagonal systems
}
update non-linear terms
}
}
![Page 16: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/16.jpg)
GPU implementation
Store all data arrays entirely in GPU memory
— Reduce number of PCI-E transfers to minimum
— Map 3D arrays to linear memory
Main kernel
— Build matrix coefficients
— Solve tridiagonal systems
(X, Y, Z)
Z + Y * dimZ + X * dimY * dimZ
Z – fastest-changing dimension
![Page 17: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/17.jpg)
Building matrices
Input data:
— Previous/non-linear 3D layers
Each thread computes:
— Coefficients of a tridiagonal matrix
— Right-hand side vector
Use C++ templates for direction and equation
a b c
d
![Page 18: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/18.jpg)
Building matrices – performance
Poor Z direction performance compared to X/Y
— Threads access contiguous memory region
— Memory access is uncoalesced, lots of cache misses
Tesl
a C
2050 (
SP)
sec
0.0
0.5
1.0
1.5
2.0
Build Build + Solve
X dir
Y dir
Z dir
Dir Requests
per warp
L1 global
load hit %
IPC
X 2 – 3 25 – 45 1.4
Y 2 – 3 33 – 44 1.4
Z 32 0 – 15 0.2
Build kernels Total time
![Page 19: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/19.jpg)
Building matrices – optimization
Run Z phase in transposed XZY space
— Better locality for memory accesses
— Additional overhead on transpose
XYZ XYZ
X local iterations Y local iterations Z local iterations
Transpose input arrays
Transpose output arrays
Y local iterations
XZY XZY
![Page 20: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/20.jpg)
Building matrices - optimization
Tridiagonal solver time dominates over transpose
— Transpose will takes less % with more local iterations
0.0
0.5
1.0
1.5
2.0
X dir Y dir Z dir Z dirOPT
Transpose
Build + Solve
Tesl
a C
2050 (
SP)
sec
2.5x
Total time
Z dir Requests
per warp
L1 global
load hit %
IPC
Original 32 0 – 15 0.2
Transposed 2 – 3 30 – 38 1.3
Build kernels
![Page 21: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/21.jpg)
Solving tridiagonal systems
Number of tridiagonal systems ~ grid size squared
Sweep algorithm is the most efficient in this case
— 1 thread solves 1 system
for( int p = 1; p < end; p++ ) {
// .. compute tridiagonal coefficients a_val, b_val, c_val, d_val ..
get(c,p) = c_val / (b_val - a_val * get(c,p-1));
get(d,p) = (d_val - get(d,p-1) * a_val) / (b_val - a_val * get(c,p-1));
}
for( int i = end-1; i >= 0; i-- )
get(x,i) = get(d,i) - get(c,i) * get(x, i+1);
![Page 22: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/22.jpg)
Solving tridiagonal systems
Matrix layout is crucial for performance
X, Y directions matrices are interleaved by default
Z is interleaved as well if doing in transposed space
Interleaved layout
a0 a0 a0 a0 a1 a1 a1 a1
Sweep friendly Thre
ad 1
Thre
ad 2
Thre
ad 3
similar as ELLPACK
for sparse matrices
![Page 23: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/23.jpg)
Solving tridiagonal systems
L1/L2 effect on performance
— Using 48K L1 instead of 16K gives 10-15% speed-up
— Turning L1 off reduces performance by 10%
— Really help on misaligned accesses and spatial reuse
Occupancy >= 50%
— Running 128 threads per block
— 26-42 registers per thread (different for u, v, w, T)
— No shared memory
![Page 24: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/24.jpg)
Performance benchmark
CPU configuration:
— Intel Core i7-3930K CPU @ 3.2 GHz, 12 cores
— Use OpenMP for CPU parallelization
Mostly memory bandwidth bound
Some parts achieves ~4x speed-up vs 1 core
GPU configuration:
— NVIDIA Tesla C2070
![Page 25: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/25.jpg)
Test cases
Box Pipe
Simple geometry
Systems of the same size
Need to compute in all rectangular grid points
Y
X
X
Y
Z 1
1
L
![Page 26: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/26.jpg)
Test cases
White Sea
Complex geometry
Big divergence for system sizes
Need to compute only inside the area
Y
X
![Page 27: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/27.jpg)
Performance results – Box Pipe
Grid 128x128x128
0
500
1000
1500
2000
2500
Solve X Solve Y Solve Z Total
CPU
GPU
0
500
1000
1500
2000
2500
Solve X Solve Y Solve Z Total
CPU
GPU
SINGLE DOUBLE segments/ms segments/ms
9.3x
8.4x
![Page 28: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/28.jpg)
Performance results – White Sea
Grid 256x192x160
SINGLE DOUBLE segments/ms segments/ms
0
500
1000
1500
2000
2500
3000
3500
4000
Solve X Solve Y Solve Z Total
CPU
GPU
0
500
1000
1500
2000
2500
3000
3500
4000
Solve X Solve Y Solve Z Total
CPU
GPU10.3x
9.5x
![Page 29: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/29.jpg)
Outline
Fluid Simulation in 3D domain
Multi-GPU implementation
— General splitting algorithm
— Running computations using CUDA
— Benchmarking and performance analysis
— Improving weak scaling
![Page 30: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/30.jpg)
Multi-GPU motivation
Limited available amount of memory
— 3D arrays: grid, temporary arrays, matrices
— Max size of grid that can fit into Tesla M2050 ~ 2243
Distribute the computations between multiple GPUs and
multiple nodes
— Can compute large grids
— Speed-up computations
![Page 31: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/31.jpg)
Main Idea of mGPU
Systems along Y/Z are solved independently in
parallel on each GPU
— No data transfer
Along X data must be synchronized
X Y
Z
GPU 0 GPU 1 GPU 2
Computing alternating directions:
X Y Z
![Page 32: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/32.jpg)
CUDA - parallelization
Split the grid along X (the longest stride)
Z + Y * dimZ + X * dimY * dimZ
Launch kernels on several GPUs from one host thread
Data transfer
— Async P2P through PCI-E (cudaMemcpyPeerAsync)
for (int i = 0; i < numDev; i++)
{
cudaSetDevice(i); //Switch device
kernel<<<…>>>(devArray[i], ..); //Computation
}
CUDA 4.x
![Page 33: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/33.jpg)
Synchronization of Nonlinear Layer
• High aggregate throughput on 8 GPU system
• Communication impact Is not significant
for (int i = 0; i < numDev-1; i++)
cudaMemcpyPeerrAsync(dHaloLeft[i+1], i+1, dDataRight[i], i, num_bytes, devStream[i]);
// might need multidev synchronization here
for (int i = 1; i < numDev; i++)
cudaMemcpyPeerAsync(dHaloRight[i-1], i-1, dDataLeft[i], i, num_bytes, devStream[i]);
![Page 34: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/34.jpg)
Solve X (tridiagonal solver)
GPU 0 GPU 1 GPU 1
bound partially bound unbound halo
![Page 35: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/35.jpg)
Solve X (tridiagonal solver)
• Process bound segments without intercommunication
• Interleave segments for better memory access – one segment per thread
• Align to the left
• Gauss elimination
• Communicate Forward
Backward
![Page 36: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/36.jpg)
Solve X X
Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
• 3D segment analysis
![Page 37: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/37.jpg)
Solve X X
Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Forward sweep along X
Active
GPU
![Page 38: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/38.jpg)
Solve X X
Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Forward sweep along X
Active
GPU
![Page 39: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/39.jpg)
Solve X X
Y
Z Active
GPU
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Back sweep along X
![Page 40: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/40.jpg)
Solve X X
Y
Z Active
GPU
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Back sweep along X
![Page 41: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/41.jpg)
Solve X
Active
GPU
X Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Back sweep along X
Result:
No speedup along X
![Page 42: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/42.jpg)
Benchmarks
Multiple GPU: 8 Tesla M2050 with P2P
Multiple Nodes: 4 InfiniBand MPI nodes, 1 Tesla M2090 each
Sample tests:
Box Pipe
White Sea
X
Y
Z 1
1
L
![Page 43: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/43.jpg)
Results: 8 GPU, 1 MPI node
0
5
10
15
20
25
30
35
Total
Millions
poin
ts p
er
sec
White Sea
1 2 4 8
0
50
100
150
200
250
Total
Millions
poin
ts p
er
sec
Box Pipe
1 2 4 8
x4.5
x1.4
Tesla M2050 Grid 2243
x2.9 x1.35
![Page 44: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/44.jpg)
1 GPU Efficiency
0
20000
40000
60000
80000
0 64 128 192 256
Poin
ts /
ms
Grid size
Box Pipe
0
5000
10000
15000
20000
0 64 128 192 256
Poin
ts /
ms
Grid size
White Sea
Tesla M2090
Estimate amount of work per
GPU in 8xGPU system using
single GPU:
Box Pipe – enough
work for single GPU
White Sea – takes
about 5% of volume of
the grid. Grid size of
1283 is not enough.
2563/8 = 1283
![Page 45: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/45.jpg)
Results: 1 GPU, 4 MPI nodes
0
5
10
15
20
25
30
35
Total
Millions
poin
ts p
er
sec
White Sea
1 2 4
0
20
40
60
80
100
120
140
160
180
200
Total
Millions
poin
ts p
er
sec
Box Pipe
1 2 4
x2.8
x1.2
Tesla M2090
![Page 46: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/46.jpg)
Load Balancing
0
200
400
600
800
1000
1200
1400
0 72 144 216 288
Segm
ents
x
Y(x) + Z(X) + X(x)dX
X splitting criteria:
— Equal volumes
— Equal number of segments
Performance benefit
observed: up to 15.5%
Tesla M2090
![Page 47: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/47.jpg)
0
10
20
SweepX SweepY SweepZ Transpose
Tim
e,
ms Even X GPU 0
GPU 1
GPU 2
GPU 3
0
10
20
SweepX SweepY SweepZ Transpose
Tim
e,
ms Even Segments GPU 0
GPU 1
GPU 2
GPU 3
0
10
20
SweepX SweepY SweepZ Transpose
Tim
e,
ms Even Volumes GPU 0
GPU 1
GPU 2
GPU 3
ttotal= 47.3
ttotal= 44.3
ttotal= 44.4 Tesla M2090
Load Balancing. White Sea (288x320x320)
![Page 48: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/48.jpg)
Analysis
All parts of the solver but one (Gauss elimination along X)
are fully parallel
Communication (using P2P + InfiniBand) is not a big issue for
given problem size
Bad weak scaling
Use blocks to hide latency for X sweeps
![Page 49: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/49.jpg)
Improved Solve X X
Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
• 3D segment analysis
GPU0 GPU1 GPU2
![Page 50: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/50.jpg)
Improved Solve X X
Y
Z
GPU0 GPU1 GPU2 B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
![Page 51: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/51.jpg)
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
![Page 52: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/52.jpg)
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
Move to the next block
group
![Page 53: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/53.jpg)
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
Move to the next block
group
Backward sweep along X,
Async halo send backward
![Page 54: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/54.jpg)
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
Move to the next block
group
Backward sweep along X,
Async halo send backward
![Page 55: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/55.jpg)
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
Move to the next block
group
Backward sweep along X,
Async halo send backward
Equal work per node!
![Page 56: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/56.jpg)
Algorithm
2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)
𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒
𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0
…
𝑖𝑏𝑙𝑜𝑐𝑘
…
X Y
Z
𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠
receive 𝑋𝑖𝑛𝑜𝑑𝑒−1
receive 𝑋𝑖𝑛𝑜𝑑𝑒+1
cudaStream1
cudaStream2
![Page 57: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/57.jpg)
Algorithm
2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)
𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒
𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0
…
𝑖𝑏𝑙𝑜𝑐𝑘
…
X Y
Z
𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠
cudaStream1
cudaStream2
Forward
Backward
![Page 58: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/58.jpg)
Algorithm
2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)
𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒
𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0
…
𝑖𝑏𝑙𝑜𝑐𝑘
…
X Y
Z
𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠
send 𝑋𝑖𝑛𝑜𝑑𝑒
cudaStream1
cudaStream2
send 𝑋𝑖𝑛𝑜𝑑𝑒
![Page 59: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/59.jpg)
Improved Solve XY X
Y
Z
B0
B1
B2
B3
B4
Y blocks
Separate buffer for Y
sweeps
Block Y sweeps are
performed independently
in separate cudaStreams
Helps with data
transfer/compute overlap
![Page 60: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/60.jpg)
Weak Scaling
100
150
200
250
300
350
400
0 2 4 6 8 10
Tim
e,
ms
Number of GPUs
Average time for Solve XYZ
Box Pipe
Grids:
2243, 2883, 3523, 4483
Tesla M2050
![Page 61: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/61.jpg)
Big Systems Limit
0
50
100
150
200
250
1 2 4 8 16 32
Tim
e,
ms
Number of blocks
Average time for Solve XYZ Consider on scalar field:
no physics, more
available RAM
8 M2050 GPUs
Grid: 7683
With larger grid sizes, curve
minimum shifts down/right
![Page 62: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/62.jpg)
Conclusions
GPU outperforms multi-core CPU over 10x factor
GPU works well with complex input domains
Performance and scaling factors heavily depend on input
geometry and size of grid
— Efficient work distribution methods are essential for performance
Using block-splitting for ADI improves scaling factor by
hiding dependency of sweep processing
![Page 63: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/63.jpg)
Future work
Test on large scale systems
— Potentially on “Lomonosov” supercomputer at MSU
— GPU part with peak performance of 863 TFlops
Memory usage optimizations
Explore different tridiagonal approaches
![Page 64: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/64.jpg)
Questions?
Thank You !
![Page 65: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/65.jpg)
![Page 66: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole](https://reader031.vdocument.in/reader031/viewer/2022021912/5c62f0c409d3f268208bb818/html5/thumbnails/66.jpg)