![Page 1: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/1.jpg)
Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA
Daniel P. Combest and Dr. P.A. Ramachandranand Dr. M.P. Dudukovic
Optimization, HPC, and Pre- and Post-Processing I Session.6th OpenFOAM Workshop Penn State University. June 15th 2011
Chemical Reaction Engineering Laboratory (CREL)Department of Energy, Environmental, and Chemical
Engineering. Washington University, St. Louis, MO.
![Page 2: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/2.jpg)
Objectives
2
![Page 3: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/3.jpg)
3
Introduction to The GPU and CUDA
What exactly is CUDA?Defined as: Compute Unified Device Architecture. I.e. a parallel computing architecture used in graphics processing units (GPU), developed by Nvidia.
![Page 4: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/4.jpg)
4
Introduction to The GPU and CUDA
What exactly is CUDA?Defined as: Compute Unified Device Architecture. I.e. a parallel computing architecture used in graphics processing units (GPU), developed by Nvidia.
What is CUDA C/C++?A language that provides an interface so that parallel algorithms can be run on CUDA enabled Nvidia GPUs
![Page 5: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/5.jpg)
5
Introduction to The GPU and CUDAGPU v.s CPU Calculations
CPU-GPU Comparison of Floating-point operations per second [1]
![Page 6: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/6.jpg)
6
Introduction to The GPU and CUDA
Why are we interested?Larger problems require more computing resources (LES, coupled physics)
GPUs are fast when used properly
They are relatively cheap
![Page 7: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/7.jpg)
7
Introduction to The GPU and CUDA
Why are we interested?Larger problems require more computing resources (LES, coupled physics)
GPUs are fast when used properly
They are relatively cheap
Where can GPUs be applied?Where parallel algorithms live
● Linear algebra i.e. sparse matrix math
![Page 8: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/8.jpg)
8
Introduction to The GPU and CUDA
Why are we interested?Larger problems require more computing resources (LES, coupled physics)
GPUs are fast when used properly
They are relatively cheap
Where can GPUs be applied?Where parallel algorithms live
● Linear algebra i.e. sparse matrix math
Why don't we compile everything to work on the GPU? Only programs written in CUDA language can be parallelized on GPU. So we cannot just recompile OF.
![Page 9: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/9.jpg)
9
Integrating CUSP into OpenFOAM
“Cusp is a library for sparse linear algebra and graph computations on CUDA. Cusp provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems.”[2]
Provided Template Solvers:• (Bi-) Conjugate Gradient (-Stabilized)• GMRES
Matrix Storage • CSR, COO, HYB, DIA
Provided Preconditioners• Jacobi (diagonal) preconditioners• Sparse Approximate inverse preconditioner• Smoothed-Aggregation Algebraic Multigrid preconditioner
cusp-Library http://code.google.com/p/cusp-library/
![Page 10: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/10.jpg)
10
Integrating CUSP into OpenFOAM
“Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust provides a flexible high-levelinterface for GPU programming that greatly enhances developer productivity. “ [3]
http://code.google.com/p/thrust/
![Page 11: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/11.jpg)
11
Integrating CUSP into OpenFOAM
AX b
=
AX b
=
OpenFOAM solve(…);
Cusp-based solver on GPU
Thrust Methods
cusp Methods
![Page 12: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/12.jpg)
12
Integrating CUSP into OpenFOAM
AX b
=
lduMatrix is converted to COO Using thrust::copy() in C++
AX b
=
OpenFOAM solve(…);
Cusp-based solver on GPU
Thrust Methods
cusp Methods
![Page 13: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/13.jpg)
13
Integrating CUSP into OpenFOAM
AX b
=
lduMatrix is converted to COO Using thrust::copy() in C++
COO is transferred to GPU In CUDA Code
AX b
=
OpenFOAM solve(…);
Cusp-based solver on GPU
Thrust Methods
cusp Methods
![Page 14: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/14.jpg)
14
Integrating CUSP into OpenFOAM
AX b
=
lduMatrix is converted to COO Using thrust::copy() in C++
COO is transferred to GPU In CUDA Code
COO is converted to other formats on GPUAnd passed to CUSP-based solver with convergence criteriaA
X b
=
OpenFOAM solve(…);
Cusp-based solver on GPU
Thrust Methods
cusp Methods
![Page 15: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/15.jpg)
15
Integrating CUSP into OpenFOAM
AX b
=
lduMatrix is converted to COO Using thrust::copy() in C++
COO is transferred to GPU In CUDA Code
COO is converted to other formats on GPUAnd passed to CUSP-based solver with convergence criteriaA
X b
= Residual calculated using OF normalized residual method
OpenFOAM solve(…);
Cusp-based solver on GPU
Thrust Methods
cusp Methods
![Page 16: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/16.jpg)
16
Integrating CUSP into OpenFOAM
AX b
=
AX b
=
OpenFOAM solve(…);
Pass X vector and solver performance data back to OpenFOAM using thrust-methods
Thrust Methods
Cusp-based solver on GPU
![Page 17: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/17.jpg)
17
Preliminary ResultsA test Problem.
02 =∇ T
2D Heat Equation
Vary N from 10-2000 where N2 = nCells
![Page 18: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/18.jpg)
Preliminary ResultsSolver Settings
All CG solvers
Tolerance = 1e-10;MaxIter 1000;
solver GAMG; tolerance 1e-10; smoother GaussSeidel; nPreSweeps 0; nPostSweeps 2; cacheAgglomeration true; nCellsInCoarsestLevel sqrt(nCells); agglomerator faceAreaPair; mergeLevels 1;
![Page 19: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/19.jpg)
Preliminary ResultsSetup
CUDA version 4.0CUSP version 0.2Thrust version 1.4Ubuntu 10.04
CPU: Dual Intel Xeon Quad Core E5430 2.66GHzMotherboard: Tyan S5396RAM: 24 gig
GPU: Tesla C2050 3GB DDR5515 Gflops peak double precision1.03 Tflops Peak single precision14 MP * 32 cores/MP = 448 coresHost-device memory bw = 1566 MB/sec (Motherboard specific)
![Page 20: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/20.jpg)
20
Preliminary ResultsSolve Time
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000
200
400
600
800
1000
1200
1400
Solve() Time Comparison
cusplink_SmAPCGGAMGcusplink_DPCGcusplink_CGDPCG-parallel4DPCG-parallel6-s231DPCGCG
nCells
Tim
e [s
eco
nd
s]
![Page 21: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/21.jpg)
21
Preliminary ResultsSolution Speedup
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000
2
4
6
8
10
12
14
16
18Speedup Comparison
DPCGDPCG-parallel4DPCG-parallel6-s231DPCG-parallel6-s161cusplink_DPCGcusplink_CG
nCells
Speedup
Speedup = Ts/Tp = TOFCG
/Tother
![Page 22: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/22.jpg)
22
Preliminary ResultsSolution Speedup
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000
20
40
60
80
100
120
140
Speedup Comparison
DPCGDPCG-parallel4DPCG-parallel6-s231DPCG-parallel6-s161cusplink_CGcusplink_DPCGGAMGGAMG6cusplink_SmAPCG
nCells
Speedup
Speedup = Ts/Tp = TOFCG
/Tother
![Page 23: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/23.jpg)
23
Preliminary ResultsSolution Speedup
0 200000 400000 600000 800000 1000000 12000000
10
20
30
40
50
60
Speedup Comparison
DPCGDPCG-parallel4DPCG-parallel6-s231DPCG-parallel6-s161cusplink_CGcusplink_DPCGGAMG6GAMGcusplink_SmAPCG
nCells
Speedup
Speedup = Ts/Tp = TOFCG
/Tother
![Page 24: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/24.jpg)
Preliminary Results
24
![Page 25: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/25.jpg)
Important Considerations
25
![Page 26: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/26.jpg)
Next Steps
26
![Page 27: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/27.jpg)
Take Home Messages● The GPU only solves the Ax=b system● We have double precision● GPUs have been integrated into OpenFOAM using Thrust and CUSP● As cusp and thrust improve, nothing needs to be changed in this code, only to update cusp and thrust.● They have been shown to be faster in the cases provided, because it is mostly solving Ax = b.● Residuals are calculated the same as in OpenFOAM● Multi-GPU still needs attention.● The results show that memory bandwidth still is an issue with this particular setup and results could be faster with other setup.
![Page 28: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/28.jpg)
Acknowledgements
Funding and SupportNvidia Professor Partnership Program
Chemical Reaction Engineering Laboratory (CREL) MRE Fund (http://crelonweb.eec.wustl.edu/)
OpenFOAM Developers Community
AdvisorsDr. Ramachandran
Dr. Dudukovic
28
![Page 29: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/29.jpg)
Sources1. Nvidia CUDA Programming Guide, Version 4.0, 2011. Nvidia
Corporation. 2. Nathan Bell and Michael Garland, Cusp: Generic Parallel
Algorithms for Sparse Matrix and Graph Computations, 2010, http://cusp-library.googlecode.com,Version 0.1.0
3. Jared Hoberock and Nathan Bell, Thrust: A Parallel Template Library, 2010, http://www.meganewtons.com/,Version 1.3.0
29
![Page 31: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/31.jpg)
Preliminary ResultsSolution Speedup
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000
20
40
60
80
100
120
140
Speedup
cusplink_CGcusplink_DPCGcusplink_SmAPCGDPCGDPCG-parallel4DPCG-parallel6-s231DPCG-parallel6-s161GAMGGAMG6
nCells
Sp
ee
du
p
Speedup = Ts/Tp = TOFCG
/Tother
![Page 32: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest](https://reader034.vdocument.in/reader034/viewer/2022042204/5ea567219a2cc226dc2a320a/html5/thumbnails/32.jpg)
For matrix A x = b,
residual is defined as
res = b - Ax
We then apply residual scaling with the following normalisation factor procedure:
Type xRef = gAverage(x);
wA = A x; pA = A xRef;
NormFactor = gSum(cmptMag(wA - pA) + cmptMag(source - pA)) + matrix.small_;
and the scaled residual is:
residual = gSum(cmptMag(source - wA))/normFactor;
I will save you from complications with vectors and tensors in my block solver. :-)
Enjoy,
Hrv
Source: http://www.cfd-online.com/Forums/openfoam-solving/57903-residuals-convergence-segregated-solvers.html
Residual Scaling