algebraic multigrid (amg) · 2014. 12. 12. · algebraic multigrid calculation repeated many times...
TRANSCRIPT
-
Page 1
Algebraic Multigrid (AMG)
Multigrid Methods and Parallel Computing
Klaus StübenFraunhofer Institute SCAISchloss Birlinghoven53754 St. Augustin, [email protected]
2nd International FEFLOW User ConferencePotsdam, September 14-16, 2009
mailto:[email protected]�
-
Page 2
Contents
What are we talking about?(Algebraic) MultigridParallel computing nowadaysPerformance of Algebraic MultigridConclusions
-
Page 3
What Are We Talking About?
-
Page 4
Large-Scale Applications
Casting and Molding
Fluid Dynamics
Circuit Simulation
Groundwatermodeling
Oil reservoir simulation
Electrochemical PlatingProcess- & Device-Simulation
-
Page 5
Bottleneck: Linear Equation Solver
required:"scalable" solvers,
Work = O(N)
Huge but sparse „elliptic“ systems
1( 1, ..., )
N
ij j ij
i Na u f=
==∑ Au f=or
Scalability requires• Locally “operating” numerical methods• Exploitation of origin and nature of the
discrete linear systems
hierarchical solvers:multigrid,
algebraic multigrid
Calculation repeated many times over• to follow the time evolution• to resolve non-linearities• for history matching• for optimization processes
-
Page 6
Additional aspects affecting numerical performance (besides size)• extremely unstructured or locally refined grids, highly heterogeneous
Bottleneck: Linear Equation Solver
-
Page 7
Additional aspects affecting numerical performance (besides size)• extremely unstructured or locally refined grids, highly heterogeneous• not optimally shaped elements
Bottleneck: Linear Equation Solver
Basin flow model:• pentahedral prismatic mesh• several faulty zones• vertical distortion of the prisms
-
Page 8
Additional aspects affecting numerical performance (besides size)• extremely unstructured or locally refined grids, highly heterogeneous• not optimally shaped elements• extreme parameter contrasts, discontinuities and anisotropies
Bottleneck: Linear Equation Solver
High parameter contrast:Permeabilities vary by7 orders of magnitude
-
Page 9
Additional aspects affecting numerical performance (besides size)• extremely unstructured or locally refined grids, highly heterogeneous• not optimally shaped elements• extreme parameter contrasts, discontinuities and anisotropies• non-symmetry and indefiniteness• multiple (varying) degrees of freedom, .....
Bottleneck: Linear Equation Solver
Nothing is for free!Compromise?
Requested: solver should• be fast (scalable)• be general• require low memory• be easy to integrate
-
Page 10
(Algebraic) Multigrid
-
Page 11
Classical one-level approach
Geometric versus Algebraic Multigrid
-
Page 12
2hhI
24
hhI
48
hhI
2hhI
42
hhI
84
hhI
hS Ωh
2hS Ω2h
4hS Ω4h
Ω8h
Geometric Multigrid (GMG)
Geometric versus Algebraic Multigrid
-
Page 13
2hhI
24
hhI
48
hhI
2hhI
42
hhI
84
hhI
hS Ωh
2hS Ω2h
4hS Ω4h
Ω8h
Geometric Multigrid (GMG) Basic idea: combine
• coarse-grid correctionNumerically scalableIdeally suited for
• structured meshes• “smooth” problems
Hardly used in industry!
• relaxation for smoothing
Difficult to integrate intoexisting software:
• no “plug-in” software• “technical” limitations
Geometric versus Algebraic Multigrid
-
Page 14
2hhI
24
hhI
48
hhI
2hhI
42
hhI
84
hhI
hS Ωh
2hS Ω2h
4hS Ω4h
Ω8h
Geometric Multigrid (GMG) Algebraic Multigrid (AMG)
Geometric versus Algebraic Multigrid
A112I
23I
34I
21I
32I
43I
A2A3
A4
A1u=f
given linearsystem:
automaticallyconstructedas part ofthe AMGalgorithm
-
Page 15
Two-stage process:
Straightforward solution phase:• select ”simple” smoother (eg, relaxation)• cycling as before
Geometric versus Algebraic MultigridAlgebraic Multigrid (AMG)
12I
23I
34I
21I
32I
43I
A1A2A3
A4
A1u=f
given linearsystem:
automaticallyconstructedas part ofthe AMGalgorithm
Recursive setup phase i=1,2,...:1. Find subset of variables
representing level i+12. Construct interpolation3. Construct restriction4. Compute coarse-level matrix
1iiI+
1iiI +
11
1iii
iiiIA A I
+++ =
sparse matrixproduct (Galerkin)
-
Page 16
Relies on same basic idea
• coarse-”grid” correction
"Virtually" scalableAlso suited for
• unstructured meshes, 2-3D• “non-smooth” problems
Very popular in industry!
• relaxation for smoothing
Easy to integrate intoexisting software:
• “plug-in” software• requires only matrices
Geometric versus Algebraic MultigridAlgebraic Multigrid (AMG)
12I
23I
34I
21I
32I
43I
A1A2A3
A4
A1u=f
given linearsystem:
automaticallyconstructedas part ofthe AMGalgorithm
-
Page 17
Scalability
Durlofsky problem:
• flow from left to right
• heterogeneous conductivity(red=1 m/s, blue=10(-6) m/s)
• steady-state
• basic mesh: 20x20,(refined by halfing mesh sizes)
h=1
h=0
-
Page 18
Scalability h=
1
h=0
0
5
10
15
20
25
1600 6400 25600 102400 409600 1638400
# Elements
Computational workper grid node
SAMGstd. PCG
-
Page 19
iter=4474CPU=1479sec
iter=11CPU=23.4sec
iter=13CPU=59.8sec
iter=662CPU=489sec
Steady State Test Cases (USGS, MODFLOW)
speedup = 8.2 speedup = 63.2 (!)
-
Page 20
Distorted Mesh Case (Wasy)
Basin flow model:• pentahedral prismatic mesh• a number of faulty zones• vertical distortion of the prisms• transient• parameter contrast: 3 orders
-
Page 21
Distorted Mesh Case (Wasy)
Standard one-level method:• failed to converge below 10(-8)• solution fully instable
SAMG:• stable solution• sufficiently accurate
-
Page 22
Electrochemical Machining - Extreme Aspect Ratios
Diesel injection system, ~ 4.2 Mio tetrahedronsAspect ratios up to 1:10,000!!
-
Page 23
Electrochemical Machining - Extreme Aspect Ratios
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
00 200 400 600 800 1000
CGAMG
Log(
Res
idua
l)
iter
Convergencehistory
Diesel injection system, ~ 4.2 Mio tetrahedrons
AMG/cg
ILU/cg
Aspect ratios up to 1:10,000!!
-
Page 24
Summary of AMGSteady state applications (elliptic)
• highly efficient• scalable• increased robustness• in particular: discontinuities, anisotropies, ...
Time dependent applications• applicable in each time step• increased robustness• efficiency depends on various aspects such as
- requested accuracy per time step- mesh size- "behavior" of classical solvers- time discretization, in particular, time step size
How does AMG perform on parallel computers?
-
Page 25
Parallel Computing Nowadays
-
Page 26
HardwareDistributed memory
• MPI (process-based)CPU1
RAM
CacheCPU2
RAM
CacheCPU3
RAM
Cache
Network (Gigabit, Infiniband, ...)
-
Page 27
Shared memory (multicore)• MPI (process-based)• OpenMP (thread-based)• mixed MPI/OpenMP
Quadcore socket
Cross-NodeLink
Quadcore socket
Hardware
-
Page 28
Hybrid architectures• mere MPI• mixed MPI/OpenMP
Hardware
CPU1
RAM
CacheCPU2
RAM
CacheCPU3
RAM
Cache
Network (Gigabit, Infiniband, ...)
-
Page 29
Distributed memory• MPI (process-based)
Shared memory (multicore)• MPI (process-based)• OpenMP (thread-based)• mixed MPI/OpenMP
Hybrid architectures• mere MPI• mixed MPI/OpenMP
Specialized hardware• GPUs (CUDA, OpenCL)• Cell
Hardware
-
Page 30
However, this is not trivial at all• there is a conflict
- really parallel solvers are usually not serially scalable- serially scalable solvers are usually not really parallel
• in particular, important AMG modules are not parallel• parallelization requires changes in the algorithm/numerics• this requires compromises
- ensure "stable" convergence (essentially independent of # processors)- minimize fill-in / complexity / overhead- minimize communication / memory access
Parallel ComputingGeneral goal
• combine serial scalability with scalability of parallel hardware
Classical (A)MG parallelization: MPI-based
-
Page 31
MPI-based Parallelization
2hhI
24
hhI
48
hhI
2hhI
42
hhI
84
hhI
hS
2hS
4hS
12I
23I
34I
21I
32I
43I
A1A2A3
A4
-
Page 32
2hhI
24
hhI
48
hhI
2hhI
42
hhI
84
hhI
hS
2hS
4hS
Partitioning of 1st grid inducespartitioning on all levels!
The user just provides the 1st level set ofequations along with a partitioning.
The setup phase has to be parallelized!
12I
23I
34I
21I
32I
43I
A1A2A3
A4
MPI-based Parallelization
-
Page 33
Parallel setup phase:1. Find subset of variables2. Construct interpolation3. Construct restriction4. Compute coarse-level matrix
Fully parallelHowever: highly communication-intensive
Solution phase:1. Smoothing2. Restriction3. Interpolation4. Stepping from fine-to-coarse and back
MPI-based ParallelizationInherently sequential, most critical in parallelization!Various attempts have been made to constructparallel analogs (eg, LLNL, Griebel). Unfortunately, ...
Generally, good smoothers are serial (eg, GS, ILU)!Compromise: hybrid smoothers (eg, GS/Jacobi)
Fully parallel
Inherently sequential! Towards coarser levels:increasingly bad computation-communication ratio.Various attempts have been made to improve this(eg, McCormick, Griebel). Unfortunately, ...
-
Page 34
Performance ofAlgebraic Multigrid
-
Page 35
Test case:
p0: 884,736p1: 1,560,896p2: 3,176,523p3: 6,331,625
Hardware 64 nodes connected by infiniband;2 dual-core sockets/node (Opteron AMD)
Mesh sizes:
Partitioning (Metis):
3D Poisson problem,27-point FE stencil Used as: 64-core distributed memory hardware
or as: 64-core hybrid system
Hybrid Architecture (MPI)
-
Page 36
Hybrid Architecture (MPI)
Solution phase (4 cores/node)
cores
speedup
Solution phase (1 core/node)
cores
speedup
Speedup reduced: Cores within a nodehave to share the internal bandwidth!
-
Page 37
Hybrid Architecture (mixed MPI/OpenMP)Mixed MPI/OpenMP programming?
OpenMP on thenodes seemsadvantageous:8:4 is most efficient!
8:48 MPI processeswith 4 threads each
OpenMP
16:216 MPI processeswith 2 threads each
OpenMP
OpenMP
32:132 MPI processeswith 1 thread each
Performance on 32 cores ....
-
Page 38
8 Quad-Core sockets 4 CPUs 8 x 64 KB L1 Cache (Instr./Data) 4 x 512 KB L2 Cache 1 x 6144 KB L3 Cache (shared) 1 x 16144 MB (shared)
Shared memory (multicore)AMD OpteronProcessor 8384, 2.7 GHz
MPI or (straightforward!) OpenMP,mixed MPI/OpenMP
Programming
Quadcore socket
Cross-NodeLink
Quadcore socket Quadcore socket
Cross-NodeLink
Quadcore socket
Cross-NodeLink
Comparison:(MPI processes : Threads per proc) = (1:16), (2:8), (4:4), (8:2), (16:1)
Shared Memory (mixed MPI/OpenMP)
Remark: We use only16 cores in order toinvestigate thread-/memory binding
pureOpenMP
pureMPI
truelyhybrid
-
Page 39
Shared Memory (mixed MPI/OpenMP)
coh: coherent;scat: scatteredsys: system-provided
Process- / thread binding
sys-binding
Run timeRun time
-
Page 40
Shared Memory (mixed MPI/OpenMP)
coh: coherent;scat: scatteredsys: system-provided
Process- / thread binding
coh-binding
Run timeRun time
-
Page 41
Shared Memory (mixed MPI/OpenMP)
3D, FEPoisson
3D, FEElasticity
small2D, FDPoisson
medium largeInfluence of different thread-bindings: fairly unpredictable, practically does not pay
-
Page 42
Shared Memory (OpenMP)
Reasons:• Sparse matrix operations
- matrix-matrix, matrix-vector• Lots of integer computations
- indirect addressing, CSR data storage, ...• Efficient cache utilization very difficult
- for multicore much more difficult than unicore- not sufficient to just consider matrix-vector- efficient tools are required
Conclusions:• Pure MPI seems a reasonable choice• Truely hybrid is faster only rarely• Manual thread binding does not really pay• Pure OpenMP is always slowest
Poblem with pure OpenMP:
• Setup phase scales nicely• Scalability of solution phase is limited
speedup
-
Page 43
Shared Memory (OpenMP)Remedies (unicore):• Improve cache utilization by, eg,
- register blocking- pre-fetching- cache blocking
• Change data access,- improve re-usage of matrix entries, eg,- "wave-like" relaxation, restriction, ....- potential conflict with modular programming
Remedies (multicore):• Data blocking, re-ordering
- minimize memory bank conflicts• Explicit thread- and/or memory bindings
- not (yet) really reliable- not enough information availabe- too hardware-dependent
Reasons:• Sparse matrix operations
- matrix-matrix, matrix-vector• Lots of integer computations
- indirect addressing, CSR data storage, ...• Efficient cache utilization very difficult
- for multicore much more difficult than unicore- not sufficient to just consider matrix-vector- efficient tools are required
Conclusions:• Pure MPI seems a reasonable choice• Truely hybrid is faster only rarely• Manual thread binding does not really pay• Pure OpenMP is always slowest
-
Page 44
ConclusionsClearly, results and conclusions strongly depend on
• hardware, network, memory access, cache hierarchy, ....• application, in particular, discretization, grid size, sparsity, ....
Distributed memory architecture• complex programming via MPI• easy to achieve high scalability (if problem size per node is sufficiently large)
Hybrid architecture (nodes consisting of multicores)• mere MPI suffers from a reduced bandwidth inside the nodes• mixed MPI/OpenMP programming seems best (if adjusted to the architecture)
Shared memory architecture (one or more multicore sockets)• mere MPI is generally superior to pure (straightforward!) OpenMP• using reasonable threat-binding, mixed programming is occasionally better• however, without substantial effort, OpenMP performance is currently limited
- eg, in the solution phase (dominated by matrix-vector operations)- demanding developments still required (tools as well as algorithms)- dynamic decisions are actually required at run time!
-
Page 45
One should not forget numerics (solvers)!• it is important to consider numerically "optimal" solvers (such as AMG)• it does not make sense to use sub-optimal solvers just because they scale better
Unfortunately,• depending on the situation, AMG may not be suitable
- not applicable, problem too small, insufficient convergence• there exists no solver which works always and is always efficient!• in a time-dependent simulation, a good solver may even depend on time• solvers should be adjusted at run time
Challenge: Dynamic and automatic decisions at run time• automatic switching between solvers and parameters ("intelligent solver")• automatic adjustment of parallelization strategy
Conclusions / Outlook
-
Page 46
[email protected]://www.scai.fraunhofer.de/samg
Thank You!
• Reports• User Manuals• Benchmarks
Algebraic Multigrid (AMG)Foliennummer 2What Are We Talking About?Large-Scale ApplicationsFoliennummer 5Foliennummer 6Foliennummer 7Foliennummer 8Foliennummer 9(Algebraic) MultigridFoliennummer 11Foliennummer 12Foliennummer 13Foliennummer 14Foliennummer 15Foliennummer 16Scalability Scalability Steady State Test Cases (USGS, MODFLOW)Distorted Mesh Case (Wasy)Distorted Mesh Case (Wasy)Foliennummer 22Foliennummer 23Summary of AMGParallel Computing NowadaysHardwareFoliennummer 27Foliennummer 28Foliennummer 29Parallel ComputingMPI-based ParallelizationFoliennummer 32MPI-based ParallelizationPerformance of�Algebraic MultigridFoliennummer 35Foliennummer 36Foliennummer 37Shared Memory (mixed MPI/OpenMP)Shared Memory (mixed MPI/OpenMP)Shared Memory (mixed MPI/OpenMP)Shared Memory (mixed MPI/OpenMP)Shared Memory (OpenMP)Shared Memory (OpenMP)ConclusionsConclusions / OutlookThank You!