amgx 2.0: scaling toward coralimages.nvidia.com/events/sc15/pdfs/amgx-v2...amgx 2.0: scaling toward...
TRANSCRIPT
Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL
2
Agenda
Introduction to AmgX
Current Capabilities
Scaling
V2.0
Roadmap for the future
3
AmgX
Fast, scalable linear solvers, emphasis on iterative methods
Flexible toolkit for GPU accelerated Ax = b solver
Simple API makes it easy to solve your problems faster
4
“ Using AmgX has allowed us to
exploit the power of the GPU
while freeing up development
time to concentrate on
reservoir simulation.”
Garf Bowen, RidgewayKiteSoftware
5
1150
197 98
0
500
1000
1500
CPU GPUCustom
AmgX
AmgX in Reservoir Simulation
Solve Faster
Solve Larger Systems
Flexible High Level API
Application Time (seconds)
Lower is
Better
3-phase Black Oil Reservoir Simulation. 400K
grid blocks solved fully implicitly.
CPU: Intel Xeon CPU E5-2670
GPU: NVIDIA Tesla K10
6
AmgX 2.0: New Features since 1.0
Classical AMG with truncation, robust aggressive coarsening
Complex arithmetic
GPUDirect, RDMA-async
Power8 support, Maxwell support
Crash-proof object management
Re-usable setup phase
Adaptors for major solver packages:
HYPRE, PETSc, Trilinos
Import data structures directly to AmgX for solve, export solution
Host or Device pointer support
JSON configuration
7
Key Features
Un-smoothed Aggregation AMG
Krylov methods: CG, GMRES, BiCGStab, IDR
Smoothers and Solvers:
Block-Jacobi, Gauss-Seidel
Incomplete LU, Dense LU
KPZ-Polynomial, Chebyshev
Flexible composition system
Scalar or coupled block systems, multi-precision
MPI, OpenMP support
Auto-consolidation
Flexible, simple high level C API
8
Minimal Example With Config
//One header
#include “amgx_c.h”
//Read config file
AMGX_create_config(&cfg, cfgfile);
//Create resources based on config
AMGX_resources_create_simple(&res,
cfg);
//Create solver object, A,x,b, set
precision
AMGX_solver_create(&solver, res,
mode, cfg);
AMGX_matrix_create(&A,res,mode);
AMGX_vector_create(&x,res,mode);
AMGX_vector_create(&b,res,mode);
//Read coefficients from a file
AMGX_read_system(&A,&x,&b,
matrixfile);
//Setup and Solve Loop
AMGX_solver_setup(solver,A);
AMGX_solver_solve(solver, b, x);
//Download Result
AMGX_download_vector(&x)
solver(main)=FGMRES
main:max_iters=100
main:convergence=RELATIVE_MAX
main:tolerance=0.1
main:preconditioner(amg)=AMG
amg:algorithm=AGGREGATION
amg:selector=SIZE_8
amg:cycle=V
amg:max_iters=1
amg:max_levels=10
amg:smoother(smoother)=BLOCK_JACOBI
amg:relaxation_factor= 0.75
amg:presweeps=1
amg:postsweeps=2
amg:coarsest_sweeps=4
determinism_flag=1
9
Integrates easily MPI and OpenMP domain decomposition
Adding GPU support to existing applications raises new issues
Proper ratio of CPU cores / GPU?
How can multiple CPU cores (MPI ranks) share a single GPU?
How does MPI switch between two sets of ‘ranks’: one set for CPUs, one set for GPUs?
AmgX handles this via Consolidation
Consolidate multiple smaller sub-matrices into single matrix
Handled automatically during PCIE data copy
10
u1
u2 u4 u3
u5
u6
u7
u1
u2
u4
u3
u5
u6
u7
u’4
u’2
Rank 0
Rank 1
GPU
u1
u2 u4 u3
u5
u6
u7
PCIE
PCIE
Original Problem
Partitioned to 2 MPI Ranks
Consolidated onto 1 GPU
Boundary exchange
11
Consolidation Examples
1 CPU socket <=> 1 GPU
Dual socket CPU <=> 2 GPUs
Dual socket CPU <=> 4 GPUs
Arbitrary Cluster:
4 nodes x [2 CPUs + 3 GPUs] IB
12
PETSc KSP vs AmgX performance test
PDE:
∂u2∂2x+∂u2∂2y+∂u2∂2z=−12π2cos(2πx)cos(2πy)cos(2πz)
BCs:
∂u∂x∣∣∣x=0=∂u∂x∣∣∣x=1=∂u∂y∣∣∣y=0=∂u∂y∣∣∣y=1=∂u∂z∣∣∣z=0=∂u∂z∣∣∣z=1=0
Exact solution:
u(x,y)=cos(2πx)cos(2πy)cos(2πz)
13
PETSc vs AmgX
7x speedup @4M unknowns 16 cores vs 1 GPU 8x speedup @100M unknowns 512 cores vs 32 GPUs
Machine specification
GPU nodes:
GPU: two K20m per node
CPU nodes:
CPU: two Intel Xeon E5-2670 per node (totally
16 cores per node)
PETSc KSP solver
14
SPE10 Cases We derived several test cases from the SPE10
permeability distribution by fixing an x-y resolution
and adding resolution in z, using TPFA stencil.
15
SPE10 Matrix Tests
GPU: NVIDIA K40
CPU: HYPRE on 10 core IvyBridge Xeon E5-2690 V2 @ 3.0GHz
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 2 4 6 8 10
Spe
ed
up
Millions of Unknowns
1 Socket vs 1 GPU
16
Scaling up the right way
17
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
Aggregation and Classical Weak Scaling, 8Million DOF per GPU
0.0
2.0
4.0
6.0
8.0
10.0
12.0
1 2 4 8 16 32 64 128 256 512
Tim
e (
s)
Number of GPUs
Setup
AmgX 1.0 (PMIS) AmgX 1.0 (AGG)
18
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
Aggregation and Classical Weak Scaling, 8Million DOF per GPU
y = 0.0062x + 0.0719 R² = 0.9249
y = 0.0022x + 0.0585 R² = 0.9437
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
1 2 4 8 16 32 64 128 256 512
Solv
e T
ime
Number of GPUs
Time per Iteration vs Log(P)
ClassicalAMGSolve
AggregationAMGSolve
Linear (ClassicalAMGSolve)
Linear (AggregationAMGSolve)
19
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
Classical AMG Preconditioner, 8Million DOF per GPU
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 256 512
Itera
tions
Number of GPUs
PCG
GMRES
20
Poisson Equation / Laplace operator
Titan (Oak Ridge National Laboratory)
GPU: NVIDIA K20x (one per node)
CPU: 16 core AMD Opteron 6274 @ 2.2GHz
Classical AMG Preconditioner, 8Million DOF per GPU
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
1 2 4 8 16 32 64 128 256 512
Solv
e Ti
me(
s)
Number of GPUs
GMRES
PCG
21
AmgX 2.0: MPI with GPUDirect RDMA
4x lower latency, 3x Bandwidth, 45% lower CPU utilization
22
Basic Coarsening
23
Basic Coarsening
24
Aggressive Coarsening
25
Aggressive Coarsening
Less Memory, Faster Setup
26
AmgX 2.0 Licensing
Developer/Academic License
non commercial use, free
Commercial License, Developer License, Premier Support Service
Subscription License (node/year)
Includes Support and Maintenance
Volume based pricing
Site License
Perpetual License
20% Maintenance and Support
27
AmgX Roadmap
Continuous Improvement
Availability Features
Classical AMG
- multi node
- multi GPU
- Aggressive coarsening
Complex Arithmetic + Aggregation
Easy interfaces, python
PETSc, HYPRE, Trilinos
Robust convergence on SPE10
GPUDirect v2.0
Scalable Sparse Eigensolvers
Scaling past 512 GPUs
Range Decomposition AMG
Guaranteed convergence aggregation
Commercial License
Premier Support
AmgX 2.5 Q2 2016
AmgX 2.0 Release Q4 2015
CUDA 8.0 with Pascal Support
Tuning for Maxwell
AmgX 2.0 was made by a great team of contributors. AmgX 2.0 Team: Marat Arsaev, Joe Eaton, Alex Fender, Andrei Schaffer AmgX 2.0 Devtechs: Simon Layton, Nikolai Sakharnykh, Nikolay Markovskiy Interns: Rohit Gupta, Constantine Stulov