amgx 2.0: scaling toward coralimages.nvidia.com/events/sc15/pdfs/amgx-v2...amgx 2.0: scaling toward...

28
Joe Eaton, November 19, 2015 AmgX 2.0: Scaling toward CORAL

Upload: others

Post on 23-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL

Page 2: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

2

Agenda

Introduction to AmgX

Current Capabilities

Scaling

V2.0

Roadmap for the future

Page 3: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

3

AmgX

Fast, scalable linear solvers, emphasis on iterative methods

Flexible toolkit for GPU accelerated Ax = b solver

Simple API makes it easy to solve your problems faster

Page 4: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

4

“ Using AmgX has allowed us to

exploit the power of the GPU

while freeing up development

time to concentrate on

reservoir simulation.”

Garf Bowen, RidgewayKiteSoftware

Page 5: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

5

1150

197 98

0

500

1000

1500

CPU GPUCustom

AmgX

AmgX in Reservoir Simulation

Solve Faster

Solve Larger Systems

Flexible High Level API

Application Time (seconds)

Lower is

Better

3-phase Black Oil Reservoir Simulation. 400K

grid blocks solved fully implicitly.

CPU: Intel Xeon CPU E5-2670

GPU: NVIDIA Tesla K10

Page 6: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

6

AmgX 2.0: New Features since 1.0

Classical AMG with truncation, robust aggressive coarsening

Complex arithmetic

GPUDirect, RDMA-async

Power8 support, Maxwell support

Crash-proof object management

Re-usable setup phase

Adaptors for major solver packages:

HYPRE, PETSc, Trilinos

Import data structures directly to AmgX for solve, export solution

Host or Device pointer support

JSON configuration

Page 7: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

7

Key Features

Un-smoothed Aggregation AMG

Krylov methods: CG, GMRES, BiCGStab, IDR

Smoothers and Solvers:

Block-Jacobi, Gauss-Seidel

Incomplete LU, Dense LU

KPZ-Polynomial, Chebyshev

Flexible composition system

Scalar or coupled block systems, multi-precision

MPI, OpenMP support

Auto-consolidation

Flexible, simple high level C API

Page 8: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

8

Minimal Example With Config

//One header

#include “amgx_c.h”

//Read config file

AMGX_create_config(&cfg, cfgfile);

//Create resources based on config

AMGX_resources_create_simple(&res,

cfg);

//Create solver object, A,x,b, set

precision

AMGX_solver_create(&solver, res,

mode, cfg);

AMGX_matrix_create(&A,res,mode);

AMGX_vector_create(&x,res,mode);

AMGX_vector_create(&b,res,mode);

//Read coefficients from a file

AMGX_read_system(&A,&x,&b,

matrixfile);

//Setup and Solve Loop

AMGX_solver_setup(solver,A);

AMGX_solver_solve(solver, b, x);

//Download Result

AMGX_download_vector(&x)

solver(main)=FGMRES

main:max_iters=100

main:convergence=RELATIVE_MAX

main:tolerance=0.1

main:preconditioner(amg)=AMG

amg:algorithm=AGGREGATION

amg:selector=SIZE_8

amg:cycle=V

amg:max_iters=1

amg:max_levels=10

amg:smoother(smoother)=BLOCK_JACOBI

amg:relaxation_factor= 0.75

amg:presweeps=1

amg:postsweeps=2

amg:coarsest_sweeps=4

determinism_flag=1

Page 9: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

9

Integrates easily MPI and OpenMP domain decomposition

Adding GPU support to existing applications raises new issues

Proper ratio of CPU cores / GPU?

How can multiple CPU cores (MPI ranks) share a single GPU?

How does MPI switch between two sets of ‘ranks’: one set for CPUs, one set for GPUs?

AmgX handles this via Consolidation

Consolidate multiple smaller sub-matrices into single matrix

Handled automatically during PCIE data copy

Page 10: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

10

u1

u2 u4 u3

u5

u6

u7

u1

u2

u4

u3

u5

u6

u7

u’4

u’2

Rank 0

Rank 1

GPU

u1

u2 u4 u3

u5

u6

u7

PCIE

PCIE

Original Problem

Partitioned to 2 MPI Ranks

Consolidated onto 1 GPU

Boundary exchange

Page 11: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

11

Consolidation Examples

1 CPU socket <=> 1 GPU

Dual socket CPU <=> 2 GPUs

Dual socket CPU <=> 4 GPUs

Arbitrary Cluster:

4 nodes x [2 CPUs + 3 GPUs] IB

Page 12: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

12

PETSc KSP vs AmgX performance test

PDE:

∂u2∂2x+∂u2∂2y+∂u2∂2z=−12π2cos(2πx)cos(2πy)cos(2πz)

BCs:

∂u∂x∣∣∣x=0=∂u∂x∣∣∣x=1=∂u∂y∣∣∣y=0=∂u∂y∣∣∣y=1=∂u∂z∣∣∣z=0=∂u∂z∣∣∣z=1=0

Exact solution:

u(x,y)=cos(2πx)cos(2πy)cos(2πz)

Page 13: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

13

PETSc vs AmgX

7x speedup @4M unknowns 16 cores vs 1 GPU 8x speedup @100M unknowns 512 cores vs 32 GPUs

Machine specification

GPU nodes:

GPU: two K20m per node

CPU nodes:

CPU: two Intel Xeon E5-2670 per node (totally

16 cores per node)

PETSc KSP solver

Page 14: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

14

SPE10 Cases We derived several test cases from the SPE10

permeability distribution by fixing an x-y resolution

and adding resolution in z, using TPFA stencil.

Page 15: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

15

SPE10 Matrix Tests

GPU: NVIDIA K40

CPU: HYPRE on 10 core IvyBridge Xeon E5-2690 V2 @ 3.0GHz

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 2 4 6 8 10

Spe

ed

up

Millions of Unknowns

1 Socket vs 1 GPU

Page 16: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

16

Scaling up the right way

Page 17: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

17

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

Aggregation and Classical Weak Scaling, 8Million DOF per GPU

0.0

2.0

4.0

6.0

8.0

10.0

12.0

1 2 4 8 16 32 64 128 256 512

Tim

e (

s)

Number of GPUs

Setup

AmgX 1.0 (PMIS) AmgX 1.0 (AGG)

Page 18: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

18

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

Aggregation and Classical Weak Scaling, 8Million DOF per GPU

y = 0.0062x + 0.0719 R² = 0.9249

y = 0.0022x + 0.0585 R² = 0.9437

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

1 2 4 8 16 32 64 128 256 512

Solv

e T

ime

Number of GPUs

Time per Iteration vs Log(P)

ClassicalAMGSolve

AggregationAMGSolve

Linear (ClassicalAMGSolve)

Linear (AggregationAMGSolve)

Page 19: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

19

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

Classical AMG Preconditioner, 8Million DOF per GPU

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512

Itera

tions

Number of GPUs

PCG

GMRES

Page 20: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

20

Poisson Equation / Laplace operator

Titan (Oak Ridge National Laboratory)

GPU: NVIDIA K20x (one per node)

CPU: 16 core AMD Opteron 6274 @ 2.2GHz

Classical AMG Preconditioner, 8Million DOF per GPU

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

1 2 4 8 16 32 64 128 256 512

Solv

e Ti

me(

s)

Number of GPUs

GMRES

PCG

Page 21: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

21

AmgX 2.0: MPI with GPUDirect RDMA

4x lower latency, 3x Bandwidth, 45% lower CPU utilization

Page 22: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

22

Basic Coarsening

Page 23: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

23

Basic Coarsening

Page 24: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

24

Aggressive Coarsening

Page 25: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

25

Aggressive Coarsening

Less Memory, Faster Setup

Page 26: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

26

AmgX 2.0 Licensing

Developer/Academic License

non commercial use, free

Commercial License, Developer License, Premier Support Service

Subscription License (node/year)

Includes Support and Maintenance

Volume based pricing

Site License

Perpetual License

20% Maintenance and Support

Page 27: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

27

AmgX Roadmap

Continuous Improvement

Availability Features

Classical AMG

- multi node

- multi GPU

- Aggressive coarsening

Complex Arithmetic + Aggregation

Easy interfaces, python

PETSc, HYPRE, Trilinos

Robust convergence on SPE10

GPUDirect v2.0

Scalable Sparse Eigensolvers

Scaling past 512 GPUs

Range Decomposition AMG

Guaranteed convergence aggregation

Commercial License

Premier Support

AmgX 2.5 Q2 2016

AmgX 2.0 Release Q4 2015

CUDA 8.0 with Pascal Support

Tuning for Maxwell

Page 28: AmgX 2.0: Scaling toward CORALimages.nvidia.com/events/sc15/pdfs/AmgX-v2...AmgX 2.0: Scaling toward CORAL 2 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for

AmgX 2.0 was made by a great team of contributors. AmgX 2.0 Team: Marat Arsaev, Joe Eaton, Alex Fender, Andrei Schaffer AmgX 2.0 Devtechs: Simon Layton, Nikolai Sakharnykh, Nikolay Markovskiy Interns: Rohit Gupta, Constantine Stulov