communication-avoiding & pipelined krylovsolvers in...

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Communication-avoiding & Pipelined Krylov solvers in Trilinos

Ichitaro Yamazaki (SNL), Mark Hoemmen (SNL),

Erik Boman (SNL), and Jack Dongarra (UTK)

Joint Laboratory for Extreme-Scale Computing Workshop,

Knoxville, TN, April 15, 2019

DOE ECP PEEKS Project

§ Krylov are powerful solvers for solving A*x = b§ used is many applications including DOE’s and ECP’s

§ some results later

§ use two main kernels (based on Krylov subspace projection)§ Matrix Vector multiply (SpMV)

– for generating Krylov subspace= span(q, A*q, A2*q, …)

– combined with Precondinerto improve convergence

– black box, provided by users

§ Orthgonalization– for generate orthonormal basis vectors

– our current focus

§ Communication can be bottleneck (time, and maybe power)§ P2P? + irregular data access for SpMV+Precond

§ all-reduce + BLAS-1 or 2 for Orthogonalization

§ becoming more expensive on a newer architecture 2/12

CPU

Memory

Cache

Memory

Memory Memory

Memory

CPU CPU

CPUCPU

DOE ECP PEEKS Project

§ Communication-avoiding & Pipelined Krylov solvers§ s-step aims to avoid them by generating s vectors at a time

§ latency reduced by factor of s§ pipeline tries to hide them by overlapping/pipelining

§ max speedup of 2x, but maybe more through pipelining

Make these solvers available in Trilinos§ Goal: perform “well” for ECP applications on exascale computers

§ Also should be useful for non-exascale Trilinos users 3/12

SpMV

allreduce P2P

s-step (s=2)

Block-orth SpMV

GEMV GEMV

allreduce

P2P

pipeline

What is Trilinos?

§ Parallel math libraries for science & engineering applications§ Sparse matrices & parallel distributions

§ Linear solvers & preconditioners

§ Nonlinear solvers, optimization, …

§ ~ 20 years’ continuous development

§ Mostly C++11, some C & Fortran

§ Supports many different platforms§ CPUs: x86, KNL, POWER, ARM, …

§ GPUs: NVIDIA, AMD in progress, …

§ Future architectures (e.g., exascale)

§ github.com/trilinos/Trilinos

§ Users inside & outside Sandia4/12

https://github.com/trilinos/Trilinos

Trilinos’ linear solvers

§ Iterative linear solvers (Belos)

§ Parallel linear algebra (Tpetra)§ LA (sparse/dense mat/vec) operations

§ Thread parallelism (Kokkos)§ Programming model for portability

§ Sparse direct solvers (Amesos2)

§ Direct+iterative solvers (ShyLU)

§ Algebraic preconditioners (Ifpack2)

§ Algebraic multigrid (MueLu)

§ Segregated block solvers (Teko)

5/12

Green: Programming modelYellow: Provide data & kernelsBlue: Use data & kernels directlyRed: Use kernels abstractly

Belos only uses underlying linear algebra implementation through abstract interface

Krylov methods we implemented§ available now in Trilinos (Belos)

§ CA-GMRES (Chronopoulos & Kim ‘90, Hoemmen 2010)§ SpMV+Precond as block box

no matrix-powers kernel (no CA)option to generate Newton basis for stability

§ block orthogonalization (a few options)>2 all-reduces/iter ~2 all-reduces/s-stepsBlas-1 or 2 Blas-3

6/12

for j=1:s:m do// generate s Krylov vectorsfor i=j+1:1:j+s,

qi = AM-1 qi-1 // SpMV+Precond (black box)end for

// orthogonalize

Qj:j+1 = Ortho(Qj:j+1, Q1:j )

Qj:j+1 = TSQR(Qj:j+1)

end for

Krylov methods we implemented

§ available now in Trilinos (Belos)§ 1 all-reduce GMRES (Ghysels et al. 2016)

§ 1 all-reduce CG (Saad ‘85, D’Azevedo ’93)§ merge orthogon/normalization

§ Pipelined GMRES (Ghysels et al. 2016)

§ Pipelined CG (Ghysels & Vanroose 2012)§ non-blocking collective, MPI_Iallreduce

7/12

SpMV GEMV

GEMV

allreduce

P2P

pipeline

Belos solver factory interface

§ available through Belos solver factory§ no code change needed for Belos users

§ available by run-time option (along with solver parameters, convergence criteria)

8/12

// Create a Solver ManagerBelos::SolverFactory factory;RCP solverManager = factory.create (args.solverName, params);

// Set problemRCP lp (new linear_problem_type (A, X, B));solverManager->setProblem (lp);

// Solve the systemBelos::ReturnType belosResult = solverManager->solve ();

§ Same solver manager may be reused for multiple solves; e.g., for different b or A.

§ A, b, x may be templated, e.g., with scalar, global/local ordinal, and node type

Nalu Wind performance resultscollaboration with ECP ExaWind project

§ Nalu Wind (CFD)§ github.com/exawind/nalu-wind

§ Low Mach, unstructured, C++

§ Sierra Tool Kit (STK) meshes

§ Can handle >> 10^9 dofs

§ Our experiments (mesh from NREL)§ Simulate air flow around wind turbine(s)

§ Hybrid RANS-LES (RANS near blade, LES in wake)

§ Segregated physics

§ Trilinos (Belos Krylov)

§ 95 M dofs / linear system§ Pressure and momentum systems

§ MueLu Multigrid / Ifpack Symmetric Gauss Seidel

§ NERSC Cori: Haswell, 32 cores/nodes 9/1213

First Test Case: Vestas V27225 kW, 27 m rotor

Image from Domino, Barone & Bruner; ExaWind Poster, SAND2018-1050D, Exascale Computing Project 2nd Annual Meeting, Knoxville, TN, February 6, 2018.

(image taken Thomas, et al., High-fidelity simulation of wind-turbine incompressible flows with classical and aggregation AMG preconditioners, submitted to SIAMJournal on Scientific Computing, Copper Mountain Conference on Iterative Methods 2018 Special Issue, on 15 June 2018)

Image credit: Domino, Barone, & Bruner, 2018

https://github.com/exawind/nalu-wind

Time to solution: Pressure system

10/12

256 512 1024 2048 4096

Process count

102

103

Tota

l solu

tion tim

e

1.4x 1.4x 1.4x 1.3x1.4x 1.3x 1.3x 1.3x1.5x 1.4x 1.4x 1.4x1.4x 1.4x Infx

GMRES+ICGSs-step, newton, CGS+CholQRs-step, newton, low-synch CGS2x+CholQR2s-step, newton, low-synch CGS+CholQR2s-step, monom, low-synch CGS+CholQR2linear

• MueLu algebraic multigrid + GMRES(100)• GMRES converges in ~ 200 iterations

• CA-GMRES • Newton basis (Ritz values from first s iterations as shifts) or monomial basis• Converges in the same number of iterations• Speedup of 1.4x due to a fewer all-reduces through block orthogonalization

Speedups over regular GMRES

Time to solution: Momentum system

11/12

• Symmetric Gauss-Seidel preconditioner + GMRES• GMRES converges in ~5 -- ~20 iterations

• Pipelined, non-blocking collective (pipeline depth = 1)

4096 8192 1638410

15

20

25

30

35

40

45

0.9x 1.0x 1.1x

GMRES+ICGSpipelined+CGSlinear

12/12

Conclusions§ Communication-avoiding & pipelined Krylov in Trilinos

§ more info at icl.utk.edu/peeks & trilinos.org

§ Improved solve performance in Nalu Wind by up to 1.5x

§ Did so without breaking software backwards compatibility

§ Working on performance improvement (e.g., comm/comp kernels with GPUs),algorithmic extension (e.g., low-synch block orthogo), other preconditioning

Thank you!!

§ Nalu developers (Stephen Thomas at NREL & Jonathan Hu at SNL)

§ ECP PEEKS, for fundingThis research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nations exascale computing imperative.

Pipelined CA-GMRES (re)orthog.

§ CGS: Classical Gram-Schmidt

§ 1 reduce CGS + CholQR1. [C; G] = [Q, P]T * P

2. Q = Q – C * P, G = G – C T * C

3. R = chol(G), Q = Q / R

§ (Next iteration orthog’s P)

§ Above + reorthogonalize P1. [T, C; G1, G] = [Q, P]T * [P’, P]

2. R’ = chol(G1), P’ = P / R’

3. Update C & G, then 2-3 above

§ MGS, CGS2, …: NREL talk today!

§ PDSEC’17; adding to Trilinos soon

15

P

start vector for new s new s vectors

P’

CholQR of PCGS with Q

Prev

ious

sve

ctor

s

Mor

e pr

evio

us v

ecto

rs

dot

dot

communication-avoiding & pipelined krylovsolvers in...

Documents