communication-avoiding & pipelined krylovsolvers in...

15
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. Communication-avoiding & Pipelined Krylov solvers in Trilinos Ichitaro Yamazaki (SNL), Mark Hoemmen (SNL), Erik Boman (SNL), and Jack Dongarra (UTK) Joint Laboratory for Extreme-Scale Computing Workshop, Knoxville, TN, April 15, 2019

Upload: others

Post on 30-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

    Communication-avoiding & Pipelined Krylov solvers in Trilinos

    Ichitaro Yamazaki (SNL), Mark Hoemmen (SNL),

    Erik Boman (SNL), and Jack Dongarra (UTK)

    Joint Laboratory for Extreme-Scale Computing Workshop,

    Knoxville, TN, April 15, 2019

  • DOE ECP PEEKS Project

    § Krylov are powerful solvers for solving A*x = b§ used is many applications including DOE’s and ECP’s

    § some results later

    § use two main kernels (based on Krylov subspace projection)§ Matrix Vector multiply (SpMV)

    – for generating Krylov subspace= span(q, A*q, A2*q, …)

    – combined with Precondinerto improve convergence

    – black box, provided by users

    § Orthgonalization– for generate orthonormal basis vectors

    – our current focus

    § Communication can be bottleneck (time, and maybe power)§ P2P? + irregular data access for SpMV+Precond

    § all-reduce + BLAS-1 or 2 for Orthogonalization

    § becoming more expensive on a newer architecture 2/12

    CPU

    Memory

    Cache

    Memory

    Memory Memory

    Memory

    CPU CPU

    CPUCPU

  • DOE ECP PEEKS Project

    § Communication-avoiding & Pipelined Krylov solvers§ s-step aims to avoid them by generating s vectors at a time

    § latency reduced by factor of s§ pipeline tries to hide them by overlapping/pipelining

    § max speedup of 2x, but maybe more through pipelining

    Make these solvers available in Trilinos§ Goal: perform “well” for ECP applications on exascale computers

    § Also should be useful for non-exascale Trilinos users 3/12

    SpMV

    allreduce P2P

    s-step (s=2)

    Block-orth SpMV

    GEMV GEMV

    allreduce

    P2P

    pipeline

  • What is Trilinos?

    § Parallel math libraries for science & engineering applications§ Sparse matrices & parallel distributions

    § Linear solvers & preconditioners

    § Nonlinear solvers, optimization, …

    § ~ 20 years’ continuous development

    § Mostly C++11, some C & Fortran

    § Supports many different platforms§ CPUs: x86, KNL, POWER, ARM, …

    § GPUs: NVIDIA, AMD in progress, …

    § Future architectures (e.g., exascale)

    § github.com/trilinos/Trilinos

    § Users inside & outside Sandia4/12

    https://github.com/trilinos/Trilinos

  • Trilinos’ linear solvers

    § Iterative linear solvers (Belos)

    § Parallel linear algebra (Tpetra)§ LA (sparse/dense mat/vec) operations

    § Thread parallelism (Kokkos)§ Programming model for portability

    § Sparse direct solvers (Amesos2)

    § Direct+iterative solvers (ShyLU)

    § Algebraic preconditioners (Ifpack2)

    § Algebraic multigrid (MueLu)

    § Segregated block solvers (Teko)

    5/12

    Green: Programming modelYellow: Provide data & kernelsBlue: Use data & kernels directlyRed: Use kernels abstractly

    Belos only uses underlying linear algebra implementation through abstract interface

  • Krylov methods we implemented§ available now in Trilinos (Belos)

    § CA-GMRES (Chronopoulos & Kim ‘90, Hoemmen 2010)§ SpMV+Precond as block box

    no matrix-powers kernel (no CA)option to generate Newton basis for stability

    § block orthogonalization (a few options)>2 all-reduces/iter ~2 all-reduces/s-stepsBlas-1 or 2 Blas-3

    6/12

    for j=1:s:m do// generate s Krylov vectorsfor i=j+1:1:j+s,

    qi = AM-1 qi-1 // SpMV+Precond (black box)end for

    // orthogonalize

    Qj:j+1 = Ortho(Qj:j+1, Q1:j )

    Qj:j+1 = TSQR(Qj:j+1)

    end for

  • Krylov methods we implemented

    § available now in Trilinos (Belos)§ 1 all-reduce GMRES (Ghysels et al. 2016)

    § 1 all-reduce CG (Saad ‘85, D’Azevedo ’93)§ merge orthogon/normalization

    § Pipelined GMRES (Ghysels et al. 2016)

    § Pipelined CG (Ghysels & Vanroose 2012)§ non-blocking collective, MPI_Iallreduce

    7/12

    SpMV GEMV

    GEMV

    allreduce

    P2P

    pipeline

  • Belos solver factory interface

    § available through Belos solver factory§ no code change needed for Belos users

    § available by run-time option (along with solver parameters, convergence criteria)

    8/12

    // Create a Solver ManagerBelos::SolverFactory factory;RCP solverManager = factory.create (args.solverName, params);

    // Set problemRCP lp (new linear_problem_type (A, X, B));solverManager->setProblem (lp);

    // Solve the systemBelos::ReturnType belosResult = solverManager->solve ();

    § Same solver manager may be reused for multiple solves; e.g., for different b or A.

    § A, b, x may be templated, e.g., with scalar, global/local ordinal, and node type

  • Nalu Wind performance resultscollaboration with ECP ExaWind project

    § Nalu Wind (CFD)§ github.com/exawind/nalu-wind

    § Low Mach, unstructured, C++

    § Sierra Tool Kit (STK) meshes

    § Can handle >> 10^9 dofs

    § Our experiments (mesh from NREL)§ Simulate air flow around wind turbine(s)

    § Hybrid RANS-LES (RANS near blade, LES in wake)

    § Segregated physics

    § Trilinos (Belos Krylov)

    § 95 M dofs / linear system§ Pressure and momentum systems

    § MueLu Multigrid / Ifpack Symmetric Gauss Seidel

    § NERSC Cori: Haswell, 32 cores/nodes 9/1213

    First Test Case: Vestas V27225 kW, 27 m rotor

    Image from Domino, Barone & Bruner; ExaWind Poster, SAND2018-1050D, Exascale Computing Project 2nd Annual Meeting, Knoxville, TN, February 6, 2018.

    (image taken Thomas, et al., High-fidelity simulation of wind-turbine incompressible flows with classical and aggregation AMG preconditioners, submitted to SIAMJournal on Scientific Computing, Copper Mountain Conference on Iterative Methods 2018 Special Issue, on 15 June 2018)

    Image credit: Domino, Barone, & Bruner, 2018

    https://github.com/exawind/nalu-wind

  • Time to solution: Pressure system

    10/12

    256 512 1024 2048 4096

    Process count

    102

    103

    Tota

    l solu

    tion tim

    e

    1.4x 1.4x 1.4x 1.3x1.4x 1.3x 1.3x 1.3x1.5x 1.4x 1.4x 1.4x1.4x 1.4x Infx

    GMRES+ICGSs-step, newton, CGS+CholQRs-step, newton, low-synch CGS2x+CholQR2s-step, newton, low-synch CGS+CholQR2s-step, monom, low-synch CGS+CholQR2linear

    • MueLu algebraic multigrid + GMRES(100)• GMRES converges in ~ 200 iterations

    • CA-GMRES • Newton basis (Ritz values from first s iterations as shifts) or monomial basis• Converges in the same number of iterations• Speedup of 1.4x due to a fewer all-reduces through block orthogonalization

    Speedups over regular GMRES

  • Time to solution: Momentum system

    11/12

    • Symmetric Gauss-Seidel preconditioner + GMRES• GMRES converges in ~5 -- ~20 iterations

    • Pipelined, non-blocking collective (pipeline depth = 1)

    4096 8192 1638410

    15

    20

    25

    30

    35

    40

    45

    0.9x 1.0x 1.1x

    GMRES+ICGSpipelined+CGSlinear

  • 12/12

    Conclusions§ Communication-avoiding & pipelined Krylov in Trilinos

    § more info at icl.utk.edu/peeks & trilinos.org

    § Improved solve performance in Nalu Wind by up to 1.5x

    § Did so without breaking software backwards compatibility

    § Working on performance improvement (e.g., comm/comp kernels with GPUs),algorithmic extension (e.g., low-synch block orthogo), other preconditioning

  • Thank you!!

    § Nalu developers (Stephen Thomas at NREL & Jonathan Hu at SNL)

    § ECP PEEKS, for fundingThis research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nations exascale computing imperative.

  • 14

  • Pipelined CA-GMRES (re)orthog.

    § CGS: Classical Gram-Schmidt

    § 1 reduce CGS + CholQR1. [C; G] = [Q, P]T * P

    2. Q = Q – C * P, G = G – C T * C

    3. R = chol(G), Q = Q / R

    § (Next iteration orthog’s P)

    § Above + reorthogonalize P1. [T, C; G1, G] = [Q, P]T * [P’, P]

    2. R’ = chol(G1), P’ = P / R’

    3. Update C & G, then 2-3 above

    § MGS, CGS2, …: NREL talk today!

    § PDSEC’17; adding to Trilinos soon

    15

    P

    start vector for new s new s vectors

    P’

    CholQR of PCGS with Q

    Prev

    ious

    sve

    ctor

    s

    Mor

    e pr

    evio

    us v

    ecto

    rs

    dot

    dot