sparse linear algebra over dag runtimessolhar.gforge.inria.fr/lib/exe/fetch.php?media=solhar.pdf ·...

33
November 15, 2013 Sparse Linear Algebra over DAG Runtimes ANR SOLHAR - Kick-off, Bordeaux M. Faverge, X. Lacoste, P. Ramet M. Faverge, X. Lacoste, P. Ramet HiePACS team Inria Bordeaux Sud-Ouest

Upload: others

Post on 28-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • November 15, 2013

    Sparse Linear Algebra over DAGRuntimesANR SOLHAR - Kick-off, BordeauxM. Faverge, X. Lacoste, P. Ramet

    M. Faverge, X. Lacoste, P. RametHiePACS teamInria Bordeaux Sud-Ouest

  • Introduction

    Introduction

    I Modern architectures are enhanced with acceleratorsI Dense linear algebra solvers on GPU ⇒ Lots of solutions

    I Mono-GPU, static scheduler: MAGMA, CULA, . . .I Tile-algorithms over runtimes: PaRSEC (DPLASMA),

    StarPU (MAGMA-MORSE), XKaapi, SMPss, . . .I What about sparse linear solvers?

    I Many solutions for distributed memory and/or shared memoryI PaStiX, MUMPS, SuperLU, Pardiso, . . .

    I Few commercial solutions with GPUs: MatrixPro, Acceleware,BCSLib-GPU7

    I Sparse QR factorization: QR-Mumps (GPU in progress)

    PaStiX Team - ANR SOLHAR November 15, 2013- 2

  • Introduction

    Guideline

    PaStiX solver

    DAG schedulers

    Towards GPU accelerated solver over runtimes

    What is next?

    Conclusion

    PaStiX Team - ANR SOLHAR November 15, 2013- 3

  • 2PaStiX solver

  • PaStiX solver

    Major steps for solving sparse linear systems

    1. Analysis: matrix is preprocessed to improve its structuralproperties (A′x ′ = b′ with A′ = PnPDr ADcQPT )

    2. Factorization: matrix is factorized as A = LU, LLT or LDLT

    3. Solve: the solution x is computed by means of forward andbackward substitutions

    PaStiX Team - ANR SOLHAR November 15, 2013- 6

  • PaStiX solver

    PaStiX solver

    I Supernodal methodI Cholesky, LU and LDLt factorizationsI Exploit symmetric patterns, even for general matricesI Hybrid implementation MPI/PthreadI Static and dynamic schedulersI Adapted to NUMA architecturesI Task-based algorithms

    PaStiX Team - ANR SOLHAR November 15, 2013- 7

  • PaStiX solver

    Supernodal factorization tasks

    PaStiX Team - ANR SOLHAR November 15, 2013- 8

  • PaStiX solver

    Static scheduling within PaStiX

    forall the Supernode S1 attributed to t dowait (S1);factorize (S1);forall the off diagonal blocks Bi of S1 do

    S2 ← supernode in front of (Bi );lock (S2);update (S1,S2);unlock (S2);if All updates applied on S2 then

    release (S2)end

    endend

    PaStiX Team - ANR SOLHAR November 15, 2013- 9

  • 3DAG schedulers

  • DAG schedulers

    DAG schedulers consideredStarPU

    I RunTime Team – Inria Bordeaux Sud-OuestI C. Augonnet, R. Namyst, S. Thibault.I Dynamic Task DiscoveryI Computes cost models on the flyI Multiple kernels on the acceleratorsI Heterogeneous First-Time strategy

    PaRSEC (formerly DAGuE)I ICL – University of Tennessee, KnoxvilleI G. Bosilca, A. Bouteiller, A. Danalys, T HeraultI Parameterized Task GraphI Only the most compute intensive kernel on acceleratorsI Simple scheduling strategy based on computing capabilitiesI GPU multi-stream enabled

    PaStiX Team - ANR SOLHAR November 15, 2013- 12

  • DAG schedulers

    StarPU loop to submit tasks (DTD)

    forall the Supernode S1 dosubmit factorize (S1);forall the off diagonal blocks Bi of S1 do

    S2 ← supernode in front of (Bi );submit update (S1,S2);

    endend

    PaStiX Team - ANR SOLHAR November 15, 2013- 13

  • DAG schedulers

    PaRSEC’s representation (PTG)

    panel(j) [high priority = on]/* execution space */c = 0 .. cblknbr-1/* Extra parameters */firstblock = diagonal block of( c )lastblock = last block of( c )lastbrow = last brow of( c )/* Locality */:A(c)RW A ← leaf ? A(c) : C update(lastbrow)

    → A update(firstblock+1..lastblock)→ A(c)

    update(j)/* execution space */b = 0 .. bloknbr-1/* Extra parameters */c = get cblk of( b )fc = get facing cblk of( b ).../* Locality */:A(fc)READ A ← A panel(c)RW C ← previous ? C update(prev) : A(fc)

    → next ? C update(next) : A panel(fc)

    PaStiX Team - ANR SOLHAR November 15, 2013- 14

  • DAG schedulers

    ExperimentsMatrix Prec Method Size nnzA nnzL TFlop/sAfshell10 D LU 1.5e+6 27e+6 610e+6 0.12FilterV2 Z LU 0.6e+6 12e+6 536e+6 3.6Flan D LLT 1.6e+6 59e+6 1712e+6 5.3Audi D LLT 0.9e+6 39e+6 1325e+6 6.5MHD D LU 0.5e+6 24e+6 1133e+6 6.6Geo1438 D LLT 1.4e+6 32e+6 2768e+6 23Pmldf Z LDLT 1.0e+6 8e+6 1105e+6 28Hook D LU 1.5e+6 31e+6 4168e+6 35Serena D LDLT 1.4e+6 32e+6 3365e+6 47

    Table : Matrix description (Z: double complex, D: double).

    MachineI Two hexa-cores Westmere Xeon X5650 (2.67 GHz)I 32 GB of memory

    PaStiX Team - ANR SOLHAR November 15, 2013- 15

  • DAG schedulers

    CPU Scaling study

    afshell10(D

    , LU)

    FilterV2(Z

    , LU)

    Flan(D, LL

    T )

    audi(D, LL

    T )

    MHD(D, L

    U)

    Geo1438(D

    , LLT )

    pmlDF(Z,

    LDLT )

    HOOK(D,

    LU)

    Serena(D,

    LDLT )

    0

    20

    40

    60

    80

    Perfo

    rman

    ce(G

    Flop

    /s)

    PaStiX with 1 core StarPU with 1 core PaRSEC with 1 corePaStiX with 3 cores StarPU with 3 cores PaRSEC with 3 coresPaStiX with 6 cores StarPU with 6 cores PaRSEC with 6 coresPaStiX with 9 cores StarPU with 9 cores PaRSEC with 9 coresPaStiX with 12 cores StarPU with 12 cores PaRSEC with 12 cores

    PaStiX Team - ANR SOLHAR November 15, 2013- 16

  • 4Towards GPU accelerated solverover runtimes

  • Towards GPU accelerated solver over runtimes

    What can be offloaded to the GPU?

    I Panel factorization:I Call MAGMA kernel?I Diagonal block size < 120I Panel is done on CPUI → No GPU kernel for factorization

    I Panel update:I GEMM variantI Highly efficient GEMM source code availableI → existing GEMM can easily be adapted to our problemI Extension of ASTRA kernel (J. Kurzak, ICL)

    PaStiX Team - ANR SOLHAR November 15, 2013- 19

  • Towards GPU accelerated solver over runtimes

    Sparse GEMM on GPU

    Tiled A × X

    P2

    fr1,1

    lr1,1

    fr1,2

    fr1,2

    fr1,3

    fr1,3

    fr2,1

    lr2,1

    fr2,2

    fr2,2

    blocknbr = 3;blocktab = [ fr1,1, lr1,1,

    fr1,2, lr1,2,fr1,3, lr1,3 ];

    fblocknbr = 2;fblocktab = [ fr2,1, lr2,1,

    fr2,2, lr2,2];

    sparse gemm cuda( char TRANSA, char TRANSB, int m, int n, int k,cuDoubleComplex alpha,const cuDoubleComplex *d A, int lda,const cuDoubleComplex *d B, int ldb,cuDoubleComplex beta,cuDoubleComplex *d C, int ldc,int blocknbr, const int *blocktab,int fblocknbr, const int *fblocktab,CUstream stream );

    Figure : Panel update on GPU

    PaStiX Team - ANR SOLHAR November 15, 2013- 20

  • Towards GPU accelerated solver over runtimes

    GPU kernel experiment

    ×

    A11

    AT11

    NcolA

    NrowA

    NrowA11

    NcolB

    ParametersI NcolA = 128I NcolB = NrowA11 = 128I NrowA varies from 256 to 20000I Random number and size of

    blocks in AI Random blocks in B matching AI Get mean time of 10 runs for a

    fixed NrowA with differentblocks distribution.

    Figure : GPU kernel experiment

    PaStiX Team - ANR SOLHAR November 15, 2013- 21

  • Towards GPU accelerated solver over runtimes

    Multi-stream performance comparison (Tesla M2070)

    50

    100

    150

    200

    250

    300

    0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

    GF

    lop/s

    Matrix size M (N=K=128)

    cuBLAS peakcuBLAS - 1 stream cuBLAS - 2 streamscuBLAS - 3 streams

    ASTRA - 1 stream ASTRA - 2 streamsASTRA - 3 streams

    Sparse - 1 stream Sparse - 2 streamsSparse - 3 streams

    PaStiX Team - ANR SOLHAR November 15, 2013- 22

  • Towards GPU accelerated solver over runtimes

    Data mapping over GPU (PaRSEC, 3 Tesla M2070)

    panel’s size FLOPs priority updates number0

    50

    100

    150

    200

    250

    300

    Perfo

    rman

    ce(G

    FLO

    PS)

    afshell10(D, LU) FilterV2(Z, LU) Flan(D, LLT ) audi(D, LLT )MHD(D, LU) Geo1438(D, LLT ) pmlDF(Z, LDLT ) HOOK(D, LU)

    Serena(D, LDLT )

    PaStiX Team - ANR SOLHAR November 15, 2013- 23

  • Towards GPU accelerated solver over runtimes

    GPU scaling study (No GPU)

    afshell10(D

    , LU)

    FilterV2(Z

    , LU)

    Flan(D, LL

    T )

    audi(D, LL

    T )

    MHD(D, L

    U)

    Geo1438(D

    , LLT )

    pmlDF(Z,

    LDLT )

    HOOK(D,

    LU)

    Serena(D,

    LDLT )

    50

    100

    150

    200

    250

    Perfo

    rman

    ce(G

    Flop

    /s)

    PaStiX

    StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU

    PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream

    PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams

    PaStiX Team - ANR SOLHAR November 15, 2013- 24

  • Towards GPU accelerated solver over runtimes

    GPU scaling study (1 GPU)

    afshell10(D

    , LU)

    FilterV2(Z

    , LU)

    Flan(D, LL

    T )

    audi(D, LL

    T )

    MHD(D, L

    U)

    Geo1438(D

    , LLT )

    pmlDF(Z,

    LDLT )

    HOOK(D,

    LU)

    Serena(D,

    LDLT )

    50

    100

    150

    200

    250

    Perfo

    rman

    ce(G

    Flop

    /s)

    PaStiX

    StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU

    PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream

    PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams

    PaStiX Team - ANR SOLHAR November 15, 2013- 25

  • Towards GPU accelerated solver over runtimes

    GPU scaling stud (2 GPUs)

    afshell10(D

    , LU)

    FilterV2(Z

    , LU)

    Flan(D, LL

    T )

    audi(D, LL

    T )

    MHD(D, L

    U)

    Geo1438(D

    , LLT )

    pmlDF(Z,

    LDLT )

    HOOK(D,

    LU)

    Serena(D,

    LDLT )

    50

    100

    150

    200

    250

    Perfo

    rman

    ce(G

    Flop

    /s)

    PaStiX

    StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU

    PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream

    PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams

    PaStiX Team - ANR SOLHAR November 15, 2013- 26

  • Towards GPU accelerated solver over runtimes

    GPU scaling study (3 GPUs)

    afshell10(D

    , LU)

    FilterV2(Z

    , LU)

    Flan(D, LL

    T )

    audi(D, LL

    T )

    MHD(D, L

    U)

    Geo1438(D

    , LLT )

    pmlDF(Z,

    LDLT )

    HOOK(D,

    LU)

    Serena(D,

    LDLT )

    50

    100

    150

    200

    250

    Perfo

    rman

    ce(G

    Flop

    /s)

    PaStiX

    StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU

    PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream

    PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams

    PaStiX Team - ANR SOLHAR November 15, 2013- 27

  • Towards GPU accelerated solver over runtimes

    GPU scaling study (Multi-streams)

    afshell10(D

    , LU)

    FilterV2(Z

    , LU)

    Flan(D, LL

    T )

    audi(D, LL

    T )

    MHD(D, L

    U)

    Geo1438(D

    , LLT )

    pmlDF(Z,

    LDLT )

    HOOK(D,

    LU)

    Serena(D,

    LDLT )

    50

    100

    150

    200

    250

    Perfo

    rman

    ce(G

    Flop

    /s)

    PaStiX

    StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU

    PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream

    PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams

    PaStiX Team - ANR SOLHAR November 15, 2013- 28

  • 5What is next?

  • What is next?

    Improvements on granularity

    1. Try to improve BLAS efficiency by larger blockingI Study impact of Scotch minimal subblock parameter (cmin)

    2. Create more parallelism while avoiding low flops tasksI Improve supernode splitting algorithm

    PaStiX Team - ANR SOLHAR November 15, 2013- 31

  • What is next?

    Idea of smarter splitting algorithm

    Constant split size

    (a) Regular splitting (Actual)

    A smarter split

    (b) Adapted splitting (New)PaStiX Team - ANR SOLHAR November 15, 2013- 32

  • What is next?

    Preliminary results

    1 2 4 6 8

    102

    103

    Number of Threads

    FactorizationTim

    e(s)

    Regular / cmin=0

    Regular / cmin=20

    Adapted / cmin=0

    Adapted / cmin=20

    Figure : Audi, LLT , D

    Splitting alg. Regular AdaptedScotch cmin 0 20 0 20Analyze time 1.95 s 0.35 s 2.56 s 0.42 sNumber of panels 118814 10082 118220 9491Number of blocks 2283029 338493 2213497 280722Created by splitting 65147 48284 18072 13081Avg. panel size 7.94262 93.602 7.98253 99.4305Avg. block height 10.1546 29.2206 9.08452 24.5355Memory usage 10.1 Go 10.7 Go 10.5 Go 11.1 Go

    Adapted splitting algorithmI 6-15% improvement on factorization timeI 16-20% augmentation on analyze time

    Scotch cminI 80% faster on analyze timeI No impact on factorization timeI 6% increase on memory consumption

    PaStiX Team - ANR SOLHAR November 15, 2013- 33

  • 6Conclusion

  • Conclusion

    Conclusion

    I Productive/Easy solution to add accelerators to sparse directsolver

    I Performance loss with both schedulers is low on multicorearchitectures

    I / Accelerators integration is not always as easy as claimed byruntime programmers

    I Sparse factorization raised many challenges for genericruntimes:

    I Small granularity → runtime overhead is too importantI Irregular task granularity →

    I Difficulty to generate costs modelsI Inefficient data movements (mostly designed for regular dense

    block)

    PaStiX Team - ANR SOLHAR November 15, 2013- 36

  • Conclusion

    Future works

    I Experiments on larger architectures (Minotaure) / Reducetask overhead

    I Distributed implementation (MPI)I Intel Xeon PHI:

    I Full Xeon PHI native implementationI Offload through runtimes

    I H-matrices:I Check the potential compression ratio on top level blocksI Develop a prototype with:

    I low-rank compression on the larger supernodesI compression tree built at each update

    I Study coupling between nested dissection and compression treeordering

    PaStiX Team - ANR SOLHAR November 15, 2013- 37

  • Conclusion

    Thanks !

    PaStiX TeamINRIA HiePACS team

    ANR SOLHAR, Bordeaux

    PaStiX solverDAG schedulersTowards GPU accelerated solver over runtimesWhat is next?Conclusion