sparse linear algebra over dag runtimessolhar.gforge.inria.fr/lib/exe/fetch.php?media=solhar.pdf ·...

November 15, 2013

Sparse Linear Algebra over DAGRuntimesANR SOLHAR - Kick-off, BordeauxM. Faverge, X. Lacoste, P. Ramet

M. Faverge, X. Lacoste, P. RametHiePACS teamInria Bordeaux Sud-Ouest

Introduction

Introduction

I Modern architectures are enhanced with acceleratorsI Dense linear algebra solvers on GPU ⇒ Lots of solutions

I Mono-GPU, static scheduler: MAGMA, CULA, . . .I Tile-algorithms over runtimes: PaRSEC (DPLASMA),

StarPU (MAGMA-MORSE), XKaapi, SMPss, . . .I What about sparse linear solvers?

I Many solutions for distributed memory and/or shared memoryI PaStiX, MUMPS, SuperLU, Pardiso, . . .

I Few commercial solutions with GPUs: MatrixPro, Acceleware,BCSLib-GPU7

I Sparse QR factorization: QR-Mumps (GPU in progress)

PaStiX Team - ANR SOLHAR November 15, 2013- 2

Introduction

Guideline

PaStiX solver

DAG schedulers

Towards GPU accelerated solver over runtimes

What is next?

Conclusion


2PaStiX solver

PaStiX solver

Major steps for solving sparse linear systems

1. Analysis: matrix is preprocessed to improve its structuralproperties (A′x ′ = b′ with A′ = PnPDr ADcQPT )

2. Factorization: matrix is factorized as A = LU, LLT or LDLT

3. Solve: the solution x is computed by means of forward andbackward substitutions


PaStiX solver

PaStiX solver

I Supernodal methodI Cholesky, LU and LDLt factorizationsI Exploit symmetric patterns, even for general matricesI Hybrid implementation MPI/PthreadI Static and dynamic schedulersI Adapted to NUMA architecturesI Task-based algorithms


PaStiX solver

Supernodal factorization tasks


PaStiX solver

Static scheduling within PaStiX

forall the Supernode S1 attributed to t dowait (S1);factorize (S1);forall the off diagonal blocks Bi of S1 do

S2 ← supernode in front of (Bi );lock (S2);update (S1,S2);unlock (S2);if All updates applied on S2 then

release (S2)end

endend


3DAG schedulers

DAG schedulers

DAG schedulers consideredStarPU

I RunTime Team – Inria Bordeaux Sud-OuestI C. Augonnet, R. Namyst, S. Thibault.I Dynamic Task DiscoveryI Computes cost models on the flyI Multiple kernels on the acceleratorsI Heterogeneous First-Time strategy

PaRSEC (formerly DAGuE)I ICL – University of Tennessee, KnoxvilleI G. Bosilca, A. Bouteiller, A. Danalys, T HeraultI Parameterized Task GraphI Only the most compute intensive kernel on acceleratorsI Simple scheduling strategy based on computing capabilitiesI GPU multi-stream enabled


DAG schedulers

StarPU loop to submit tasks (DTD)

forall the Supernode S1 dosubmit factorize (S1);forall the off diagonal blocks Bi of S1 do

S2 ← supernode in front of (Bi );submit update (S1,S2);

endend


DAG schedulers

PaRSEC’s representation (PTG)

panel(j) [high priority = on]/* execution space */c = 0 .. cblknbr-1/* Extra parameters */firstblock = diagonal block of( c )lastblock = last block of( c )lastbrow = last brow of( c )/* Locality */:A(c)RW A ← leaf ? A(c) : C update(lastbrow)

→ A update(firstblock+1..lastblock)→ A(c)

update(j)/* execution space */b = 0 .. bloknbr-1/* Extra parameters */c = get cblk of( b )fc = get facing cblk of( b ).../* Locality */:A(fc)READ A ← A panel(c)RW C ← previous ? C update(prev) : A(fc)

→ next ? C update(next) : A panel(fc)


DAG schedulers

ExperimentsMatrix Prec Method Size nnzA nnzL TFlop/sAfshell10 D LU 1.5e+6 27e+6 610e+6 0.12FilterV2 Z LU 0.6e+6 12e+6 536e+6 3.6Flan D LLT 1.6e+6 59e+6 1712e+6 5.3Audi D LLT 0.9e+6 39e+6 1325e+6 6.5MHD D LU 0.5e+6 24e+6 1133e+6 6.6Geo1438 D LLT 1.4e+6 32e+6 2768e+6 23Pmldf Z LDLT 1.0e+6 8e+6 1105e+6 28Hook D LU 1.5e+6 31e+6 4168e+6 35Serena D LDLT 1.4e+6 32e+6 3365e+6 47

Table : Matrix description (Z: double complex, D: double).

MachineI Two hexa-cores Westmere Xeon X5650 (2.67 GHz)I 32 GB of memory


DAG schedulers

CPU Scaling study

afshell10(D

, LU)

FilterV2(Z

, LU)

Flan(D, LL

T )

audi(D, LL

T )

MHD(D, L

U)

Geo1438(D

, LLT )

pmlDF(Z,

LDLT )

HOOK(D,

LU)

Serena(D,

LDLT )

0

20

40

60

80

Perfo

rman

ce(G

Flop

/s)

PaStiX with 1 core StarPU with 1 core PaRSEC with 1 corePaStiX with 3 cores StarPU with 3 cores PaRSEC with 3 coresPaStiX with 6 cores StarPU with 6 cores PaRSEC with 6 coresPaStiX with 9 cores StarPU with 9 cores PaRSEC with 9 coresPaStiX with 12 cores StarPU with 12 cores PaRSEC with 12 cores


4Towards GPU accelerated solverover runtimes


What can be offloaded to the GPU?

I Panel factorization:I Call MAGMA kernel?I Diagonal block size < 120I Panel is done on CPUI → No GPU kernel for factorization

I Panel update:I GEMM variantI Highly efficient GEMM source code availableI → existing GEMM can easily be adapted to our problemI Extension of ASTRA kernel (J. Kurzak, ICL)



Sparse GEMM on GPU

Tiled A × X

P2

fr1,1

lr1,1

fr1,2

fr1,2

fr1,3

fr1,3

fr2,1

lr2,1

fr2,2

fr2,2

blocknbr = 3;blocktab = [ fr1,1, lr1,1,

fr1,2, lr1,2,fr1,3, lr1,3 ];

fblocknbr = 2;fblocktab = [ fr2,1, lr2,1,

fr2,2, lr2,2];

sparse gemm cuda( char TRANSA, char TRANSB, int m, int n, int k,cuDoubleComplex alpha,const cuDoubleComplex *d A, int lda,const cuDoubleComplex *d B, int ldb,cuDoubleComplex beta,cuDoubleComplex *d C, int ldc,int blocknbr, const int *blocktab,int fblocknbr, const int *fblocktab,CUstream stream );

Figure : Panel update on GPU



GPU kernel experiment

×

A11

AT11

NcolA

NrowA

NrowA11

NcolB

ParametersI NcolA = 128I NcolB = NrowA11 = 128I NrowA varies from 256 to 20000I Random number and size of

blocks in AI Random blocks in B matching AI Get mean time of 10 runs for a

fixed NrowA with differentblocks distribution.

Figure : GPU kernel experiment



Multi-stream performance comparison (Tesla M2070)

50

100

150

200

250

300

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

GF

lop/s

Matrix size M (N=K=128)

cuBLAS peakcuBLAS - 1 stream cuBLAS - 2 streamscuBLAS - 3 streams

ASTRA - 1 stream ASTRA - 2 streamsASTRA - 3 streams

Sparse - 1 stream Sparse - 2 streamsSparse - 3 streams



Data mapping over GPU (PaRSEC, 3 Tesla M2070)

panel’s size FLOPs priority updates number0

50

100

150

200

250

300

Perfo

rman

ce(G

FLO

PS)

afshell10(D, LU) FilterV2(Z, LU) Flan(D, LLT ) audi(D, LLT )MHD(D, LU) Geo1438(D, LLT ) pmlDF(Z, LDLT ) HOOK(D, LU)

Serena(D, LDLT )



GPU scaling study (No GPU)

afshell10(D

, LU)

FilterV2(Z

, LU)

Flan(D, LL

T )

audi(D, LL

T )

MHD(D, L

U)

Geo1438(D

, LLT )

pmlDF(Z,

LDLT )

HOOK(D,

LU)

Serena(D,

LDLT )

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

PaStiX

StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU

PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream

PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams



GPU scaling study (1 GPU)

afshell10(D

, LU)

FilterV2(Z

, LU)

Flan(D, LL

T )

audi(D, LL

T )

MHD(D, L

U)

Geo1438(D

, LLT )

pmlDF(Z,

LDLT )

HOOK(D,

LU)

Serena(D,

LDLT )

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

PaStiX






GPU scaling stud (2 GPUs)

afshell10(D

, LU)

FilterV2(Z

, LU)

Flan(D, LL

T )

audi(D, LL

T )

MHD(D, L

U)

Geo1438(D

, LLT )

pmlDF(Z,

LDLT )

HOOK(D,

LU)

Serena(D,

LDLT )

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

PaStiX






GPU scaling study (3 GPUs)

afshell10(D

, LU)

FilterV2(Z

, LU)

Flan(D, LL

T )

audi(D, LL

T )

MHD(D, L

U)

Geo1438(D

, LLT )

pmlDF(Z,

LDLT )

HOOK(D,

LU)

Serena(D,

LDLT )

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

PaStiX






GPU scaling study (Multi-streams)

afshell10(D

, LU)

FilterV2(Z

, LU)

Flan(D, LL

T )

audi(D, LL

T )

MHD(D, L

U)

Geo1438(D

, LLT )

pmlDF(Z,

LDLT )

HOOK(D,

LU)

Serena(D,

LDLT )

50

100

150

200

250

Perfo

rman

ce(G

Flop

/s)

PaStiX





5What is next?

What is next?

Improvements on granularity

1. Try to improve BLAS efficiency by larger blockingI Study impact of Scotch minimal subblock parameter (cmin)

2. Create more parallelism while avoiding low flops tasksI Improve supernode splitting algorithm


What is next?

Idea of smarter splitting algorithm

Constant split size

(a) Regular splitting (Actual)

A smarter split

(b) Adapted splitting (New)PaStiX Team - ANR SOLHAR November 15, 2013- 32

What is next?

Preliminary results

1 2 4 6 8

102

103

Number of Threads

FactorizationTim

e(s)

Regular / cmin=0

Regular / cmin=20

Adapted / cmin=0

Adapted / cmin=20

Figure : Audi, LLT , D

Splitting alg. Regular AdaptedScotch cmin 0 20 0 20Analyze time 1.95 s 0.35 s 2.56 s 0.42 sNumber of panels 118814 10082 118220 9491Number of blocks 2283029 338493 2213497 280722Created by splitting 65147 48284 18072 13081Avg. panel size 7.94262 93.602 7.98253 99.4305Avg. block height 10.1546 29.2206 9.08452 24.5355Memory usage 10.1 Go 10.7 Go 10.5 Go 11.1 Go

Adapted splitting algorithmI 6-15% improvement on factorization timeI 16-20% augmentation on analyze time

Scotch cminI 80% faster on analyze timeI No impact on factorization timeI 6% increase on memory consumption


6Conclusion

Conclusion

Conclusion

I Productive/Easy solution to add accelerators to sparse directsolver

I Performance loss with both schedulers is low on multicorearchitectures

I / Accelerators integration is not always as easy as claimed byruntime programmers

I Sparse factorization raised many challenges for genericruntimes:

I Small granularity → runtime overhead is too importantI Irregular task granularity →

I Difficulty to generate costs modelsI Inefficient data movements (mostly designed for regular dense

block)


Conclusion

Future works

I Experiments on larger architectures (Minotaure) / Reducetask overhead

I Distributed implementation (MPI)I Intel Xeon PHI:

I Full Xeon PHI native implementationI Offload through runtimes

I H-matrices:I Check the potential compression ratio on top level blocksI Develop a prototype with:

I low-rank compression on the larger supernodesI compression tree built at each update

I Study coupling between nested dissection and compression treeordering


Conclusion

Thanks !

PaStiX TeamINRIA HiePACS team

ANR SOLHAR, Bordeaux

PaStiX solverDAG schedulersTowards GPU accelerated solver over runtimesWhat is next?Conclusion

sparse linear algebra over dag runtimessolhar.gforge.inria.fr/lib/exe/fetch.php?media=solhar.pdf ·...

Documents