sparse linear algebra over dag runtimessolhar.gforge.inria.fr/lib/exe/fetch.php?media=solhar.pdf ·...
TRANSCRIPT
-
November 15, 2013
Sparse Linear Algebra over DAGRuntimesANR SOLHAR - Kick-off, BordeauxM. Faverge, X. Lacoste, P. Ramet
M. Faverge, X. Lacoste, P. RametHiePACS teamInria Bordeaux Sud-Ouest
-
Introduction
Introduction
I Modern architectures are enhanced with acceleratorsI Dense linear algebra solvers on GPU ⇒ Lots of solutions
I Mono-GPU, static scheduler: MAGMA, CULA, . . .I Tile-algorithms over runtimes: PaRSEC (DPLASMA),
StarPU (MAGMA-MORSE), XKaapi, SMPss, . . .I What about sparse linear solvers?
I Many solutions for distributed memory and/or shared memoryI PaStiX, MUMPS, SuperLU, Pardiso, . . .
I Few commercial solutions with GPUs: MatrixPro, Acceleware,BCSLib-GPU7
I Sparse QR factorization: QR-Mumps (GPU in progress)
PaStiX Team - ANR SOLHAR November 15, 2013- 2
-
Introduction
Guideline
PaStiX solver
DAG schedulers
Towards GPU accelerated solver over runtimes
What is next?
Conclusion
PaStiX Team - ANR SOLHAR November 15, 2013- 3
-
2PaStiX solver
-
PaStiX solver
Major steps for solving sparse linear systems
1. Analysis: matrix is preprocessed to improve its structuralproperties (A′x ′ = b′ with A′ = PnPDr ADcQPT )
2. Factorization: matrix is factorized as A = LU, LLT or LDLT
3. Solve: the solution x is computed by means of forward andbackward substitutions
PaStiX Team - ANR SOLHAR November 15, 2013- 6
-
PaStiX solver
PaStiX solver
I Supernodal methodI Cholesky, LU and LDLt factorizationsI Exploit symmetric patterns, even for general matricesI Hybrid implementation MPI/PthreadI Static and dynamic schedulersI Adapted to NUMA architecturesI Task-based algorithms
PaStiX Team - ANR SOLHAR November 15, 2013- 7
-
PaStiX solver
Supernodal factorization tasks
PaStiX Team - ANR SOLHAR November 15, 2013- 8
-
PaStiX solver
Static scheduling within PaStiX
forall the Supernode S1 attributed to t dowait (S1);factorize (S1);forall the off diagonal blocks Bi of S1 do
S2 ← supernode in front of (Bi );lock (S2);update (S1,S2);unlock (S2);if All updates applied on S2 then
release (S2)end
endend
PaStiX Team - ANR SOLHAR November 15, 2013- 9
-
3DAG schedulers
-
DAG schedulers
DAG schedulers consideredStarPU
I RunTime Team – Inria Bordeaux Sud-OuestI C. Augonnet, R. Namyst, S. Thibault.I Dynamic Task DiscoveryI Computes cost models on the flyI Multiple kernels on the acceleratorsI Heterogeneous First-Time strategy
PaRSEC (formerly DAGuE)I ICL – University of Tennessee, KnoxvilleI G. Bosilca, A. Bouteiller, A. Danalys, T HeraultI Parameterized Task GraphI Only the most compute intensive kernel on acceleratorsI Simple scheduling strategy based on computing capabilitiesI GPU multi-stream enabled
PaStiX Team - ANR SOLHAR November 15, 2013- 12
-
DAG schedulers
StarPU loop to submit tasks (DTD)
forall the Supernode S1 dosubmit factorize (S1);forall the off diagonal blocks Bi of S1 do
S2 ← supernode in front of (Bi );submit update (S1,S2);
endend
PaStiX Team - ANR SOLHAR November 15, 2013- 13
-
DAG schedulers
PaRSEC’s representation (PTG)
panel(j) [high priority = on]/* execution space */c = 0 .. cblknbr-1/* Extra parameters */firstblock = diagonal block of( c )lastblock = last block of( c )lastbrow = last brow of( c )/* Locality */:A(c)RW A ← leaf ? A(c) : C update(lastbrow)
→ A update(firstblock+1..lastblock)→ A(c)
update(j)/* execution space */b = 0 .. bloknbr-1/* Extra parameters */c = get cblk of( b )fc = get facing cblk of( b ).../* Locality */:A(fc)READ A ← A panel(c)RW C ← previous ? C update(prev) : A(fc)
→ next ? C update(next) : A panel(fc)
PaStiX Team - ANR SOLHAR November 15, 2013- 14
-
DAG schedulers
ExperimentsMatrix Prec Method Size nnzA nnzL TFlop/sAfshell10 D LU 1.5e+6 27e+6 610e+6 0.12FilterV2 Z LU 0.6e+6 12e+6 536e+6 3.6Flan D LLT 1.6e+6 59e+6 1712e+6 5.3Audi D LLT 0.9e+6 39e+6 1325e+6 6.5MHD D LU 0.5e+6 24e+6 1133e+6 6.6Geo1438 D LLT 1.4e+6 32e+6 2768e+6 23Pmldf Z LDLT 1.0e+6 8e+6 1105e+6 28Hook D LU 1.5e+6 31e+6 4168e+6 35Serena D LDLT 1.4e+6 32e+6 3365e+6 47
Table : Matrix description (Z: double complex, D: double).
MachineI Two hexa-cores Westmere Xeon X5650 (2.67 GHz)I 32 GB of memory
PaStiX Team - ANR SOLHAR November 15, 2013- 15
-
DAG schedulers
CPU Scaling study
afshell10(D
, LU)
FilterV2(Z
, LU)
Flan(D, LL
T )
audi(D, LL
T )
MHD(D, L
U)
Geo1438(D
, LLT )
pmlDF(Z,
LDLT )
HOOK(D,
LU)
Serena(D,
LDLT )
0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
PaStiX with 1 core StarPU with 1 core PaRSEC with 1 corePaStiX with 3 cores StarPU with 3 cores PaRSEC with 3 coresPaStiX with 6 cores StarPU with 6 cores PaRSEC with 6 coresPaStiX with 9 cores StarPU with 9 cores PaRSEC with 9 coresPaStiX with 12 cores StarPU with 12 cores PaRSEC with 12 cores
PaStiX Team - ANR SOLHAR November 15, 2013- 16
-
4Towards GPU accelerated solverover runtimes
-
Towards GPU accelerated solver over runtimes
What can be offloaded to the GPU?
I Panel factorization:I Call MAGMA kernel?I Diagonal block size < 120I Panel is done on CPUI → No GPU kernel for factorization
I Panel update:I GEMM variantI Highly efficient GEMM source code availableI → existing GEMM can easily be adapted to our problemI Extension of ASTRA kernel (J. Kurzak, ICL)
PaStiX Team - ANR SOLHAR November 15, 2013- 19
-
Towards GPU accelerated solver over runtimes
Sparse GEMM on GPU
Tiled A × X
P2
fr1,1
lr1,1
fr1,2
fr1,2
fr1,3
fr1,3
fr2,1
lr2,1
fr2,2
fr2,2
blocknbr = 3;blocktab = [ fr1,1, lr1,1,
fr1,2, lr1,2,fr1,3, lr1,3 ];
fblocknbr = 2;fblocktab = [ fr2,1, lr2,1,
fr2,2, lr2,2];
sparse gemm cuda( char TRANSA, char TRANSB, int m, int n, int k,cuDoubleComplex alpha,const cuDoubleComplex *d A, int lda,const cuDoubleComplex *d B, int ldb,cuDoubleComplex beta,cuDoubleComplex *d C, int ldc,int blocknbr, const int *blocktab,int fblocknbr, const int *fblocktab,CUstream stream );
Figure : Panel update on GPU
PaStiX Team - ANR SOLHAR November 15, 2013- 20
-
Towards GPU accelerated solver over runtimes
GPU kernel experiment
×
A11
AT11
NcolA
NrowA
NrowA11
NcolB
ParametersI NcolA = 128I NcolB = NrowA11 = 128I NrowA varies from 256 to 20000I Random number and size of
blocks in AI Random blocks in B matching AI Get mean time of 10 runs for a
fixed NrowA with differentblocks distribution.
Figure : GPU kernel experiment
PaStiX Team - ANR SOLHAR November 15, 2013- 21
-
Towards GPU accelerated solver over runtimes
Multi-stream performance comparison (Tesla M2070)
50
100
150
200
250
300
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
GF
lop/s
Matrix size M (N=K=128)
cuBLAS peakcuBLAS - 1 stream cuBLAS - 2 streamscuBLAS - 3 streams
ASTRA - 1 stream ASTRA - 2 streamsASTRA - 3 streams
Sparse - 1 stream Sparse - 2 streamsSparse - 3 streams
PaStiX Team - ANR SOLHAR November 15, 2013- 22
-
Towards GPU accelerated solver over runtimes
Data mapping over GPU (PaRSEC, 3 Tesla M2070)
panel’s size FLOPs priority updates number0
50
100
150
200
250
300
Perfo
rman
ce(G
FLO
PS)
afshell10(D, LU) FilterV2(Z, LU) Flan(D, LLT ) audi(D, LLT )MHD(D, LU) Geo1438(D, LLT ) pmlDF(Z, LDLT ) HOOK(D, LU)
Serena(D, LDLT )
PaStiX Team - ANR SOLHAR November 15, 2013- 23
-
Towards GPU accelerated solver over runtimes
GPU scaling study (No GPU)
afshell10(D
, LU)
FilterV2(Z
, LU)
Flan(D, LL
T )
audi(D, LL
T )
MHD(D, L
U)
Geo1438(D
, LLT )
pmlDF(Z,
LDLT )
HOOK(D,
LU)
Serena(D,
LDLT )
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
PaStiX
StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU
PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream
PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams
PaStiX Team - ANR SOLHAR November 15, 2013- 24
-
Towards GPU accelerated solver over runtimes
GPU scaling study (1 GPU)
afshell10(D
, LU)
FilterV2(Z
, LU)
Flan(D, LL
T )
audi(D, LL
T )
MHD(D, L
U)
Geo1438(D
, LLT )
pmlDF(Z,
LDLT )
HOOK(D,
LU)
Serena(D,
LDLT )
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
PaStiX
StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU
PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream
PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams
PaStiX Team - ANR SOLHAR November 15, 2013- 25
-
Towards GPU accelerated solver over runtimes
GPU scaling stud (2 GPUs)
afshell10(D
, LU)
FilterV2(Z
, LU)
Flan(D, LL
T )
audi(D, LL
T )
MHD(D, L
U)
Geo1438(D
, LLT )
pmlDF(Z,
LDLT )
HOOK(D,
LU)
Serena(D,
LDLT )
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
PaStiX
StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU
PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream
PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams
PaStiX Team - ANR SOLHAR November 15, 2013- 26
-
Towards GPU accelerated solver over runtimes
GPU scaling study (3 GPUs)
afshell10(D
, LU)
FilterV2(Z
, LU)
Flan(D, LL
T )
audi(D, LL
T )
MHD(D, L
U)
Geo1438(D
, LLT )
pmlDF(Z,
LDLT )
HOOK(D,
LU)
Serena(D,
LDLT )
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
PaStiX
StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU
PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream
PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams
PaStiX Team - ANR SOLHAR November 15, 2013- 27
-
Towards GPU accelerated solver over runtimes
GPU scaling study (Multi-streams)
afshell10(D
, LU)
FilterV2(Z
, LU)
Flan(D, LL
T )
audi(D, LL
T )
MHD(D, L
U)
Geo1438(D
, LLT )
pmlDF(Z,
LDLT )
HOOK(D,
LU)
Serena(D,
LDLT )
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
PaStiX
StarPU StarPU 1 GPU StarPU 2 GPU StarPU 3 GPU
PaRSEC PaRSEC 1 GPU, 1 stream PaRSEC 2 GPU, 1 stream PaRSEC 3 GPU, 1 stream
PaRSEC PaRSEC 1 GPU, 3 streams PaRSEC 2 GPU, 3 streams PaRSEC 3 GPU, 3 streams
PaStiX Team - ANR SOLHAR November 15, 2013- 28
-
5What is next?
-
What is next?
Improvements on granularity
1. Try to improve BLAS efficiency by larger blockingI Study impact of Scotch minimal subblock parameter (cmin)
2. Create more parallelism while avoiding low flops tasksI Improve supernode splitting algorithm
PaStiX Team - ANR SOLHAR November 15, 2013- 31
-
What is next?
Idea of smarter splitting algorithm
Constant split size
(a) Regular splitting (Actual)
A smarter split
(b) Adapted splitting (New)PaStiX Team - ANR SOLHAR November 15, 2013- 32
-
What is next?
Preliminary results
1 2 4 6 8
102
103
Number of Threads
FactorizationTim
e(s)
Regular / cmin=0
Regular / cmin=20
Adapted / cmin=0
Adapted / cmin=20
Figure : Audi, LLT , D
Splitting alg. Regular AdaptedScotch cmin 0 20 0 20Analyze time 1.95 s 0.35 s 2.56 s 0.42 sNumber of panels 118814 10082 118220 9491Number of blocks 2283029 338493 2213497 280722Created by splitting 65147 48284 18072 13081Avg. panel size 7.94262 93.602 7.98253 99.4305Avg. block height 10.1546 29.2206 9.08452 24.5355Memory usage 10.1 Go 10.7 Go 10.5 Go 11.1 Go
Adapted splitting algorithmI 6-15% improvement on factorization timeI 16-20% augmentation on analyze time
Scotch cminI 80% faster on analyze timeI No impact on factorization timeI 6% increase on memory consumption
PaStiX Team - ANR SOLHAR November 15, 2013- 33
-
6Conclusion
-
Conclusion
Conclusion
I Productive/Easy solution to add accelerators to sparse directsolver
I Performance loss with both schedulers is low on multicorearchitectures
I / Accelerators integration is not always as easy as claimed byruntime programmers
I Sparse factorization raised many challenges for genericruntimes:
I Small granularity → runtime overhead is too importantI Irregular task granularity →
I Difficulty to generate costs modelsI Inefficient data movements (mostly designed for regular dense
block)
PaStiX Team - ANR SOLHAR November 15, 2013- 36
-
Conclusion
Future works
I Experiments on larger architectures (Minotaure) / Reducetask overhead
I Distributed implementation (MPI)I Intel Xeon PHI:
I Full Xeon PHI native implementationI Offload through runtimes
I H-matrices:I Check the potential compression ratio on top level blocksI Develop a prototype with:
I low-rank compression on the larger supernodesI compression tree built at each update
I Study coupling between nested dissection and compression treeordering
PaStiX Team - ANR SOLHAR November 15, 2013- 37
-
Conclusion
Thanks !
PaStiX TeamINRIA HiePACS team
ANR SOLHAR, Bordeaux
PaStiX solverDAG schedulersTowards GPU accelerated solver over runtimesWhat is next?Conclusion