hierarchical qr factorization algorithms for multi-core

Hierarchical QR factorizationalgorithms for multi-core clustersystems

Jack DongarraMathieu FavergeThomas HeraultJulien LangouYves RobertUniversity of Tennessee Knoxville, USAUniversity of Colorado Denver, USAEcole Normale Supérieure de Lyon, FranceJUNE 29, 2012

• reducing communication◦ in sequential◦ in parallel distributed

• increasing parallelism(or reducing the critical path, reducing synchronization)

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 2 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Outline





ReduceAlgorithms:Introduc4onTheQRfactoriza4onofalongandskinnymatrixwithitsdatapar44onedver4callyacrossseveralprocessorsarisesinawiderangeofapplica4ons.

A1

A2

A3

Q1

Q2

Q3

R

Input:Aisblockdistributedbyrows

Output:QisblockdistributedbyrowsRisglobal



Exampleofapplica3ons:inblockitera3vemethods.

a)  initera3vemethodswithmul3pleright‐handsides(blockitera4vemethods:)

1)  Trilinos(SandiaNa4onalLab.)throughBelos(R.Lehoucq,H.Thornquist,U.Hetmaniuk).

2)  BlockGMRES,BlockGCR,BlockCG,BlockQMR,…

b)  initera3vemethodswithasingleright‐handside

1)  s‐stepmethodsforlinearsystemsofequa4ons(e.g.A.Chronopoulos),

2)  LGMRES(Jessup,Baker,Dennis,U.ColoradoatBoulder)implementedinPETSc,

3)  RecentworkfromM.HoemmenandJ.Demmel(U.CaliforniaatBerkeley).

e)  initera3veeigenvaluesolvers,

1)  PETSc(ArgonneNa4onalLab.)throughBLOPEX(A.Knyazev,UCDHSC),

2)  HYPRE(LawrenceLivermoreNa4onalLab.)throughBLOPEX,

3)  Trilinos(SandiaNa4onalLab.)throughAnasazi(R.Lehoucq,H.Thornquist,U.Hetmaniuk),

4)  PRIMME(A.Stathopoulos,Coll.William&Mary),

5)  AndalsoTRLAN,BLZPACK,IRBLEIGS.



ReduceAlgorithms:Introduc4onExampleofapplica3ons:

a)  inlinearleastsquaresproblemswhichthenumberofequa4onsisextremelylargerthanthenumberofunknowns

b)  inblockitera4vemethods(itera4vemethodswithmul4pleright‐handsidesoritera4veeigenvaluesolvers)

c)  indenselargeandmoresquareQRfactoriza4onwheretheyareusedasthepanelfactoriza4onstep



BlockedLUandQRalgorithms(LAPACK)

‐

lu( )

dgeK2

dtrsm(+dswp)

dgemm

\

L

U

A(1)

A(2)L

U

‐

qr( )

dgeqf2+dlarQ

dlarR

V

R

A(1)

A(2)V

R

LAPACKblockLU(right‐looking):dgetrf LAPACKblockQR(right‐looking):dgeqrf

Upd

ateofth

eremainingsub

matrix

Pane

lfactoriza4

on



BlockedLUandQRalgorithms(LAPACK)

‐

lu( )

dgeK2

dtrsm(+dswp)

dgemm

\

L

U

A(1)

A(2)L

U

LAPACKblockLU(right‐looking):dgetrf

Upd

ateofth

eremainingsub

matrix

Pane

lfactoriza4

on

Latencybounded:morethannbAllReduceforn*nb2ops

CPU‐bandwidthbounded:thebulkofthecomputa4on:n*n*nbopshighlyparalleliable,efficientandsaclable.



Paralleliza3onofLUandQR.

Parallelizetheupdate:• Easyanddoneinanyreasonablesoeware.• Thisisthe2/3n3termintheFLOPscount.• CanbedoneefficientlywithLAPACK+mul4threadedBLAS

‐

dgemm

‐

lu( )

dgeK2

dtrsm(+dswp)

dgemm

\

L

U

A(1)

A(2)L

U



Paralleliza3onofLUandQR.

Parallelizetheupdate:• Easyanddoneinanyreasonablesoeware.• Thisisthe2/3n3termintheFLOPscount.• CanbedoneefficientlywithLAPACK+mul4threadedBLAS

Parallelizethepanelfactoriza3on:• Notanop4oninmul4corecontext(p<16)• Seee.g.ScaLAPACKorHPLbuts4llbyfartheslowestandthebojleneckofthecomputa4on.

Hidethepanelfactoriza3on:• Lookahead(seee.g.HighPerformanceLINPACK)• DynamicScheduling

lu( )

dgeK2

‐

dgemm

lu( )

dgeK2



Hidingthepanelfactoriza4onwithdynamicscheduling.

TimeCourtesyfromAlfredoBujari,UTennessee



Whataboutstrongscalability?



Whataboutstrongscalability?N=1536

NB=64

procs=16

CourtesyfromJakubKurzak,UTennessee




NB=64

procs=16

Wecannothidethepanelfactoriza4onintheMM,actuallyitistheMMsthatarehiddenbythepanelfactoriza4ons!





NB=64

procs=16

Wecannothidethepanelfactoriza4on(n2)withtheMM(n3),actuallyitistheMMsthatarehiddenbythepanelfactoriza4ons!

NEEDFORNEWMATHEMATICALALGORITHMS




Anewgenera4onofalgorithms?Algorithmsfollowhardwareevolu3onalong3me.

LINPACK(80’s)(Vectoropera4ons)

Relyon‐Level‐1BLASopera4ons

LAPACK(90’s)(Blocking,cachefriendly)




Anewgenera4onofalgorithms?Algorithmsfollowhardwareevolu3onalong3me.

LINPACK(80’s)(Vectoropera4ons)


LAPACK(90’s)(Blocking,cachefriendly)


NewAlgorithms(00’s)(mul4corefriendly)

Relyon‐aDAG/scheduler‐blockdatalayout‐someextrakernels

Thosenewalgorithms‐haveaverylowgranularity,theyscaleverywell(mul4core,petascalecompu4ng,…)‐removesalotsofdependenciesamongthetasks,(mul4core,distributedcompu4ng)‐avoidlatency(distributedcompu4ng,out‐of‐core)‐relyonfastkernelsThosenewalgorithmsneednewkernelsandrelyonefficientschedulingalgorithms.



2005‐2007:Newalgorithmsbasedon2Dpar44onning:

–  UTexas(vandeGeijn):SYRK,CHOL(mul4core),LU,QR(out‐of‐core)–  UTennessee(Dongarra):CHOL(mul4core)–  HPC2N(Kågström)/IBM(Gustavson):Chol(Distributed)

–  UCBerkeley(Demmel)/INRIA(Grigori):LU/QR(distributed)–  UCDenver(Langou):LU/QR(distributed)

A3rdrevolu4onfordenselinearalgebra?





We start with reduction communication in parallel distributed in the tall andskinny case.



Ontwoprocesses

A0 Q0

A1 Q1

processes

4me



OntwoprocessesR0(0)( , )QR( )

A0 V0(0)

R1(0)( , )QR( )

A1 V1(0)

processes

4me




A0 V0(0)

)R0(0)

R1(0)

R1(0)( , )QR( )

A1 V1(0)

processes

4me

(




A0 V0(0)

R0(1)( , )QR( )R0(0)

R1(0)

V0(1)

V1(1)

R1(0)( , )QR( )

A1 V1(0)

processes

4me



Thebigpicture….

A1

A2

A3

A4

A5

A6

Q0

Q4

Q1

Q5

Q6

Q3

Q2

R

R

R

R

R

R

R

A QR4me

processes

A0



Thebigpicture….

A1

A2

A3

A4

A5

A6

4me

processes

communica3on

computa3on

A0



Latencybutalsopossibilityoffastpanelfactoriza4on.

•  DGEQR3istherecursivealgorithm(seeElmrothandGustavson,2000),DGEQRFandDGEQR2aretheLAPACKrou4nes.

•  TimesincludeQRandDLARFT.

•  RunonPen4umIII.

QRfactoriza3onandconstruc3onofTm=10,000

PerfinMFLOP/sec(Timesinsec)

n DGEQR3 DGEQRF DGEQR2

50 173.6 (0.29) 65.0 (0.77) 64.6 (0.77)

100 240.5 (0.83) 62.6 (3.17) 65.3 (3.04)

150 277.9 (1.60) 81.6 (5.46) 64.2 (6.94)

200 312.5 (2.53) 111.3 (7.09) 65.9 (11.98)

m=1000,000,thexaxisisn

MFLOP/sec



QandR:Strongscalability

0

100

200

300

400

500

600

700

800

32 64 128 256

ReduceHH(QR3)

ReduceHH(QRF)

ScaLAPACKQRF

ScaLAPACKQR2

Inthisexperiment,wefixtheproblem:m=1,000,000andn=50.Thenweincreasethenumberofprocessors.BlueGeneL

frost.ncar.edu

#ofprocessors

MFLOPs/sec/proc



communication :: TSQR :: parallel caseWhenonlyRiswanted

processes

4me

R0(0)( )QR( )

A0

R0(1)( )

QR( )R0

(0)R1

(0)

R1(0)( )QR( )

A1

R2(0)( )QR( )

A2

R2(1)( )

QR( )R2

(0)R3

(0)

R3(0)( )QR( )

A3

R( )QR( )

R0(1)

R2(1)

Considerarchitecture: parallel case: P processing unitsproblem: QR factorization of a m-by-n TS matrix (TS = m/P ≥ n)(main) assumption: the operation is “truly” parallel distributed⇒ answer: binary treetheory:

TSQR ScaLAPACK-like Lower bound

# flops 2mn2

P + 2n3

3 log P 2mn2

P − 2n3

3P Θ(

mn2

P

)# words n2

2 log P n2

2 log P n2

2 log P# messages log P 2n log P log P



Any tree does.

WhenonlyRiswanted:TheMPI_Allreduce

InthecasewhereonlyRiswanted,insteadofconstruc4ngourowntree,onecansimplyuseMPI_Allreducewithauserdefinedopera4on.Theopera4onwegivetoMPIisbasicallytheAlgorithm2.Itperformstheopera4on:

Thisbinaryopera4onisassocia3veandthisisallMPIneedstouseauser‐definedopera4ononauser‐defineddatatype.Moreover,ifwechangethesignsoftheelementsofRsothatthediagonalofRholdsposi4veelementsthenthebinaryopera4onRfactorbecomescommuta3ve.

Thecodebecomestwolines:

lapack_dgeqrf( mloc, n, A, lda, tau, &dlwork, lwork, &info );

MPI_Allreduce( MPI_IN_PLACE, A, 1, MPI_UPPER,

LILA_MPIOP_QR_UPPER, mpi_comm);

QR ( )R1R2

R



Flat Binary tree

parallelism

+me

parallelism

+me

A weird tree

parallelism

+me

parallelism

+me

Another weird tree


Tall and Skinny matrices | Minimizing communication in hierarchichal parallel distributed (6)

Outline





Latency (ms) Orsay Toulouse Bordeaux SophiaOrsay 0.07 7.97 6.98 6.12Toulouse 0.03 9.03 8.18Bordeaux 0.05 7.18Sophia 0.06

Throughput (Mb/s) Orsay Toulouse Bordeaux SophiaOrsay 890 78 90 102Toulouse 890 77 90Bordeaux 890 83Sophia 890



Domain 2,4

Domain 1,1

Domain 1,2

Domain 1,3

Cluster 1

Domain 1,4

Domain 1,5

Domain 2,1

Domain 2,2

Domain 2,3

Cluster 2

Domain 3,1

Domain 3,2

Cluster 3

Illustration of ScaLAPACK PDEGQR2 without reduce affinity



Domain 1,1

Domain 1,2

Domain 1,3

Cluster 1

Domain 1,4

Domain 1,5

Domain 2,1

Domain 2,2

Domain 2,3

Cluster 2

Domain 2,4

Domain 3,1

Domain 3,2

Cluster 3

Illustration of ScaLAPACK PDEGQR2 with reduce affinity



Domain 1,1

Domain 1,2

Domain 1,3

Cluster 1

Domain 1,4

Domain 1,5

Domain 2,1

Domain 2,2

Domain 2,3

Cluster 2

Domain 2,4

Domain 3,1

Domain 3,2

Cluster 3

Illustration of TSQR without reduce affinity



Domain 1,1

Domain 1,2

Domain 1,3

Cluster 1

Domain 1,4

Domain 1,5

Domain 2,1

Domain 2,2

Domain 2,3

Cluster 2

Domain 2,4

Domain 3,1

Domain 3,2

Cluster 3

Illustration of TSQR with reduce affinity



Using 4 clusters of 32 processors.

104 105 106 107100

101

102

103

m

GF

lop/

sec

Classical Gram−SchmidtPerformance for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x−axis), n = 32

m = 8.e+06

m = 16.e+06

1 cluster, 9.27 GFlops/sec

2 clusters, 17.19 GFlops/sec


experimentsmodel

104 105 106 107100

101

102

103

m

GF

lop/

sec

TSQR−binary−treePerformance for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x−axis), n = 32

m = 2.e+5

m = 5.e+5

1 cluster, 24.20 GFlops/sec


4 clusters, 96.20 GFlops/secexperimentsmodel

Two effects at once: (1) avoiding communication with TSQR, (2) tuning of thereduction tree


Tall and Skinny matrices | Minimizing communication in sequential (1)

Outline




Tall and Skinny matrices | Minimizing communication in sequential (1)

Considerarchitecture: sequential case: one processing unit with cache of size (W )problem: QR factorization of a m-by-n TS matrix (TS = m ≥ n and W ≥ 3

2 n2)⇒ answer: flat treetheory:

flat tree LAPACK-like Lower bound# flops 2mn2 2mn2 Θ(mn2)

# words 2mn m2n2

2W 2mn# messages 3mn

Wmn2

2W2mnW


Rectangular matrices | Minimizing communication in sequential (2)

Outline





5000 10000 15000 20000 25000 30000 350000

10

20

30

40

50

60

70

80

90

100

n (matrix size)

perc

ent o

f CP

U p

eak

perf

orm

ance

Mfast

= 106 −− Mslow

= 109 −− β = 108 −− γ = 1010

sequential case −− LU (or QR) factorization of square matrices

n = Mfast1/2

n = Mslow1/2

problem lower boundupper bound for CAQR−flat−treelower bound for the LAPACK algorithm



100 101 102 103 104 105 106 107100

101

102

( #

of s

low

mem

ory

refe

renc

es )

/ (p

robl

em lo

wer

bou

nd)

n −− size of the matrix

sequential case −− LU (or QR) factorization of square matrices

Mfast

= 103

Mfast

= 106

Mfast

= 109

lower bound for the LAPACK algorithmupper bound for CAQR−flat−tree


Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Outline





Mission

Perform the QR factorization of an initial tiled matrix A on a multicore platform.

time



for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}




}

1.dgeqrt(A[0][0],T[0][0]);

VR AT

dgeqrt0




}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);

V A

dgeqrt0

TR

dlarB0,1




}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);3.dlar+(A[0][0],T[0][0],A[0][2]);4.dlar+(A[0][0],T[0][0],A[0][3]);

V A

dgeqrt0

TR

dlarB0,3dlarB0,2dlarB0,1




}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);3.dlar+(A[0][0],T[0][0],A[0][2]);4.dlar+(A[0][0],T[0][0],A[0][3]);

V A

dgeqrt0

TR





}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);3.dlar+(A[0][0],T[0][0],A[0][2]);4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);

dgeqrt0


V

R

A

TR

dtsqrt1,0




}

…4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);6.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);

dgeqrt0


V

A

B

T

dtsqrt1,0

A

B

dssrB1,1




}

…4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);6.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);7.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);8.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);

dgeqrt0


V

A

B

T

dtsqrt1,0

A

B

dssrB1,1 dssrB1,1 dssrB1,1



the input. The T matrix is stored separately.

DTSQRT: The kernel performs the QR factoriza-tion of a matrix built by coupling the R factor,produced by DGEQRT or a previous call to DT-SQRT, with a tile below the diagonal tile. Thekernel produces an updated R factor, a squarematrix V containing the Householder reflectorsand the matrix T resulting from accumulatingthe reflectors V . The new R factor overrides theold R factor. The block of reflectors overridesthe square tile of the input matrix. The T ma-trix is stored separately.

DLARFB: The kernel applies the reflectors calcu-lated by DGEQRT to a tile to the right of thediagonal tile, using the reflectors V along withthe matrix T .

DSSRFB: The kernel applies the reflectors calcu-lated by DTSQRT to two tiles to the right of thetiles factorized by DTSQRT, using the reflectorsV and the matrix T produced by DTSQRT.

Naive implementation, where the full T matrixis built, results in 25 % more floating point opera-tions than the standard algorithm. In order to mini-mize this overhead, the idea of inner-blocking is used,where the T matrix has sparse (block-diagonal) struc-ture (Figure 10) [32, 33, 34].

Figure 10: Inner blocking in the tile QR factorization.

Figure 11 shows the pseudocode of the tile QR fac-torization. Figure 12 shows the task graph of the tileQR factorization for a matrix of 5⇥5 tiles. Orders ofmagnitude larger matrices are used in practice. Thisexample only serves the purpose of showing the com-plexity of the task graph, which is noticeably higherthan that of Cholesky factorization.

Figure 11: Pseudocode of the tile QR factorization.

DGEQRT

DGEQRT

DGEQRT

DGEQRT

DGEQRT

DLARFBDLARFB DLARFB DLARFB

DLARFB DLARFB DLARFB

DLARFB DLARFB

DLARFB

DTSQRT

DTSQRT

DTSQRT

DTSQRTDTSQRT

DTSQRT

DTSQRT

DTSQRT

DTSQRT

DTSQRT

DSSRFBDSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB

DSSRFB DSSRFB

DSSRFB

Figure 12: Task graph of the tile QR factorization(matrix of size 5 ⇥ 5 tiles).

10

DAG for 5x5 matrix (QR factorization)Julien Langou | University of Colorado Denver Hierarchical QR | 59 of 103


f o r ( i = 0 ; i < p ; i ++ ) {DSCHED dpotrf ( dsched , ’L ’ , nb [ i ] , A[ i ] [ i ] , nb [ i ] , i n f o ) ;f o r ( j = i +1 ; j < p ; j ++ )

DSCHED dtrsm ( dsched , ’R ’ , ’L ’ , ’T ’ , ’N’ , nb [ j ] , nb [ i ] , 1 . 0 , A[ i ] [ i ] , nb [ i ] , A[ j ] [ i ] , nb [ j ] ) ;f o r ( j = i +1 ; j < p ; j ++ ) {

f o r ( k = i +1 ; k < j ; k++ ) {DSCHED dgemm( dsched , ’N’ , ’T ’ , nb [ j ] , nb [ k ] , nb [ i ] , �1, A[ j ] [ i ] , nb [ j ] , A[ k ] [ i ] , nb [ k ] , 1 , A[ j ] [ k ] , nb [ j ] ) ;

DSCHED dsyrk ( dsched , ’L ’ , ’N’ , nb [ j ] , nb [ i ] , �1.0 , A[ j ] [ i ] , nb [ j ] , + 1 . 0 , A[ j ] [ j ] , nb [ j ] ) ;}}

Fig. 9. Tile Cholesky factorization that calls the scheduled core linear algebra operations.

1 :1

2 : 2 0

1 0 : 5 8

1 1 : 7 4

1 2 : 7 4

1 3 : 7 5

1 4 : 9 0

3 : 2 0

1 5 : 9 0

1 6 : 9 1

1 7 : 1 0 5

1 8 : 1 0 5

1 9 : 1 0 6

2 0 : 1 1 9

2 1 : 1 1 8

2 2 : 1 0 0

2 3 : 1 1 1

2 4 : 9 3

2 5 : 9 3

4 : 2 1

2 6 : 8 7

2 7 : 8 6

2 8 : 7 1

2 9 : 8 0

3 0 : 6 5

3 1 : 6 5

3 2 : 6 0

3 3 : 5 9

3 4 : 4 7

3 5 : 5 4

3 6 : 4 2

5 : 3 9

3 7 : 4 2

3 8 : 3 8

3 9 : 3 7

4 0 : 2 8

4 1 : 3 3

4 2 : 2 4

4 3 : 2 4

4 4 : 2 1

4 5 : 2 0

4 6 : 1 4

4 7 : 1 7

6 : 3 9

4 8 : 1 1

4 9 : 1 1

5 0 : 9

5 1 : 8

5 2 : 5

5 3 : 6

5 4 : 3

5 5 : 3

5 6 : 2

5 7 : 1

5 8 : 1

7 : 4 0

8 : 5 7

9 : 5 7

Fig. 10. DAG for a LU factorization with 20 tiles (block size 200and matrix size 4000). The size of the DAG grows very fast with thenumber of tiles.

more tasks until some are completed. The usage of awindow of tasks has implications in how the loops of anapplication are unfolded and how much look ahead isavailable to the scheduler. This paper discusses some ofthese implication in the context of dense linear algebraapplications.

d) Data Locality and Cache Reuse: It has beenshown in the past that the reuse of memory cachescan lead to a substantial performance improvement inexecution time. Since we are working with tiles of datathat should fit in the local caches on each core, wehave provided the algorithm designer with the abilityto hint the cache locality behavior. A parameter in a call(e.g., Fig. 7) can be decorated with the LOCALITYflag in order to tell the scheduler that the data item(parameter) should be kept in cache if possible. Aftera computational core (worker) executes that task, thescheduler will assign by-default any future task usingthat data item to the same core. Note that the workstealing can disrupt the by-default assignment of tasksto cores.

The next section studies the performance impact ofthe locality flag and the window size on the LL and RLvariants of the three tile factorizations.

V. EXPERIMENTAL RESULTS

This section describes the analysis of dynamicallyscheduled tile algorithms for the three factorizations (i.e.,Cholesky, QR and LU) on different multicore systems.The tile sizes for these algorithm have been tuned and

are equal to b = 200.

A. Hardware Descriptions

In this study, we consider two different shared memoryarchitectures. The first architecture (System A) is aquad-socket, quad-core machine based on an Intel XeonEMT64 E7340 processor operating at 2.39 GHz. Thetheoretical peak is equal to 9.6 Gflops/s per core or 153.2Gflops/s for the whole node, composed of 16 cores. Thepractical peak (mesured by the performance of a GEMM)is equal to 8.5 Gflops/s per core or 136 Gflops/s forthe 16 cores. The level-1 cache, local to the core, isdivided into 32 kB of instruction cache and 32 kB of datacache. Each quad-core processor is actually composed oftwo dual-core Core2 architectures and the level-2 cachehas 2⇥4 MB per socket (each dual-core shares 4 MB).The machine is a NUMA architecture and it providesIntel Compilers 11.0 together with the MKL 10.1 vendorlibrary.

The second system (System B) is an 8 sockets, 6core AMD Opteron 8439 SE Processor (48 cores total@ 2.8Ghz) with 128 Gb of main memory. Each corehas a theoretical peak of 11.2 Gflops/s and the wholemachine 537.6 Gflops/s. The practical peak (mesured bythe performance of a GEMM) is equal to 9.5 Gflops/sper core or 456 Gflops/s for the 48 cores. There arethree levels of cache. The level-1 cache consist of 64kB and the level-2 cache consist of 512 kB. Each socketis composed of 6 cores and the level-3 cache has 6 MB48-way associative shared cache per socket. The machineis a NUMA architecture and it provides Intel Compilers11.1 together with the MKL 10.2 vendor library.

B. Performance Discussions

In this section, we evaluate the effect of the windowsize and the locality feature on the LL and RL tilealgorithm variants.

The nested-loops describing the tile LL variant codesare naturally ordered in a way that already promoteslocality on the data tiles located on the panel. Fig. 11shows the effect of the locality flag of the scheduleron the overall performance of the tile LL Choleskyvariant. As expected, the locality flag does not reallyimprove the performances when using small window

DAG for 20x20 matrix (QR factorization)

f o r ( i = 0 ; i < p ; i ++ ) {DSCHED dpotrf ( dsched , ’L ’ , nb [ i ] , A[ i ] [ i ] , nb [ i ] , i n f o ) ;f o r ( j = i +1 ; j < p ; j ++ )

DSCHED dtrsm ( dsched , ’R ’ , ’L ’ , ’T ’ , ’N’ , nb [ j ] , nb [ i ] , 1 . 0 , A[ i ] [ i ] , nb [ i ] , A[ j ] [ i ] , nb [ j ] ) ;f o r ( j = i +1 ; j < p ; j ++ ) {

f o r ( k = i +1 ; k < j ; k++ ) {DSCHED dgemm( dsched , ’N’ , ’T ’ , nb [ j ] , nb [ k ] , nb [ i ] , �1, A[ j ] [ i ] , nb [ j ] , A[ k ] [ i ] , nb [ k ] , 1 , A[ j ] [ k ] , nb [ j ] ) ;

DSCHED dsyrk ( dsched , ’L ’ , ’N’ , nb [ j ] , nb [ i ] , �1.0 , A[ j ] [ i ] , nb [ j ] , + 1 . 0 , A[ j ] [ j ] , nb [ j ] ) ;}}

Fig. 9. Tile Cholesky factorization that calls the scheduled core linear algebra operations.

1 :1

2 : 2 0

1 0 : 5 8

1 1 : 7 4

1 2 : 7 4

1 3 : 7 5

1 4 : 9 0

3 : 2 0

1 5 : 9 0

1 6 : 9 1

1 7 : 1 0 5

1 8 : 1 0 5

1 9 : 1 0 6

2 0 : 1 1 9

2 1 : 1 1 8

2 2 : 1 0 0

2 3 : 1 1 1

2 4 : 9 3

2 5 : 9 3

4 : 2 1

2 6 : 8 7

2 7 : 8 6

2 8 : 7 1

2 9 : 8 0

3 0 : 6 5

3 1 : 6 5

3 2 : 6 0

3 3 : 5 9

3 4 : 4 7

3 5 : 5 4

3 6 : 4 2

5 : 3 9

3 7 : 4 2

3 8 : 3 8

3 9 : 3 7

4 0 : 2 8

4 1 : 3 3

4 2 : 2 4

4 3 : 2 4

4 4 : 2 1

4 5 : 2 0

4 6 : 1 4

4 7 : 1 7

6 : 3 9

4 8 : 1 1

4 9 : 1 1

5 0 : 9

5 1 : 8

5 2 : 5

5 3 : 6

5 4 : 3

5 5 : 3

5 6 : 2

5 7 : 1

5 8 : 1

7 : 4 0

8 : 5 7

9 : 5 7

Fig. 10. DAG for a LU factorization with 20 tiles (block size 200and matrix size 4000). The size of the DAG grows very fast with thenumber of tiles.

more tasks until some are completed. The usage of awindow of tasks has implications in how the loops of anapplication are unfolded and how much look ahead isavailable to the scheduler. This paper discusses some ofthese implication in the context of dense linear algebraapplications.

d) Data Locality and Cache Reuse: It has beenshown in the past that the reuse of memory cachescan lead to a substantial performance improvement inexecution time. Since we are working with tiles of datathat should fit in the local caches on each core, wehave provided the algorithm designer with the abilityto hint the cache locality behavior. A parameter in a call(e.g., Fig. 7) can be decorated with the LOCALITYflag in order to tell the scheduler that the data item(parameter) should be kept in cache if possible. Aftera computational core (worker) executes that task, thescheduler will assign by-default any future task usingthat data item to the same core. Note that the workstealing can disrupt the by-default assignment of tasksto cores.

The next section studies the performance impact ofthe locality flag and the window size on the LL and RLvariants of the three tile factorizations.

V. EXPERIMENTAL RESULTS

This section describes the analysis of dynamicallyscheduled tile algorithms for the three factorizations (i.e.,Cholesky, QR and LU) on different multicore systems.The tile sizes for these algorithm have been tuned and

are equal to b = 200.

A. Hardware Descriptions

In this study, we consider two different shared memoryarchitectures. The first architecture (System A) is aquad-socket, quad-core machine based on an Intel XeonEMT64 E7340 processor operating at 2.39 GHz. Thetheoretical peak is equal to 9.6 Gflops/s per core or 153.2Gflops/s for the whole node, composed of 16 cores. Thepractical peak (mesured by the performance of a GEMM)is equal to 8.5 Gflops/s per core or 136 Gflops/s forthe 16 cores. The level-1 cache, local to the core, isdivided into 32 kB of instruction cache and 32 kB of datacache. Each quad-core processor is actually composed oftwo dual-core Core2 architectures and the level-2 cachehas 2⇥4 MB per socket (each dual-core shares 4 MB).The machine is a NUMA architecture and it providesIntel Compilers 11.0 together with the MKL 10.1 vendorlibrary.

The second system (System B) is an 8 sockets, 6core AMD Opteron 8439 SE Processor (48 cores total@ 2.8Ghz) with 128 Gb of main memory. Each corehas a theoretical peak of 11.2 Gflops/s and the wholemachine 537.6 Gflops/s. The practical peak (mesured bythe performance of a GEMM) is equal to 9.5 Gflops/sper core or 456 Gflops/s for the 48 cores. There arethree levels of cache. The level-1 cache consist of 64kB and the level-2 cache consist of 512 kB. Each socketis composed of 6 cores and the level-3 cache has 6 MB48-way associative shared cache per socket. The machineis a NUMA architecture and it provides Intel Compilers11.1 together with the MKL 10.2 vendor library.

B. Performance Discussions

In this section, we evaluate the effect of the windowsize and the locality feature on the LL and RL tilealgorithm variants.

The nested-loops describing the tile LL variant codesare naturally ordered in a way that already promoteslocality on the data tiles located on the panel. Fig. 11shows the effect of the locality flag of the scheduleron the overall performance of the tile LL Choleskyvariant. As expected, the locality flag does not reallyimprove the performances when using small window



0 500 1000 1500 2000 2500 3000 3500 40000

20

40

60

80

100

120

140

160

180

200Tile QR Factorization −− 3.2 GHz CELL Processor

Matrix Size

Gflo

p/s

SSSRFB PeakTile QR

Figure 5: Performance of the tile QR factorization in single precision on a 3.2GHz CELL processor with eight SPEs. Square matrices were used. The solidhorizontal line marks performance of the SSSRFB kernel times the numberof SPEs (22.16 × 8 = 177 [Gflop/s]).

28

Performance of the tile QR factorization in single precision on a 3.2 GHz CELL processor with eight SPEs. Square matrices were used. Solid horizontal linemarks performance of the SSSRFB kernel times the number of SPEs (22.16 × 8 = 177 [Gflop/s]).“The presented implementation of tile QR factorization on the CELL processor allows for factorization of a 4000–by–4000 dense matrix in single precision inexactly half of a second. To the author’s knowledge, at present, it is the fastest reported time of solving such problem by any semiconductor deviceimplemented on a single semiconductor die.”

Jakub Kurzak and Jack Dongarra, LAWN 201 – QR Factorization for the CELLProcessor, May 2008.



Perform the QR factorization of an initial tiled matrix A.

time

Our tool:

Givens’ rotations:(cos θ − sin θsin θ cos θ

)



(p = 15) q = 1

q = 2 q = 3 q = 4

binomial tree 4

8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.

Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1

q = 2 q = 3 q = 4

binomial tree 4

8 12 16

flat tree 14

15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10


The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2

q = 3 q = 4

binomial tree 4

8 12 16

flat tree 14




44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2

q = 3 q = 4

binomial tree 4 8

12 16

flat tree 14




44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14




44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15

16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10



44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17

plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10



44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5

8 11 14greedy tree 4 6 8 10


The first step requires seven computing units,the second four, the third two, and the fourthone.

Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5

8 11 14greedy tree 4 6 8 10



44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8

11 14greedy tree 4 6 8 10



44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14

greedy tree 4 6 8 10



44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4

6 8 10



44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6

8 10



44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8

10



44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10



44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5





time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 1: Elimination elim(i, piv(i, k), k) via TSkernels.GEQRT (piv(i, k), k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)

TSQRT (i, piv(i, k), k)for j = k + 1 to q do

TSMQR(i, piv(i, k), k , j)

Triangle on top of square

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle










Triangle on top of square0






0
















Note: it is understood that if a tile is already in triangle form, then the associatedGEQRT and update kernels don’t need to be applied.



The analysis and coding is a little more complex ...time 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142



TheoremNo matter what elimination list (any combination of TT, TS) is used the totalweight of the tasks for performing a tiled QR factorization algorithm is constantand equal to 6pq2 − 2q3.

Proof:

L1 :: GEQRT = TTQRT + q

L2 :: UNMQRT = TTMQRT + (1/2)q(q − 1)

L3 :: TTQRT + TSQRT = pq − (1/2)q(q + 1)

L4 :: TTMQR + TSMQR = (1/2)pq(q − 1) − (1/6)q(q − 1)(q + 1)

Define L5 as 4L1 + 6L2 + 6L3 + 12L4 and we get

L5 :: 6pq2 − 2q3 = 4GEQRT + 12TSMQRT + 6TTMQRT + 2TTQRT + 6TSQRT + 6UNMQRT

L5 :: = total # of flops

Note: Using our unit task weight of b3/3, with m = pb, and n = qb, we obtain 2mn2 − 2/3n3 flopswhich is the exact same number as for a standard Householder reflection algorithm as found inLAPACK or ScaLAPACK.



TheoremFor a tiled matrix of size p × q, where p ≥ q. The critical path length ofFLATTREE is

2p + 2 if p ≥ q = 1

6p + 16q − 22 if p > q > 1

22p − 24 if p = q > 1

time 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64

2nd Row 3rd Row 4th Row 2nd Column 3nd Column

Initial: 10 Fill the pipeline: 6(p− 1) Pipeline: 16(q − 2) End: 4



Greedy tree on a 15x4 matrix. (Weighted on top, coarse below.)time 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18



TheoremFor a tiled matrix of size p × q, where p ≥ q. Given the time for any coarsegrain algorithm, COARSE(p,q), the corresponding weighted algorithm timeWEIGHTED(p,q) is

10 ∗ (q − 1) + 6 ∗ COARSE(p, q − 1) + 4 + 2 ≤ WEIGHTED(p, q)

≤ 10∗(q−1)+6∗COARSE(p, q−1)+4+2∗(COARSE(p, q)−COARSE(p, q−1))



TheoremWe can prove that the GRASAP algorithm is optimal.



Platform

All experiments were performed on a 48-core machine composed of eighthexa-core AMD Opteron 8439 SE (codename Istanbul) processors running at2.8 GHz. Each core has a theoretical peak of 11.2 Gflop/s with a peak of 537.6Gflop/s for the whole machine.

Experimental code is written using the QUARK scheduler.

Results are tested with ‖I − QT Q‖ and ‖A− QR‖/‖A‖. Code has beenwritten in real and complex arithmetic, single and double precision.



Complex Arithmetic (p=40,b=200, Model)

FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]O

verh

ea

din

cp

len

gth

with

resp

ectto

GR

EE

DY

(GR

EE

DY

=1

)

q

1 2 3 4 5 6 7 8 9 10 20 30 40

1

1.5

2

2.5

3

3.5

4

4.5

5



Complex Arithmetic (p=40,b=200, Experimental)


Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8]O

verh

ea

din

tim

ew

ith

resp

ectto

GR

EE

DY

(GR

EE

DY

=1

)

q

1 2 3 4 5 6 7 8 9 10 20 30 40

1

1.5

2

2.5

3

3.5

4

4.5

5



Complex Arithmetic (p=40,b=200, Model on left, Experimental on right)


Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]

Ove

rhe

ad

incp

len

gth

with

resp

ectto

GR

EE

DY

(GR

EE

DY

=1

)

q

1 2 3 4 5 6 7 8 9 10 20 30 40

1

1.5

2

2.5

3

3.5

4

4.5

5


Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8]

Ove

rhe

ad

intim

ew

ith

resp

ectto

GR

EE

DY

(GR

EE

DY

=1

)

q

1 2 3 4 5 6 7 8 9 10 20 30 40

1

1.5

2

2.5

3

3.5

4

4.5

5

Main difference is on the right where there is enough parallelism so all method give “peak” perfor-mance. (peak is TTMQRT performance)



Complex Arithmetic (p=40,b=200)


Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8]G

FL

OP

/s

q

1 2 3 4 5 6 7 8 9 10 20 30 400

20

40

60

80

100

120

140

160



Complex Arithmetic (p=40,b=200, Model on left, Experimental on right)


Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]

Pre

dic

ted

GF

LO

P/s

q

1 2 3 4 5 6 7 8 9 10 20 30 40

20

40

60

80

100

120

140

160


Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8]

GF

LO

P/s

q

1 2 3 4 5 6 7 8 9 10 20 30 400

20

40

60

80

100

120

140

160


Rectangular matrices | Parallelism+communication on multicore nodes (2)

Outline





Factor Kernelscode name # flops

GEQRT 4

TTQRT 2

TSQRT 6

• We work on square b-by-b tiles.• The unit for the task weights is b3/3.• Each of our tasks is O(b3) computation for O(b2) communication.• ⇒ Thanks to this surface/volume effect, we can neglect in a first

approximation communication and focus only on the parallelism.



Using TT KernelsGEQRT

GEQRT TTQRT

Total # of flops: (sequential time)

2 * GEQRT + TTQRT = 2 * 4 + 2 = 10

Critical Path Length: (parallel time)

GEQRT + TTQRT = 4 + 2 = 6

Using TS Kernels

GEQRT TSQRTTotal # of flops: (sequential time)

GEQRT + TSQRT = 4 + 6 = 10

Critical Path Length: (parallel time)

GEQRT + TSQRT = 4 + 2 = 10

One remarkable outcome is that the total number of flops is 10b3/3. If we consider the number offlops for a standard LAPACK algorithm (based on standard Householder transformation), we wouldalso obtain 10b3/3.



TSQRT (in)GEQRT (in)TTQRT (in)GEQRT +TTQRT (in)GEMM (in)TSQRT (out)GEQRT (out)TTQRT (out)GEQRT +TTQRT (out)GEMM (out)

GF

LO

P/s

tile size

100 200 300 400 500 6000

1

2

3

4

5

6

7

8

9

10

The TS kernels (factorization and update) are more efficient than the TT ones due a variety of reasons.(Better vectorization, better volume-to-surface ratio.)



Real Arithmetic (p=40,b=200, Model on left, Experimental on right)

FLATTREE (TS)PLASMATREE (TS) (best)FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TS) = [ 1 1 1 1 5 5 5 5 5 5 10 ... 10 ]Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ]

Pre

dic

ted

GF

LO

P/s

q

1 2 3 4 5 6 7 8 9 10 20 30 40

50

100

150

200

250

FLATTREE (TS)PLASMATREE (TS) (best)FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TS) = [1 3 6 11 12 18 32]Best domain size for PLASMATREE (TT) = [1 3 10 5 17 27 19]

GF

LO

P/s

q

1 2 3 4 5 6 7 8 9 10 20 30 400

50

100

150

200

250

⇒ Motivates the introduction of a TS-level. Square 1-by-1 tiles are grouped in a larger rectangulartiles of size a-by-1 which is eliminated in a TS fashion. The regular TT algorithm is then applied onthe rectangular tiles.

pros: recover performance of TS levelcons: introduce a tuning parameter (a)



Real Arithmetic (p=40,b=200, performance on left, overhead on right)

1 2 3 4 5 6 7 8 9 10 20 30 400

50

100

150

200

250

Q

Gflo

p/se

c

P=40MB=200,IB=32,cores = 48

BESTFT (TS)BINOMIAL (TT)FT (TT)FIBO (TT)Greedy (TT)Greedy+RRon+a=8Greedy+RRoff+a=8

1 2 3 4 5 6 7 8 9 10 20 30 401

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Q

over

head

wrt

BEST

P=40MB=200,IB=32,cores = 48

BESTFT (TS)BINOMIAL (TT)FT (TT)FIBO (TT)Greedy (TT)Greedy+RRon+a=8Greedy+RRoff+a=8

Idea: use TS domain of size a, use greedy on top of them, use round-robin onthe domains (This experiment was done with DAGuE.)


Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Outline





Outline

Three goals:

1. minimize the number of local communication

2. minimize the number of parallel distributed communication

3. maximize the parallelism within a node



Hierarchical trees



Hierarchical algorithm layout

Global view Local view

LegendP0

P1

P2

0 local TS1 local TT2 domino3 global tree



Outline

Three goals:

1. minimize the number of local communication

2. minimize the number of parallel distributed communication

3. maximize the parallelism within a node

Four hierarchical trees

1. TS level :: flat treeeliminate some tiles with TS kernels

2. node level :: flat/greedy/binary/fibonacci treeeliminate remaining non coupled tiles locally

3. local level :: domino treetake care of the coupling between local computation and distributedcomputations

4. distributed level :: flat/greedy/binary/fibonacci treehigh level, reduce tiles to one in parallel distributed



Hierarchical trees



Platform

Cluster Edel from Grid5000, Grenoble• 60 nodes• 2 Nehalem Xeon E5520 at 2.27GHz per node (8 cores)• 24 GB per node• Infiniband 20G network• Theoretical peak performance:

◦ 9.08 GFlop/s per core◦ 72.64 GFlop/s per node◦ 4.358 TFlop/s for the whole machine



Putting the results in perspective

Peak:Peak performance of a core is 9.08 GFlops/sec.x8 for a node gives 72.64 GFlops/sec.x60 for the parallel distributed platform gives 4358.4 Gflops/sec.Kernels: (one core)The TSMQR kernel performs at 7.21 GFlops/sec. (This is 79.41% of peak).The TTMQR kernel performs at 6.28 GFlops/sec. (This is 69.17% of peak).In shared memory on a node: (one node, eight cores)Flat tree TS gets 76.8% of peak with 55.83 GFlops/sec.Flat tree TT gets 66.4% of peak with 48.22 GFlops/sec.Our parallel distributed code: (sixty nodes, 480 cores)The parallel distributed on 60 nodes we top at 3 TFlop/sec, that’s 68.8% ofpeak.



Influence of TS kernels and trees (N=4480)

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Treea=1, greedya=4, greedy

a=8, greedya=1, binary

a=4, binarya=8, binary

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Tree (continued)a=1, flata=4, flat

a=8, flata=1, fibonacci

a=4, fibonaccia=8, fibonacci

Influence of the TS level size with fixed local reduction tree.(Left: Greedy, Right: Flat)




0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M




0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M





⇒ a has to be adapted to the matrix size




0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M




0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M





Low level tree: FLATTREE is slower than the three others




0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M




0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M





High level tree: FLATTREE is slightly better than the others



Influence of the domino

0

500

1000

1500

2000

2500

3000

17920 35840 71680 143360 286720

Performance (GFlop/s)

M

Low-Level Treew/o domino: flat

fibonaccigreedybinary

w/ domino: flatfibonacci

greedybinary

Figure: Influence of the low-level tree and thedomino optimization.

0

20

40

60

80

100

120

140

160

180

200

17920 35840 71680 143360 286720

Num

ber

of

task

s (x

10

00

)

M

w/o domino: GEQRT+UNMQR TT kernels TS kernels

w/ domino:

0

10

20

30

40

50

60

70

80

90

100

17920 35840 71680 143360 286720

Perc

enta

ge o

f each

cate

gory

of

task

s

M

w/o domino: GEQRT+UNMQR TT kernels TS kernels

w/ domino:

N=4480, P=15, Q=4, MB=280, a=4, High level tree set to Fibonacci



Scaling experiments (N=4480, M varies)

0

500

1000

1500

2000

2500

3000

0 50000 100000 150000 200000 250000 300000

Perf

orm

ance (

GF

lop/s

)

M (N=4,480)

Theoretical Peak: 4358.4 GFlop/s

Scalapack[BBD+10][SLHD10]

HQR

P=15, Q=4, MB=280, FIBONACCI/FIBONACCI,a=4, domino enabled

matrix is [16-100]x16 tiles



Scaling experiments (M=67200, N varies)

0

500

1000

1500

2000

2500

3000

0 10000 20000 30000 40000 50000 60000 70000

Perf

orm

ance (

GF

lop/s

)

N (M=67,200)

Theoretical Peak: 4358.4 GFlop/s

Scalapack[BBD+10][SLHD10]

HQR

P=15, Q=4, MB=280, FIBONACCI/FLATTREE,For N ≤ 16800, a=1, domino enabledFor N > 16800, a=4, domino disabled

matrix is 240x[16-240] tilesJulien Langou | University of Colorado Denver Hierarchical QR | 98 of 103

Rectangular matrices | Scheduling on multicore nodes (4)

Outline





We consider a n–by–n matrix with nb–by–nb tiles, and set n = t ∗ nb.We take for unit of flops nb3/3.This makes n and nb disappears from our problem.The number total of flops in Cholesky is

t3.

The length of the critical path is9t − 10.

Therefore on p threads, the execution time of Cholesky is at least

max(t3

p, 9t − 10 ).

This is our first lower bound for any execution time. (So in particular for the optimal execution time.)

0 20 40 60 80 100 1200

5

10

15

20

25

30

35

40

45

50

p, # of threads

spee

dup

scalability with t=20

lower bound #1

Note: the expected speedup is therefore less than this dashed line. (The lower bound in timebecomes a upper bound in speedup.)



We consider three strategies:

1. max list schedule with priority given to the task with the maximum flops in all of its children(and itself),

2. rand list schedule with random tasks selection

3. min list schedule with priority given to the task with the minimum flops in all of its children (anditself),

0 20 40 60 80 100 1200

5

10

15

20

25

30

35

40

45

50

p, # of threads

spee

dup

scalability with t=20

lower bound #1strat. maxstrat. randstrat. min

0 20 40 60 80 100 1200

10

20

30

40

50

60t=20

p, # of cores

% o

verh

ead

with

res

pect

to b

asic

low

er b

ound

strat. minstart. randstrat. max

If we are not happy with this 40% discrepancy between the lower bound and the upper bound, wehave two choices ... either find a greater lower bound or find a lower upper bound ... or both ...



1 8 16 24 32 40 48 56 64 72 80 88 96 104 1120

5

10

15

20

25

30

35

40

45

# of cores

spee

dup

scalability with p=20

lower bound in timenewboundmaxrandommin

1 16 32 48 64 80 96 1121

1.05

1.1

1.15

1.2

1.25

1.3

1.35


# of cores

% o

verh

ead

with

resp

ect t

o cu

rrent

low

er b

ound

maxrandommin



1 8 16 24 32 40 48 56 64 72 80 88 96 104 1120

5

10

15

20

25

30

35

40

45

# of cores

spee

dup


lower bound in timenewboundmaxrandommin

1 16 32 48 64 80 96 1121

1.05

1.1

1.15

1.2

1.25

1.3

1.35


# of cores

% o

verh

ead

with

resp

ect t

o cu

rrent

low

er b

ound

maxrandommin

If we are not happy with this 5% discrepancy between the lower bound and theupper bound, we have two choices ... either find a greater lower bound or finda lower upper bound ... or both ...


|

شكرا!Julien Langou | University of Colorado Denver Hierarchical QR | 105 of 103

hierarchical qr factorization algorithms for multi-core

Documents