hierarchical qr factorization algorithms for multi-core

145
Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra Mathieu Faverge Thomas Herault Julien Langou Yves Robert University of Tennessee Knoxville, USA University of Colorado Denver, USA Ecole Normale Supérieure de Lyon, France JUNE 29, 2012

Upload: others

Post on 10-Feb-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Hierarchical QR factorizationalgorithms for multi-core clustersystems

Jack DongarraMathieu FavergeThomas HeraultJulien LangouYves RobertUniversity of Tennessee Knoxville, USAUniversity of Colorado Denver, USAEcole Normale Supérieure de Lyon, FranceJUNE 29, 2012

• reducing communication◦ in sequential◦ in parallel distributed

• increasing parallelism(or reducing the critical path, reducing synchronization)

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 2 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Outline

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 3 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

ReduceAlgorithms:Introduc4onTheQRfactoriza4onofalongandskinnymatrixwithitsdatapar44onedver4callyacrossseveralprocessorsarisesinawiderangeofapplica4ons.

A1

A2

A3

Q1

Q2

Q3

R

Input:Aisblockdistributedbyrows

Output:QisblockdistributedbyrowsRisglobal

Julien Langou | University of Colorado Denver Hierarchical QR | 4 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Exampleofapplica3ons:inblockitera3vemethods.

a)  initera3vemethodswithmul3pleright‐handsides(blockitera4vemethods:)

1)  Trilinos(SandiaNa4onalLab.)throughBelos(R.Lehoucq,H.Thornquist,U.Hetmaniuk).

2)  BlockGMRES,BlockGCR,BlockCG,BlockQMR,…

b)  initera3vemethodswithasingleright‐handside

1)  s‐stepmethodsforlinearsystemsofequa4ons(e.g.A.Chronopoulos),

2)  LGMRES(Jessup,Baker,Dennis,U.ColoradoatBoulder)implementedinPETSc,

3)  RecentworkfromM.HoemmenandJ.Demmel(U.CaliforniaatBerkeley).

e)  initera3veeigenvaluesolvers,

1)  PETSc(ArgonneNa4onalLab.)throughBLOPEX(A.Knyazev,UCDHSC),

2)  HYPRE(LawrenceLivermoreNa4onalLab.)throughBLOPEX,

3)  Trilinos(SandiaNa4onalLab.)throughAnasazi(R.Lehoucq,H.Thornquist,U.Hetmaniuk),

4)  PRIMME(A.Stathopoulos,Coll.William&Mary),

5)  AndalsoTRLAN,BLZPACK,IRBLEIGS.

Julien Langou | University of Colorado Denver Hierarchical QR | 5 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

ReduceAlgorithms:Introduc4onExampleofapplica3ons:

a)  inlinearleastsquaresproblemswhichthenumberofequa4onsisextremelylargerthanthenumberofunknowns

b)  inblockitera4vemethods(itera4vemethodswithmul4pleright‐handsidesoritera4veeigenvaluesolvers)

c)  indenselargeandmoresquareQRfactoriza4onwheretheyareusedasthepanelfactoriza4onstep

Julien Langou | University of Colorado Denver Hierarchical QR | 6 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

BlockedLUandQRalgorithms(LAPACK)

lu( )

dgeK2

dtrsm(+dswp)

dgemm

\

L

U

A(1)

A(2)L

U

qr( )

dgeqf2+dlarQ

dlarR

V

R

A(1)

A(2)V

R

LAPACKblockLU(right‐looking):dgetrf LAPACKblockQR(right‐looking):dgeqrf

Upd

ateofth

eremainingsub

matrix

Pane

lfactoriza4

on

Julien Langou | University of Colorado Denver Hierarchical QR | 7 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

BlockedLUandQRalgorithms(LAPACK)

lu( )

dgeK2

dtrsm(+dswp)

dgemm

\

L

U

A(1)

A(2)L

U

LAPACKblockLU(right‐looking):dgetrf

Upd

ateofth

eremainingsub

matrix

Pane

lfactoriza4

on

Latencybounded:morethannbAllReduceforn*nb2ops

CPU‐bandwidthbounded:thebulkofthecomputa4on:n*n*nbopshighlyparalleliable,efficientandsaclable.

Julien Langou | University of Colorado Denver Hierarchical QR | 8 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Paralleliza3onofLUandQR.

Parallelizetheupdate:• Easyanddoneinanyreasonablesoeware.• Thisisthe2/3n3termintheFLOPscount.• CanbedoneefficientlywithLAPACK+mul4threadedBLAS

dgemm

lu( )

dgeK2

dtrsm(+dswp)

dgemm

\

L

U

A(1)

A(2)L

U

Julien Langou | University of Colorado Denver Hierarchical QR | 9 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Paralleliza3onofLUandQR.

Parallelizetheupdate:• Easyanddoneinanyreasonablesoeware.• Thisisthe2/3n3termintheFLOPscount.• CanbedoneefficientlywithLAPACK+mul4threadedBLAS

Parallelizethepanelfactoriza3on:• Notanop4oninmul4corecontext(p<16)• Seee.g.ScaLAPACKorHPLbuts4llbyfartheslowestandthebojleneckofthecomputa4on.

Hidethepanelfactoriza3on:• Lookahead(seee.g.HighPerformanceLINPACK)• DynamicScheduling

lu( )

dgeK2

dgemm

lu( )

dgeK2

Julien Langou | University of Colorado Denver Hierarchical QR | 10 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Hidingthepanelfactoriza4onwithdynamicscheduling.

TimeCourtesyfromAlfredoBujari,UTennessee

Julien Langou | University of Colorado Denver Hierarchical QR | 11 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Whataboutstrongscalability?

Julien Langou | University of Colorado Denver Hierarchical QR | 12 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Whataboutstrongscalability?N=1536

NB=64

procs=16

CourtesyfromJakubKurzak,UTennessee

Julien Langou | University of Colorado Denver Hierarchical QR | 13 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Whataboutstrongscalability?N=1536

NB=64

procs=16

Wecannothidethepanelfactoriza4onintheMM,actuallyitistheMMsthatarehiddenbythepanelfactoriza4ons!

CourtesyfromJakubKurzak,UTennessee

Julien Langou | University of Colorado Denver Hierarchical QR | 14 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Whataboutstrongscalability?N=1536

NB=64

procs=16

Wecannothidethepanelfactoriza4on(n2)withtheMM(n3),actuallyitistheMMsthatarehiddenbythepanelfactoriza4ons!

NEEDFORNEWMATHEMATICALALGORITHMS

CourtesyfromJakubKurzak,UTennessee

Julien Langou | University of Colorado Denver Hierarchical QR | 15 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Anewgenera4onofalgorithms?Algorithmsfollowhardwareevolu3onalong3me.

LINPACK(80’s)(Vectoropera4ons)

Relyon‐Level‐1BLASopera4ons

LAPACK(90’s)(Blocking,cachefriendly)

Relyon‐Level‐3BLASopera4ons

Julien Langou | University of Colorado Denver Hierarchical QR | 16 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Anewgenera4onofalgorithms?Algorithmsfollowhardwareevolu3onalong3me.

LINPACK(80’s)(Vectoropera4ons)

Relyon‐Level‐1BLASopera4ons

LAPACK(90’s)(Blocking,cachefriendly)

Relyon‐Level‐3BLASopera4ons

NewAlgorithms(00’s)(mul4corefriendly)

Relyon‐aDAG/scheduler‐blockdatalayout‐someextrakernels

Thosenewalgorithms‐haveaverylowgranularity,theyscaleverywell(mul4core,petascalecompu4ng,…)‐removesalotsofdependenciesamongthetasks,(mul4core,distributedcompu4ng)‐avoidlatency(distributedcompu4ng,out‐of‐core)‐relyonfastkernelsThosenewalgorithmsneednewkernelsandrelyonefficientschedulingalgorithms.

Julien Langou | University of Colorado Denver Hierarchical QR | 17 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

2005‐2007:Newalgorithmsbasedon2Dpar44onning:

–  UTexas(vandeGeijn):SYRK,CHOL(mul4core),LU,QR(out‐of‐core)–  UTennessee(Dongarra):CHOL(mul4core)–  HPC2N(Kågström)/IBM(Gustavson):Chol(Distributed)

–  UCBerkeley(Demmel)/INRIA(Grigori):LU/QR(distributed)–  UCDenver(Langou):LU/QR(distributed)

A3rdrevolu4onfordenselinearalgebra?

Julien Langou | University of Colorado Denver Hierarchical QR | 18 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

• reducing communication◦ in sequential◦ in parallel distributed

• increasing parallelism(or reducing the critical path, reducing synchronization)

We start with reduction communication in parallel distributed in the tall andskinny case.

Julien Langou | University of Colorado Denver Hierarchical QR | 19 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Ontwoprocesses

A0 Q0

A1 Q1

processes

4me

Julien Langou | University of Colorado Denver Hierarchical QR | 20 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

OntwoprocessesR0(0)( , )QR( )

A0 V0(0)

R1(0)( , )QR( )

A1 V1(0)

processes

4me

Julien Langou | University of Colorado Denver Hierarchical QR | 21 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

OntwoprocessesR0(0)( , )QR( )

A0 V0(0)

)R0(0)

R1(0)

R1(0)( , )QR( )

A1 V1(0)

processes

4me

(

Julien Langou | University of Colorado Denver Hierarchical QR | 22 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

OntwoprocessesR0(0)( , )QR( )

A0 V0(0)

R0(1)( , )QR( )R0(0)

R1(0)

V0(1)

V1(1)

R1(0)( , )QR( )

A1 V1(0)

processes

4me

Julien Langou | University of Colorado Denver Hierarchical QR | 23 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Thebigpicture….

A1

A2

A3

A4

A5

A6

Q0

Q4

Q1

Q5

Q6

Q3

Q2

R

R

R

R

R

R

R

A QR4me

processes

A0

Julien Langou | University of Colorado Denver Hierarchical QR | 24 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Thebigpicture….

A1

A2

A3

A4

A5

A6

4me

processes

communica3on

computa3on

A0

Julien Langou | University of Colorado Denver Hierarchical QR | 25 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Thebigpicture….

A1

A2

A3

A4

A5

A6

4me

processes

communica3on

computa3on

A0

Julien Langou | University of Colorado Denver Hierarchical QR | 26 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Thebigpicture….

A1

A2

A3

A4

A5

A6

4me

processes

communica3on

computa3on

A0

Julien Langou | University of Colorado Denver Hierarchical QR | 27 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Latencybutalsopossibilityoffastpanelfactoriza4on.

•  DGEQR3istherecursivealgorithm(seeElmrothandGustavson,2000),DGEQRFandDGEQR2aretheLAPACKrou4nes.

•  TimesincludeQRandDLARFT.

•  RunonPen4umIII.

QRfactoriza3onandconstruc3onofTm=10,000

PerfinMFLOP/sec(Timesinsec)

n DGEQR3 DGEQRF DGEQR2

50 173.6 (0.29) 65.0 (0.77) 64.6 (0.77)

100 240.5 (0.83) 62.6 (3.17) 65.3 (3.04)

150 277.9 (1.60) 81.6 (5.46) 64.2 (6.94)

200 312.5 (2.53) 111.3 (7.09) 65.9 (11.98)

m=1000,000,thexaxisisn

MFLOP/sec

Julien Langou | University of Colorado Denver Hierarchical QR | 28 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

QandR:Strongscalability

0

100

200

300

400

500

600

700

800

32 64 128 256

ReduceHH(QR3)

ReduceHH(QRF)

ScaLAPACKQRF

ScaLAPACKQR2

Inthisexperiment,wefixtheproblem:m=1,000,000andn=50.Thenweincreasethenumberofprocessors.BlueGeneL

frost.ncar.edu

#ofprocessors

MFLOPs/sec/proc

Julien Langou | University of Colorado Denver Hierarchical QR | 29 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

communication :: TSQR :: parallel caseWhenonlyRiswanted

processes

4me

R0(0)( )QR( )

A0

R0(1)( )

QR( )R0

(0)R1

(0)

R1(0)( )QR( )

A1

R2(0)( )QR( )

A2

R2(1)( )

QR( )R2

(0)R3

(0)

R3(0)( )QR( )

A3

R( )QR( )

R0(1)

R2(1)

Considerarchitecture: parallel case: P processing unitsproblem: QR factorization of a m-by-n TS matrix (TS = m/P ≥ n)(main) assumption: the operation is “truly” parallel distributed⇒ answer: binary treetheory:

TSQR ScaLAPACK-like Lower bound

# flops 2mn2

P + 2n3

3 log P 2mn2

P − 2n3

3P Θ(

mn2

P

)# words n2

2 log P n2

2 log P n2

2 log P# messages log P 2n log P log P

Julien Langou | University of Colorado Denver Hierarchical QR | 30 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Any tree does.

WhenonlyRiswanted:TheMPI_Allreduce

InthecasewhereonlyRiswanted,insteadofconstruc4ngourowntree,onecansimplyuseMPI_Allreducewithauserdefinedopera4on.Theopera4onwegivetoMPIisbasicallytheAlgorithm2.Itperformstheopera4on:

Thisbinaryopera4onisassocia3veandthisisallMPIneedstouseauser‐definedopera4ononauser‐defineddatatype.Moreover,ifwechangethesignsoftheelementsofRsothatthediagonalofRholdsposi4veelementsthenthebinaryopera4onRfactorbecomescommuta3ve.

Thecodebecomestwolines:

lapack_dgeqrf( mloc, n, A, lda, tau, &dlwork, lwork, &info );

MPI_Allreduce( MPI_IN_PLACE, A, 1, MPI_UPPER,

LILA_MPIOP_QR_UPPER, mpi_comm);

QR ( )R1R2

R

Julien Langou | University of Colorado Denver Hierarchical QR | 31 of 103

Tall and Skinny matrices | Minimizing communication in parallel distributed (29)

Flat  Binary  tree    

parallelism  

+me  

parallelism  

+me  

A  weird  tree  

parallelism  

+me  

parallelism  

+me  

Another  weird  tree  

Julien Langou | University of Colorado Denver Hierarchical QR | 32 of 103

Tall and Skinny matrices | Minimizing communication in hierarchichal parallel distributed (6)

Outline

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 33 of 103

Tall and Skinny matrices | Minimizing communication in hierarchichal parallel distributed (6)

Latency (ms) Orsay Toulouse Bordeaux SophiaOrsay 0.07 7.97 6.98 6.12Toulouse 0.03 9.03 8.18Bordeaux 0.05 7.18Sophia 0.06

Throughput (Mb/s) Orsay Toulouse Bordeaux SophiaOrsay 890 78 90 102Toulouse 890 77 90Bordeaux 890 83Sophia 890

Julien Langou | University of Colorado Denver Hierarchical QR | 34 of 103

Tall and Skinny matrices | Minimizing communication in hierarchichal parallel distributed (6)

Domain 2,4

Domain 1,1

Domain 1,2

Domain 1,3

Cluster 1

Domain 1,4

Domain 1,5

Domain 2,1

Domain 2,2

Domain 2,3

Cluster 2

Domain 3,1

Domain 3,2

Cluster 3

Illustration of ScaLAPACK PDEGQR2 without reduce affinity

Julien Langou | University of Colorado Denver Hierarchical QR | 35 of 103

Tall and Skinny matrices | Minimizing communication in hierarchichal parallel distributed (6)

Domain 1,1

Domain 1,2

Domain 1,3

Cluster 1

Domain 1,4

Domain 1,5

Domain 2,1

Domain 2,2

Domain 2,3

Cluster 2

Domain 2,4

Domain 3,1

Domain 3,2

Cluster 3

Illustration of ScaLAPACK PDEGQR2 with reduce affinity

Julien Langou | University of Colorado Denver Hierarchical QR | 36 of 103

Tall and Skinny matrices | Minimizing communication in hierarchichal parallel distributed (6)

Domain 1,1

Domain 1,2

Domain 1,3

Cluster 1

Domain 1,4

Domain 1,5

Domain 2,1

Domain 2,2

Domain 2,3

Cluster 2

Domain 2,4

Domain 3,1

Domain 3,2

Cluster 3

Illustration of TSQR without reduce affinity

Julien Langou | University of Colorado Denver Hierarchical QR | 37 of 103

Tall and Skinny matrices | Minimizing communication in hierarchichal parallel distributed (6)

Domain 1,1

Domain 1,2

Domain 1,3

Cluster 1

Domain 1,4

Domain 1,5

Domain 2,1

Domain 2,2

Domain 2,3

Cluster 2

Domain 2,4

Domain 3,1

Domain 3,2

Cluster 3

Illustration of TSQR with reduce affinity

Julien Langou | University of Colorado Denver Hierarchical QR | 38 of 103

Tall and Skinny matrices | Minimizing communication in hierarchichal parallel distributed (6)

Using 4 clusters of 32 processors.

104 105 106 107100

101

102

103

m

GF

lop/

sec

Classical Gram−SchmidtPerformance for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x−axis), n = 32

m = 8.e+06

m = 16.e+06

1 cluster, 9.27 GFlops/sec

2 clusters, 17.19 GFlops/sec

4 clusters, 28.19 GFlops/sec

experimentsmodel

104 105 106 107100

101

102

103

m

GF

lop/

sec

TSQR−binary−treePerformance for pg = 1, 2 or 4 clusters, pc = 32 nodes per cluster, m varies (x−axis), n = 32

m = 2.e+5

m = 5.e+5

1 cluster, 24.20 GFlops/sec

2 clusters, 48.25 GFlops/sec

4 clusters, 96.20 GFlops/secexperimentsmodel

Two effects at once: (1) avoiding communication with TSQR, (2) tuning of thereduction tree

Julien Langou | University of Colorado Denver Hierarchical QR | 39 of 103

Tall and Skinny matrices | Minimizing communication in sequential (1)

Outline

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 40 of 103

Tall and Skinny matrices | Minimizing communication in sequential (1)

Considerarchitecture: sequential case: one processing unit with cache of size (W )problem: QR factorization of a m-by-n TS matrix (TS = m ≥ n and W ≥ 3

2 n2)⇒ answer: flat treetheory:

flat tree LAPACK-like Lower bound# flops 2mn2 2mn2 Θ(mn2)

# words 2mn m2n2

2W 2mn# messages 3mn

Wmn2

2W2mnW

Julien Langou | University of Colorado Denver Hierarchical QR | 41 of 103

Rectangular matrices | Minimizing communication in sequential (2)

Outline

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 42 of 103

Rectangular matrices | Minimizing communication in sequential (2)

5000 10000 15000 20000 25000 30000 350000

10

20

30

40

50

60

70

80

90

100

n (matrix size)

perc

ent o

f CP

U p

eak

perf

orm

ance

Mfast

= 106 −− Mslow

= 109 −− β = 108 −− γ = 1010

sequential case −− LU (or QR) factorization of square matrices

n = Mfast1/2

n = Mslow1/2

problem lower boundupper bound for CAQR−flat−treelower bound for the LAPACK algorithm

Julien Langou | University of Colorado Denver Hierarchical QR | 43 of 103

Rectangular matrices | Minimizing communication in sequential (2)

100 101 102 103 104 105 106 107100

101

102

( #

of s

low

mem

ory

refe

renc

es )

/ (p

robl

em lo

wer

bou

nd)

n −− size of the matrix

sequential case −− LU (or QR) factorization of square matrices

Mfast

= 103

Mfast

= 106

Mfast

= 109

lower bound for the LAPACK algorithmupper bound for CAQR−flat−tree

Julien Langou | University of Colorado Denver Hierarchical QR | 44 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Outline

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 45 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Mission

Perform the QR factorization of an initial tiled matrix A on a multicore platform.

time

Julien Langou | University of Colorado Denver Hierarchical QR | 46 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

Julien Langou | University of Colorado Denver Hierarchical QR | 47 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

1.dgeqrt(A[0][0],T[0][0]);

VR AT

dgeqrt0

Julien Langou | University of Colorado Denver Hierarchical QR | 48 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

1.dgeqrt(A[0][0],T[0][0]);

VR AT

dgeqrt0

Julien Langou | University of Colorado Denver Hierarchical QR | 49 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);

V A

dgeqrt0

TR

dlarB0,1

Julien Langou | University of Colorado Denver Hierarchical QR | 50 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);

V A

dgeqrt0

TR

dlarB0,1

Julien Langou | University of Colorado Denver Hierarchical QR | 51 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);3.dlar+(A[0][0],T[0][0],A[0][2]);4.dlar+(A[0][0],T[0][0],A[0][3]);

V A

dgeqrt0

TR

dlarB0,3dlarB0,2dlarB0,1

Julien Langou | University of Colorado Denver Hierarchical QR | 52 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);3.dlar+(A[0][0],T[0][0],A[0][2]);4.dlar+(A[0][0],T[0][0],A[0][3]);

V A

dgeqrt0

TR

dlarB0,3dlarB0,2dlarB0,1

Julien Langou | University of Colorado Denver Hierarchical QR | 53 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);3.dlar+(A[0][0],T[0][0],A[0][2]);4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);

dgeqrt0

dlarB0,3dlarB0,2dlarB0,1

V

R

A

TR

dtsqrt1,0

Julien Langou | University of Colorado Denver Hierarchical QR | 54 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

1.dgeqrt(A[0][0],T[0][0]);2.dlar+(A[0][0],T[0][0],A[0][1]);3.dlar+(A[0][0],T[0][0],A[0][2]);4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);

dgeqrt0

dlarB0,3dlarB0,2dlarB0,1

V

R

A

TR

dtsqrt1,0

Julien Langou | University of Colorado Denver Hierarchical QR | 55 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

…4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);6.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);

dgeqrt0

dlarB0,3dlarB0,2dlarB0,1

V

A

B

T

dtsqrt1,0

A

B

dssrB1,1

Julien Langou | University of Colorado Denver Hierarchical QR | 56 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

…4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);6.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);

dgeqrt0

dlarB0,3dlarB0,2dlarB0,1

V

A

B

T

dtsqrt1,0

A

B

dssrB1,1

Julien Langou | University of Colorado Denver Hierarchical QR | 57 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

for(k=0;k<TILES;k++){dgeqrt(A[k][k],T[k][k]);for(n=k+1;n<TILES;n++){ dlar+(A[k][k],T[k][k],A[k][n]);for(m=k+1;m<TILES;m++){ dtsqrt(A[k][k],A[m][k],T[m][k]); for(n=k+1;n<TILES;n++) dssr+(A[m][k],T[m][k],A[k][n],A[m][n]);}

}

…4.dlar+(A[0][0],T[0][0],A[0][3]);5.dtsqrt(A[0][0],A[1][0],T[1][0]);6.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);7.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);8.dssr+(A[1][0],T[1][0],A[0][1],A[1][1]);

dgeqrt0

dlarB0,3dlarB0,2dlarB0,1

V

A

B

T

dtsqrt1,0

A

B

dssrB1,1 dssrB1,1 dssrB1,1

Julien Langou | University of Colorado Denver Hierarchical QR | 58 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

the input. The T matrix is stored separately.

DTSQRT: The kernel performs the QR factoriza-tion of a matrix built by coupling the R factor,produced by DGEQRT or a previous call to DT-SQRT, with a tile below the diagonal tile. Thekernel produces an updated R factor, a squarematrix V containing the Householder reflectorsand the matrix T resulting from accumulatingthe reflectors V . The new R factor overrides theold R factor. The block of reflectors overridesthe square tile of the input matrix. The T ma-trix is stored separately.

DLARFB: The kernel applies the reflectors calcu-lated by DGEQRT to a tile to the right of thediagonal tile, using the reflectors V along withthe matrix T .

DSSRFB: The kernel applies the reflectors calcu-lated by DTSQRT to two tiles to the right of thetiles factorized by DTSQRT, using the reflectorsV and the matrix T produced by DTSQRT.

Naive implementation, where the full T matrixis built, results in 25 % more floating point opera-tions than the standard algorithm. In order to mini-mize this overhead, the idea of inner-blocking is used,where the T matrix has sparse (block-diagonal) struc-ture (Figure 10) [32, 33, 34].

Figure 10: Inner blocking in the tile QR factorization.

Figure 11 shows the pseudocode of the tile QR fac-torization. Figure 12 shows the task graph of the tileQR factorization for a matrix of 5⇥5 tiles. Orders ofmagnitude larger matrices are used in practice. Thisexample only serves the purpose of showing the com-plexity of the task graph, which is noticeably higherthan that of Cholesky factorization.

Figure 11: Pseudocode of the tile QR factorization.

DGEQRT

DGEQRT

DGEQRT

DGEQRT

DGEQRT

DLARFBDLARFB DLARFB DLARFB

DLARFB DLARFB DLARFB

DLARFB DLARFB

DLARFB

DTSQRT

DTSQRT

DTSQRT

DTSQRTDTSQRT

DTSQRT

DTSQRT

DTSQRT

DTSQRT

DTSQRT

DSSRFBDSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB DSSRFB

DSSRFB DSSRFB

DSSRFB DSSRFB

DSSRFB

Figure 12: Task graph of the tile QR factorization(matrix of size 5 ⇥ 5 tiles).

10

DAG for 5x5 matrix (QR factorization)Julien Langou | University of Colorado Denver Hierarchical QR | 59 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

f o r ( i = 0 ; i < p ; i ++ ) {DSCHED dpotrf ( dsched , ’L ’ , nb [ i ] , A[ i ] [ i ] , nb [ i ] , i n f o ) ;f o r ( j = i +1 ; j < p ; j ++ )

DSCHED dtrsm ( dsched , ’R ’ , ’L ’ , ’T ’ , ’N’ , nb [ j ] , nb [ i ] , 1 . 0 , A[ i ] [ i ] , nb [ i ] , A[ j ] [ i ] , nb [ j ] ) ;f o r ( j = i +1 ; j < p ; j ++ ) {

f o r ( k = i +1 ; k < j ; k++ ) {DSCHED dgemm( dsched , ’N’ , ’T ’ , nb [ j ] , nb [ k ] , nb [ i ] , �1, A[ j ] [ i ] , nb [ j ] , A[ k ] [ i ] , nb [ k ] , 1 , A[ j ] [ k ] , nb [ j ] ) ;

DSCHED dsyrk ( dsched , ’L ’ , ’N’ , nb [ j ] , nb [ i ] , �1.0 , A[ j ] [ i ] , nb [ j ] , + 1 . 0 , A[ j ] [ j ] , nb [ j ] ) ;}}

Fig. 9. Tile Cholesky factorization that calls the scheduled core linear algebra operations.

1 :1

2 : 2 0

1 0 : 5 8

1 1 : 7 4

1 2 : 7 4

1 3 : 7 5

1 4 : 9 0

3 : 2 0

1 5 : 9 0

1 6 : 9 1

1 7 : 1 0 5

1 8 : 1 0 5

1 9 : 1 0 6

2 0 : 1 1 9

2 1 : 1 1 8

2 2 : 1 0 0

2 3 : 1 1 1

2 4 : 9 3

2 5 : 9 3

4 : 2 1

2 6 : 8 7

2 7 : 8 6

2 8 : 7 1

2 9 : 8 0

3 0 : 6 5

3 1 : 6 5

3 2 : 6 0

3 3 : 5 9

3 4 : 4 7

3 5 : 5 4

3 6 : 4 2

5 : 3 9

3 7 : 4 2

3 8 : 3 8

3 9 : 3 7

4 0 : 2 8

4 1 : 3 3

4 2 : 2 4

4 3 : 2 4

4 4 : 2 1

4 5 : 2 0

4 6 : 1 4

4 7 : 1 7

6 : 3 9

4 8 : 1 1

4 9 : 1 1

5 0 : 9

5 1 : 8

5 2 : 5

5 3 : 6

5 4 : 3

5 5 : 3

5 6 : 2

5 7 : 1

5 8 : 1

7 : 4 0

8 : 5 7

9 : 5 7

Fig. 10. DAG for a LU factorization with 20 tiles (block size 200and matrix size 4000). The size of the DAG grows very fast with thenumber of tiles.

more tasks until some are completed. The usage of awindow of tasks has implications in how the loops of anapplication are unfolded and how much look ahead isavailable to the scheduler. This paper discusses some ofthese implication in the context of dense linear algebraapplications.

d) Data Locality and Cache Reuse: It has beenshown in the past that the reuse of memory cachescan lead to a substantial performance improvement inexecution time. Since we are working with tiles of datathat should fit in the local caches on each core, wehave provided the algorithm designer with the abilityto hint the cache locality behavior. A parameter in a call(e.g., Fig. 7) can be decorated with the LOCALITYflag in order to tell the scheduler that the data item(parameter) should be kept in cache if possible. Aftera computational core (worker) executes that task, thescheduler will assign by-default any future task usingthat data item to the same core. Note that the workstealing can disrupt the by-default assignment of tasksto cores.

The next section studies the performance impact ofthe locality flag and the window size on the LL and RLvariants of the three tile factorizations.

V. EXPERIMENTAL RESULTS

This section describes the analysis of dynamicallyscheduled tile algorithms for the three factorizations (i.e.,Cholesky, QR and LU) on different multicore systems.The tile sizes for these algorithm have been tuned and

are equal to b = 200.

A. Hardware Descriptions

In this study, we consider two different shared memoryarchitectures. The first architecture (System A) is aquad-socket, quad-core machine based on an Intel XeonEMT64 E7340 processor operating at 2.39 GHz. Thetheoretical peak is equal to 9.6 Gflops/s per core or 153.2Gflops/s for the whole node, composed of 16 cores. Thepractical peak (mesured by the performance of a GEMM)is equal to 8.5 Gflops/s per core or 136 Gflops/s forthe 16 cores. The level-1 cache, local to the core, isdivided into 32 kB of instruction cache and 32 kB of datacache. Each quad-core processor is actually composed oftwo dual-core Core2 architectures and the level-2 cachehas 2⇥4 MB per socket (each dual-core shares 4 MB).The machine is a NUMA architecture and it providesIntel Compilers 11.0 together with the MKL 10.1 vendorlibrary.

The second system (System B) is an 8 sockets, 6core AMD Opteron 8439 SE Processor (48 cores total@ 2.8Ghz) with 128 Gb of main memory. Each corehas a theoretical peak of 11.2 Gflops/s and the wholemachine 537.6 Gflops/s. The practical peak (mesured bythe performance of a GEMM) is equal to 9.5 Gflops/sper core or 456 Gflops/s for the 48 cores. There arethree levels of cache. The level-1 cache consist of 64kB and the level-2 cache consist of 512 kB. Each socketis composed of 6 cores and the level-3 cache has 6 MB48-way associative shared cache per socket. The machineis a NUMA architecture and it provides Intel Compilers11.1 together with the MKL 10.2 vendor library.

B. Performance Discussions

In this section, we evaluate the effect of the windowsize and the locality feature on the LL and RL tilealgorithm variants.

The nested-loops describing the tile LL variant codesare naturally ordered in a way that already promoteslocality on the data tiles located on the panel. Fig. 11shows the effect of the locality flag of the scheduleron the overall performance of the tile LL Choleskyvariant. As expected, the locality flag does not reallyimprove the performances when using small window

DAG for 20x20 matrix (QR factorization)

f o r ( i = 0 ; i < p ; i ++ ) {DSCHED dpotrf ( dsched , ’L ’ , nb [ i ] , A[ i ] [ i ] , nb [ i ] , i n f o ) ;f o r ( j = i +1 ; j < p ; j ++ )

DSCHED dtrsm ( dsched , ’R ’ , ’L ’ , ’T ’ , ’N’ , nb [ j ] , nb [ i ] , 1 . 0 , A[ i ] [ i ] , nb [ i ] , A[ j ] [ i ] , nb [ j ] ) ;f o r ( j = i +1 ; j < p ; j ++ ) {

f o r ( k = i +1 ; k < j ; k++ ) {DSCHED dgemm( dsched , ’N’ , ’T ’ , nb [ j ] , nb [ k ] , nb [ i ] , �1, A[ j ] [ i ] , nb [ j ] , A[ k ] [ i ] , nb [ k ] , 1 , A[ j ] [ k ] , nb [ j ] ) ;

DSCHED dsyrk ( dsched , ’L ’ , ’N’ , nb [ j ] , nb [ i ] , �1.0 , A[ j ] [ i ] , nb [ j ] , + 1 . 0 , A[ j ] [ j ] , nb [ j ] ) ;}}

Fig. 9. Tile Cholesky factorization that calls the scheduled core linear algebra operations.

1 :1

2 : 2 0

1 0 : 5 8

1 1 : 7 4

1 2 : 7 4

1 3 : 7 5

1 4 : 9 0

3 : 2 0

1 5 : 9 0

1 6 : 9 1

1 7 : 1 0 5

1 8 : 1 0 5

1 9 : 1 0 6

2 0 : 1 1 9

2 1 : 1 1 8

2 2 : 1 0 0

2 3 : 1 1 1

2 4 : 9 3

2 5 : 9 3

4 : 2 1

2 6 : 8 7

2 7 : 8 6

2 8 : 7 1

2 9 : 8 0

3 0 : 6 5

3 1 : 6 5

3 2 : 6 0

3 3 : 5 9

3 4 : 4 7

3 5 : 5 4

3 6 : 4 2

5 : 3 9

3 7 : 4 2

3 8 : 3 8

3 9 : 3 7

4 0 : 2 8

4 1 : 3 3

4 2 : 2 4

4 3 : 2 4

4 4 : 2 1

4 5 : 2 0

4 6 : 1 4

4 7 : 1 7

6 : 3 9

4 8 : 1 1

4 9 : 1 1

5 0 : 9

5 1 : 8

5 2 : 5

5 3 : 6

5 4 : 3

5 5 : 3

5 6 : 2

5 7 : 1

5 8 : 1

7 : 4 0

8 : 5 7

9 : 5 7

Fig. 10. DAG for a LU factorization with 20 tiles (block size 200and matrix size 4000). The size of the DAG grows very fast with thenumber of tiles.

more tasks until some are completed. The usage of awindow of tasks has implications in how the loops of anapplication are unfolded and how much look ahead isavailable to the scheduler. This paper discusses some ofthese implication in the context of dense linear algebraapplications.

d) Data Locality and Cache Reuse: It has beenshown in the past that the reuse of memory cachescan lead to a substantial performance improvement inexecution time. Since we are working with tiles of datathat should fit in the local caches on each core, wehave provided the algorithm designer with the abilityto hint the cache locality behavior. A parameter in a call(e.g., Fig. 7) can be decorated with the LOCALITYflag in order to tell the scheduler that the data item(parameter) should be kept in cache if possible. Aftera computational core (worker) executes that task, thescheduler will assign by-default any future task usingthat data item to the same core. Note that the workstealing can disrupt the by-default assignment of tasksto cores.

The next section studies the performance impact ofthe locality flag and the window size on the LL and RLvariants of the three tile factorizations.

V. EXPERIMENTAL RESULTS

This section describes the analysis of dynamicallyscheduled tile algorithms for the three factorizations (i.e.,Cholesky, QR and LU) on different multicore systems.The tile sizes for these algorithm have been tuned and

are equal to b = 200.

A. Hardware Descriptions

In this study, we consider two different shared memoryarchitectures. The first architecture (System A) is aquad-socket, quad-core machine based on an Intel XeonEMT64 E7340 processor operating at 2.39 GHz. Thetheoretical peak is equal to 9.6 Gflops/s per core or 153.2Gflops/s for the whole node, composed of 16 cores. Thepractical peak (mesured by the performance of a GEMM)is equal to 8.5 Gflops/s per core or 136 Gflops/s forthe 16 cores. The level-1 cache, local to the core, isdivided into 32 kB of instruction cache and 32 kB of datacache. Each quad-core processor is actually composed oftwo dual-core Core2 architectures and the level-2 cachehas 2⇥4 MB per socket (each dual-core shares 4 MB).The machine is a NUMA architecture and it providesIntel Compilers 11.0 together with the MKL 10.1 vendorlibrary.

The second system (System B) is an 8 sockets, 6core AMD Opteron 8439 SE Processor (48 cores total@ 2.8Ghz) with 128 Gb of main memory. Each corehas a theoretical peak of 11.2 Gflops/s and the wholemachine 537.6 Gflops/s. The practical peak (mesured bythe performance of a GEMM) is equal to 9.5 Gflops/sper core or 456 Gflops/s for the 48 cores. There arethree levels of cache. The level-1 cache consist of 64kB and the level-2 cache consist of 512 kB. Each socketis composed of 6 cores and the level-3 cache has 6 MB48-way associative shared cache per socket. The machineis a NUMA architecture and it provides Intel Compilers11.1 together with the MKL 10.2 vendor library.

B. Performance Discussions

In this section, we evaluate the effect of the windowsize and the locality feature on the LL and RL tilealgorithm variants.

The nested-loops describing the tile LL variant codesare naturally ordered in a way that already promoteslocality on the data tiles located on the panel. Fig. 11shows the effect of the locality flag of the scheduleron the overall performance of the tile LL Choleskyvariant. As expected, the locality flag does not reallyimprove the performances when using small window

Julien Langou | University of Colorado Denver Hierarchical QR | 60 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

0 500 1000 1500 2000 2500 3000 3500 40000

20

40

60

80

100

120

140

160

180

200Tile QR Factorization −− 3.2 GHz CELL Processor

Matrix Size

Gflo

p/s

SSSRFB PeakTile QR

Figure 5: Performance of the tile QR factorization in single precision on a 3.2GHz CELL processor with eight SPEs. Square matrices were used. The solidhorizontal line marks performance of the SSSRFB kernel times the numberof SPEs (22.16 × 8 = 177 [Gflop/s]).

28

Performance of the tile QR factorization in single precision on a 3.2 GHz CELL processor with eight SPEs. Square matrices were used. Solid horizontal linemarks performance of the SSSRFB kernel times the number of SPEs (22.16 × 8 = 177 [Gflop/s]).“The presented implementation of tile QR factorization on the CELL processor allows for factorization of a 4000–by–4000 dense matrix in single precision inexactly half of a second. To the author’s knowledge, at present, it is the fastest reported time of solving such problem by any semiconductor deviceimplemented on a single semiconductor die.”

Jakub Kurzak and Jack Dongarra, LAWN 201 – QR Factorization for the CELLProcessor, May 2008.

Julien Langou | University of Colorado Denver Hierarchical QR | 61 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Perform the QR factorization of an initial tiled matrix A.

time

Our tool:

Givens’ rotations:(cos θ − sin θsin θ cos θ

)

Julien Langou | University of Colorado Denver Hierarchical QR | 62 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1

q = 2 q = 3 q = 4

binomial tree 4

8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.

Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1

q = 2 q = 3 q = 4

binomial tree 4

8 12 16

flat tree 14

15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2

q = 3 q = 4

binomial tree 4

8 12 16

flat tree 14

15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2

q = 3 q = 4

binomial tree 4 8

12 16

flat tree 14

15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2

q = 3 q = 4

binomial tree 4 8

12 16

flat tree 14

15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14

15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14

15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14

15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15

16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17

plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5

8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.

Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5

8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8

11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14

greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4

6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6

8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8

10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

(p = 15) q = 1 q = 2 q = 3 q = 4binomial tree 4 8 12 16flat tree 14 15 16 17plasma tree (bs=4) 5 8 11 14greedy tree 4 6 8 10

First column: The goal is to introduce zerosusing Givens rotations. Only the top entry shallstay. At each step, this eliminates this .(This is a reduction.)

The first step requires seven computing units,the second four, the third two, and the fourthone.Note: one point of the plasma tree is to (1)enhance parallelism in the rectangular case, (2)increase data locality (when working within adomain).

44 63 6 83 5 8 102 5 7 102 4 7 92 4 6 92 4 6 81 3 5 81 3 5 71 3 5 71 3 4 61 2 4 61 2 4 51 2 3 5

• Flat tree - Sameh and Kuck (1978).

• Plasma tree - Hadri et al. (2010). Note: thatbinomial tree corresponds to bs = 1, andflat tree to bs = p.

• Greedy - Modi and Clarke (1984), Cosnardand Robert (1986).

• Cosnard and Robert (1986) proved that nomatter the shape of the matrix (i.e., p andq), Greedy was optimal.

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 63 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 64 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 65 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 66 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 67 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 1: Elimination elim(i, piv(i, k), k) via TSkernels.GEQRT (piv(i, k), k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)

TSQRT (i, piv(i, k), k)for j = k + 1 to q do

TSMQR(i, piv(i, k), k , j)

Triangle on top of square

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TSkernels.GEQRT (piv(i, k), k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)

TSQRT (i, piv(i, k), k)for j = k + 1 to q do

TSMQR(i, piv(i, k), k , j)

Triangle on top of square

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TSkernels.GEQRT (piv(i, k), k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)

TSQRT (i, piv(i, k), k)for j = k + 1 to q do

TSMQR(i, piv(i, k), k , j)

Triangle on top of square

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TSkernels.GEQRT (piv(i, k), k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)

TSQRT (i, piv(i, k), k)for j = k + 1 to q do

TSMQR(i, piv(i, k), k , j)

Triangle on top of square

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TSkernels.GEQRT (piv(i, k), k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)

TSQRT (i, piv(i, k), k)for j = k + 1 to q do

TSMQR(i, piv(i, k), k , j)

Triangle on top of square

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TSkernels.GEQRT (piv(i, k), k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)

TSQRT (i, piv(i, k), k)for j = k + 1 to q do

TSMQR(i, piv(i, k), k , j)

Triangle on top of square0

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TSkernels.GEQRT (piv(i, k), k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)

TSQRT (i, piv(i, k), k)for j = k + 1 to q do

TSMQR(i, piv(i, k), k , j)

Triangle on top of square

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

0

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle

Note: it is understood that if a tile is already in triangle form, then the associatedGEQRT and update kernels don’t need to be applied.

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Kernels for orthogonal transformations

Operation Panel UpdateName Cost Name Cost

Factor square into triangle GEQRT 4 UNMQR 6Zero square with triangle on top TSQRT 6 TSMQR 12Zero triangle with triangle on top TTQRT 2 TTMQR 6

Algorithm 2: Elimination elim(i, piv(i, k), k) via TSkernels.GEQRT (piv(i, k), k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)

TSQRT (i, piv(i, k), k)for j = k + 1 to q do

TSMQR(i, piv(i, k), k , j)

Triangle on top of square

Algorithm 2: Elimination elim(i, piv(i, k), k) via TTkernels.GEQRT (piv(i, k), k)GEQRT (i, k)for j = k + 1 to q do

UNMQR(piv(i, k), k , j)UNMQR(i, k , j)

TTQRT (i, piv(i, k), k)for j = k + 1 to q do

TTMQR(i, piv(i, k), k , j)

Triangle on top of triangle

Julien Langou | University of Colorado Denver Hierarchical QR | 68 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

The analysis and coding is a little more complex ...time 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142

Julien Langou | University of Colorado Denver Hierarchical QR | 69 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

TheoremNo matter what elimination list (any combination of TT, TS) is used the totalweight of the tasks for performing a tiled QR factorization algorithm is constantand equal to 6pq2 − 2q3.

Proof:

L1 :: GEQRT = TTQRT + q

L2 :: UNMQRT = TTMQRT + (1/2)q(q − 1)

L3 :: TTQRT + TSQRT = pq − (1/2)q(q + 1)

L4 :: TTMQR + TSMQR = (1/2)pq(q − 1) − (1/6)q(q − 1)(q + 1)

Define L5 as 4L1 + 6L2 + 6L3 + 12L4 and we get

L5 :: 6pq2 − 2q3 = 4GEQRT + 12TSMQRT + 6TTMQRT + 2TTQRT + 6TSQRT + 6UNMQRT

L5 :: = total # of flops

Note: Using our unit task weight of b3/3, with m = pb, and n = qb, we obtain 2mn2 − 2/3n3 flopswhich is the exact same number as for a standard Householder reflection algorithm as found inLAPACK or ScaLAPACK.

Julien Langou | University of Colorado Denver Hierarchical QR | 70 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

TheoremFor a tiled matrix of size p × q, where p ≥ q. The critical path length ofFLATTREE is

2p + 2 if p ≥ q = 1

6p + 16q − 22 if p > q > 1

22p − 24 if p = q > 1

time 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64

2nd Row 3rd Row 4th Row 2nd Column 3nd Column

Initial: 10 Fill the pipeline: 6(p− 1) Pipeline: 16(q − 2) End: 4

Julien Langou | University of Colorado Denver Hierarchical QR | 71 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Greedy tree on a 15x4 matrix. (Weighted on top, coarse below.)time 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Julien Langou | University of Colorado Denver Hierarchical QR | 72 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

TheoremFor a tiled matrix of size p × q, where p ≥ q. Given the time for any coarsegrain algorithm, COARSE(p,q), the corresponding weighted algorithm timeWEIGHTED(p,q) is

10 ∗ (q − 1) + 6 ∗ COARSE(p, q − 1) + 4 + 2 ≤ WEIGHTED(p, q)

≤ 10∗(q−1)+6∗COARSE(p, q−1)+4+2∗(COARSE(p, q)−COARSE(p, q−1))

Julien Langou | University of Colorado Denver Hierarchical QR | 73 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

TheoremWe can prove that the GRASAP algorithm is optimal.

Julien Langou | University of Colorado Denver Hierarchical QR | 74 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Platform

All experiments were performed on a 48-core machine composed of eighthexa-core AMD Opteron 8439 SE (codename Istanbul) processors running at2.8 GHz. Each core has a theoretical peak of 11.2 Gflop/s with a peak of 537.6Gflop/s for the whole machine.

Experimental code is written using the QUARK scheduler.

Results are tested with ‖I − QT Q‖ and ‖A− QR‖/‖A‖. Code has beenwritten in real and complex arithmetic, single and double precision.

Julien Langou | University of Colorado Denver Hierarchical QR | 75 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Complex Arithmetic (p=40,b=200, Model)

FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]O

verh

ea

din

cp

len

gth

with

resp

ectto

GR

EE

DY

(GR

EE

DY

=1

)

q

1 2 3 4 5 6 7 8 9 10 20 30 40

1

1.5

2

2.5

3

3.5

4

4.5

5

Julien Langou | University of Colorado Denver Hierarchical QR | 76 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Complex Arithmetic (p=40,b=200, Experimental)

FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8]O

verh

ea

din

tim

ew

ith

resp

ectto

GR

EE

DY

(GR

EE

DY

=1

)

q

1 2 3 4 5 6 7 8 9 10 20 30 40

1

1.5

2

2.5

3

3.5

4

4.5

5

Julien Langou | University of Colorado Denver Hierarchical QR | 77 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Complex Arithmetic (p=40,b=200, Model on left, Experimental on right)

FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]

Ove

rhe

ad

incp

len

gth

with

resp

ectto

GR

EE

DY

(GR

EE

DY

=1

)

q

1 2 3 4 5 6 7 8 9 10 20 30 40

1

1.5

2

2.5

3

3.5

4

4.5

5

FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8]

Ove

rhe

ad

intim

ew

ith

resp

ectto

GR

EE

DY

(GR

EE

DY

=1

)

q

1 2 3 4 5 6 7 8 9 10 20 30 40

1

1.5

2

2.5

3

3.5

4

4.5

5

Main difference is on the right where there is enough parallelism so all method give “peak” perfor-mance. (peak is TTMQRT performance)

Julien Langou | University of Colorado Denver Hierarchical QR | 78 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Complex Arithmetic (p=40,b=200)

FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8]G

FL

OP

/s

q

1 2 3 4 5 6 7 8 9 10 20 30 400

20

40

60

80

100

120

140

160

Julien Langou | University of Colorado Denver Hierarchical QR | 79 of 103

Rectangular matrices | Maximizing parallelism on multicore nodes (30)

Complex Arithmetic (p=40,b=200, Model on left, Experimental on right)

FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 . . . 20 ]

Pre

dic

ted

GF

LO

P/s

q

1 2 3 4 5 6 7 8 9 10 20 30 40

20

40

60

80

100

120

140

160

FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TT) = [1 5 5 5 17 28 8]

GF

LO

P/s

q

1 2 3 4 5 6 7 8 9 10 20 30 400

20

40

60

80

100

120

140

160

Julien Langou | University of Colorado Denver Hierarchical QR | 80 of 103

Rectangular matrices | Parallelism+communication on multicore nodes (2)

Outline

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 81 of 103

Rectangular matrices | Parallelism+communication on multicore nodes (2)

Factor Kernelscode name # flops

GEQRT 4

TTQRT 2

TSQRT 6

• We work on square b-by-b tiles.• The unit for the task weights is b3/3.• Each of our tasks is O(b3) computation for O(b2) communication.• ⇒ Thanks to this surface/volume effect, we can neglect in a first

approximation communication and focus only on the parallelism.

Julien Langou | University of Colorado Denver Hierarchical QR | 82 of 103

Rectangular matrices | Parallelism+communication on multicore nodes (2)

Using TT KernelsGEQRT

GEQRT TTQRT

Total # of flops: (sequential time)

2 * GEQRT + TTQRT = 2 * 4 + 2 = 10

Critical Path Length: (parallel time)

GEQRT + TTQRT = 4 + 2 = 6

Using TS Kernels

GEQRT TSQRTTotal # of flops: (sequential time)

GEQRT + TSQRT = 4 + 6 = 10

Critical Path Length: (parallel time)

GEQRT + TSQRT = 4 + 2 = 10

One remarkable outcome is that the total number of flops is 10b3/3. If we consider the number offlops for a standard LAPACK algorithm (based on standard Householder transformation), we wouldalso obtain 10b3/3.

Julien Langou | University of Colorado Denver Hierarchical QR | 83 of 103

Rectangular matrices | Parallelism+communication on multicore nodes (2)

TSQRT (in)GEQRT (in)TTQRT (in)GEQRT +TTQRT (in)GEMM (in)TSQRT (out)GEQRT (out)TTQRT (out)GEQRT +TTQRT (out)GEMM (out)

GF

LO

P/s

tile size

100 200 300 400 500 6000

1

2

3

4

5

6

7

8

9

10

The TS kernels (factorization and update) are more efficient than the TT ones due a variety of reasons.(Better vectorization, better volume-to-surface ratio.)

Julien Langou | University of Colorado Denver Hierarchical QR | 84 of 103

Rectangular matrices | Parallelism+communication on multicore nodes (2)

Real Arithmetic (p=40,b=200, Model on left, Experimental on right)

FLATTREE (TS)PLASMATREE (TS) (best)FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TS) = [ 1 1 1 1 5 5 5 5 5 5 10 ... 10 ]Best domain size for PLASMATREE (TT) = [ 1 3 5 5 5 10 10 10 10 10 20 ... 20 ]

Pre

dic

ted

GF

LO

P/s

q

1 2 3 4 5 6 7 8 9 10 20 30 40

50

100

150

200

250

FLATTREE (TS)PLASMATREE (TS) (best)FLATTREE (TT)PLASMATREE (TT) (best)FIBONACCI (TT)GREEDY

Best domain size for PLASMATREE (TS) = [1 3 6 11 12 18 32]Best domain size for PLASMATREE (TT) = [1 3 10 5 17 27 19]

GF

LO

P/s

q

1 2 3 4 5 6 7 8 9 10 20 30 400

50

100

150

200

250

⇒ Motivates the introduction of a TS-level. Square 1-by-1 tiles are grouped in a larger rectangulartiles of size a-by-1 which is eliminated in a TS fashion. The regular TT algorithm is then applied onthe rectangular tiles.

pros: recover performance of TS levelcons: introduce a tuning parameter (a)

Julien Langou | University of Colorado Denver Hierarchical QR | 85 of 103

Rectangular matrices | Parallelism+communication on multicore nodes (2)

Real Arithmetic (p=40,b=200, performance on left, overhead on right)

1 2 3 4 5 6 7 8 9 10 20 30 400

50

100

150

200

250

Q

Gflo

p/se

c

P=40MB=200,IB=32,cores = 48

BESTFT (TS)BINOMIAL (TT)FT (TT)FIBO (TT)Greedy (TT)Greedy+RRon+a=8Greedy+RRoff+a=8

1 2 3 4 5 6 7 8 9 10 20 30 401

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Q

over

head

wrt

BEST

P=40MB=200,IB=32,cores = 48

BESTFT (TS)BINOMIAL (TT)FT (TT)FIBO (TT)Greedy (TT)Greedy+RRon+a=8Greedy+RRoff+a=8

Idea: use TS domain of size a, use greedy on top of them, use round-robin onthe domains (This experiment was done with DAGuE.)

Julien Langou | University of Colorado Denver Hierarchical QR | 86 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Outline

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 87 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Outline

Three goals:

1. minimize the number of local communication

2. minimize the number of parallel distributed communication

3. maximize the parallelism within a node

Julien Langou | University of Colorado Denver Hierarchical QR | 88 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Hierarchical trees

Julien Langou | University of Colorado Denver Hierarchical QR | 89 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Hierarchical trees

Julien Langou | University of Colorado Denver Hierarchical QR | 89 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Hierarchical trees

Julien Langou | University of Colorado Denver Hierarchical QR | 89 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Hierarchical algorithm layout

Global view Local view

LegendP0

P1

P2

0 local TS1 local TT2 domino3 global tree

Julien Langou | University of Colorado Denver Hierarchical QR | 90 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Outline

Three goals:

1. minimize the number of local communication

2. minimize the number of parallel distributed communication

3. maximize the parallelism within a node

Four hierarchical trees

1. TS level :: flat treeeliminate some tiles with TS kernels

2. node level :: flat/greedy/binary/fibonacci treeeliminate remaining non coupled tiles locally

3. local level :: domino treetake care of the coupling between local computation and distributedcomputations

4. distributed level :: flat/greedy/binary/fibonacci treehigh level, reduce tiles to one in parallel distributed

Julien Langou | University of Colorado Denver Hierarchical QR | 91 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Hierarchical trees

Julien Langou | University of Colorado Denver Hierarchical QR | 92 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Hierarchical trees

Julien Langou | University of Colorado Denver Hierarchical QR | 92 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Hierarchical trees

Julien Langou | University of Colorado Denver Hierarchical QR | 92 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Hierarchical trees

Julien Langou | University of Colorado Denver Hierarchical QR | 92 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Platform

Cluster Edel from Grid5000, Grenoble• 60 nodes• 2 Nehalem Xeon E5520 at 2.27GHz per node (8 cores)• 24 GB per node• Infiniband 20G network• Theoretical peak performance:

◦ 9.08 GFlop/s per core◦ 72.64 GFlop/s per node◦ 4.358 TFlop/s for the whole machine

Julien Langou | University of Colorado Denver Hierarchical QR | 93 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Putting the results in perspective

Peak:Peak performance of a core is 9.08 GFlops/sec.x8 for a node gives 72.64 GFlops/sec.x60 for the parallel distributed platform gives 4358.4 Gflops/sec.Kernels: (one core)The TSMQR kernel performs at 7.21 GFlops/sec. (This is 79.41% of peak).The TTMQR kernel performs at 6.28 GFlops/sec. (This is 69.17% of peak).In shared memory on a node: (one node, eight cores)Flat tree TS gets 76.8% of peak with 55.83 GFlops/sec.Flat tree TT gets 66.4% of peak with 48.22 GFlops/sec.Our parallel distributed code: (sixty nodes, 480 cores)The parallel distributed on 60 nodes we top at 3 TFlop/sec, that’s 68.8% ofpeak.

Julien Langou | University of Colorado Denver Hierarchical QR | 94 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Influence of TS kernels and trees (N=4480)

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Treea=1, greedya=4, greedy

a=8, greedya=1, binary

a=4, binarya=8, binary

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Tree (continued)a=1, flata=4, flat

a=8, flata=1, fibonacci

a=4, fibonaccia=8, fibonacci

Influence of the TS level size with fixed local reduction tree.(Left: Greedy, Right: Flat)

Julien Langou | University of Colorado Denver Hierarchical QR | 95 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Influence of TS kernels and trees (N=4480)

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Treea=1, greedya=4, greedy

a=8, greedya=1, binary

a=4, binarya=8, binary

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Tree (continued)a=1, flata=4, flat

a=8, flata=1, fibonacci

a=4, fibonaccia=8, fibonacci

Influence of the TS level size with fixed local reduction tree.(Left: Greedy, Right: Flat)

⇒ a has to be adapted to the matrix size

Julien Langou | University of Colorado Denver Hierarchical QR | 95 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Influence of TS kernels and trees (N=4480)

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Treea=1, greedya=4, greedy

a=8, greedya=1, binary

a=4, binarya=8, binary

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Tree (continued)a=1, flata=4, flat

a=8, flata=1, fibonacci

a=4, fibonaccia=8, fibonacci

Influence of the TS level size with fixed local reduction tree.(Left: Greedy, Right: Flat)

Low level tree: FLATTREE is slower than the three others

Julien Langou | University of Colorado Denver Hierarchical QR | 95 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Influence of TS kernels and trees (N=4480)

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Treea=1, greedya=4, greedy

a=8, greedya=1, binary

a=4, binarya=8, binary

0

500

1000

1500

2000

2500

4480 8960 17920 35840 71680 143360 286720

Pe

rfo

rma

nce

(G

Flo

p/s

)

M

High-Level Tree (continued)a=1, flata=4, flat

a=8, flata=1, fibonacci

a=4, fibonaccia=8, fibonacci

Influence of the TS level size with fixed local reduction tree.(Left: Greedy, Right: Flat)

High level tree: FLATTREE is slightly better than the others

Julien Langou | University of Colorado Denver Hierarchical QR | 95 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Influence of the domino

0

500

1000

1500

2000

2500

3000

17920 35840 71680 143360 286720

Performance (GFlop/s)

M

Low-Level Treew/o domino: flat

fibonaccigreedybinary

w/ domino: flatfibonacci

greedybinary

Figure: Influence of the low-level tree and thedomino optimization.

0

20

40

60

80

100

120

140

160

180

200

17920 35840 71680 143360 286720

Num

ber

of

task

s (x

10

00

)

M

w/o domino: GEQRT+UNMQR TT kernels TS kernels

w/ domino:

0

10

20

30

40

50

60

70

80

90

100

17920 35840 71680 143360 286720

Perc

enta

ge o

f each

cate

gory

of

task

s

M

w/o domino: GEQRT+UNMQR TT kernels TS kernels

w/ domino:

N=4480, P=15, Q=4, MB=280, a=4, High level tree set to Fibonacci

Julien Langou | University of Colorado Denver Hierarchical QR | 96 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Scaling experiments (N=4480, M varies)

0

500

1000

1500

2000

2500

3000

0 50000 100000 150000 200000 250000 300000

Perf

orm

ance (

GF

lop/s

)

M (N=4,480)

Theoretical Peak: 4358.4 GFlop/s

Scalapack[BBD+10][SLHD10]

HQR

P=15, Q=4, MB=280, FIBONACCI/FIBONACCI,a=4, domino enabled

matrix is [16-100]x16 tiles

Julien Langou | University of Colorado Denver Hierarchical QR | 97 of 103

Rectangular matrices | Parallelism+communication on distributed+multicore nodes (13)

Scaling experiments (M=67200, N varies)

0

500

1000

1500

2000

2500

3000

0 10000 20000 30000 40000 50000 60000 70000

Perf

orm

ance (

GF

lop/s

)

N (M=67,200)

Theoretical Peak: 4358.4 GFlop/s

Scalapack[BBD+10][SLHD10]

HQR

P=15, Q=4, MB=280, FIBONACCI/FLATTREE,For N ≤ 16800, a=1, domino enabledFor N > 16800, a=4, domino disabled

matrix is 240x[16-240] tilesJulien Langou | University of Colorado Denver Hierarchical QR | 98 of 103

Rectangular matrices | Scheduling on multicore nodes (4)

Outline

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 99 of 103

Rectangular matrices | Scheduling on multicore nodes (4)

We consider a n–by–n matrix with nb–by–nb tiles, and set n = t ∗ nb.We take for unit of flops nb3/3.This makes n and nb disappears from our problem.The number total of flops in Cholesky is

t3.

The length of the critical path is9t − 10.

Therefore on p threads, the execution time of Cholesky is at least

max(t3

p, 9t − 10 ).

This is our first lower bound for any execution time. (So in particular for the optimal execution time.)

0 20 40 60 80 100 1200

5

10

15

20

25

30

35

40

45

50

p, # of threads

spee

dup

scalability with t=20

lower bound #1

Note: the expected speedup is therefore less than this dashed line. (The lower bound in timebecomes a upper bound in speedup.)

Julien Langou | University of Colorado Denver Hierarchical QR | 100 of 103

Rectangular matrices | Scheduling on multicore nodes (4)

We consider three strategies:

1. max list schedule with priority given to the task with the maximum flops in all of its children(and itself),

2. rand list schedule with random tasks selection

3. min list schedule with priority given to the task with the minimum flops in all of its children (anditself),

0 20 40 60 80 100 1200

5

10

15

20

25

30

35

40

45

50

p, # of threads

spee

dup

scalability with t=20

lower bound #1strat. maxstrat. randstrat. min

0 20 40 60 80 100 1200

10

20

30

40

50

60t=20

p, # of cores

% o

verh

ead

with

res

pect

to b

asic

low

er b

ound

strat. minstart. randstrat. max

If we are not happy with this 40% discrepancy between the lower bound and the upper bound, wehave two choices ... either find a greater lower bound or find a lower upper bound ... or both ...

Julien Langou | University of Colorado Denver Hierarchical QR | 101 of 103

Rectangular matrices | Scheduling on multicore nodes (4)

1 8 16 24 32 40 48 56 64 72 80 88 96 104 1120

5

10

15

20

25

30

35

40

45

# of cores

spee

dup

scalability with p=20

lower bound in timenewboundmaxrandommin

1 16 32 48 64 80 96 1121

1.05

1.1

1.15

1.2

1.25

1.3

1.35

scalability with p=20

# of cores

% o

verh

ead

with

resp

ect t

o cu

rrent

low

er b

ound

maxrandommin

Julien Langou | University of Colorado Denver Hierarchical QR | 102 of 103

Rectangular matrices | Scheduling on multicore nodes (4)

1 8 16 24 32 40 48 56 64 72 80 88 96 104 1120

5

10

15

20

25

30

35

40

45

# of cores

spee

dup

scalability with p=20

lower bound in timenewboundmaxrandommin

1 16 32 48 64 80 96 1121

1.05

1.1

1.15

1.2

1.25

1.3

1.35

scalability with p=20

# of cores

% o

verh

ead

with

resp

ect t

o cu

rrent

low

er b

ound

maxrandommin

If we are not happy with this 5% discrepancy between the lower bound and theupper bound, we have two choices ... either find a greater lower bound or finda lower upper bound ... or both ...

Julien Langou | University of Colorado Denver Hierarchical QR | 103 of 103

Rectangular matrices | Scheduling on multicore nodes (4)

• reducing communication◦ in sequential◦ in parallel distributed

• increasing parallelism(or reducing the critical path, reducing synchronization)

Tall and Skinny matricesMinimizing communication in parallel distributed (29)Minimizing communication in hierarchichal parallel distributed (6)Minimizing communication in sequential (1)

Rectangular matricesMinimizing communication in sequential (2)Maximizing parallelism on multicore nodes (30)Parallelism+communication on multicore nodes (2)Parallelism+communication on distributed+multicore nodes (13)Scheduling on multicore nodes (4)

Julien Langou | University of Colorado Denver Hierarchical QR | 104 of 103

|

شكرا!Julien Langou | University of Colorado Denver Hierarchical QR | 105 of 103