scheduling and ordering issues in sequential task flow...

Scheduling and ordering issues inSequential Task Flow parallelmultifrontal methodsand preliminary work on memory aware scheduling

E. Agullo, A. Buttari, A. Guermouche and F. Lopez,INRIA-LaBRI, CNRS-IRIT, Universite de Bordeaux, UPS-IRIT

SOLHAR meeting on scheduling, April 2014, Lyon

Context of the work

Runtime systems

Application

Architecture

xPU0 xM0 yPU0 yM0xPU1 xM1

The classical approach is based on amixture of technologies (e.g.,MPI+OpenMP+CUDA) which

• requires a big programming effort

• is difficult to maintain and update

• is prone to (performance)portability issues

3/44 SOLHAR meeting on scheduling, April 2014, Lyon

Runtime systems

Application

Runtime

Architecture

xPU0 xM0

Scheduling engine

Mem. manager

xPUdriver

yPUdriver

A BC

CB B

A

yPU0 yM0xPU1 xM1

Ax Ay

Bx Cx

• runtimes provide an abstractionlayer that hides the architecturedetails

• the workload is expressed as a DAGof tasks where the dependenciesare

◦ defined explicitly◦ defined through rules◦ automatically inferred

• the scheduler decides when/whereto execute a task

• the drivers deploy the code on thedevices

• the memory manager does thememory transfers and guaranteesthe consistency


Runtime systems

Application

Runtime

Architecture

xPU0 xM0

Scheduling engine

Mem. manager

xPUdriver

yPUdriver

A BC

CB B

A

yPU0 yM0xPU1 xM1

Ax Ay

Bx Cx

• runtimes provide an abstractionlayer that hides the architecturedetails

• the workload is expressed as a DAGof tasks where the dependenciesare◦ defined explicitly◦ defined through rules◦ automatically inferred

• the scheduler decides when/whereto execute a task

• the drivers deploy the code on thedevices

• the memory manager does thememory transfers and guaranteesthe consistency


Runtime systems

Runtime systems are becoming widely adopted in scientific computingespecially for dense linear algebra libraries:

• PLASMA (QUARK)

• DPLASMA (PaRSEC)

• MAGMA-MORSE (StarPU)

• FLAME (SuperMatrix)

it is, however, much more challenging to use them for complex andirregular workloads as in sparse computations

Objective

The objective of this work is to assess the usability of runtime systemsfor sparse factorization methods and evaluate their effectiveness onsingle-node, multicore systems1

1very related work from P. Ramet et al. on the PaStiX solver4/44 SOLHAR meeting on scheduling, April 2014, Lyon

The multifrontal QR method

The Multifrontal QR method

The multifrontal QR factorization is guided by a graph calledelimination tree:

• each node is associated with arelatively small dense matrix calledfrontal matrix (or front) containing kpivots to be eliminated along with allthe other coefficients concerned bytheir elimination


The Multifrontal QR method

The tree is traversed in topological order (i.e., bottom-up) and, ateach node, two operations are performed:

• assembly: coefficients from the originalmatrix associated with the pivots andcontribution blocks produced by thetreatment of the child nodes arestacked to form the frontal matrix

• factorization: the k pivots areeliminated through a complete QRfactorization of the frontal matrix. Asa result we get:◦ part of the global R and Q factors◦ a triangular contribution block that will

be assembled into the father’s front


The multifrontal QR method

Notable differences with multifrontal LU:

• fronts are rectangular, either over or under-determined

• assembly operations are just copies (with lots of indirectaddressing) and not sums. They can thus be done in any order(like in LU) but also in parallel (most likely not efficient because offalse sharing issues)

• fronts are not full: they have a staircase structure. The zeroes inthe lower-leftmost part can be ignore. This irregular structuremakes the modeling of performance rather difficult

• fronts are completely factorized and not just partially. This makesthe overall size of factors bigger and thus the active memoryconsumption less sensitive to the tree traversal

• contribution blocks are trapezoidal and note square


The qr mumps approach

Parallelism

Fine-granularity is achieved through a 1-D block partitioning of frontsand the definition of five elementary operations:

1. activate(front): just allocate the memoryrequired to process the node

2. init(front): compute the structure of thefront

3. panel(bcol): QR factorization of a column

4. update(bcol): update of a column in thetrailing submatrix wrt to a panel

5. assemble(bcol): assembly of a column ofthe contribution block into the father

6. clean(front): cleanup the front (deallocatememory and store factors))


Parallelism: a new approach

If a task is defined as the execution of one elementary operation on ablock-column or a front, then the entire multifrontal factorization canbe represented as a Directed Acyclic Graph (DAG)

1 2

a

p1 u2 u3

p2 u3

p3

s2 s3

a

p1 u2 u3

u3

u4

u4

u4

s2 s3 s4

p2

p3

cc

3

a

p1 u2 u3

u3

u4

u4

u4

p4

s3 s4

p2

p3

c

a

u

s

c

activate

p panel

update

assemble

clean

1 2

a

p1 u2 u3

p2 u3

p3

s2 s3

a

p1 u2 u3

u3

u4

u4

u4

s2 s3 s4

p2

p3

cc

3

a

p1 u2 u3

u3

u4

u4

u4

p4

s3 s4

p2

p3

c

a

u

s

c

activate

p panel

update

assemble

clean

From a DAG of DAGs to just one single huge DAG


Parallelism: a new approach

The scheduling is performed by a finely-tuned, hand-written code

N the fine-grained decomposition and the asynchronous/dynamicscheduling deliver high concurrency and much better performancecompared to the classical approach (SPQR)

H the scheduler is not scalable (the search for ready tasks in the DAGis inefficient)...

H ... extremely difficult to maintain...

H ... and not really portable

All these problems may be overcome by replacing the scheduler with amodern runtime system


StarPU multifrontal QR

The multifrontal QR factorization: StarPU integration

StarPU is a runtime environment that lets the programmer achieveparallelism through a Sequential Task Flow model:

• The parallel code looks exactly the same as the sequential oneexcept that elementary operations are not executed but submittedto the system

• Depending on how elementary operations access data (whether inread or write mode) and on the (sequential) order of submission,StarPU can infer dependencies among them and build a DAGwhich is used to drive the parallel execution

• The StarPU scheduler is in charge of deploying the DAG on theunderlying architecture

• The StarPU memory manager moves data from one memory toanother and maintains the global memory coherency

Other approaches are possible such as the Parametrized Task Flowmodel used in Parsec (Florent goes to Rocky Top).



Sequential qr mumps code

do f=1, nfronts ! in postorder

! activate front

call activate(f)

! init front

call init(f)

do c=1, f%nc ! for all the children of f

do j=1,c%n

! assemble column j of c into f

call assemble(c, j, f)

end do

! cleanup child

call clean(c)

end do

do p=1, f%n

! panel reduction of column p

call panel(f, p)

do u=p+1, f%n

! update of column u with panel p

call update(f, u, p)

end do

end do

end do



STF parallel qr mumps code


! activate front

call starpu_submit(activate , f)

! init front

call starpu_submit(init , f)


do j=1,c%n


call starpu_submit(assemble , c, j, f)

end do

! cleanup child

call starpu_submit(clean , c)

end do

do p=1, f%n


call starpu_submit(panel , f, p)

do u=p+1, f%n


call starpu_submit(update , f, u, p)

end do

end do

end do

! wait for the tasks to be executed

call starpu_waitall ()


2D partitioning + CA front factorization

1D block-column partitioning is not very well suited for the casewhere frontal matrices are (strongly) overdetermined...“Houston, we have a problem”

Thanks to the simplicity of the STF programmingmodel it is possible to plug in communication avoidingmethods for factorizing the frontal matrices with arelatively moderate effort

N 2D block partitioning (not necessarily square)

N more concurrency

H more complex dependencies

H many more tasks

H more sensitive to runtime overhead


2D partitioning + CA front factorization

do f=1, nfronts ! in sequential order

call starpu_submit(activate , f) ! activate front

call starpu_submit(init , f) ! init front

do c=1, f%nchildren ! for all the children of f

do i=1,c%m

do j=1,c%n

call starpu_submit(assemble , c, i, j, f) ! assemble block(i,j) of c into f

end do

end do

call starpu_submit(clean , c) ! cleanup child

end do

ca_facto: do k=1, min(f%m,f%n)

do s=0, log2(f%m-k+1)

do i = k, f%n, 2**s

if(s.eq.0) then

call starpu_submit(geqrt , f, k, i)

do j=k+1, f%n

call starpu_submit(gemqrt , f, k, i, j)

end do

else

l = i+2**(s-1)

call starpu_submit(tpqrt , f, k, i, l)

do j=k+1, front%n

call starpu_submit(tpmqrt , f, k, i, l, j)

end do

end if

end do

end do

end do ca_facto

end do

call starpu_waitall () ! wait for the tasks to be

executed

Scheduling issues in shared memory settings

How to schedule the tasks of such a complex DAG?

• how to identify the critical path?

• how to speed up operations along the critical path?

• how to improve the locality of data? (see the work of Abdou andAndra on contexts)

• how to deal with NUMA settings?

• the 2D blocking does not have to be used on all fronts. Whenshould we switch to 2D blocking?◦ only larger fronts?◦ only overdetermined fronts?◦ only topmost fronts?



qr mumps uses a logical tree pruning using the Geist+Ng algorithm

• reduces the complexity of the scheduling

• reduces the overhead of parallelism

• improves the locality of data



Problem is: for 2 processes, Geist+Ng would be perfectly happy withthis layer

1 2

3

4 5

6

7

50 50 80 20

5 1

5

but in shared memory we would like to stay as close as possible to thesequential traversal (we can afford it!); in this case the above layerleads to a severe imbalance


Memory aware scheduling

The memory consumption problem

the elimination tree has to be traversed in a topological order (i.e.,each node after its children)

a b

c

d e

f

g

h i

l

m

n

o

because memory is allocated (in the activation task) and partiallydeallocated (in the clean task) at each node, different traversal ordersresult in a different memory consumption




a b

c

d e

f

g

h i

l

m

n

o

bfs1={a,b,d,e,g,h,i,c,f,l,m,n,o}

BFS normally results in orderings that are bad in terms of memoryconsumption




a b

c

d e

f

g

h i

l

m

n

o

dfs1={a,b,c,d,e,f,g,h,i,l,m,n,o}

among all the topological orderings, (postorder) DFS commonly havea better memory behavior (both in terms of footprint and locality)




a b

c

d e

f

g

h i

l

m

n

o

dfs2={h,i,l,g,m,d,e,f,n,a,b,c,o}

among all the DFS, some are better than others and we know how tocompute the one that minimizes the memory footprint




a b

c

d e

f

g

h i

l

m

n

o

dfs2={h,i,l,g,m,d,e,f,n,a,b,c,o}

in parallel we want to work on multiple branches concurrently whichmeans that we have to deviate from the DFS and thus consume morememory


Memory aware scheduling in a STF model

Approach: the activation of fronts is forced to respect the sequential(BFS/DFS) order

but the order in which active fronts are processedcan be different

a

d e g

h i

m

n

o

l

b

c

f

This ensures that the parallel execution runs within the same memoryenvelop as the sequential one. The memory constraint can be relaxedin order to permit the activation of more fronts and, potentiallyachieve more concurrency



Approach: the activation of fronts is forced to respect the sequential(BFS/DFS) order but the order in which active fronts are processedcan be different

a

d e g

h i

m

n

o

l

b

c

f

This ensures that the parallel execution runs within the same memoryenvelop as the sequential one. The memory constraint can be relaxedin order to permit the activation of more fronts and, potentiallyachieve more concurrency




! activate front

call starpu_submit(activate , f)

! init front



do j=1,c%n



end do

! cleanup child


end do

do p=1, f%n



do u=p+1, f%n



end do

end do

end do





do while(avail_mem < size(f)) wait()

! activate front

call activate(f)

! init front



do j=1,c%n



end do

! cleanup child


end do

do p=1, f%n



do u=p+1, f%n



end do

end do

end do



Experimental resultsS

peedup w

rt s

eq

Memory (MB)

DFS

BFS

2000 2500 3000 3500 4000

5

10

15

-- AMD Opteron 24 coresHirlam

node: 1 m: 6 n: 7 np: 1

pv:309827 order: 1

peak: 116.467

node: 2 m: 6 n: 7 np: 1pv: 33240

order: 2peak: 116.467

node: 2317m: 3202 n: 2837np: 269pv: 33054

order: 2317peak: 234.508

node: 2318 m: 6 n: 7 np: 1

pv:397885order: 2318

peak: 234.508

node: 6155m: 4086 n: 3572np: 392

pv:354553order: 6155

peak: 386.884

node: 6156m: 5847 n: 4926np: 742

pv:375275order: 6156

peak: 512.136

node: 13381m: 4025 n: 3188np: 691

pv:199543 order: 13381

peak: 542.885

node: 20521m: 4798 n: 3946np: 670

pv:312548 order: 20521

peak: 737.816

node: 20522m: 5924 n: 4721 np: 1028pv: 17631

order: 20522peak: 795.331

node: 20523 m: 6 n: 7 np: 1

pv:258672 order: 20523

peak: 795.331

node: 22634m: 2384 n: 2379 np: 3

pv:281129 order: 22634

peak: 795.331

node: 22635 m: 6 n: 7 np: 1

pv:105006 order: 22635

peak: 795.331

node: 25296m: 3469 n: 2986np: 371pv: 37856

order: 25296peak: 795.331

node: 25297m: 5044 n: 4413np: 493

pv:105007 order: 25297

peak: 848.660

node: 30699m: 4781 n: 3869np: 745

pv:309389 order: 30699

peak: 980.344

node: 30700 m: 6 n: 7 np: 1

pv:209388 order: 30700

peak: 980.344

node: 45035m: 6377 n: 4766 np: 1340pv:300513

order: 45035peak: 1373.245

node: 45036m: 6754 n: 5225 np: 1291pv:209218

order: 45036peak: 1380.826

node: 45037m: 8035 n: 6189 np: 1596 pv: 4365

order: 45037peak: 1385.077

node: 45038m: 8535 n: 6325 np: 1929pv: 13241

order: 45038peak: 1385.077

node: 45039 m: 6 n: 7 np: 1

pv:119578 order: 45039

peak: 1385.077

node: 49489m: 4970 n: 4209np: 630

pv:346894 order: 49489

peak: 1385.077

node: 55391m: 4707 n: 4118np: 467pv: 92878

order: 55391peak: 1385.077

node: 55392m: 7366 n: 5881 np: 1239 pv: 3118

order: 55392peak: 1405.563

node: 55393m: 9310 n: 7023 np: 1975pv:119408

order: 55393peak: 1595.857

node: 55394m: 9611 n: 6596 np: 2569 pv: 7723

order: 55394peak: 1595.857

node: 58117m: 3468 n: 3136np: 246

pv:125514 order: 58117

peak: 1595.857

node: 71698m: 6545 n: 5155 np: 1176pv:307770

order: 71698peak: 1595.857

node: 71699m: 7073 n: 5520 np: 1270pv: 12972

order: 71699peak: 1595.857

node: 74844m: 4417 n: 3701np: 554pv: 52764

order: 74844peak: 1595.857

node: 90175m: 7300 n: 5579 np: 1468 pv: 6152

order: 90175peak: 1727.891

node: 90176m: 7461 n: 5744 np: 1450 pv: 2428

order: 90176peak: 1727.891

node: 90177m: 8895 n: 6262 np: 2235pv: 10252

order: 90177peak: 1727.891

node: 90178m: 8656 n: 4027 np: 4027 pv: 7552

order: 90178peak: 1727.891

node: 90179 m: 0 n: 0 np: 0pv: 0

order: 90179peak: 1727.891

The BFS deliver (sometimes much) better performance at the price ofa higher memory consumption and a more erratic behavior


peedup w

rt s

eq

Memory (MB)

-- AMD Opteron 24 coresRucci1

DFS

BFS

5000 6000 7000 8000 9000 10000 11000

2

4

6

8

10

12

14

node: 102m: 5438 n: 2601np: 284

pv: 1158order: 102

peak: 431.879

node: 206m: 4689 n: 2700np: 213pv: 46665

order: 206peak: 529.942

node: 207m: 9833 n: 3981np: 589pv: 26279

order: 207peak: 736.403

node: 330m: 6784 n: 2690np: 397pv: 19900

order: 330peak: 795.841

node: 445m: 6961 n: 2759np: 403pv: 15115

order: 445peak: 906.405

node: 446m: 7628 n: 4194np: 346pv: 32304

order: 446peak: 1061.863

node: 447 m: 12493 n: 6200np: 833

pv: 1061order: 447

peak: 1367.535

node: 544m: 4521 n: 1836np: 251 pv: 757

order: 544peak: 1367.535

node: 628m: 4276 n: 2018np: 224pv: 46361

order: 628peak: 1367.535

node: 629m: 7520 n: 2727np: 522pv: 27551

order: 629peak: 1367.535

node: 849m: 6307 n: 2716np: 420

pv: 1382order: 849

peak: 1367.535

node: 1153m: 7962 n: 3357np: 546pv: 45341

order: 1153peak: 1469.937

node: 1154m: 9808 n: 4277np: 731

pv: 1367order: 1154

peak: 1627.573

node: 1155 m: 10809 n: 4954np: 610

pv: 1052order: 1155

peak: 1700.234

node: 1282m: 4626 n: 2386np: 214pv: 86678

order: 1282peak: 1700.234

node: 1668m: 7771 n: 3780np: 487pv: 87138

order: 1668peak: 1700.234

node: 1669m: 8537 n: 4885np: 495pv: 86993

order: 1669peak: 1844.807

node: 1794m: 7328 n: 2881np: 374pv: 21068

order: 1794peak: 1844.807

node: 2227m: 9125 n: 3383np: 607pv: 50596

order: 2227peak: 2190.775

node: 2228 m: 10737 n: 4367np: 748pv: 64699

order: 2228peak: 2283.586

node: 2287m: 3807 n: 2084np: 143pv: 64915

order: 2287peak: 2283.586

node: 2338m: 3232 n: 2160np: 78

pv: 39794order: 2338

peak: 2283.586

node: 2339m: 6867 n: 3570np: 254pv: 33733

order: 2339peak: 2283.586

node: 2435m: 5091 n: 2586np: 226pv: 34938

order: 2435peak: 2283.586

node: 2518m: 4940 n: 2809np: 123pv: 50778

order: 2518peak: 2351.621

node: 2919m: 8776 n: 3580np: 601pv: 40985

order: 2919peak: 2850.450

node: 3071m: 7991 n: 3517np: 430 pv: 317

order: 3071peak: 2899.850

node: 3072 m: 15911 n: 4453 np: 1392 pv: 189

order: 3072peak: 3272.891

node: 3073 m: 12935 n: 4646np: 952pv: 78861

order: 3073peak: 3274.707

node: 3074 m: 12552 n: 4918np: 919 pv: 129

order: 3074peak: 3274.707

node: 3075 m: 16237 n: 5834 np: 1155 pv: 706

order: 3075peak: 3274.707

node: 3076 m: 17859 n: 6751 np: 1388

pv: 10order: 3076

peak: 3518.752

node: 3077 m: 22725 n: 7694 np: 1906pv: 49070

order: 3077peak: 3769.463

node: 3078 m: 24050 n: 7961 np: 2039 pv: 528

order: 3078peak: 3988.389

node: 3079 m: 30426 n: 7868 np: 3036 pv: 646

order: 3079peak: 4067.301

node: 3207m: 6399 n: 2892np: 334pv: 41964

order: 3207peak: 4067.301

node: 3346m: 6322 n: 3875np: 229pv: 58993

order: 3346peak: 4067.301

node: 3347 m: 10871 n: 5422np: 590 pv: 909

order: 3347peak: 4067.301

node: 3348 m: 40735 n: 4832 np: 4832 pv: 1316

order: 3348peak: 4067.301

node: 3349 m: 0 n: 0 np: 0pv: 0

order: 3349peak: 4067.301

When DFS achieves the peak memory consumption on the top of thetree, performance varies mildly wrt the memory constraint

Experimental results

500 600 700 800 900 1000 1100 1200 1300

2

4

6

8

10

DFS

BFS

Speedup w

rt s

eq

Memory (MB)

EternityII_E -- AMD Opteron 24 cores

node: 7m: 5559n: 821np: 51

pv: 4492 order: 7

peak: 92.847

node: 11m: 789n: 458np: 15

pv: 7852 order: 11

peak: 92.847

node: 12m: 400n: 283np: 10

pv: 2299 order: 12

peak: 92.847

node: 13m: 289n: 190 np: 7

pv: 2310 order: 13

peak: 92.847

node: 14m: 248n: 199 np: 6

pv: 2305 order: 14

peak: 92.847

node: 15m: 4696 n: 1472np: 106

pv: 6148 order: 15

peak: 137.809

node: 16 m: 96n: 146 np: 1

pv: 8187 order: 16

peak: 137.809

node: 17m: 100n: 156 np: 1

pv: 9147 order: 17

peak: 137.809

node: 18m: 100n: 156 np: 1

pv: 6747 order: 18

peak: 137.809

node: 19 m: 96n: 146 np: 1

pv: 8564 order: 19

peak: 137.809

node: 20m: 100n: 156 np: 1

pv: 7124 order: 20

peak: 137.809

node: 21m: 100n: 156 np: 1

pv: 9524 order: 21

peak: 137.809

node: 22m: 100n: 156 np: 1

pv: 9452 order: 22

peak: 137.809

node: 23 m: 96n: 146 np: 1

pv: 8492 order: 23

peak: 137.809

node: 24m: 100n: 156 np: 1

pv: 7052 order: 24

peak: 137.809

node: 43m: 3322n: 884np: 116 pv: 130


node: 83m: 4909n: 977np: 117pv: 4


node: 84m: 4294 n: 1187np: 141

pv: 4639 order: 84

peak: 198.015

node: 85 m: 96n: 113 np: 1

pv: 8406 order: 85

peak: 198.015

node: 103m: 5715 n: 1074np: 80

pv: 4344order: 103

peak: 215.521

node: 121m: 3773 n: 1314np: 115pv: 40

order: 121peak: 262.100

node: 122m: 4251 n: 1521np: 279

pv: 4642order: 122

peak: 271.757

node: 123m: 2737 n: 1700np: 299pv: 52

order: 123peak: 271.757

node: 132m: 7218 n: 1091np: 64

pv: 3481order: 132

peak: 271.757

node: 133m: 4338 n: 1116np: 36

pv: 4438order: 133

peak: 271.757

node: 137m: 1008n: 684np: 15

pv: 6091order: 137

peak: 271.757

node: 143m: 9269 n: 1057np: 91

pv: 3049order: 143

peak: 271.757

node: 144m: 4827 n: 1419np: 82

pv: 165order: 144

peak: 303.046

node: 145 m: 96n: 146 np: 1

pv: 8375order: 145

peak: 303.046

node: 146m: 100n: 156 np: 1

pv: 9335order: 146

peak: 303.046

node: 147 m: 96n: 146 np: 1

pv: 8374order: 147

peak: 303.046

node: 148m: 100n: 156 np: 1

pv: 6935order: 148

peak: 303.046

node: 149m: 100n: 156 np: 1

pv: 9334order: 149

peak: 303.046

node: 150m: 100n: 156 np: 1

pv: 6934order: 150

peak: 303.046

node: 151 m: 96n: 146 np: 1

pv: 8586order: 151

peak: 303.046

node: 160m: 2977 n: 1040np: 21

pv: 9591order: 160

peak: 303.046

node: 161m: 5434 n: 1157np: 47

pv: 3261order: 161

peak: 303.046

node: 162 m: 96n: 146 np: 1

pv: 8376order: 162

peak: 303.046

node: 163m: 100n: 156 np: 1

pv: 6936order: 163

peak: 303.046

node: 164m: 100n: 156 np: 1

pv: 9336order: 164

peak: 303.046

node: 167m: 818n: 474np: 15

pv: 5880order: 167

peak: 303.046

node: 168 m: 96n: 146 np: 1

pv: 8589order: 168

peak: 303.046

node: 171m: 835n: 488np: 15

pv: 5881order: 171

peak: 303.046

node: 172m: 4299 n: 1089np: 37

pv: 5945order: 172

peak: 315.610

node: 191m: 6535n: 800np: 227

pv: 4787order: 191

peak: 387.933

node: 192 m: 96n: 146 np: 1

pv: 8373order: 192

peak: 387.933

node: 193m: 100n: 156 np: 1

pv: 6933order: 193

peak: 387.933

node: 194m: 100n: 156 np: 1

pv: 9333order: 194

peak: 387.933

node: 195m: 288n: 205 np: 7

pv: 5989order: 195

peak: 387.933

node: 196m: 368n: 205np: 10

pv: 8104order: 196

peak: 387.933

node: 197m: 368n: 205np: 10

pv: 8107order: 197

peak: 387.933

node: 198m: 5026n: 887np: 112

pv: 4266order: 198

peak: 387.933

node: 199 m: 96n: 146 np: 1

pv: 8377order: 199

peak: 387.933

node: 200m: 100n: 156 np: 1

pv: 6937order: 200

peak: 387.933

node: 201m: 100n: 156 np: 1

pv: 9337order: 201

peak: 387.933

node: 202 m: 96n: 146 np: 1

pv: 8379order: 202

peak: 387.933

node: 203m: 100n: 156 np: 1

pv: 9339order: 203

peak: 387.933

node: 204m: 100n: 156 np: 1

pv: 6939order: 204

peak: 387.933

node: 240m: 2970n: 942np: 80

pv: 8336order: 240

peak: 416.539

node: 243m: 4618n: 873np: 42

pv: 4539order: 243

peak: 416.539

node: 244m: 6095 n: 1249np: 205

pv: 4535order: 244

peak: 455.073

node: 245m: 3019 n: 1327np: 121

pv: 2379order: 245

peak: 455.073

node: 246m: 2308 n: 1673 np: 4 pv: 200

order: 246peak: 455.073

node: 247m: 2864 n: 1669np: 202 pv: 443

order: 247peak: 455.073

node: 248m: 3151 n: 1844 np: 9 pv: 185

order: 248peak: 455.073

node: 249m: 2749 n: 1835np: 335 pv: 442

order: 249peak: 455.073

node: 250m: 4030 n: 2010np: 350

pv: 4657order: 250

peak: 455.073

node: 251m: 3865 n: 1871np: 351 pv: 100

order: 251peak: 455.073

node: 252m: 3248 n: 2132np: 423pv: 83

order: 252peak: 455.073

node: 253 m: 96n: 113 np: 1

pv: 8256order: 253

peak: 455.073

node: 265m: 3475n: 887np: 15

pv: 1088order: 265

peak: 455.073

node: 284m: 1248n: 861np: 136pv: 15

order: 284peak: 455.073

node: 285m: 6996 n: 1231np: 109

pv: 4365order: 285

peak: 455.073

node: 299m: 2400n: 821np: 14

pv: 7773order: 299

peak: 455.073

node: 300m: 400n: 283np: 10

pv: 2239order: 300

peak: 455.073

node: 301m: 210n: 186 np: 5

pv: 2245order: 301

peak: 455.073

node: 302m: 290n: 189 np: 7 pv: 330

order: 302peak: 455.073

node: 303m: 9675 n: 1818np: 144

pv: 6088order: 303

peak: 455.073

node: 311m: 6305 n: 1207np: 53

pv: 6336order: 311

peak: 455.073

node: 321m: 5931 n: 1256np: 56

pv: 310order: 321

peak: 455.073

node: 322m: 3991 n: 1883np: 66

pv: 107order: 322

peak: 455.073

node: 323m: 3597 n: 2453np: 15pv: 63

order: 323peak: 455.073

node: 324m: 6734 n: 2438np: 524 pv: 345

order: 324peak: 455.073

node: 325m: 4104 n: 2266np: 24pv: 10

order: 325peak: 455.073

node: 326m: 5608 n: 2242np: 838 pv: 266

order: 326peak: 455.073

node: 327m: 4876 n: 1404 np: 1404pv: 10819

order: 327peak: 455.073

node: 328 m: 0 n: 0 np: 0pv: 0

order: 328peak: 455.073

When the peak is low in the tree, performance varies smoothly withthe memory consumption


peedup w

rt s

eq

Memory (MB)

-- AMD Opteron 24 coresKarted

650 700 750 800 850 900 950 1000

1

2

3

4

5

6

7

8

9

DFS

BFS

node: 36m: 941

n: 1001np: 12

pv: 993 order: 36

peak: 60.877

node: 67m: 381

n: 1052np: 19

pv: 2773 order: 67

peak: 60.877

node: 68m: 1291 n: 1935np: 59

pv: 3830 order: 68

peak: 79.195

node: 73m: 140n: 549np: 15

pv: 23311 order: 73

peak: 79.195

node: 97m: 524n: 582np: 41

pv: 40673 order: 97

peak: 79.195

node: 98m: 608

n: 1046np: 14

pv: 23197 order: 98

peak: 79.195

node: 108m: 332n: 549np: 15

pv: 37288order: 108

peak: 79.902

node: 120m: 330n: 559np: 15

pv: 32452order: 120

peak: 83.189

node: 121m: 632

n: 1051np: 15

pv: 3081order: 121

peak: 83.189

node: 122m: 1211 n: 1983np: 58

pv: 3054order: 122

peak: 96.282

node: 123m: 2385 n: 3544np: 176 pv: 739

order: 123peak: 151.961

node: 141m: 422

n: 1083np: 13

pv: 445order: 141

peak: 151.961

node: 154m: 202n: 573np: 15

pv: 6942order: 154

peak: 151.961

node: 166m: 168n: 574np: 14

pv: 22472order: 166

peak: 151.961

node: 167m: 750

n: 2093np: 44

pv: 4649order: 167

peak: 151.961

node: 183m: 261

n: 1102np: 15

pv: 17024order: 183

peak: 151.961

node: 208m: 537

n: 1138np: 15

pv: 41587order: 208

peak: 151.961

node: 209m: 768

n: 2143np: 38

pv: 12379order: 209

peak: 151.961

node: 210m: 1436 n: 3886np: 152 pv: 306

order: 210peak: 179.873

node: 236m: 668n: 717 np: 6

pv: 6602order: 236

peak: 179.873

node: 257m: 567n: 773 np: 7

pv: 1759order: 257

peak: 179.873

node: 258m: 1222 n: 1444np: 17

pv: 6624order: 258

peak: 179.873

node: 272m: 455n: 447np: 20

pv: 6000order: 272

peak: 179.873

node: 277m: 263n: 246np: 29

pv: 42908order: 277

peak: 179.873

node: 280m: 820n: 287np: 231pv: 19507

order: 280peak: 179.873

node: 287m: 296n: 289np: 118pv: 39079

order: 287peak: 179.873

node: 288m: 871n: 851np: 10

pv: 45515order: 288

peak: 179.873

node: 303m: 435n: 430 np: 2pv: 31200

order: 303peak: 179.873

node: 320m: 367n: 447 np: 9pv: 10724

order: 320peak: 179.873

node: 321m: 1627 n: 1674np: 14

pv: 15044order: 321

peak: 196.380

node: 322m: 2818 n: 2969np: 75

pv: 766order: 322

peak: 253.012

node: 353m: 620

n: 1040np: 12

pv: 1859order: 353

peak: 253.012

node: 369m: 373

n: 1060np: 14

pv: 7596order: 369

peak: 253.012

node: 370m: 967

n: 1996np: 41

pv: 1240order: 370

peak: 253.012

node: 390m: 560

n: 1067np: 15

pv: 265order: 390

peak: 253.012

node: 420m: 584

n: 1087np: 18

pv: 5673order: 420

peak: 253.012

node: 421m: 1111 n: 2036np: 40pv: 31

order: 421peak: 257.289

node: 422m: 1997 n: 3665np: 174pv: 68

order: 422peak: 304.566

node: 423m: 4566 n: 5469np: 684 pv: 117

order: 423peak: 465.482

node: 424m: 5166 n: 6508 np: 1603

pv: 21order: 424

peak: 602.733

node: 425m: 5772 n: 5612 np: 2409

pv: 2order: 425

peak: 616.500

node: 446m: 404n: 451 np: 6pv: 38228

order: 446peak: 616.500

node: 466m: 469n: 493np: 33

pv: 20109order: 466

peak: 616.500

node: 467m: 834n: 893 np: 6

pv: 8001order: 467

peak: 616.500

node: 475m: 464n: 491np: 17

pv: 24863order: 475

peak: 616.500

node: 485m: 357n: 493np: 13

pv: 7314order: 485

peak: 616.500

node: 486m: 791n: 932np: 17

pv: 5979order: 486

peak: 616.500

node: 487m: 1602 n: 1757np: 27

pv: 4486order: 487

peak: 616.500

node: 514m: 657n: 937np: 13

pv: 3850order: 514

peak: 616.500

node: 533m: 400n: 496np: 19

pv: 14181order: 533

peak: 616.500

node: 549m: 334n: 495np: 13

pv: 42732order: 549

peak: 616.500

node: 550m: 1346 n: 1817np: 39

pv: 1300order: 550

peak: 616.500

node: 551m: 2882 n: 3316np: 113 pv: 414

order: 551peak: 616.500

node: 552m: 5972 n: 3203 np: 3203 pv: 715

order: 552peak: 616.500

node: 553 m: 0 n: 0 np: 0pv: 0

order: 553peak: 616.500

In some cases no interesting behavior is observed mostly because allorderings equally good/bad in terms of performance

Experimental results

0

20

40

60

80

100

120

0 5 10 15 20 25 30 35 40

#ofta

sks

Time (s)

Hirlam--AMD Opteron 24cores

DFSavailable (no cstr)

available (seq cstr)

Scheduling issues withaccelerators

High-level STF algorithm



! activate front

call activate(f)

! init front



do j=1,c%n



end do

! cleanup child


end do

do p=1, f%n



do u=p+1, f%n



end do

end do

end do



High-level STF algorithm: front factorization

|do f=1, nfronts ! in postorder


! activate front

call activate(f)

! init front



do j=1,c%n



end do

! cleanup child


end do|

do p=1, f%n



do u=p+1, f%n



end do

end do

|end do


call starpu_waitall ()|

High-level STF algorithm: blocked front factorization

|do f=1, nfronts ! in postorder


! activate front

call activate(f)

! init front



do j=1,c%n



end do

! cleanup child


end do|

do p=1, f%np

! panel reduction of block column p


do u=p+1, f%nc

! update of block column u with panel p


end do

end do

|end do


call starpu_waitall ()|

Coarse-grain factorization (adapted to GPUs)

nbGPU

Panel

Update

Coarse-grain factorization (adapted to GPUs)

Panel

Update

Coarse-grain factorization: profile

Panel

Outer update

• n = 20480

• nbGPU = 512

Fine-grain panel factorizations

nbGPU

nbCPU

Panel

Inner update

Outer update

Fine-grain panel factorizations

Panel

Inner update

Outer update

Fine-grain panel factorizations: profile

Panel

Inner update

Outer update

• n = 20480

• nbGPU = 512

• nbCPU = 128

Static workload balancing

nbGPU

nbCPU

Panel

Inner update

Outer update

Splited outerupdate


Panel

Inner update

Outer update

Splited outerupdate


Panel

Inner update

Outer update

Splited outer update

• n = 20480

• nbGPU = 512

• nbCPU = 128

Dynamic refinement of workload

Ideas:

• Work stealing of GPUs on low granularity updates;

• Work stealing of CPUs on high granularity updates.

Implementation:

• 1 queue per type of tasks;

• Queues polled in different orders by GPUs and CPUs;

• Criterium for work stealing.

Dynamic refinement of workload: profile

Panel

Inner update

Outer update

Splited outer update

• n = 20480

• nbGPU = 512

• nbCPU = 128

Conclusion on scheduling issues with accelerators

Summary:

• “Which task?” matters! (not managed by Heft);

• Static knowledge from the application;

• Dynamic correction at runtime;

• Results are preliminary.

On-going work:

• Multiple GPU case (limit data movement);

• Multifrontal factorization (exploit tree parallelism).


? Thanks!Questions?

scheduling and ordering issues in sequential task flow...

Documents