scheduling and ordering issues in sequential task flow...
TRANSCRIPT
Scheduling and ordering issues inSequential Task Flow parallelmultifrontal methodsand preliminary work on memory aware scheduling
E. Agullo, A. Buttari, A. Guermouche and F. Lopez,INRIA-LaBRI, CNRS-IRIT, Universite de Bordeaux, UPS-IRIT
SOLHAR meeting on scheduling, April 2014, Lyon
Context of the work
Runtime systems
Application
Architecture
xPU0 xM0 yPU0 yM0xPU1 xM1
The classical approach is based on amixture of technologies (e.g.,MPI+OpenMP+CUDA) which
• requires a big programming effort
• is difficult to maintain and update
• is prone to (performance)portability issues
3/44 SOLHAR meeting on scheduling, April 2014, Lyon
Runtime systems
Application
Runtime
Architecture
xPU0 xM0
Scheduling engine
Mem. manager
xPUdriver
yPUdriver
A BC
CB B
A
yPU0 yM0xPU1 xM1
Ax Ay
Bx Cx
• runtimes provide an abstractionlayer that hides the architecturedetails
• the workload is expressed as a DAGof tasks where the dependenciesare
◦ defined explicitly◦ defined through rules◦ automatically inferred
• the scheduler decides when/whereto execute a task
• the drivers deploy the code on thedevices
• the memory manager does thememory transfers and guaranteesthe consistency
3/44 SOLHAR meeting on scheduling, April 2014, Lyon
Runtime systems
Application
Runtime
Architecture
xPU0 xM0
Scheduling engine
Mem. manager
xPUdriver
yPUdriver
A BC
CB B
A
yPU0 yM0xPU1 xM1
Ax Ay
Bx Cx
• runtimes provide an abstractionlayer that hides the architecturedetails
• the workload is expressed as a DAGof tasks where the dependenciesare
◦ defined explicitly◦ defined through rules◦ automatically inferred
• the scheduler decides when/whereto execute a task
• the drivers deploy the code on thedevices
• the memory manager does thememory transfers and guaranteesthe consistency
3/44 SOLHAR meeting on scheduling, April 2014, Lyon
Runtime systems
Application
Runtime
Architecture
xPU0 xM0
Scheduling engine
Mem. manager
xPUdriver
yPUdriver
A BC
CB B
A
yPU0 yM0xPU1 xM1
Ax Ay
Bx Cx
• runtimes provide an abstractionlayer that hides the architecturedetails
• the workload is expressed as a DAGof tasks where the dependenciesare◦ defined explicitly◦ defined through rules◦ automatically inferred
• the scheduler decides when/whereto execute a task
• the drivers deploy the code on thedevices
• the memory manager does thememory transfers and guaranteesthe consistency
3/44 SOLHAR meeting on scheduling, April 2014, Lyon
Runtime systems
Application
Runtime
Architecture
xPU0 xM0
Scheduling engine
Mem. manager
xPUdriver
yPUdriver
A BC
CB B
A
yPU0 yM0xPU1 xM1
Ax Ay
Bx Cx
• runtimes provide an abstractionlayer that hides the architecturedetails
• the workload is expressed as a DAGof tasks where the dependenciesare◦ defined explicitly◦ defined through rules◦ automatically inferred
• the scheduler decides when/whereto execute a task
• the drivers deploy the code on thedevices
• the memory manager does thememory transfers and guaranteesthe consistency
3/44 SOLHAR meeting on scheduling, April 2014, Lyon
Runtime systems
Application
Runtime
Architecture
xPU0 xM0
Scheduling engine
Mem. manager
xPUdriver
yPUdriver
A BC
CB B
A
yPU0 yM0xPU1 xM1
Ax Ay
Bx Cx
• runtimes provide an abstractionlayer that hides the architecturedetails
• the workload is expressed as a DAGof tasks where the dependenciesare◦ defined explicitly◦ defined through rules◦ automatically inferred
• the scheduler decides when/whereto execute a task
• the drivers deploy the code on thedevices
• the memory manager does thememory transfers and guaranteesthe consistency
3/44 SOLHAR meeting on scheduling, April 2014, Lyon
Runtime systems
Application
Runtime
Architecture
xPU0 xM0
Scheduling engine
Mem. manager
xPUdriver
yPUdriver
A BC
CB B
A
yPU0 yM0xPU1 xM1
Ax Ay
Bx Cx
• runtimes provide an abstractionlayer that hides the architecturedetails
• the workload is expressed as a DAGof tasks where the dependenciesare◦ defined explicitly◦ defined through rules◦ automatically inferred
• the scheduler decides when/whereto execute a task
• the drivers deploy the code on thedevices
• the memory manager does thememory transfers and guaranteesthe consistency
3/44 SOLHAR meeting on scheduling, April 2014, Lyon
Runtime systems
Runtime systems are becoming widely adopted in scientific computingespecially for dense linear algebra libraries:
• PLASMA (QUARK)
• DPLASMA (PaRSEC)
• MAGMA-MORSE (StarPU)
• FLAME (SuperMatrix)
it is, however, much more challenging to use them for complex andirregular workloads as in sparse computations
Objective
The objective of this work is to assess the usability of runtime systemsfor sparse factorization methods and evaluate their effectiveness onsingle-node, multicore systems1
1very related work from P. Ramet et al. on the PaStiX solver4/44 SOLHAR meeting on scheduling, April 2014, Lyon
The multifrontal QR method
The Multifrontal QR method
The multifrontal QR factorization is guided by a graph calledelimination tree:
• each node is associated with arelatively small dense matrix calledfrontal matrix (or front) containing kpivots to be eliminated along with allthe other coefficients concerned bytheir elimination
6/44 SOLHAR meeting on scheduling, April 2014, Lyon
The Multifrontal QR method
The tree is traversed in topological order (i.e., bottom-up) and, ateach node, two operations are performed:
• assembly: coefficients from the originalmatrix associated with the pivots andcontribution blocks produced by thetreatment of the child nodes arestacked to form the frontal matrix
• factorization: the k pivots areeliminated through a complete QRfactorization of the frontal matrix. Asa result we get:◦ part of the global R and Q factors◦ a triangular contribution block that will
be assembled into the father’s front
6/44 SOLHAR meeting on scheduling, April 2014, Lyon
The Multifrontal QR method
The tree is traversed in topological order (i.e., bottom-up) and, ateach node, two operations are performed:
• assembly: coefficients from the originalmatrix associated with the pivots andcontribution blocks produced by thetreatment of the child nodes arestacked to form the frontal matrix
• factorization: the k pivots areeliminated through a complete QRfactorization of the frontal matrix. Asa result we get:◦ part of the global R and Q factors◦ a triangular contribution block that will
be assembled into the father’s front
6/44 SOLHAR meeting on scheduling, April 2014, Lyon
The multifrontal QR method
Notable differences with multifrontal LU:
• fronts are rectangular, either over or under-determined
• assembly operations are just copies (with lots of indirectaddressing) and not sums. They can thus be done in any order(like in LU) but also in parallel (most likely not efficient because offalse sharing issues)
• fronts are not full: they have a staircase structure. The zeroes inthe lower-leftmost part can be ignore. This irregular structuremakes the modeling of performance rather difficult
• fronts are completely factorized and not just partially. This makesthe overall size of factors bigger and thus the active memoryconsumption less sensitive to the tree traversal
• contribution blocks are trapezoidal and note square
7/44 SOLHAR meeting on scheduling, April 2014, Lyon
The qr mumps approach
Parallelism
Fine-granularity is achieved through a 1-D block partitioning of frontsand the definition of five elementary operations:
1. activate(front): just allocate the memoryrequired to process the node
2. init(front): compute the structure of thefront
3. panel(bcol): QR factorization of a column
4. update(bcol): update of a column in thetrailing submatrix wrt to a panel
5. assemble(bcol): assembly of a column ofthe contribution block into the father
6. clean(front): cleanup the front (deallocatememory and store factors))
9/44 SOLHAR meeting on scheduling, April 2014, Lyon
Parallelism: a new approach
If a task is defined as the execution of one elementary operation on ablock-column or a front, then the entire multifrontal factorization canbe represented as a Directed Acyclic Graph (DAG)
1 2
a
p1 u2 u3
p2 u3
p3
s2 s3
a
p1 u2 u3
u3
u4
u4
u4
s2 s3 s4
p2
p3
cc
3
a
p1 u2 u3
u3
u4
u4
u4
p4
s3 s4
p2
p3
c
a
u
s
c
activate
p panel
update
assemble
clean
1 2
a
p1 u2 u3
p2 u3
p3
s2 s3
a
p1 u2 u3
u3
u4
u4
u4
s2 s3 s4
p2
p3
cc
3
a
p1 u2 u3
u3
u4
u4
u4
p4
s3 s4
p2
p3
c
a
u
s
c
activate
p panel
update
assemble
clean
From a DAG of DAGs to just one single huge DAG
10/44 SOLHAR meeting on scheduling, April 2014, Lyon
Parallelism: a new approach
The scheduling is performed by a finely-tuned, hand-written code
N the fine-grained decomposition and the asynchronous/dynamicscheduling deliver high concurrency and much better performancecompared to the classical approach (SPQR)
H the scheduler is not scalable (the search for ready tasks in the DAGis inefficient)...
H ... extremely difficult to maintain...
H ... and not really portable
All these problems may be overcome by replacing the scheduler with amodern runtime system
11/44 SOLHAR meeting on scheduling, April 2014, Lyon
StarPU multifrontal QR
The multifrontal QR factorization: StarPU integration
StarPU is a runtime environment that lets the programmer achieveparallelism through a Sequential Task Flow model:
• The parallel code looks exactly the same as the sequential oneexcept that elementary operations are not executed but submittedto the system
• Depending on how elementary operations access data (whether inread or write mode) and on the (sequential) order of submission,StarPU can infer dependencies among them and build a DAGwhich is used to drive the parallel execution
• The StarPU scheduler is in charge of deploying the DAG on theunderlying architecture
• The StarPU memory manager moves data from one memory toanother and maintains the global memory coherency
Other approaches are possible such as the Parametrized Task Flowmodel used in Parsec (Florent goes to Rocky Top).
13/44 SOLHAR meeting on scheduling, April 2014, Lyon
The multifrontal QR factorization: StarPU integration
Sequential qr mumps code
do f=1, nfronts ! in postorder
! activate front
call activate(f)
! init front
call init(f)
do c=1, f%nc ! for all the children of f
do j=1,c%n
! assemble column j of c into f
call assemble(c, j, f)
end do
! cleanup child
call clean(c)
end do
do p=1, f%n
! panel reduction of column p
call panel(f, p)
do u=p+1, f%n
! update of column u with panel p
call update(f, u, p)
end do
end do
end do
14/44 SOLHAR meeting on scheduling, April 2014, Lyon
The multifrontal QR factorization: StarPU integration
STF parallel qr mumps code
do f=1, nfronts ! in postorder
! activate front
call starpu_submit(activate , f)
! init front
call starpu_submit(init , f)
do c=1, f%nc ! for all the children of f
do j=1,c%n
! assemble column j of c into f
call starpu_submit(assemble , c, j, f)
end do
! cleanup child
call starpu_submit(clean , c)
end do
do p=1, f%n
! panel reduction of column p
call starpu_submit(panel , f, p)
do u=p+1, f%n
! update of column u with panel p
call starpu_submit(update , f, u, p)
end do
end do
end do
! wait for the tasks to be executed
call starpu_waitall ()
15/44 SOLHAR meeting on scheduling, April 2014, Lyon
2D partitioning + CA front factorization
1D block-column partitioning is not very well suited for the casewhere frontal matrices are (strongly) overdetermined...“Houston, we have a problem”
Thanks to the simplicity of the STF programmingmodel it is possible to plug in communication avoidingmethods for factorizing the frontal matrices with arelatively moderate effort
N 2D block partitioning (not necessarily square)
N more concurrency
H more complex dependencies
H many more tasks
H more sensitive to runtime overhead
16/44 SOLHAR meeting on scheduling, April 2014, Lyon
2D partitioning + CA front factorization
1D block-column partitioning is not very well suited for the casewhere frontal matrices are (strongly) overdetermined...“Houston, we have a problem”
Thanks to the simplicity of the STF programmingmodel it is possible to plug in communication avoidingmethods for factorizing the frontal matrices with arelatively moderate effort
N 2D block partitioning (not necessarily square)
N more concurrency
H more complex dependencies
H many more tasks
H more sensitive to runtime overhead
16/44 SOLHAR meeting on scheduling, April 2014, Lyon
2D partitioning + CA front factorization
do f=1, nfronts ! in sequential order
call starpu_submit(activate , f) ! activate front
call starpu_submit(init , f) ! init front
do c=1, f%nchildren ! for all the children of f
do i=1,c%m
do j=1,c%n
call starpu_submit(assemble , c, i, j, f) ! assemble block(i,j) of c into f
end do
end do
call starpu_submit(clean , c) ! cleanup child
end do
ca_facto: do k=1, min(f%m,f%n)
do s=0, log2(f%m-k+1)
do i = k, f%n, 2**s
if(s.eq.0) then
call starpu_submit(geqrt , f, k, i)
do j=k+1, f%n
call starpu_submit(gemqrt , f, k, i, j)
end do
else
l = i+2**(s-1)
call starpu_submit(tpqrt , f, k, i, l)
do j=k+1, front%n
call starpu_submit(tpmqrt , f, k, i, l, j)
end do
end if
end do
end do
end do ca_facto
end do
call starpu_waitall () ! wait for the tasks to be
executed
Scheduling issues in shared memory settings
How to schedule the tasks of such a complex DAG?
• how to identify the critical path?
• how to speed up operations along the critical path?
• how to improve the locality of data? (see the work of Abdou andAndra on contexts)
• how to deal with NUMA settings?
• the 2D blocking does not have to be used on all fronts. Whenshould we switch to 2D blocking?◦ only larger fronts?◦ only overdetermined fronts?◦ only topmost fronts?
18/44 SOLHAR meeting on scheduling, April 2014, Lyon
Scheduling issues in shared memory settings
qr mumps uses a logical tree pruning using the Geist+Ng algorithm
• reduces the complexity of the scheduling
• reduces the overhead of parallelism
• improves the locality of data
19/44 SOLHAR meeting on scheduling, April 2014, Lyon
Scheduling issues in shared memory settings
Problem is: for 2 processes, Geist+Ng would be perfectly happy withthis layer
1 2
3
4 5
6
7
50 50 80 20
5 1
5
but in shared memory we would like to stay as close as possible to thesequential traversal (we can afford it!); in this case the above layerleads to a severe imbalance
20/44 SOLHAR meeting on scheduling, April 2014, Lyon
Memory aware scheduling
The memory consumption problem
the elimination tree has to be traversed in a topological order (i.e.,each node after its children)
a b
c
d e
f
g
h i
l
m
n
o
because memory is allocated (in the activation task) and partiallydeallocated (in the clean task) at each node, different traversal ordersresult in a different memory consumption
22/44 SOLHAR meeting on scheduling, April 2014, Lyon
The memory consumption problem
the elimination tree has to be traversed in a topological order (i.e.,each node after its children)
a b
c
d e
f
g
h i
l
m
n
o
bfs1={a,b,d,e,g,h,i,c,f,l,m,n,o}
BFS normally results in orderings that are bad in terms of memoryconsumption
22/44 SOLHAR meeting on scheduling, April 2014, Lyon
The memory consumption problem
the elimination tree has to be traversed in a topological order (i.e.,each node after its children)
a b
c
d e
f
g
h i
l
m
n
o
dfs1={a,b,c,d,e,f,g,h,i,l,m,n,o}
among all the topological orderings, (postorder) DFS commonly havea better memory behavior (both in terms of footprint and locality)
22/44 SOLHAR meeting on scheduling, April 2014, Lyon
The memory consumption problem
the elimination tree has to be traversed in a topological order (i.e.,each node after its children)
a b
c
d e
f
g
h i
l
m
n
o
dfs2={h,i,l,g,m,d,e,f,n,a,b,c,o}
among all the DFS, some are better than others and we know how tocompute the one that minimizes the memory footprint
22/44 SOLHAR meeting on scheduling, April 2014, Lyon
The memory consumption problem
the elimination tree has to be traversed in a topological order (i.e.,each node after its children)
a b
c
d e
f
g
h i
l
m
n
o
dfs2={h,i,l,g,m,d,e,f,n,a,b,c,o}
in parallel we want to work on multiple branches concurrently whichmeans that we have to deviate from the DFS and thus consume morememory
22/44 SOLHAR meeting on scheduling, April 2014, Lyon
Memory aware scheduling in a STF model
Approach: the activation of fronts is forced to respect the sequential(BFS/DFS) order
but the order in which active fronts are processedcan be different
a
d e g
h i
m
n
o
l
b
c
f
This ensures that the parallel execution runs within the same memoryenvelop as the sequential one. The memory constraint can be relaxedin order to permit the activation of more fronts and, potentiallyachieve more concurrency
23/44 SOLHAR meeting on scheduling, April 2014, Lyon
Memory aware scheduling in a STF model
Approach: the activation of fronts is forced to respect the sequential(BFS/DFS) order but the order in which active fronts are processedcan be different
a
d e g
h i
m
n
o
l
b
c
f
This ensures that the parallel execution runs within the same memoryenvelop as the sequential one. The memory constraint can be relaxedin order to permit the activation of more fronts and, potentiallyachieve more concurrency
23/44 SOLHAR meeting on scheduling, April 2014, Lyon
Memory aware scheduling in a STF model
do f=1, nfronts ! in postorder
! activate front
call starpu_submit(activate , f)
! init front
call starpu_submit(init , f)
do c=1, f%nc ! for all the children of f
do j=1,c%n
! assemble column j of c into f
call starpu_submit(assemble , c, j, f)
end do
! cleanup child
call starpu_submit(clean , c)
end do
do p=1, f%n
! panel reduction of column p
call starpu_submit(panel , f, p)
do u=p+1, f%n
! update of column u with panel p
call starpu_submit(update , f, u, p)
end do
end do
end do
! wait for the tasks to be executed
call starpu_waitall ()
Memory aware scheduling in a STF model
do f=1, nfronts ! in postorder
do while(avail_mem < size(f)) wait()
! activate front
call activate(f)
! init front
call starpu_submit(init , f)
do c=1, f%nc ! for all the children of f
do j=1,c%n
! assemble column j of c into f
call starpu_submit(assemble , c, j, f)
end do
! cleanup child
call starpu_submit(clean , c)
end do
do p=1, f%n
! panel reduction of column p
call starpu_submit(panel , f, p)
do u=p+1, f%n
! update of column u with panel p
call starpu_submit(update , f, u, p)
end do
end do
end do
! wait for the tasks to be executed
call starpu_waitall ()
Experimental resultsS
peedup w
rt s
eq
Memory (MB)
DFS
BFS
2000 2500 3000 3500 4000
5
10
15
-- AMD Opteron 24 coresHirlam
node: 1 m: 6 n: 7 np: 1
pv:309827 order: 1
peak: 116.467
node: 2 m: 6 n: 7 np: 1pv: 33240
order: 2peak: 116.467
node: 2317m: 3202 n: 2837np: 269pv: 33054
order: 2317peak: 234.508
node: 2318 m: 6 n: 7 np: 1
pv:397885order: 2318
peak: 234.508
node: 6155m: 4086 n: 3572np: 392
pv:354553order: 6155
peak: 386.884
node: 6156m: 5847 n: 4926np: 742
pv:375275order: 6156
peak: 512.136
node: 13381m: 4025 n: 3188np: 691
pv:199543 order: 13381
peak: 542.885
node: 20521m: 4798 n: 3946np: 670
pv:312548 order: 20521
peak: 737.816
node: 20522m: 5924 n: 4721 np: 1028pv: 17631
order: 20522peak: 795.331
node: 20523 m: 6 n: 7 np: 1
pv:258672 order: 20523
peak: 795.331
node: 22634m: 2384 n: 2379 np: 3
pv:281129 order: 22634
peak: 795.331
node: 22635 m: 6 n: 7 np: 1
pv:105006 order: 22635
peak: 795.331
node: 25296m: 3469 n: 2986np: 371pv: 37856
order: 25296peak: 795.331
node: 25297m: 5044 n: 4413np: 493
pv:105007 order: 25297
peak: 848.660
node: 30699m: 4781 n: 3869np: 745
pv:309389 order: 30699
peak: 980.344
node: 30700 m: 6 n: 7 np: 1
pv:209388 order: 30700
peak: 980.344
node: 45035m: 6377 n: 4766 np: 1340pv:300513
order: 45035peak: 1373.245
node: 45036m: 6754 n: 5225 np: 1291pv:209218
order: 45036peak: 1380.826
node: 45037m: 8035 n: 6189 np: 1596 pv: 4365
order: 45037peak: 1385.077
node: 45038m: 8535 n: 6325 np: 1929pv: 13241
order: 45038peak: 1385.077
node: 45039 m: 6 n: 7 np: 1
pv:119578 order: 45039
peak: 1385.077
node: 49489m: 4970 n: 4209np: 630
pv:346894 order: 49489
peak: 1385.077
node: 55391m: 4707 n: 4118np: 467pv: 92878
order: 55391peak: 1385.077
node: 55392m: 7366 n: 5881 np: 1239 pv: 3118
order: 55392peak: 1405.563
node: 55393m: 9310 n: 7023 np: 1975pv:119408
order: 55393peak: 1595.857
node: 55394m: 9611 n: 6596 np: 2569 pv: 7723
order: 55394peak: 1595.857
node: 58117m: 3468 n: 3136np: 246
pv:125514 order: 58117
peak: 1595.857
node: 71698m: 6545 n: 5155 np: 1176pv:307770
order: 71698peak: 1595.857
node: 71699m: 7073 n: 5520 np: 1270pv: 12972
order: 71699peak: 1595.857
node: 74844m: 4417 n: 3701np: 554pv: 52764
order: 74844peak: 1595.857
node: 90175m: 7300 n: 5579 np: 1468 pv: 6152
order: 90175peak: 1727.891
node: 90176m: 7461 n: 5744 np: 1450 pv: 2428
order: 90176peak: 1727.891
node: 90177m: 8895 n: 6262 np: 2235pv: 10252
order: 90177peak: 1727.891
node: 90178m: 8656 n: 4027 np: 4027 pv: 7552
order: 90178peak: 1727.891
node: 90179 m: 0 n: 0 np: 0pv: 0
order: 90179peak: 1727.891
The BFS deliver (sometimes much) better performance at the price ofa higher memory consumption and a more erratic behavior
Experimental resultsS
peedup w
rt s
eq
Memory (MB)
-- AMD Opteron 24 coresRucci1
DFS
BFS
5000 6000 7000 8000 9000 10000 11000
2
4
6
8
10
12
14
node: 102m: 5438 n: 2601np: 284
pv: 1158order: 102
peak: 431.879
node: 206m: 4689 n: 2700np: 213pv: 46665
order: 206peak: 529.942
node: 207m: 9833 n: 3981np: 589pv: 26279
order: 207peak: 736.403
node: 330m: 6784 n: 2690np: 397pv: 19900
order: 330peak: 795.841
node: 445m: 6961 n: 2759np: 403pv: 15115
order: 445peak: 906.405
node: 446m: 7628 n: 4194np: 346pv: 32304
order: 446peak: 1061.863
node: 447 m: 12493 n: 6200np: 833
pv: 1061order: 447
peak: 1367.535
node: 544m: 4521 n: 1836np: 251 pv: 757
order: 544peak: 1367.535
node: 628m: 4276 n: 2018np: 224pv: 46361
order: 628peak: 1367.535
node: 629m: 7520 n: 2727np: 522pv: 27551
order: 629peak: 1367.535
node: 849m: 6307 n: 2716np: 420
pv: 1382order: 849
peak: 1367.535
node: 1153m: 7962 n: 3357np: 546pv: 45341
order: 1153peak: 1469.937
node: 1154m: 9808 n: 4277np: 731
pv: 1367order: 1154
peak: 1627.573
node: 1155 m: 10809 n: 4954np: 610
pv: 1052order: 1155
peak: 1700.234
node: 1282m: 4626 n: 2386np: 214pv: 86678
order: 1282peak: 1700.234
node: 1668m: 7771 n: 3780np: 487pv: 87138
order: 1668peak: 1700.234
node: 1669m: 8537 n: 4885np: 495pv: 86993
order: 1669peak: 1844.807
node: 1794m: 7328 n: 2881np: 374pv: 21068
order: 1794peak: 1844.807
node: 2227m: 9125 n: 3383np: 607pv: 50596
order: 2227peak: 2190.775
node: 2228 m: 10737 n: 4367np: 748pv: 64699
order: 2228peak: 2283.586
node: 2287m: 3807 n: 2084np: 143pv: 64915
order: 2287peak: 2283.586
node: 2338m: 3232 n: 2160np: 78
pv: 39794order: 2338
peak: 2283.586
node: 2339m: 6867 n: 3570np: 254pv: 33733
order: 2339peak: 2283.586
node: 2435m: 5091 n: 2586np: 226pv: 34938
order: 2435peak: 2283.586
node: 2518m: 4940 n: 2809np: 123pv: 50778
order: 2518peak: 2351.621
node: 2919m: 8776 n: 3580np: 601pv: 40985
order: 2919peak: 2850.450
node: 3071m: 7991 n: 3517np: 430 pv: 317
order: 3071peak: 2899.850
node: 3072 m: 15911 n: 4453 np: 1392 pv: 189
order: 3072peak: 3272.891
node: 3073 m: 12935 n: 4646np: 952pv: 78861
order: 3073peak: 3274.707
node: 3074 m: 12552 n: 4918np: 919 pv: 129
order: 3074peak: 3274.707
node: 3075 m: 16237 n: 5834 np: 1155 pv: 706
order: 3075peak: 3274.707
node: 3076 m: 17859 n: 6751 np: 1388
pv: 10order: 3076
peak: 3518.752
node: 3077 m: 22725 n: 7694 np: 1906pv: 49070
order: 3077peak: 3769.463
node: 3078 m: 24050 n: 7961 np: 2039 pv: 528
order: 3078peak: 3988.389
node: 3079 m: 30426 n: 7868 np: 3036 pv: 646
order: 3079peak: 4067.301
node: 3207m: 6399 n: 2892np: 334pv: 41964
order: 3207peak: 4067.301
node: 3346m: 6322 n: 3875np: 229pv: 58993
order: 3346peak: 4067.301
node: 3347 m: 10871 n: 5422np: 590 pv: 909
order: 3347peak: 4067.301
node: 3348 m: 40735 n: 4832 np: 4832 pv: 1316
order: 3348peak: 4067.301
node: 3349 m: 0 n: 0 np: 0pv: 0
order: 3349peak: 4067.301
When DFS achieves the peak memory consumption on the top of thetree, performance varies mildly wrt the memory constraint
Experimental results
500 600 700 800 900 1000 1100 1200 1300
2
4
6
8
10
DFS
BFS
Speedup w
rt s
eq
Memory (MB)
EternityII_E -- AMD Opteron 24 cores
node: 7m: 5559n: 821np: 51
pv: 4492 order: 7
peak: 92.847
node: 11m: 789n: 458np: 15
pv: 7852 order: 11
peak: 92.847
node: 12m: 400n: 283np: 10
pv: 2299 order: 12
peak: 92.847
node: 13m: 289n: 190 np: 7
pv: 2310 order: 13
peak: 92.847
node: 14m: 248n: 199 np: 6
pv: 2305 order: 14
peak: 92.847
node: 15m: 4696 n: 1472np: 106
pv: 6148 order: 15
peak: 137.809
node: 16 m: 96n: 146 np: 1
pv: 8187 order: 16
peak: 137.809
node: 17m: 100n: 156 np: 1
pv: 9147 order: 17
peak: 137.809
node: 18m: 100n: 156 np: 1
pv: 6747 order: 18
peak: 137.809
node: 19 m: 96n: 146 np: 1
pv: 8564 order: 19
peak: 137.809
node: 20m: 100n: 156 np: 1
pv: 7124 order: 20
peak: 137.809
node: 21m: 100n: 156 np: 1
pv: 9524 order: 21
peak: 137.809
node: 22m: 100n: 156 np: 1
pv: 9452 order: 22
peak: 137.809
node: 23 m: 96n: 146 np: 1
pv: 8492 order: 23
peak: 137.809
node: 24m: 100n: 156 np: 1
pv: 7052 order: 24
peak: 137.809
node: 43m: 3322n: 884np: 116 pv: 130
order: 43peak: 166.690
node: 83m: 4909n: 977np: 117pv: 4
order: 83peak: 198.015
node: 84m: 4294 n: 1187np: 141
pv: 4639 order: 84
peak: 198.015
node: 85 m: 96n: 113 np: 1
pv: 8406 order: 85
peak: 198.015
node: 103m: 5715 n: 1074np: 80
pv: 4344order: 103
peak: 215.521
node: 121m: 3773 n: 1314np: 115pv: 40
order: 121peak: 262.100
node: 122m: 4251 n: 1521np: 279
pv: 4642order: 122
peak: 271.757
node: 123m: 2737 n: 1700np: 299pv: 52
order: 123peak: 271.757
node: 132m: 7218 n: 1091np: 64
pv: 3481order: 132
peak: 271.757
node: 133m: 4338 n: 1116np: 36
pv: 4438order: 133
peak: 271.757
node: 137m: 1008n: 684np: 15
pv: 6091order: 137
peak: 271.757
node: 143m: 9269 n: 1057np: 91
pv: 3049order: 143
peak: 271.757
node: 144m: 4827 n: 1419np: 82
pv: 165order: 144
peak: 303.046
node: 145 m: 96n: 146 np: 1
pv: 8375order: 145
peak: 303.046
node: 146m: 100n: 156 np: 1
pv: 9335order: 146
peak: 303.046
node: 147 m: 96n: 146 np: 1
pv: 8374order: 147
peak: 303.046
node: 148m: 100n: 156 np: 1
pv: 6935order: 148
peak: 303.046
node: 149m: 100n: 156 np: 1
pv: 9334order: 149
peak: 303.046
node: 150m: 100n: 156 np: 1
pv: 6934order: 150
peak: 303.046
node: 151 m: 96n: 146 np: 1
pv: 8586order: 151
peak: 303.046
node: 160m: 2977 n: 1040np: 21
pv: 9591order: 160
peak: 303.046
node: 161m: 5434 n: 1157np: 47
pv: 3261order: 161
peak: 303.046
node: 162 m: 96n: 146 np: 1
pv: 8376order: 162
peak: 303.046
node: 163m: 100n: 156 np: 1
pv: 6936order: 163
peak: 303.046
node: 164m: 100n: 156 np: 1
pv: 9336order: 164
peak: 303.046
node: 167m: 818n: 474np: 15
pv: 5880order: 167
peak: 303.046
node: 168 m: 96n: 146 np: 1
pv: 8589order: 168
peak: 303.046
node: 171m: 835n: 488np: 15
pv: 5881order: 171
peak: 303.046
node: 172m: 4299 n: 1089np: 37
pv: 5945order: 172
peak: 315.610
node: 191m: 6535n: 800np: 227
pv: 4787order: 191
peak: 387.933
node: 192 m: 96n: 146 np: 1
pv: 8373order: 192
peak: 387.933
node: 193m: 100n: 156 np: 1
pv: 6933order: 193
peak: 387.933
node: 194m: 100n: 156 np: 1
pv: 9333order: 194
peak: 387.933
node: 195m: 288n: 205 np: 7
pv: 5989order: 195
peak: 387.933
node: 196m: 368n: 205np: 10
pv: 8104order: 196
peak: 387.933
node: 197m: 368n: 205np: 10
pv: 8107order: 197
peak: 387.933
node: 198m: 5026n: 887np: 112
pv: 4266order: 198
peak: 387.933
node: 199 m: 96n: 146 np: 1
pv: 8377order: 199
peak: 387.933
node: 200m: 100n: 156 np: 1
pv: 6937order: 200
peak: 387.933
node: 201m: 100n: 156 np: 1
pv: 9337order: 201
peak: 387.933
node: 202 m: 96n: 146 np: 1
pv: 8379order: 202
peak: 387.933
node: 203m: 100n: 156 np: 1
pv: 9339order: 203
peak: 387.933
node: 204m: 100n: 156 np: 1
pv: 6939order: 204
peak: 387.933
node: 240m: 2970n: 942np: 80
pv: 8336order: 240
peak: 416.539
node: 243m: 4618n: 873np: 42
pv: 4539order: 243
peak: 416.539
node: 244m: 6095 n: 1249np: 205
pv: 4535order: 244
peak: 455.073
node: 245m: 3019 n: 1327np: 121
pv: 2379order: 245
peak: 455.073
node: 246m: 2308 n: 1673 np: 4 pv: 200
order: 246peak: 455.073
node: 247m: 2864 n: 1669np: 202 pv: 443
order: 247peak: 455.073
node: 248m: 3151 n: 1844 np: 9 pv: 185
order: 248peak: 455.073
node: 249m: 2749 n: 1835np: 335 pv: 442
order: 249peak: 455.073
node: 250m: 4030 n: 2010np: 350
pv: 4657order: 250
peak: 455.073
node: 251m: 3865 n: 1871np: 351 pv: 100
order: 251peak: 455.073
node: 252m: 3248 n: 2132np: 423pv: 83
order: 252peak: 455.073
node: 253 m: 96n: 113 np: 1
pv: 8256order: 253
peak: 455.073
node: 265m: 3475n: 887np: 15
pv: 1088order: 265
peak: 455.073
node: 284m: 1248n: 861np: 136pv: 15
order: 284peak: 455.073
node: 285m: 6996 n: 1231np: 109
pv: 4365order: 285
peak: 455.073
node: 299m: 2400n: 821np: 14
pv: 7773order: 299
peak: 455.073
node: 300m: 400n: 283np: 10
pv: 2239order: 300
peak: 455.073
node: 301m: 210n: 186 np: 5
pv: 2245order: 301
peak: 455.073
node: 302m: 290n: 189 np: 7 pv: 330
order: 302peak: 455.073
node: 303m: 9675 n: 1818np: 144
pv: 6088order: 303
peak: 455.073
node: 311m: 6305 n: 1207np: 53
pv: 6336order: 311
peak: 455.073
node: 321m: 5931 n: 1256np: 56
pv: 310order: 321
peak: 455.073
node: 322m: 3991 n: 1883np: 66
pv: 107order: 322
peak: 455.073
node: 323m: 3597 n: 2453np: 15pv: 63
order: 323peak: 455.073
node: 324m: 6734 n: 2438np: 524 pv: 345
order: 324peak: 455.073
node: 325m: 4104 n: 2266np: 24pv: 10
order: 325peak: 455.073
node: 326m: 5608 n: 2242np: 838 pv: 266
order: 326peak: 455.073
node: 327m: 4876 n: 1404 np: 1404pv: 10819
order: 327peak: 455.073
node: 328 m: 0 n: 0 np: 0pv: 0
order: 328peak: 455.073
When the peak is low in the tree, performance varies smoothly withthe memory consumption
Experimental resultsS
peedup w
rt s
eq
Memory (MB)
-- AMD Opteron 24 coresKarted
650 700 750 800 850 900 950 1000
1
2
3
4
5
6
7
8
9
DFS
BFS
node: 36m: 941
n: 1001np: 12
pv: 993 order: 36
peak: 60.877
node: 67m: 381
n: 1052np: 19
pv: 2773 order: 67
peak: 60.877
node: 68m: 1291 n: 1935np: 59
pv: 3830 order: 68
peak: 79.195
node: 73m: 140n: 549np: 15
pv: 23311 order: 73
peak: 79.195
node: 97m: 524n: 582np: 41
pv: 40673 order: 97
peak: 79.195
node: 98m: 608
n: 1046np: 14
pv: 23197 order: 98
peak: 79.195
node: 108m: 332n: 549np: 15
pv: 37288order: 108
peak: 79.902
node: 120m: 330n: 559np: 15
pv: 32452order: 120
peak: 83.189
node: 121m: 632
n: 1051np: 15
pv: 3081order: 121
peak: 83.189
node: 122m: 1211 n: 1983np: 58
pv: 3054order: 122
peak: 96.282
node: 123m: 2385 n: 3544np: 176 pv: 739
order: 123peak: 151.961
node: 141m: 422
n: 1083np: 13
pv: 445order: 141
peak: 151.961
node: 154m: 202n: 573np: 15
pv: 6942order: 154
peak: 151.961
node: 166m: 168n: 574np: 14
pv: 22472order: 166
peak: 151.961
node: 167m: 750
n: 2093np: 44
pv: 4649order: 167
peak: 151.961
node: 183m: 261
n: 1102np: 15
pv: 17024order: 183
peak: 151.961
node: 208m: 537
n: 1138np: 15
pv: 41587order: 208
peak: 151.961
node: 209m: 768
n: 2143np: 38
pv: 12379order: 209
peak: 151.961
node: 210m: 1436 n: 3886np: 152 pv: 306
order: 210peak: 179.873
node: 236m: 668n: 717 np: 6
pv: 6602order: 236
peak: 179.873
node: 257m: 567n: 773 np: 7
pv: 1759order: 257
peak: 179.873
node: 258m: 1222 n: 1444np: 17
pv: 6624order: 258
peak: 179.873
node: 272m: 455n: 447np: 20
pv: 6000order: 272
peak: 179.873
node: 277m: 263n: 246np: 29
pv: 42908order: 277
peak: 179.873
node: 280m: 820n: 287np: 231pv: 19507
order: 280peak: 179.873
node: 287m: 296n: 289np: 118pv: 39079
order: 287peak: 179.873
node: 288m: 871n: 851np: 10
pv: 45515order: 288
peak: 179.873
node: 303m: 435n: 430 np: 2pv: 31200
order: 303peak: 179.873
node: 320m: 367n: 447 np: 9pv: 10724
order: 320peak: 179.873
node: 321m: 1627 n: 1674np: 14
pv: 15044order: 321
peak: 196.380
node: 322m: 2818 n: 2969np: 75
pv: 766order: 322
peak: 253.012
node: 353m: 620
n: 1040np: 12
pv: 1859order: 353
peak: 253.012
node: 369m: 373
n: 1060np: 14
pv: 7596order: 369
peak: 253.012
node: 370m: 967
n: 1996np: 41
pv: 1240order: 370
peak: 253.012
node: 390m: 560
n: 1067np: 15
pv: 265order: 390
peak: 253.012
node: 420m: 584
n: 1087np: 18
pv: 5673order: 420
peak: 253.012
node: 421m: 1111 n: 2036np: 40pv: 31
order: 421peak: 257.289
node: 422m: 1997 n: 3665np: 174pv: 68
order: 422peak: 304.566
node: 423m: 4566 n: 5469np: 684 pv: 117
order: 423peak: 465.482
node: 424m: 5166 n: 6508 np: 1603
pv: 21order: 424
peak: 602.733
node: 425m: 5772 n: 5612 np: 2409
pv: 2order: 425
peak: 616.500
node: 446m: 404n: 451 np: 6pv: 38228
order: 446peak: 616.500
node: 466m: 469n: 493np: 33
pv: 20109order: 466
peak: 616.500
node: 467m: 834n: 893 np: 6
pv: 8001order: 467
peak: 616.500
node: 475m: 464n: 491np: 17
pv: 24863order: 475
peak: 616.500
node: 485m: 357n: 493np: 13
pv: 7314order: 485
peak: 616.500
node: 486m: 791n: 932np: 17
pv: 5979order: 486
peak: 616.500
node: 487m: 1602 n: 1757np: 27
pv: 4486order: 487
peak: 616.500
node: 514m: 657n: 937np: 13
pv: 3850order: 514
peak: 616.500
node: 533m: 400n: 496np: 19
pv: 14181order: 533
peak: 616.500
node: 549m: 334n: 495np: 13
pv: 42732order: 549
peak: 616.500
node: 550m: 1346 n: 1817np: 39
pv: 1300order: 550
peak: 616.500
node: 551m: 2882 n: 3316np: 113 pv: 414
order: 551peak: 616.500
node: 552m: 5972 n: 3203 np: 3203 pv: 715
order: 552peak: 616.500
node: 553 m: 0 n: 0 np: 0pv: 0
order: 553peak: 616.500
In some cases no interesting behavior is observed mostly because allorderings equally good/bad in terms of performance
Experimental results
0
20
40
60
80
100
120
0 5 10 15 20 25 30 35 40
#ofta
sks
Time (s)
Hirlam--AMD Opteron 24cores
DFSavailable (no cstr)
available (seq cstr)
Scheduling issues withaccelerators
High-level STF algorithm
do f=1, nfronts ! in postorder
do while(avail_mem < size(f)) wait()
! activate front
call activate(f)
! init front
call starpu_submit(init , f)
do c=1, f%nc ! for all the children of f
do j=1,c%n
! assemble column j of c into f
call starpu_submit(assemble , c, j, f)
end do
! cleanup child
call starpu_submit(clean , c)
end do
do p=1, f%n
! panel reduction of column p
call starpu_submit(panel , f, p)
do u=p+1, f%n
! update of column u with panel p
call starpu_submit(update , f, u, p)
end do
end do
end do
! wait for the tasks to be executed
call starpu_waitall ()
High-level STF algorithm: front factorization
|do f=1, nfronts ! in postorder
do while(avail_mem < size(f)) wait()
! activate front
call activate(f)
! init front
call starpu_submit(init , f)
do c=1, f%nc ! for all the children of f
do j=1,c%n
! assemble column j of c into f
call starpu_submit(assemble , c, j, f)
end do
! cleanup child
call starpu_submit(clean , c)
end do|
do p=1, f%n
! panel reduction of column p
call starpu_submit(panel , f, p)
do u=p+1, f%n
! update of column u with panel p
call starpu_submit(update , f, u, p)
end do
end do
|end do
! wait for the tasks to be executed
call starpu_waitall ()|
High-level STF algorithm: blocked front factorization
|do f=1, nfronts ! in postorder
do while(avail_mem < size(f)) wait()
! activate front
call activate(f)
! init front
call starpu_submit(init , f)
do c=1, f%nc ! for all the children of f
do j=1,c%n
! assemble column j of c into f
call starpu_submit(assemble , c, j, f)
end do
! cleanup child
call starpu_submit(clean , c)
end do|
do p=1, f%np
! panel reduction of block column p
call starpu_submit(panel , f, p)
do u=p+1, f%nc
! update of block column u with panel p
call starpu_submit(update , f, u, p)
end do
end do
|end do
! wait for the tasks to be executed
call starpu_waitall ()|
Coarse-grain factorization (adapted to GPUs)
nbGPU
Panel
Update
Coarse-grain factorization (adapted to GPUs)
Panel
Update
Coarse-grain factorization (adapted to GPUs)
Panel
Update
Coarse-grain factorization: profile
Panel
Outer update
• n = 20480
• nbGPU = 512
Fine-grain panel factorizations
nbGPU
nbCPU
Panel
Inner update
Outer update
Fine-grain panel factorizations
Panel
Inner update
Outer update
Fine-grain panel factorizations
Panel
Inner update
Outer update
Fine-grain panel factorizations
Panel
Inner update
Outer update
Fine-grain panel factorizations
Panel
Inner update
Outer update
Fine-grain panel factorizations: profile
Panel
Inner update
Outer update
• n = 20480
• nbGPU = 512
• nbCPU = 128
Static workload balancing
nbGPU
nbCPU
Panel
Inner update
Outer update
Splited outerupdate
Static workload balancing
Panel
Inner update
Outer update
Splited outerupdate
Static workload balancing
Panel
Inner update
Outer update
Splited outer update
• n = 20480
• nbGPU = 512
• nbCPU = 128
Dynamic refinement of workload
Ideas:
• Work stealing of GPUs on low granularity updates;
• Work stealing of CPUs on high granularity updates.
Implementation:
• 1 queue per type of tasks;
• Queues polled in different orders by GPUs and CPUs;
• Criterium for work stealing.
Dynamic refinement of workload: profile
Panel
Inner update
Outer update
Splited outer update
• n = 20480
• nbGPU = 512
• nbCPU = 128
Conclusion on scheduling issues with accelerators
Summary:
• “Which task?” matters! (not managed by Heft);
• Static knowledge from the application;
• Dynamic correction at runtime;
• Results are preliminary.
On-going work:
• Multiple GPU case (limit data movement);
• Multifrontal factorization (exploit tree parallelism).
43/44 SOLHAR meeting on scheduling, April 2014, Lyon
? Thanks!Questions?