weighted dynamic scheduling with many parallelism grains ...runtime scheduler loop 11/16/15, austin,...
TRANSCRIPT
![Page 1: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/1.jpg)
Weighted Dynamic Scheduling with Many Parallelism Grains for
Offloading of Numerical Workloads to Multiple Varied
Accelerators
Azzam Haidar Yulu Jia
Piotr Luszczek Stanimire Tomov
Asim YarKhan Jack Dongarra
![Page 2: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/2.jpg)
Problem Statement: Factorization
11/16/15, Austin, TX, USA ScalA 2015 2
Ax = b PA = LU Ly = P-1b Ux = y
GETF2 TRSM GEMM
![Page 3: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/3.jpg)
Problem Statement: Algorithm
11/16/15, Austin, TX, USA ScalA 2015 3
Ax = b PA = LU Ly = P-1b Ux = y
GETF2 TRSM GEMM
![Page 4: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/4.jpg)
Problem Statement: Mapping to Hardware
11/16/15, Austin, TX, USA ScalA 2015 4
Ax = b PA = LU Ly = P-1b Ux = y
GETF2 TRSM GEMM
![Page 5: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/5.jpg)
From Code to DAG
11/16/15, Austin, TX, USA ScalA 2015 5
GETRF(A[1:n,1:n]){fori=1,nb,2*nb,…{ GETF2(A[i:n,i:i+nb]) TRSM(A[i:i+nb,i:i+nb],A[i:i+nb,i:n]) GEMM(A[i:n,i:i+nb],A[i:i+nb,i:n],A[i+nb:n,i+nb:n])}
}
GETF2(A[1,1],A[2,1],…)TRSM(A[1,1],A[1,2])TRSM(A[1,1],A[1,3])GEMM(A[2,1],A[1,2],A[2,2])
Startwithcanonicalcode:
Unwindthefunc:oncalls: Trackdependences:
Thefunc:oncallsandtheirdependencesformaDAG.
![Page 6: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/6.jpg)
Matrix-View of Dependences
11/16/15, Austin, TX, USA ScalA 2015 6
GPU
XeonPhi
![Page 7: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/7.jpg)
From Code to DAG
11/16/15, Austin, TX, USA ScalA 2015 7
![Page 8: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/8.jpg)
Asynchronous Algorithm
11/16/15, Austin, TX, USA ScalA 2015 8
Panelfactoriza:on
Pivo:ng(rowswaps)
Triangularsolve
Matrixmul:ply(Schurcomplement)
Scheduletasksaccordingto:-datasizes-hardwareweights-affinity-performlookahead
![Page 9: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/9.jpg)
Main Features of the Scheduler
• Resource capability weight for the task – w(kernel, device)
• Adaptive scheduling with weights • Dynamic lookahead • Enhanced task priorities – Regulates lookahead (DAG exploration order)
• Transparent data movement – Uses (and tracks) asynchronous data transfers
• Dynamic data redistribution – Data is moved as needed and only if needed
11/16/15, Austin, TX, USA ScalA 2015 9
![Page 10: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/10.jpg)
Runtime Scheduler Loop
11/16/15, Austin, TX, USA ScalA 2015 10
defmain_thread_loop(user_code,queues,threshold):#ifthereareenoughtasksforcoresifqueues.total_length()>threshold: #resumeuser’scodeforsubmittingtasks task=user_code.get_next_task() q=queues.find_closes_queue(task.devices()) q.insert(task,task.priority())else: task=queues.steal_task(main_cpu) #iftasksavailableforstealing ifnottask.is_empty(): task.execute()#executesingletask else: queues.wait_for_tasks()
![Page 11: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/11.jpg)
Effect of Dynamic Scheduling
11/16/15, Austin, TX, USA ScalA 2015 11
![Page 12: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/12.jpg)
Performance: Kepler K20
11/16/15, Austin, TX, USA ScalA 2015 12
![Page 13: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/13.jpg)
Performance: Xeon Phi
11/16/15, Austin, TX, USA ScalA 2015 13
![Page 14: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/14.jpg)
Performance: Xeon Phi + Kepler K40
11/16/15, Austin, TX, USA ScalA 2015 14
![Page 15: Weighted Dynamic Scheduling with Many Parallelism Grains ...Runtime Scheduler Loop 11/16/15, Austin, TX, USA ScalA 2015 10 def main_thread_loop(user_code, queues, threshold): # if](https://reader034.vdocument.in/reader034/viewer/2022051604/6002da01880468788c348a91/html5/thumbnails/15.jpg)
Summary of Contributions
• Fine- and coarse-grained tasks for scheduling • Capability weights for hardware description • Unified scheduling across CPUs, GPUs, and
coprocessors • Synchronous memory-transfer model with
transparent asynchronous progress
11/16/15, Austin, TX, USA ScalA 2015 15