an effective dynamic scheduling runtime and tuning system for heterogeneous multi and many-core...
TRANSCRIPT
![Page 1: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/1.jpg)
1
An Effective Dynamic Scheduling Runtime and Tuning
System for HeterogeneousMulti and Many-Core Desktop
Platforms
Authous: Al’ecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andr’e Stork, and Dieter W. Fellner
ytchen2012.09.19
![Page 2: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/2.jpg)
2
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
![Page 3: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/3.jpg)
3
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
![Page 4: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/4.jpg)
4
Introduction • High performance platforms are commonly
required for scientific and engineering algorithms dealing appropriately with timing constraints.
• Both computation time and performance need to be optimized.
• Efficiency with respect to both huge domain sizes and with small problems is important.
![Page 5: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/5.jpg)
5
Introduction • Our dynamic scheduling method combines a first
assignment phase for a set of high-level tasks (algorithms, for example), based on a pre-processing benchmark for acquiring basic performance samples of the tasks on the PUs, with a runtime phase that obtains real performance measurements of tasks, and feeds a performance database.
![Page 6: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/6.jpg)
6
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
![Page 7: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/7.jpg)
7
Motivation • 3D Computational Fluid Dynamics (CFD)• large computations
o velocity field o local pressure
• Exampleo planeso Cars
![Page 8: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/8.jpg)
8
Motivation• three iterative solvers for SLEs (Jacobi, Red-Black
Gauss-Seidel, and Conjugate Gradient)o Jacobi: determining the solutions of a system of linear
equations with largest absolute values in each row and column dominated by the diagonal element.
o Red-Black Gauss-Seidel: an iterative method used to solve a linear system of equations resulting from the finite difference discretization of partial differential equations.
o Conjugate Gradient: an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite.
![Page 9: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/9.jpg)
9
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
![Page 10: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/10.jpg)
10
System overview• Units of Allocation (UA): is represented as a task.
![Page 11: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/11.jpg)
11
Platform Independent Programming Model
• OpenCL• In its basic principle, the API encapsulates
implementations of a task (methods, algorithms, parts of code, etc.) for different PUs, leveraging intrinsic hardware features and making them platform independent.
![Page 12: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/12.jpg)
12
Profiler and Database• profiler monitors and stores tasks’ execution
times and characteristics in a timing performance database.
• input data (size and type), data transfers between PUs, among others.
![Page 13: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/13.jpg)
13
Profiler and Database• The performance is measured in Host (CPU)
counting clocks, which intrinsically takes into account the data transfer times from/to CPU to/from the PU, possible initialization and synchronization times on the PUs, and latency.
![Page 14: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/14.jpg)
14
Dynamic Scheduler • First, it establishes an initial scheduling guess
over the PUs just when the applications(s) starts.o First Assignment Phase – FAP
• Second, for every new arriving task, it performs a scheduling consulting the timing database.o Runtime Assignment Phase – RAP
![Page 15: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/15.jpg)
15
First Assignment Phase – FAP
• Given a set of tasks with predefined costs for the PUs stored at the database, the first assignment phase performs a scheduling of tasks over the asymmetric PUs.
• lowest total execution time: o m: the number of Pus
• m = 2o n: the number of considered taskso i: tasko j: processor
![Page 16: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/16.jpg)
16
![Page 17: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/17.jpg)
17
![Page 18: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/18.jpg)
18
![Page 19: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/19.jpg)
19
![Page 20: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/20.jpg)
20
![Page 21: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/21.jpg)
21
Runtime Assignment Phase - RAP
• Modeled the arriving of new tasks as a FIFO (First In First Out) queue.
• assignment reconfiguration - Tasks that were already scheduled but not executed will change their assignment if it promotes a performance gain.
• When there is no entry for a task with a specific domain size, the lookup function retrieves the data from the task with the most similar domain size.
![Page 22: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/22.jpg)
22
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
![Page 23: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/23.jpg)
23
Experiment results• Domain sizes and execution costs of the tasks on
the PUs
![Page 24: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/24.jpg)
24
Experiment results• Comparison of allocation heuristics
o 0-GPU, 1-CPU
![Page 25: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/25.jpg)
25
Experiment results• Overhead of the dynamic scheduling using ALG.2
and its gain in comparison to scheduling all tasks to the GPU
![Page 26: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/26.jpg)
26
Experiment results• Scheduling techniques for 24 tasks
o Overhead: the time to perform the schedulingo Solve time: the execution time to compute the tasks o Total time: overhead + solve timeo Error: the total time of the techniques in comparison to the optimal
solution without it overhead • ex: (7660-6130) / 6130
o Optimal: exhaustive search
![Page 27: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/27.jpg)
27
Experiment results• Scheduling 24 tasks in the FAP + 42 tasks arriving
in the RAP
![Page 28: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/28.jpg)
28
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
![Page 29: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/29.jpg)
29
Related work • Distributed processing on a CPU-GPU platform
• Scheduling on a CPU-GPU platformo HEFT (Heterogeneous-Earliest-Finish-Time)
![Page 30: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/30.jpg)
30
Related workStarPU this paper
execution model codelets OpenCL
method low-level high-level
motivation CFD matrix multiplication
system runtime system
scheduling database
![Page 31: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/31.jpg)
31
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
![Page 32: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/32.jpg)
32
Conclusion• This paper presents a context-aware runtime and
tuning system based on a compromise between reducing the execution time of engineering applications.
• We combined a model for a first scheduling based on an off-line performance benchmark with a runtime model that keeps track of the real execution time of the tasks with the goal to extend the scheduling process of the OpenCL.
![Page 33: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/33.jpg)
33
Conclusion• We achieved an execution time gain of 21.77% in
comparison to the static assignment of all tasks to the GPU with a scheduling error of only 0.25% compared to exhaustive search.
![Page 34: An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos](https://reader035.vdocument.in/reader035/viewer/2022081516/56649e395503460f94b2aad3/html5/thumbnails/34.jpg)
34
Thanks for your listening!