efficient scheduling, mapping and resource prediction for
TRANSCRIPT
![Page 1: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/1.jpg)
Efficient Scheduling, Mapping and Resource Prediction
for Dynamic Run time Operating Systems
by
Ahmed Al-Wattar
A Thesis
presented to
The University of Guelph
In partial fulfilment of requirements
for the degree of
Doctor of Philosophy
in
Engineering
Guelph, Ontario, Canada
c©Ahmed Al-Wattar, April,2015
![Page 2: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/2.jpg)
ABSTRACT
Efficient Scheduling, Mapping and Resource Prediction for Dynamic
Run time Operating Systems
Ahmed Al-Wattar
University of Guelph, 2015
Advisor:
Professor Shawki Areibi
Several embedded application domains for reconfigurable systems tend to combine frequent
changes with high performance demands of their workloads such as image processing, wearable
computing and network processors. Time multiplexing of reconfigurable hardware resources
raises a number of new issues, ranging from run-time systemsto complex programming models
that usually form a Reconfigurable hardware Operating System (ROS). In this thesis a novel ROS
framework that aids the designer from the early design stages all the way down to the hardware
implementation is proposed. An efficient reconfigurable platform was implemented along with
several novel scheduling algorithms. The algorithms proposed tend to reuse hardware tasks to
reduce reconfiguration overhead, migrate tasks between software/hardware to efficiently utilize
resources and reduce computation time. A framework for efficient mapping of execution units to
task graphs in a runtime reconfigurable system is also designed. The framework utilizes an Is-
land Based Genetic Algorithm flow that optimizes several objectives including performance, area
and power consumption. The proposed Island based GA framework achieves on average 55.2%
improvement over a single GA implementation and 80.7% improvement over a baseline random
allocation and binding approach. Finally, we present a novel adaptive and dynamic methodology
based on a Machine Learning approach for predicting and estimating the necessary resources for
an application based on past historical information. An important feature of the proposed method-
![Page 3: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/3.jpg)
ology is that the system is able to learn and generalize and, therefore, is expected to improve its
accuracy over time.
![Page 4: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/4.jpg)
Acknowledgements
I would like to express my deep gratitude and special appreciation to my advisor Dr.
Shawki Areibi for his help and support during every moment ofmy work and for the
guidance he provided me while carrying out this research. Your advice on both research
as well as on my career have been priceless. Thank you! I wouldlike to thank Dr. Gary
Grewal for his valuable advice and feedback that greatly helped me enhance the quality
of my work and my thesis. Without them, my work would have beenhardly possible
to complete. I would also like to thank my committee member Dr. Radu Muresan for
the kindness and assistance he provided at all levels of the research project. I would
like to express my thanks to the faculty and staff in the school of Engineering for their
support and help. I would like to give special thank my colleagues in the ISLAB and my
friends at the University of Guelph for their encouragement, support and the great time
we spent together, special thanks to Elisha Colmenar, Ziad Abuowaimer, Omar Ahmed,
and Cynthia Mason. Finally, special thanks to my family. Words cannot express how
grateful I am to my mother, father, brothers and sister, for all of the sacrifices that you
have made on my behalf. Your prayer for me was what sustained me thus far.
iv
![Page 5: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/5.jpg)
List of Publications
• Journal Papers:
1. “An Efficient Framework for Floor-plan Prediction of Dynamic Runtime Re-
configurable Systems ” Submitted to the International Journal of Reconfig-
urable and Embedded Systems (IJRES), February 19, 2015.
• Conference Papers:
1. “Efficient mapping of Execution Units to Task Graphs usingan Evolution-
ary Framework” Accepted for publication to the international symposium on
Highly Efficient Accelerators and Reconfigurable Technologies (HEART2015)
Boston, MA, USA, June 12, 2015.
2. “Efficient On-line Hardware/Software Task Scheduling for Dynamic Run-
time Reconfigurable Systems,” Published in Parallel and Distributed Process-
ing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26thIn-
ternational , pp.401,406, 21-25 May 2012.
• Application Note:
1. “Real-Time Power Monitoring for Xilinx-based FPGAs (VC707 Board)” Pub-
lished in CMC Microsystems, Ontario - Canada, November 1, 2012.
v
![Page 6: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/6.jpg)
Contents
1 Introduction 1
1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivations and Proposed Approach . . . . . . . . . . . . . . . . . .. 6
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background 12
2.1 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Data Flow Graph (DFG) . . . . . . . . . . . . . . . . . . . . 14
2.1.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Run Time Reconfiguration (RTR) . . . . . . . . . . . . . . . . 18
2.1.4 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . 21
2.2 Resource Management of Reconfigurable Systems . . . . . . . .. . . 27
2.2.1 Scheduling in Reconfigurable Systems . . . . . . . . . . . . . .28
2.3 Data Mining, Machine Learning and Classification . . . . . .. . . . . 30
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
vi
![Page 7: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/7.jpg)
3 Literature Review 33
3.1 Partial Dynamic Reconfigurable Systems . . . . . . . . . . . . . .. . 33
3.1.1 Dynamic Partial Reconfiguration Applications . . . . . .. . . 36
3.2 Reconfigurable Operating Systems . . . . . . . . . . . . . . . . . . .. 39
3.2.1 Scheduling for Reconfigurable Operating Systems . . . .. . . 46
3.3 Execution unit Allocation and Genetic Algorithm . . . . . .. . . . . . 49
3.4 Resource Prediction and Data-Mining . . . . . . . . . . . . . . . .. . 51
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Overall Methodology and Tools 57
4.1 Run-Time Reconfigurable Platform . . . . . . . . . . . . . . . . . . .. 57
4.1.1 PRR Uniformity . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 DFG Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 DFG Generator Sub-modules . . . . . . . . . . . . . . . . . . 63
4.3 Reconfigurable Simulator . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Simulator Inputs . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2 Simulator Output . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Modes of Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Reconfigurable Online Schedulers 83
5.1 Task Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.1 Task Placement Models . . . . . . . . . . . . . . . . . . . . . 86
5.2 Baseline Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
vii
![Page 8: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/8.jpg)
5.2.1 ILP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.2 RCOffline Scheduler . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.3 Meta-Offline Scheduler . . . . . . . . . . . . . . . . . . . . . . 93
5.2.4 RCSched-Base Scheduler . . . . . . . . . . . . . . . . . . . . 93
5.2.5 Baseline Schedulers Comparison . . . . . . . . . . . . . . . . . 93
5.3 Proposed Scheduling Algorithms . . . . . . . . . . . . . . . . . . . .. 95
5.3.1 RCSched-I and RCSched-II . . . . . . . . . . . . . . . . . . . 96
5.3.2 RCSched-III . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.3 RCSched-III-Enhanced . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.2 Preliminary Stage . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4.3 Intermediate Stage . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4.4 Advanced Stage . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6 Allocation/Binding of Execution Units 130
6.1 Single Island GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.1.1 Initial population . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.1.2 Genetic Operators . . . . . . . . . . . . . . . . . . . . . . . . 133
6.1.3 Fitness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2 Island Based GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2.1 Power Measurements . . . . . . . . . . . . . . . . . . . . . . . 138
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
viii
![Page 9: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/9.jpg)
6.3.1 Experimental Method . . . . . . . . . . . . . . . . . . . . . . . 139
6.3.2 Convergence of Island based GA . . . . . . . . . . . . . . . . . 140
6.3.3 The Pareto Front of Island GA Framework . . . . . . . . . . . 142
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7 Resource Prediction 158
7.1 Data Preparation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.2 Training Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.3 Classification Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.4.1 Classification Algorithms: Implementation . . . . . . . .. . . 170
7.4.2 Experimental Setup and Evaluation . . . . . . . . . . . . . . . 172
7.4.3 Results for Synthetic Benchmarks . . . . . . . . . . . . . . . . 173
7.4.4 Results for MediaBench DSP suite . . . . . . . . . . . . . . . . 177
7.4.5 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . 179
7.4.6 Random Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8 Conclusions and Future Work 189
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Bibliography 194
A Scheduling and Placement Restriction 213
A.1 Total Number of Reconfiguration . . . . . . . . . . . . . . . . . . . . .213
ix
![Page 10: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/10.jpg)
A.2 Number of Hardware Task Reuse . . . . . . . . . . . . . . . . . . . . . 215
A.3 Busy Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
A.4 Hardware to Software Migration . . . . . . . . . . . . . . . . . . . . .219
B Architecture Library 223
C FPGA Power Measurements 242
C.1 Power management of the VC707 board . . . . . . . . . . . . . . . . . 243
C.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
C.2.1 Monitoring FPGA resources using TI UCD9248PFC Power Con-
trollers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
x
![Page 11: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/11.jpg)
List of Tables
2.1 Comparison of Representative Computing Architecture .. . . . . . . . 14
2.2 Reconfiguration Speed for different interfaces . . . . . . .. . . . . . . 25
4.1 Reconfigurable Simulator Input Parameter . . . . . . . . . . . .. . . . 67
5.1 Description of task modes . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Baseline schedulers comparison (values for Total time in # of cycles). . 95
5.3 Synthesized Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 MediaBench Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5 Comparison of RCSched-I, RCSched-II and RCOffline . . . . .. . . . 108
5.6 Task type Constraints for restricted placement DFG . . . .. . . . . . . 115
5.7 Schedulers comparison forTotal Execution time(No learning). . . . . . 124
5.8 Schedulers comparison forTotal Execution time(after learning phase). 125
5.9 Run-time comparison of RCOffline and RCSched-III-Enhanced . . . . 126
5.10 Schedulers comparison forNumber of Hardware Reuse(No learning). . 127
5.11 Schedulers comparison forNumber of Hardware Reuse(after learning). 128
6.1 Benchmark Specifications. . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2 IBGA run-time for serial and parallel implementations .. . . . . . . . 141
xi
![Page 12: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/12.jpg)
6.3 The Average of 10 runs for the Best Fitness Values . . . . . . .. . . . 149
6.4 Fitness values for different weights (W) . . . . . . . . . . . . .. . . . 150
6.5 Exhaustive vs. IBGA (average of 10 runs) . . . . . . . . . . . . . .. . 150
7.1 Data Flow Graphs: Statistics . . . . . . . . . . . . . . . . . . . . . . .161
7.2 Extracted Features from DFGs. . . . . . . . . . . . . . . . . . . . . . .165
7.3 Cases used to Evaluate the Framework along with Specifications . . . . 169
7.4 Accuracy with T-test evaluation (SVM) [synthesized benchmark]. . . . 176
7.5 Accuracy with T-test evaluation (J48) [synthesized benchmark]. . . . . 176
7.6 Training time for Case # 1 for individual classifiers . . . .. . . . . . . 177
7.7 Mediabench DSP Benchmark Specifications. . . . . . . . . . . . .. . 177
7.8 Accuracy with T-Test significant evaluation [synthesized benchmark] . . 181
7.9 Training time for different classification algorithms.. . . . . . . . . . . 181
7.10 Performance Enhancement for total time (Synthesized). . . . . . . . . . 185
7.11 Performance Enhancement for power (Synthesized). . . .. . . . . . . . 185
7.12 Performance Enhancement for total time (MediaBench DSP). . . . . . . 186
7.13 Performance Enhancement for power (MediaBench DSP). .. . . . . . 187
C.1 Power Rail Specfication for UCD9248 PMBus Controllers. .. . . . . 244
xii
![Page 13: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/13.jpg)
List of Figures
1.1 Essential Components of a Reconfigurable OS. . . . . . . . . . .. . . 3
1.2 (A) PRR layout, (B) Miscellaneous Platforms/Floorplans . . . . . . . . 6
2.1 Flexibility vs performance of processor classes . . . . . .. . . . . . . 13
2.2 Dataflow Graph for Quadratic Root . . . . . . . . . . . . . . . . . . . 15
2.3 FPGA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Software Vs Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Including a customized IP within the RISC architecture (Customized In-
struction) [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Including a customized IP via the FSL interface onto MicroBlaze [1] . . 19
2.7 Partially Reconfigurable Region (PRR) X can be Loaded with Partial
Reconfiguration Module X1, X2, X4, or X3 . . . . . . . . . . . . . . . 22
2.8 Partial Reconfiguration Design Flow . . . . . . . . . . . . . . . . .. . 24
2.9 Loading a partial bit file . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.10 Data Mining Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.11 Supervised Learning Steps . . . . . . . . . . . . . . . . . . . . . . . .31
4.1 Framework: Major Blocks. . . . . . . . . . . . . . . . . . . . . . . . . 60
xiii
![Page 14: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/14.jpg)
4.2 Floorplan for the uniform (left) and non-uniform (right) implementations. 61
4.3 A node can be independent or can have several dependencies. . . . . . . 63
4.4 DFGs Generator Components . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Reconfigurable simulator layout . . . . . . . . . . . . . . . . . . . .. 66
4.6 Simulator task variant (architecture) library (example). . . . . . . . . . 69
4.7 Simulator platform (PRR) library (example). . . . . . . . . .. . . . . 70
4.8 Simulator DFG file example for the S1 benchmark . . . . . . . . .. . 71
4.9 Simulator output for the S1 benchmark, using the –verbose option. . . 72
4.10 Simulator graph file, where ’#’ represent reconfiguration, and ’*’ execu-
tion. The number is the task ID. ( S1 benchmark ) . . . . . . . . . . . .73
4.11 Overall Methodology Flow . . . . . . . . . . . . . . . . . . . . . . . . 74
4.12 Operation Mode 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.13 Operation Mode 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.14 Operation Mode 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.15 Operation Mode 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 A Data Flow Graph (DFG). . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 1D versus 2D Task Placement Area Models for Reconfigurable Devices 86
5.3 Meta-Offline scheduler processing time. . . . . . . . . . . . . .. . . . 94
5.4 Pseudo-code for RCSched-I & II (with reuse and task migration) . . . . 98
5.5 A simplified Pseudo-code for ROS, illustrates how the scheduler is called. 99
5.6 Task Type table structure . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7 Pseudo-code for RCSched-III (with reuse and task migration) . . . . . . 104
5.8 Total Execution Time in mSec . . . . . . . . . . . . . . . . . . . . . . 111
xiv
![Page 15: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/15.jpg)
5.9 Total Reconfiguration Time in mSec . . . . . . . . . . . . . . . . . . .112
5.10 Busy Counter: Increased when no Free Resources Available . . . . . . 113
5.11 Number of hardware to software Task Migration . . . . . . . .. . . . 114
5.12 Experimental Flow of the intermediate state testing . .. . . . . . . . . 115
5.13 Total Execution Time: Non-Restricted Placement . . . . .. . . . . . . 117
5.14 Total Execution Time : Restricted Placement . . . . . . . . .. . . . . 118
5.15 Total Reconfiguration Time: Non-Restricted Placement. . . . . . . . . 120
5.16 Total Reconfiguration Time: Restricted Placement . . . .. . . . . . . . 121
6.1 A Single GA Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2 Task Graph to Chromosome Mapping (Binding/Allocation). . . . . . . 134
6.3 Proposed Island Based Framework . . . . . . . . . . . . . . . . . . . .137
6.4 Platforms with different floorplans . . . . . . . . . . . . . . . . .. . . 141
6.5 Convergence of synthesized benchmark S1 . . . . . . . . . . . . .. . 142
6.6 Convergence of synthesized benchmark S2 . . . . . . . . . . . . .. . 143
6.7 Convergence of synthesized benchmark S3 . . . . . . . . . . . . .. . 143
6.8 Convergence of synthesized benchmark S4 . . . . . . . . . . . . .. . 144
6.9 Convergence of MediaBench benchmark (DFG2) . . . . . . . . . .. . 144
6.10 Convergence of MediaBench benchmark (DFG6) . . . . . . . . .. . . 145
6.11 Convergence of MediaBench benchmark (DFG7) . . . . . . . . .. . . 145
6.12 Convergence of MediaBench benchmark (DFG12) . . . . . . . .. . . 146
6.13 Convergence of MediaBench benchmark (DFG14) . . . . . . . .. . . 146
6.14 Convergence of MediaBench benchmark (DFG16) . . . . . . . .. . . 147
6.15 Convergence of MediaBench benchmark (DFG19) . . . . . . . .. . . 147
xv
![Page 16: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/16.jpg)
6.16 The DFG for benchmark S3 . . . . . . . . . . . . . . . . . . . . . . . 148
6.17 Architecture binding for each platform (P1-P4), for the S3 Benchmark. . 148
6.18 Aggregated Pareto Front (Time vs Power) for synthesized benchmark
(S1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.19 Aggregated Pareto Front (Time vs Power) for synthesized benchmark
(S2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.20 Aggregated Pareto Front (Time vs Power) for synthesized benchmark
(S3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.21 Aggregated Pareto Front (Time vs Power) for synthesized benchmark
(S4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.22 Aggregated Pareto Front (Time vs Power) for MediaBenchbenchmark
(DFG2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.23 Aggregated Pareto Front (Time vs Power) for MediaBenchbenchmark
(DFG6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.24 Aggregated Pareto Front (Time vs Power) for MediaBenchbenchmark
(DFG7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.25 Aggregated Pareto Front (Time vs Power) for MediaBenchbenchmark
(DFG12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.26 Aggregated Pareto Front (Time vs Power) for MediaBenchbenchmark
(DFG14) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.27 Aggregated Pareto Front (Time vs Power) for MediaBenchbenchmark
(DFG16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.28 Aggregated Pareto Front (Time vs Power) for MediaBenchbenchmark
(DFG19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
xvi
![Page 17: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/17.jpg)
7.1 Overall Methodology and Flow . . . . . . . . . . . . . . . . . . . . . . 159
7.2 Supervised learning: Training and testing . . . . . . . . . . .. . . . . 164
7.3 Individual Classifiers: Predication Accuracy for the synthesized bench-
marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.4 Imbalance ratio between Minority/Majority Classes [synthesized bench-
mark] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.5 Individual Classifiers: Predication Accuracy for MediaBench. . . . . . 178
7.6 Ensembles Classifiers: Predication Accuracy for the Synthesized Bench-
mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.7 Ensembles Classifiers: Predication Accuracy for the MediaBench . . . 182
7.8 Ensembles Classifiers: Area Under ROC for the Synthesized Benchmark 183
7.9 Ensembles Classifiers: Area Under ROC for the MediaBench. . . . . . 183
A.1 Total Number of Reconfigurations: Non-Restricted Placement . . . . . 214
A.2 Total Number of Reconfigurations: Restricted Placement. . . . . . . . 215
A.3 Total Number of HW reuse: Non-Restricted Placement . . . .. . . . . 216
A.4 Total Number of HW reuse: Restricted Placement . . . . . . . .. . . . 217
A.5 Idle Time Measurement: Non-Restricted Placement . . . . .. . . . . . 218
A.6 Idle Time Measurement: Restricted Placement . . . . . . . . .. . . . . 219
A.7 Number of hardware to software Task Migration: Non-Restricted Place-
ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
A.8 Number of hardware to software Task Migration: Restricted Placement 222
C.1 Power distribution of Xilinx VC707 Board from [2]. . . . . .. . . . . 244
C.2 Texas Instrument USB to GPIO adapter . . . . . . . . . . . . . . . . .246
xvii
![Page 18: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/18.jpg)
C.3 Xilinx VC707 board, TI Adapter should be connected to thehighlighted
PMBus connector (J5) . . . . . . . . . . . . . . . . . . . . . . . . . . 246
C.4 Window showing Fusion software modes, press OK . . . . . . . .. . . 247
C.5 Select Device, click the first link . . . . . . . . . . . . . . . . . . .. . 248
C.6 Fusion Monitor View (measurement can be observed). The monitor sec-
tion on top left is view or hide real-time graphs, the device and rail can
be changed from the top right corner. . . . . . . . . . . . . . . . . . . . 249
C.7 Fusion Monitor View, Rail dashboard show real time reading of all met-
rics, highlighted on top left side. . . . . . . . . . . . . . . . . . . . . .250
C.8 Select the rail and device to be monitored. . . . . . . . . . . . .. . . . 251
C.9 Fusion Monitor View, Rail dashboard show real time reading of all met-
rics, highlighted on top left side. . . . . . . . . . . . . . . . . . . . . .251
xviii
![Page 19: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/19.jpg)
Abbreviations
SoC : System on Chip
GPP : General Purpose Processor
ASIP : Application Specific Processors
CLB :Configurable Logic Blocks
PR :Partial Reconfiguration
PRR : Partial Reconfigurable Region
FPGA : Field Programmable Gate Array
ASICs : Application Specific Integrated Circuits
PRM : Partial Reconfigurable Module
OS : Operating System
RTOS : Real-Time Operating System
SM : Static Module
GA : Genetic Algorithms
PE : Processing Element
DM : Data-Mining
RC : Reconfigurable Computing
GA : Genetic Algorithms
DFG : Data Flow Graph
EA : Evolutionary Algorithm
DAG : Directed Acyclic Graph
ROS : Reconfigurable Operating System
RTR : Run-Time Reconfiguration
RCS : Reconfigurable Computing Systems
IBGA : Island Based Genetic Algorithm
xix
![Page 20: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/20.jpg)
xx
![Page 21: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/21.jpg)
Chapter 1
Introduction
In the area of computer architecture, choices span a wide spectrum, with Application
Specific Integrated Circuits (ASICs)andGeneral Purpose Processors (GPPs)being at
opposite ends. General purpose processors are flexible, butunlike ASICs are not opti-
mized to specific applications. Reconfigurable Architectures, in general, andField Pro-
grammable Gate Arrays (FPGAs), in particular, fill the gap between these two extremes
by achieving both the high performance of ASICs and the flexibility of GPPs. However,
FPGAs are still not a match for the lower power consumed by ASICs nor the performance
achieved by the latter. One important feature of FPGAs is their capability to adapt during
the run-time of an application. The run-time reconfiguration environment capability in
FPGAs provides common benefits in adapting hardware algorithms during system run-
time, sharing hardware resources to reduce device count, reducing power consumption,
and shortening reconfiguration time [3].
Several embedded application domains for reconfigurable systems tend to combine
frequent changes with high performance demands of their workloads, such as image
1
![Page 22: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/22.jpg)
CHAPTER 1. INTRODUCTION 2
processing, wearable computing, and network processors. In many of these embedded
systems, several wireless standards and technologies, such as WiMax, WLAN, GSM, and
WCDMA have to be utilized and supported. However, it is unlikely that these protocols
will be usedsimultaneously. Accordingly, it is possible to dynamically load only the one
that is needed. Another example is employing different machine vision algorithms onto
anUnmanned Ariel Vehicle (UAV)and utilizing the most appropriate algorithm based on
the environment or perhaps the need to lower power consumption.
Time multiplexing of reconfigurable hardware resources raises a number of new is-
sues, ranging from run-time systems to complex programmingmodels that usually form
aReconfigurable hardware Operating System (ROS). The operating system performs on-
line task scheduling and handles resource management. The main objective of a ROS
is to reduce the complexity of developing applications by giving the developer a higher
level of abstraction with which to work. The basic components of any ROS are illustrated
in Figure 1.1, which include a bit-stream manager, scheduler, placer, and communica-
tions network.
The mapping and partitioning of an application starts by decomposing the application
into constituents tasks. The tasks are then structured in aData Flow Graph (DFG)to
identify dependencies between the tasks. Finally, the tasks are scheduled and mapped
onto hardware resources [4].
1.1 Problem definition
Performance is one of the fundamental reasons for usingReconfigurable Computing Sys-
tems (RCS). By mapping algorithms and applications to hardware, designers can tai-
![Page 23: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/23.jpg)
CHAPTER 1. INTRODUCTION 3
��������������
� ����������������
�����������������
������ ����� ��������������
������
� ���������� �����
����������
��� ��� ��� !
"����� ��� ����#��$���
%����������
��������
��������
��������
��������
& !'
Figure 1.1: Essential Components of a Reconfigurable OS.
![Page 24: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/24.jpg)
CHAPTER 1. INTRODUCTION 4
lor not only the computation components, but also perform data-flow optimization to
match the algorithm. One of the main problems encountered inRun-Time Reconfigu-
ration (RTR)is identifying the most appropriate framework or infrastructure that suits
an application. Typically, a designer would manually divide the FPGA fabric into both
staticanddynamicregions [5]. The static regions would accommodate modules that do
not change in time such as the task manager and necessary buses used for communica-
tion. The dynamic region is partitioned into a set of uniformor non-uniform regions
with a certain size (we refer to these asPartial Reconfigurable Regions (PRRs)) that
can accommodateReconfigurable Modules(RM)or in other words application specific
hardware accelerators for the incoming tasks that need to beexecuted. Every applica-
tion (e.g., Machine Vision, Wireless-Sensor Network, etc...) requires specific types of
resources that optimize certain objectives such as reducing power consumption and/or
improving execution time.
In current RTR systems, designers tend to manually perform resource allocation and
floor-planning of the FPGA fabric a priori. These allocated resources, however, might
not be appropriate for a new and different incoming application (e.g., streaming, non-
streaming, hybrid). By not tailoring the FPGA fabric for a particular application, the
latter has to suffer if the floor-plan is a miss match to the application itself. In general,
a one size fits all approach not only hinders the amount of performance sought by using
RTR, but may also adversely affect power consumption. The latter may occur since
meeting performance requirements might entail usage of multiple PRRs. Accordingly,
anadaptiveanddynamicapproach is necessary for performing both resource estimation
and floorplanning.
Another important issue is related to task scheduling and placement. In any type of
![Page 25: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/25.jpg)
CHAPTER 1. INTRODUCTION 5
operating system a scheduler decideswhento load new tasks to be executed. Efficient
task scheduling algorithms have to take task dependencies,data communication, task
resource utilization and system parameters into account tofully exploit the performance
of a dynamic reconfigurable system. Adapting a conventionalscheduler is not an option
since schedulers in ROS differ from conventional OS schedulers in several ways:
1. ROS schedulers are heavily dependent on the placement of hardware tasks, while
in conventional systems the scheduler can be independent from memory allocation.
2. Computation resources are available as soon as a task is placed in the reconfig-
urable fabric, while in conventional systems the task may wait in a ready queue for
free processing resources.
3. ROS schedulers have to take into account reconfiguration time and should mini-
mize this time by taking into account task prefetching and reuse.
A scheduling algorithm for a reconfigurable architecture that assumes one architec-
ture variant for the hardware implementation of each task may also lead to inferior so-
lutions [4]. Having multiple hardware implementations pertask helps in reducing the
imbalance between the processing throughput of interacting tasks [6]. Also, static re-
source allocation for RTR might lead to inferior results. The number of PRRs for one
application might be different than the number required by another. The type of PRRs
(uniform, non-uniform, hybrid) also plays a crucial role indetermining both performance
and power consumption, as seen in Figure 1.2.
![Page 26: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/26.jpg)
CHAPTER 1. INTRODUCTION 6
Figure 1.2: (A) PRR layout, (B) Miscellaneous Platforms/Floorplans
1.2 Motivations and Proposed Approach
The main goal of this dissertation is to propose and implement an infrastructure for re-
configurable operating systems that aids the designer from the early stages of the design
all the way down to the actual implementation and submissionof the application onto the
reconfigurable fabric. Therefore, we started by implementing a platform and scheduling
algorithms for reconfigurable computing that can manage both hardware and software
tasks along with developing tools and benchmarks to validate and measure the schedulers
performance. Three on-line scheduling algorithms were developed. All three schedulers
tend to reuse hardware resources for tasks to reduce reconfiguration overhead, migrate
tasks between software/hardware, and assign priorities totasks types, while respecting
task priority and maintaining precedence. The first two schedulers, calledRCSched-I
andRCSched-IIgive priorities to hardware tasks [7]. In the case of a lack ofhardware
resources, both RCSched-I and RCSched-II, initiate the migration of hardware tasks to
software. RCSched-III, on the other hand, migrates tasks toSW only when it is more
![Page 27: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/27.jpg)
CHAPTER 1. INTRODUCTION 7
beneficial to do so. In particular, RCSched-III dynamicallymeasures several system met-
rics, such as execution time and reconfiguration time, then calculates the priority for each
task type. RCSched-III then migrates and assigns tasks to the most suitable processing
elements (SW or HW) based on the calculated priorities.
A dynamic reconfigurable framework consisting of five reconfigurable regions and
two GPPs was implemented. The developed schedulers were tested using a collection
of tasks represented by DFGs. To verify the scheduler functionality and performance,
DFGs with different sizes and parameters were required; therefore, a DFG generator was
also designed and implemented. The DFG generator is able to generate benchmarks with
different sizes and features. However, using such a dynamicreconfigurable platform to
evaluate hundreds of DFGs based on different hardware configurations is both complex
and tedious. The FPGA platform to be used in this work requires a different floor-plan
and bit-stream for each new configuration which limited the scope of testing and evalu-
ation. Accordingly, an architecture reconfigurable simulator was developed to simulate
the hardware platform, while running the developed reconfigurable operating system.
The simulator allows for faster evaluation and flexibility and thus can support differ-
ent hardware scenarios, by varying the number of processingelements (PRRs, GPPs),
size/shape of PRRs and schedulers.
Contrary to software tasks, hardware tasks may have multiple implementations vari-
ants (execution units, or hardware implementations). These variants differ in terms of
performance, power consumption, and area. As limiting a scheduler to one execution
variant may lead to inferior solutions, we are proposing an efficient optimization frame-
work that can mitigate the mentioned deficiency. The optimization framework is based
on an Island Based Genetic Algorithm technique. Given a particular task graph, each
![Page 28: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/28.jpg)
CHAPTER 1. INTRODUCTION 8
island seeks to optimize speed, power and/or area for a different floorplan. The floor-
plans differ in terms of the number, size and layout of PRRs asdescribed earlier. The
framework uses the online schedulers, integrated with an island based GA engine, for
evaluating the quality of a particular schedule (in terms ofpower and performance) and
floorplan.
One of the major understudied problems related to OS development for reconfig-
urable systems that this thesis seeks to address is predicting the required resources such
as soft cores, PRRs, and communication infrastructure to name just a few. Static re-
source allocation for RTR might lead to inferior results. The number of PRRs for one
application might be different than the number required by another. The type of PRRs
(uniform, non-uniform, hybrid) also plays a crucial role indetermining both performance
and power consumption. The type of Scheduler used to determine when/where a task is
executed is also important for specific real-time operations. The type of communica-
tion infrastructure that connects PRRs with the Task Manager plays an important role
to speedup a certain application. Accordingly, in this thesis we seek to overcome the
limitation of static resource allocation with a more appealing approach that can adapt the
infrastructure of the reconfigurable computing platform toaccommodate and match the
application rather than the reverse. Therefore, we presenta novel adaptive and dynamic
methodology based on an intelligent machine learning approach that is used to predict
and estimate the necessary resources for an application based on past historical infor-
mation. An important feature of the proposed methodology isthat the system is able to
learn as it gains more knowledge and, therefore, is expectedto generalize and improve
its accuracy over time. Even though the approach is general enough to predict most if
not all types of resources from the number of PRRs, type of PRRs, the type of scheduler,
![Page 29: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/29.jpg)
CHAPTER 1. INTRODUCTION 9
and communication infrastructure, we limit our results to the former three required for
an application. This task is accomplished by first extracting certainfeaturesfrom the
applications that are executed on the reconfigurable platform. The features compiled are
then used to train and build aclassification modelthat is capable of predicting the floor-
plan appropriate for an application. The classification model is a supervised learning
approach that can generalize and accurately predict the class of an incoming application
that it learned from previously seen patterns. Our proposedapproach is based on several
modules including benchmark generation, data collection,pre-processing of data, data
classification, and post processing. The goal of the entire process is to extract useful
hidden knowledge from the data. This knowledge is then used to predict and estimate
the necessary resources and appropriate floorplan for an unknown or not previously seen
application.
1.3 Research Contributions
The main contributions in this thesis can be categorized in three major paradigms:
1. Platform and Schedulers:The development and evaluation of several novel heuris-
tics for on-line scheduling of hard real time tasks for partially reconfigurable de-
vices. The proposed schedulers use fixed predefined partial reconfigurable regions
with re-use, relocation, and task migration capability. Inparticular, RCSched-III
uses real-time data to make scheduling decisions. The scheduler dynamically mea-
sures several performance metrics such as reconfiguration time and execution time,
calculates a priority and based on these metrics assigns incoming tasks to the ap-
propriate processing elements. In order to evaluate the proposed framework and
![Page 30: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/30.jpg)
CHAPTER 1. INTRODUCTION 10
schedulers a DFG generator was developed. It randomly generates benchmarks
with predefined specification, such as number of nodes, task types and total num-
ber of dependencies per DFG. To the best of our knowledge thisis the first actual
implementation of reconfigurable system with software/hardware task migration.
2. Multiple Execution Variants: A multi-objective optimization framework capable
of optimizing total execution time and power consumption for static and partial
dynamic reconfigurable systems is proposed. This frameworkis not only capable
of optimizing the above mentioned objectives, but can also determining the most
appropriate reconfigurable floorplan/platform for the application. This way a good
trade-off between performance and power consumption can beachieved which re-
sults in high energy efficiency. To the best of our knowledge,this is the first attempt
to use multiple GA instances for optimizing several objectives and aggregating the
results to further improve solution quality.
3. Resource prediction: The majority of work on RCS rely on a static approach to
estimate and decide upon the resources required to solve a specific problem. In
this work, we instead propose a novel dynamic and adaptive approach. To the best
of our knowledge, the use of data-mining and machine-learning techniques has not
been proposed by any research group to exploit this specific type of Design Explo-
ration for Reconfigurable Systems in terms of predicting theappropriate floorplan
of an application.
![Page 31: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/31.jpg)
CHAPTER 1. INTRODUCTION 11
1.4 Thesis Organization
The remainder of the thesis is organized as follows: Chapter2 provides an essential
background on reconfigurable Computing, RTR, and machine learning. Chapter 3 intro-
duces the main published works in the field of reconfigurable computing, reconfigurable
schedulers/OS, and resource predictions. The overall methodology that describes the
different phases of the research and modes of operation is introduced in Chapter 4. The
proposed dynamic run time reconfigurable platform along with three novel schedulers
are introduced in Chapter 5. Chapter 6 introduces the evolutionary framework devel-
oped for allocating and binding execution units to task graphs. The novel data-mining
resource prediction framework is discussed in Chapter 7. Finally, the thesis conclusion
is provided in Chapter 8 along with future work.
![Page 32: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/32.jpg)
Chapter 2
Background
In this chapter necessary background material describing reconfigurable computing sys-
tems will be introduced. The concept of run time reconfiguration, FPGA, and partial
reconfiguration design flow will be explained as well. Related topics to the research such
as Data-mining and scheduling will also be introduced to help the reader understand the
remainder of this thesis.
2.1 Reconfigurable Computing
Reconfigurable systems refer to computational models that are based on dynamically
reconfigurable devices. [8,9]. A good survey on reconfigurable computing can be found
in [10–13]. Reconfigurable computers usually consist of oneor more general purpose
processors coupled with at least one reconfigurable device.A reconfigurable device can
be simply defined as a special kind of hardware that can be programmed, into whatever
logic the user desires [8,14].
12
![Page 33: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/33.jpg)
CHAPTER 2. BACKGROUND 13
The main two performance measures to evaluate any processorare speed and flexibil-
ity. Von Neumann (VN) based architectures processors are very flexible and accordingly
are referred to as General Purpose Processors (GPPs). They can compute almost any
task sequentially. The sequential nature of VN processors impedes their performance.
On the other hand, Application Specific Processors (ASIPs) deliver higher performance
as they are optimized for the application (the hardware is adapted to the application).
Reconfigurable Computing can ideally take the best of both worlds, where it takes on the
flexibility of VN processor and the performance of ASIP as illustrated in Figure 2.1, [9].
Figure 2.1: Flexibility vs performance of processor classes
Digital systems are traditionally classified into processor based system and Applica-
tion Specific Integrated Circuits (ASICs). The formal is general while the latter is more
specific. An ASIC is an Integrated Circuit specifically designed to performspecialized
and uniquefunctions in hardware. ASICs usually execute several functions in parallel
![Page 34: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/34.jpg)
CHAPTER 2. BACKGROUND 14
on a chip. An ASIC can replace a GPP for a single application utilizing extreme per-
formance and low power. ASICs fixed resource and algorithm architecture make them
inflexible and very expensive per application. As a trade-off between the two extreme
characteristics of GPP and ASIC, reconfigurable computing can combine the advantages
of both. Table 2.1 compares the features of the three systems[14].
Architecture General Purpose ASIC ReconfigurableResources Fixed Fixed ConfigwareAlgorithms Software Fixed FlowwarePerformance Low High MediumCost Low High MediumPower Medium Low MediumFlexibility High Low HighComputing Model Mature Mature ImmatureNRE Low High Medium
Table 2.1: Comparison of Representative Computing Architecture
Reconfigurable Computing applications can be found in many fields. It can be ap-
plied to embedded systems, network security applications,multimedia applications, vi-
sion, scientific computing, Unmanned airborne vehicles andSATisfiability solvers. More
details can be found in [9,15].
2.1.1 Data Flow Graph (DFG)
A Data Flow Graph (DFG) provides the means to describe a computing task in a stream-
ing mode. A DFG can represent any high level language such as Cor C++, where each
operator represents a node in the Data Flow Graph. The inputsof the nodes are the
operands on which the corresponding operator is applied. The output of the node repre-
sents the result of the operation on that node. The output of anode is used as an input for
![Page 35: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/35.jpg)
CHAPTER 2. BACKGROUND 15
another node, and accordingly data dependency is defined in the graph. As an example
the DFG shown in Figure 2.2 represent the quadratic root formula [Equation 2.1] [9].
f =(b2 − 4ac)
12 − b)
2a(2.1)
��
��
��
��������
������
��
� � ��
Figure 2.2: Dataflow Graph for Quadratic Root
A Data Flow Graph is a directed acyclic graph which can be represented by G=(V,E),
whereV are the vertices andE are the set of edges. Each vertex can represent a task,
therefore, a set of nodesV can represent a set of tasksT = T1, T2, T3, ..., Tn. An edge
e = (vi, vj) ∈ E is defined through the data dependency between taskTi and taskTj .
To model hardware tasks with a DFG, a hardware implementation is needed for each
![Page 36: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/36.jpg)
CHAPTER 2. BACKGROUND 16
task( i.e. node) that occupies a rectangular area of the chip. The nodes and edges of a
DFG can possess some hardware characteristics such as width, height, area, latency and
speed.
2.1.2 FPGA
FPGAs (Field Programmable Gate Array) are a specific family of integrated circuits that
are used for implementing custom digital circuits. One of their main properties is the
ability to be configured infinite number of times [14]. More specifically in SRAM based
FPGAs the configuration data is stored in a volatile memory. Reconfiguring an FPGA
changes its functionality to support a new application, which is equivalent to mapping a
new hardware in the system for a different application.
An FPGA generic architecture is illustrated in Figure 2.3. The main building blocks
of an FPGA areConfigurable Logic Blocks (CLB)which contain the logic circuits,In-
put/Output blocks (IOB)which interface the internal logic to a pin in the FPGA package,
and communication resources that allows the arbitrary connection of CLBs and IOBs.
FPGAs usually have additional resources such as clock managers, dedicated multipliers,
dual-port RAMs and sometimes microprocessors [14].
2.1.2.1 Accelerator Coupling Strategies
FPGAs have evolved over the years from simple Programmable Logic Devices (PLD) to
a complete System On Chip (SOC), containing microprocessors, memories, DSP blocks
and highly optimized connection paths with unlimited reconfigurablity. This develop-
ment gave the designer a plethora of logic gates to use (millions of gates).
There are different ways to connect a user IP into an embeddedmicroprocessor-
![Page 37: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/37.jpg)
CHAPTER 2. BACKGROUND 17
Figure 2.3: FPGA structure
based system. Generally speaking, an application can be implemented either in software
or hardware. Parallel execution of an algorithm is the main advantages of the using hard-
ware implementation. This is important especially for strict timing-driven applications
and capabilities for managing the user IP in software (e.g. C/C++). Figure 2.4 shows an
example of how parallel execution can be used. The software routine calculates the re-
sults ofF in 12 clock cycles. In contrast, the hardware routine computes the same results
in only 2 clock cycles [1].
The customized IP core can be integrated inside theReduced Instruction Set Com-
puting (RISC)processor architecture seen in Figure 2.5 or externally as demonstrated by
Figure 2.6. The integration of a customized IP core within the execution unit is very
restrictive though for two reasons the first is the nature of RISC processors which has an
ALU with two inputs and one output. Instructions that need different I/Os will be very
![Page 38: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/38.jpg)
CHAPTER 2. BACKGROUND 18
��������������������������
����� ������� ������� ����� �������������� ����� ���������������
�
�
�
�
� �
���
����� ������������� ����� � ��
!����� �����������"��� ����� � ���
Figure 2.4: Software Vs Hardware
difficult to implement. The second reason is that changing the execution unit changes the
critical path for all instruction which may delay the built-in instruction and further reduce
the overall frequency of the processor as shown in Figure 2.5. On the other hand, using
a hardware accelerator does not affect the critical path anddoes not alter the processor
capability.
2.1.3 Run Time Reconfiguration (RTR)
Run time reconfiguration (RTR) is a feature in reconfigurablesystems that enables the
change in functionality of a device during operation. This is a built-in feature in FPGAs
and can be used to reduce component count, power consumptionand cost by reconfigur-
ing the same FPGA with different applications (tasks).
![Page 39: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/39.jpg)
CHAPTER 2. BACKGROUND 19
Figure 2.5: Including a customized IP within the RISC architecture (Customized Instruc-tion) [1]
Figure 2.6: Including a customized IP via the FSL interface onto MicroBlaze [1]
![Page 40: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/40.jpg)
CHAPTER 2. BACKGROUND 20
2.1.3.1 Local vs. global Configuration
FPGAs may be configured completely (Global Configuration) orpartially (local config-
uration). Global configuration suspend the FPGA operation during configuration, which
takes a relatively long time due to the large configuration bitstream. On the other hand,
local configuration is faster, because it needs a smaller configuration bitstream. Another
benefit of local configuration is the ability (for some FPGAs)of on-line configuration
(i.e. without suspending the rest of the FPGA). On-line partial reconfiguration orDy-
namic partial Reconfigurationallows for embedding the configuration circuit inside the
FPGA.
2.1.3.2 Spatial/Temporal Partitioning
Partitioning of hardware tasks is important in reconfigurable computing. It seeks to par-
tition a large task into sub-tasks to fit within any FPGA. Spatial partitioning maps a
complex circuit onto several FPGAs. While temporal partitioning loads the sub-tasks in
sequence into the same FPGA. In temporal partitioning thereis a need to keep track of
intermediate data between sub-tasks and data dependency between tasks. Another im-
portant factor for temporal partitioning is reconfiguration time, which needs to be fast
enough to meet applications timing. Temporal partitioningcan use global reconfigura-
tion with an external reconfiguration controller to reconfigure the entire FPGA each time,
or it can benefit from dynamic partial reconfiguration with anembedded reconfiguration
controller. Dynamic Partial reconfiguration eliminates the need for an external controller.
This can reduce the overhead of reconfiguration, as a result of using smaller bitstreams
and special techniques such as task pre-fetching and task re-using [16].
![Page 41: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/41.jpg)
CHAPTER 2. BACKGROUND 21
2.1.4 Partial Reconfiguration
Partial Reconfiguration is a unique feature of Xilinx FPGAs that allows the reconfigura-
tion of a part of the device while the remainder of the fabric remains in operation mode.
In recent years, Xilinx FPGAs went through multiple hardware improvements to support
more features on partial reconfiguration.
The two key hardware improvements are:
• Smaller units of reconfiguration granularity. From the fulldevice height reconfig-
uration frames in the Virtex-II and Virtex-II Pro families to partial CLB blocks in
the newer FPGAs. For example, the configuration frame is 16-CLBs high in the
Virtex-4 family.
• Increased bandwidth in the internal configuration access port: From 800Mbit/s
in the Virtex-II and Virtex-II Pro families to 3.2Gbit/s in the Virtex-4 family and
beyond
Xilinx also introduced Early Access Partial Reconfiguration (EAPR) design flow,
which organizes the process to some extent [17]. Xilinx introduced the first commercial
partial reconfigurable design flow that further simplifies the process. Partial reconfigu-
ration is usually used for time multiplexing multiple functions of an application that are
not required at the same time on the FPGA (mutual exclusive functions). This approach
reduces power consumption and enables the use of smaller FPGA device and hence re-
duces costs. Figure 2.7 illustrates the concepts of partialreconfiguration where a PRR
can be reconfigured with different reconfiguration modules.
![Page 42: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/42.jpg)
CHAPTER 2. BACKGROUND 22
������
������
������
�����
����� ����
�����
��������� ����
������
Figure 2.7: Partially Reconfigurable Region (PRR) X can be Loaded with Partial Recon-figuration Module X1, X2, X4, or X3
2.1.4.1 Static and Partial Reconfigurable Regions
Any PR design consists of two regionsStaticandPartial Reconfigurable Regions (PRR).
Static regions contain the logic that does not change duringpartial configurations. The
static region may contain the circuit to control the PR process. The PRR(s) logic can be
reconfigured independently of other PRRs and the static region. A PRR should have at
leas two Partial Reconfigurable Modules (PRM) with different functionality. Xilinx al-
lows more than one PRR region in the design. Those regions must be specified manually
on an FPGA using the PlanAhead floor-planner.
2.1.4.2 Partial Reconfiguration design flow
The existing flow from the Hardware Description Language (HDL) to configuration bit-
stream is extremely complicated. Therefore, there are design hierarchy limitations that
exist to aid the software tools in creating PR designs. The primary limitation needs that
![Page 43: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/43.jpg)
CHAPTER 2. BACKGROUND 23
the top-level module contain submodules that are either static modules (SMs) or partially
reconfigurable modules (PRMs) [18]. Special bus macros provided by Xilinx are used
and must be explicitly declared for communication between static and PRR regions with
the exception of the clock (the latest design flow does not require bus macros anymore).
The needed hierarchy impose a substantial amount of effort when converting an existing
static design into one that is suitable for PR. The restriction of having all the PRMs at
the top level will usually lead to routing many signals to other modules within the main
static module.
2.1.4.3 Partial Reconfiguration software and design flow
The Early Access PR software provided by Xilinx is only compatible with the ISE 9.2i
service pack 4. The PR extension is a patch to the ISE 9.2 tools, and it’s not compatible
with the new ISE software. The new partition based design flowstarted with ISE 12.1
and it only supports the high end Virtex-4,5,6,7 families (Recent lower end Xilinx 7
Series FPGA support partial partial reconfiguration). The PR design flow is illustrated in
Figure 2.8. For this flow, each PRM along with the static design must be implemented
in a separate directory, and then merged to generate the fulland partial configuration
bitstreams. Steps 1-4 are similar to the non-PR design flow; Steps 5-7 are unique to the
PR design flow.
2.1.4.4 Reconfiguration Speed
Reconfiguration times are highly dependent upon PRRs sizes and organization. For ex-
ample, Virtex-II, allows for partial reconfiguration of entire columns only, which may
lead to partial bitstreams to be significantly larger than necessary. On the other hand,
![Page 44: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/44.jpg)
CHAPTER 2. BACKGROUND 24
�������������������
������� ������������
�����������������
�� �������������������
����������������
���� ������������������
����������������� ���
!!
""
##
$$
%%
&&
''
Figure 2.8: Partial Reconfiguration Design Flow
![Page 45: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/45.jpg)
CHAPTER 2. BACKGROUND 25
PORT Bus (bits) Max Frequency (MHz) µSec/FrameSerial 1 100 13.12JTAG 1 66 19,88
SelectMap 8 100 1.64SelectMap 32 100 0.41
ICAP 8 100 1.64ICAP 32 100 0.41
Table 2.2: Reconfiguration Speed for different interfaces
Virtex-4 FPGA and beyond, allow for arbitrarily-shaped PRRs, which was a great im-
provement over the previous generations. Since reconfiguration time is greatly affected
by design size, the metric ofµSec/Frame is used to calculate the reconfiguration speed.
Each Frame is composed of forty-one 32-bit words. The smallest of the Virtex-4 de-
vices (the LX15) has 3,740 frames, while the largest FPGA (the FX140) has 41,152
frames [19]. The most common FPGA’s configuration techniques are: the JTAG (Bound-
ary Scan) port, externally through the serial configurationport, or the SelectMap port, or
could be internally through the internal configuration access port (ICAP), (using an em-
bedded microcontroller or state machine), as shown in Figure 2.9. Each of these methods
has its own appropriate applications. The supported methods for partial reconfiguration
are: JTAG, SelectMap and ICAP. During development JTAG is usually used to partially
reconfigure the FPGA. The ICAP performs the reconfiguration internally using an em-
bedded microcontroller, which provides a powerful and veryflexible platform for PR
designs. A summary of configuration speeds using different interfaces is shown in Ta-
ble 2.2 .
![Page 46: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/46.jpg)
CHAPTER 2. BACKGROUND 26
Figure 2.9: Loading a partial bit file
2.1.4.5 Reconfiguration Using an Embedded Microcontroller
Xilinx provides FPGA products that include embedded hard processor cores, along with
supporting embedded soft processor core (i.e., Xilinxs MicroBlaze) in all Virtex-II and
later FPGAs. One example that makes these cores an extremelyflexible option is the ca-
pability to process C/C++ code for reconfiguration designs.Using an external controller
(i.e. using a PC) can be eliminated by controlling reconfiguration using a processor that
is embedded within the FPGA. Depending on specific embedded software design and
functionality, having the processor embedded within the FPGA allows for autonomous
operation.
Shown in Figure 2.9 is an example of an embedded system designfor partial recon-
figuration. The microprocessor loads the required configuration data from the external
memory and reconfigures the PR region through the ICAP primitive. The external mem-
![Page 47: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/47.jpg)
CHAPTER 2. BACKGROUND 27
ory could consist of ROM, Flash memory, or static RAM that is loaded at start-up or
even filled by the FPGA itself. Internal configuration can be triggered by the FPGA it-
self. Internal configuration can consist of either a custom state machine, or an embedded
processor such as MicroBlaze processor or PowerPC 405 processor (PPC405). When the
overhead of an embedded microcontroller is undesirable, itcan be replaced by a custom
state machine that handles the loading of configuration data[18].
2.2 Resource Management of Reconfigurable Systems
The field of reconfigurable computing has grasped researchers attention for quite a while.
Recent FPGA developments in reconfiguration technology accompanied with reconfig-
uration speed made them more practical and applicable for reconfigurable computing.
However a main challenge for reconfigurable computing is resource management. Re-
configurable computing needs to map hardware tasks on a finiteFPGA fabric taking into
account tasks dependency and timing criteria. Therefore, resource management becomes
crucial for any reconfigurable computing system. For this reason, several researchers
studied resource management of reconfigurable systems [20].
The term reconfigurable Operating System is used to represent any sort of reconfig-
uration manager. There is neither a real full implementation nor theoretical standard for
the reconfigurable OS. Therefore, the term reconfigurable OSis used to describe a man-
ager that can be as simple as a loader with a scheduler to a fullreconfiguration OS that
has an allocator, partitioner, scheduler, and loader.
From the literature, a reconfigurable OS usually consists ofa processor and a re-
configurable fabric. Software run on the processor to managehardware tasks. Usually
![Page 48: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/48.jpg)
CHAPTER 2. BACKGROUND 28
part of the OS should be implemented in hardware to work as a bridge between the two
systems and increase the system efficiency.
Reconfigurable OS should not be confused with hardware OS acceleration. The first
focuses on managing hardware resources to accelerate the applications while the latter
attempts to accelerate OS functions with a dedicated hardware such as ASICs or FP-
GAs. The goal of the hardware OS acceleration is to reduce RTOS overhead by imple-
menting critical OS components in hardware such as context switching, scheduling and
semaphore implementation. The idea is to make RTOS more deterministic for critical
applications. Many researches [21–24] were able to achievepromising speedup which
lead to a commercial product that accelerate MicroC/OS-II RTOS utilizing Xilinx FP-
GAs fromSierra [25]
2.2.1 Scheduling in Reconfigurable Systems
The main benefits of scheduling is organizing tasks to complete an objective or a set of
objectives based on a certain constraints. A well designed scheduler can increase sys-
tem efficiency by reducing time and/or increasing throughput and achieving other goals
such as power reduction. Many researchers has been worked toimprove the efficiency
of scheduling. In general, scheduling can take two forms, static scheduling and dynamic
scheduling. Static scheduler is employed at design time before the system being sched-
uled is run. An example for static scheduling is a schedule used for embedded system.
All task information needs to be known prior to the static scheduler, such as tasks arrival
times and tasks execution times, in order to decide tasks execution sequence. Therefore,
static schedulers are suitable only for systems with behavior that are known in advance
such as embedded systems. The advantage of static scheduling is that complex (time con-
![Page 49: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/49.jpg)
CHAPTER 2. BACKGROUND 29
suming) algorithms can be used to achieve an optimum scheduler. On the other hand, a
dynamic scheduler does not need all the information to be known in advance. It attempts
to assign tasks while the system is running, therefore, it isused in operating systems
where tasks may arrive randomly that need to be placed in the ready queue. Dynamic
scheduling has the advantage of adapting to sudden changes by being able to handle
(schedule) unexpected tasks that may arrive during run time[14].
In real-time systems, tasks have timing constraints and their execution is bounded to
a maximum delay that has to be precisely followed. The goals of scheduling is to allow
tasks to fulfill these constraints when the application runsin a nominal mode. A schedule
in real-time systems must be predictable. It must be proved in advance that all the timing
constraints are met in a nominal mode. When a malfunction occurs, an alarm tasks may
be triggered, execution time may be increased, overloadingthe application and giving
rise to timing faults [26]
In reconfigurable systems the decision of when to start the execution of a task de-
pends on whether the task can be placed and configured into thereconfigurable logic.
Therefore, scheduling and placement are dependent on each other, which is not the case
in processor based systems where tasks scheduling and memory allocation can be per-
formed separately. In conventional systems, a task can be loaded into the memory then
wait in the ready queue for its turn to use the processor. In reconfigurable systems, as
soon as a task is configured, it can begin execution, since configuration means it does
both allocation and assigning processing resources [14].
Most static scheduler in reconfigurable systems are based onthe list scheduling tech-
nique and use a queue to keep the ready tasks. Giving a list of ready and priority sorted
tasks, the scheduler picks the task with the highest priority to execute. The static sched-
![Page 50: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/50.jpg)
CHAPTER 2. BACKGROUND 30
ulers may differ from each other based on the calculation method of tasks priority. Some
typical methods include random assignments, As Soon As Possible (ASAP), and As Late
As Possible (ALAP) [14].
2.3 Data Mining, Machine Learning and Classification
Data Mining is the core process of the knowledge discovery procedure. A data-mining
flow includes different stages, such as Pre-processing, Classification, Clustering and
Post-processing. The main objective of the entire process is to extract useful hidden
knowledge from the data. Each set of data can be mined using different data-mining
Figure 2.10: Data Mining Flow
techniques depending on the data and the goal of mining procedure, as shown in Fig-
ure 2.10.
Classification is one of the main phases of data-mining. It isa key function that
categorizes items and records in a database specific classesor categories. The main
objective of classification is to accurately predict the target class for each item in the
![Page 51: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/51.jpg)
CHAPTER 2. BACKGROUND 31
data. Classification is considered to be a form of supervisedlearning technique that
Figure 2.11: Supervised Learning Steps
infers a function from labeled data, as shown in Figure 2.11.The training data usually
consists of a set of records. Each record is a pair consistingof an input object along with a
desired output target. The training data is analyzed by the supervised learning algorithm
and produces a model or function which can be used for mappingnew examples. The
main objective is to produce a model that correctly determines the class label for unseen
instances. In other words, the learning algorithm will havethe capability to generalize
from the provided training data to unseen situations in somereasonable accurate fashion.
![Page 52: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/52.jpg)
CHAPTER 2. BACKGROUND 32
2.4 Summary
Partial dynamic reconfiguration allows multiple independent configurations to be swapped
in and out of hardware independently, as one configuration can be selectively replaced
on the chip while the other is left intact. Partial reconfiguration provides the capability
to change certain parts of the hardware while other parts of the FPGA remain in use.
A Reconfigurable Computing system traditionally consists of a General Purpose Pro-
cessor (GPP) and one or more Reconfigurable Modules that run hardware tasks in paral-
lel [27]. A fundamental feature of a Partially Reconfigurable FPGA is that the logic and
interconnects are time multiplexed. Thus, for a circuit to be implemented on an FPGA,
it needs to be partitioned such that each sub-circuit can be executed at a different time.
Xilinx Partial Reconfiguration design flow [5] uses a bottom up synthesis approach,
where Reconfigurable Modules have to be synthesized separately. Each Reconfigurable
Module is considered as a separate project where it is verified and synthesized separately.
The top design treats each Partial Reconfigurable Regions asa black box. After gener-
ating all net-lists (top design and Reconfigurable Modules), each Partial Reconfigurable
Region must be manually floor-planned using the Xilinx PlanAhead design tool. The
PRR can be rectangular or L shaped with some restrictions. More details can be found
in [5].
Partial reconfiguration is appealing and attractive since it provides flexibility. How-
ever, multitasking reconfigurable hardware is complex and requires some overhead in
terms of management. In order for users to benefit from the flexibility of such systems,
an operating system must be developed to further reduce the complexity of application
development by giving the developer a higher level of abstraction.
![Page 53: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/53.jpg)
Chapter 3
Literature Review
In this chapter, a list of the important published works in the field of dynamic recon-
figurable system will be examined. Research work on reconfigurable scheduling, ROS,
and resource prediction will be reviewed. The main objective of this chapter is to high-
light the advantages and disadvantages of current approaches and implementations on
reconfigurable computing platforms in order to provide guidance for the reader.
3.1 Partial Dynamic Reconfigurable Systems
Partial reconfiguration is a unique and important feature ofreconfigurable systems as it
allows swapping hardware modules in and out of an FPGA devicedynamically without
the need for resetting the entire device. The possibility offered by partial reconfigurable
devices to system adaptivity is outstanding. Not many devices support dynamic partial
reconfigurability. Some of the few devices that fall in this category are the Virtex series
from Xilinx [9].
33
![Page 54: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/54.jpg)
CHAPTER 3. LITERATURE REVIEW 34
Several partial reconfiguration design flows arose during the time, and only two of
them are currently supported by their providers. The JBits [28] approach is an application
interface developed by Xilinx to allow the end user to changeconnections inside the
FPGA and set the contents of the LUTs by using a set of Java classes as part of the
JBits API. JBits does not allow the changes to be made directly to the FPGA, but to the
configuration file. One of the major drawbacks of JBits is the difficulty to provide fixed
routing for a direct connection between two modules, which may lead to unroutable paths
after reconfiguration [9].
A new design flow that supports a wider range of devices similar to JBits is the
Difference-Based partial reconfiguration design flow [29].This flow uses a different
user interface than JBits and allows the designer to make small logic changes using the
Xilinx FPGA Editor. It tends to generate a different bitsream that contains thedifferences
between the two design versions. This approach has the same limitation of JBits and does
not guarantee correct routing after configuration. Therefore, it is only recommended for
small changes such as logic equations, filter parameters andI/O standards.
A more advanced design flow that starts with the HDL level and attempts to solve
the routing problem of the JBits is the Modular Design Flow [30]. The Modular de-
sign flow usesBus Macroprimitives to guarantee fixed communication channels among
components that will be reconfigured at run time. The Modulardesign flow was initially
developed to support a group of engineers to cooperate on thesame project. At a later
stage it was adapted to support partial reconfiguration [9].
A follow up to the Modular design flow from Xilinx is theEarly Access Design
Flow which is basically an enhancements to the former. The main enhancements added
were better usability, flow automation and the relaxation ofsome rigid constraints. For
![Page 55: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/55.jpg)
CHAPTER 3. LITERATURE REVIEW 35
example, signals passing the reconfigurable region withoutbeing used in the reconfig-
urable modules do not have to be passed throughBus Macros. As a result of these
enhancements, more device were also supported with the Early Access Modular design
flow [31].
Xilinx’s latest design flow is the Partition based design flow[17]. This design flow
is integrated within the Xilinx ISE development tools. It issimilar to the Early Access
Design flow but with less limitations. The Xilinx Partition based design flow utilizes
Partitions, a mature feature that ensures exact preservation of previously generated re-
sults. This is the first commercial design flow provided, but it still has several limitations
as it does not support automatic partitioning, relocation,and all partial reconfigurable
hardware modules has to be placed manually using a floor-planner.
There were several attempts to adapt the standard PR design flow to specific appli-
cations field which should reduce complexity or add extra features. In [32] a tool called
PARBIT has been developed to easily transfer and regeneratebitfiles in order to imple-
ment dynamically loadable hardware modules. The tool was developed based on the
JBits design flow and inherited most of the JBits limitations. An implementation for
reconfigurable packet processing circuit using the PARBIT tools is presented in [33].
Another tool based on the JBits design flow is the JRTR [34]. This tool aimed to provide
a simple and efficient model and implementation for partial run-time reconfiguration us-
ing a cache based approach. The tool extended the JBits Java library and it supports
bitstreams read-back. In [35] a modification to the Modular design flow is presented.
The published work presented an implementation of a reconfigurable system usingslices
as connecting resources instead of TBUF elements. The benefit of this approach was to
automate the design flow and reduce manual debugging. In [36]a complete framework
![Page 56: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/56.jpg)
CHAPTER 3. LITERATURE REVIEW 36
was built to enable hardware developers to use three different schedulers without going
into reconfiguration techniques details. Reconfiguration modules can be written in a high
level Java like language in order to handle the reconfiguration process in an easier way.
The proposed system was tested with data acquisition and a streaming application. The
design was based on the Early Access Design flow. Partial reconfiguration is currently
supported by the high end and expensive FPGAs in state of the art Virtex FPGAs series
(the only exception is Xilinx 7 series). In [37] an implementation of a virtual internal
configuration port for realizing dynamic and partial self-reconfiguration on Spartan III
FPGAs is presented. The proposed work attempts to overcome the limitation of miss-
ing ICAP interface in Spartan II FPGAs. The researchers proposed a virtual interface
that writes the configuration data to output ports on the FPGA, which are externally
connected to the available JTAG pins.
3.1.1 Dynamic Partial Reconfiguration Applications
The number of application fields in which reconfigurable computing has been applied to
is increasing steadily. Some of the widely applied application domains are in embedded
systems, network security applications, multimedia applications, scientific computing,
and SATisfiabilty solvers [14]. More detailed applicationscan be found in [38]. Despite
the many benefits of dynamic partial reconfiguration [39,40], only a few researches used
Dynamic Partial Reconfiguration for reconfigurable computing.
3.1.1.1 JBits and Difference Based Design Flows Applications
The JBits and Difference based partial reconfiguration design flows have been used in
cryptographic applications. Partial dynamic reconfiguration has been used to reconfig-
![Page 57: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/57.jpg)
CHAPTER 3. LITERATURE REVIEW 37
ure some FPGAs to accommodate changing keys and modifying key and data block
width [41, 42]. The published work in [41] exploits partial reconfiguration in the IDEA
algorithm by changing keys via partial dynamic reconfiguration. In [42] an implemen-
tation of two cryptographic algorithms (IDEA and AES) usingdynamic partial recon-
figuration is presented. Again JBits partial reconfiguration was used to reconfigure the
elements involved with the keys for both algorithms.
3.1.1.2 Modular Design Flow Applications
In [43] researchers consider a hybrid DSP-FPGA platform forSoftware Defined Radio
(SDR). In their system they were able to show the performancebenefits of FPGA partial
reconfiguration for software defined radio applications by reducing configuration time
using theModular based partial reconfiguration design flow. In [44] theModular design
flowhas been used in implementing a self reconfigurable co-processor for accelerating a
representative secure applications (SSH) on a standard operating system (uClinux OS).
3.1.1.3 Early Access Modular Design Flow Applications
The Early Access Modular Design Flow has been used with SDR in[45, 45–47]. SDR
is a common hardware platform for multi-standard communications that is controlled by
software. The goal of the SDR is to produce uninterrupted communication devices which
can support different standards [45]. Therefore, SDR is a typical application for dynamic
partial reconfigurations where the communications module can be changed without inter-
rupting the rest of the system. An important direction considered to be akiller application
is in the biometric field where a full biometric reconfiguration algorithm can be imple-
mented in a small FPGA device at a very low cost. Processing can be performed in real-
![Page 58: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/58.jpg)
CHAPTER 3. LITERATURE REVIEW 38
time while preserving data in it’s hardware implementationby multiplexing functional-
ity on the fly over a reduced set of resources placed in a partially reconfigurable region
on the same device [48]. Partial reconfiguration has been used for the first time to fur-
ther increase the performance of High-Performance Reconfigurable Computers (HPRCs)
in [49]. The authors investigated the performance potential of partial run-time reconfig-
uration on HPRC from both theoretical and practical perspectives, comparing it to the
full run-time reconfiguration approach. In [50] partial reconfiguration has been used to
design a multi-protocol sensor reading system for road safety. The goal was to reduce
power and costs by time sharing multiple design modules (protocols) on the same FPGA.
In [51] the flow was used for accelerating video-based driverassistance applications in
automotive systems. The idea was to separate pixel based operations from high level op-
eration. The pixel based operations were accelerated usinga reconfigurable co-processor
attached to a standardPowerPCprocessor. Multiple algorithms requiring the same pixel
operation can share the co-processor to minimize chip area and at the same time multiple
pixel operations can be supported by reconfiguring the co-processor. The use of partial
reconfiguration in automotive domains has been investigated in [52] as well. An evalu-
ation based on experiments made on a set of signal and image processing applications
is presented in [47]. The work evaluated the benefits and limitations of using dynamic
partial reconfiguration for real signals and image processing professional electronics ap-
plications and provided guidelines to enhance its applicability. An improvement to the
reconfiguration interface provided by the vendor was implemented by using a Data Mem-
ory Access (DMA) for accelerating I/O operations. Reconfiguration speed is significant
in applications especially when fast module switching is required. Most applications
use the internal ICAP configuration interface and [53] attempted to further enhances the
![Page 59: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/59.jpg)
CHAPTER 3. LITERATURE REVIEW 39
ICAP throughput. The published work uses DMA, Master (MST) burst, and dedicated
BlockRAM (BRAM) cache to reduce reconfiguration time. Theirexperimental results
concluded that using DMA and MST burst proposed controllersmay achieve a processor
independent reconfiguration speed of one order of magnitudefaster than the provided
vendor modules. While the use of the proposed BRAM based module can approach
the reconfiguration speed limit of the ICAP interface, but this comes at a cost of large
BRAM resource utilization on the FPGA. In [54] the researchers introduced an approach
to reduce FPGA power consumption using dynamic partial reconfiguration by exploit-
ing the time varying nature of the system’s environment. Power saving is achieved by
adapting the implementation of a function to temporal changes in the environment. The
authors pointed out the benefits of using dynamic partial reconfiguration with the type
of applications where all application functions are required on the FPGA at the same
time. However, the system reduces power consumption by timemultiplexing different
implementations of the same function. The authors tested the system with network appli-
cation using different implementation of Viterbi decoders. In [55] the authors proposed
the use of partial reconfiguration in networked portable multimedia appliance. In [56]
researchers proposed the use of partial reconfiguration fordynamic fault tolerance to
reconfigure the faulty area without affecting the rest of thesystem.
3.2 Reconfigurable Operating Systems
The first proposed operating system for partially reconfigurable hardware was by Brebner
[57, 58]. Some of the principles that influence an operating system for reconfigurable
devices was discussed. The author brought the notion ofSwappable Logic Units (SLUs)
![Page 60: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/60.jpg)
CHAPTER 3. LITERATURE REVIEW 40
which can be swapped in and out from a partially reconfigured device using an operating
system. The author proposed that applications should consist of relocatable SLUs that
are loaded/unloaded by the operating system. The system wasonly simulated in C.
Merino [59] attempted to divide the reconfigurable logic into identical areas, called
‘slots‘, and used an operating system to schedule the loading/unloading of tasks into the
slots. Each task was designed according to a pre-defined template to standardize the
input/outputs of all tasks. A table was used to keep track of the current (loaded) tasks in
the FPGA. The work does not give enough details to duplicate the work. Also, there was
no indication of any successful implementation or even simulation of the system.
Hardware task preemption has been explored in [60]. The system suspends the run-
ning task, reads-back the reconfiguration bitstream, and extracts the internal register sta-
tus directly from the bitfile in a process called‘State Extraction‘. In order to resume a
suspended task, it’s bitstream has to be injected with the previous register values (update
internal registers) before loading it to the FPGA in anotherprocess, called‘Task Recon-
struction‘. The system has many limitations such as the need for an FPGA to support
read back of the configuration bitfiles, and it is also an intensive processor operation.
Wigley [61–64] proposed an OS for a reconfigurable system. The author supported
every decision with a logical conclusion based on previous research. The literature re-
view, flow of thesis, and methodology were outstanding. Wigley used the proposed OS in
a real application [65], but the practicality of the proposed OS in that system was vague.
The OS consists of three tiers, each tier runs on a separate machine and they com-
municate with normal TCP/IP communication. Each HW application should be modeled
with a data flow graph. Each node in the graph has to be implemented in advance and
the size of each node has to be stored with the application along with all the connections
![Page 61: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/61.jpg)
CHAPTER 3. LITERATURE REVIEW 41
between nodes (implemented in Java classes). The user tier runs a program and sends
the EDIF file to the Colonel tier. The Colonel tier has an allocator, partitioner and a
queue. Every HW task is stored in the queue as it arrives. The allocator tries to allocate
a position on the FPGA if enough space exists, otherwise it sends it to the partitioner
along with the largest available vacancy. The allocator andpartitioner communicate to
find a way to allocate the program. If no suitable place exists, the allocator issues a block
signal to the program and inserts it back to the end of the queue. After finding suitable
locations for the task, the Colonel tier will generate a suitable UCF file and send the re-
sult to the Xilinx tool to generate the bit-file. The bitfile isthen sent to the Platform tier
which performs the configuration.
The authors presented a prototype for a hardware OS that performs allocation, par-
titioning and simple scheduling (simple queue with non-preemptive behavior). In order
for the OS to function efficiently, CAD tools should exist to support dynamic place-and-
route. The author suggested an FPGA fabric that has a speciallayer for global routing
to ease the connection between the modules. Without those constraints the OS would
be impractical to use. Using the current technology for every load/unload of any task,
CAD tools have to be utilized. A complete place-and-route algorithm also needs to be
implemented, and a full chip reconfiguration, which takes extensive CPU time. The
author also suggests metrics and benchmarks to measure the reconfigurable system per-
formance. This was one of the first implementations for a purereconfigurable OS. Most
results were based on simulation. The OS assumed a special hardware fabric which does
not exist in any of the currently available commercial products.
All of the previous system employ hardware tasks as a co-processor. This approach
limits the flexibility of the FPGA, and it requires a powerfulexternal processor with a
![Page 62: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/62.jpg)
CHAPTER 3. LITERATURE REVIEW 42
high speed bus. The same scenario would be completely impractical if the processor was
embedded in the FPGA. According to [51], using a PowerPC hardcore processor running
at 300MHz is not sufficient for a simple video application. Therefore, it is very important
to have autonomous reconfigurable tasks which can accept data from an external source
and interact with the processor.
Walber and Plazner [66,67] along with Steiger [66] were the pioneers who proposed
the idea of using reconfigurable hardware, not as a dependentco-processor but as stan-
dalone independent hardware. Hardware tasks can access a FIFO, memory block and
I/O driver or even signal the scheduler. The system uses an external powerful processor
and connects to an FPGA via the PCI port. The proposed run-time environment can dy-
namically load and run variable size tasks utilizing partial reconfiguration. [66] used the
system as the foundation to explore on-line Scheduling for real-time tasks for both 1D
and 2D modules.
Hayden K. So et al. [8,68,69] extended the Unix OS to support hardware processes,
by providing native kernel support for FPGA hardware. In this work the FPGA is treated
as a traditional computational resource. BORPH OS introduced a new executable file
format (thanks to the extensibility of Linux) that combinedan FPGA bit-file, ELF data,
and other information. The hardware process can be executedusing normal Unix‘exec‘
and‘fork‘ commands. Once the hardware processes are loaded into the system it appears
to the user/software as a normal software process. The hardware processes are handled
differently by the scheduler by not being added on the software run queue. The commu-
nication between hardware processes is achieved by a special packet passing mechanism
which is compatible with the standard Linux message passingmechanism. The author
used the Berkeley Emulation Engine 2 (BEE2) that is based on five Xilinx FPGAs (Vir-
![Page 63: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/63.jpg)
CHAPTER 3. LITERATURE REVIEW 43
tex II Pro 70). The FPGAs are arranged in a star topology with four user FPGAs in a
ring and one control FPGA connected to each user. A PowerPC processor was used to
run the main kernel on a fifth FPGA (control FPGA) while another PowerPC processor,
is assigned to every user FPGA (with every task), which limits the speed of the system.
The authors found that the read, write and loading of software processes were much
faster than hardware processes. The hardware platform was redesigned by adding DMA
and direct communication, which enhanced the system with a speedup up to 70% more
than the initial prototype. Despite the 70% speedup all of the test applications that used
hardware processes were slower than software processes!
The author extended Linux to support hardware processes with the available technol-
ogy. Scheduling, partitioning, and allocation were not added to the hardware processes.
However, the HW process can be allocated to any free FPGA chipavailable. This work
can be extended to support partial reconfiguration without any major issue.
Lubbers and Platzner [70] attempted to extend the multi-threaded programming model
from software to hardware, by supporting the same POSIX software functions for hard-
ware tasks. The authors developed the required software extension and hardware com-
munication model to achieve that. The software model runs asan application on top of a
general OS, such as Linux. The software extension provides an interface to the hardware
communication model. There were two interesting points about this research: first, it
can use any POSIX functions without any changes to the host OSkernel, and second,
the hardware accelerator does not need to perform as a co-processor but as a stand alone
unit.
Rodolfo and Marco [71] were the first to explore the migrationbetween hardware
and software tasks. The authors studied the co-existence ofhardware (on an FPGA), and
![Page 64: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/64.jpg)
CHAPTER 3. LITERATURE REVIEW 44
software (running on a CPU) threads and proposed a migrationprotocol between them
for real-time processing. The authors introduced a novel allocation scheme and online
admission control. They also proposed an architecture thatcan benefit from utilizing
dynamic partial reconfiguration. The importance of this work is attributed to it being
novel in studying the migration between hardware and software tasks. The results were
mainly based on pure simulation.
Fakhreddine [72] presented a run-time hardware/software scheduler that accepts tasks
modeled as a DFG. The scheduler is based on a special architecture that consists of two
Von Neumann processors (master and slave), a ReconfigurableComputing Unit (RCU)
and shared memory. RCU runs hardware tasks and can be partially reconfigured. Each
task can be represented in three forms: a hardware bitstreamrun on the RCU, software
task for the master processor, or software task for the slaveprocessor. A scheduler mi-
grates tasks between hardware and software to achieve the highest possible efficiency.
The authors found the software implementation of the scheduler to be slow. For exam-
ple, a software scheduler running on anIntel Core-2 Dueprocessor with a frequency of
2.8 GHz and 4 GB of RAM will take about 63% of one image processing computation.
Therefore, the authors designed and implemented a hardwareversion of the scheduler to
minimize overhead. The paper focuses mainly on the scheduler and hardware architec-
ture of the scheduler. It does not address other issues associated with loading/unloading
of hardware tasks or the efficiency of inter-task communications. Another drawback was
due to the requirement that each task in the data flow graph need to be assigned to one of
the three processing elements mentioned above. This restriction conflicts with the idea
of software/hardware migration!
Antonio [16] designed and implemented an execution managerthat loads/unloads
![Page 65: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/65.jpg)
CHAPTER 3. LITERATURE REVIEW 45
tasks in an FPGA using dynamic partial reconfiguration. The execution manager insures
task dependency and proper execution based on the internal data dependency and the
scheduler. It also uses prefetching and task reuse to reduceconfiguration time overhead.
The task should be modeled in scheduled data flow graphs in order to be usable by the
execution manager. The authors implemented a software and hardware version of the
manger and found software versions to be slow for real-time applications. The imple-
mentation used a simulated hardware task that consists of two timers to represent exe-
cution and configuration time. There are still many unaddressed issues for the manager
to be used with real tasks such as configuration management and inter-task communica-
tions.
Devaux et. al [73–75] studied the communication problem foron-line partial recon-
figurable hardware tasks. Dynamic reconfiguration of tasks may lead to communication
issues since tasks are not available in the FPGA during all computation time. The dy-
namicity of the tasks needs to be supported by the interconnection network. In their work
they studied and tested severalNetwork on Chip (NoC)topologies. They found that the
fat-tree meets the requirements of dynamic partial reconfiguration and has higher net-
work performance in terms of bandwidth and latency. The maindrawback of fat-tree
network is their high resource requirements. Therefore, the researchers proposed a mod-
ified network called DRAFT to address the size issue. The published work also proposed
a tool, called DRAGOON, that parametrizes and automatically generates the DRAFT
topology. In [76] an architecture to support both hardware and software threads is in-
troduced, which uses the DRAFT NoC. The goals of the work wereto introduce a new
type of system partitioning, where hardware and software components follow the same
execution model. The architecture can be completely distributed, with the entire plat-
![Page 66: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/66.jpg)
CHAPTER 3. LITERATURE REVIEW 46
form being homogeneous from the application’s point of view. The published work was
briefly introduced, and lacks enough information to replicate the work.
Although [77] is not directly related to this work it proposed an excellent design ex-
ploration technique to study the use of several RTOS and processors with reconfigurable
hardware.
3.2.1 Scheduling for Reconfigurable Operating Systems
As presented in Chapter 2, generally, there are two scheduling techniques: Static schedul-
ing and dynamic scheduling. The first is employed at design time while the latter is used
at run-time to schedule the dynamically arrived tasks. The static scheduler requires all
task timing information in advance and it uses intensive, complex algorithms to achieve
a solution that can lead to optimal results, therefore, it has been used a lot in embed-
ded systems. Schedulers in Reconfigurable operating systems differ from conventional
ones in two major points. First, reconfigurable operating system schedulers are heavily
dependent on the placement of hardware tasks. While in conventional systems, where
software tasks are stored in memory, the scheduler can be independent from memory
allocation. The second difference is that computation resources are available as soon as
a task is placed in the reconfigurable fabric, while in conventional systems the task may
wait in a ready queue for free processing resources. Most static schedulers in reconfig-
urable systems are based on the list scheduling technique and use a queue to keep the
ready tasks [14]. Given a list of ready and sorted tasks, the scheduler picks the task with
the highest priority to execute. The static schedulers differ in the calculation method of
task priority. Some typical methods includerandom assignments, As Soon As Possible
(ASAP), andAs Late As Possible(ALAP) [14].
![Page 67: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/67.jpg)
CHAPTER 3. LITERATURE REVIEW 47
In [78] a high-level model, calledreconfigurable system design model(RSDM), is
introduced. The scheduler, which is at the core of the RSDM model, takes resource con-
straints, a hardware library (which has processors, communication elements, and task-
specific cores), a set of system tasks, and design constraints as input. The scheduler
produces afeasible task scheduleand ahigh level hardware system description. The sys-
tem uses randompriority list schedulingalong withsimulated annealingand agenetic
algorithm to schedule hardware tasks. The scheduler operation is verytime consuming
and is not suitable for use at run-time.
Most conventional dynamic scheduling algorithms can be adopted for hardware task
scheduling, as long as the module placement problem is addressed separately [14]. How-
ever, this is not practical as mentioned earlier. In [79] several conventional scheduling
methods have been used in an on-line scheduler for a block-partitioned reconfigurable
device. TheFirst Come First Serve (FCFS)and Shortest Job First (SJF)schedulers
where used for non-preemptive tasks. TheShortest Remaining Processing Time (SRPT)
andEarliest Deadline First (EDF)were used as a preemptive schedulers. In [80] a dy-
namic version of the ASAP/ALAP scheduling calleddynamic priority listscheduling
was proposed for reconfigurable systems.
In [81] a scheduler is proposed to improve hardware resourceutilization. The authors
presented two preemptive scheduling algorithms. EDF-NextFit (EDF-NF) invokes the
conventional EDF algorithm when tasks can be configured, otherwise it uses the next
hardware task that can be placed on the FPGA. ED-NF has good scheduling performance,
but it is good only for a small number of tasks. Another algorithm that can handle large
numbers of tasks, but with lower performance, was also presented by the authors. The
Merge-Server Distribute Load(MSDL) algorithm uses the concept of servers that reserve
![Page 68: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/68.jpg)
CHAPTER 3. LITERATURE REVIEW 48
area and execution time for hardware tasks.
In [82–84] the authors presented an algorithm to reduce reconfiguration overhead.
The proposed scheduling algorithm integrates three modules into an existing hybrid run-
time/design time framework, calledTask Concurrency Management(TCM) [85]. The
three modules werereuse(different applications use the same task),prefetch(place the
task ahead of time), andreplace(increase the possibilities of reusing critical subtasks).
Two scheduling and placement techniques referred to ashorizonandstuffingwere
proposed in [86,87] to reduce scheduling overhead. The horizon maintains three lists, an
execution list (stores the currently executing tasks), a reservation list (stores the sched-
uled tasks not yet executed), and scheduling horizon list (stores location and last release
time of the placed tasks). The advantage of the horizon scheduling is it’s simplicity. The
stuffing scheduler is similar to the horizon scheduler, but differs in the way it maintains
unused space. The latter is more space efficient compared to the former scheduling al-
gorithm. An extension for both algorithms to support a 2D reconfigurable area model
is presented in [87]. Both stuffing and horizon algorithms produce logic fragmentation
when there is a large variance in the size of hardware tasks. To reduce fragmentation the
classified stuffingtechnique was introduced in [88].
Scheduling algorithms for power reduction are presented in[89–91]. In [92, 93] the
authors presented a class of co-scheduling algorithms, where software / hardware tasks
can be relocated. Relocation implies that a hardware task can be preempted and restarted
as a software task and vice versa.
Gohringer et al [94] proposed a hardware OS, called CAP-OS, that used run-time
scheduling, task mapping, and resource management on a run-time reconfigurable mul-
tiprocessor system. The authors proposed to use the RAMPSOC[95] as a platform. The
![Page 69: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/69.jpg)
CHAPTER 3. LITERATURE REVIEW 49
authors proposed two schedulers: A static scheduler to generate a task graph and assign
resources, and a dynamic scheduler that uses the generated task graph produced by the
static scheduler. There is a deadline constraint for the entire data graph instead of indi-
vidual tasks. Most of their work was based on simulation, andthe implementation lacked
task reconfiguration.
3.3 Execution unit Allocation and Genetic Algorithm
Many researchers considered the use of multiple architecture variants as part of develop-
ing an operating system or a scheduler. In [96, 97] multiple core variants were swapped
to enhance execution time and/or reduce power consumption.In our previous work [7],
we alternated software and a hardware implementations of tasks to reduce total execution
time. The work presented here is different in it’s applicability since it is more comprehen-
sive. It can be used during the design stage to aid the designer to select the appropriate
application variant for tasks in the a task graph.
Genetic Algorithms (GA) were used by many researchers to maptasks into FPGAs.
For example, in [98] a hardware based GA partitioning and scheduling technique for
dynamically reconfigurable embedded systems was developed. The authors used a mod-
ified list scheduling and placement algorithm as part of the GA approach to determine
the best partitioning. The work in [99] maps task graphs intoa single FPGA composed
of multiple Xilinx Microblaze processors and a hardware implementation. In [100] the
authors used a GA to map task graphs into partially reconfigurable FPGAs. The authors
modeled the reconfigurable area into tiles of resources withmultiple configuration con-
trollers operating in parallel. However, all of the above works do not take into account
![Page 70: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/70.jpg)
CHAPTER 3. LITERATURE REVIEW 50
multiple architecture variants.
In [96] the authors modified theRate Montonic Scheduling (RMS)algorithm to sup-
port hot swappingarchitecture implementations. Their goal is to minimize power con-
sumption and re-schedule the task on-the-fly while the system is running. The authors
looked at the problem from the perspective of schedule feasibility, but their approach
does not target any particular reconfigurable platform, nortake multiple objectives into
consideration which makes it different from the proposed work in this thesis.
R. Chen et al. [101] used anEvolutionary Algorithm (EA)to optimize power and
temperature for heterogeneous FPGAs. The work in [101] is very different from our pro-
posed framework in terms of the types of objectives considered and their actual imple-
mentation, since the focus is on swapping a limited number ofGPP cores in an embedded
processor for the arriving tasks. In contrast, our proposedwork focuses on optimizing the
hardware implementation for every single task in the incoming DFG, and targets partial
reconfigurable FPGAs.
In [4, 6] the authors used a GA for task graph partitioning along with a library of
hardware task implementations that contain multiple architecture variants for each hard-
ware task. The variants reflect trade-offs between hardwareresources and task execution
throughput. In their work, the selection of implementationvariants dictates how the task
graph should be partitioned. There are a few important differences from our proposed
framework. First, our work targets dynamic partial RTR withhardware operating sys-
tems. The system simultaneously runs multiple hardware tasks on PRRs and can swap
tasks in real-time between software and hardware (the reconfigurable OS platform is dis-
cussed in [7]). Second, the authors use a single objective approach, selecting the variants
that give minimum execution time. In contrast, our approachis multi-objective and not
![Page 71: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/71.jpg)
CHAPTER 3. LITERATURE REVIEW 51
only seeks to optimize speed and power, but also seeks to select the best reconfigurable
platform. As each of the previous objectives are likely to have its own optimal solution,
the resulting optimization does not provide rise to a singlesuperior solution, but rather
a set of optimal solutions, known as Pareto optimal solutions (these solutions are con-
sidered optimal in the sense that no one solution can be considered to be better than any
other solution with respect to all objectives). Unlike the work in [4, 6], we employ a
parallel, multi-objective island-based GA to generate theset of Pareto optimal solutions
(also known as the Pareto Front).
3.4 Resource Prediction and Data-Mining
The use of both machine-learning and data-mining methods, as proposed in this work,
represents a new direction for reconfigurable-computing research. In contrast, it has al-
ready become a fast-growing research area in physical design. Applications include pre-
dicting defects in silicon wafers [102], identifying speedpaths in processors as guides for
performance improvement [103] and design exploration for high-level synthesis [104].
A notable effort in the area of CAD for ASICs is PADE [105], a new ASIC placement
flow which employs machine-learning and data-mining methods to predict and evalu-
ate potential data-paths using high-dimensional data fromthe original designs net-list.
PADE achieves 7% to 12% improvements in solution quality compared to the state-of-
the-art. A summary of other successful applications of data-mining driven prediction to
problems in the area of physical design can be found in [106].
There seems to be an abundance of research work in the literature that covers the con-
cept of management of reconfigurable computing systems. Most of the previous work
![Page 72: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/72.jpg)
CHAPTER 3. LITERATURE REVIEW 52
mainly concentrates on the development of OS and managers along with the necessary
modules such as schedulers and placers. Only very few articles discuss the concept
of resource estimation and utilization of machine learningtechniques for predicting the
necessary resources for dynamic reconfigurable computing systems. The authors in [3]
present an automated tool to support dynamic reconfiguration for high performance reg-
ular expression searching. The author presented a method toquickly and accurately
estimate the resource requirements of a given set of regularexpressions. However, this
work is limited since it applies only to regular expression searching. Also, no prediction
nor learning from past history is applicable in this approach.
In [107] the authors investigate the use of advanced meta-heuristic techniques along
with machine learning to automate the optimization of a reconfigurable application pa-
rameter set. The approach proposed in [107] is called theMachine Learning Optimizer
(MLO), and involves aParticle Swarm Optimization (PSO)methodology along with an
underlying surrogate fitness function model based onSupport Vector Machines (SVM)
and aGaussian Process (GP). Their approach is mainly used to save time on analysis
and application specific tool development. Our work is completely different than MLO
in the sense that our framework predicts the necessary resources and floor-plan in Dy-
namic Reconfigurable Systems to optimize the execution of benchmarks as they arrive
for processing.
The authors in [108] propose a fast, a priori estimation of resources during the system
level design for FPGAs and ASICs targeting FIR/IFR filters. The prediction was based on
neural networks. The type of resources they targeted included area, maximum frequency
and dynamic power consumption. However, the work is very limited and is not applicable
to dynamic run-time reconfiguration applications.
![Page 73: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/73.jpg)
CHAPTER 3. LITERATURE REVIEW 53
An on-line predictor for a dynamic reconfigurable system is proposed in [109] to
reduce reconfiguration overhead by pre-fetching hardware modules. The proposed algo-
rithm uses a piecewise linear predictor to find correlation and load hardware modules a
priori. This work tries to optimize the use of fixed resources, while our work plays a role
at a much higher level, and seeks to predict the necessary resources.
The work in [110] proposed a dynamic learning data mining technique for failure
prediction of high performance computing. Their main contribution was to dynamically
increase the training set during the system operation, which helps in predicting failures
at early deployment. The work in [110] is fundamentally different from our work, since
it does not predict resources based on an intelligent machine learning approach.
A multi-objective design space exploration tool is proposed in [111] to enable re-
source management for the Molen reconfigurable architecture. The proposed approach
analyzes an application source code and using heuristic techniques determines a set of
hardware/software candidate configuration (sub-tasks). The resource manager then uses
these candidates to exploit more efficiently the available system resources. Their work
tends to optimize the sub-tasks of the application to fit a fixed pre-determined platform,
while our work predicts a suitable platform for a given application. Their work targeted
a specific platform (Molen) and does not take partial reconfiguration into account.
In [112], the authors proposed an algorithm for Programmable Reconfigurable (PR)
module generation. The proposed technique can be integrated in manual design flow, to
automate the generation of PR partitions and modules. The authors formulated the PR
module generation problem as a standard Maximum-Weight Independent Set Problem
[113]. Their design supports multiple objectives such as reconfiguration overhead and
area with different constraints. This work is different than our work in many aspects.
![Page 74: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/74.jpg)
CHAPTER 3. LITERATURE REVIEW 54
For example, techniques proposed in their paper do not use machine learning nor do they
learn from previous results; moreover it is limited to the generation of PR partitions.
The closest published research to our work can be found in [114], [115], [116]
and [117]. In [114], the authors present a high level prediction modeling technique that
produces prediction models for miscellaneous platforms and tool chains and application
domains. The frameworks proposed in this papers use linear regression and neural net-
works to accurately capture the relation between hardware and software metrics. The
framework takes an ANSI-C description as input and estimates various FPGA-related
measures, such as area, frequency, or latency. However, ourapproach is totally differ-
ent in the following aspects: (a) Our framework does not predict hardware resources
consumption but predicts the most optimal layout (PRRs, Soft Cores) that would best
execute a certain application such that power is reduced andperformance is enhanced,
(b) our framework is associated with an OS for dynamic reconfigurable systems, (c) our
framework targets dynamic reconfigurable designs and not static designs as in Quipu.
The work in [115] presents a framework consisting of two layers for resource man-
agement of dynamic reconfigurable platforms. The proposed system is capable of evalu-
ating the performance of a reconfigurable computing platform based on prediction model.
The framework is applied to an artificial vision case study. However, this paper does not
seek to predict neither the layout nor the suitable resources required for the applica-
tion. The resources are, in fact, fixed and the main task of therun-time resource manager
(RRM) is to allocate the best computational resource (software or hardware) based on the
application. The approach in [115] is different from our approach since the application
level decision making runs a greedy optimization which is radiatively computationally
expensive to find the best mapping that returns the maximum performance with respect
![Page 75: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/75.jpg)
CHAPTER 3. LITERATURE REVIEW 55
to a trained data-mining classifier. In our approach, we use asmart supervised learning
approach to efficiently predict the best layout that would maximize the performance and
reduce power consumption.
In [116], the authors propose an online adaptive algorithm that decides the best im-
plementation to be used for the execution of an instance based on features of the process
and history of execution. The work tends to improve the hardware/software partitioning
task by avoiding predetermined execution times and concentrating on run-time based on
system execution history. This work does not use any statistical or machine learning
technique to predict resources or floor-plan of the reconfigurable system.
The work in [117] proposes a decision making support framework, calledDRuid,
which utilizes machine learning and a meta-heuristic (combines Genetic Algorithms and
Random Forest) to extract and learn characteristics that make certain functionality of
applications more suitable for a certain computing technology. Starting from a ’C’ im-
plementation, the framework either selects the best computational element that can be
accelerated by the computational element or offers suggestions on code transformations
that can be applied. The expert system identifies 88.9% of thetime the functionalities
that are efficiently accelerated by using the FPGA. This work, however, is different from
our proposed work since it does not predict the most suitablefloor-plan nor layout of the
reconfigurable computing platform for a specific application, but predicts the functional-
ity of the application that can be accelerated efficiently byan FPGA.
![Page 76: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/76.jpg)
CHAPTER 3. LITERATURE REVIEW 56
3.5 Summary
In this chapter we reviewed the state-of-the-art work on Reconfigurable OS and the
progress of partial reconfiguration design flow. We also reviewed the work on the use
of multiple architecture variants as part of developing an operating system or a scheduler
along with use of genetic algorithm on reconfigurable scheduling. Finally, the work re-
lated to the use of machine-learning and data-mining for hardware reuse prediction was
also reviewed.
![Page 77: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/77.jpg)
Chapter 4
Overall Methodology and Tools
The main objective of this chapter is to provide the reader with an overview of the overall
methodology used in this thesis. The intent is to help the reader understand the overall
framework being proposed, and to identify the key components of the framework, which
are presented later in Chapters 5, 6 and 7 respectively. We first explain the developed run-
time reconfigurable platform that can manages both hardwareand software tasks, along
with the developed tools and benchmark generators needed tovalidate and measure the
performance of the proposed system. This will be followed bya brief description of the
different phases required to design and implement the proposed framework in this thesis.
4.1 Run-Time Reconfigurable Platform
Xilinx introduced an interesting commercial partial reconfigurable design flow withISE
Design Suiteversion 12.1 in late 2010. Partial reconfiguration is usually used for time
multiplexing multiple functions of an application that arenot required to run at the same
57
![Page 78: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/78.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 58
time on the FPGA (mutually exclusive functions). This approach reduces power con-
sumption and enables the use of smaller FPGA device and hencereduces costs. Fig-
ure 2.7, presented earlier in Chapter 2, illustrates the concepts of partial reconfiguration
where a PRR can be reconfigured with different reconfiguration modules. Xilinx Partial
Reconfiguration design flow [5] uses a bottom up synthesis approach, where Recon-
figurable Modules have to be synthesized separately. Each Reconfigurable Module is
considered as a separate project where it is verified and synthesized separately. The top
design treats each Partial Reconfigurable Region as a black box. After generating all
net-lists (top design and Reconfigurable Modules), each Partial Reconfigurable Region
must be manually floor-planned using the Xilinx PlanAhead design tool. The PRR can
be rectangular or L shaped with some restrictions. More details can be found in [5].
Xilinx ISE 12.4 with partial reconfiguration capability along with EDK, Plan-ahead
were utilized to develop a platform capable of achieving dynamic run-time reconfigura-
tion in this thesis. The proposed RTR platform was initiallymapped on Xilinx Virtex
4 and 5 boards, and eventually implemented on a Virtex 6 FPGA board. The proposed
platform consists of two Xilinx MicroBlaze processors running at 100MHz and five par-
tial reconfigurable regions for hardware tasks. One of the processors is dedicated to run
the reconfigurable operating system along with the schedulers, while the other acts as a
general-purpose processor, as shown in Figure 4.1. Each processor has its own data/in-
struction caches and memory. A DDR memory is shared between the two processors
(using an MDM controller) to facilitate data sharing. In addition, a message box is im-
plemented as a means for intra-processor communication. The message box interrupts
the processors whenever new data arrives. Each processor has its own serial and debug
module for debugging. Thecontroller executes the schedulers and assigns tasks to the
![Page 79: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/79.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 59
available processing elements (software and hardware).
Xilinx partial reconfiguration design flow has been utilizedto assign five PRRs to
accommodateReconfigurable Modules(RM)as hardware accelerators. Figure 4.2 shows
the FPGA floor plan where the purple rectangles highlight thefive PRRs. Each PRR can
be reconfigured internally using the Xilinx ICAP controller. Hardware tasks are fetched
from a library of partial bit-streams stored in an external memory card. The controller
processor reads the appropriate partial bit-stream from anexternal memory, performs
some pre-processing operations, and then reconfigures the appropriate PRR. The use
of an embedded processor for reconfiguration limits reconfiguration speed. Therefore,
a dedicated hardware reconfiguration manager attempts to free the controller processor
from the task of configuring and reconfiguring the system, thus enhancing performance.
Each PRR consists of a programmable timer that triggers an interrupt upon task com-
pletion. The timer is used to emulate task processing times which helps in studying the
effects of varying task processing times on the schedulers.
4.1.1 PRR Uniformity
The available dynamic reconfigurable area reserved for hardware tasks was partitioned
into five PRRs. The reconfigurable area was first divided into uniform and non-uniform
partitions, as shown in Figure 4.2. The objective of having two implementations with
different uniformities, was to have as many PRRs as possibleand investigate the effect
of PRR uniformity on reconfiguration and hence, total execution time and power con-
sumption. The five PRRs are identical in size for the”uniform” based implementation
and different in size for the”non-uniform” platform. Yet the total area of the five PRRs
are identical in both cases.
![Page 80: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/80.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 60
Figure 4.1: Framework: Major Blocks.
![Page 81: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/81.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 61
Figure 4.2: Floorplan for the uniform (left) and non-uniform (right) implementations.
![Page 82: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/82.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 62
Reconfiguration time mainly depends on the size of the PRR dedicated to the RM.
Therefore, similar reconfiguration times (for each PRR) areexpected for the uniform
floorplan. On the other hand, the non-uniform PRRs will have reconfiguration times
proportional to their respective size. The floorplan based on the non-uniform PRRs is
more flexible, as it can accommodate different task sizes. Toefficiently utilize the ben-
efit of the non-uniform PRR platform, the developed schedulers should take the non-
uniformity into account, hence, a smart scheduler, called RCSched-III, is introduced in
Section 5.3.2.
4.2 DFG Generator
Scheduling sequential tasks on multiple processing elements is an NP-hard problem
[118]. To develop, modify, and evaluate a scheduler, extensive testing is required. Schedul-
ing validation can be achieved using workloads created either by actual recorded data
from user logs, or more commonly using randomly synthetically generated benchmarks
[118]. In this work, a synthetic random DFG generator was developed to assist in veri-
fying system functionality and performance.
An overview of the DFG generator is given in Figure 4.4. The inputs to the DFG
generator include a library of task types (operations), thenumber of required nodes, the
total number of dependencies between nodes (tasks), and themaximum number of de-
pendencies per task. The number of dependencies per task is illustrated in the examples
shown in Figure 4.3. In addition, the user can control the percentage of occurrence in the
DFG for each task or group of tasks. The outputs of the DFG generator consist of two
files: A graph and a platform file. The graph file is a graphical representation of the DFG
![Page 83: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/83.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 63
in a user friendly format, while the platform file is the same DFG in a format compatible
with the targeted platform.
Figure 4.3: A node can be independent or can have several dependencies.
4.2.1 DFG Generator Sub-modules
The DFG generator consists of four major modules, as shown inFigure 4.4:
1. Matrix dependency generator: This module provides the core functionality of
the DFG generator, and is responsible for constructing the DFG skeleton (i.e., the
nodes and dependencies). The nodes and edges that constitute the DFG are gen-
erated randomly. The nodes represent functions (or tasks) and the edges in the
graph represent data that is communicated among tasks. Bothtasks and edges be-
tween tasks are generated randomly based on user-supplied directives. the matrix
dependency generator tends to build the entire DFG without assigning task types
(operations) to the nodes.
2. Matrix operation generator: This module, assigns tasks types (operation) to the
DFG skeleton generated from theMatrix Dependency generatormodule. A task
![Page 84: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/84.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 64
type (operation) is assigned to each node from a library of task types (operations)
based on user-supplied parameters.
3. Platform file generator: Generates the DFG in a format compatible with the RTR
platform. This file is explained in more detail later in Section 4.3.1.
4. Graph file generator: Produces a graph representation of the DFG in a standard
open sourceDOT (graph description language) format [119].
Figure 4.4: DFGs Generator Components
![Page 85: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/85.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 65
4.3 Reconfigurable Simulator
Using the hardware platform described in Section 4.1 has several advantages but intro-
duces some limitations. For example, the machine learning engine proposed in this thesis
(Phase III) requires the evaluation of hundreds of DFGs based on different hardware con-
figurations to develop a model that can accurately predict anappropriate floorplan for an
incoming DFG or application. The FPGA platform to be used in this work requires a
different floor-plan and bit-stream for each new configuration which limits the scope of
testing and evaluation. Accordingly, an architecture reconfigurable simulator was de-
veloped to simulate the hardware platform discussed in Section 4.1, while running the
developed reconfigurable operating system.
The simulator consists of three distinct layers, as shown inFigure 4.5. ThePlatform
Emulator Layer (PEL)emulates the RTR platform functionality and acts as a virtual
machine for the upper layers. The ROS kernel along with the developed schedulers runs
on top of the PEL. The developed code of the ROS kernel along with the schedulers can
run on both, the simulator and on the RTR platform (as discussed in Section 4.1) without
any further modification. The use of the PEL layer enables thedesigner to easily develop
and test new schedulers on the simulator before porting themto the actual platform.
This simulator utilizes three schedulers as will be described in Chapter 5, and sup-
ports any number of PRRs and/or GPPs (software or hardware).The simulator was
written in the C language and runs under Linux. The code was carefully developed in
a modular format, which makes it easily expandable for new schedulers. The simula-
tor uses several configuration files, similar to the one used by the Linux OS. We opted
to publish the simulator under the open source GPL license onGitHub [120] to enable
![Page 86: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/86.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 66
Figure 4.5: Reconfigurable simulator layout
researcher working in the area of reconfigurable operating systems to unitize it as a sup-
porting tool for their work.
4.3.1 Simulator Inputs
The simulator accepts different input parameters to control it’s operation, as shown in
Table 4.1. The simulator is developed to emulate the hardware platform and expects the
following configuration files as input:
• A Task (architecture) Libraryfile which stores task information used by the sim-
ulator. Task information includes the mode of operation (software, hardware or
hybrid), execution time, area, reconfiguration time, reconfiguration power and dy-
namic power consumption (hybrid tasks can migrate between hardware and soft-
ware), on a per task type (operation) basis. Some of these values are based on
![Page 87: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/87.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 67
Table 4.1: Reconfigurable Simulator Input Parameter
Input Parameter Description Default
-h , –help Help --V , –version Display simulator version -
-v, –verboseShow each node execution time, and reconfigurationtimes, in addition to the total exuction time
-
-i,–iterationThe number of times the scheduler runs thesame DFG. This helps in the learningphase for RCSched-III
1
-t, –task-migration Enable task migration between SW and HW 0
-q,–disable-q-searchDisable ready queue search for reuse( used by RCSched-III-Enhanced)
0
-k, –scheduler Select the scheduler 3-d, –dfg-file DFG input file name dfg.conf-a, –arch-file Architecture file name arch.conf-p, –prr-file PRR (platform) file name prr.conf
-s, –prrs-setWhich platform to use from the platformlibraries in the platform file.
0
-g, –task-graphGenerate a graph file that contains detailedplacement and timing for each task
-
![Page 88: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/88.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 68
analytic models found in [121], [122], while others are measured manually from
actual implementations on a Xilinx Virtex-6 platform. For example, Figure 4.6
illustrates a task library for two task types (operations).The first task type, called
Task1, has three hardware implementation variants (arch1, arch2, and arch3). Each
task variant, includes the time and power consumption for executing the task and
reconfiguring it along with the reconfiguration frame area ofthe task (which is rep-
resented by rows and columns). Notice that unlikeTask1, Task2has one hardware
variant and one software variant.
• A Layout (platform)file which specifies the FPGA floor-plan. The layout includes
data that represent the size, shape, and number of PRRs alongwith the types and
number of GPPs. Notice that the schedules developed and usedin this thesis re-
quire the number of processing elements and their reconfiguration times. The area
can also be derived from reconfiguration time. Therefore, wecombined the area
and reconfiguration time in one parameter as shown in Figure 4.7. The figure il-
lustrates an example of five different floorplans (layouts) with different number of
GPPs and PRRs.
• A DFG file which stores the data flow graphs to be scheduled and executed. Fig-
ure 4.8 shows the S1 benchmark file as an example. This file is usually generated
by the DFG Generator, or by a script that reads a graph representation of the DFG
(the case for real-life benchmarks). The file defines the independent input and out-
put vertices, using the keywordsinputsandoutputs. The intermediate vertices are
then identified using theregskeyword (for registers). Finally, each node (T0 - T9)
is defined with the node’s inputs, outputs and type. The type is an index to the
![Page 89: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/89.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 69
id of the task type (operation), in theTask (architecture) Libraryfile as seen in
Figure 4.6.
Name = " Architecture Library v1.0 Name"Date = "Jan 03, 2015"######################Task Task1 { # Task name
id = 1 # task IDarch arch1 { # first architecture (task varient ) for Task1
exec_time = 5config_time = 5config_power = 5exec_power = 10columns = 1rows = 1mode = HW }
arch arch2 {exec_time = 10config_time = 10config_power = 10exec_power = 5columns = 1rows = 2mode = HW }
arch arch3 {exec_time = 10config_time = 20config_power = 20exec_power = 3columns = 2rows = 2mode = HW }}
######################Task Task2 {
id = 2arch arch1{
exec_time = 5config_time = 5config_power = 5exec_power = 20columns = 1rows = 1mode = HW }
arch arch2 {exec_time = 20exec_power = 15mode = SW }}
######################
Figure 4.6: Simulator task variant (architecture) library(example).
![Page 90: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/90.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 70
Name = "PRR.conf v1.0"Date = "Feb 4, 2015"processors Set1 {
PRRno= 5GPPno= 0PRRConfigTime={20,20,20,20,20}}
processors Set2 {PRRno= 5GPPno= 0PRRConfigTime={150,130,130,30,30}}
processors Set3 {PRRno= 5GPPno= 0PRRConfigTime={150,90,30,10,5}}
processors Set4 {PRRno= 3GPPno= 0PRRConfigTime={40,40,40}}
processors Set5 {PRRno= 5GPPno= 3PRRConfigTime={20,20,20,20,20}}
Figure 4.7: Simulator platform (PRR) library (example).
4.3.2 Simulator Output
The Simulator generates many system parameters, includingtotal time, reconfiguration
time, task migration information, and hardware reuse. For example, Figure 4.9 shows,
the output with the verbose option (’-v or --verbose’) for the S1 benchmark,
scheduled on a platform with 3 PRRs and 1 GPP. Figure 4.9 first displays data for the ex-
ecution of each node (task) by printing the task ID, execution start time, reconfiguration
start time, hardware or software execution (RECONFIGif the task ran on hardware and
SW COMotherwise), reconfiguration time, execution time, the index of the PRR/GPP,
task reuse, task type priority (RCSched-III), and finally task type ID (operation). This is
followed by the final parameters for the entire DFG, which areself explanatory.
The simulator can further produce, the task placement graph, by enabling the’-g
or --task-graph’ option. Figure 4.10 shows an example of the task placement
![Page 91: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/91.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 71
Name= "S1 Benchmark"Date= "July 8th, 2013"# this file is generated by the DFG generator#input vertices (no dependecies)inputs = { c0 , c1 , c2 , c3 , c4 , c5 , c6 , c7 , c8 , c9 , c10 , c11 , c12 , c13 , c14}#output verticesoutputs = { o0 , o1 , o2 , o3 , o4 , o5 , o6 }# intermediate verticesregs = {r0 , r1 , r2 }
task T0{ # task nametype = 3 # task type ID, Taken from the archetecture libraryinputs = {c0 ,c1 }output = r0}
task T1{type = 4inputs = {c2 ,r0 }output = o0}
task T2{type = 2inputs = {c3 ,r0 }output = r2}
task T3{type = 3inputs = {c4 ,c5 }output = o1}
task T4{type = 2inputs = {c6 ,c7 }output = r1}
task T5{type = 1inputs = {c8 ,r1 }output = o2}
task T6{type = 3inputs = {c9 ,c10 }output = o3}
task T7{type = 2inputs = {r1 ,c11 }output = o4}
task T8{type = 1inputs = {r2 ,c12 }output = o5}
task T9{type = 2inputs = {c13 ,c14 }output = o6}
Figure 4.8: Simulator DFG file example for the S1 benchmark
![Page 92: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/92.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 72
graph for the S1 benchmark scheduled on a platform with 3 PRRsand 1 GPP. The Y
axis of the graph represents time, while the X axis represents the processing elements
(PRR/GPP). Each Column, is assigned to a PRR or a GPP, except for column 1, which
represents the time. A PRR can be idle (”.”), being reconfigured ”#”, or executing a task
”*”. On the left of each task (* or #) is the task ID, which is a reference to the task being
executed or configured.
Node [ 0 ] −−>T[ 333] R[ 0 ] RECONF Conf ig 333 Exec 800 PRR[ 2 ] Reuse [ NO] P r i o[ 0 ] Type [ 3 ]Node [ 1 ] −−>T[ 1467] R[ −] SW COM Conf ig 0 Exec 800 GPP[ 0 ] Reuse [ NO] P r i o [ 1 ] Type [ 4 ]Node [ 2 ] −−>T[ 1268] R[ 0 ] RECONF Conf ig 0 Exec 200 PRR[ 1 ] Reuse [YES] P r i o[ 1 ] Type [ 2 ]Node [ 3 ] −−>T[ 334] R[ −] SW COM Conf ig 0 Exec 960 GPP[ 0 ] Reuse [ NO] P r i o [ 0 ] Type [ 3 ]Node [ 4 ] −−>T[ 668] R[ 335] RECONF Conf ig 333 Exec 200 PRR[ 1 ] Reuse [ NO] P ri o [ 1 ] Type [ 2 ]Node [ 5 ] −−>T[ 1466] R[ 1133] RECONF Conf ig 333 Exec 200 PRR[ 2 ] Reuse [ NO]P r i o [ 1 ] Type [ 1 ]Node [ 6 ] −−>T[ 1001] R[ 668] RECONF Conf ig 333 Exec 800 PRR[ 0 ] Reuse [ NO] Pr i o [ 0 ] Type [ 3 ]Node [ 7 ] −−>T[ 1068] R[ 0 ] RECONF Conf ig 0 Exec 200 PRR[ 1 ] Reuse [YES] P r i o[ 1 ] Type [ 2 ]Node [ 8 ] −−>T[ 1801] R[ 1468] RECONF Conf ig 333 Exec 200 PRR[ 1 ] Reuse [ NO]P r i o [ 1 ] Type [ 1 ]Node [ 9 ] −−>T[ 868] R[ 0 ] RECONF Conf ig 0 Exec 200 PRR[ 1 ] Reuse [YES] P r i o [1 ] Type [ 2 ]
To t a l Conf Time [ 1 6 6 5 ] ,noGPP 1[ 3 ] S ch ed u l e r[ 2267 ] To t a l Number o f Cyc les[ 395 ] To t a l Power[ 5 ] Number o f C o n f i g u r a t i o n[ 0 ] SW Busy[ 131 ] HW Busy[ 0 ] SW2HW MIG[ 2 ] HW2SW MIG[ 3 ] # o f Reuse[ 2 ] # o f SW t a s k s
Figure 4.9: Simulator output for the S1 benchmark, using the–verbose option.
4.4 Methodology Overview
In this thesis, an efficient operating system for reconfigurable computing is proposed de-
signed and implemented to ease the application design and properly manage resources
within the reconfigurable system. Making such a reconfiguration manager available
along with the current flow should further enhance and improve partial reconfiguration
on state of the art FPGAs. The run-time reconfigurable manager consists of three main
cooperating modules, as seen in Figure 4.11. The first module, is mainly responsible for
![Page 93: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/93.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 73
Time PRR0 PRR1 PRR2 GPP00-- ..... ..... 0##### ......1-- ..... ..... 0##### ......2-- ..... ..... 0##### ......3-- ..... ..... 0##### ......4-- ..... ..... 0##### ......5-- ..... ..... 0##### ......6-- ..... ..... 0##### ......7-- ..... 4##### 0***** 3*****8-- ..... 4##### 0***** 3*****9-- ..... 4##### 0***** 3*****
10-- ..... 4##### 0***** 3*****11-- ..... 4##### 0***** 3*****12-- ..... 4##### 0***** 3*****13-- 6##### 4***** 0***** 3*****14-- 6##### 4***** 0***** 3*****15-- 6##### 4***** 0***** 3*****16-- 6##### 4***** 0***** 3*****17-- 6##### 9***** 0***** 3*****18-- 6##### 9***** 0***** 3*****19-- 6##### 9***** 0***** 3*****20-- 6***** 9***** 0***** 3*****21-- 6***** 7***** 0***** 3*****22-- 6***** 7***** 0***** 3*****23-- 6***** 7***** 5##### 3*****24-- 6***** 7***** 5##### 3*****25-- 6***** 2***** 5##### 3*****26-- 6***** 2***** 5##### ......27-- 6***** 2***** 5##### ......28-- 6***** 2***** 5##### ......29-- 6***** 8##### 5***** 1*****30-- 6***** 8##### 5***** 1*****31-- 6***** 8##### 5***** 1*****32-- 6***** 8##### 5***** 1*****33-- 6***** 8##### ..... 1*****34-- 6***** 8##### ..... 1*****35-- 6***** 8##### ..... 1*****36-- ..... 8***** ..... 1*****37-- ..... 8***** ..... 1*****38-- ..... 8***** ..... 1*****39-- ..... 8***** ..... 1*****40-- ..... ..... ..... 1*****41-- ..... ..... ..... 1*****42-- ..... ..... ..... 1*****43-- ..... ..... ..... 1*****44-- ..... ..... ..... 1*****
Figure 4.10: Simulator graph file, where ’#’ represent reconfiguration, and ’*’ execution.The number is the task ID. ( S1 benchmark )
![Page 94: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/94.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 74
Figure 4.11: Overall Methodology Flow
placing and scheduling tasks of the application on the FPGA fabric. The second mod-
ule attempts to allocate and bind the appropriate executionunit to each task to further
reduce power consumption and improve performance. The third, and final module, uses
a machine learning supervised approach to predict the most appropriate floorplan given
a specific application. The development of the proposed framework occurred in there
distinct phases, each of which is described next.
1. Phase I: Involves the development and evaluation of several novel heuristics for
on-line scheduling for partially reconfigurable devices (which is discussed in Chap-
ter 5). The proposed schedulers used fixed predefined partialreconfigurable re-
gions with reuse, relocation, and task migration capability. In particular, RCSched-
III uses real-time data to make scheduling decisions. The scheduler dynamically
![Page 95: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/95.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 75
measures several performance metrics such as reconfiguration time and execution
time, calculates a priority for each task type, and based on these metrics, assigns
incoming tasks to the appropriate processing elements. A dynamic Reconfigurable
framework consisting of five reconfigurable regions and two General Purpose Pro-
cessors (GPPs) was implemented. The schedulers run as part of the ROS in the
developed reconfigurable platform as discussed in Section 4.1. To verify the sched-
uler functionality and performance, different DFGs are needed. Therefore, a DFG
generator was developed, as described Section 4.2. The DFG generator is able to
randomly generate benchmarks with different sizes and features, using predefined
specification.
2. Phase II: Proposes a design and implementation of a parallel, island-based GA ap-
proach for efficiently mapping execution units to task graphs for partial dynamic
reconfigurable systems (as described in Chapter 6). Each GA optimization mod-
ule consists of four main components: an Architecture Library, Initial Population
Generation module, a GA Engine, and a Fitness Evaluation module (based on an
on-line scheduler). The Fitness evaluation module requires the platform and sched-
ulers developed inPhase I. Using a hardware platform as the GA fitness function
is not practical. Therefore, we developed a reconfigurable simulator for the RTR
platform (described in Section 4.1). Each GA module tends tooptimize several
objectives based on asinglestatic floorplan/platform. Unlike previous works, our
approach is multi-objective and not only seeks to optimize speed and power, but
also seeks to select the best reconfigurable floorplan. The basic idea is to aggregate
the results obtained from the Pareto fronts of each island toenhance the overall so-
![Page 96: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/96.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 76
lution quality. Each solution on the Pareto front tends to optimize power consump-
tion, speed and area based on a different platform (floorplan) within the FPGA.
Our approach was tested using both synthetic and real-worldbenchmarks. This is
the first attempt to use multiple GA instances for optimizingseveral objectives and
aggregating the results to further improve solution quality.
3. Phase III: Proposes a novel adaptive and dynamic methodology based on an intel-
ligent machine learning approach that is used to predict andestimate the necessary
resources for an application based on past historical information. Even though the
approach is general enough to predict most if not all types ofresources from the
number of PRRs, the type of PRRs, the type of scheduler, and communication in-
frastructure, we limit our results to the former three required for an application.
The framework is based on extracting certainfeaturesfrom the applications that
are executed on the reconfigurable platform. The features compiled are then used
to train and build aclassification modelthat is capable of predicting the floorplan
appropriate for an application. The classification model developed is based on a
supervised learning approach that can generalize and accurately predict the class
of an incoming application that it learned from previously observed patterns. The
proposed approach is based on several modules including benchmark generation,
data collection, pre-processing of data, data classification, and post processing.
The goal of the entire process is to extract useful, hidden knowledge from the data,
this knowledge is then used to predict and estimate the necessary resources and
appropriate floorplan for an unknown or not previously seen application. Based
on the literature review, the use of data-mining and machine-learning techniques
![Page 97: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/97.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 77
has not been proposed by any research group to exploit this specific type of De-
sign Exploration for Reconfigurable Systems in terms of predicting the appropriate
floorplan of an application. This is needed due to the long run-times of training a
bunch of different platforms. The machine learning prediction framework is intro-
duced and explained in detail in Chapter 7.
4.5 Modes of Operations
The proposed framework introduced briefly in Section 4.1 of this dissertation can be used
in several operational modes. The modes of operation will depend on the degree of the
integration of different modules developed and requested by the user.
1. Mode #1: The hardware platform initially developed using Xilinx partial recon-
figuration flow along with the designed schedulers can be usedby the designer to
schedule and place tasks of the application, as shown in Figure 4.12. In this mode
of operation, only a single floorplan is employed along with asingle execution unit
associated with a task. We consider this the most basic mode of operation.
2. Mode #2: The hardware framework along with the Schedulers and the GA opti-
mization modules are integrated, as shown in Figure 4.13. The system can be used
in the following capacities:
(a) Mode #2A: In this mode of operation, a single GA island is used to optimize
the mapping and binding of execution units to the tasks of thegraph. A
single floorplan is used, but multiple execution units can beassociated with
each task, thus reducing power and delay.
![Page 98: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/98.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 78
Figure 4.12: Operation Mode 1
(b) Mode #2B: In this mode of operation, an island based GA is used to further
optimize the mapping and binding of execution units to the tasks of the graph.
A set of floorplans (designed a priori) are used within the framework and the
user can pick and choose the solution generated by the Islandbased GA along
with the appropriate floorplan associated with it.
3. Mode #3: The Hardware framework along with the schedulers and Machine
Learning module are integrated. This system is capable of predicting the most
appropriate floorplan based on the incoming DFG of the application. The main
limitation of this mode of operation is that no variation of the execution units ap-
ply, and therefore, only a single execution unit is associated with a task.
4. Mode #4: The final and most general mode of operation, is based on the inte-
gration of the basic hardware framework along with the schedulers, Island Based
![Page 99: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/99.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 79
Figure 4.13: Operation Mode 2
![Page 100: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/100.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 80
Figure 4.14: Operation Mode 3
![Page 101: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/101.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 81
GA module and Machine Learning module. This is the most general and efficient
framework but most memory and resource demanding mode. In this mode of oper-
ation, the system utilized the library of execution units that are allocated and bound
to the tasks of the DFG using the GA module. The Machine Learning module on
the other hand, attempts to predict a specific floorplan for anincoming DFG. Oper-
ation Mode #4 can also be used to select the most appropriate scheduling technique
via the machine learning module, while the island based GA optimizes the map-
ping and binding of execution units to the task of the incoming DFG.
Figure 4.15: Operation Mode 2
![Page 102: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/102.jpg)
CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 82
4.6 Summary
In this chapter we introduced the overall methodology flow ofthe proposed work along
with the developed tools that have been used throughout thisthesis. The methodology
flow consists of three phases:Phase Ideals with developing heuristic based online re-
configurable schedulers for software and hardware tasks. InPhase IIan island based
genetic algorithm is designed and developed, that can map hardware task variants to task
graphs, using the schedulers developed inPhase Ias part of the fitness evaluation mod-
ule. Finally,Phase IIIproposes a novel technique based on machine learning to predict
the necessary resources for Dynamic Run Time Reconfiguration. This chapter also de-
scribed the proposed hardware reconfigurable platform, along with the DFG generator,
and a reconfigurable simulator used as part of the proposed framework.
![Page 103: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/103.jpg)
Chapter 5
Reconfigurable Online Schedulers
In this chapter, several online scheduling algorithms for reconfigurable computing are
designed and implemented. The proposed schedulers, manageboth hardware and soft-
ware tasks. We also introduce an efficient offline schedulersand an exact mathematical
Integer Linear Programming (ILP)model for reconfigurable scheduling that can be used
to verify the quality of solutions obtained by the online schedulers. The developed sched-
ulers tend to reuse hardware resources for tasks to reduce reconfiguration overhead, mi-
grate tasks between software/hardware, assign prioritiesto tasks types, while respecting
tasks priority and maintaining precedence between tasks. The first two schedulers give
priorities to hardware tasks and are called (RCSched-I and RCSched-II). When there is
a shortage of hardware resources, both RCSched-I and RCSched-II initiate the migra-
tion of hardware tasks to software. In contrast the third scheduler (called RCSched-III),
migrates tasks to software only when it is more beneficial to do so. RCSched-III dynam-
ically measures several system metrics, such as execution time and reconfiguration time,
and then calculates the priority for each task type. Based onthese priorities, RCSched-III
83
![Page 104: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/104.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 84
migrates and assigns tasks to the most suitable processing elements (software or hard-
ware).
The main methodology developed in our work is demonstrated as follows: First, the
task representation of benchmarks used in this work is explained along with the impact
of PRR uniformity on scheduling. We introduce several baseline schedulers that were
adapted from the literature to further evaluate the performance of our proposed sched-
ulers. This is followed by a description of the three novel reconfigurable scheduling
algorithms (RCSched-I, RCSched-II, and RCSched-III), along with an enhanced version
of RCSched-III, renamed RCSched-III-Enhanced. Finally, in order to evaluate the per-
formance of the proposed schedulers, several experiments will be carried out along with
comparison with baseline schedulers.
5.1 Task Representation
Applications running on the proposed partial reconfigurable operating systems are mod-
eled as aData Flow Graph (DFG). Each node within a DFG represents a task. Each
task has a predefined operation, ID, priority and type. Edgesbetween nodes represents
task dependencies. Figure 5.1 shows a sample DFG for an application along with the
corresponding modes as seen in the tasks model presented in Table 5.1.
Mode DescriptionHybridSW Hybrid task that usually runs on GPP but can migrate to PRRHybridHW Hybrid task that preferably runs on PRRs but can migrate to GPPSW Software only task executed on GPP (no migration)HW HW only task runs on PRRs (no migration)
Table 5.1: Description of task modes
![Page 105: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/105.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 85
Figure 5.1: A Data Flow Graph (DFG).
![Page 106: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/106.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 86
5.1.1 Task Placement Models
The task placement model in a reconfigurable device can be abstracted as a 1D or 2D
model. The 1D model divides the reconfigurable device into columns that can be recon-
figured separately, where a task size is assigned by width only. In the 2D based approach,
a task can have any width and height and be placed anywhere on the FPGA fabric. The
1D model simplifies the placement mechanism and trades this simplification for a sub-
optimal device utilization. Both models are shown in Figure5.2. The RTR platform
presented in Section 4.1, uses 2D placement model as previously shown in Figure 4.2.
Nonetheless, PRR sizes cannot be changed after design time due to design flow limita-
tion, therefore, the scheduling problem presented in this work is based on the 1D model
representation.
Figure 5.2: 1D versus 2D Task Placement Area Models for Reconfigurable Devices
The RTR platform introduced in Section 4.1 has two PRR floorplans:Uniform and
![Page 107: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/107.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 87
non-uniform. Non-uniform PRRs have different reconfiguration times, therefore, task
placement affects total execution time. Accordingly, an efficient reconfigurable scheduler
should be developed to take uniformity into account.
5.2 Baseline Schedulers
One of the main challenges faced in this work, is the lack of compatible schedulers, tools,
and benchmarks to compare and evaluate our work. Therefore,throughout the research
process, we tend to consider several published algorithms and adapt them to represent
our RTR platform for the sake of comparing with our proposed schedulers. The adapted
scheduling algorithms are described briefly in the following sections. First, an exact ILP
model for scheduling tasks on reconfigurable platform is described in detail. This is fol-
lowed by RCOffline, a heuristic offline scheduler for reconfigurable platforms. Finally,
two baseline schedulers, a meta-heuristic offline based scheduler and a simple generic
online scheduler are used to further evaluate the schedulesproduced by the proposed
online schedulers, which are introduced in Section 5.3.
5.2.1 ILP Model
The ILP model in [123] was modified and developed to solve the task scheduling problem
in the proposed RTR platform. The ILP model takes into account hardware task reuse,
task prefetching, and scheduling, with a single objective of minimizing total time. The
model in [123] targets partial reconfigurable platforms with 1D task placement, and at-
tempts to reduce defragmentation. Since our proposed approach uses preassigned PRRs,
where a task can be placed in one PRR at a time, therefore, the ILP model was modified
![Page 108: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/108.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 88
by treating each column as a PRR and restricting the placements of tasks into one PRR
(column).
The following ILP model is adapted to suit the proposed RTR platform and usesN
uniform PRRs.
5.2.1.1 Constants
For every taski ∈ O let: WhereO is the total number of tasks in a DFG.
• li := latency of task i;
• ri := reconfiguration time for taski;
• uij := task i and j perform the same operation (they can exploit taskreuse).
It is also important to consider the upper limit of the time needed to schedule all the
tasks in a DFG (worst case), where
T =
|O|∑
i=1
(li + ri). (5.1)
5.2.1.2 Variables
The variables employed in the model are defined as follows: (The variables are integers
unless otherwise noted).
• Piek := taski is loaded on the FPGA at timee onto PRRP ∈ (0, 1).
• tie := reconfiguration of taski starts at timee.
• mi := taski exploits reuse (1 for reuse and0 otherwise).
![Page 109: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/109.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 89
• Sini := arrival time of taski on the FPGA.
• Souti := end of execution time of taski.
• tf := total execution time.
5.2.1.3 Constraints
PRR constraints At any given time, no more than|N | PRRs can be used:
∀e,
|O|∑
i=1
|N |∑
k=1
Piek PRR Constraints. (5.2)
A PRR can be only used by one task at a time:
∀e, k
|O|∑
i=1
Pier ≤ 1 No overlap. (5.3)
Time constraints The first (initial) instance is reserved
|O|∑
i=1
|N |∑
k=1
Pi0k = 0 Time Zero (5.4)
The 1’s inPiek are arranged in a PRR for every taski for the time it is on the FPGA:
∀i, e, k
T∑
m=1
|N |∑
l=1
Piml − Pimk
≤ T (1 − Piek) Same PRR. (5.5)
![Page 110: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/110.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 90
Task arrival time must be less than or equal to the initial instant for whichP is 1:
∀i, e, Sini − h
|N |∑
k=1
Piek ≤ T
1 −
|N |∑
k=1
Piek
Task load Time (5.6)
Task leaving time must be greater than or equal to the last instant for whichP is 1:
∀i, e, Souti ≥ e
|N |∑
k=1
Piek Task unload time. (5.7)
The tasks cannot disappear and reappear from the FPGA:
∀i,T
∑
e=1
|N |∑
k=1
Piek = Souti − Sin
i + 1 Continous usage (5.8)
Precedence should be enforced:
∀(i, j) ∈ P, Soutj − lj ≥ Sout
i Precedences (5.9)
Reconfiguration constraints If a task could not reuse an existing task, then it must
stay on the FPGA for at least the amount of time taken by the sumof reconfiguration and
execution times:
∀i, ri + li −
T∑
e=1
|N |∑
k=1
Piek ≤ T.mi Reconfigured time (5.10)
Reconfiguration starts as soon as the task is on the FPGA:
∀i, , Sini −
T∑
e=1
e.tie = T.mi Reconfiguration start (5.11)
![Page 111: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/111.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 91
Only one reconfiguration at a time :
∀e,
|O|∑
i=1
∑
m=max(1,e−ri+1)
tim ≤ 1 Single Reconfiguration (5.12)
A task can reuse a placed task in PRR at timee if and only if at timee − 1 there is
the same task type in the same PRR position:
∀i, ∀e, ∀k, 1−
|O|∑
j=1,j 6=i
uij ·pj(e−1)k−pi(e−1)k+T ·(1−piek) ≤ T (1−mi) Reuse (5.13)
If a task is reconfigured, the task has to stay in the FPGA for atleast the execution
time:
∀i, li −
T∑
e=1
|N |∑
k=1
piek ≤ T (l − mi) Reused time (5.14)
The reconfiguration time is unique. It is not defined if there is a task reuse:
∀i,
T∑
e=1
tie ≤ 1 − mi No reconfiguration (5.15)
5.2.1.4 Objective
Definition of tf :
∀i, tf ≥ Souti , tf ≤ T (5.16)
The objective is to minimize total execution time
min tf Objective (5.17)
![Page 112: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/112.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 92
The advantage of using an exact ILP model in this thesis is that it generates an op-
timal solution (schedule), which can help in verifying the solution quality produced by
the proposed online schedulers. However, an ILP model is proven to be NP-Complete.
Therefore, even a small problem takes a long amount of time tosolve. For example, it
takes from 3 to 14 days to evaluate a small DFG with 10 nodes, using CPLEX, a com-
mercial LP/ILP solver. Accordingly, we adapted a greedy heuristic scheduler in [124]
to evaluate our RTR platform as explained in the next section. Compared with the pre-
vious exact ILP model, RCOffline produced near optimal solutions at a fraction of the
computation time.
5.2.2 RCOffline Scheduler
The second scheduler that was adapted to our RTR platform is called, RCOffline. The
RCOffline scheduler is a greedy offline reconfiguration awarescheduler for 2D dynam-
ically reconfigurable architectures. It supports task reuse, module prefetch and anti-
fragmentation techniques. The main advantage of utilizingthis scheduler is that it pro-
duces solutions (schedules) that are of high quality (closeto those obtained by the ILP
model [123,124]) in a fraction of the time for at least the small problems.
The offline scheduler proposed in [124] and modified in our work is different than
the proposed online schedulers in this thesis. Since it is anoffline based technique it
expects the entire DFG in advance. Also, it assumes the existence of a reconfigurable
platform that supports task relocation and PRR resizing at run time. At the time of
writing this thesis, no such reconfigurable platform existed. Another key difference is
that the offline scheduler does not support task migration orsoftware tasks, unlike the
schedulers proposed in this thesis.
![Page 113: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/113.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 93
5.2.3 Meta-Offline Scheduler
The third scheduler that was adapted for our RTR platform wasa meta-heuristic based
offline scheduler [125] that uses different advanced meta-heuristic methods to achieve
a near optimal solution for multi-processing system. The scheduler is used specifically
for heterogeneous multi-processor environments, where tasks in a DFG are efficiently
assigned to each processor. The Meta-Offline scheduler targets multiprocessor envi-
ronments, therefore, it has been modified to work with reconfigurable platforms. The
processing elements, in our RTR platform, are PRRs and GPPs,which was treated as a
separate processor by the Meta-Offline scheduler. Hardwaretask reconfiguration time
had to be considered, since it was not directly supported by the Meta-Offline scheduler.
5.2.4 RCSched-Base Scheduler
The final baseline scheduler adapted to our RTR platform is a simple online scheduler
that produces feasible solutions with no optimization applied (such as reuse and place-
ment priority). RCSched-Base scheduler however maintainsdependencies, task priority,
and placement restrictions. In the results section in 5.4.3, RCSched-Base is used as a
baseline when comparing the proposed schedulers and evaluating them.
5.2.5 Baseline Schedulers Comparison
The four baseline schedulers were evaluated base on different criteria including, runtime,
quality of solutions, and representation of the RTR platform.
The RCSched-Base scheduler produces inferior solutions asits main objective is to
come up with feasible solutions without optimization.
![Page 114: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/114.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 94
The Meta-Offline scheduler tends to use the GPP extensively for tasks with high re-
configuration time in respect to execution time, due to the fact that it does not support
hardware task reuse. An important point to reflect upon is that the Meta-Offline sched-
uler requires more CPU time to find a near optimal schedule. Calculation times varies
dramatically by the number of processing elements used as shown in Figure 5.3. The
Meta-Offline scheduler is not suitable for embedded systemsfor the extensive use of
resources and huge time needed to find a valid schedule.
� � � � �
�
�����
�����
�����
�����
������
������
��������������� ��������������
����������������� ������������
�������������
�������������
���������
������������
Figure 5.3: Meta-Offline scheduler processing time.
The CPLEX solver for the ILP model takes a substantial amountof time to produce
the optimal schedule. Accordingly, it was difficult to utilize the ILP model to evaluate
solutions produced by the proposed online schedulers. For example, the solver took 74
hours to find the optimal schedule for the S1 Benchmark on a RedHat Linux workstation
with 6 cores Intel Xeon processors, running at speed of 3.2GHz, equipped with 16GB
![Page 115: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/115.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 95
or RAM. Table 5.2 shows thetotal timeof the schedules produced by RCOffline and
RCSched-III-Enhanced with that of the ILP model for three small benchmarks. The
results shows that RCOffline is in average4.7% away from the optimal solution for the
three small benchmarks.
Benchmark # of nodes ILP RCOffline RCSched-III-EnhancedS1 10 160 160 160S11 13 160 160 160
DFG14 2 11 120 140 140
Table 5.2: Baseline schedulers comparison (values for Total time in # of cycles).
For these reasons RCOffline was selected as a baseline since it represents the recon-
figurable platform more than the Mata-Offline scheduler, canproduce results in relatively
short time, and it has been evaluated against an ILP model.
5.3 Proposed Scheduling Algorithms
The quality of the proposed online scheduling algorithms can have a significant impact
on the final performance of applications running on a reconfigurable computing platform.
The overall time taken to run applications on the same platform can vary considerably
by using different schedulers.
In this section, three reconfigurable schedulers are introduced: RCSched-I, RCSched-
II, and RCSched-III. The first two schedulers are designed tohandle both software and
hardware tasks, and migrate tasks to software in the event ofhardware resource scarcity.
Unlike RCSched-I and RCSched-II that prefer utilizing hardware resources,RCSched-III
![Page 116: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/116.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 96
assigns task priority dynamically based on live measured system metrics, and migrates
tasks only if total execution time is reduced. The RCSched-III also takes advantage of
non-uniform PRR implementations and tends to place tasks intelligently on the most
appropriate processing element. The RCSched-III was further enhanced by modifying
the technique for selecting a task from theready queue, which dramatically improves
performance, as will be shown in Section 5.4.4.
5.3.1 RCSched-I and RCSched-II
Two efficient scheduling algorithms (RCSched-I and RCSched-II) were designed and
implemented with different objectives. Both algorithms nominateFree (i.e., inactive or
currently not running) PRRs for the next ready task. If all PRRs are busy and the GPP
is free, the scheduler attempts to migrate theready task (first task in the ready queue)
by changing the task type from hardware to software (if the task accepts hardware to
software migration).Busy PRRs, unlikeFree PRRs, accommodate hardware tasks that
are active (i.e., running). The two algorithms differ with respect to teh way the next PRR
is nominated.
RCSched-I nominates the first free available PRR for reconfiguration as seen on line
10 in the pseudo code of Figure 5.4. The nominated PRR is then checked against the
ready task. The ready task is scheduled to execute immediately if thereis a match
(i.e., if the PRR holds the bit-stream of the ready task), otherwise the reconfiguration
controller reconfigures the PRR with a bit-stream of the ready task and executes it. The
RCSched-I scheduler tends to minimize the number of active PRRs and has less task
variety. Therefore, it has a lower task reuse ratio comparedto RCSched-II.
Task reuse is noticeably increased with the second scheduler (RCSched-II). RCSched-
![Page 117: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/117.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 97
II reconfigures theleast recently configuredPRR. The scheduler nominatesall free PRRs
then checks them against theready taskfor a match. If there is a miss-match the sched-
uler sends the least recently configured PRR for reconfiguration. This scheduler has more
active PRRs, i.e., PRRs with running tasks. In addition, this scheduler leads to PRRs with
higher task variety, which increase tasks reuse ratio. Figure 5.4 shows the pseudo code
for RCSched-I and RCSched-II. The schedulers are called by the ROS main loop, at the
end of task execution and task reconfiguration, as shown in Figure 5.5. Assuming all
DFG tasks are available, the complexity of RCSched-I and RCSched-II isO(n), where n
is the total number of tasks in a given DFG.
5.3.2 RCSched-III
Due to the limitations imposed by RCSched-I and RCSched-II of not considering PRR
uniformity and task types when making placement decisions,an attempt to overcome
these shortcomings is proposed by introducing and designing a third scheduler, called
RCSched-III, as seen in Figure 5.7. The main difference between RCSched-III and the
previous two schedulers is in the assignment of tasks to the available processing ele-
ments. RCSched-III intelligently identifies and chooses the most suitable processing
element (PRR or GPP) for the task in hand based on knowledge extracted from previ-
ous tasks’ historical data. The scheduler dynamically calculates taskplacement priorities
based on data obtained from the previous running tasks. During task execution the sched-
uler tends to learn about running tasks by classifying them according to their types. The
scheduler then dynamically updates a memory table,task type tablewith the new task
type information. The data stored in the table is used by the scheduler later to classify
and assign priorities to new incoming tasks.
![Page 118: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/118.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 98
1 Start2 Read next task3 if ( All dependencies have been met)4 {add task to ready queue }5 Fetch the first task from the ready queue6 switch (task->mode)7 { case Hybrid_HW :8 {if ( A free PRR is available)9 {// to minimize reconfiguration time
10 (RCSed-I) : {Check the free PRR for reuse}11 (RCSed-II): {Search Free PRRs for a match12 against the ready task }13 if(PRR Reuse is available)14 run ready task on available PRR15 else{16 (RCSed-I) : {reconfigure PRR with the17 ready task bit-stream.}18 (RCSed-II) {reconfigure least recently configured19 PRR with the ready task bit-stream}}20 }else // all PRRs are busy21 {if ( GPP is free)22 { Change task type from HybridHW to HybridSW23 Add to the beginning of the ready queue }24 }else{25 return Busy}26 break; }27 case HybridSW :28 {if (GPP is free)29 { load task into GPP}30 }else if (there is a free PRR)31 { change task type from HybridSW to HybridHW32 Add to the beginning of the ready queue}33 }else{34 return Busy35 break;}36 }37 End
Figure 5.4: Pseudo-code for RCSched-I & II (with reuse and task migration)
![Page 119: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/119.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 99
1 /*2 * Simplied verson of ROS main loop.3 * The OS state is global, and it can be set by the configuraiton thread and task
execution thread.4 */5
6 do {7
8 switch (State) {// OS States9 case CfgDone: // end of configuration
10
11 Call the schedule; // call the scheduler12 State = TaskDone; // The OS state is set, by the configuration thread.13 break;14 case TaskDone:15 // This function checks if there is a new task16 // to be added to the ready queue.17 AddTask2Queue (ReadyQ);18 Call the schedule;19 State = TaskDone;20 break;21 case Start: // The OS just started, and the ready queue is empty22 AddTask2Queue (ReadyQ); // Wait for a task23 State = TaskDone;24 break;25 default:26 print ( Error: Unknown state );27 break;28 } while (TRUE);
Figure 5.5: A simplified Pseudo-code for ROS, illustrates how the scheduler is called.
![Page 120: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/120.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 100
The learning process is not an issue for embedded systems since usually the same
tasks sets run repeatedly on the same platform. For example,video processing applica-
tions use the same repeated task sets for every frame. That isto say, the scheduler will al-
ways produce a valid schedule even during the learning phase, nonetheless it’s efficiency
increases over time. The following definitions are needed for further understanding the
RCSched-III algorithm:
PRRs priority: Is calculated based on the PRR size. PRRs with smaller size have lower
reconfiguration time and thus have higher priorities.
Placement priority: Is a dynamically calculated metric, which is frequently updated on
the arrival of new tasks. Placement priority is assigned based on the type (function)
of a task. RCSched-III uses this metric to decidewhereto place the current tasks.
Placement Priority is an integer between 0 and maximum processing elements,
where the lower the number the higher the priority.
Based on the pseudo-code of Figure 5.7, RCSched-III takes the following steps to
efficiently schedule tasks:
1. The PRRs are given priority based on their respective size. PRR priority is a num-
ber from0 to PRRmax − 1.
2. Arriving tasks are checked for dependencies. If all dependencies are satisfied the
task is appended to theready queue.
3. The scheduler then selects the task with the highest priority from the ready queue;
if task modeis HybridHW then:
![Page 121: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/121.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 101
(a) The scheduler checks the available PRRs for task reuse, if reuse is missing,
the scheduler usesplacement priorityandPRR priorityto place the task either
on a PRR or migrate it to software (lines 26 to 39 in the pseudo-code of
Figure 5.7).
This step is useful for the non-uniform implementation, since each PRR has
a different reconfiguration time. Task migration is performed when:
Migrate_task= True,
if [(HW Exec. time + Reconfig. time) > SW Exec. Time ]
or [there No free PRRs]
(b) If no resources are available (either in hardware and software), a busy counter
is incremented while waiting for a resource to be free (lines44 to 48). The
busy counteris a performance metric that is used to give an indication of how
busy (or idle) the scheduler is. The busy counter counts the number of cycles
a task has to wait for free resources.
4. When task mode is HybridSW: RCSched-III first attempts to run the task on a GPP,
and then migrates it to hardware based on GPP availability. Otherwise, RCSched-
III increments the busy counter and waits for free resources(lines 49 to 60).
5. Software and hardware only tasks are handled in a similar way to HybridSW and
HybridHW respectively with no task migration.
Placement Priority, which is calculated by the scheduler, is different thantask prior-
ity,which is assigned to DFG nodes at design time, in thatplacement prioritydetermines
![Page 122: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/122.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 102
where a ready task should be placed, whiletask prioritydetermines the order by which
tasks should run.
Priority Calculation The calculation of PRR Placement Priority takes into account the
PRRs’ reconfiguration times in addition to task execution time;
1. At the end of task execution, RCSched-III adds any newly discoveredtask typeto
thetask type table.
2. The scheduler updates the reconfiguration time for the current PRR in thetask type
table.
3. RCSched-III updates the execution time for the current processing element based
on the following formula:
(Enew = Eold + (Enew − Eold) ∗ X)
where: E denotes to execution time, and X is the learning factor.
As a result of extensive experimentations,E has been set to0.2. This assists the
scheduler to adapt for changes in execution time for the sameoperation. The learn-
ing formula takes into account old accumulated average execution time (80%) and
the most recent instance of the task (20%). The goal of thetask type tableis to
keep track of execution times (and reconfiguration times in the case of PRRs), of
each task type on every processing elements (PRRs and GPPs).Accordingly, when
similar new tasks arrive, the scheduler can use the data in the table to determine
the appropriate location to run the task. Figure 5.6 show thetask type structure
along with thetask type table.
![Page 123: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/123.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 103
4. Finally RCSched-III uses the measured reconfiguration and execution times to cal-
culate placement priority for each task type.
struct TaskType{int ID; /* Task Type ID*/char * name; /* Task Type Name, such as adder, sub, or FFT)*/Xuint32 SWET; /* SW execution time */Xuint32 HWET; /* HW execution time */int SWPriority; /* Task type priority */int ConfigTime[NO_PRR]; /* configuratin time for each PRR */Xuint32 CanRun; /* Flags, that define placement restricion */
};struct TaskType TaskTypeTable [size]; /* Task Type Table */
Figure 5.6: Task Type table structure
Assuming all tasks in a DFG are available the complexity of RCSched-III isO(n),
where n is the number of tasks in the DFG.
5.3.3 RCSched-III-Enhanced
To further enhance the performance of RCSched-III, the selection methodology was
slightly modified. Instead of selecting the first task in the ready queue (i.e., the task
with the highest priority), the scheduler searches the ready queue for a task that matches
the one already reconfigured. This addition dramatically reduced the number of recon-
figuration, as will be discussed in Section 5.4.4.
5.4 Results and Analysis
The proposed online schedulers presented in Section 5.3 were implemented and eval-
uated at different stages of the research carried in this thesis. The developed online
scheduler were compared with RCOffline due to reasons highlighted in Section 5.2.5.
![Page 124: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/124.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 104
1 /*2 * RCSched-III measures reconfiguration time and execution (SW/HW) time for each task3 * on every PRR it runs on in real time. Based on those measurements it Dynamically4 * Calculates the TaskTypePriority.5 */6
7 Start8
9 // The PRRS should be stored from the smallest to the largest so10 //the one with a smaller ID always has smaller reconfiguration time11 Sort PRRs in ascending order based on size.12
13 Read the node with the highest task priority.14 if ( All dependencies have been met)15 {add task to ready queue }16 Fetch the first task from the ready queue17 switch (task->mode)18 { case Hybrid_HW :19
20 { if(TaskTypePriority==0 and GPP is free)21 {// TaskTypePriority ==0 indicates that task is preferred to run on SW22 Change task type from HybridHW to HybridSW23 Add to the beginning of the ready queue }24 }else{25 if (there is a free PRR(s) that fits the current task)26 {{Search Free PRRs for a match27 against the ready task }28
29 if(task found)30 run ready task on the free PRR31 else{32 currentPRR= smallest free PRR;33 if(TaskTypePriority< currentPRR and GPP is free)34 { // check if it’s faster to reconfigure or run on SW35 Change task type from HybridHW to HybridSW36 Add to the beginning of the ready queue }37 }else{38 Reconfigure currentPRR with the ready task bit-stream}}39 }else // all PRRs are busy40 {if ( GPP is free)41 { Change task type from HybridHW to HybridSW42 Add to the beginning of the ready queue }43 }else{44 // return to main program and45 // wait for free resources46 Increase the Busy Counter47 return Busy}48 break; }49 case HybridSW :50 {if (GPP is free)51 { load task into GPP}52 }else if (there is a free PRR)53 { change task type from HybridSW to HybridHW54 Add to the beginning of the ready queue}55 }else{56 // return to main program and57 //wait for free resources58 Increase the Busy Counter59 return Busy60 break;}}61 End
Figure 5.7: Pseudo-code for RCSched-III (with reuse and task migration)
![Page 125: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/125.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 105
In this section we will introduce the three stages used to evaluate the developed online
schedulers.
• Preliminary Stage: In this stage we evaluated RCSched-I and RCSched-II to the
RCOffline Scheduler introduced in Section 5.2.2. Parametermetrics evaluated in
this stage were based on hardware reuse, flexibility using ofGPPs and the number
of PRRs. This stage uses the uniform implementation, with noplacement restric-
tions (tasks can be placed in any PRR). The results of this stage were published
in [7].
• Intermediate Stage: In this stage, RCSched-I, RCSched-II, and RCSched-III are
evaluated. Two PRR floorplans were utilized in this evaluation stage (uniform
and non-uniform). In addition to the PRR uniformity, the schedulers performance
was evaluated with and without task placement restrictions, where a task can be
placed onto specific locations. The placement restriction can be due to size, I/O or
communication constraints.
• Advanced Stage:The third and final stage examines the performance enhance-
ment of RCSched-III-Enhanced. The latter is evaluated against RCSched-III and
RCOffline. The same benchmarks used in the intermediate stage were again used
in this comparison.
5.4.1 Benchmarks
Verifying the functionality of any developed scheduler requires extensive testing based
on benchmarks with different parameters, properties, and statistics. Accordingly, ten
![Page 126: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/126.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 106
DFGs were generated by the DFG generator presented in Section 4.2. These DFGs differ
with respect to the number of tasks (nodes), dependencies, and operations, as shown in
Table 5.3. Operation or task type refers to the unique function of a task. The generated
benchmarks have two sets of DFGs. The first set uses four operations while the second
set uses eight operations, that have variable execution time on software and hardware.
In addition to the synthesized benchmarks, 7 real-world benchmarks selected from the
well-known MediaBench DSP suite [126] are used in schedulers’ evaluation, as shown
in Table 5.4.
Name ♯ of nodes Total dep. No. of OperationsS1 10 5 4S2 25 10 4S3 50 50 4S4 100 150 4S5 150 200 4S6 25 40 8S7 50 60 8S8 100 120 8S9 150 200 8S10 200 60 8
Table 5.3: Synthesized Benchmarks
5.4.2 Preliminary Stage
To evaluate the performance of the developed schedulers we compared several param-
eters with different configurations. Each scheduler was tested with a different number
![Page 127: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/127.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 107
Name ♯ of nodes Total dep. DescriptionDFG2 51 52 JPEG-Smooth DownsampleDFG6 32 29 MPEG- Motion VectorDFG7 56 73 EPIC-Collapse -pyrDFG12 109 116 MESA- Matrix MultiplicationDFG14 11 8 HALDFG16 40 39 Finite Input Response Filter 2DFG19 66 76 Cosine I
Table 5.4: MediaBench Benchmarks
of PRRs with/without the use of a GPP. In addition, we attempted to measure the pa-
rameters of the on-line schedulers by enabling/disabling hardware tasks reuse. Our main
objective was to study the effect of varying the number and types of processing elements
and taking the advantage of hardware re-usability. The results are grouped by the mea-
sured parameters. Table 5.5 shows a comparison between RCSched-I, RCSched-II, and
the baseline RCOffline scheduler. The online schedulers usea platform of 5 PRRs and
one GPP, while RCOffline uses a hardware platform of 5 PRRs.
![Page 128: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/128.jpg)
CH
AP
TE
R5.
RE
CO
NF
IGU
RA
BLE
ON
LINE
SC
HE
DU
LER
S1
08
Table 5.5: Comparison of RCSched-I, RCSched-II and RCOffline
BenchmarkRCSched-I RCSched-II RCSOffline
Tota time Rec. time HW2SW mig # Reuse Busy Cnt. Total time Rec. time HW2SW mig # Reuse Busy Cnt Total time # reuse Prefetch
S1 201 160 0 2 0 160 120 0 4 0 160 3 2S2 583 400 0 5 0 503 320 0 9 0 380 15 3S3 867 780 1 10 16 740 580 2 19 50 520 37 1S4 1735 1540 2 21 33 1293 1100 3 42 33 920 85 3S5 2466 2240 3 35 47 1767 1560 4 68 29 1160 137 2S6 478 440 0 3 0 441 400 0 5 0 300 11 6S7 896 880 0 6 0 818 800 0 10 0 420 34 2S8 1773 1740 0 13 0 1508 1480 0 26 0 775 73 1S9 2697 2660 0 17 0 2417 2120 0 44 0 995 127 1S10 3589 3540 1 22 10 2887 2860 0 57 0 1095 180 1
DFG2 912 800 1 18 12 827 720 1 22 0 680 34 3DFG6 390 280 0 18 0 358 260 0 19 0 325 21 0DFG7 701 420 0 32 0 596 340 0 36 0 405 44 5DFG12 1270 1100 3 51 21 879 720 7 66 46 745 85 1DFG14 183 140 0 4 0 162 140 0 4 0 140 5 2DFG16 469 320 1 23 9 403 280 0 26 0 295 29 3DFG19 738 660 2 31 28 560 500 2 39 30 455 50 0
Average 1173.4 1064.7 0.8 18.3 10.4 959.9 841.2 1.1 29.2 11.1 574.7 57.1 2.1
![Page 129: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/129.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 109
Results obtained in Table 5.5 clearly indicate that RCSched-II is better than RCSched-
I. Total time is lower for RCSched-II due to more hardware task reuse, which in terns
reduces total reconfiguration time and hence total Time. RCSched-II is40% away from
RCSOffline, with performance improvement over RCSched-I by122%.
The next subsections of the Preliminary stage illustrate the different performance
metrics used to evaluate the schedules, by using the DFG shown in Figure 5.1 as an
example.
5.4.2.1 Total Execution Time
The total execution time is defined as the time to load and execute all tasks, satisfy de-
pendencies, and output results for a particular DFG. Figure5.8 shows the total execution
times for the schedulers with different configurations of the DFG presented in Figure 5.1.
Reducing the total execution time is an important objectivein designing any scheduler
to maximize resource utilization.
The following are some important points that can be concluded from Figure 5.8:
1. The static schedule has a noticeably shorter time when GPPs are available as re-
sources. This can be attributed to the use of simple tasks with almost identical
execution time in software and hardware. It is important to remember that extra
time is needed for hardware tasks due to reconfiguration. While, the off-line static
scheduler attempts to use the GPP as much as possible, the on-line schedulers al-
gorithm uses the GPP as a backup processing element. This is not the case when
more complex tasks are presented that need less time to run onhardware or when
the reconfiguration overhead is minimal. Comparing the static scheduler without
![Page 130: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/130.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 110
GPP to the on-line schedulers (without GPP as well) shows small difference in
time that reflect the real efficiency of the on-line schedulers.
2. Task re-use leads to shorter total execution time even when compared to the off-
line static scheduler. The latter requires more time to find afeasible schedule on a
similar platform.
3. The use of GPPs to run software tasks was not only a flexibility advantage but it
also reduces total execution time. From Figure 5.8 it can be concluded, using one
PRR with a GPP gives better results than using two PRRs without a GPP. This is
expected to change if the tasks need less time to run on hardware.
4. The RCSched-II algorithm has a lower total execution timebecause of its higher
hardware tasks reuse ratio, which reduces the total reconfiguration time, as shown
in Figures 5.9.
5. Without task reuse the on-line schedulers performed equally well except in two
cases where RCSched-I performed better. That can be attributed to the nature of the
DFG used. It is worth mentioning that RCSched-I has a simplerimplementation,
therefore, it is recommended to be used when the task reuse feature is not needed.
5.4.2.2 Reconfiguration Overhead
The reconfiguration overhead was examined by measuring two related parameters: The
total number of reconfigurations and the accumulated reconfiguration time (as seen in
Figure 5.9). The former represents the number of reconfigurations that occurred for the
![Page 131: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/131.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 111
� � � � �
�
����
����
����
����
����
����
����
���
���
�����
������������ �����
�� � ��������
� � ���������
�����������������
�����
���������������������
���������������������
��������������������
���������������������
��������������������
��������������������
�������������������
���������
����� ���
Figure 5.8: Total Execution Time in mSec
sampled tasks’ module and the latter shows the total time consumed for hardware tasks
reconfiguration for a particular DFG.
RCSched-II tends to reduce the number of reconfigurations required (and hence re-
configuration time) for the following reasons. It checks allfree PRRs for a match, and
also reconfigures the least reconfigured PRR which gives moretask variety. It is worth
mentioning that while increasing the number of PRRs reducesthe number of reconfig-
uration, it has some side effect. This can be attributed to the nature of the application
(DFG) used, the number of concurrent tasks and the sequence of similar tasks.
![Page 132: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/132.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 112
� � � � �
�
���
����
����
����
����
����������� �����������
������������
������������
���������������������
��������������������
��������������������
�������������������
��������������������
�������������������
�������������������
������������������
����������
����� �����������������
Figure 5.9: Total Reconfiguration Time in mSec
5.4.2.3 Busy Counter
The busy counter parameter measures the frequency by which the scheduler waits for a
free processing element. The busy counter is incremented when a task is present in the
ready queue and no free PRR is available and the GPP is busy. Whenever the GPP is free
the scheduler migrates hardware tasks to software in order to effectively use the GPP
without increasing the busy counter. The busy counter parameter is proportional to the
number of processing elements available, as shown in Figure5.10. The busy counter can
be used to measure the efficiency of adding extra processing elements. For example, in
Figure 5.10 the maximum number of processing elements used was five, and increasing
![Page 133: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/133.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 113
the number of processing elements beyond three had minimum affect.
� � � � �
�
�����
������
������
������
������
������������
����� ����������������������
� ������� ����������
� ������ ����������
� ������� ���������
� ������ ���������
� ������� ���������
� ������ ���������
� ������� ��������
� ������ ��������
��������
����������������������
Figure 5.10: Busy Counter: Increased when no Free ResourcesAvailable
5.4.2.4 Hardware to Software Migration
The hardware to software migration parameter is an indicator of the number of tasks that
migrate from hardware to software. A task tends to migrate from hardware to software
only if two conditions are met: all PRRs are busy and at the same time an idle GPP is
available. Figure 5.11, shows the number of hardware-to-software task migrations for
a sample DFG. It is important to note that all tasks within theDFG were initialized as
hardware tasks. Therefore, any task that eventually ran on the GPP was due to a migra-
tion. Tasks migration enhances the system dramatically, especially when the number of
![Page 134: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/134.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 114
PRRs is small compared to the number of concurrent tasks.
� � � � �
�
�
�
�
�
��
��
�������������� �����������
���������������� ���
��������������� ���
���������������� ����
��������������� ����
��������
�������������� �����������
Figure 5.11: Number of hardware to software Task Migration
5.4.3 Intermediate Stage
Tests were conducted using two strategies with different placement restrictions as shown
in Figure 5.12. The tests were performed on the uniform and non-uniform floorplans,
shown in Figure 4.2. Each floorplan was evaluated with two different placement con-
straints:Restricted, where a task can only be placed in a particular processing element,
andnon-restricted, where a task can be placed on any processing element. A task can be
restricted for many reasons, such as size, communication bus, or I/O requirements. Ta-
ble 5.6 shows, the placement restriction for each task type (operation), for the restricted
case.
![Page 135: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/135.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 115
Figure 5.12: Experimental Flow of the intermediate state testing
OP ID Mode SW PRR 1 PRR 2 PRR 3 PRR 4 PRR 51 HybridHW YES YES YES YES YES YES2 SWOnly YES NO NO NO NO NO3 HWOnly NO YES YES YES YES NO4 HybridHW YES YES YES NO NO NO5 HybridHW YES YES YES YES NO NO6 HybridHW YES YES NO NO NO NO7 HybridHW YES YES YES YES YES NO8 HybridHW YES YES YES YES YES YES
Table 5.6: Task type Constraints for restricted placement DFG
![Page 136: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/136.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 116
5.4.3.1 Schedule Evaluation
In this section, results obtained from testing the proposedonline schedulers will be dis-
cussed. For a better and effective evaluation each DFG was run six times in sequence
under the same conditions. The average of the last four runs were then considered. The
first two runs were discarded and the average of the remainingruns were calculated. The
motivation behind this can be explained as follows:
1. The results were more uniform after the first two iterations, especially for small
DFGs. This is due to a better reuse ratio since the currently loaded PRRs are now
relevant to the DFG under test. In addition, RCSched-III needs several iterations
to collect enough data to make more intelligent decisions regarding the placement
of next incoming task.
2. In most cases, the same tasks are repeated many times. For example if there is a
set of tasks for processing video stream, the same DFG would run for every frame.
Therefore, the effect of the first two iterations can be ignored.
3. For higher iterations the measurements of the first would be marginal to the total
average, since all measurement become constant after the first few iterations. The
first two iterations can be considered as a transit time for the system to get into the
steady state mode. Due to the time consuming nature of running the tests, only six
iterations were performed for every DFG under each configuration.
For the graphs in this state each point represents the average of the 10 benchmarks
for each measured parameter.
![Page 137: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/137.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 117
5.4.3.2 Total Execution Time
1 2 3 4 5
RCSched-I 16181 11302 9908 10870 12302
RCSched-II 16263 11008 9571 9594 10998
RCSched-III 16172 10976 9362 8434 8392
Baseline 34570 20149 16852 17364 17821
0
5000
10000
15000
20000
25000
30000
35000
40000
Tim
e(m
s)
# of PRRs
Overall Time (Non Uniform)
1 2 3 4 5
RCSched-I 20085 16434 15951 15847 16072
RCSched-II 20367 17340 15985 14661 13864
RCSched-III 19545 15546 12787 11706 10299
Baseline 33192 21613 19460 18896 18538
0
5000
10000
15000
20000
25000
30000
35000
Tim
e(m
s)
# of PRRs
Overall Time (Unifor m)
Figure 5.13: Total Execution Time: Non-Restricted Placement
Non-Restricted Placement: Figure 5.13 shows the total execution time fo the pro-
posed online schedulers for the non-restricted placement case. Figure 5.13 clearly shows
that RCSched-III produces better results in most cases. RCSched-III has noticeably
lower total execution timewhen enough resources (more PRRs) are available. RCSched-
I and RCSched-II performance were close to RCSched-III whena restricted number of
PRRs (1 - 2) were used.
![Page 138: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/138.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 118
Restricted Placement: Figure 5.14 presents the total execution time again of the pro-
posed online schedulers for the restricted placement case.Results from Figure 5.14
indicate that the overall time almost doubled when restricted placement was imposed.
The differences in performance among the schedulers was marginal since the placement
constraints reduced the solution space, thus minimizing the effect of the scheduling al-
gorithm.
1 2 3 4 5
RCSched-I 46811 31316 28167 26789 25766
RCSched-II 46217 32217 27525 26021 23285
RCSched-III 47047 31194 29027 25420 25108
Baseline 46999 33282 28895 26527 28084
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Tim
e(m
s)
# of PRRs
Overall Time (Non Uniform)
1 2 3 4 5
RCSched-I 52212 34090 31857 28765 28080
RCSched-II 47765 33497 29087 26397 23843
RCSched-III 54360 36829 29930 28033 26343
Baseline 55431 36521 31455 27788 29132
0
10000
20000
30000
40000
50000
60000
Tim
e(m
s)
# of PRRs
Overall Time (Unifor m)
Figure 5.14: Total Execution Time : Restricted Placement
![Page 139: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/139.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 119
5.4.3.3 Total Reconfiguration Time
Reconfiguration time is a measure of the total time spent in downloading the bitstreams
into the FPGA (reconfiguring PRRs) for a particular task set.Reconfiguration time is
an overhead for using hardware processing elements and tends to increase overall time.
Reconfiguration time can be minimized using several techniques such as :
1. Hardware reuse, which is supported by the proposed schedulers, namely; RCSched-
I, RCSched-II, and RCSched-III.
2. Selecting the smallest possible processing elements that can accommodate the cur-
rent task (supported by RCSched-III).
3. Migrating the task to software to further improve performance (supported by RCSched-
III).
Figure 5.15 along with Figure 5.16 present the total reconfiguration time involved for
the non-restricted and restricted placement respectively.
Non-Restricted Placement: It is clear from Figure 5.15 that RCSched-III has better
reconfiguration time (in all cases), especially for implementations with four and five
PRRs. As the number of PRRs decreases, the three schedulers performance approach a
similar point. Reconfiguration time was significantly lowerfor the non-uniform imple-
mentations due to the reasons highlighted earlier.
Restricted Placement: Figure 5.16 shows the total reconfiguration time for the Re-
stricted placement. Reconfiguration time was lower for the non-uniform implementation
![Page 140: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/140.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 120
1 2 3 4 5
RCSched-I 2486 3733 5712 8700 10964
RCSched-II 2212 3325 5193 7222 9531
RCSched-III 2243 3460 5215 6012 6776
Baseline 8037 8974 11809 14825 17152
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Tim
e (m
s)
# of PRRs
Total Reconfiguarti on time (Non Unifor m)
1 2 3 4 5
RCSched-I 7548 11393 13553 14594 15219
RCSched-II 8090 12350 13596 13360 10663
RCSched-III 6738 10802 10439 10296 9204
Baseline 13334 15926 17624 18179 14299
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Tim
e (m
s)
# of PRRs
Total Reconfiguarti on time (Unifor m)
Figure 5.15: Total Reconfiguration Time: Non-Restricted Placement
![Page 141: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/141.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 121
for RCSched-II. However the difference between the scheduler was marginal for the rea-
sons mentioned earlier.
1 2 3 4 5
RCSched-I 6105 7143 7740 7592 7253
RCSched-II 5018 5711 6508 5998 5160
RCSched-III 5184 5854 6392 5898 4722
Baseline 6247 7153 9076 11711 13468
0
2000
4000
6000
8000
10000
12000
14000
16000
Tim
e (m
s)
# of PRRs
Total Reconfiguarti on time (Non Unifor m)
1 2 3 4 5
RCSched-I 9873 11210 10529 9683 9094
RCSched-II 8612 9070 8293 6724 5289
RCSched-III 14946 16179 13745 9930 8360
Baseline 12942 14167 14593 14905 14857
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Tim
e (m
s)
# of PRRs
Total Reconfiguarti on time (Unifor m)
Figure 5.16: Total Reconfiguration Time: Restricted Placement
5.4.3.4 Intermediate Stage Analysis
Based on results presented in the previous section the following conclusions can be
drawn:
1. The results obtained were based on two different implementations, uniform and
![Page 142: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/142.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 122
non-uniform. The total reconfigurable area (accumulated PRRs size) are set equal
for both implementations. To further gain insight on the performance of the sched-
ulers placement restrictions were imposed on the benchmarks.
2. The main objective of the proposed schedulers is to reducethe Total execution
time. The non-uniform implementation has lower total executiontime, which is
mainly due to lower reconfiguration time.
3. The performance of RCSched-III was superior compared to RCSched-I and RCSched-
II when there is no resource scarcity. This can be attributedto the smarter task
migration between software and hardware, by taking task execution and reconfig-
uration time into account.
4. The difference intotal execution timewas marginal for the systems with limited
resources (1 and 2 PRRs).
5. Increasing the number of PRRs enhanced the performance byreducing thetotal
execution time.
The remaining of the measured metrics are presented in more details in Appendix A.
5.4.4 Advanced Stage
In this section results based on RCSched-III along with an enhanced version of RCSched-
III will be presented. In order for us to have a fair comparison with the RCOffline sched-
uler, task migration and software tasks were completely disabled from RCSched-III and
RCSched-III-Enhanced. We also assigned the same reconfigurable area available in the
![Page 143: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/143.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 123
hardware platform to the RCOffline scheduler. All results were based on the uniform
floorplan platform shown in Figure 4.2.
5.4.4.1 Total Execution Time
The total execution time (cycles) of the proposed online schedulers (RCSched-III) and its
variant are presented in Table 5.7 and Table 5.8. The tables show a comparison of the to-
tal execution time for RCSched-III, RCSched-III-Enhanced, and RCOffline schedulers.
It also shows the gap percentage of RCSched-III-Enhanced and RCOffline scheduler. Ta-
ble 5.7 shows thetotal execution timeof the first run for all schedulers. Results are based
on the assumption that all PRRs are empty and RCSched-III is in the initial phase of
learning. Table 5.8 on the other hand shows, thetotal execution timeafter the first two it-
erations. At this state the learning phase of RCSched-III iscomplete (i.e., third iteration).
Since RCOffline is an offline scheduler with no learning capability, the time it takes to
configure the first task of a DFG was subtracted (assumed it is already configured).
The results in Table 5.7, clearly indicates an average gap performance increase by
40% of RCSched-III-Enhanced over RCSched-III and a gap of only−6% over the RCOf-
fline. This is a significant improvement, taking into accountRCSched-III-Enhanced is
an online scheduler that requires a small CPU time compared to the RCOffline. The
RCSched-III-Enhanced performance is improved dramatically following the learning
phase as shown in Table 5.8. The RCSched-III-Enhanced performance gap was49%
of RCSched-III and2% of RCOffline. The performance is expected to improve even
more if task migration is enabled.
The CPU runtime comparison of RCSched-III-Enhanced and RCOffline, is shown in
Table 5.9. The tests were performed on a RedHat Linux workstation with 6 cores Intel
![Page 144: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/144.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 124
Table 5.7: Schedulers comparison forTotal Execution time(No learning).
BenchmarkTotal execution Time (cycles)
Gap %RCOffline RCSched-III RCSched-III-En.
S1 160 160 160 0S2 380 500 460 -17S3 520 740 540 -4S4 920 1430 1000 -8S5 1160 1750 1220 -5S6 300 420 340 -12S7 420 860 500 -16S8 775 1470 820 -5S9 995 1990 1000 -1S10 1095 3000 1140 -4
DFG2 680 800 584 16DFG6 325 340 359 -9DFG7 405 600 481 -16DFG12 745 910 761 -2DFG14 140 160 142 -1DFG16 295 403 383 -23DFG19 455 540 460 -1
Average 575 945 609 -6
![Page 145: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/145.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 125
Table 5.8: Schedulers comparison forTotal Execution time(after learning phase).
BenchmarkTotal execution Time (cycles)
Gap %RCOffline RCSched-III RCSched-III-En.
S1 140 140 120 17S2 360 440 345 4S3 500 660 520 -4S4 900 1230 866 4S5 1140 1690 1160 -2S6 280 390 280 0S7 400 700 430 -7S8 755 1370 660 14S9 975 1950 950 3S10 1075 2620 1075 0
DFG2 660 830 584 13DFG6 305 360 322 -5DFG7 385 555 436 -12DFG12 725 870 723 0DFG14 120 160 103 17DFG16 275 380 288 -5DFG19 435 540 460 -5
Average 555 876 548 2
![Page 146: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/146.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 126
Xeon processor, running at a speed of 3.2GHz and equipped with 16GB or RAM. The
RCSched-III-Enhanced is on average 1000 times faster than the RCOffline.
Table 5.9: Run-time comparison of RCOffline and RCSched-III-Enhanced
BenchmarkTime in (ms)
Speedup (X times)RCOffline Rcched-III-En.
S1 2 0.116 1724S2 5 0.614 814S3 12 1.52 789S4 31 4.98 622S5 59 9.14 646S6 5 0.418 1196S7 8 1.31 611S8 29 2.03 1429S9 58 6.28 924S10 110 9.74 1129
DFG2 12 0.829 1448DFG6 3 0.553 542DFG7 7 1.15 609DFG12 29 3.69 786DFG14 2 0.111 1802DFG16 3 0.754 398DFG19 12 1.39 863
5.4.4.2 Hardware Task Reuse
Table 5.10 and Table 5.11 show the number of reused hardware tasks prior to the learning
phase and post learning respectively. Hardware task reuse tends to reduce the number
of required reconfigurations, and hence, total execution time. It is clear from the tables
that the amount of hardware task reuse was improved with RCSched-III-Enhanced. Ob-
viously this contributed to a reduction of the total execution time. Table 5.11, shows the
benefit of the learning phase which tends to increase the variety of the available recon-
![Page 147: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/147.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 127
figured tasks.
Table 5.10: Schedulers comparison forNumber of Hardware Reuse(No learning).
Benchmark# of HW task Reuse # Prefetch
Gap %RCOffline RCSched-III RCSched-III-En. RCOffline
S1 3 4 4 2 33S2 15 9 14 3 -7S3 37 22 37 1 0S4 85 40 78 3 -8S5 137 74 128 2 -7S6 11 6 10 6 -9S7 34 8 27 2 -21S8 73 28 67 1 -8S9 127 53 118 1 -7S10 180 54 173 1 -4
DFG2 34 23 38 3 12DFG6 21 19 19 0 -10DFG7 44 36 42 5 -5DFG12 85 72 84 1 -1DFG14 5 4 5 2 0DFG16 29 26 26 3 -10DFG19 50 42 49 0 -2
Average 57 31 54 2 -3
RCSched-III-Enhanced has172% and158% more average hardware reuse than RCSched-
III prior and following the learning phase respectively. RCSched-III-Enhanced exceeded
the amount of hardware reuse over RCOffline after the learning phase by119%, and it
reached97% of the hardware reuse without learning. RCOffline is able to prefetch tasks,
which influences the amount of reuse, but the number of prefetches was marginal as can
be seen in the results presented in Table 5.10 and Table 5.11.
![Page 148: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/148.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 128
Table 5.11: Schedulers comparison forNumber of Hardware Reuse(after learning).
Benchmark# of HW task Reuse # Prefetch
Gap %RCOffline RCSched-III RCSched-III-En. RCOffline
S1 3 8 8 2 167S2 15 13 21 3 40S3 37 26 41 1 11S4 85 51 90 3 6S5 137 79 131 2 -4S6 11 10 13 6 18S7 34 17 31 2 -9S8 73 33 79 1 8S9 127 56 123 1 -3S10 180 74 183 1 2
DFG2 34 24 38 3 12DFG6 21 20 22 0 5DFG7 44 38 48 5 9DFG12 85 73 85 1 0DFG14 5 5 8 2 60DFG16 29 27 32 3 10DFG19 50 42 49 0 -2
Average 57 35 59 2 19
![Page 149: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/149.jpg)
CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 129
5.5 Summary
In this chapter we presented three efficient reconfigurable online schedulers that are em-
ployed by the OS proposed in the thesis.The schedulers were evaluated and compared
with multiple baseline schedulers in addition to an ILP exact model. The main objective
of the proposed schedulers is to minimize the total execution time for the incoming task
graph. The proposed schedulers support hardware task reuseand software to hardware
task migration. RCSched-I and RCSched-I migrate tasks to software only if hardware
resources are scarce. On the other hand, RCSched-III makes intelligent decisions and
places tasks on the intended processing element (hardware or software) that leads to min-
imal total execution time. RCSched-III takes into account,PRR size and hence, it works
best on the non-uniform PRR floorplan. RCSched-III, learns about task types from pre-
vious history and it’s performance tends to improve overtime. RCSched-III-Enhanced,
further improves upon RCSched-III and searches the ready queue for a task that can ex-
ploit hardware task reuse. RCSched-III-Enhanced has exceptionally good performance
since it reached97% on average the performance of RCOffline scheduler, without the
learning phase, and102% of RCOffline scheduler following the learning phase.
![Page 150: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/150.jpg)
Chapter 6
Allocation/Binding of Execution Units
Optimizing the hardware architecture (execution units) for a specific task in a task graph
is an NP-hard problem [4]. Therefore, in this chapter, we areproposing a heuristic based
technique to select the type of execution units (implementation variants) needed for a
specific task. The proposed approach uses anIsland Based Genetic algorithm (IBGA) as
a meta-heuristic optimization technique.
The main feature of an IBGA is that each population of individuals (i.e., set of can-
didate solutions) is divided into several sub-populations, called islands. All traditional
genetic operators, such as crossover, mutation and selection are performed separately on
each island. Some individuals are selected from each islandand migrated to different
islands periodically. In our proposed Island basedGenetic Algorithm (GA), we do not
perform any migration among the different sub-populationsto preserve the solution qual-
ity of each island, since it is based on a different platform (i.e., floorplan). The basic idea
is to aggregate the results obtained from the Pareto fronts of each island to enhance the
overall solution quality. Each solution on the Pareto fronttends to optimize multiple ob-
130
![Page 151: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/151.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 131
jectives including power consumption and speed. However, each island tends to optimize
this multi-objective optimization problem based on a distinct platform(floorplan).
The IBGA approach proposed utilizes the framework proposedin Chapter 5 to evalu-
ate the solutions created. However, using the proposed reconfigurable hardware platform
in Chapter 5 would be impractical and limiting. Therefore, areconfigurable simulator
presented earlier in Chapter 4 was developed.
6.1 Single Island GA
Each GA optimization module consists of four main components: an Architecture Li-
brary, Initial Population Generation module, a GA Engine, and a Fitness Evaluation
module (on-line scheduler), as shown in Figure 6.1. TheArchitecture library stores
all of the different available architectures (execution units) for every task type. Archi-
tectures vary in execution time, area, reconfiguration time, and power consumption. For
example, a multiplier can have a serial, semi parallel, or parallel implementation. Each
GA module tends to optimize several criteria (objectives) based on asinglefixed floor-
plan/platform. The architecture library is explained in more details in Appendix B.
The Initial population generatoruses a given task graph along with the architecture
library to generate a diverse initial population of task graphs, as demonstrated by Fig-
ure 6.1. Each possible solution (chromosome) of the population assembles the same
DFG with one execution unit (architecture) bound to each task. The initial population
can be generated randomly or partially seeded using known solutions. TheGenetic algo-
rithm enginemanipulates the initial population by creating new solutions using several
recombinational operators in the form of crossover and mutation. New solutions created
![Page 152: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/152.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 132
Figure 6.1: A Single GA Module
![Page 153: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/153.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 133
are evaluated and the most fit solutions replaces the least fit. These iterations will either
continue for a specified number of generations or until no further improvement in solu-
tion quality is achieved. The fourth component is thefitness function, which evaluates
each solution based on a specific criteria. The fitness function for our model is a recon-
figurable scheduler that schedules each task graph and returns the time and overall power
consumed by the DFG to the GA engine.
6.1.1 Initial population
The initial population represents the solution space that the GA will use to search for the
best solution. The initial population needs to be as diverseas possible. In our framework,
we evaluated the system with different population sizes that varied from a population
equal to the number of nodes in the (input) DFG to 5 times the number of nodes of the
input DFGs.
Chromosome representation:Every (solution) individual in the population is repre-
sented using a bit string known as a chromosome. Each chromosome consists of genes.
In our formulation, we choose to use a gene for each of the N tasks in the task graph; each
gene represents the choice of a particular implementation variant (as an integer value) for
the task. A task graph is mapped to a chromosome, where each gene represents an oper-
ation within the DFG using a specific execution unit, as shownin Figure 6.2.
6.1.2 Genetic Operators
GAs begin execution based on an initial population of randomly chosen solutions, which
are successively refined through several generations usingthe crossover and mutation
![Page 154: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/154.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 134
Figure 6.2: Task Graph to Chromosome Mapping (Binding/Allocation)
operators. The following modules represent the basic operators used:
• TheSelectionmodule: selects a set of individuals in the current population, called
parents, that contribute their genes to their children. Theselection module usually
selects individuals that have better fitness values than their peers. Several selection
modules including tournament and roulette wheel selections can be use.
• TheCrossovermodule: takes two individuals from the population (parents) based
on the selection module and produces two new individuals (children) by splitting
the parent chromosomes at a random position then recombining them. The proba-
bility of applying crossover on two chromosomes is called thecrossover rate.
• TheMutation module: randomly alters genes within a chromosome with a small
probability and is usually applied following crossover. Crossoverrapidly explores
the search space, while mutation provides a modest amount ofrandomsearch.
Mutation helps to diversify the population of solutions to avoid rapid convergence.
• TheReplacementmodule: replaces the current population (parents) with thenewly
formed population (children). Generational GA creates newoffspring from the
![Page 155: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/155.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 135
members of an old population using the genetic operators introduced earlier and
place these individuals in a new population which becomes the old population
when the whole new population is created. Steady state GA, onthe other hand, is
different to the generation model in that there is typicallyone single new member
instead into the new population at any one time. We used a replacement policy
where the new population is a combination of the best individuals of both (parent
and child) populations.
• Fitness function is an objective function that is used by the genetic algorithm
engine to summarize, as a single figure of merit, the fitness ofa given chromosome.
The fitness function is unique for each problem, and usually returns a numeric
value to the GA engine. In our proposed framework, the fitnessfunction is based
on the outcome of a scheduler which takes each chromosome uses it to construct a
schedule and then returns a value representing time, power,or a weighted average
of both.
6.1.3 Fitness Evaluation
The quality of the scheduling algorithms implemented can have a significant impact on
the final performance of applications running on a reconfigurable computing platform.
The overall time it takes to run a set of tasks on the same platform could vary using
different schedulers. Power consumption, reconfigurationtime and other factors can
also be used as metrics to evaluate schedulers. The three online scheduler described in
Chapter 5 can be used to evaluate each chromosome as part of the fitness function, in the
GA framework. The results of this chapter are based on the RCSched-III scheduler.
![Page 156: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/156.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 136
6.2 Island Based GA
The proposed Island based GA consists of multiple GA modulesas shown in Figure 6.3.
Each GA module produces a multi-objective Pareto Front based on power and perfor-
mance (execution time) for a unique platform. An aggregatedPareto front is then con-
structed from all the GAs to give the user a framework that notonly optimizes power and
performance, but also the suitable platform (floorplan). A platform in this case includes
the number, size, layout of the PRRs.
The proposed framework simultaneously sends the same DFG toevery GA island,
as shown in Figure 6.3. The Islands then run in parallel to optimize for performance
and power consumption for each platform. Running multiple GAs in parallel helps in
reducing processing time as shown in Table 6.2. The chromosomes in each island, which
represent an allocation of execution units (implementation variant) and binding of units
to tasks are used to build a feasible near optimal schedule. The schedule, allocation, and
binding is then evaluated by the simulator to determine the associated power consumed
and total execution time. These values are then fed back to the GA as a measure of
fitness of different solutions in the population. The GA seeks to minimize a weighted
sum of performance and power consumption (Equation 6.1). The value ofW determines
the weight of each objective. For example, a value of one indicates that the GA should
optimize for performance only, while a value of 0.5 gives equal emphasis on performance
and speed (assuming the values of speed and power are normalized).
FitnessV alue = Power ∗ (1 − W ) + Exec.T ime ∗ W (6.1)
![Page 157: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/157.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 137
Figure 6.3: Proposed Island Based Framework
![Page 158: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/158.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 138
Every island produces a separatePareto frontfor a different platform in parallel. An
aggregation module then combines all the Pareto fronts intoan aggregated Pareto front
that not only gives a near optimal performance and power, butalso the hardware platform
(floorplan) associated with the solution. Each island generates the Pareto front by varying
the value ofW in Equation 6.1 from 0.0 to 1.0 in step sizes of 0.1.
6.2.1 Power Measurements
Real time power measurements was initially measured using the Xilinx VC707 board as
explained in Appendix C. We used the (Texas Instrument) TI power controllers along
with TI dongle and a monitoring station to perform the power measurement. Other mea-
surements were estimated using Xilinx XPower software and others were based on data
from the literature.
The essential figures of merits of a digital circuit or systemare speed and power
consumption. Despite the fact, Equation 6.1 uses power, yetswitching energy can be
used instead. The results will be different but the methodology remains the same. In
the case of energy, theArchitecture librarycan be updated with energy values instead of
power as seen in Equation 6.4. The fitness function will be calculated by the simulator
accordingly.
Pdyn = CLVddαfC (6.2)
Es = switching energy =Pdyn
αfC
= CLV 2dd (6.3)
![Page 159: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/159.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 139
Energy = Power X Time (6.4)
Whereα is the switching activity andfC is the frequency.
To minimize power, one can think of replacing the idle partial reconfigurable mod-
ule with a blank module. Even though, this has not been considered in this work, we
believe doing so will have an adverse effect on power consumption in most cases. The
overhead of increasing the number of reconfigurations (and hence reconfiguration time
and power), due to the lower task variety, will surpass the slight advantage of replacing
the idle module with a blank one. Another alternative strategy, is to power-gate the clock
of the idle task without replacing the idle task, this can be accomplished by the ROS or
the task itself.
6.3 Results
In this section, we will first describe the experimental method used and then presents the
results based on the the proposed island-based GA framework.
6.3.1 Experimental Method
The proposed framework was tested on both synthetic and real-world benchmarks, as
shown in Table 6.1. In total, 5 synthetic benchmarks (i.e., S1-S5) were generated at
random subject only to connectivity constraints. One of themain advantages of using
randomly generated problem instances is that it facilitates the comparison of task graphs
with potentially different behaviors. Different models arise due to differences in the num-
![Page 160: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/160.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 140
ber and types of tasks, the number of vertices, the number of dependencies, the length
of the critical path, and the number of opportunities to reuse modules. The remaining
7 benchmarks (i.e., DFG2-DFG19) are real-world benchmarksselected from the well-
known MediaBench DSP suite [126]. As the GA is stochastic by nature, it was run
10 times for each benchmark. Solution quality was then evaluated in terms of average
results produced.
Table 6.1: Benchmark Specifications.ID Name # nodes #edges Avg. Edges/node Critical path parallelism (Nodes/Crit. path)
S1 Synthesized (4 task types) 50 50 1 6 8.3S2 Synthesized (4 task types) 150 200 1.33 15 10S3 Synthesized (8 task types) 25 40 1.6 7 3.5S4 Synthesized (8 task types) 100 120 1.2 7 14.2S5 Synthesized (5 task types) 13 10 0.77 3 4.3
DFG2 JPEG - Smooth Downsample 51 52 1.02 16 3.2DFG6 MPEG - Motion Vectors 32 29 0.91 6 5.33DFG7 EPIC - Collapsepyr 56 73 1.3 7 8DFG12 MESA - Matrix Multiplication 109 116 1.06 9 12.11DFG14 HAL 11 8 0.72 4 2.75DFG16 Finite Input Response Filter 2 40 39 0.975 11 3.64DFG19 Cosine 1 66 76 1.15 8 8.25
The run-time performance of the IBGA based on serial and parallel implementations,
is shown in Table 6.2. The IBGA was tested on a RedHat Linux workstation with 6 cores
Intel Xeon processor, running at a speed of 3.2GHz and equipped with 16GB of RAM.
6.3.2 Convergence of Island based GA
The Benchmarks introduced in Table 6.1 were tested on the proposed Island based GA
framework, which consists of four GA islands. Each island targeted a different FPGA
platform (floorplan) that was generated manually. Each platform is distinct in terms of
the number, size, and layout of the PRRs. An example is shown in Figure 6.4. Platform
1 and 2 have uniform size distribution, while platform 3 and 4have non-uniform size
![Page 161: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/161.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 141
Table 6.2: IBGA run-time for serial and parallel implementations
DFGTime (min:sec)
Speedup (X times)Serial Parallel
S1 00:52 00:17 3.1S2 04:53 01:41 2.9S3 00:23 00:10 2.3S4 03:06 01:16 2.5S5 00:07 00:02 3.0DFG2 01:12 00:28 2.6DFG6 00:12 00:08 1.5DFG7 00:35 00:13 2.7DFG12 02:45 00:52 3.2DFG14 00:05 00:02 2.5DFG16 00:26 00:09 2.9DFG19 01:13 00:23 3.2
distribution. Despite the fact the four islands are optimizing the same task graph they
converge to different solutions, since they are targeting different floorplans on the same
FPGA architecture. Figure 6.6 - 6.8 and Figure 6.9 - 6.15 show, the convergence of the
fitness values for two different benchmarks, synthetic and MeadiaBench, for an average
of 10 runs.
Figure 6.4: Platforms with different floorplans
![Page 162: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/162.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 142
1000
1500
2000
2500
3000
3500
4000
4500
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (S1) Benchmark
P 1P 2P 3P 4
Figure 6.5: Convergence of synthesized benchmark S1
For example the DFG for benchmark S3 shown in Figure 6.16, hasdifferent architec-
ture binding for each platform, as shown in Figure 6.17. Eachnumber represents an index
to the selected execution unit (architecture) and the location of the integer represents the
node number.
6.3.3 The Pareto Front of Island GA Framework
The proposed multi-objective Island based GA optimizes forexecution time and/or power
consumption. Since the objective functions are conflicting, a set of Pareto optimal solu-
tions was generated for every benchmark, as discussed in Section 6.2.
Figures 6.19 - 6.21 and Figure 6.22 - 6.28 shows the Pareto fronts for both synthetic
and MediaBench benchmarks respectively based on an averageof 10 runs. The Pareto
![Page 163: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/163.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 143
2000
4000
6000
8000
10000
12000
14000
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (S2) Benchmark
P 1P 2P 3P 4
Figure 6.6: Convergence of synthesized benchmark S2
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (S3) Benchmark
P 1P 2P 3P 4
Figure 6.7: Convergence of synthesized benchmark S3
![Page 164: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/164.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 144
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (S4) Benchmark
P 1P 2P 3P 4
Figure 6.8: Convergence of synthesized benchmark S4
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (DFG2) Benchmark
P 1P 2P 3P 4
Figure 6.9: Convergence of MediaBench benchmark (DFG2)
![Page 165: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/165.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 145
0
500
1000
1500
2000
2500
3000
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (DFG6) Benchmark
P 1P 2P 3P 4
Figure 6.10: Convergence of MediaBench benchmark (DFG6)
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (DFG7) Benchmark
P 1P 2P 3P 4
Figure 6.11: Convergence of MediaBench benchmark (DFG7)
![Page 166: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/166.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 146
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (DFG12) Benchmark
P 1P 2P 3P 4
Figure 6.12: Convergence of MediaBench benchmark (DFG12)
300
400
500
600
700
800
900
1000
1100
1200
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (DFG14) Benchmark
P 1P 2P 3P 4
Figure 6.13: Convergence of MediaBench benchmark (DFG14)
![Page 167: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/167.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 147
500
1000
1500
2000
2500
3000
3500
4000
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (DFG16) Benchmark
P 1P 2P 3P 4
Figure 6.14: Convergence of MediaBench benchmark (DFG16)
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
0 10 20 30 40 50 60 70 80 90 100
Fitn
ess
Generation
Generation vs Fitness average convergance for (DFG19) Benchmark
P 1P 2P 3P 4
Figure 6.15: Convergence of MediaBench benchmark (DFG19)
![Page 168: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/168.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 148
Figure 6.16: The DFG for benchmark S3
P1: 2533215124431321121115213P2: 2533215124431321121115213P3: 2322231111431211111313212P4: 1322221114431111111111112
Figure 6.17: Architecture binding for each platform (P1-P4), for the S3 Benchmark.
![Page 169: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/169.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 149
front obtained by each island (P1 to P4) is displayed individually along with the aggre-
gated Pareto front for each benchmark. It is clear from Figures 6.19 - 6.21 that each GA
island tends to produce a different and unique Pareto front that optimizes performance
and power for the targeted platform. On the other hand, the aggregated solution based
on the four GA islands combines the best solutions obtained and thus improves upon the
solutions obtained by the individual GA Pareto fronts. Hence, the system/designer can
choose the most appropriate point based on the desired objective function.
Table 6.3: The Average of 10 runs for the Best Fitness Values
Benchmark No opt. 1-GA IBGA
S1 4435.7 1847 978.2S2 12514.7 4288.3 2244.4S3 2649.95 1973 660.9S4 8474.23 3856 1792.4S5 1392 877 286DFG2 5452.67 2554.1 1036.1DFG6 2942.99 1217.9 406.1DFG7 4558.4 1303.2 582.4DFG12 9249.97 2595.9 1218.1DFG14 1134.13 735 341.5DFG16 3614.28 1301.1 505.2DFG19 5805.24 2368.4 1528.1
Table 6.3 compares the best fitness values of the randomly binding architecture with
no optimization (No Opt.) to that using a single GA approach (1-GA) along with the
proposed aggregated Pareto optimal point approach based onfour Islands (IBGA). Each
value in Table 6.3 is based on the average of 10 different runs. The single GA im-
plementation achieves on average 55.9% improvement over the baseline non-optimized
approach, while the Island based GA achieves on average 80.7% improvement. The
latter achieves on average 55.2% improvement over the single GA approach. Table 6.4
![Page 170: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/170.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 150
shows the fitness values based on different weight values (introduced in Equation 6.1) for
randomly binding architectures (No Opt.), a single GA (1-GA) and an Island Based GA
(IBGA). On average the single GA implementation achieves 52.7% improvement over
the baseline non-optimized approach, while the Island based GA achieves on average
75% improvement.
Table 6.4: Fitness values for different weights (W)Weight W=0.5 W=0.7 W=0.875 W=1 (Performance)
Bench. No opt. 1-GA IBGA No opt. 1-GA IBGA No opt. 1-GA IBGA No opt. 1-GA IBGAS1 3327 2280 1469 3917 2134 1266 4435.7 1847 978.2 4661 1601 796S2 9483 5731 4351 11034 4872 3331 12514.7 4288.3 2244.4 13457 3724 1604S3 1894 1517 825 2303 1760 734 2649.95 1973 660.9 2914 2126 588S4 6286 4000 2747 7470 3890 2277 8474.23 3856 1792.4 9226 3715 1366S5 954 684 335 1194 790 300 1392 877 286 1550 940 290
DFG2 3749 2237 1324 4673 2450 1189 5452.67 2554.1 1036.1 6025 2582903DFG6 1851 906 387 2434 1073 409 2942.99 1217.9 406.1 3306 1320 396DFG7 2880 1148 668 3792 1231 640 4558.4 1303.2 582.4 5147 1324 537DFG12 5891 2400 1432 7722 2609 1342 9249.97 2595.9 1218.1 10415 2886 1044DFG14 775 557 338 963 656 345 1134.13 735 341.5 1249 785 340DFG16 2274 1003 446 2976 1166 505 3614.28 1301.1 505.2 4095 1392 507DFG19 4588 3152 2308 5232 2743 1907 5805.24 2368.4 1528.1 6146 21861197
Table 6.5 compares the IBGA with an exhaustive procedure in terms of quality of
solution and CPU time. Only two small benchmarks are used dueto the combinatorial
Table 6.5: Exhaustive vs. IBGA (average of 10 runs)
Benchmarks IBGA Exhaustive searchFitness Time(sec) Fitness Time(min)
DFG14 (11 nodes) 341 1.8 313 5.8S5 (13 nodes) 286 2.4 283 257
explosion of the CPU time when using the exhaustive search based procedure. It is clear
from Table 6.5 that the IBGA technique produces near optimalsolutions for these small
benchmarks. The DFG14 (11 nodes) is 9% inferior to the optimal solution, while S5 (13
nodes) is 1% away from optimality.
![Page 171: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/171.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 151
Pareto front for the different GA Islands for Benchmark s1
0.00.51.01.52.02.53.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
P 1
0.00.51.01.52.02.53.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
P 2
0.00.51.01.52.02.53.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.51.01.52.02.53.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.51.01.52.02.53.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.18: Aggregated Pareto Front (Time vs Power) for synthesized benchmark (S1)
Pareto front for the different GA Islands for Benchmark s2
0.01.02.03.04.05.06.07.08.0
0.0 2.0 4.0 6.0 8.010.012.014.016.018.020.0
P 1
0.01.02.03.04.05.06.07.08.0
0.0 2.0 4.0 6.0 8.010.012.014.016.018.020.0
P 2
0.01.02.03.04.05.06.07.08.0
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.01.02.03.04.05.06.07.08.0
0.0 2.0 4.0 6.0 8.0 10.0 12.0
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.01.02.03.04.05.06.07.08.0
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0 20.0
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.19: Aggregated Pareto Front (Time vs Power) for synthesized benchmark (S2)
![Page 172: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/172.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 152
Pareto front for the different GA Islands for Benchmark s3
0.00.20.40.60.81.01.2
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
P 1
0.00.20.40.60.81.01.2
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
P 2
0.00.20.40.60.81.01.2
0.0 0.5 1.0 1.5 2.0 2.5
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.20.40.60.81.01.2
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.20.40.60.81.01.2
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.20: Aggregated Pareto Front (Time vs Power) for synthesized benchmark (S3)
Pareto front for the different GA Islands for Benchmark s4
0.00.51.01.52.02.53.03.54.04.55.0
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0
P 1
0.00.51.01.52.02.53.03.54.04.55.0
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0
P 2
0.00.51.01.52.02.53.03.54.04.5
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.51.01.52.02.53.03.54.04.55.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.51.01.52.02.53.03.54.04.55.0
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.21: Aggregated Pareto Front (Time vs Power) for synthesized benchmark (S4)
![Page 173: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/173.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 153
Pareto front for the different GA Islands for Benchmark DFG2
0.00.51.01.52.02.5
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
P 1
0.00.51.01.52.02.5
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
P 2
0.00.20.40.60.81.01.21.41.61.82.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.51.01.52.02.5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.51.01.52.02.5
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.22: Aggregated Pareto Front (Time vs Power) for MediaBench benchmark(DFG2)
Pareto front for the different GA Islands for Benchmark DFG6
0.00.10.20.30.40.50.6
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
P 1
0.00.10.20.30.40.50.6
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
P 2
0.00.10.20.30.40.50.6
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.10.20.30.40.50.6
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.10.20.30.40.50.6
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.23: Aggregated Pareto Front (Time vs Power) for MediaBench benchmark(DFG6)
![Page 174: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/174.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 154
Pareto front for the different GA Islands for Benchmark DFG7
0.00.10.20.30.40.50.60.70.80.91.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
P 1
0.00.10.20.30.40.50.60.70.80.91.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
P 2
0.00.10.20.30.40.50.60.70.80.91.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.10.20.30.40.50.60.70.80.91.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.10.20.30.40.50.60.70.80.91.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.24: Aggregated Pareto Front (Time vs Power) for MediaBench benchmark(DFG7)
Pareto front for the different GA Islands for Benchmark DFG12
0.00.51.01.52.02.5
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0
P 1
0.00.51.01.52.02.5
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0
P 2
0.00.51.01.52.02.5
0.0 1.0 2.0 3.0 4.0 5.0 6.0
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.51.01.52.02.5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.51.01.52.02.5
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.25: Aggregated Pareto Front (Time vs Power) for MediaBench benchmark(DFG12)
![Page 175: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/175.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 155
Pareto front for the different GA Islands for Benchmark DFG14
0.00.10.10.20.20.20.30.30.4
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
P 1
0.00.10.10.20.20.20.30.30.4
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
P 2
0.00.10.10.20.20.20.30.3
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.10.10.20.20.20.30.3
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.10.10.20.20.20.30.30.4
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.26: Aggregated Pareto Front (Time vs Power) for MediaBench benchmark(DFG14)
Pareto front for the different GA Islands for Benchmark DFG16
0.00.10.20.30.40.50.60.7
0.0 1.0 2.0 3.0 4.0 5.0 6.0
P 1
0.00.10.20.30.40.50.60.7
0.0 1.0 2.0 3.0 4.0 5.0 6.0
P 2
0.00.10.20.30.40.50.6
0.0 0.5 1.0 1.5 2.0 2.5
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.10.20.30.40.50.60.7
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.10.20.30.40.50.60.7
0.0 1.0 2.0 3.0 4.0 5.0 6.0
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.27: Aggregated Pareto Front (Time vs Power) for MediaBench benchmark(DFG16)
![Page 176: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/176.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 156
Pareto front for the different GA Islands for Benchmark DFG19
0.00.51.01.52.02.53.03.54.04.5
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
P 1
0.00.51.01.52.02.53.03.54.04.5
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
P 2
0.00.51.01.52.02.53.03.54.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
P 3
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
/:Bold=20 Power (mW)
0.00.51.01.52.02.53.03.54.04.5
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
P 4
Exe
cutio
n tim
e (1
/100
0 C
ycle
)
Power (mW)
0.00.51.01.52.02.53.03.54.04.5
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
All Platforms
Power (mW)
P 1P 2P 3P 4
Figure 6.28: Aggregated Pareto Front (Time vs Power) for MediaBench benchmark(DFG19)
![Page 177: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/177.jpg)
CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 157
6.4 Summary
In this chapter, a parallel island-based GA approach was designed and implemented to
efficiently map execution units to task graphs for partial dynamic reconfigurable systems.
Unlike previous works, our approach is multi-objective andnot only seeks to optimize
speed and power, but also seeks to select the best reconfigurable floorplan. Our algorithm
was tested using both synthetic and real-world benchmarks.Experimental results clearly
indicates the effectiveness of this approach for solving the problem where the proposed
Island based GA framework achieved on average 55.2% improvement over a single GA
implementation and 80.7% improvement over a baseline random allocation and binding
approach.
![Page 178: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/178.jpg)
Chapter 7
Resource Prediction
The ability to determine and predict the hardware resourcesrequired by an application in
a dynamic run-time reconfigurable environment can significantly improve the system’s
overall performance in different ways. However, optimizing the necessary hardware re-
sources for a given task graph is an NP-hard problem [127]. Predicting resources for real-
time problems makes it even harder to solve. Therefore, in this chapter we are proposing
a machine learning based technique to estimate and predict the needed resources. Given
a new application, our proposed framework employs a database, a simulator and a set
of machine learning algorithms to construct a model capableof intelligently estimat-
ing the resources required to optimize various objectives such as run-time, area, power
consumption and cost. The proposed approach uses previous knowledge and features
extracted from benchmarks to create a supervised machine-learning model that is capa-
ble of predicting the estimated necessary resources. Figure 7.1 illustrates the proposed
prediction framework and flow used in this work. The framework consists of three main
phases: data preparation, training and testing (classification/prediction). Thedata prepa-
158
![Page 179: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/179.jpg)
CHAPTER 7. RESOURCE PREDICTION 159
Figure 7.1: Overall Methodology and Flow
![Page 180: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/180.jpg)
CHAPTER 7. RESOURCE PREDICTION 160
ration phaseuses both synthetic and real-life benchmarks, a simulator and a database, as
shown in Figure 7.1-A. In this phase, benchmarks are generated and evaluated in terms
of the power consumed and execution time using a simulator, as will be explained in
Section (7.1).
All necessary features that will be used to train the machinelearning algorithms are
extracted from both the DFGs generated in addition to metrics provided by the simula-
tor. In thetraining phase(shown in Figure 7.1-B) the framework utilizes the features
extracted from the previous stage to train and create a modelthat learns from previous
historic information. This model extracts useful hidden knowledge (i.e., generalizes)
from the data and estimates/predicts resources of the reconfigurable computing system.
The third and finaltesting/prediction phaseutilizes the developed model given new un-
seen task graphs to predict necessary resources as seen in Figure 7.1-C. Each of these
phases are explained in more detail in the following section.
7.1 Data Preparation Stage
In this stage, a tabular database suitable for the data-mining engine is constructed. Each
record in the database corresponds to a Data-Flow Graph (DFG) along with its features
or attributes that are necessary for the training stage. TheDFGs are either synthesized or
taken from a real-life application. The synthesized DFGs are evaluated under different
hardware set-ups. The evaluation can either be performed using a dedicated hardware
platform or a simulator. The latter allows for faster evaluation and flexibility. Each DFG
is evaluated using different hardware scenarios, by varying the number of processing
elements (PRRs, GPPs), size/shape of PRRs and schedulers. The system performance
![Page 181: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/181.jpg)
CHAPTER 7. RESOURCE PREDICTION 161
metrics (power consumption, execution time) for each case is then recorded accordingly.
Several features from each DFG such as number of nodes, dependencies, fan-out, critical
path, slack, sharable resources and many more are extracted(see Table 7.2). These
features are treated as individual measurable attributes that are effectively exploited in
a supervised learning task such as classification.
The necessary modules to construct the database are:
1. Task Graph Generator: The first step in the proposed methodology involves us-
ing a task-graph generator (described in Section 4.2) to automatically synthesize a
large number of input tasks. These input tasks are used laterto train and test the
proposed predictors. Each input task is represented as a DFG, where the nodes
in the DFG represent particular operations to be scheduled and assigned either to
hardware or software. The edges on the other hand represent the normal dependen-
cies and flow between operations. Each DFG has randomly generated parameters,
as shown in Table 7.1. The probability distribution of DFG parameters is currently
based on a uniform distribution, but any other distribution, including a distribu-
tion based on real-world applications, can be employed. A total of 258DFGs are
generated at this step in the flow.
Parameter Range NoteNodes 5 - 1000 # of nodes in a DFGEdges 0 - 1863 # of edges (dependencies)Edges per Node 0 - 2 Average # of edges per nodeSubgraphs 1 - 968 # of Subgrahs in the taskTask types 3 - 16 # of task typesHW tasks area – Avg/Min/Max/..
Table 7.1: Data Flow Graphs: Statistics
![Page 182: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/182.jpg)
CHAPTER 7. RESOURCE PREDICTION 162
2. Simulator (Evaluation): The evaluation of the performance of all DFGs based on
different hardware configurations is determined before thefeatures and attributes
of each DFG are stored in the database. Accordingly, an architecture reconfig-
urable simulator discussed in Section 4.3 was developed to simulate the hardware
platform , while running the developed reconfigurable operating system.
The target FPGA fabric is first partitioned uniformly and then partitioned with
50% increase in size, while varying the number of PRRs. This results in different
layouts for the same number of PRRs.
In order to have a diverse database we evaluated every DFG with various hard-
ware settings and then recorded the evaluation (performance) metrics. The two
most important evaluation metrics used weretotal execution time and total power
consumptionfor a given DFG. Each DFG was evaluated with different PRR num-
bers and PRR sizes, various number of GPPs, and schedulers. This resulted in a
database containing on average several hundreds of recordsper DFG.
3. Schedulers: The three on-line scheduling algorithms described earlierin Chap-
ter 5 were used. The algorithms have different task reuse capabilities to mini-
mize reconfiguration overhead. Tasks are capable of migrating between software
(GPP) and hardware (PRRs) to minimize total execution time and mitigate resource
scarcity issues. The schedulers record system metrics, learn, and accommodate fu-
ture tasks. The schedulers were first developed and implemented on a hardware
platform, and then incorporated into the simulator that is shown in Figure 7.1.
![Page 183: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/183.jpg)
CHAPTER 7. RESOURCE PREDICTION 163
4. Feature Extraction: Since the DFG format cannot be used directly by the classi-
fication toolWaikato Environment for Knowledge Analysis(WEKA) [128], a soft-
ware application was designed and implemented to extract numerous numeric fea-
tures from a given DFG and store them in a tabular format. Someof the challenges
were to extract as much numeric information as possible without adding irrelevant
information (noise). Features and attributes were represented either as a single
numeric entity or a range format. Below are some of the key extracted features:
• DFG Connectivity: number of edges, average edges per node, fan-in/fanout,
maximum number of parent nodes and maximum number of dependent nodes.
• DFG size: number of sub-graphs, maximum number of nodes in sub-graph,
length of the critical path, average length of critical pathof the sub-graph.
• Task types: the number of task types, the number of tasks of each task type,
number of hardware/software/hybrid tasks.
• Schedule flexibility: node mobilities and reuse possibilities.
• Task type metrics: hardware tasks area, latency, reconfiguration time, and
power consumption.
A complete list of the features is presented in Table 7.2.
These features are extracted from the training DFGs (a subset of the synthesized
DFGs), and then fed to the untrained machine learning algorithm module, as shown
in Figure 7.2-a. The machine algorithm module is trained against a known output
(label) for each of the input DFGs. After the training phase,the same features
are extracted from a new unseen DFG, which is unknown to the trained machine
![Page 184: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/184.jpg)
CHAPTER 7. RESOURCE PREDICTION 164
learning algorithm module. The trained module is then readyto predict the correct
label (class) accordingly, as shown by Figure 7.2-b.
Figure 7.2: Supervised learning: Training and testing
7.2 Training Stage
The second stage of the framework is related to training and model development, as
shown in Figure 7.1-B. In a prediction problem, a model is usually given a data-set of
known data on which training is performed (training data-set), and a data-set of unknown
data (or first time seen data) against which the model is tested (testing data-set). In the
training phase, a data-set is used to train the machine-learning algorithm so that a model
![Page 185: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/185.jpg)
CHAPTER 7. RESOURCE PREDICTION 165
Feature Category Range NoteNodes 5 - 1000 # of nodes in a DFGRoot Nodes 0-254 # of Root nodes without a parentsInternal nodes 0-267 # of Nodes that are neither root nor leaf.Leaf nodes 0-502 # Of leaf NodesIsolated nodes 0-940 # of Nodes with no parent nor childrenEdges 0-1863 # of edges (dependencies)Edges per Node 0-2 Average # of edges per nodeMax parents 0-2 Maximum # of parents for a child node.Max Children 0-19Sharable resources 0-19 Sum of number of sharable resourcesSubgraphs 1-968 # of Subgrahs in the taskCritical paths – # Avg, Min critical pathsCritical path – count the longest pathTask types 3-16 # of task typesTask type N – Frequency of each task typeHW/SW task types – # of HW or SW tasksMigratable tasks – # of migratable tasks.Slack 0-563 Average slackHW/SW latency – Avg/Min/Max valuesHW/SW execution*pwr – Avg/Min/Max valuesHW config time and power – Avg/Min/Max valuesHW tasks area – Avg/Min/Max values
Table 7.2: Extracted Features from DFGs.
![Page 186: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/186.jpg)
CHAPTER 7. RESOURCE PREDICTION 166
is developed, which can be used in the actual prediction and classification phase. The
generated data entries from the previous phase consist of hundreds of records for each
DFG that differ in the evaluation metrics such as power and speed. These records were
obtained by evaluating the same DFG using different hardware configurations. Since the
DFG features are the only metrics that will be used to predictthe class (resource type),
having multiple identical records per DFG, yet with different classes will have a negative
effect on the learning capability of the classifiers. In thisstep, thefittestrecord is selected
for each DFG (to eliminate duplication). The termfittestcould refer to speed, power,
cost, or a combination (depending on the designer’s goal). For example if the goal is to
train the classifier to predict a hardware configuration for the lowest execution time, the
fittest function will only keep the records (for each DFG) that has the lowest execution
time. Accordingly, several smaller databases are created,one for each objective (power,
speed, cost, execution time vs. cost). Each database contains records equivalent to the
number of DFGs available.
Cross validation is an important task in machine learning. It is mainly used to esti-
mate how precise and accurately a classification model will perform in real life. One of
the main goals of cross validation is to avoid problems such as over-fitting [129] (mem-
orizing) and give a perception of how well the model will generalize independent data
sets. Several techniques for cross validation are available such as (i) the holdout method,
(ii) K-fold cross validation, (iii) Leave-one-out cross validation. In the holdout method,
the data set is separated into two different sets called the training set and testing set.
The main advantage of the holdout method is that it takes a short time to compute. The
second cross validation method “Leave one out” has a good validation error but is very
expensive. In our current work, we resorted to the K-fold cross validation as it represents
![Page 187: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/187.jpg)
CHAPTER 7. RESOURCE PREDICTION 167
an improvement over the “Holdout method” and is not as expensive as the “Leave One
Out” method. The data set in the K-fold cross validation is usually divided intok subsets,
and the holdout method is repeatedk times. In each experiment, one of thek subsets is
used as the testing set while the remainingk-1 subsets form a training set. The average
error across allk trials is then computed.
7.3 Classification Stage
The result of the training phase is a model capable of classifying and predicting resources
as shown in Figure 7.1-C. Several supervised machine learning algorithms are deployed
and contrasted by measuring the accuracy of prediction. Each classifier is executed with
their default parameter values, then evaluated using the test sample for each DFG group.
Finally, themean accuracyof the validations is calculated. Classifiers used in this work
range from simple Naive Bayes [129] to more accurate and complex machine learning al-
gorithms in the form of ANNs and SVMs [130]. The classification algorithms are trained
and developed using the training database in WEKA. A detailed evaluation of the classi-
fiers are discussed in the next section. Since ensemble basedsystems provide favorable
results compared to single expert machine learning systemsunder certain scenarios, it
was worth considering them further in this work. Researchers from various disciplines
consider and resort to ensemble based classifiers whenever they seek to improve predic-
tion performance. The goal of the ensemble based method is tocreate a more accurate
predictive model by integrating multiple models. The ensemble methodology attempts
to weigh and combine several individual classifiers in orderto obtain a unified classifier
that outperforms each individual classifier and is similar to human beings seeking sev-
![Page 188: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/188.jpg)
CHAPTER 7. RESOURCE PREDICTION 168
eral opinions before making any important decision. Accordingly, to further improve the
performance of the proposed framework, ensemble based classifiers [131] are employed.
An important feature of the proposed methodology is that thesystem is able to learn,
generalize and, therefore, can be expected to improve its accuracy over time. The goal
of the entire process is to extract useful hidden knowledge from the data. This knowl-
edge is the prediction and estimation of the necessary resources for an unknown or not
previously seen application.
7.4 Experiments and Results
The main objective of this work is to demonstrate that our proposed system is able to
generate accurate prediction models, suitable for use in reconfigurable operating systems,
based on known features of DFGs extracted from previous designs. To achieve this
goal, five different cases were developed and evaluated. Thefirst two (I-II) are based
on predicting schedulers (described in Chapter 5) that can minimize power consumption,
migrate hardware/software tasks, and minimize total execution time. The schedulers
can be distinguished only by power consumption, since RCSched-III performed better
in terms of latency. When optimizing for power, neither the frequency nor the tasks are
altered. The remaining three cases (II-IV) are based on predicting the number of soft
cores (GPPs) and partial-reconfigurable regions (PRRs) to be used on the reconfigurable
platform to minimize combinations of total execution time and power. For the first four
cases (I-IV) the area is fixed, only the layout and the number of PRRs are changed, while
for Case-V the area is proportional to the number of PRRs. In Case-V we introduced
a penalty factor (cost) proportional to the number of PRRs. The cost parameter can be
![Page 189: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/189.jpg)
CHAPTER 7. RESOURCE PREDICTION 169
set by the designer to give a trade-of between area and speed.The proposed cases are
summarized as follows:
Case-I Predict the type of scheduler that will lead to a minimum power consumption
while allowing one or more GPPs to be employed in the design.
Case-II Predict the most appropriate scheduler that will produce the minimum power
schedule while only allowing hardware tasks (PRRs) to be used in the design.
Case-III Predict the number and type of the PRRs and GPPs required to minimize exe-
cution time.
Case-IV Predict the number and type of PRRs and GPPs required to achieve a balance
between power consumption and execution time. (Notice thatthis case involves
conflicting objectives.)
Case-V Predicts the number of the PRRs necessary to minimize both execution time and
the total area consumed by the design.
Table 7.3 summarizes the cases along with the classes that will be predicted. Note
that the last case is based on three different layout of PRRs.
Case # Class type # of Classes Objective NotesI Scheduler 2 Min Power Consumption For GPPs and PRRsII Scheduler 2 Min Power Consumption No GPPs (PRR only)III GPPs/PRRs 4 Min Execution Time -IV GPPs/PRRs 4 Min Exec. Time/Power -V PRR 3 Min Exec. Time and area Figure 1.2-B
Table 7.3: Cases used to Evaluate the Framework along with Specifications
![Page 190: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/190.jpg)
CHAPTER 7. RESOURCE PREDICTION 170
It is important to note that the results achieved and highlighted in this section are not
limited to the previous five cases. The framework and methodology proposed is general
and flexible enough to enable the development of prediction models for any combina-
tion of resources related to the application or underlying run-time environment including
communication infrastructure, interfaces, etc.
7.4.1 Classification Algorithms: Implementation
Machine-learning algorithms proposed in the literature can vary in complexity, accuracy
and performance. They can range from simple to more complex models based also on the
parameters that need to be tuned which usually influence the training procedure. Each
algorithm has its own advantages and disadvantages, and no one algorithm works best
in all cases. Consequently, in this work, we will first attempt to use the following five
individual classification algorithms, all of which are different in terms of complexity and
accuracy, to implement and evaluate each of the five previousprediction cases (i.e., Case
I-V). In Sec. 7.4.5 we will resort to a more advanced classification technique (Ensem-
ble of Classifiers) that combines several individual classification algorithms to further
improve the prediction performance and accuracy.
• Naive Bayes (NB):Based on Bayes theorem, an NB classifier classifies a record
by first computing its probability of belonging to each class, and then assigning the
record to the class with the highest probability [132]. Advantages of this classi-
fier include its simplicity, computational efficiency, and classification performance.
However, for good performance, the classifier requires a large number of records
and assumes that the individual feature values extracted from the records are inde-
![Page 191: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/191.jpg)
CHAPTER 7. RESOURCE PREDICTION 171
pendent, which is not always the case.
• Multi-Layer Perceptron (MLP): Based on a model of biological activity in the
brain, MLP is a feed-forward neural network that has proved to be among the
most effective models for prediction in the context of data mining [133]. Although
it is able to capture highly complicated relationships between the predictors and
response, their flexibility and performance relies heavilyon having sufficient data
for training purposes.
• Support Vector Machine (SVM): Related to statistical learning theory, SVM is a
non-probabilistic supervised learning algorithm where training records are mapped
into a higher dimensional input space so that the separate classes are divided by
an optimal separating hyperplane (i.e., a hyperplane with maximum separation
margin) in this space [102]. New input records are then mapped into this space
and predicted to belong to the class based upon which side of the hyperplane they
fall. Although directly applicable to binary-class problems, SVM can still perform
effectively in high-dimensional spaces. However, this often requires the overhead
of solving a series of binary classification problems.
• K-Nearest Neighbor (K-NN): The K-NN method is both simple and lacks the pa-
rameter tuning of many of the methods described above [134].It is a lazy learning
algorithm since it defers the decision to generalize beyondthe training examples
till a new query is encountered. It works by first identifyingthe k-nearest neighbors
in the training set to the new record. The new record is then assigned to the pre-
dominant class among these neighbors. Despite its simplicity, the K-NN method
performs remarkably well, especially when the target classes are characterized by
![Page 192: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/192.jpg)
CHAPTER 7. RESOURCE PREDICTION 172
a number of related features.
• J48: Tree methods, like J48, are considered to be among the most robust methods
for performing classification and prediction. In general, these methods work by
recursively splitting the training data into subgroups based on the data’s features.
In the case of J48, the decision to split is based on the concept of information
entropy and seeks to choose the features that maximize the normalized information
gain [102]. The recursive splitting stops and leaf nodes arecreated, upon finding
subsets belonging to the same class.
The classifiers listed above were used to classify and predict each of the five cases
presented earlier in Section (7.4). Thus a combination of 25models were developed and
verified. Each of the 25 models was implemented in Java.
7.4.2 Experimental Setup and Evaluation
Each of the 25 models was first trained and tested using 258 synthetic benchmarks
(DFGs) described in Section (7.1). Following that, all 25 models were tested using
20 real-world benchmarks from Media-bench DSP benchmark suite [135]. All of the
models were evaluated based on their accuracy. The accuracyof a prediction model is
a measure of how well the predictor is able to make correct predictions, and is formally
defined as the ratio of correctly classified instances to the number of total instances, as
shown in Equation 7.1.
Accuracy =TN + TP
TN + TP + FP + FN
(7.1)
![Page 193: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/193.jpg)
CHAPTER 7. RESOURCE PREDICTION 173
Note: TP andTN are the number of true positives and true negatives, respectively,
while FP andFN are the number of false positives and false negatives, respectively. In
context of this thesis, an accuracy of 1.0 would mean that themodel was able to pre-
dict the necessary resources (i.e, scheduler, numbers and types of GPPs and/or PRRs)
to achieve a near optimal solutions (e.g., minimum execution time, power, and/or area)
100% of the time. In addition, we also evaluate each model using the well-known Re-
ceiver Operating Characteristics (ROC) curve [136–138]. The ROC curve is a plot of
theTP rate versesFP rate, and shows the trade-off between sensitivity and specificity.
The ROC evaluation is more appropriate measure than accuracy when dealing with im-
balanced data sets, as will be explained later. In the following subsections, the detailed
experimental results are discussed.
7.4.3 Results for Synthetic Benchmarks
The methodology used to evaluate the performance of the 5 cases introduced earlier
in Table 7.3 using 5 different classifiers (25 prediction models) was based on 10-fold
cross-validation. All 258 DFGs described in Section (7.1) were randomly divided into
10 subgroups, each containing an equal number of DFGs. Each of the 25 prediction
models was then trained usingnine of the subgroups and then tested using the tenth
subgroup. The process was repeated 10 times and the average accuracy for each model
was computed. The accuracies of the prediction models were then compared using a
t-test(with alpha = 0.05) corrected to avoidType Ierrors due to the dependence between
samples.
Figure 7.3 presents the prediction accuracy of the 5 cases based on different classi-
fiers. Tables 7.4 and Table 7.5 present statistical significant of the accuracy achieved by
![Page 194: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/194.jpg)
CHAPTER 7. RESOURCE PREDICTION 174
SVM and J48 receptively using synthetic data. Based on Figure 7.3 and Tables 7.4 and
Table 7.5, the results conclusively demonstrate at a high rate of statistical significance
that J48, SVM and MLP achieve the highest accuracy rate of approximately 83%. This
indicates that 83% of the time the prediction model is able topredict the necessary re-
sources for obtaining an optimal objective. The other classification models, based on
K-NN and NB, obtain a lower accuracy rate of approximately 79%.
Interestingly, the graph in Figure 7.3 also shows that the average accuracy of all five
classification methods decreases as the number of classes increases. In particular, the
average accuracy of all five classification methods for case Iand case II, which represent
binary classification problems, are 93.7% and 94.2%, respectively. For case V, which is
a 3-class problem, the average accuracy of all five classification methods is 84.3%. The
worst average accuracy (62%) occurs for case III, where the number of classes increases
to 4. This behavior can be attributed to imbalanced data among the various problem
instances, variance/bias classifiers, or to the ratio of records to feature set. The class
imbalance problem occurs when the distribution of class instances is skewed between
classes. An unbalanced distribution causes typical classification algorithms, which are
designed to maximize classification accuracy, to have trouble learning the minority class
or classes. As can be seen from Figure 7.4, the class imbalance problem is more notice-
able in the multi-class data sets than in the binary-class data sets. The problem of imbal-
anced data can be dealt with using either undersampling or oversampling which might
not be ideal if the dataset size is either small or duplicating data may effect the models.
A more conservative and promising approach used by many researchers to reduce the
impact of within-class imbalance is to utilize ensemble-based methods, as explained in
the next subsection.
![Page 195: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/195.jpg)
CHAPTER 7. RESOURCE PREDICTION 175
55
60
65
70
75
80
85
90
95
100
Case-1
Case-2
Case-3
Case-4
Case-5
Acc
urac
y in
%
Datasets
NBSVMMLPKNNJ48
Figure 7.3: Individual Classifiers: Predication Accuracy for the synthesized benchmarks
1
2
3
4
5
6
7
8
9
10
Case-1
Case-2
Case-3
Case-4
Case-5
X
Datasets
Figure 7.4: Imbalance ratio between Minority/Majority Classes [synthesized benchmark]
![Page 196: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/196.jpg)
CHAPTER 7. RESOURCE PREDICTION 176
Table 7.4: Accuracy with T-test evaluation (SVM) [synthesized benchmark].Dataset SVM NB MLP KNN J48Case-1 95.12 90.12• 96.36 92.45 94.89Case-2 95.19 93.72 95.11 91.27• 95.71Case-3 71.61 66.30• 79.69◦ 74.06 77.80◦Case-4 66.73 58.37• 61.79 60.45• 64.85Case-5 87.72 83.57• 84.34 81.20• 85.01Average 83.27 78.42 83.46 79.89 83.65◦, • statistically significant improvement or degradation
Table 7.5: Accuracy with T-test evaluation (J48) [synthesized benchmark].Dataset J48 NB SVM MLP KNNCase-1 94.89 90.12• 95.12 96.36 92.45Case-2 95.71 93.72 95.19 95.11 91.27•Case-3 77.80 66.30• 71.61• 79.69 74.06Case-4 64.85 58.37 66.73 61.79 60.45Case-5 85.01 83.57 87.72 84.34 81.20◦, • statistically significant improvement or degradation
With respect to the CPU time used by the algorithms, the training of each classier
incurs a one-time computational cost. Table 7.6 shows the average run-times for training
the prediction models based on NB, MLP, SVM, K-NN, and J48, respectively. (Note:
each java-based model ran on an Intel Xeon (W3670) workstation with 16 GB of mem-
ory.) For example, the SVM algorithm is trained at a one-timecost of around 200 mSec.
The Naive Bayes comes handy since it can be trained faster compared to SVM, J48 ad
MLP algorithms. It should be noted that in the case of K-NN, there is no initial training.
Accordingly, KNN is usually referred to alazy classifierdue to the absence of training.
However, the computational cost is incurred when performing the prediction during the
testing phase.
![Page 197: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/197.jpg)
CHAPTER 7. RESOURCE PREDICTION 177
Classifier Training TimeSVM 100 mSecMLP 60.4 SecNB 10 mSec
KNN NAJ48 200 mSec
Table 7.6: Training time for Case # 1 for individual classifiers
7.4.4 Results for MediaBench DSP suite
In addition to the synthetic benchmarks used in the previoussubsection, 20 real-world
DFGs based on DSP applications were used for evaluating the performance of the pre-
diction models. The DFGs were selected from MediaBench a standard DSP benchmark
suite. Table 7.7 lists the characteristics of the DFGs that range in complexity from 18
to 359 nodes and from 16 to 380 edges. Figure 7.5 shows a similar trend to that of the
Table 7.7: Mediabench DSP Benchmark Specifications.ID Name # of nodes # of edges Avg. Edges per node Critical path length parallelism (node/critical path)
1 JPEG - Write BMP Header 106 88 0.83 7 15.142 JPEG - Smooth Downsample 51 52 1.02 16 3.193 JPEG - Forward Discrete Cosine Transform 134 169 1.26 13 10.34 JPEG - Inverse Discrete Cosine Transform 122 162 1.33 14 8.715 MPEG - Inverse Discrete Cosine Transform 114 164 0.44 16 7.1256 MPEG - Motion Vectors 32 29 0.91 6 5.337 EPIC - Collapsepyr 56 73 1.3 7 88 MESA - Invert Matrix 333 354 1.06 11 30.279 MESA - Smooth Triangle 197 196 0.99 11 17.910 MESA - Horner Bezier 18 16 0.89 8 2.2511 MESA - Interpolate Aux 108 104 0.96 8 13.512 MESA - Matrix Multiplication 109 116 1.06 9 12.1113 MESA - Feedback Points 53 50 0.94 7 7.5714 HAL 11 8 0.72 4 2.7515 Finite Input Response Filter 11/h2 44 43 0.98 11 416 Finite Input Response Filter 2 40 39 0.975 11 3.6417 Elliptic Wave Filter 34 47 1.38 14 2.4318 Auto Regression Filter 28 30 1.07 8 3.519 Cosine 1 66 76 1.15 8 8.2520 Cosine 2 82 91 1.11 8 10.25
synthetic data in Figure 7.3 with all of the classifiers performing better on Cases I, II, V,
IV and III in that order. On average MLP performed the best with average accuracy of
75% followed by SVM and J48 (Average accuracy of 73% and 72% respectively). NB
![Page 198: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/198.jpg)
CHAPTER 7. RESOURCE PREDICTION 178
20
30
40
50
60
70
80
90
100
Case-1
Case-2
Case-3
Case-4
Case-5
Acc
urac
y in
%
Datasets
NBSVMMLPKNNJ48
Figure 7.5: Individual Classifiers: Predication Accuracy for MediaBench.
was the worst with average accuracy of 68% while KNN has average accuracy of 70%.
For Case-IV J48 performed the best with 57% accuracy. The prediction of the real-word
data was lower than the synthetic by an average of 10%. Exceptof case-I all cases have
higher accuracy based on synthetic data. The difference wasin the range from 9% dif-
ference for Case-II to 21% difference for Case-IV. While Case-II and Case-V were in
between by an average of accuracy difference of 14% and 11% respectively. The differ-
ence in accuracy between the synthetic benchmark and the MediaBench DSP benchmark
is not surprising. The reason is that the models were trainedand tested with synthetic
data in the first case, but trained with synthetic data and tested with real-world data in
the second. This shows that although our models can accurately predict classes based on
unseen circuits it is recommended to train the model on real-world data to achieve better
accuracy results.
![Page 199: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/199.jpg)
CHAPTER 7. RESOURCE PREDICTION 179
7.4.5 Ensemble Learning
In this subsection we seek to determine if we can further improve accuracy of the pre-
diction models by employing ensemble methods. Ensemble methods [131] employ not
one, but multiple classifiers to classify new data points, often by taking a weighted aver-
age vote of their predictions. Ensemble learning algorithms are primarily used to improve
prediction accuracy by mitigating issues related to variance and bias, or over-fitting when
the data set is small. Many techniques have been proposed forcombining the predictions
of multiple classifiers, but the most popular methods are:
• Baggingwhich is a method that seeks to train each classifier in the ensemble us-
ing a random redistribution of the original data. If X is the size of the original
training set, training sets are generated for each classifier by randomly selecting X
records, with replacement. The predictions of each trainedclassifier are combined
using a (weighted) voting scheme. In practice, bagging is especially useful in situ-
ations where even small changes in the training set may lead to big changes in the
prediction [139].
• Boosting which is an ensemble technique where learning models are learned se-
quentially. Initial prediction models tend to be very simple, and are used to deter-
mine particular data points that are difficult or hard to fit. Later prediction models
focus primarily on those points that are hard to fit, with the goal of trying to predict
them correctly. All of the models are finally given weights and the set is combined
to evolve an increasingly more complex and accurate prediction model [140].
• Stacking applies different types of classification models to the original data. In-
![Page 200: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/200.jpg)
CHAPTER 7. RESOURCE PREDICTION 180
stead of using voting or a weighted average approach, stacking uses a meta-lever
classifier to determine the winning model. In general, stacking works well, but it
is very hard to analyze [102].
• Randomization (Random Forest)works with a large collection of decision trees.
The method works by generating a large set (forest) of independent decision trees
by using different random samples of the original data set. Voting is then used to
determine the final class [102].
In general, each of the previous ensemble methods seeks to create a more accurate
classifier by combining less-accurate ones. Figure 7.6 and 7.7 compare the prediction ac-
curacy of the single classifier, J48, to that of the four ensemble methods Random Forrest,
Boosting, Bagging, and Stacking for both synthetic and MediaBench DSP benchmarks
respectively. (The J48 was selected for comparison, since it achieved the highest av-
erage accuracy rate - 83% along with MLP and SVM. It is clear that in every case all
four ensembles achieve higher average accuracy rates than J48. The improvements on
average in accuracy range from 1% to 11%. However, there is nostatistically significant
difference in average accuracy between the four ensembles.
As observed in the case of the individual classifiers, the average accuracy of all four
ensembles decreases as the number of classes increases. In particular, the average accu-
racy of all four ensembles for Case I and Case II, which represent binary classification
problems, are 96.15% and 96.03%, respectively. For Case V, which is a 3-class problem,
the average accuracy of the ensembles is 86.16%. Finally, the worst average accuracy
(70.9%) occurs for Case III, where the number of classes increases to 4. Although the
use of ensembles does not fully solve the problem of class imbalance, it does help to
![Page 201: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/201.jpg)
CHAPTER 7. RESOURCE PREDICTION 181
mitigate its effects. Notice that the largest improvementsin average accuracy compared
with J48 occurred in Case III and Case IV both of which have more than 2 classes.
Table 7.8 shows the average accuracy achieved by all five individual classifiers and
all four ensembles. It is clear that ensembles outperform all of the single classifier ap-
proaches. Table 7.9 shows the average training time for eachof the classifiers. Although
there is no statistically significant difference between the four ensembles in terms of
accuracy, both bagging and stacking require an order-of-magnitude more training time
compared with random forest and boosting. This would suggest that the latter two tech-
niques should be used due to their run-time efficiencies.
The area under ROC for the ensembles techniques is shown in Figure 7.8 and Fig-
ure 7.9 for both synthetic and Media-bench DSP suite respectively. It is clear that the
average performance of the ensemble based techniques is better than the J48 machine
learning technique in all cases.
Table 7.8: Accuracy with T-Test significant evaluation [synthesized benchmark]Dataset MLP NB SVM KNN J48 Rand. Forest AdaBoost Bagging StackingCase-1 94.74 90.12 • 95.31 96.67◦ 92.60 • 96.36 ◦ 96.29 ◦ 95.86 ◦ 96.09 ◦
Case-2 95.71 93.72 • 95.19 95.11 91.27• 96.01 96.32 95.97 95.82Case-3 77.80 66.30 • 71.61 • 79.69 ◦ 74.06 • 84.79 ◦ 84.14 ◦ 84.45 ◦ 84.50 ◦
Case-4 64.85 58.37 • 66.73 61.79• 60.45 • 70.41 ◦ 70.90 ◦ 71.27 ◦ 71.00 ◦
Case-5 85.01 83.57 • 87.72 ◦ 84.34 81.20• 87.01 ◦ 86.94 ◦ 87.14 ◦ 87.56 ◦
◦, • statistically significant improvement or degradation
Table 7.9: Training time for different classification algorithms.Dataset MLP NB SVM KNN J48 Rand. Forest AdaBoost Bagging StackingCase-1 51.90 0.01 ◦ 0.10 ◦ 0.00 ◦ 0.10 ◦ 0.40 ◦ 0.40 ◦ 3.60 ◦ 5.80 ◦
Case-2 40.90 0.01 ◦ 0.10 ◦ 0.00 ◦ 0.10 ◦ 0.40 ◦ 0.40 ◦ 3.50 ◦ 5.70 ◦
Case-3 83.90 0.02 ◦ 0.50 ◦ 0.00 ◦ 0.60 ◦ 1.60 ◦ 1.70 ◦ 14.60 ◦ 27.40 ◦
Case-4 63.30 0.02 ◦ 0.50 ◦ 0.00 ◦ 0.40 ◦ 1.20 ◦ 1.30 ◦ 10.70 ◦ 20.80 ◦
Case-5 44.00 0.01 ◦ 0.20 ◦ 0.00 ◦ 0.20 ◦ 0.70 ◦ 0.70 ◦ 6.00 ◦ 10.60 ◦
Average 56.8 0.014 0.28 0.00 0.28 0.86 0.90 7.68 14.06◦, • statistically significant improvement or degradation
![Page 202: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/202.jpg)
CHAPTER 7. RESOURCE PREDICTION 182
60
65
70
75
80
85
90
95
100
Case-1
Case-2
Case-3
Case-4
Case-5
Acc
urac
y in
%
Datasets
J48R.Forest
AdaBoostBaggingStacking
Figure 7.6: Ensembles Classifiers: Predication Accuracy for the Synthesized Benchmark
30
40
50
60
70
80
90
100
Case-1
Case-2
Case-3
Case-4
Case-5
Acc
urac
y in
%
Datasets
J48R.Forest
AdaBoostBaggingStacking
Figure 7.7: Ensembles Classifiers: Predication Accuracy for the MediaBench
![Page 203: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/203.jpg)
CHAPTER 7. RESOURCE PREDICTION 183
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Case-1
Case-2
Case-3
Case-4
Case-5
Are
a U
nder
RO
C
Datasets
J48R.Forest
AdaBoostBaggingStacking
Figure 7.8: Ensembles Classifiers: Area Under ROC for the Synthesized Benchmark
0.4
0.5
0.6
0.7
0.8
0.9
1
Case-1
Case-2
Case-3
Case-4
Case-5
Are
a U
nder
RO
C
Datasets
J48R.Forest
AdaBoostBaggingStacking
Figure 7.9: Ensembles Classifiers: Area Under ROC for the MediaBench
![Page 204: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/204.jpg)
CHAPTER 7. RESOURCE PREDICTION 184
7.4.6 Random Baseline
As a final demonstration of the efficiency of our approach, we provide a comparison
of our prediction models with that of a random predictor as will be highlighted in Ta-
bles 7.10 - 7.13. It is important to note that the best solutions entered in Tables 7.10 - 7.13
were achieved by exhaustive search. While the random solutions indicated in these ta-
bles were picked randomly from the pool of solutions generated. Working with synthetic
data first, 11 DFGs were randomly selected (with uniform probability) out of the pool
of 258 DFGs. These DFGs were then compared with DFGs predicted by the models for
Case-III and Case-I using SVM classifiers on the bases of minimizing execution time (#
of cycles) and power. The power was measured as explained previously in Section 6.2.1.
The number of cycles was used to make the model more general and can target multi-
ple platforms with different frequencies. The actual time can be calculated by dividing
the number of cycles over the frequency of the system (cycles/Frequncy). Table 7.10
and 7.11 show the results for time and power respectively. Anasterisk (*) indicates the
model missed the best class yet still close enough and much better than the random guess
in almost all cases. Results obtained for the synthetic benchmarks indicate that the ML-
engine proposed in this thesis is13.9% and0.1% away from the best available solutions
for latency and power respectively, and improves upon a random based baseline solution
by 278% and119% for latency and power respectively. Similarly a baseline comparison
were performed for the MediaBench DSP benchmark for Case-III and Case-I but using
the J48 classifier. The results are shown in Table 7.12 and Table 7.13. Results obtained
for the MediaBench DSP benchmarks indicate that the ML-Engine proposed is4.2% and
0.0% away from the best available solution for latency and power respectively, and im-
![Page 205: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/205.jpg)
CHAPTER 7. RESOURCE PREDICTION 185
proves upon a random based baseline solution by403% and103% for latency and power
respectively.
Table 7.10: Performance Enhancement for total time (Synthesized).
DFG ID # nodes# of Cycles [Case 3 -SVM]
Random Best Data mining
9 100 27,610 1,317 *1,41839 75 4,464 2,226 2,22640 100 3,333 2,122 *2,48443 200 5,796 2,038 2,03853 50 3,503 1,427 *1,53654 75 4,816 1,685 1,68561 450 3,635 2,401 *4,88863 750 8,173 3,727 *3,98599 1,000 4,827 4,242 4,242106 125 4,988 2,207 2,207112 500 5,257 4,073 *5,186
Average 6,946 2,497 2,900
Table 7.11: Performance Enhancement for power (Synthesized).
DFG ID # nodesPower mW [Case - SVM]
Random Best Data mining
9 100 1,882 1,845 1,84539 75 1,550 1,419 1,41940 100 2,166 1,755 1,75543 200 4,591 3,685 3,68553 50 940 807 80754 75 1,695 1,420 1,42061 450 9,821 7,779 7,77963 750 17,484 13,797 13,79799 1,000 27,758 24,885 *24,963106 125 2,814 2,319 2,319112 500 11,648 9,490 9,490
Average 7,486 6,291 6,298
![Page 206: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/206.jpg)
CHAPTER 7. RESOURCE PREDICTION 186
Table 7.12: Performance Enhancement for total time (MediaBench DSP).
DFG ID # nodes# of Cycles [Case 3 -J48 ]
Random Best Data mining
1 106 2,499 1,091 * 17912 51 2,143 990 9903 134 2,794 1,995 1,9954 122 9,755 2,064 * 21695 114 13,156 2,061 * 20636 32 2,774 567 5677 56 1,265 1,061 1,0618 333 12,490 2,844 2,8449 197 13,158 2,073 2,07310 18 2,567 503 * 67911 108 9,229 1,128 1,12812 109 3,809 1,610 * 181513 53 3,642 982 98214 11 692 349 34915 44 2,994 1,220 * 148716 40 1,398 1,015 * 109117 34 1,469 799 79918 28 983 848 84819 66 1,696 1,507 1,50720 82 13,331 1,649 * 1971
Average 5,092 1,318 1,262
![Page 207: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/207.jpg)
CHAPTER 7. RESOURCE PREDICTION 187
Table 7.13: Performance Enhancement for power (MediaBenchDSP).
DFG ID # nodesPower mW [Case 1 - j48]
Random Best Data mining
1 106 2,954 2,801 2,8012 51 1,455 1,366 1,3663 134 4,257 4,065 4,0654 122 3,514 3,399 3,3995 114 3,747 3,737 3,7376 32 781 781 7817 56 1,416 1,373 1,3738 333 7,100 6,974 6,9749 197 5,657 5,597 5,59710 18 398 396 39611 108 2,788 2,700 2,70012 109 2,597 2,471 2,47113 53 1,299 1,259 1,25914 11 230 227 22715 44 988 978 97816 40 1,019 990 99017 34 1,068 1,049 1,04918 28 697 677 67719 66 1,319 1,253 1,25320 82 1,832 1,741 1,741
Average 2,256 2,192 2,192
![Page 208: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/208.jpg)
CHAPTER 7. RESOURCE PREDICTION 188
7.5 Summary
The Performance metric is one of the fundamental reasons forusing Reconfigurable
Computing Systems (RCS). By mapping algorithms and applications to hardware, de-
signers can tailor not only the computation components, butalso perform data flow op-
timization to match the algorithm. There are many challenges in adaptive computing
and dynamic reconfigurable systems. One of the major understudied challenges is es-
timating the required resources and predicting an appropriate floorplan for the system.
In this thesis we propose a novel technique based on machine learning to predict the
necessary resources for Dynamic Run Time Reconfigurations.For our work proposed
in this chapter, a classification model is used to predict theappropriate type of resources
for a reconfigurable computing platform given a specific application. Using ensemble
based systems provides favorable results compared to single-expert machine learning
algorithms under a variety of scenarios.
![Page 209: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/209.jpg)
Chapter 8
Conclusions and Future Work
Several embedded application domains for reconfigurable systems tend to combine fre-
quent changes with high performance demands of their workloads, such as image pro-
cessing, wearable computing, and network processors. For example in wireless commu-
nication systems several standards and technologies are available such as GSM, WiMax,
WCDMA but it is unlikely that all these protocols will be usedat the same time. Accord-
ingly, it is possible to dynamically load only the one that isneeded. Dynamic run time
reconfigurable architectures is considered to be an attractive venue to compose efficient
platforms for today’s embedded systems. Partial reconfiguration is appealing and attrac-
tive since it provides flexibility, minimizes power, cost and area. The resulting system is
one that can provide high performance by implementing custom hardware functions in
the FPGA and still be flexible by reprogramming the FPGA and /or using the attached
processor.
However, multitasking reconfigurable hardware is complex and requires some over-
head in terms of management. Time multiplexing of reconfigurable hardware resources
189
![Page 210: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/210.jpg)
CHAPTER 8. CONCLUSIONS AND FUTURE WORK 190
raises a number of new issues, ranging from run-time systemsto complex programming
models. In order for users to benefit from the flexibility of such systems, an operat-
ing system must be developed to not only further reduce the complexity of application
development but also providing the developer with tools at ahigher level of abstraction.
8.1 Conclusions
In this thesis an efficient Operating systems for run time reconfigurable architectures
was designed and implemented to ease the application designand help to properly man-
age resources within the reconfigurable system. Making sucha reconfiguration manager
available along with the current flow should further enhanceand improve partial recon-
figuration on state of the art FPGAs. The thesis was carried out in three main phases.
1. Thefirst phaseof the thesis involved The development and evaluation of several
novel heuristics for on-line scheduling of hard real time tasks running on partially
reconfigurable devices. The proposed schedulers used fixed predefined partial re-
configurable regions with re-use, relocation, and task migration capability. In par-
ticular, RCSched-III used real-time data to make scheduling decisions. The sched-
uler dynamically measures several performance metrics such as reconfiguration
time and execution time, calculates a priority and based on these metrics assigns
incoming tasks to the appropriate processing elements. In order to evaluate the
proposed framework and schedulers a DFG generator was carefully designed and
developed. It randomly generates benchmarks with predefined specification, such
as number of nodes, task types and total number of dependencies per DFG. The de-
veloped tools in this thesis should assist developers usingdynamic reconfiguration
![Page 211: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/211.jpg)
CHAPTER 8. CONCLUSIONS AND FUTURE WORK 191
with more ease and flexibility.
2. Thesecond phaseof the thesis proposed a design and implementation of a parallel
island-based GA approach for efficiently mapping executionunits to task graphs
for partial dynamic reconfigurable systems. Each GA optimization module con-
sisted of four main components: an Architecture Library, Initial Population Gen-
eration module, a GA Engine, and a Fitness Evaluation module(on-line sched-
uler). TheArchitecture library stores all of the different available architectures
(execution units) for every task type. Architectures vary in execution time, area,
reconfiguration time, and power consumption. Each GA moduletends to optimize
several criteria (objectives) based on asinglefixed floorplan/platform. Unlike pre-
vious works, our approach is multi-objective and not only seeks to optimize speed
and power, but also seeks to select the best reconfigurable floorplan. The basic
idea is to aggregate the results obtained from the Pareto fronts of each island to
enhance the overall solution quality. Each solution on the Pareto front tends to
optimize power consumption, speed and area based on a different platform (floor-
plan) within the FPGA. Our approach was tested using both synthetic and real-
world benchmarks. Experimental results have demonstratedthe effectiveness of
this approach for solving the problem where the proposed Island based GA frame-
work achieved on average 55.2% improvement over a single GA implementation
and 80.7% improvement over a baseline random allocation andbinding approach.
This is on of the few attempt to use multiple GA instances for optimizing several
objectives and aggregating the results to further improve solution quality.
3. Thethird and final phase of this thesis proposed a novel adaptive and dynamic
![Page 212: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/212.jpg)
CHAPTER 8. CONCLUSIONS AND FUTURE WORK 192
methodology based on an intelligent machine learning approach that is used to
predict and estimate the necessary resources for an application based on past his-
torical information. An important feature of the proposed methodology is that the
system is able to learn as it gains more knowledge and, therefore, is expected to
generalize and improve its accuracy over time. Even though the approach is gen-
eral enough to predict most if not all types of resources fromthe number and type
of PRRs, the type of scheduler, and communication infrastructure, we limit our
results to the former three required for an application. Theframework was based
on extracting certainfeaturesfrom the applications that are executed on the re-
configurable platform. The features compiled were then usedto train and build
a classification modelthat was capable of predicting the floorplan appropriate for
an application. Our proposed approach was based on several modules including
benchmark generation, data collection, pre-processing ofdata, data classification,
and post processing. The goal of the entire process was to extract useful, hid-
den knowledge from the data, this knowledge is then used to predict and estimate
the necessary resources and appropriate floorplan for an unknown or not previ-
ously seen application. Based on the literature review, theuse of data-mining and
machine-learning techniques has not been proposed by any research group to ex-
ploit this specific type of Design Exploration for Reconfigurable Systems in terms
of predicting the appropriate floorplan of an application.
The proposed framework in this thesis is flexible enough to beused in several oper-
ational modes. The modes of operation will depend on the degree of integration of the
different phases developed above and requested by the user.
![Page 213: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/213.jpg)
CHAPTER 8. CONCLUSIONS AND FUTURE WORK 193
8.2 Future Work
Designing and implementing operation systems for partial reconfigurable systems is a
complex, tedious and iterative process. Crating an effective and robust operation sys-
tem for reconfigurable embedded systems is an integral part of our ongoing research.
Accordingly, in our future work we will attempt to pursue thefollowing directions:
• Develop more efficient scheduling algorithms that can produce near optimal solu-
tion compared to offline schedulers.
• Overcome some of the current limitations of the prediction model by applying
it to predict other important resources such as the communication infrastructure
and communication links between soft cores and programmable reconfigurable
regions.
• To achieve real time performance our work in the future should consider develop-
ing hardware acceleration of accelerating the supervised learning classifiers used
in this work.
• Extend the ideas introduced in this thesis for specific applications including real
time image processing, computer vision and communication systems.
![Page 214: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/214.jpg)
Bibliography
[1] H. P. Rosinger,Connecting customized IP to the MicroBlaze soft processor using
the Fast Simplex Link (FSL) channel. Xilinx Inc., 2004.
[2] “Vc707 evaluation board for the virtex-7 fpga , user guide,” UG885(V1.0), March
5, 2012.
[3] K. Eguro, “Automated Dynamic Reconfiguration for High Performance Regular
Expression Searching,” inIEEE Int’l Conference on Field Programmable Tech-
nology, Sydney, Australia, December 2009, pp. 455–459.
[4] M. Huang, V. K. Narayana, M. Bakhouya, J. Gaber, and T. El-Ghazawi, “Efficient
mapping of task graphs onto reconfigurable hardware using architectural variants,”
Computers, IEEE Transactions on, vol. 61, no. 9, pp. 1354–1360, 2012.
[5] Xilinx, “Partial reconfiguration user guide,”UG702, May 2010.
[6] M. Huang, V. K. Narayana, and T. El-Ghazawi, “Efficient mapping of hardware
tasks on reconfigurable computers using libraries of architecture variants,” inField
Programmable Custom Computing Machines, 2009. FCCM ’09. 17th IEEE Sym-
posium on, 2009, pp. 247–250.
194
![Page 215: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/215.jpg)
BIBLIOGRAPHY 195
[7] A. Al-wattar, S. Areibi, and F. Saffih, “Efficient On-lineHardware/Software Task
Scheduling for Dynamic Run-Time Reconfigurable Systems,” in IEEE Reconfig-
urable Architectures Workshop (RAW), Shanghai, China, May 2012, pp. 401–406.
[8] H. K. hay So a, “BORPH: An operating system for FPGA based reconfigurable
computers,” Ph.D. dissertation, University of California, Berkeley, 2007.
[9] C. Bobda,Introduction to Reconfigurable Computing: Architectures,algorithms
and applications. Springer, November 2007.
[10] K. Compton and S. Hauck, “Reconfigurable computing: a survey of systems and
software,”ACM Comput.Surv., vol. 34, no. 2, pp. 171–210, 2002.
[11] R. Hartenstein, “A decade of reconfigurable computing:a visionary retrospective,”
in Design, Automation and Test in Europe, 2001. Conference andExhibition 2001.
Proceedings, 2001, pp. 642–649.
[12] R. Tessier and W. Burleson, “Reconfigurable computing for digital signal process-
ing: A survey,”Journal of VLSI Signal Processing, vol. 28, pp. 7–27, 2000.
[13] T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W. Luk, and
P. Y. K. Cheung, “Reconfigurable computing: architectures and design methods,”
Computers and Digital Techniques, IEE Proceedings -, vol. 152, no. 2, pp. 193–
207, 2005.
[14] P. H. M. S. C. Huang,Reconfigurable System Design and Verification. Boca
Raton, FL, USA: CRC Press, Inc, 2009.
![Page 216: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/216.jpg)
BIBLIOGRAPHY 196
[15] R. Graml and G. Wigley, “Bushfire hotspot detection through uninhabited aerial
vehicles and reconfigurable computing,” inAerospace Conference, 2008 IEEE,
2008, pp. 1–13.
[16] J. A. Clemente, C. Gonzlez, J. Resano, and D. Mozos, “A task graph execu-
tion manager for reconfigurable multi-tasking systems,”Microprocessors and Mi-
crosystems, vol. 34, no. 2-4, pp. 73–83, 6 2010.
[17] Partial Reconfiguration User Guide. Xilinx Inc., October 5, 2010.
[18] E. J. McDonald, “Runtime FPGA partial reconfiguration,” in Aerospace Confer-
ence, 2008 IEEE, 2008, pp. 1–7.
[19] Virtex-4 FPGA Configuration User Guide. Xilinx Inc., June 9, 2009.
[20] M. Sabeghi and K. Bertels, “Current trends in resource management of reconfig-
urable systems,” in19th Annual Workshop on Circuits, Systems and Signal Pro-
cessing (ProRISC2008), November 2008.
[21] T. Nakano, “Hardware implementation of a real-time operating system,” M. Imai,
Ed., vol. 0, 1995, pp. 34–34.
[22] S. Nordstrom and L. Lindh, “Application specific real-time microkernel in hard-
ware,” in14th IEEE-NPSS Real Time Conference 2005. IEEE, June 2005.
[23] L. Yan, L. Xian-yao, G. Ping-ping, Z. Hong-jie, and C. Ping, “Hardware imple-
mentation of muC/OS-II based on FPGA,”Education Technology and Computer
Science, International Workshop on, vol. 3, pp. 825–828, 2010.
![Page 217: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/217.jpg)
BIBLIOGRAPHY 197
[24] T. Kerstan and S. Oberthur, “A configurable hybrid kernel for embedded real-time
systems,” A. Rettberg, Ed., 29May - 1 June 2007.
[25] T. Klevin, “Get real fast RTOS with Xilinx FPGAs,”Xcell Journal, vol. 45, 2003.
[26] F. Cottet, J. Delacroix, C. Kaiser, and Z. Mammeri,Scheduling in Real-Time Sys-
tems. John Wiley and Sons, Ltd, 2003.
[27] S. Hauck and A. Dehon,Reconfigurable Computing: The Theory and Practice of
FPGA-Based Computation. Amsterdam: Morgan Kaufmann, 2008.
[28] S. Guccione, D. Levi, and P. Sundararajan, “JBits: Javabased interface for recon-
figurable computing,” 1999.
[29] E. Eto,Difference-Based Partial Reconfiguration. Xilinx Inc., December 3, 2007.
[30] ——, Two Flows for Partial Reconfiguration: Module Based or Difference Based.
Xilinx Inc., September 9, 2004.
[31] Early Access Partial Reconfiguration. Xilinx Inc., 2008.
[32] E. L. Horta and J. W. Lockwood, “PARBIT: A tool to transform bitfiles to imple-
ment partial reconfiguration of field programmable gate arrays (FPGAs),” Tech.
Rep., 2001.
[33] E. L. Horta, J. W. Lockwood, D. E. Taylor, and D. Parlour,“Dynamic hardware
plugins in an FPGA with partial run-time reconfiguration,” in Proceedings of the
39th annual Design Automation Conference, ser. DAC ’02. New York, NY, USA:
ACM, 2002, pp. 343–348.
![Page 218: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/218.jpg)
BIBLIOGRAPHY 198
[34] S. McMillan and S. Guccione, “Partial run-time reconfiguration using jrtr,” vol.
1896, pp. 352–360, 2000.
[35] M. Huebner, T. Becker, and J. Becker, “Real-time LUT-based network topologies
for dynamic and partial FPGA self-reconfiguration,” inIntegrated Circuits and
Systems Design, 2004. SBCCI 2004. 17th Symposium on, 2004, pp. 28–32.
[36] N. Abel, S. Manz, F. Grull, and U. Kebschull, “Increasing design changeability
using dynamic partial reconfiguration,”Nuclear Science, IEEE Transactions on,
vol. 57, no. 2, pp. 602–609, 2010.
[37] K. Paulsson, M. Hubner, G. Auer, M. Dreschmann, L. Chen,and J. Becker, “Im-
plementation of a virtual internal configuration access port (JCAP) for enabling
partial self-reconfiguration on Xilinx Spartan iii FPGAs,”in Field Programmable
Logic and Applications, 2007. FPL 2007. International Conference on, 2007, pp.
351–356.
[38] M. B. Gokhale and P. S. Graham,Reconfigurable Computing: Accelerating Com-
putation with Field-Programmable Gate Arrays. Springer, December 2005.
[39] C. Cao, “Benefits of partial reconfiguration,”Xcell Journal, no. 55, 2005.
[40] J. McCaskill and D. Lautzenheiser, “FPGA Partial Reconfiguration Goes Main-
stream,”Xcell Journal, no. 73, 2010.
[41] I. Gonzalez, S. Lopez-Buedo, F. J. Gomez, and J. Martinez, “Using partial re-
configuration in cryptographic applications: An implementation of the idea algo-
rithm,” in FPL, 2003, pp. 194–203.
![Page 219: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/219.jpg)
BIBLIOGRAPHY 199
[42] J. M. Granado, M. A. Vega-Rodriguez, J. M. Sanchez-Perez, and J. A. Gomez-
Pulido, “Idea and aes, two cryptographic algorithms implemented using partial
and dynamic reconfiguration,”Microelectron.J., vol. 40, no. 6, pp. 1032–1040,
June 2009.
[43] J. P. Delahaye, G. Gogniat, C. Roland, and P. Bomel, “Software radio and dynamic
reconfiguration on a DSP/FPGA platform,” 2004.
[44] I. Gonzalez, S. Lopez-Buedo, and F. J. Gomez-Arribas, “Implementation of secure
applications in self-reconfigurable systems,”Microprocessors and Microsystems,
vol. 32, no. 1, pp. 23–32, 2 2008.
[45] R. Kumar, R. C. Joshi, and K. S. Raju, “A FPGA partial reconfiguration design
approach for RASIP SDR,” inIndia Conference (INDICON), 2009 Annual IEEE,
2009, pp. 1–4.
[46] P. Ryser, “Software defined radio with reconfigurable hardware and software,” in
Embedded Systems Conference 2005, vol. 442. Xilinx Inc., 2005.
[47] P. Manet, D. Maufroid, L. Tosi, G. Gailliard, O. Mulertt, M. D. Ciano, J.-D. Legat,
D. Aulagnier, C. Gamrat, R. Liberati, V. L. Barba, P. Cuvelier, B. Rousseau, and
P. Gelineau, “An evaluation of dynamic partial reconfiguration for signal and im-
age processing in professional electronics applications,” EURASIP J.Embedded
Syst., vol. 2008, pp. 1:1–1:11, January 2008.
[48] F. Fons and M. Fons,Making Biometrics the Killer App of FPGA Dynamic Partial
Reconfiguration. Xilinx Inc., 2010, vol. 72.
![Page 220: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/220.jpg)
BIBLIOGRAPHY 200
[49] E. El-Araby, I. Gonzalez, and T. El-Ghazawi, “Exploiting partial run-
time reconfiguration for high-performance reconfigurable computing,” ACM
Trans.Reconfigurable Technol.Syst., vol. 1, no. 4, pp. 21:1–21:23, January 2009.
[50] C. S. Ibala and K. Arshak, “Using dynamic partial reconfiguration approach to
read sensor with different bus protocol,” inSensors Applications Symposium,
2009. SAS 2009. IEEE, 2009, pp. 175–179.
[51] C. Claus, J. Zeppenfeld, F. Muller, and W. Stechele, “Using partial-run-time re-
configurable hardware to accelerate video processing in driver assistance system,”
in DATE ’07: Proceedings of the conference on Design, automation and test in
Europe. San Jose, CA, USA: EDA Consortium, 2007, pp. 498–503.
[52] J. Becker, M. Hubner, G. Hettich, R. Constapel, J. Eisenmann, and J. Luka, “Dy-
namic and partial FPGA exploitation,”Proceedings of the IEEE, vol. 95, no. 2,
pp. 438–452, 2007.
[53] M. Liu, W. Kuehn, Z. Lu, and A. Jantsch, “Run-time partial reconfiguration speed
investigation and architectural design space exploration,” in Field Programmable
Logic and Applications, 2009. FPL 2009. International Conference on, 2009, pp.
498–502.
[54] J. Noguera and I. O. Kennedy, “Power reduction in network equipment through
adaptive partial reconfiguration,” inField Programmable Logic and Applications,
2007. FPL 2007. International Conference on, 2007, pp. 240–245.
[55] J. y Mignolet, S. Vernalde, D. Verkest, and R. Lauwereins, “Enabling hardware-
software multitasking on a reconfigurable computing platform for networked
![Page 221: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/221.jpg)
BIBLIOGRAPHY 201
portable multimedia appliances,” inProceedings of the International Conference
on Engineering Reconfigurable Systems and Architecture 2002, 2002, pp. 116–
122.
[56] J. Emmert, C. Stroud, B. Skaggs, and M. Abramovici, “Dynamic fault tolerance
in FPGAs via partial reconfiguration,” inField-Programmable Custom Computing
Machines, 2000 IEEE Symposium on, 2000, pp. 165–174.
[57] G. Brebner, “A virtual hardware operating system for the Xilinx XC6200,” in
Field-Programmable Logic Smart Applications, New Paradigms and Compilers,
ser. Lecture Notes in Computer Science, R. Hartenstein and M. Glesner, Eds.
Springer Berlin / Heidelberg, 1996, vol. 1142, pp. 327–336.
[58] ——, “The swappable logic unit: a paradigm for virtual hardware,” inFPGAs for
Custom Computing Machines, 1997. Proceedings., The 5th Annual IEEE Sympo-
sium on, 1997, pp. 77–86.
[59] P. Merino, J. C. Lopez, and M. Jacome, “A hardware operating system for dynamic
reconfiguration of FPGAs,” p. 431, 1998.
[60] L. Levinson, R. Manner, M. Sessler, and H. Simmler, “Preemptive multitasking
on FPGAs,” inFCCM ’00: Proceedings of the 2000 IEEE Symposium on Field-
Programmable Custom Computing Machines. Washington, DC, USA: IEEE
Computer Society, 2000, p. 301.
[61] G. Wigley and D. Kearney, “The development of an operating system for recon-
figurable computing,” inIn Proceedings of the IEEE Symposium on FPGAs for
Custom Computing Machines (FCCM). IEEE CS. Press, 2001.
![Page 222: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/222.jpg)
BIBLIOGRAPHY 202
[62] G. Wigley, D. Kearney, and D. Warren, “Introducing ReConfigME: An operating
system for reconfigurable computing,” pp. 687–697, 2002.
[63] G. Wigley and D. Kearney, “The first real operating system for reconfigurable
computers,” inProceedings of the 6th Australasian conference on Computersys-
tems architecture; ACSAC ’01. Washington, DC, USA: IEEE Computer Society,
2001, pp. 130–137.
[64] G. B. Wigley, “An operating system for reconfigurable computing,” 2005.
[65] G. Wigley and M. Jasiunas, “A low cost, high performancereconfigurable comput-
ing based unmanned aerial vehicle,” inAerospace Conference, 2006 IEEE, 2006,
p. 13 pp.
[66] H. Walder and M. Platzner, “Reconfigurable hardware operating systems: From
design concepts to realizations,” inIn Proceedings of the 3rd International Con-
ference on Engineering of Reconfigurable Systems and Architectures (ERSA.
CSREA Press, 2003, pp. 284–287.
[67] ——, “A runtime environment for reconfigurable hardwareoperating systems,”
pp. 831–835, 2004.
[68] H. K.-H. So, A. Tkachenko, and R. Brodersen, “A unified hardware/software run-
time environment for FPGA-based reconfigurable computers using BORPH,” in
CODES+ISSS ’06: Proceedings of the 4th international conference on Hard-
ware/software codesign and system synthesis. New York, NY, USA: ACM, 2006,
pp. 259–264.
![Page 223: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/223.jpg)
BIBLIOGRAPHY 203
[69] H. K.-H. So and R. Brodersen, “A unified hardware/software runtime environ-
ment for FPGA-based reconfigurable computers using BORPH,”Transactions on
Embedded Computing Systems, vol. 7, no. 2, pp. 1–28, 2008.
[70] E. Lubbers and M. Platzner, “ReconOS: Multithreaded programming for recon-
figurable computers,”ACM Trans.Embed.Comput.Syst., vol. 9, no. 1, pp. 1–33,
2009.
[71] R. Pellizzoni and M. Caccamo, “Real-time management ofhardware and software
tasks for FPGA-based embedded systems,”IEEE Trans.Comput., vol. 56, no. 12,
pp. 1666–1680, 2007.
[72] F. Ghaffari, B. Miramond, and F. Verdier, “Run-time HW/SW scheduling of data
flow applications on reconfigurable architectures,”EURASIP J.Embedded Syst.,
vol. 2009, pp. 3–3, 2009.
[73] L. Devaux, S. B. Sassi, S. Pillement, D. Chillet, and D. Demigny, “DRAFT:
Flexible interconnection network for dynamically reconfigurable architectures,”
in Field-Programmable Technology, 2009. FPT 2009. International Conference
on, 2009, pp. 435–438.
[74] L. Devaux, D. Chillet, S. Pillement, and D. Demigny, “Flexible communication
support for dynamically reconfigurable FPGAs,” inProgrammable Logic, 2009.
SPL. 5th Southern Conference on, 2009, pp. 65–70.
[75] L. Devaux, S. B. Sassi, S. Pillement, D. Chillet, and D. Demigny, “Flexible inter-
connection network for dynamically and partially reconfigurable architectures,”
Int.J.Reconfig.Comput., vol. 2010, pp. 6:4–6:4, January 2010.
![Page 224: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/224.jpg)
BIBLIOGRAPHY 204
[76] F. Muller, J. L. Rhun, F. Lemonnier, B. Miramond, and L. Devaux, “A flexible
operating system for dynamic applications,”Xcell Journal, no. 73, 2010.
[77] B. Miramond, E. Huck, F. Verdier, M. E. A. Benkhelifa, B.Granado, M. Aichouch,
J. C. Prvotet, D. Chillet, S. Pillement, T. Lefebvre, and Y. Oliva, “OveRSoC : a
framework for the exploration of RTOS for RSoC platforms,”International Jour-
nal on Reconfigurable Computing, vol. 2009, no. 450607, pp. 1–18, dec 2009.
[78] S. M. Loo and B. E. Wells, “Task scheduling in a finite-resource, reconfigurable
hardware/software codesign environment,”INFORMS Journal on Computing,
vol. 18, no. 2, pp. 151–172, January 1 2006.
[79] H. Walder and M. Platzner, “Online scheduling for block-partitioned reconfig-
urable devices,” inDesign, Automation and Test in Europe Conference and Exhi-
bition, 2003, 2003, pp. 290–295.
[80] B. Mei, P. Schaumont, and S. Vernalde, “A hardware-software partitioning and
scheduling algorithm for dynamically reconfigurable embedded systems.” Sys-
tems and Signal Processing Veldhoven, 2000.
[81] K. Danne and M. Platzner, “Periodic real-time scheduling for FPGA computers,”
in Intelligent Solutions in Embedded Systems, 2005. Third International Workshop
on, 2005, pp. 117–127.
[82] J. Resano, D. Mozos, and F. Catthoor, “A hybrid prefetchscheduling heuristic to
minimize at run-time the reconfiguration overhead of dynamically reconfigurable
hardware,” inDesign, Automation and Test in Europe, 2005. Proceedings, 2005,
pp. 106–111 Vol. 1.
![Page 225: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/225.jpg)
BIBLIOGRAPHY 205
[83] J. Resano, D. Mozos, D. Verkest, F. Catthoor, and S. Vernalde, “Specific schedul-
ing support to minimize the reconfiguration overhead of dynamically reconfig-
urable hardware,” inDesign Automation Conference, 2004. Proceedings. 41st,
2004, pp. 119–124.
[84] J. Resano, D. Mozos, D. Verkest, S. Vernalde, and F. Catthoor, “Run-time mini-
mization of reconfiguration overhead in dynamically reconfigurable systems,” vol.
2778, pp. 585–594, 2003.
[85] P. Yang, C. Wong, P. Marchal, F. Catthoor, D. Desmet, D. Verkest, and R. Lauw-
ereins, “Energy-aware runtime scheduling for embedded-multiprocessor SOCs,”
Design and Test of Computers, IEEE, vol. 18, no. 5, pp. 46–58, 2001.
[86] C. Steiger, H. Walder, and M. Platzner, “Heuristics foronline scheduling real-
time tasks to partially reconfigurable devices,” inIn Proceedings of the 13rd In-
ternational Conference on Field Programmable Logic and Application (FPL03.
Springer, 2003, pp. 575–584.
[87] ——, “Operating systems for reconfigurable embedded platforms: online schedul-
ing of real-time tasks,”Computers, IEEE Transactions on, vol. 53, no. 11, pp.
1393–1407, 2004.
[88] Y.-H. Chen and P.-A. Hsiung, “Hardware task schedulingand placement in oper-
ating systems for dynamically reconfigurable soc,” vol. 3824, pp. 489–498, 2005.
[89] C.-W. Liu, “Hardware / software real-time relocatabletask scheduling and place-
ment in dynamically partial reconfigurable systems,” June 2006.
![Page 226: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/226.jpg)
BIBLIOGRAPHY 206
[90] J. Noguera and R. M. Badia, “System-level power-performance trade-offs in task
scheduling for dynamically reconfigurable architectures,” in Proceedings of the
2003 international conference on Compilers, architectureand synthesis for em-
bedded systems, ser. CASES ’03. New York, NY, USA: ACM, 2003, pp. 73–83.
[91] P.-A. Hsiung, P.-H. Lu, and C.-W. Liu, “Energy efficientco-scheduling in dy-
namically reconfigurable systems,” inProceedings of the 5th IEEE/ACM inter-
national conference on Hardware/software codesign and system synthesis, ser.
CODES+ISSS ’07. New York, NY, USA: ACM, 2007, pp. 87–92.
[92] C.-C. Chiang, “Hardware/software real-time relocatable task scheduling and
placement in dynamically partial reconfigurable systems,”Master’s thesis, Na-
tional Chung Cheng University, June 2007.
[93] R. Pellizzoni and M. Caccamo, “Adaptive allocation of software and hardware
real-time tasks for FPGA-based embedded systems,” inReal-Time and Embedded
Technology and Applications Symposium, 2006. Proceedingsof the 12th IEEE,
2006, pp. 208–220.
[94] D. Gohringer, M. Hubner, E. N. Zeutebouo, and J. Becker,“Operating system for
runtime reconfigurable multiprocessor systems,” 2011.
[95] D. Gohringer and J. Becker, “High performance reconfigurable multi-processor-
based computing on fpgas,” inParallel & Distributed Processing, Workshops and
Phd Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, pp. 1–4.
[96] T.-M. Lee, J. Henkel, and W. Wolf, “Dynamic runtime re-scheduling allowing
multiple implementations of a task for platform-based designs,” in Design, Au-
![Page 227: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/227.jpg)
BIBLIOGRAPHY 207
tomation and Test in Europe Conference and Exhibition, 2002. Proceedings, 2002,
pp. 296–301.
[97] W. Fu and K. Compton, “An execution environment for reconfigurable comput-
ing,” in Field-Programmable Custom Computing Machines, 2005. FCCM2005.
13th Annual IEEE Symposium on, 2005, pp. 149–158.
[98] B. Mei, P. Schaumont, and S. Vernalde, “A hardware-software partitioning and
scheduling algorithm for dynamically reconfigurable embedded systems,” 2000.
[99] J. Wang and S. M. Loo, “Case study of finite resource optimization in fpga using
genetic algorithm,”I.J.Comput.Appl., vol. 17, no. 2, pp. 95–101, 2010.
[100] Y. Qu, J. P. Soininen, and J. Nurmi, “A genetic algorithm for scheduling tasks
onto dynamically reconfigurable hardware,” inCircuits and Systems, 2007. ISCAS
2007. IEEE International Symposium on, 2007, pp. 161–164.
[101] R. Chen, P. R. Lewis, and X. Yao, “Temperature management for heteroge-
neous multi-core fpgas using adaptive evolutionary multi-objective approaches,”
in Evolvable Systems (ICES), 2014 IEEE International Conference on, 2014, pp.
101–108, iD: 1.
[102] N. Sumikawa, “Screening customer returns with multivariate test analysis,” in
IEEE International Test Conference, Anaheim, California, 2012, pp. 1–75.
[103] J. Chen, “Mining ac delay measurements for understanding speed-limiting paths,”
in International Test Conference, Austin, Texas, 2010, pp. 553–562.
![Page 228: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/228.jpg)
BIBLIOGRAPHY 208
[104] H. Liu, “On learning-based methods for design-space exploration with high-level
synthesis,” inDesign Automation Conference, California, 2013, pp. 51–57.
[105] S. Ward, “Pade: A high-performance placer with automatic datapath extraction
and evaluation through high-dimensional data learning,” in Design Automation
Conference, California, 2012, pp. 756–761.
[106] L. Wang, “Data mining in design and test processes: Basic principles and
promises,” inISPD, California, 2013, pp. 41–42.
[107] M. Kurek and W. Lik, “Parametric Reconfigurable Designs with Machine Learn-
ing Optimizer,” in IEEE Int’l Conference on Field Programmable Technology,
Seoul Korea, December 2012, pp. 109–112.
[108] A. Monostori, H. Fruhauf, and G. Kokai, “Quick Estimation of Resources of FP-
GAs and ASICs Using Neural Networks,” inLWA, December 2005, pp. 201–215.
[109] J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B.-H. Park, “Dynamic meta-
learning for failure prediction in large-scale systems: A case study,” inInt, Con-
ference on Parallel Processing,, 2008, pp. 157–164.
[110] A. Lifa, P. Eles, and Z. Peng, “Dynamic configuration prefetching based on piece-
wise linear prediction,” inDesign, Automation Test in Europe Conference Exhibi-
tion (DATE), 2013, pp. 815–820.
[111] G. Mariani, V. Sima, G. Palermo, V. Zaccaria, C. Silvano, and K. Bertels, “Using
multi-objective design space exploration to enable run-time resource management
![Page 229: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/229.jpg)
BIBLIOGRAPHY 209
for reconfigurable architectures,” inDesign Automation and Test in Europe Con-
ference and Exhibition (DATE), 2012, pp. 1379–1384.
[112] R. H. Y. M. K. Z. J. Bian, “Isba: An independent set-based algorithm for au-
tomated partial reconfiguration module generation,” inIEEE/ACM International
Conference on Computer-Aided Design (ICCAD), 2012, pp. 500–507.
[113] V. Lozin and M. Milanic, “A polynomial algorithm to findan independent set
of maximum weight in a fork-free graph,”Journal of Discrete Algorithm, vol. 6,
no. 4, pp. 595–604, Dec 2008.
[114] R. Meeuws, S. Ostazadeh, C. Galuzzi, V. Sima, R. Nane, and K. Bertels, “Quipu:
A statistical model for predicting hardware resources,”ACM Transactions on Re-
configurable Technology and Systems, vol. 6, no. 1, pp. 3:1–3:25, 2013.
[115] G. Mariani, V. Sima, G. Palermo, V. Zaccaria, G. Marchiori, C. Silvano, and
K. Bertels, “Run-time optimization of a dynamically reconfigurable embedded
system through performance prediction,” Porto, Portugal,pp. 1–8, September
2013.
[116] V. Sima and K. Bertels, “Runtime decision of hardware or software execution on
a heterogeneous reconfigurable platform,” Rome, Italy, pp.1–6, May 2009.
[117] G. Mariani, G. Palermo, R. Meeuws, V. Sima, C. Silvano,and K. Bertels, “Druid:
Designing reconfigurable architectures with decision-making support,” suntec,
Singapore, pp. 213–218, January 2014.
[118] D. Cordeiro, M. Gregory, S. Perarnau, D. Trystram, J.-M. Vincent, and F. Wagner,
![Page 230: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/230.jpg)
BIBLIOGRAPHY 210
“Random graph generation for scheduling simulations,” inProceedings of the 3rd
International ICST Conference on Simulation Tools and Techniques, ser. SIMU-
Tools ’10. Institute for Computer Sciences, Social-Informatics and Telecom-
munications Engineering; ICST, Brussels, Belgium, Belgium: ICST, 2010, pp.
60:1–60:10.
[119] “Dot (graph description language),” http://www.graphviz.org/Documentation.php.
[120] A. Al-wattar, S. Areibi, and G. Grewal, “Rcsimulator,a simulator for reconfig-
urable operating systems,” https://github.com/Aalwattar/rcSimulator.
[121] G. Vera, D. Llamocca, S. Pattichis, and J. Lyke, “A dynamically reconfigurable
computing model for video processing applications,” inSignals, Systems and
Computers Conference, Pacific Grove, CA, November 2009, pp. 327–331.
[122] L. Cai, S. Huang, Z. Lou, and H. Peng, “Measurement method of the system
reconfigurability,”Journal of Communications and Information Sciences, vol. 3,
no. 3, pp. 1–13, 2013.
[123] F. Redaelli, M. D. Santambrogio, and D. Sciuto, “Task scheduling with configura-
tion prefetching and anti-fragmentation techniques on dynamically reconfigurable
systems,” inDesign, Automation and Test in Europe, 2008. DATE ’08, 2008, pp.
519–522, iD: 1.
[124] F. Redaelli, M. D. Santambrogio, and S. O. Memik, “An ilp formulation for the
task graph scheduling problem tailored to bi-dimensional reconfigurable architec-
tures,” inReconfigurable Computing and FPGAs, 2008. ReConFig ’08. Interna-
tional Conference on, 2008, pp. 97–102, iD: 1.
![Page 231: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/231.jpg)
BIBLIOGRAPHY 211
[125] A. Elhossini, J. Huissman, B. Debowski, S. Areibi, andR. Dony, “An efficient
scheduling methodology for heterogeneous multi-core processor systems,” inMi-
croelectronics (ICM), 2010 International Conference on, 2010, pp. 475–478.
[126] “Express benchmarks: Electrical & computer engineering department at the ucsb,
usa,” http://express.ece.ucsb.edu/benchmark/.
[127] J. R. M. Potkonjak, “Optimizing resource utilizationusing transformation,”IEEE
Transaction on Computer Aided Design of Integrated Circuits and Systems,
vol. 13, no. 3, pp. 277–292, March 1994.
[128] M. H. E. F. G. H. B. P. P. R. I. Witten, “The weka data mining software: an update,”
SIGKDD Explor.Newsl., vol. 11, no. 1, pp. 10–18, nov 2009.
[129] I. H. Witten, E. Frank, and M. A. Hall,Data Mining: Practical Machine Learning
Tools and Techniques, 3rd ed. Amsterdam: Morgan Kaufmann, 2011.
[130] S. Amari and S. Wu, “Improving support vector machine classifiers by modifying
kernel functions,”Journal of Neural Networks, vol. 12, no. 6, pp. 783–789, July
1999.
[131] L. Rokach, “Ensemble-based classifiers,”Artificial Intelligence Review, vol. 33,
no. 1, pp. 1–39, February 2010.
[132] Z. Harry, “The optimality of naive bayes,” inFLAIR, Florida, 2004, pp. 562 – 567.
[133] P. Wasserman and T. Schwartz, “Neural networks: What are they and why is ev-
erybody so interested in them now?”IEEE Expert, vol. 3, no. 1, pp. 10–15, 1988.
![Page 232: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/232.jpg)
BIBLIOGRAPHY 212
[134] P. Hall and R. Samworth, “Choice of neighbor order in nearest neighbor classifi-
cation,”Annals of Statistics, vol. 36, no. 5, pp. 2135–2152, 2008.
[135] D. Beasley, D. R. Bull, and R. R. Martin, “An overview ofgenetic algorithms:
Part 1, fundamentals,” 1993.
[136] S. H. Park, J. M. Goo, and C.-H. Jo, “Receiver operatingcharacteristic (roc) curve:
practical review for radiologists.”Korean journal of radiology : official journal of
the Korean Radiological Society, vol. 5, no. 1, pp. 11–18.
[137] J. A. Hanley and B. J. McNeil, “The meaning and use of thearea under a receiver
operating characteristic (roc) curve,”Radiology, vol. 143, no. 1, pp. 29–36, Apr
1982.
[138] J. Hilden, “The area under the roc curve and its competitors,” Medical Decision
Making, vol. 11, no. 2, pp. 95–101, 1991.
[139] L. Breiman, “Bagging predictors,”Machine Learning, vol. 24, no. 2, pp. 123–140,
1996.
[140] R. Schapire, “The boosting approach to machine learning: An overview,” inMSRI
Workshop on Nonlinear Estimation and Classification, Florida, 2002, pp. 1–23.
[141] “Power methodology guide,”UG786(V13.1), March 1, 2011.
[142] “Vivado design suite user guide power analysis and optimization,”
UG907(V2012.2), July 25, 2012.
[143] “Fusion digital power designer version 1.8230,” June12, 2012.
![Page 233: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/233.jpg)
Appendix A
Scheduling and Placement Restriction
In this Appendix RCSched-I, RCSched-II, and RCSched-III are evaluated in respect to
the RCSched-Base described in Section 5.2.4. We looked at the problem by using two
PRR floorplans (uniform and non-uniform). In addition to thePRR uniformity, the sched-
ulers performance was evaluated with task placement restriction. In many cases a task
can be placed onto only specific locations, the placement restriction can be due size, I/O
or communication restrictions. The benchmark used is this stage is larger with a wide
variate of DFGs, as described in Section 5.4.1.
A.1 Total Number of Reconfiguration
The total number of reconfiguration represents the frequency by which the system uses
the ICAP controller to download a bitstream on the FPGA fabric. The total number
of reconfigurationis related toreconfiguration timeyet it is different for non-uniform
platform. An intelligent scheduler may have higher number of reconfiguration but tends
213
![Page 234: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/234.jpg)
APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 214
to use smaller PRRs which leads to lower reconfiguration time.
Non-Restricted Placement: From Figure A.1, it can be noticed that the number of
configuration is inversely proportional to the number of PRRs. Implementations with less
than three PRRs showed no significant difference among the three schedulers. RCSched-
III improved performance can be attributed to the higher number of hardware to software
migration.
1 2 3 4 5
RCSched-I 39.7 51.4 55.4 61.9 67.1
RCSched-II 39.5 51.6 54.9 56.8 53.7
RCSched-III 37.9 49.0 50.2 47.6 42.9
Baseline 77.9 80.4 82.6 84.6 86.0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
# of
Rec
onfig
ura
tion
# of PRRs
Total # of Reconfiguration (Non Unifor m)
1 2 3 4 5
RCSched-I 35.5 53.0 62.4 67.2 70.0
RCSched-II 35.5 52.8 58.1 57.1 45.2
RCSched-III 34.9 44.2 42.8 42.2 37.7
Baseline 63 75 83 85 67
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
# of
Rec
onfig
ura
tion
# of PRRs
Total # of Reconfigurati on (Unifor m)
Figure A.1: Total Number of Reconfigurations: Non-Restricted Placement
![Page 235: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/235.jpg)
APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 215
Restricted Placement: As expected the results obtained indicates that RCSched-I is
inferior to RCSched-II and RCSched-III, as shown in Figure A.2.
1 2 3 4 5
RCSched-I 47.3 51.1 50.0 46.8 42.9
RCSched-II 48.3 49.8 46.7 37.6 30.6
RCSched-III 48.5 49.9 45.7 39.0 32.0
Baseline 61.4 64.6 66.9 68.0 68.0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
# of
Rec
onfig
ura
tion
# of PRRs
Total # of Reconfiguration (Non Unifor m)
1 2 3 4 5
RCSched-I 46.5 52.6 49.4 45.5 42.8
RCSched-II 49.4 50.5 45.8 37.2 29.2
RCSched-III 46.5 49.8 43.9 36.8 30.3
Baseline 61 67 68 70 70
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
# of
Rec
onfig
ura
tion
# of PRRs
Total # of Reconfiguration (Unifor m)
Figure A.2: Total Number of Reconfigurations: Restricted Placement
A.2 Number of Hardware Task Reuse
Figure A.3 and Figure A.4 show the total number of task reuse,reusing a hardware task,
which avoids reconfiguration and minimizes total executiontime.
![Page 236: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/236.jpg)
APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 216
Non-Restricted Placement: Figure A.3 clearly shows that RCSched-II has more reuse
than RCSched-III, yet Figure 5.13 shows RCSched-II has higher overall time in most
cases. This is due to lower hardware to software task migration ration for RCSched-II.
1 2 3 4 5
RCSched-I 8.7 9.1 12.8 13.6 15.4
RCSched-II 9.4 10.0 13.4 18.6 27.7
RCSched-III 8.8 10.3 13.4 17.7 24.5
Baseline 0 0 0 0 0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
# of
reu
sed
HW
tas
ks
# of PRRs
# of Reuse (Non Uniform)
1 2 3 4 5
RCSched-I 8.9 10.8 13.6 15.3 15.2
RCSched-II 8.0 11.0 17.5 25.4 28.9
RCSched-III 8.0 11.5 18.0 21.5 27.6
Baseline 0 0 0 0 0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
# of
reu
sed
HW
tas
ks
# of PRRs
# of Reuse (Uniform)
Figure A.3: Total Number of HW reuse: Non-Restricted Placement
Restricted Placement: As more restrictions are imposed on placement the perfor-
mance of the schedulers tend to be close in quality. RCSched-I has the worst results,
while RCSched-II and RCSched-III were almost identical as shown in Figure A.4.
A.3 Busy Counter
When there are no free hardware processing elements the schedulers tend to migrate a
task from hardware to software without incrementing the counter when applicable.
![Page 237: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/237.jpg)
APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 217
1 2 3 4 5
RCSched-I 14.1 14.5 17.0 20.7 25.0
RCSched-II 14.2 17.0 21.9 31.7 39.3
RCSched-III 13.8 16.9 22.4 30.2 37.3
Baseline 0 0 0 0 0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
# of
reu
sed
HW
tas
ks
# of PRRs
# of Reuse (Non Uniform)
1 2 3 4 5
RCSched-I 15.4 14.1 18.5 23.4 26.5
RCSched-II 13.4 15.8 22.8 32.0 40.6
RCSched-III 14.2 16.2 24.1 32.1 38.8
Baseline 0 0 0 0 0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
# of
reu
sed
HW
tas
ks
# of PRRs
# of Reuse (Unifor m)
Figure A.4: Total Number of HW reuse: Restricted Placement
![Page 238: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/238.jpg)
APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 218
Figures A.5 and A.6 show that thebusy counter(idle timer) is inversely proportional
to the number of PRRs available in the system.
Non-Restricted Placement: RCSched-III has a lower idle period compared to RCSched-
II due to more optimized utilization of resources when the number of PRRs exceed three,
as shown in Figures A.5.
1 2 3 4 5
RCSched-I 45682 24651 13239 4652 1755
RCSched-II 64324 32979 17113 6811 2829
RCSched-III 51967 25967 12782 5970 2238
Baseline 74767 27333 7720 1795 0
0
10000
20000
30000
40000
50000
60000
70000
80000
Idle
Per
iod
# of PRRs
Id le Time Measurements (Non Unifor m)
1 2 3 4 5
RCSched-I 41846 16762 7481 1901 353
RCSched-II 39738 15781 7004 2433 7121
RCSched-III 59590 20005 7872 2508 925
Baseline 52389 14188 3105 541 4159
0
10000
20000
30000
40000
50000
60000
70000
Idle
Per
iod
# of PRRs
Id le Time Measurements (Unifor m)
Figure A.5: Idle Time Measurement: Non-Restricted Placement
![Page 239: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/239.jpg)
APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 219
Restricted Placement: RCSched-I has the worst results, while RCSched-II and RCSched-
III were almost identical as shown in Figure A.6.
1 2 3 4 5
RCSched-I 54007 18334 6999 4859 4297
RCSched-II 53779 21412 10108 5696 6531
RCSched-III 79747 27481 13420 8620 10239
Baseline 54176 17767 5378 2401 1913
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Idle
Per
iod
# of PRRs
Id le Time Measurements (Non Unifor m)
1 2 3 4 5
RCSched-I 59202 19435 8475 6475 5969
RCSched-II 55099 18372 8883 5604 6696
RCSched-III 77509 22668 11702 8554 8744
Baseline 57927 14317 4595 1629 2492
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Idle
Per
iod
# of PRRs
Id le Time Measurements (Unifor m)
Figure A.6: Idle Time Measurement: Restricted Placement
A.4 Hardware to Software Migration
The hardware to software migration parameter measures the total number of tasks that
migrate from hardware to software.
![Page 240: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/240.jpg)
APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 220
Non-Restricted Placement: Recall from section 5.4.1 that all tasks within a DFG were
initialized as Hybrid hardware tasks. Therefore, any task that was processed on the GPP
was due to migration.
Figure A.7 clearly indicates that RCSched-III has higher number of hardware to soft-
ware migration, since it performs migration based on an intelligent criteria. RCSched-I
and RCSched-II on the other hand only migrate tasks to software when there is lack of
hardware processing elements.
Restricted Placement: RCSched-I has the worst performance (it has similar perfor-
mance to the baseline scheduler), while RCSched-II and RCSched-III were almost iden-
tical as shown in Figure A.8.
![Page 241: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/241.jpg)
APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 221
1 2 3 4 5
RCSched-I 37.6 25.5 17.9 10.6 3.5
RCSched-II 37.1 24.4 17.7 10.6 4.7
RCSched-III 39.4 26.7 22.4 20.7 18.6
Baseline 8.1 5.6 3.4 1.4 0.0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
# of
Tas
ks
No. of PRRs
# of HW to SW Mi gration (Non Unifor m)
1 2 3 4 5
RCSched-I 41.6 22.3 10.0 3.5 0.9
RCSched-II 42.6 22.2 10.5 3.6 12.0
RCSched-III 43.1 30.3 25.2 22.3 20.7
Baseline 23 11 3 1 5
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
# of
Tas
ks
No. of PRRs
# of HW to SW Mi gration (Unifor m)
Figure A.7: Number of hardware to software Task Migration: Non-Restricted Placement
![Page 242: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/242.jpg)
APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 222
1 2 3 4 5
RCSched-I 23.1 19.0 17.5 17.0 16.7
RCSched-II 9.5 5.2 3.5 2.7 2.1
RCSched-III 9.7 5.2 4.0 2.8 2.8
Baseline 23.2 19.9 17.6 16.5 16.6
0.0
5.0
10.0
15.0
20.0
25.0
# of
Tas
ks
No. of PRRs
# of HW to SW Mi gration (Non Unifor m)
1 2 3 4 5
RCSched-I 24.1 19.3 18.1 17.2 16.8
RCSched-II 9.3 5.8 3.4 2.8 2.3
RCSched-III 11.3 6.1 4.0 3.1 2.9
Baseline 25 19 18 16 16
0.0
5.0
10.0
15.0
20.0
25.0
30.0
# of
Tas
ks
No. of PRRs
# of HW to SW Mi gration (Unifor m)
Figure A.8: Number of hardware to software Task Migration: Restricted Placement
![Page 243: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/243.jpg)
Appendix B
Architecture Library
The architecture library used in the Chapter 6 is shown below:
Name = " Architecture Library v1.0"
Date = "Jul 03, 2013"
######################
Task Task1 {
id = 1
arch arch1 {
exec_time = 20
config_time = 5
config_power = 3
exec_power = 10
columns = 1
223
![Page 244: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/244.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 224
rows = 1
mode = HW
}
arch arch2 {
exec_time = 10
config_time = 10
config_power = 5
exec_power = 5
columns = 1
rows = 2
mode = HW
}
arch arch3 {
exec_time = 5
config_time = 20
config_power = 5
exec_power = 3
columns = 2
rows = 2
mode = HW
}
}
######################
Task Task2 {
id = 2
![Page 245: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/245.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 225
arch arch1{
exec_time = 30
config_time = 5
config_power = 3
exec_power = 20
columns = 1
rows = 1
mode = HW
}
arch arch2 {
exec_time = 20
config_time = 10
config_power = 5
exec_power = 15
columns = 2
rows = 1
mode = HW
}
arch arch3 {
exec_time = 10
config_time = 20
config_power = 10
exec_power = 10
columns = 2
rows = 2
mode = HW
![Page 246: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/246.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 226
}
arch arch4 {
exec_time = 5
config_time = 30
config_power = 15
exec_power = 5
columns = 3
rows = 2
mode = HW
}
}
######################
Task Task3 {
id = 3
arch arch1 {
exec_time = 125
config_time = 20
config_power = 10
exec_power = 70
columns = 2
rows = 2
mode = HW
}
arch arch2 {
![Page 247: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/247.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 227
exec_time = 75
config_time = 30
config_power = 15
exec_power = 50
columns = 3
rows = 2
mode = HW
}
arch arch3 {
exec_time = 20
config_time = 20
config_power = 10
exec_power = 40
columns = 4
rows = 1
mode = HW
}
arch arch4 {
exec_time = 30
config_time = 75
config_power = 35
exec_power = 30
columns = 5
rows = 3
mode = HW
}
![Page 248: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/248.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 228
arch arch5 {
exec_time = 20
config_time = 125
config_power = 65
exec_power = 25
columns = 5
rows = 5
mode = HW
}
}
######################
Task Task4 {
id = 4
arch arch1 {
exec_time = 40
config_time = 75
config_power = 35
exec_power = 70
columns = 3
rows = 4
mode = HW
}
arch arch2 {
exec_time = 30
![Page 249: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/249.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 229
config_time = 125
config_power = 65
exec_power = 50
columns = 5
rows = 5
mode = HW
}
}
######################
Task Task5 {
id = 5
arch arch1 {
exec_time = 15
config_time = 5
config_power = 3
exec_power = 10
columns = 1
rows = 1
mode = HW
}
arch arch2 {
exec_time = 10
config_time = 10
config_power = 5
exec_power = 5
![Page 250: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/250.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 230
columns = 1
rows = 2
mode = HW
}
arch arch3 {
exec_time = 5
config_time = 20
config_power = 10
exec_power = 3
columns = 2
rows = 2
mode = HW
}
}
######################
Task Task6 {
id = 6
arch arch1 {
exec_time = 40
config_time = 5
config_power = 3
exec_power = 30
columns = 1
rows = 1
mode = HW
![Page 251: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/251.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 231
}
arch arch2 {
exec_time = 30
config_time = 10
config_power = 5
exec_power = 15
columns = 2
rows = 1
mode = HW
}
arch arch3 {
exec_time = 15
config_time = 20
config_power = 10
exec_power = 10
columns = 2
rows = 2
mode = HW
}
}
######################
Task Task7 {
id = 7
![Page 252: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/252.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 232
arch arch1 {
exec_time = 75
config_time = 20
config_power = 10
exec_power = 50
columns = 2
rows = 2
mode = HW
}
arch arch2 {
exec_time = 50
config_time = 40
config_power = 20
exec_power = 25
columns = 4
rows = 2
mode = HW
}
}
######################
Task Task8 {
id = 8
arch arch1 {
exec_time = 30
config_time = 75
![Page 253: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/253.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 233
config_power = 35
exec_power = 60
columns = 3
rows = 4
mode = HW
}
arch arch2 {
exec_time = 20
config_time = 125
config_power = 65
exec_power = 55
columns = 5
rows = 5
mode = HW
}
}
######################
Task Task9 {
id = 9
arch arch1 {
exec_time = 40
config_time = 5
![Page 254: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/254.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 234
config_power = 3
exec_power = 30
columns = 1
rows = 1
mode = HW
}
arch arch2 {
exec_time = 30
config_time = 10
config_power = 5
exec_power = 15
columns = 2
rows = 1
mode = HW
}
arch arch3 {
exec_time = 15
config_time = 20
config_power = 10
exec_power = 10
columns = 2
rows = 2
mode = HW
}
}
![Page 255: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/255.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 235
######################
Task Task10 {
id = 10
arch arch1 {
exec_time = 75
config_time = 20
config_power = 10
exec_power = 50
columns = 2
rows = 2
mode = HW
}
arch arch2 {
exec_time = 50
config_time = 40
config_power = 20
exec_power = 25
columns = 4
rows = 2
mode = HW
}
}
######################
Task Task11 {
![Page 256: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/256.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 236
id = 11
arch arch1 {
exec_time = 30
config_time = 35
config_power = 20
exec_power = 60
columns = 3
rows = 4
mode = HW
}
arch arch2 {
exec_time = 20
config_time = 75
config_power = 35
exec_power = 55
columns = 5
rows = 5
mode = HW
}
}
######################
![Page 257: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/257.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 237
Task Task12 {
id = 12
arch arch1{
exec_time = 30
config_time = 5
config_power = 3
exec_power = 20
columns = 1
rows = 1
mode = HW
}
arch arch2 {
exec_time = 20
config_time = 10
config_power = 5
exec_power = 15
columns = 2
rows = 1
mode = HW
}
arch arch3 {
exec_time = 10
config_time = 20
config_power = 10
exec_power = 10
columns = 2
![Page 258: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/258.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 238
rows = 2
mode = HW
}
arch arch4 {
exec_time = 5
config_time = 30
config_power = 15
exec_power = 5
columns = 3
rows = 2
mode = HW
}
}
######################
Task Task13 {
id = 13
arch arch1 {
exec_time = 125
config_time = 20
config_power = 10
exec_power = 70
columns = 2
rows = 2
mode = HW
}
![Page 259: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/259.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 239
arch arch2 {
exec_time = 75
config_time = 30
config_power = 15
exec_power = 50
columns = 3
rows = 2
mode = HW
}
arch arch3 {
exec_time = 20
config_time = 20
config_power = 10
exec_power = 40
columns = 4
rows = 1
mode = HW
}
arch arch4 {
exec_time = 30
config_time = 75
config_power = 35
exec_power = 30
columns = 5
rows = 3
mode = HW
![Page 260: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/260.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 240
}
arch arch5 {
exec_time = 20
config_time = 125
config_power = 65
exec_power = 25
columns = 5
rows = 5
mode = HW
}
}
######################
Task Task14 {
id = 14
arch arch1 {
exec_time = 40
config_time = 75
config_power = 35
exec_power = 70
columns = 3
rows = 4
mode = HW
}
![Page 261: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/261.jpg)
APPENDIX B. ARCHITECTURE LIBRARY 241
arch arch2 {
exec_time = 30
config_time = 125
config_power = 65
exec_power = 50
columns = 5
rows = 5
mode = HW
}
}
![Page 262: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/262.jpg)
Appendix C
FPGA Power Measurements
Modern state of the art FPGA require multiple power supplies. The use of multiple volt-
ages for different FPGA resources increases performance(signal strength), at the same
time increase system immunity to noise and parasitic effects [141].
Based on Xilinx power classification the total power requirement for each supply
source of an FPGA depends on three components [141]:
Device Static Power: transistor leakage power, that is required for the device tooperate
and be ready for programming.
Design Static Power: additional power consumed when the device is configured but not
active.
Design dynamic power: additional power drawn from design activity. This component
is function of voltage levels, logic, and routing used.
The power consumption paths of an FPGA can either dissipate as heat due to internal
activities or it can be consumed by of-chip peripherals through I/O pins and then dissipate
242
![Page 263: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/263.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 243
by miscellaneous chip components [141].
A Xilinx FPGA goes through several power phases from power onto power down
and for each phase there is different power requirement. Thepower phases in order
are: Power-On, Configuration, standby, Active, suspend, and Hibernate. For detailed
explanation of these power modes and Xilinx power requirements refer to [141,142].
C.1 Power management of the VC707 board
The Xilinx VC707 board uses several power regulators and many supporting chip in
order to meet power requirements of the FPGA and the other system components. The
power distribution diagram is shown in Figure C.1. For a complete description of the
power supply system of the VC707 board refer to [2].
Figure C.1 clearly indicate that three power controllers (U42, U43 and U64) are in-
volved. The power regulators are PMBus compliant digital PWM system controllers
from Texas Instruments (UCD9248PFC). The power regulatorsare mainly used to sup-
ply the core voltages of the FPGA. The Texas Instruments UCD9248PFC controller can
be used to monitor voltage, current, and power for differentpower rails. Table C.1 shows
power resources supplied by each power rail. For a detailed information about the role of
TI controllers in the VC707 board refer to [2].Texas Instrument’s Fusion Digital Power
graphical user interface is a free software used to monitor real-time power reading of the
UCD9248PFC controllers. A TI USB adapter (TI part number EVMUSB-TP-GPIO) is
needed to convert the signals of the PMBus on the board side toa standard USB signal
suitable for PC software (Fusion power designer).
![Page 264: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/264.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 244
Figure C.1: Power distribution of Xilinx VC707 Board from [2].
ID Schematic Rail name (power name)Controller at addr. (52) Controller at addr. (53) Controller at addr. (54)
1 V CCINT F PGA (VCCINT ) V CC2C5 F PGA (VCCO 2.5V ) V CCAUX IO (VCCAUX IO)2 V CCAUX (VCCAUX ) V CC1V 5 (VCCO 1.5V ) V CC BRAM (VCC BRAM )3 V CC3V 3 (VCCO 3.3V ) MGTAVCC VCMGTVCCAUX4 VCCADJ VMGTAVTT V CC1V 8 F PGA (VCCO 1.8V )
Table C.1: Power Rail Specfication for UCD9248 PMBus Controllers.
![Page 265: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/265.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 245
C.2 Methodology
In this appendix two different methods were used to monitor internal FPGA power sup-
plies. The first approach was based on using TI power controllers via TI USB adapter
and TI Fusion software. The second approach was based on using the internal System
Monitor component via the standard JTAG cable. The former has the advantage of mon-
itoring more power supplies via voltage, current and hence power. The latter canonly
monitor some of the internalvoltages and junction temperature, however it does not re-
quire special adapters or software and therefore can be applied to any FPGA with built
in System Monitor component.
C.2.1 Monitoring FPGA resources using TI UCD9248PFC Power
Controllers.
To monitor and measure current, voltage, and accordingly power for the VC707 board
we need Texas Instruments’Fusion Digital Power graphical user interfacewhich can be
downloaded for free from [143]. To connect the board to a PC Texas Instruments’ USB
to GPIO adapter (TI part number EVM USB-TP-GPIO) shown in Figure C.2 is needed.
The steps to connect and measure system metrics using TI Fusion software are:
1. Download and Install TI Fusion Digital Power Designer from [143].
2. Connect TI USB adaptor to the PMBus adapter (J5) on the VC707 board as shown
in Figure C.3 and the USB to Windows based computer.
3. Ensure the VC707 board is powered ON and run TI Fusion software.
![Page 266: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/266.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 246
Figure C.2: Texas Instrument USB to GPIO adapter
Figure C.3: Xilinx VC707 board, TI Adapter should be connected to the high-lighted PMBus connector (J5)
![Page 267: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/267.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 247
4. If this the first time running TI Fusion software an informative window about the
software modes will be shown, press OK, as shown in Figure C.4.
Figure C.4: Window showing Fusion software modes, press OK
5. A window showing Device Scanning mode will pop up, click the first link (UCD
Controller and Sequencers, Isolated Controller) as shown in Figure C.5. The soft-
ware will scan compatible power controller connected to thePMBus, then it should
detect three controllers on addresses (52,53, and 54) respectively.
6. The main window of TI Fusion software will show next, clickon Monitor (left
bottom) and selectdevice and railto be monitored (top right) as shown in Figure
C.6. Table C.1 shows the association between power rails andpower sources for
each controller.
![Page 268: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/268.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 248
Figure C.5: Select Device, click the first link
![Page 269: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/269.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 249
Figure C.6: Fusion Monitor View (measurement can be observed). The monitorsection on top left is view or hide real-time graphs, the device and rail can bechanged from the top right corner.
![Page 270: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/270.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 250
7. A user can select/deselect the metrics to be monitored (Input current, Output cur-
rents, Voltage and temperature) from the top left panel as shown in Figure C.6 and
Figure C.7 .
Figure C.7: Fusion Monitor View, Rail dashboard show real time reading of allmetrics, highlighted on top left side.
8. Active Power rail/device under measurements can be changed from the top right
drop list as shown in Figure C.8. Sometime TI Fusion softwarereset the view to
theConfigure view following rail change. Simply change it back to theMonitor
view (left down) as shown in Figure C.7.
9. In order to view real-time measurement of all rails for alldevices, press onSystem
Dashboardbutton as shown in Figure C.7 (the Monitor view should be active), a
new window will pop-up as shown in Figure C.9
![Page 271: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/271.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 251
Figure C.8: Select the rail and device to be monitored.
Figure C.9: Fusion Monitor View, Rail dashboard show real time reading of allmetrics, highlighted on top left side.
![Page 272: Efficient Scheduling, Mapping and Resource Prediction for](https://reader034.vdocument.in/reader034/viewer/2022051406/627d1db0b7c2ba6de278d7e1/html5/thumbnails/272.jpg)
APPENDIX C. FPGA POWER MEASUREMENTS 252
The software is self explanatory and the steps described above are sufficient to
monitor the core power source supplying the Virtex-7 FPGA onthe VC707 board.