efﬁcient scheduling, mapping and resource prediction for

Efficient Scheduling, Mapping and Resource Prediction

for Dynamic Run time Operating Systems

by

Ahmed Al-Wattar

A Thesis

presented to

The University of Guelph

In partial fulfilment of requirements

for the degree of

Doctor of Philosophy

in

Engineering

Guelph, Ontario, Canada

c©Ahmed Al-Wattar, April,2015

ABSTRACT

Efficient Scheduling, Mapping and Resource Prediction for Dynamic

Run time Operating Systems

Ahmed Al-Wattar

University of Guelph, 2015

Advisor:

Professor Shawki Areibi

Several embedded application domains for reconfigurable systems tend to combine frequent

changes with high performance demands of their workloads such as image processing, wearable

computing and network processors. Time multiplexing of reconfigurable hardware resources

raises a number of new issues, ranging from run-time systemsto complex programming models

that usually form a Reconfigurable hardware Operating System (ROS). In this thesis a novel ROS

framework that aids the designer from the early design stages all the way down to the hardware

implementation is proposed. An efficient reconfigurable platform was implemented along with

several novel scheduling algorithms. The algorithms proposed tend to reuse hardware tasks to

reduce reconfiguration overhead, migrate tasks between software/hardware to efficiently utilize

resources and reduce computation time. A framework for efficient mapping of execution units to

task graphs in a runtime reconfigurable system is also designed. The framework utilizes an Is-

land Based Genetic Algorithm flow that optimizes several objectives including performance, area

and power consumption. The proposed Island based GA framework achieves on average 55.2%

improvement over a single GA implementation and 80.7% improvement over a baseline random

allocation and binding approach. Finally, we present a novel adaptive and dynamic methodology

based on a Machine Learning approach for predicting and estimating the necessary resources for

an application based on past historical information. An important feature of the proposed method-

ology is that the system is able to learn and generalize and, therefore, is expected to improve its

accuracy over time.

Acknowledgements

I would like to express my deep gratitude and special appreciation to my advisor Dr.

Shawki Areibi for his help and support during every moment ofmy work and for the

guidance he provided me while carrying out this research. Your advice on both research

as well as on my career have been priceless. Thank you! I wouldlike to thank Dr. Gary

Grewal for his valuable advice and feedback that greatly helped me enhance the quality

of my work and my thesis. Without them, my work would have beenhardly possible

to complete. I would also like to thank my committee member Dr. Radu Muresan for

the kindness and assistance he provided at all levels of the research project. I would

like to express my thanks to the faculty and staff in the school of Engineering for their

support and help. I would like to give special thank my colleagues in the ISLAB and my

friends at the University of Guelph for their encouragement, support and the great time

we spent together, special thanks to Elisha Colmenar, Ziad Abuowaimer, Omar Ahmed,

and Cynthia Mason. Finally, special thanks to my family. Words cannot express how

grateful I am to my mother, father, brothers and sister, for all of the sacrifices that you

have made on my behalf. Your prayer for me was what sustained me thus far.

iv

List of Publications

• Journal Papers:

1. “An Efficient Framework for Floor-plan Prediction of Dynamic Runtime Re-

configurable Systems ” Submitted to the International Journal of Reconfig-

urable and Embedded Systems (IJRES), February 19, 2015.

• Conference Papers:

1. “Efficient mapping of Execution Units to Task Graphs usingan Evolution-

ary Framework” Accepted for publication to the international symposium on

Highly Efficient Accelerators and Reconfigurable Technologies (HEART2015)

Boston, MA, USA, June 12, 2015.

2. “Efficient On-line Hardware/Software Task Scheduling for Dynamic Run-

time Reconfigurable Systems,” Published in Parallel and Distributed Process-

ing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26thIn-

ternational , pp.401,406, 21-25 May 2012.

• Application Note:

1. “Real-Time Power Monitoring for Xilinx-based FPGAs (VC707 Board)” Pub-

lished in CMC Microsystems, Ontario - Canada, November 1, 2012.

v

Contents

1 Introduction 1

1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Motivations and Proposed Approach . . . . . . . . . . . . . . . . . .. 6

1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Background 12

2.1 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Data Flow Graph (DFG) . . . . . . . . . . . . . . . . . . . . 14

2.1.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.3 Run Time Reconfiguration (RTR) . . . . . . . . . . . . . . . . 18

2.1.4 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . 21

2.2 Resource Management of Reconfigurable Systems . . . . . . . .. . . 27

2.2.1 Scheduling in Reconfigurable Systems . . . . . . . . . . . . . .28

2.3 Data Mining, Machine Learning and Classification . . . . . .. . . . . 30

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

vi

3 Literature Review 33

3.1 Partial Dynamic Reconfigurable Systems . . . . . . . . . . . . . .. . 33

3.1.1 Dynamic Partial Reconfiguration Applications . . . . . .. . . 36

3.2 Reconfigurable Operating Systems . . . . . . . . . . . . . . . . . . .. 39

3.2.1 Scheduling for Reconfigurable Operating Systems . . . .. . . 46

3.3 Execution unit Allocation and Genetic Algorithm . . . . . .. . . . . . 49

3.4 Resource Prediction and Data-Mining . . . . . . . . . . . . . . . .. . 51

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Overall Methodology and Tools 57

4.1 Run-Time Reconfigurable Platform . . . . . . . . . . . . . . . . . . .. 57

4.1.1 PRR Uniformity . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 DFG Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.1 DFG Generator Sub-modules . . . . . . . . . . . . . . . . . . 63

4.3 Reconfigurable Simulator . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.1 Simulator Inputs . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.2 Simulator Output . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Modes of Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Reconfigurable Online Schedulers 83

5.1 Task Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.1.1 Task Placement Models . . . . . . . . . . . . . . . . . . . . . 86

5.2 Baseline Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

vii

5.2.1 ILP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2.2 RCOffline Scheduler . . . . . . . . . . . . . . . . . . . . . . . 92

5.2.3 Meta-Offline Scheduler . . . . . . . . . . . . . . . . . . . . . . 93

5.2.4 RCSched-Base Scheduler . . . . . . . . . . . . . . . . . . . . 93

5.2.5 Baseline Schedulers Comparison . . . . . . . . . . . . . . . . . 93

5.3 Proposed Scheduling Algorithms . . . . . . . . . . . . . . . . . . . .. 95

5.3.1 RCSched-I and RCSched-II . . . . . . . . . . . . . . . . . . . 96

5.3.2 RCSched-III . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3.3 RCSched-III-Enhanced . . . . . . . . . . . . . . . . . . . . . . 103

5.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.4.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.4.2 Preliminary Stage . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.4.3 Intermediate Stage . . . . . . . . . . . . . . . . . . . . . . . . 114

5.4.4 Advanced Stage . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6 Allocation/Binding of Execution Units 130

6.1 Single Island GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.1.1 Initial population . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.1.2 Genetic Operators . . . . . . . . . . . . . . . . . . . . . . . . 133

6.1.3 Fitness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 135

6.2 Island Based GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.2.1 Power Measurements . . . . . . . . . . . . . . . . . . . . . . . 138

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

viii

6.3.1 Experimental Method . . . . . . . . . . . . . . . . . . . . . . . 139

6.3.2 Convergence of Island based GA . . . . . . . . . . . . . . . . . 140

6.3.3 The Pareto Front of Island GA Framework . . . . . . . . . . . 142

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7 Resource Prediction 158

7.1 Data Preparation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.2 Training Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7.3 Classification Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.4.1 Classification Algorithms: Implementation . . . . . . . .. . . 170

7.4.2 Experimental Setup and Evaluation . . . . . . . . . . . . . . . 172

7.4.3 Results for Synthetic Benchmarks . . . . . . . . . . . . . . . . 173

7.4.4 Results for MediaBench DSP suite . . . . . . . . . . . . . . . . 177

7.4.5 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . 179

7.4.6 Random Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 184

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

8 Conclusions and Future Work 189

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

Bibliography 194

A Scheduling and Placement Restriction 213

A.1 Total Number of Reconfiguration . . . . . . . . . . . . . . . . . . . . .213

ix

A.2 Number of Hardware Task Reuse . . . . . . . . . . . . . . . . . . . . . 215

A.3 Busy Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

A.4 Hardware to Software Migration . . . . . . . . . . . . . . . . . . . . .219

B Architecture Library 223

C FPGA Power Measurements 242

C.1 Power management of the VC707 board . . . . . . . . . . . . . . . . . 243

C.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

C.2.1 Monitoring FPGA resources using TI UCD9248PFC Power Con-

trollers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

x

List of Tables

2.1 Comparison of Representative Computing Architecture .. . . . . . . . 14

2.2 Reconfiguration Speed for different interfaces . . . . . . .. . . . . . . 25

4.1 Reconfigurable Simulator Input Parameter . . . . . . . . . . . .. . . . 67

5.1 Description of task modes . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2 Baseline schedulers comparison (values for Total time in # of cycles). . 95

5.3 Synthesized Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.4 MediaBench Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.5 Comparison of RCSched-I, RCSched-II and RCOffline . . . . .. . . . 108

5.6 Task type Constraints for restricted placement DFG . . . .. . . . . . . 115

5.7 Schedulers comparison forTotal Execution time(No learning). . . . . . 124

5.8 Schedulers comparison forTotal Execution time(after learning phase). 125

5.9 Run-time comparison of RCOffline and RCSched-III-Enhanced . . . . 126

5.10 Schedulers comparison forNumber of Hardware Reuse(No learning). . 127

5.11 Schedulers comparison forNumber of Hardware Reuse(after learning). 128

6.1 Benchmark Specifications. . . . . . . . . . . . . . . . . . . . . . . . . 140

6.2 IBGA run-time for serial and parallel implementations .. . . . . . . . 141

xi

6.3 The Average of 10 runs for the Best Fitness Values . . . . . . .. . . . 149

6.4 Fitness values for different weights (W) . . . . . . . . . . . . .. . . . 150

6.5 Exhaustive vs. IBGA (average of 10 runs) . . . . . . . . . . . . . .. . 150

7.1 Data Flow Graphs: Statistics . . . . . . . . . . . . . . . . . . . . . . .161

7.2 Extracted Features from DFGs. . . . . . . . . . . . . . . . . . . . . . .165

7.3 Cases used to Evaluate the Framework along with Specifications . . . . 169

7.4 Accuracy with T-test evaluation (SVM) [synthesized benchmark]. . . . 176

7.5 Accuracy with T-test evaluation (J48) [synthesized benchmark]. . . . . 176

7.6 Training time for Case # 1 for individual classifiers . . . .. . . . . . . 177

7.7 Mediabench DSP Benchmark Specifications. . . . . . . . . . . . .. . 177

7.8 Accuracy with T-Test significant evaluation [synthesized benchmark] . . 181

7.9 Training time for different classification algorithms.. . . . . . . . . . . 181

7.10 Performance Enhancement for total time (Synthesized). . . . . . . . . . 185

7.11 Performance Enhancement for power (Synthesized). . . .. . . . . . . . 185

7.12 Performance Enhancement for total time (MediaBench DSP). . . . . . . 186

7.13 Performance Enhancement for power (MediaBench DSP). .. . . . . . 187

C.1 Power Rail Specfication for UCD9248 PMBus Controllers. .. . . . . 244

xii

List of Figures

1.1 Essential Components of a Reconfigurable OS. . . . . . . . . . .. . . 3

1.2 (A) PRR layout, (B) Miscellaneous Platforms/Floorplans . . . . . . . . 6

2.1 Flexibility vs performance of processor classes . . . . . .. . . . . . . 13

2.2 Dataflow Graph for Quadratic Root . . . . . . . . . . . . . . . . . . . 15

2.3 FPGA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Software Vs Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Including a customized IP within the RISC architecture (Customized In-

struction) [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Including a customized IP via the FSL interface onto MicroBlaze [1] . . 19

2.7 Partially Reconfigurable Region (PRR) X can be Loaded with Partial

Reconfiguration Module X1, X2, X4, or X3 . . . . . . . . . . . . . . . 22

2.8 Partial Reconfiguration Design Flow . . . . . . . . . . . . . . . . .. . 24

2.9 Loading a partial bit file . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.10 Data Mining Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.11 Supervised Learning Steps . . . . . . . . . . . . . . . . . . . . . . . .31

4.1 Framework: Major Blocks. . . . . . . . . . . . . . . . . . . . . . . . . 60

xiii

4.2 Floorplan for the uniform (left) and non-uniform (right) implementations. 61

4.3 A node can be independent or can have several dependencies. . . . . . . 63

4.4 DFGs Generator Components . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Reconfigurable simulator layout . . . . . . . . . . . . . . . . . . . .. 66

4.6 Simulator task variant (architecture) library (example). . . . . . . . . . 69

4.7 Simulator platform (PRR) library (example). . . . . . . . . .. . . . . 70

4.8 Simulator DFG file example for the S1 benchmark . . . . . . . . .. . 71

4.9 Simulator output for the S1 benchmark, using the –verbose option. . . 72

4.10 Simulator graph file, where ’#’ represent reconfiguration, and ’*’ execu-

tion. The number is the task ID. ( S1 benchmark ) . . . . . . . . . . . .73

4.11 Overall Methodology Flow . . . . . . . . . . . . . . . . . . . . . . . . 74

4.12 Operation Mode 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.13 Operation Mode 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.14 Operation Mode 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.15 Operation Mode 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.1 A Data Flow Graph (DFG). . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 1D versus 2D Task Placement Area Models for Reconfigurable Devices 86

5.3 Meta-Offline scheduler processing time. . . . . . . . . . . . . .. . . . 94

5.4 Pseudo-code for RCSched-I & II (with reuse and task migration) . . . . 98

5.5 A simplified Pseudo-code for ROS, illustrates how the scheduler is called. 99

5.6 Task Type table structure . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.7 Pseudo-code for RCSched-III (with reuse and task migration) . . . . . . 104

5.8 Total Execution Time in mSec . . . . . . . . . . . . . . . . . . . . . . 111

xiv

5.9 Total Reconfiguration Time in mSec . . . . . . . . . . . . . . . . . . .112

5.10 Busy Counter: Increased when no Free Resources Available . . . . . . 113

5.11 Number of hardware to software Task Migration . . . . . . . .. . . . 114

5.12 Experimental Flow of the intermediate state testing . .. . . . . . . . . 115

5.13 Total Execution Time: Non-Restricted Placement . . . . .. . . . . . . 117

5.14 Total Execution Time : Restricted Placement . . . . . . . . .. . . . . 118

5.15 Total Reconfiguration Time: Non-Restricted Placement. . . . . . . . . 120

5.16 Total Reconfiguration Time: Restricted Placement . . . .. . . . . . . . 121

6.1 A Single GA Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.2 Task Graph to Chromosome Mapping (Binding/Allocation). . . . . . . 134

6.3 Proposed Island Based Framework . . . . . . . . . . . . . . . . . . . .137

6.4 Platforms with different floorplans . . . . . . . . . . . . . . . . .. . . 141

6.5 Convergence of synthesized benchmark S1 . . . . . . . . . . . . .. . 142




6.9 Convergence of MediaBench benchmark (DFG2) . . . . . . . . . .. . 144

6.10 Convergence of MediaBench benchmark (DFG6) . . . . . . . . .. . . 145

6.11 Convergence of MediaBench benchmark (DFG7) . . . . . . . . .. . . 145

6.12 Convergence of MediaBench benchmark (DFG12) . . . . . . . .. . . 146




xv

6.16 The DFG for benchmark S3 . . . . . . . . . . . . . . . . . . . . . . . 148

6.17 Architecture binding for each platform (P1-P4), for the S3 Benchmark. . 148

6.18 Aggregated Pareto Front (Time vs Power) for synthesized benchmark

(S1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


(S2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


(S3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152


(S4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.22 Aggregated Pareto Front (Time vs Power) for MediaBenchbenchmark

(DFG2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153


(DFG6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153


(DFG7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154


(DFG12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154


(DFG14) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


(DFG16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


(DFG19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

xvi

7.1 Overall Methodology and Flow . . . . . . . . . . . . . . . . . . . . . . 159

7.2 Supervised learning: Training and testing . . . . . . . . . . .. . . . . 164

7.3 Individual Classifiers: Predication Accuracy for the synthesized bench-

marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.4 Imbalance ratio between Minority/Majority Classes [synthesized bench-

mark] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.5 Individual Classifiers: Predication Accuracy for MediaBench. . . . . . 178

7.6 Ensembles Classifiers: Predication Accuracy for the Synthesized Bench-

mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

7.7 Ensembles Classifiers: Predication Accuracy for the MediaBench . . . 182

7.8 Ensembles Classifiers: Area Under ROC for the Synthesized Benchmark 183

7.9 Ensembles Classifiers: Area Under ROC for the MediaBench. . . . . . 183

A.1 Total Number of Reconfigurations: Non-Restricted Placement . . . . . 214

A.2 Total Number of Reconfigurations: Restricted Placement. . . . . . . . 215

A.3 Total Number of HW reuse: Non-Restricted Placement . . . .. . . . . 216

A.4 Total Number of HW reuse: Restricted Placement . . . . . . . .. . . . 217

A.5 Idle Time Measurement: Non-Restricted Placement . . . . .. . . . . . 218

A.6 Idle Time Measurement: Restricted Placement . . . . . . . . .. . . . . 219

A.7 Number of hardware to software Task Migration: Non-Restricted Place-

ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

A.8 Number of hardware to software Task Migration: Restricted Placement 222

C.1 Power distribution of Xilinx VC707 Board from [2]. . . . . .. . . . . 244

C.2 Texas Instrument USB to GPIO adapter . . . . . . . . . . . . . . . . .246

xvii

C.3 Xilinx VC707 board, TI Adapter should be connected to thehighlighted

PMBus connector (J5) . . . . . . . . . . . . . . . . . . . . . . . . . . 246

C.4 Window showing Fusion software modes, press OK . . . . . . . .. . . 247

C.5 Select Device, click the first link . . . . . . . . . . . . . . . . . . .. . 248

C.6 Fusion Monitor View (measurement can be observed). The monitor sec-

tion on top left is view or hide real-time graphs, the device and rail can

be changed from the top right corner. . . . . . . . . . . . . . . . . . . . 249

C.7 Fusion Monitor View, Rail dashboard show real time reading of all met-

rics, highlighted on top left side. . . . . . . . . . . . . . . . . . . . . .250

C.8 Select the rail and device to be monitored. . . . . . . . . . . . .. . . . 251

C.9 Fusion Monitor View, Rail dashboard show real time reading of all met-

rics, highlighted on top left side. . . . . . . . . . . . . . . . . . . . . .251

xviii

Abbreviations

SoC : System on Chip

GPP : General Purpose Processor

ASIP : Application Specific Processors

CLB :Configurable Logic Blocks

PR :Partial Reconfiguration

PRR : Partial Reconfigurable Region

FPGA : Field Programmable Gate Array

ASICs : Application Specific Integrated Circuits

PRM : Partial Reconfigurable Module

OS : Operating System

RTOS : Real-Time Operating System

SM : Static Module

GA : Genetic Algorithms

PE : Processing Element

DM : Data-Mining

RC : Reconfigurable Computing

GA : Genetic Algorithms

DFG : Data Flow Graph

EA : Evolutionary Algorithm

DAG : Directed Acyclic Graph

ROS : Reconfigurable Operating System

RTR : Run-Time Reconfiguration

RCS : Reconfigurable Computing Systems

IBGA : Island Based Genetic Algorithm

xix

Chapter 1

Introduction

In the area of computer architecture, choices span a wide spectrum, with Application

Specific Integrated Circuits (ASICs)andGeneral Purpose Processors (GPPs)being at

opposite ends. General purpose processors are flexible, butunlike ASICs are not opti-

mized to specific applications. Reconfigurable Architectures, in general, andField Pro-

grammable Gate Arrays (FPGAs), in particular, fill the gap between these two extremes

by achieving both the high performance of ASICs and the flexibility of GPPs. However,

FPGAs are still not a match for the lower power consumed by ASICs nor the performance

achieved by the latter. One important feature of FPGAs is their capability to adapt during

the run-time of an application. The run-time reconfiguration environment capability in

FPGAs provides common benefits in adapting hardware algorithms during system run-

time, sharing hardware resources to reduce device count, reducing power consumption,

and shortening reconfiguration time [3].

Several embedded application domains for reconfigurable systems tend to combine

frequent changes with high performance demands of their workloads, such as image

1

CHAPTER 1. INTRODUCTION 2

processing, wearable computing, and network processors. In many of these embedded

systems, several wireless standards and technologies, such as WiMax, WLAN, GSM, and

WCDMA have to be utilized and supported. However, it is unlikely that these protocols

will be usedsimultaneously. Accordingly, it is possible to dynamically load only the one

that is needed. Another example is employing different machine vision algorithms onto

anUnmanned Ariel Vehicle (UAV)and utilizing the most appropriate algorithm based on

the environment or perhaps the need to lower power consumption.

Time multiplexing of reconfigurable hardware resources raises a number of new is-

sues, ranging from run-time systems to complex programmingmodels that usually form

aReconfigurable hardware Operating System (ROS). The operating system performs on-

line task scheduling and handles resource management. The main objective of a ROS

is to reduce the complexity of developing applications by giving the developer a higher

level of abstraction with which to work. The basic components of any ROS are illustrated

in Figure 1.1, which include a bit-stream manager, scheduler, placer, and communica-

tions network.

The mapping and partitioning of an application starts by decomposing the application

into constituents tasks. The tasks are then structured in aData Flow Graph (DFG)to

identify dependencies between the tasks. Finally, the tasks are scheduled and mapped

onto hardware resources [4].

1.1 Problem definition

Performance is one of the fundamental reasons for usingReconfigurable Computing Sys-

tems (RCS). By mapping algorithms and applications to hardware, designers can tai-


��

� ��

��

��

��

� ��

��

�� !

"�� #��$��

%��

��

��

��

��

& !'

Figure 1.1: Essential Components of a Reconfigurable OS.


lor not only the computation components, but also perform data-flow optimization to

match the algorithm. One of the main problems encountered inRun-Time Reconfigu-

ration (RTR)is identifying the most appropriate framework or infrastructure that suits

an application. Typically, a designer would manually divide the FPGA fabric into both

staticanddynamicregions [5]. The static regions would accommodate modules that do

not change in time such as the task manager and necessary buses used for communica-

tion. The dynamic region is partitioned into a set of uniformor non-uniform regions

with a certain size (we refer to these asPartial Reconfigurable Regions (PRRs)) that

can accommodateReconfigurable Modules(RM)or in other words application specific

hardware accelerators for the incoming tasks that need to beexecuted. Every applica-

tion (e.g., Machine Vision, Wireless-Sensor Network, etc...) requires specific types of

resources that optimize certain objectives such as reducing power consumption and/or

improving execution time.

In current RTR systems, designers tend to manually perform resource allocation and

floor-planning of the FPGA fabric a priori. These allocated resources, however, might

not be appropriate for a new and different incoming application (e.g., streaming, non-

streaming, hybrid). By not tailoring the FPGA fabric for a particular application, the

latter has to suffer if the floor-plan is a miss match to the application itself. In general,

a one size fits all approach not only hinders the amount of performance sought by using

RTR, but may also adversely affect power consumption. The latter may occur since

meeting performance requirements might entail usage of multiple PRRs. Accordingly,

anadaptiveanddynamicapproach is necessary for performing both resource estimation

and floorplanning.

Another important issue is related to task scheduling and placement. In any type of


operating system a scheduler decideswhento load new tasks to be executed. Efficient

task scheduling algorithms have to take task dependencies,data communication, task

resource utilization and system parameters into account tofully exploit the performance

of a dynamic reconfigurable system. Adapting a conventionalscheduler is not an option

since schedulers in ROS differ from conventional OS schedulers in several ways:

1. ROS schedulers are heavily dependent on the placement of hardware tasks, while

in conventional systems the scheduler can be independent from memory allocation.

2. Computation resources are available as soon as a task is placed in the reconfig-

urable fabric, while in conventional systems the task may wait in a ready queue for

free processing resources.

3. ROS schedulers have to take into account reconfiguration time and should mini-

mize this time by taking into account task prefetching and reuse.

A scheduling algorithm for a reconfigurable architecture that assumes one architec-

ture variant for the hardware implementation of each task may also lead to inferior so-

lutions [4]. Having multiple hardware implementations pertask helps in reducing the

imbalance between the processing throughput of interacting tasks [6]. Also, static re-

source allocation for RTR might lead to inferior results. The number of PRRs for one

application might be different than the number required by another. The type of PRRs

(uniform, non-uniform, hybrid) also plays a crucial role indetermining both performance

and power consumption, as seen in Figure 1.2.


Figure 1.2: (A) PRR layout, (B) Miscellaneous Platforms/Floorplans

1.2 Motivations and Proposed Approach

The main goal of this dissertation is to propose and implement an infrastructure for re-

configurable operating systems that aids the designer from the early stages of the design

all the way down to the actual implementation and submissionof the application onto the

reconfigurable fabric. Therefore, we started by implementing a platform and scheduling

algorithms for reconfigurable computing that can manage both hardware and software

tasks along with developing tools and benchmarks to validate and measure the schedulers

performance. Three on-line scheduling algorithms were developed. All three schedulers

tend to reuse hardware resources for tasks to reduce reconfiguration overhead, migrate

tasks between software/hardware, and assign priorities totasks types, while respecting

task priority and maintaining precedence. The first two schedulers, calledRCSched-I

andRCSched-IIgive priorities to hardware tasks [7]. In the case of a lack ofhardware

resources, both RCSched-I and RCSched-II, initiate the migration of hardware tasks to

software. RCSched-III, on the other hand, migrates tasks toSW only when it is more


beneficial to do so. In particular, RCSched-III dynamicallymeasures several system met-

rics, such as execution time and reconfiguration time, then calculates the priority for each

task type. RCSched-III then migrates and assigns tasks to the most suitable processing

elements (SW or HW) based on the calculated priorities.

A dynamic reconfigurable framework consisting of five reconfigurable regions and

two GPPs was implemented. The developed schedulers were tested using a collection

of tasks represented by DFGs. To verify the scheduler functionality and performance,

DFGs with different sizes and parameters were required; therefore, a DFG generator was

also designed and implemented. The DFG generator is able to generate benchmarks with

different sizes and features. However, using such a dynamicreconfigurable platform to

evaluate hundreds of DFGs based on different hardware configurations is both complex

and tedious. The FPGA platform to be used in this work requires a different floor-plan

and bit-stream for each new configuration which limited the scope of testing and evalu-

ation. Accordingly, an architecture reconfigurable simulator was developed to simulate

the hardware platform, while running the developed reconfigurable operating system.

The simulator allows for faster evaluation and flexibility and thus can support differ-

ent hardware scenarios, by varying the number of processingelements (PRRs, GPPs),

size/shape of PRRs and schedulers.

Contrary to software tasks, hardware tasks may have multiple implementations vari-

ants (execution units, or hardware implementations). These variants differ in terms of

performance, power consumption, and area. As limiting a scheduler to one execution

variant may lead to inferior solutions, we are proposing an efficient optimization frame-

work that can mitigate the mentioned deficiency. The optimization framework is based

on an Island Based Genetic Algorithm technique. Given a particular task graph, each


island seeks to optimize speed, power and/or area for a different floorplan. The floor-

plans differ in terms of the number, size and layout of PRRs asdescribed earlier. The

framework uses the online schedulers, integrated with an island based GA engine, for

evaluating the quality of a particular schedule (in terms ofpower and performance) and

floorplan.

One of the major understudied problems related to OS development for reconfig-

urable systems that this thesis seeks to address is predicting the required resources such

as soft cores, PRRs, and communication infrastructure to name just a few. Static re-

source allocation for RTR might lead to inferior results. The number of PRRs for one

application might be different than the number required by another. The type of PRRs

(uniform, non-uniform, hybrid) also plays a crucial role indetermining both performance

and power consumption. The type of Scheduler used to determine when/where a task is

executed is also important for specific real-time operations. The type of communica-

tion infrastructure that connects PRRs with the Task Manager plays an important role

to speedup a certain application. Accordingly, in this thesis we seek to overcome the

limitation of static resource allocation with a more appealing approach that can adapt the

infrastructure of the reconfigurable computing platform toaccommodate and match the

application rather than the reverse. Therefore, we presenta novel adaptive and dynamic

methodology based on an intelligent machine learning approach that is used to predict

and estimate the necessary resources for an application based on past historical infor-

mation. An important feature of the proposed methodology isthat the system is able to

learn as it gains more knowledge and, therefore, is expectedto generalize and improve

its accuracy over time. Even though the approach is general enough to predict most if

not all types of resources from the number of PRRs, type of PRRs, the type of scheduler,


and communication infrastructure, we limit our results to the former three required for

an application. This task is accomplished by first extracting certainfeaturesfrom the

applications that are executed on the reconfigurable platform. The features compiled are

then used to train and build aclassification modelthat is capable of predicting the floor-

plan appropriate for an application. The classification model is a supervised learning

approach that can generalize and accurately predict the class of an incoming application

that it learned from previously seen patterns. Our proposedapproach is based on several

modules including benchmark generation, data collection,pre-processing of data, data

classification, and post processing. The goal of the entire process is to extract useful

hidden knowledge from the data. This knowledge is then used to predict and estimate

the necessary resources and appropriate floorplan for an unknown or not previously seen

application.

1.3 Research Contributions

The main contributions in this thesis can be categorized in three major paradigms:

1. Platform and Schedulers:The development and evaluation of several novel heuris-

tics for on-line scheduling of hard real time tasks for partially reconfigurable de-

vices. The proposed schedulers use fixed predefined partial reconfigurable regions

with re-use, relocation, and task migration capability. Inparticular, RCSched-III

uses real-time data to make scheduling decisions. The scheduler dynamically mea-

sures several performance metrics such as reconfiguration time and execution time,

calculates a priority and based on these metrics assigns incoming tasks to the ap-

propriate processing elements. In order to evaluate the proposed framework and


schedulers a DFG generator was developed. It randomly generates benchmarks

with predefined specification, such as number of nodes, task types and total num-

ber of dependencies per DFG. To the best of our knowledge thisis the first actual

implementation of reconfigurable system with software/hardware task migration.

2. Multiple Execution Variants: A multi-objective optimization framework capable

of optimizing total execution time and power consumption for static and partial

dynamic reconfigurable systems is proposed. This frameworkis not only capable

of optimizing the above mentioned objectives, but can also determining the most

appropriate reconfigurable floorplan/platform for the application. This way a good

trade-off between performance and power consumption can beachieved which re-

sults in high energy efficiency. To the best of our knowledge,this is the first attempt

to use multiple GA instances for optimizing several objectives and aggregating the

results to further improve solution quality.

3. Resource prediction: The majority of work on RCS rely on a static approach to

estimate and decide upon the resources required to solve a specific problem. In

this work, we instead propose a novel dynamic and adaptive approach. To the best

of our knowledge, the use of data-mining and machine-learning techniques has not

been proposed by any research group to exploit this specific type of Design Explo-

ration for Reconfigurable Systems in terms of predicting theappropriate floorplan

of an application.


1.4 Thesis Organization

The remainder of the thesis is organized as follows: Chapter2 provides an essential

background on reconfigurable Computing, RTR, and machine learning. Chapter 3 intro-

duces the main published works in the field of reconfigurable computing, reconfigurable

schedulers/OS, and resource predictions. The overall methodology that describes the

different phases of the research and modes of operation is introduced in Chapter 4. The

proposed dynamic run time reconfigurable platform along with three novel schedulers

are introduced in Chapter 5. Chapter 6 introduces the evolutionary framework devel-

oped for allocating and binding execution units to task graphs. The novel data-mining

resource prediction framework is discussed in Chapter 7. Finally, the thesis conclusion

is provided in Chapter 8 along with future work.

Chapter 2

Background

In this chapter necessary background material describing reconfigurable computing sys-

tems will be introduced. The concept of run time reconfiguration, FPGA, and partial

reconfiguration design flow will be explained as well. Related topics to the research such

as Data-mining and scheduling will also be introduced to help the reader understand the

remainder of this thesis.

2.1 Reconfigurable Computing

Reconfigurable systems refer to computational models that are based on dynamically

reconfigurable devices. [8,9]. A good survey on reconfigurable computing can be found

in [10–13]. Reconfigurable computers usually consist of oneor more general purpose

processors coupled with at least one reconfigurable device.A reconfigurable device can

be simply defined as a special kind of hardware that can be programmed, into whatever

logic the user desires [8,14].

12

CHAPTER 2. BACKGROUND 13

The main two performance measures to evaluate any processorare speed and flexibil-

ity. Von Neumann (VN) based architectures processors are very flexible and accordingly

are referred to as General Purpose Processors (GPPs). They can compute almost any

task sequentially. The sequential nature of VN processors impedes their performance.

On the other hand, Application Specific Processors (ASIPs) deliver higher performance

as they are optimized for the application (the hardware is adapted to the application).

Reconfigurable Computing can ideally take the best of both worlds, where it takes on the

flexibility of VN processor and the performance of ASIP as illustrated in Figure 2.1, [9].

Figure 2.1: Flexibility vs performance of processor classes

Digital systems are traditionally classified into processor based system and Applica-

tion Specific Integrated Circuits (ASICs). The formal is general while the latter is more

specific. An ASIC is an Integrated Circuit specifically designed to performspecialized

and uniquefunctions in hardware. ASICs usually execute several functions in parallel


on a chip. An ASIC can replace a GPP for a single application utilizing extreme per-

formance and low power. ASICs fixed resource and algorithm architecture make them

inflexible and very expensive per application. As a trade-off between the two extreme

characteristics of GPP and ASIC, reconfigurable computing can combine the advantages

of both. Table 2.1 compares the features of the three systems[14].

Architecture General Purpose ASIC ReconfigurableResources Fixed Fixed ConfigwareAlgorithms Software Fixed FlowwarePerformance Low High MediumCost Low High MediumPower Medium Low MediumFlexibility High Low HighComputing Model Mature Mature ImmatureNRE Low High Medium

Table 2.1: Comparison of Representative Computing Architecture

Reconfigurable Computing applications can be found in many fields. It can be ap-

plied to embedded systems, network security applications,multimedia applications, vi-

sion, scientific computing, Unmanned airborne vehicles andSATisfiability solvers. More

details can be found in [9,15].

2.1.1 Data Flow Graph (DFG)

A Data Flow Graph (DFG) provides the means to describe a computing task in a stream-

ing mode. A DFG can represent any high level language such as Cor C++, where each

operator represents a node in the Data Flow Graph. The inputsof the nodes are the

operands on which the corresponding operator is applied. The output of the node repre-

sents the result of the operation on that node. The output of anode is used as an input for


another node, and accordingly data dependency is defined in the graph. As an example

the DFG shown in Figure 2.2 represent the quadratic root formula [Equation 2.1] [9].

f =(b2 − 4ac)

12 − b)

2a(2.1)

��

��

��

��

��

��

� � ��

Figure 2.2: Dataflow Graph for Quadratic Root

A Data Flow Graph is a directed acyclic graph which can be represented by G=(V,E),

whereV are the vertices andE are the set of edges. Each vertex can represent a task,

therefore, a set of nodesV can represent a set of tasksT = T1, T2, T3, ..., Tn. An edge

e = (vi, vj) ∈ E is defined through the data dependency between taskTi and taskTj .

To model hardware tasks with a DFG, a hardware implementation is needed for each


task( i.e. node) that occupies a rectangular area of the chip. The nodes and edges of a

DFG can possess some hardware characteristics such as width, height, area, latency and

speed.

2.1.2 FPGA

FPGAs (Field Programmable Gate Array) are a specific family of integrated circuits that

are used for implementing custom digital circuits. One of their main properties is the

ability to be configured infinite number of times [14]. More specifically in SRAM based

FPGAs the configuration data is stored in a volatile memory. Reconfiguring an FPGA

changes its functionality to support a new application, which is equivalent to mapping a

new hardware in the system for a different application.

An FPGA generic architecture is illustrated in Figure 2.3. The main building blocks

of an FPGA areConfigurable Logic Blocks (CLB)which contain the logic circuits,In-

put/Output blocks (IOB)which interface the internal logic to a pin in the FPGA package,

and communication resources that allows the arbitrary connection of CLBs and IOBs.

FPGAs usually have additional resources such as clock managers, dedicated multipliers,

dual-port RAMs and sometimes microprocessors [14].

2.1.2.1 Accelerator Coupling Strategies

FPGAs have evolved over the years from simple Programmable Logic Devices (PLD) to

a complete System On Chip (SOC), containing microprocessors, memories, DSP blocks

and highly optimized connection paths with unlimited reconfigurablity. This develop-

ment gave the designer a plethora of logic gates to use (millions of gates).

There are different ways to connect a user IP into an embeddedmicroprocessor-


Figure 2.3: FPGA structure

based system. Generally speaking, an application can be implemented either in software

or hardware. Parallel execution of an algorithm is the main advantages of the using hard-

ware implementation. This is important especially for strict timing-driven applications

and capabilities for managing the user IP in software (e.g. C/C++). Figure 2.4 shows an

example of how parallel execution can be used. The software routine calculates the re-

sults ofF in 12 clock cycles. In contrast, the hardware routine computes the same results

in only 2 clock cycles [1].

The customized IP core can be integrated inside theReduced Instruction Set Com-

puting (RISC)processor architecture seen in Figure 2.5 or externally as demonstrated by

Figure 2.6. The integration of a customized IP core within the execution unit is very

restrictive though for two reasons the first is the nature of RISC processors which has an

ALU with two inputs and one output. Instructions that need different I/Os will be very


��

��

�

�

�

�

� �

��

��

!�� "��

Figure 2.4: Software Vs Hardware

difficult to implement. The second reason is that changing the execution unit changes the

critical path for all instruction which may delay the built-in instruction and further reduce

the overall frequency of the processor as shown in Figure 2.5. On the other hand, using

a hardware accelerator does not affect the critical path anddoes not alter the processor

capability.

2.1.3 Run Time Reconfiguration (RTR)

Run time reconfiguration (RTR) is a feature in reconfigurablesystems that enables the

change in functionality of a device during operation. This is a built-in feature in FPGAs

and can be used to reduce component count, power consumptionand cost by reconfigur-

ing the same FPGA with different applications (tasks).


Figure 2.5: Including a customized IP within the RISC architecture (Customized Instruc-tion) [1]

Figure 2.6: Including a customized IP via the FSL interface onto MicroBlaze [1]


2.1.3.1 Local vs. global Configuration

FPGAs may be configured completely (Global Configuration) orpartially (local config-

uration). Global configuration suspend the FPGA operation during configuration, which

takes a relatively long time due to the large configuration bitstream. On the other hand,

local configuration is faster, because it needs a smaller configuration bitstream. Another

benefit of local configuration is the ability (for some FPGAs)of on-line configuration

(i.e. without suspending the rest of the FPGA). On-line partial reconfiguration orDy-

namic partial Reconfigurationallows for embedding the configuration circuit inside the

FPGA.

2.1.3.2 Spatial/Temporal Partitioning

Partitioning of hardware tasks is important in reconfigurable computing. It seeks to par-

tition a large task into sub-tasks to fit within any FPGA. Spatial partitioning maps a

complex circuit onto several FPGAs. While temporal partitioning loads the sub-tasks in

sequence into the same FPGA. In temporal partitioning thereis a need to keep track of

intermediate data between sub-tasks and data dependency between tasks. Another im-

portant factor for temporal partitioning is reconfiguration time, which needs to be fast

enough to meet applications timing. Temporal partitioningcan use global reconfigura-

tion with an external reconfiguration controller to reconfigure the entire FPGA each time,

or it can benefit from dynamic partial reconfiguration with anembedded reconfiguration

controller. Dynamic Partial reconfiguration eliminates the need for an external controller.

This can reduce the overhead of reconfiguration, as a result of using smaller bitstreams

and special techniques such as task pre-fetching and task re-using [16].


2.1.4 Partial Reconfiguration

Partial Reconfiguration is a unique feature of Xilinx FPGAs that allows the reconfigura-

tion of a part of the device while the remainder of the fabric remains in operation mode.

In recent years, Xilinx FPGAs went through multiple hardware improvements to support

more features on partial reconfiguration.

The two key hardware improvements are:

• Smaller units of reconfiguration granularity. From the fulldevice height reconfig-

uration frames in the Virtex-II and Virtex-II Pro families to partial CLB blocks in

the newer FPGAs. For example, the configuration frame is 16-CLBs high in the

Virtex-4 family.

• Increased bandwidth in the internal configuration access port: From 800Mbit/s

in the Virtex-II and Virtex-II Pro families to 3.2Gbit/s in the Virtex-4 family and

beyond

Xilinx also introduced Early Access Partial Reconfiguration (EAPR) design flow,

which organizes the process to some extent [17]. Xilinx introduced the first commercial

partial reconfigurable design flow that further simplifies the process. Partial reconfigu-

ration is usually used for time multiplexing multiple functions of an application that are

not required at the same time on the FPGA (mutual exclusive functions). This approach

reduces power consumption and enables the use of smaller FPGA device and hence re-

duces costs. Figure 2.7 illustrates the concepts of partialreconfiguration where a PRR

can be reconfigured with different reconfiguration modules.


��

��

��

��

��

��

��

��

Figure 2.7: Partially Reconfigurable Region (PRR) X can be Loaded with Partial Recon-figuration Module X1, X2, X4, or X3

2.1.4.1 Static and Partial Reconfigurable Regions

Any PR design consists of two regionsStaticandPartial Reconfigurable Regions (PRR).

Static regions contain the logic that does not change duringpartial configurations. The

static region may contain the circuit to control the PR process. The PRR(s) logic can be

reconfigured independently of other PRRs and the static region. A PRR should have at

leas two Partial Reconfigurable Modules (PRM) with different functionality. Xilinx al-

lows more than one PRR region in the design. Those regions must be specified manually

on an FPGA using the PlanAhead floor-planner.

2.1.4.2 Partial Reconfiguration design flow

The existing flow from the Hardware Description Language (HDL) to configuration bit-

stream is extremely complicated. Therefore, there are design hierarchy limitations that

exist to aid the software tools in creating PR designs. The primary limitation needs that


the top-level module contain submodules that are either static modules (SMs) or partially

reconfigurable modules (PRMs) [18]. Special bus macros provided by Xilinx are used

and must be explicitly declared for communication between static and PRR regions with

the exception of the clock (the latest design flow does not require bus macros anymore).

The needed hierarchy impose a substantial amount of effort when converting an existing

static design into one that is suitable for PR. The restriction of having all the PRMs at

the top level will usually lead to routing many signals to other modules within the main

static module.

2.1.4.3 Partial Reconfiguration software and design flow

The Early Access PR software provided by Xilinx is only compatible with the ISE 9.2i

service pack 4. The PR extension is a patch to the ISE 9.2 tools, and it’s not compatible

with the new ISE software. The new partition based design flowstarted with ISE 12.1

and it only supports the high end Virtex-4,5,6,7 families (Recent lower end Xilinx 7

Series FPGA support partial partial reconfiguration). The PR design flow is illustrated in

Figure 2.8. For this flow, each PRM along with the static design must be implemented

in a separate directory, and then merged to generate the fulland partial configuration

bitstreams. Steps 1-4 are similar to the non-PR design flow; Steps 5-7 are unique to the

PR design flow.

2.1.4.4 Reconfiguration Speed

Reconfiguration times are highly dependent upon PRRs sizes and organization. For ex-

ample, Virtex-II, allows for partial reconfiguration of entire columns only, which may

lead to partial bitstreams to be significantly larger than necessary. On the other hand,


��

��

��

��

��

��

��

!!

""

##

$$

%%

&&

''

Figure 2.8: Partial Reconfiguration Design Flow


PORT Bus (bits) Max Frequency (MHz) µSec/FrameSerial 1 100 13.12JTAG 1 66 19,88

SelectMap 8 100 1.64SelectMap 32 100 0.41

ICAP 8 100 1.64ICAP 32 100 0.41

Table 2.2: Reconfiguration Speed for different interfaces

Virtex-4 FPGA and beyond, allow for arbitrarily-shaped PRRs, which was a great im-

provement over the previous generations. Since reconfiguration time is greatly affected

by design size, the metric ofµSec/Frame is used to calculate the reconfiguration speed.

Each Frame is composed of forty-one 32-bit words. The smallest of the Virtex-4 de-

vices (the LX15) has 3,740 frames, while the largest FPGA (the FX140) has 41,152

frames [19]. The most common FPGA’s configuration techniques are: the JTAG (Bound-

ary Scan) port, externally through the serial configurationport, or the SelectMap port, or

could be internally through the internal configuration access port (ICAP), (using an em-

bedded microcontroller or state machine), as shown in Figure 2.9. Each of these methods

has its own appropriate applications. The supported methods for partial reconfiguration

are: JTAG, SelectMap and ICAP. During development JTAG is usually used to partially

reconfigure the FPGA. The ICAP performs the reconfiguration internally using an em-

bedded microcontroller, which provides a powerful and veryflexible platform for PR

designs. A summary of configuration speeds using different interfaces is shown in Ta-

ble 2.2 .


Figure 2.9: Loading a partial bit file

2.1.4.5 Reconfiguration Using an Embedded Microcontroller

Xilinx provides FPGA products that include embedded hard processor cores, along with

supporting embedded soft processor core (i.e., Xilinxs MicroBlaze) in all Virtex-II and

later FPGAs. One example that makes these cores an extremelyflexible option is the ca-

pability to process C/C++ code for reconfiguration designs.Using an external controller

(i.e. using a PC) can be eliminated by controlling reconfiguration using a processor that

is embedded within the FPGA. Depending on specific embedded software design and

functionality, having the processor embedded within the FPGA allows for autonomous

operation.

Shown in Figure 2.9 is an example of an embedded system designfor partial recon-

figuration. The microprocessor loads the required configuration data from the external

memory and reconfigures the PR region through the ICAP primitive. The external mem-


ory could consist of ROM, Flash memory, or static RAM that is loaded at start-up or

even filled by the FPGA itself. Internal configuration can be triggered by the FPGA it-

self. Internal configuration can consist of either a custom state machine, or an embedded

processor such as MicroBlaze processor or PowerPC 405 processor (PPC405). When the

overhead of an embedded microcontroller is undesirable, itcan be replaced by a custom

state machine that handles the loading of configuration data[18].

2.2 Resource Management of Reconfigurable Systems

The field of reconfigurable computing has grasped researchers attention for quite a while.

Recent FPGA developments in reconfiguration technology accompanied with reconfig-

uration speed made them more practical and applicable for reconfigurable computing.

However a main challenge for reconfigurable computing is resource management. Re-

configurable computing needs to map hardware tasks on a finiteFPGA fabric taking into

account tasks dependency and timing criteria. Therefore, resource management becomes

crucial for any reconfigurable computing system. For this reason, several researchers

studied resource management of reconfigurable systems [20].

The term reconfigurable Operating System is used to represent any sort of reconfig-

uration manager. There is neither a real full implementation nor theoretical standard for

the reconfigurable OS. Therefore, the term reconfigurable OSis used to describe a man-

ager that can be as simple as a loader with a scheduler to a fullreconfiguration OS that

has an allocator, partitioner, scheduler, and loader.

From the literature, a reconfigurable OS usually consists ofa processor and a re-

configurable fabric. Software run on the processor to managehardware tasks. Usually


part of the OS should be implemented in hardware to work as a bridge between the two

systems and increase the system efficiency.

Reconfigurable OS should not be confused with hardware OS acceleration. The first

focuses on managing hardware resources to accelerate the applications while the latter

attempts to accelerate OS functions with a dedicated hardware such as ASICs or FP-

GAs. The goal of the hardware OS acceleration is to reduce RTOS overhead by imple-

menting critical OS components in hardware such as context switching, scheduling and

semaphore implementation. The idea is to make RTOS more deterministic for critical

applications. Many researches [21–24] were able to achievepromising speedup which

lead to a commercial product that accelerate MicroC/OS-II RTOS utilizing Xilinx FP-

GAs fromSierra [25]

2.2.1 Scheduling in Reconfigurable Systems

The main benefits of scheduling is organizing tasks to complete an objective or a set of

objectives based on a certain constraints. A well designed scheduler can increase sys-

tem efficiency by reducing time and/or increasing throughput and achieving other goals

such as power reduction. Many researchers has been worked toimprove the efficiency

of scheduling. In general, scheduling can take two forms, static scheduling and dynamic

scheduling. Static scheduler is employed at design time before the system being sched-

uled is run. An example for static scheduling is a schedule used for embedded system.

All task information needs to be known prior to the static scheduler, such as tasks arrival

times and tasks execution times, in order to decide tasks execution sequence. Therefore,

static schedulers are suitable only for systems with behavior that are known in advance

such as embedded systems. The advantage of static scheduling is that complex (time con-


suming) algorithms can be used to achieve an optimum scheduler. On the other hand, a

dynamic scheduler does not need all the information to be known in advance. It attempts

to assign tasks while the system is running, therefore, it isused in operating systems

where tasks may arrive randomly that need to be placed in the ready queue. Dynamic

scheduling has the advantage of adapting to sudden changes by being able to handle

(schedule) unexpected tasks that may arrive during run time[14].

In real-time systems, tasks have timing constraints and their execution is bounded to

a maximum delay that has to be precisely followed. The goals of scheduling is to allow

tasks to fulfill these constraints when the application runsin a nominal mode. A schedule

in real-time systems must be predictable. It must be proved in advance that all the timing

constraints are met in a nominal mode. When a malfunction occurs, an alarm tasks may

be triggered, execution time may be increased, overloadingthe application and giving

rise to timing faults [26]

In reconfigurable systems the decision of when to start the execution of a task de-

pends on whether the task can be placed and configured into thereconfigurable logic.

Therefore, scheduling and placement are dependent on each other, which is not the case

in processor based systems where tasks scheduling and memory allocation can be per-

formed separately. In conventional systems, a task can be loaded into the memory then

wait in the ready queue for its turn to use the processor. In reconfigurable systems, as

soon as a task is configured, it can begin execution, since configuration means it does

both allocation and assigning processing resources [14].

Most static scheduler in reconfigurable systems are based onthe list scheduling tech-

nique and use a queue to keep the ready tasks. Giving a list of ready and priority sorted

tasks, the scheduler picks the task with the highest priority to execute. The static sched-


ulers may differ from each other based on the calculation method of tasks priority. Some

typical methods include random assignments, As Soon As Possible (ASAP), and As Late

As Possible (ALAP) [14].

2.3 Data Mining, Machine Learning and Classification

Data Mining is the core process of the knowledge discovery procedure. A data-mining

flow includes different stages, such as Pre-processing, Classification, Clustering and

Post-processing. The main objective of the entire process is to extract useful hidden

knowledge from the data. Each set of data can be mined using different data-mining

Figure 2.10: Data Mining Flow

techniques depending on the data and the goal of mining procedure, as shown in Fig-

ure 2.10.

Classification is one of the main phases of data-mining. It isa key function that

categorizes items and records in a database specific classesor categories. The main

objective of classification is to accurately predict the target class for each item in the


data. Classification is considered to be a form of supervisedlearning technique that

Figure 2.11: Supervised Learning Steps

infers a function from labeled data, as shown in Figure 2.11.The training data usually

consists of a set of records. Each record is a pair consistingof an input object along with a

desired output target. The training data is analyzed by the supervised learning algorithm

and produces a model or function which can be used for mappingnew examples. The

main objective is to produce a model that correctly determines the class label for unseen

instances. In other words, the learning algorithm will havethe capability to generalize

from the provided training data to unseen situations in somereasonable accurate fashion.


2.4 Summary

Partial dynamic reconfiguration allows multiple independent configurations to be swapped

in and out of hardware independently, as one configuration can be selectively replaced

on the chip while the other is left intact. Partial reconfiguration provides the capability

to change certain parts of the hardware while other parts of the FPGA remain in use.

A Reconfigurable Computing system traditionally consists of a General Purpose Pro-

cessor (GPP) and one or more Reconfigurable Modules that run hardware tasks in paral-

lel [27]. A fundamental feature of a Partially Reconfigurable FPGA is that the logic and

interconnects are time multiplexed. Thus, for a circuit to be implemented on an FPGA,

it needs to be partitioned such that each sub-circuit can be executed at a different time.

Xilinx Partial Reconfiguration design flow [5] uses a bottom up synthesis approach,

where Reconfigurable Modules have to be synthesized separately. Each Reconfigurable

Module is considered as a separate project where it is verified and synthesized separately.

The top design treats each Partial Reconfigurable Regions asa black box. After gener-

ating all net-lists (top design and Reconfigurable Modules), each Partial Reconfigurable

Region must be manually floor-planned using the Xilinx PlanAhead design tool. The

PRR can be rectangular or L shaped with some restrictions. More details can be found

in [5].

Partial reconfiguration is appealing and attractive since it provides flexibility. How-

ever, multitasking reconfigurable hardware is complex and requires some overhead in

terms of management. In order for users to benefit from the flexibility of such systems,

an operating system must be developed to further reduce the complexity of application

development by giving the developer a higher level of abstraction.

Chapter 3

Literature Review

In this chapter, a list of the important published works in the field of dynamic recon-

figurable system will be examined. Research work on reconfigurable scheduling, ROS,

and resource prediction will be reviewed. The main objective of this chapter is to high-

light the advantages and disadvantages of current approaches and implementations on

reconfigurable computing platforms in order to provide guidance for the reader.

3.1 Partial Dynamic Reconfigurable Systems

Partial reconfiguration is a unique and important feature ofreconfigurable systems as it

allows swapping hardware modules in and out of an FPGA devicedynamically without

the need for resetting the entire device. The possibility offered by partial reconfigurable

devices to system adaptivity is outstanding. Not many devices support dynamic partial

reconfigurability. Some of the few devices that fall in this category are the Virtex series

from Xilinx [9].

33

CHAPTER 3. LITERATURE REVIEW 34

Several partial reconfiguration design flows arose during the time, and only two of

them are currently supported by their providers. The JBits [28] approach is an application

interface developed by Xilinx to allow the end user to changeconnections inside the

FPGA and set the contents of the LUTs by using a set of Java classes as part of the

JBits API. JBits does not allow the changes to be made directly to the FPGA, but to the

configuration file. One of the major drawbacks of JBits is the difficulty to provide fixed

routing for a direct connection between two modules, which may lead to unroutable paths

after reconfiguration [9].

A new design flow that supports a wider range of devices similar to JBits is the

Difference-Based partial reconfiguration design flow [29].This flow uses a different

user interface than JBits and allows the designer to make small logic changes using the

Xilinx FPGA Editor. It tends to generate a different bitsream that contains thedifferences

between the two design versions. This approach has the same limitation of JBits and does

not guarantee correct routing after configuration. Therefore, it is only recommended for

small changes such as logic equations, filter parameters andI/O standards.

A more advanced design flow that starts with the HDL level and attempts to solve

the routing problem of the JBits is the Modular Design Flow [30]. The Modular de-

sign flow usesBus Macroprimitives to guarantee fixed communication channels among

components that will be reconfigured at run time. The Modulardesign flow was initially

developed to support a group of engineers to cooperate on thesame project. At a later

stage it was adapted to support partial reconfiguration [9].

A follow up to the Modular design flow from Xilinx is theEarly Access Design

Flow which is basically an enhancements to the former. The main enhancements added

were better usability, flow automation and the relaxation ofsome rigid constraints. For


example, signals passing the reconfigurable region withoutbeing used in the reconfig-

urable modules do not have to be passed throughBus Macros. As a result of these

enhancements, more device were also supported with the Early Access Modular design

flow [31].

Xilinx’s latest design flow is the Partition based design flow[17]. This design flow

is integrated within the Xilinx ISE development tools. It issimilar to the Early Access

Design flow but with less limitations. The Xilinx Partition based design flow utilizes

Partitions, a mature feature that ensures exact preservation of previously generated re-

sults. This is the first commercial design flow provided, but it still has several limitations

as it does not support automatic partitioning, relocation,and all partial reconfigurable

hardware modules has to be placed manually using a floor-planner.

There were several attempts to adapt the standard PR design flow to specific appli-

cations field which should reduce complexity or add extra features. In [32] a tool called

PARBIT has been developed to easily transfer and regeneratebitfiles in order to imple-

ment dynamically loadable hardware modules. The tool was developed based on the

JBits design flow and inherited most of the JBits limitations. An implementation for

reconfigurable packet processing circuit using the PARBIT tools is presented in [33].

Another tool based on the JBits design flow is the JRTR [34]. This tool aimed to provide

a simple and efficient model and implementation for partial run-time reconfiguration us-

ing a cache based approach. The tool extended the JBits Java library and it supports

bitstreams read-back. In [35] a modification to the Modular design flow is presented.

The published work presented an implementation of a reconfigurable system usingslices

as connecting resources instead of TBUF elements. The benefit of this approach was to

automate the design flow and reduce manual debugging. In [36]a complete framework


was built to enable hardware developers to use three different schedulers without going

into reconfiguration techniques details. Reconfiguration modules can be written in a high

level Java like language in order to handle the reconfiguration process in an easier way.

The proposed system was tested with data acquisition and a streaming application. The

design was based on the Early Access Design flow. Partial reconfiguration is currently

supported by the high end and expensive FPGAs in state of the art Virtex FPGAs series

(the only exception is Xilinx 7 series). In [37] an implementation of a virtual internal

configuration port for realizing dynamic and partial self-reconfiguration on Spartan III

FPGAs is presented. The proposed work attempts to overcome the limitation of miss-

ing ICAP interface in Spartan II FPGAs. The researchers proposed a virtual interface

that writes the configuration data to output ports on the FPGA, which are externally

connected to the available JTAG pins.

3.1.1 Dynamic Partial Reconfiguration Applications

The number of application fields in which reconfigurable computing has been applied to

is increasing steadily. Some of the widely applied application domains are in embedded

systems, network security applications, multimedia applications, scientific computing,

and SATisfiabilty solvers [14]. More detailed applicationscan be found in [38]. Despite

the many benefits of dynamic partial reconfiguration [39,40], only a few researches used

Dynamic Partial Reconfiguration for reconfigurable computing.

3.1.1.1 JBits and Difference Based Design Flows Applications

The JBits and Difference based partial reconfiguration design flows have been used in

cryptographic applications. Partial dynamic reconfiguration has been used to reconfig-


ure some FPGAs to accommodate changing keys and modifying key and data block

width [41, 42]. The published work in [41] exploits partial reconfiguration in the IDEA

algorithm by changing keys via partial dynamic reconfiguration. In [42] an implemen-

tation of two cryptographic algorithms (IDEA and AES) usingdynamic partial recon-

figuration is presented. Again JBits partial reconfiguration was used to reconfigure the

elements involved with the keys for both algorithms.

3.1.1.2 Modular Design Flow Applications

In [43] researchers consider a hybrid DSP-FPGA platform forSoftware Defined Radio

(SDR). In their system they were able to show the performancebenefits of FPGA partial

reconfiguration for software defined radio applications by reducing configuration time

using theModular based partial reconfiguration design flow. In [44] theModular design

flowhas been used in implementing a self reconfigurable co-processor for accelerating a

representative secure applications (SSH) on a standard operating system (uClinux OS).

3.1.1.3 Early Access Modular Design Flow Applications

The Early Access Modular Design Flow has been used with SDR in[45, 45–47]. SDR

is a common hardware platform for multi-standard communications that is controlled by

software. The goal of the SDR is to produce uninterrupted communication devices which

can support different standards [45]. Therefore, SDR is a typical application for dynamic

partial reconfigurations where the communications module can be changed without inter-

rupting the rest of the system. An important direction considered to be akiller application

is in the biometric field where a full biometric reconfiguration algorithm can be imple-

mented in a small FPGA device at a very low cost. Processing can be performed in real-


time while preserving data in it’s hardware implementationby multiplexing functional-

ity on the fly over a reduced set of resources placed in a partially reconfigurable region

on the same device [48]. Partial reconfiguration has been used for the first time to fur-

ther increase the performance of High-Performance Reconfigurable Computers (HPRCs)

in [49]. The authors investigated the performance potential of partial run-time reconfig-

uration on HPRC from both theoretical and practical perspectives, comparing it to the

full run-time reconfiguration approach. In [50] partial reconfiguration has been used to

design a multi-protocol sensor reading system for road safety. The goal was to reduce

power and costs by time sharing multiple design modules (protocols) on the same FPGA.

In [51] the flow was used for accelerating video-based driverassistance applications in

automotive systems. The idea was to separate pixel based operations from high level op-

eration. The pixel based operations were accelerated usinga reconfigurable co-processor

attached to a standardPowerPCprocessor. Multiple algorithms requiring the same pixel

operation can share the co-processor to minimize chip area and at the same time multiple

pixel operations can be supported by reconfiguring the co-processor. The use of partial

reconfiguration in automotive domains has been investigated in [52] as well. An evalu-

ation based on experiments made on a set of signal and image processing applications

is presented in [47]. The work evaluated the benefits and limitations of using dynamic

partial reconfiguration for real signals and image processing professional electronics ap-

plications and provided guidelines to enhance its applicability. An improvement to the

reconfiguration interface provided by the vendor was implemented by using a Data Mem-

ory Access (DMA) for accelerating I/O operations. Reconfiguration speed is significant

in applications especially when fast module switching is required. Most applications

use the internal ICAP configuration interface and [53] attempted to further enhances the


ICAP throughput. The published work uses DMA, Master (MST) burst, and dedicated

BlockRAM (BRAM) cache to reduce reconfiguration time. Theirexperimental results

concluded that using DMA and MST burst proposed controllersmay achieve a processor

independent reconfiguration speed of one order of magnitudefaster than the provided

vendor modules. While the use of the proposed BRAM based module can approach

the reconfiguration speed limit of the ICAP interface, but this comes at a cost of large

BRAM resource utilization on the FPGA. In [54] the researchers introduced an approach

to reduce FPGA power consumption using dynamic partial reconfiguration by exploit-

ing the time varying nature of the system’s environment. Power saving is achieved by

adapting the implementation of a function to temporal changes in the environment. The

authors pointed out the benefits of using dynamic partial reconfiguration with the type

of applications where all application functions are required on the FPGA at the same

time. However, the system reduces power consumption by timemultiplexing different

implementations of the same function. The authors tested the system with network appli-

cation using different implementation of Viterbi decoders. In [55] the authors proposed

the use of partial reconfiguration in networked portable multimedia appliance. In [56]

researchers proposed the use of partial reconfiguration fordynamic fault tolerance to

reconfigure the faulty area without affecting the rest of thesystem.

3.2 Reconfigurable Operating Systems

The first proposed operating system for partially reconfigurable hardware was by Brebner

[57, 58]. Some of the principles that influence an operating system for reconfigurable

devices was discussed. The author brought the notion ofSwappable Logic Units (SLUs)


which can be swapped in and out from a partially reconfigured device using an operating

system. The author proposed that applications should consist of relocatable SLUs that

are loaded/unloaded by the operating system. The system wasonly simulated in C.

Merino [59] attempted to divide the reconfigurable logic into identical areas, called

‘slots‘, and used an operating system to schedule the loading/unloading of tasks into the

slots. Each task was designed according to a pre-defined template to standardize the

input/outputs of all tasks. A table was used to keep track of the current (loaded) tasks in

the FPGA. The work does not give enough details to duplicate the work. Also, there was

no indication of any successful implementation or even simulation of the system.

Hardware task preemption has been explored in [60]. The system suspends the run-

ning task, reads-back the reconfiguration bitstream, and extracts the internal register sta-

tus directly from the bitfile in a process called‘State Extraction‘. In order to resume a

suspended task, it’s bitstream has to be injected with the previous register values (update

internal registers) before loading it to the FPGA in anotherprocess, called‘Task Recon-

struction‘. The system has many limitations such as the need for an FPGA to support

read back of the configuration bitfiles, and it is also an intensive processor operation.

Wigley [61–64] proposed an OS for a reconfigurable system. The author supported

every decision with a logical conclusion based on previous research. The literature re-

view, flow of thesis, and methodology were outstanding. Wigley used the proposed OS in

a real application [65], but the practicality of the proposed OS in that system was vague.

The OS consists of three tiers, each tier runs on a separate machine and they com-

municate with normal TCP/IP communication. Each HW application should be modeled

with a data flow graph. Each node in the graph has to be implemented in advance and

the size of each node has to be stored with the application along with all the connections


between nodes (implemented in Java classes). The user tier runs a program and sends

the EDIF file to the Colonel tier. The Colonel tier has an allocator, partitioner and a

queue. Every HW task is stored in the queue as it arrives. The allocator tries to allocate

a position on the FPGA if enough space exists, otherwise it sends it to the partitioner

along with the largest available vacancy. The allocator andpartitioner communicate to

find a way to allocate the program. If no suitable place exists, the allocator issues a block

signal to the program and inserts it back to the end of the queue. After finding suitable

locations for the task, the Colonel tier will generate a suitable UCF file and send the re-

sult to the Xilinx tool to generate the bit-file. The bitfile isthen sent to the Platform tier

which performs the configuration.

The authors presented a prototype for a hardware OS that performs allocation, par-

titioning and simple scheduling (simple queue with non-preemptive behavior). In order

for the OS to function efficiently, CAD tools should exist to support dynamic place-and-

route. The author suggested an FPGA fabric that has a speciallayer for global routing

to ease the connection between the modules. Without those constraints the OS would

be impractical to use. Using the current technology for every load/unload of any task,

CAD tools have to be utilized. A complete place-and-route algorithm also needs to be

implemented, and a full chip reconfiguration, which takes extensive CPU time. The

author also suggests metrics and benchmarks to measure the reconfigurable system per-

formance. This was one of the first implementations for a purereconfigurable OS. Most

results were based on simulation. The OS assumed a special hardware fabric which does

not exist in any of the currently available commercial products.

All of the previous system employ hardware tasks as a co-processor. This approach

limits the flexibility of the FPGA, and it requires a powerfulexternal processor with a


high speed bus. The same scenario would be completely impractical if the processor was

embedded in the FPGA. According to [51], using a PowerPC hardcore processor running

at 300MHz is not sufficient for a simple video application. Therefore, it is very important

to have autonomous reconfigurable tasks which can accept data from an external source

and interact with the processor.

Walber and Plazner [66,67] along with Steiger [66] were the pioneers who proposed

the idea of using reconfigurable hardware, not as a dependentco-processor but as stan-

dalone independent hardware. Hardware tasks can access a FIFO, memory block and

I/O driver or even signal the scheduler. The system uses an external powerful processor

and connects to an FPGA via the PCI port. The proposed run-time environment can dy-

namically load and run variable size tasks utilizing partial reconfiguration. [66] used the

system as the foundation to explore on-line Scheduling for real-time tasks for both 1D

and 2D modules.

Hayden K. So et al. [8,68,69] extended the Unix OS to support hardware processes,

by providing native kernel support for FPGA hardware. In this work the FPGA is treated

as a traditional computational resource. BORPH OS introduced a new executable file

format (thanks to the extensibility of Linux) that combinedan FPGA bit-file, ELF data,

and other information. The hardware process can be executedusing normal Unix‘exec‘

and‘fork‘ commands. Once the hardware processes are loaded into the system it appears

to the user/software as a normal software process. The hardware processes are handled

differently by the scheduler by not being added on the software run queue. The commu-

nication between hardware processes is achieved by a special packet passing mechanism

which is compatible with the standard Linux message passingmechanism. The author

used the Berkeley Emulation Engine 2 (BEE2) that is based on five Xilinx FPGAs (Vir-


tex II Pro 70). The FPGAs are arranged in a star topology with four user FPGAs in a

ring and one control FPGA connected to each user. A PowerPC processor was used to

run the main kernel on a fifth FPGA (control FPGA) while another PowerPC processor,

is assigned to every user FPGA (with every task), which limits the speed of the system.

The authors found that the read, write and loading of software processes were much

faster than hardware processes. The hardware platform was redesigned by adding DMA

and direct communication, which enhanced the system with a speedup up to 70% more

than the initial prototype. Despite the 70% speedup all of the test applications that used

hardware processes were slower than software processes!

The author extended Linux to support hardware processes with the available technol-

ogy. Scheduling, partitioning, and allocation were not added to the hardware processes.

However, the HW process can be allocated to any free FPGA chipavailable. This work

can be extended to support partial reconfiguration without any major issue.

Lubbers and Platzner [70] attempted to extend the multi-threaded programming model

from software to hardware, by supporting the same POSIX software functions for hard-

ware tasks. The authors developed the required software extension and hardware com-

munication model to achieve that. The software model runs asan application on top of a

general OS, such as Linux. The software extension provides an interface to the hardware

communication model. There were two interesting points about this research: first, it

can use any POSIX functions without any changes to the host OSkernel, and second,

the hardware accelerator does not need to perform as a co-processor but as a stand alone

unit.

Rodolfo and Marco [71] were the first to explore the migrationbetween hardware

and software tasks. The authors studied the co-existence ofhardware (on an FPGA), and


software (running on a CPU) threads and proposed a migrationprotocol between them

for real-time processing. The authors introduced a novel allocation scheme and online

admission control. They also proposed an architecture thatcan benefit from utilizing

dynamic partial reconfiguration. The importance of this work is attributed to it being

novel in studying the migration between hardware and software tasks. The results were

mainly based on pure simulation.

Fakhreddine [72] presented a run-time hardware/software scheduler that accepts tasks

modeled as a DFG. The scheduler is based on a special architecture that consists of two

Von Neumann processors (master and slave), a ReconfigurableComputing Unit (RCU)

and shared memory. RCU runs hardware tasks and can be partially reconfigured. Each

task can be represented in three forms: a hardware bitstreamrun on the RCU, software

task for the master processor, or software task for the slaveprocessor. A scheduler mi-

grates tasks between hardware and software to achieve the highest possible efficiency.

The authors found the software implementation of the scheduler to be slow. For exam-

ple, a software scheduler running on anIntel Core-2 Dueprocessor with a frequency of

2.8 GHz and 4 GB of RAM will take about 63% of one image processing computation.

Therefore, the authors designed and implemented a hardwareversion of the scheduler to

minimize overhead. The paper focuses mainly on the scheduler and hardware architec-

ture of the scheduler. It does not address other issues associated with loading/unloading

of hardware tasks or the efficiency of inter-task communications. Another drawback was

due to the requirement that each task in the data flow graph need to be assigned to one of

the three processing elements mentioned above. This restriction conflicts with the idea

of software/hardware migration!

Antonio [16] designed and implemented an execution managerthat loads/unloads


tasks in an FPGA using dynamic partial reconfiguration. The execution manager insures

task dependency and proper execution based on the internal data dependency and the

scheduler. It also uses prefetching and task reuse to reduceconfiguration time overhead.

The task should be modeled in scheduled data flow graphs in order to be usable by the

execution manager. The authors implemented a software and hardware version of the

manger and found software versions to be slow for real-time applications. The imple-

mentation used a simulated hardware task that consists of two timers to represent exe-

cution and configuration time. There are still many unaddressed issues for the manager

to be used with real tasks such as configuration management and inter-task communica-

tions.

Devaux et. al [73–75] studied the communication problem foron-line partial recon-

figurable hardware tasks. Dynamic reconfiguration of tasks may lead to communication

issues since tasks are not available in the FPGA during all computation time. The dy-

namicity of the tasks needs to be supported by the interconnection network. In their work

they studied and tested severalNetwork on Chip (NoC)topologies. They found that the

fat-tree meets the requirements of dynamic partial reconfiguration and has higher net-

work performance in terms of bandwidth and latency. The maindrawback of fat-tree

network is their high resource requirements. Therefore, the researchers proposed a mod-

ified network called DRAFT to address the size issue. The published work also proposed

a tool, called DRAGOON, that parametrizes and automatically generates the DRAFT

topology. In [76] an architecture to support both hardware and software threads is in-

troduced, which uses the DRAFT NoC. The goals of the work wereto introduce a new

type of system partitioning, where hardware and software components follow the same

execution model. The architecture can be completely distributed, with the entire plat-


form being homogeneous from the application’s point of view. The published work was

briefly introduced, and lacks enough information to replicate the work.

Although [77] is not directly related to this work it proposed an excellent design ex-

ploration technique to study the use of several RTOS and processors with reconfigurable

hardware.

3.2.1 Scheduling for Reconfigurable Operating Systems

As presented in Chapter 2, generally, there are two scheduling techniques: Static schedul-

ing and dynamic scheduling. The first is employed at design time while the latter is used

at run-time to schedule the dynamically arrived tasks. The static scheduler requires all

task timing information in advance and it uses intensive, complex algorithms to achieve

a solution that can lead to optimal results, therefore, it has been used a lot in embed-

ded systems. Schedulers in Reconfigurable operating systems differ from conventional

ones in two major points. First, reconfigurable operating system schedulers are heavily

dependent on the placement of hardware tasks. While in conventional systems, where

software tasks are stored in memory, the scheduler can be independent from memory

allocation. The second difference is that computation resources are available as soon as

a task is placed in the reconfigurable fabric, while in conventional systems the task may

wait in a ready queue for free processing resources. Most static schedulers in reconfig-

urable systems are based on the list scheduling technique and use a queue to keep the

ready tasks [14]. Given a list of ready and sorted tasks, the scheduler picks the task with

the highest priority to execute. The static schedulers differ in the calculation method of

task priority. Some typical methods includerandom assignments, As Soon As Possible

(ASAP), andAs Late As Possible(ALAP) [14].


In [78] a high-level model, calledreconfigurable system design model(RSDM), is

introduced. The scheduler, which is at the core of the RSDM model, takes resource con-

straints, a hardware library (which has processors, communication elements, and task-

specific cores), a set of system tasks, and design constraints as input. The scheduler

produces afeasible task scheduleand ahigh level hardware system description. The sys-

tem uses randompriority list schedulingalong withsimulated annealingand agenetic

algorithm to schedule hardware tasks. The scheduler operation is verytime consuming

and is not suitable for use at run-time.

Most conventional dynamic scheduling algorithms can be adopted for hardware task

scheduling, as long as the module placement problem is addressed separately [14]. How-

ever, this is not practical as mentioned earlier. In [79] several conventional scheduling

methods have been used in an on-line scheduler for a block-partitioned reconfigurable

device. TheFirst Come First Serve (FCFS)and Shortest Job First (SJF)schedulers

where used for non-preemptive tasks. TheShortest Remaining Processing Time (SRPT)

andEarliest Deadline First (EDF)were used as a preemptive schedulers. In [80] a dy-

namic version of the ASAP/ALAP scheduling calleddynamic priority listscheduling

was proposed for reconfigurable systems.

In [81] a scheduler is proposed to improve hardware resourceutilization. The authors

presented two preemptive scheduling algorithms. EDF-NextFit (EDF-NF) invokes the

conventional EDF algorithm when tasks can be configured, otherwise it uses the next

hardware task that can be placed on the FPGA. ED-NF has good scheduling performance,

but it is good only for a small number of tasks. Another algorithm that can handle large

numbers of tasks, but with lower performance, was also presented by the authors. The

Merge-Server Distribute Load(MSDL) algorithm uses the concept of servers that reserve


area and execution time for hardware tasks.

In [82–84] the authors presented an algorithm to reduce reconfiguration overhead.

The proposed scheduling algorithm integrates three modules into an existing hybrid run-

time/design time framework, calledTask Concurrency Management(TCM) [85]. The

three modules werereuse(different applications use the same task),prefetch(place the

task ahead of time), andreplace(increase the possibilities of reusing critical subtasks).

Two scheduling and placement techniques referred to ashorizonandstuffingwere

proposed in [86,87] to reduce scheduling overhead. The horizon maintains three lists, an

execution list (stores the currently executing tasks), a reservation list (stores the sched-

uled tasks not yet executed), and scheduling horizon list (stores location and last release

time of the placed tasks). The advantage of the horizon scheduling is it’s simplicity. The

stuffing scheduler is similar to the horizon scheduler, but differs in the way it maintains

unused space. The latter is more space efficient compared to the former scheduling al-

gorithm. An extension for both algorithms to support a 2D reconfigurable area model

is presented in [87]. Both stuffing and horizon algorithms produce logic fragmentation

when there is a large variance in the size of hardware tasks. To reduce fragmentation the

classified stuffingtechnique was introduced in [88].

Scheduling algorithms for power reduction are presented in[89–91]. In [92, 93] the

authors presented a class of co-scheduling algorithms, where software / hardware tasks

can be relocated. Relocation implies that a hardware task can be preempted and restarted

as a software task and vice versa.

Gohringer et al [94] proposed a hardware OS, called CAP-OS, that used run-time

scheduling, task mapping, and resource management on a run-time reconfigurable mul-

tiprocessor system. The authors proposed to use the RAMPSOC[95] as a platform. The


authors proposed two schedulers: A static scheduler to generate a task graph and assign

resources, and a dynamic scheduler that uses the generated task graph produced by the

static scheduler. There is a deadline constraint for the entire data graph instead of indi-

vidual tasks. Most of their work was based on simulation, andthe implementation lacked

task reconfiguration.

3.3 Execution unit Allocation and Genetic Algorithm

Many researchers considered the use of multiple architecture variants as part of develop-

ing an operating system or a scheduler. In [96, 97] multiple core variants were swapped

to enhance execution time and/or reduce power consumption.In our previous work [7],

we alternated software and a hardware implementations of tasks to reduce total execution

time. The work presented here is different in it’s applicability since it is more comprehen-

sive. It can be used during the design stage to aid the designer to select the appropriate

application variant for tasks in the a task graph.

Genetic Algorithms (GA) were used by many researchers to maptasks into FPGAs.

For example, in [98] a hardware based GA partitioning and scheduling technique for

dynamically reconfigurable embedded systems was developed. The authors used a mod-

ified list scheduling and placement algorithm as part of the GA approach to determine

the best partitioning. The work in [99] maps task graphs intoa single FPGA composed

of multiple Xilinx Microblaze processors and a hardware implementation. In [100] the

authors used a GA to map task graphs into partially reconfigurable FPGAs. The authors

modeled the reconfigurable area into tiles of resources withmultiple configuration con-

trollers operating in parallel. However, all of the above works do not take into account


multiple architecture variants.

In [96] the authors modified theRate Montonic Scheduling (RMS)algorithm to sup-

port hot swappingarchitecture implementations. Their goal is to minimize power con-

sumption and re-schedule the task on-the-fly while the system is running. The authors

looked at the problem from the perspective of schedule feasibility, but their approach

does not target any particular reconfigurable platform, nortake multiple objectives into

consideration which makes it different from the proposed work in this thesis.

R. Chen et al. [101] used anEvolutionary Algorithm (EA)to optimize power and

temperature for heterogeneous FPGAs. The work in [101] is very different from our pro-

posed framework in terms of the types of objectives considered and their actual imple-

mentation, since the focus is on swapping a limited number ofGPP cores in an embedded

processor for the arriving tasks. In contrast, our proposedwork focuses on optimizing the

hardware implementation for every single task in the incoming DFG, and targets partial

reconfigurable FPGAs.

In [4, 6] the authors used a GA for task graph partitioning along with a library of

hardware task implementations that contain multiple architecture variants for each hard-

ware task. The variants reflect trade-offs between hardwareresources and task execution

throughput. In their work, the selection of implementationvariants dictates how the task

graph should be partitioned. There are a few important differences from our proposed

framework. First, our work targets dynamic partial RTR withhardware operating sys-

tems. The system simultaneously runs multiple hardware tasks on PRRs and can swap

tasks in real-time between software and hardware (the reconfigurable OS platform is dis-

cussed in [7]). Second, the authors use a single objective approach, selecting the variants

that give minimum execution time. In contrast, our approachis multi-objective and not


only seeks to optimize speed and power, but also seeks to select the best reconfigurable

platform. As each of the previous objectives are likely to have its own optimal solution,

the resulting optimization does not provide rise to a singlesuperior solution, but rather

a set of optimal solutions, known as Pareto optimal solutions (these solutions are con-

sidered optimal in the sense that no one solution can be considered to be better than any

other solution with respect to all objectives). Unlike the work in [4, 6], we employ a

parallel, multi-objective island-based GA to generate theset of Pareto optimal solutions

(also known as the Pareto Front).

3.4 Resource Prediction and Data-Mining

The use of both machine-learning and data-mining methods, as proposed in this work,

represents a new direction for reconfigurable-computing research. In contrast, it has al-

ready become a fast-growing research area in physical design. Applications include pre-

dicting defects in silicon wafers [102], identifying speedpaths in processors as guides for

performance improvement [103] and design exploration for high-level synthesis [104].

A notable effort in the area of CAD for ASICs is PADE [105], a new ASIC placement

flow which employs machine-learning and data-mining methods to predict and evalu-

ate potential data-paths using high-dimensional data fromthe original designs net-list.

PADE achieves 7% to 12% improvements in solution quality compared to the state-of-

the-art. A summary of other successful applications of data-mining driven prediction to

problems in the area of physical design can be found in [106].

There seems to be an abundance of research work in the literature that covers the con-

cept of management of reconfigurable computing systems. Most of the previous work


mainly concentrates on the development of OS and managers along with the necessary

modules such as schedulers and placers. Only very few articles discuss the concept

of resource estimation and utilization of machine learningtechniques for predicting the

necessary resources for dynamic reconfigurable computing systems. The authors in [3]

present an automated tool to support dynamic reconfiguration for high performance reg-

ular expression searching. The author presented a method toquickly and accurately

estimate the resource requirements of a given set of regularexpressions. However, this

work is limited since it applies only to regular expression searching. Also, no prediction

nor learning from past history is applicable in this approach.

In [107] the authors investigate the use of advanced meta-heuristic techniques along

with machine learning to automate the optimization of a reconfigurable application pa-

rameter set. The approach proposed in [107] is called theMachine Learning Optimizer

(MLO), and involves aParticle Swarm Optimization (PSO)methodology along with an

underlying surrogate fitness function model based onSupport Vector Machines (SVM)

and aGaussian Process (GP). Their approach is mainly used to save time on analysis

and application specific tool development. Our work is completely different than MLO

in the sense that our framework predicts the necessary resources and floor-plan in Dy-

namic Reconfigurable Systems to optimize the execution of benchmarks as they arrive

for processing.

The authors in [108] propose a fast, a priori estimation of resources during the system

level design for FPGAs and ASICs targeting FIR/IFR filters. The prediction was based on

neural networks. The type of resources they targeted included area, maximum frequency

and dynamic power consumption. However, the work is very limited and is not applicable

to dynamic run-time reconfiguration applications.


An on-line predictor for a dynamic reconfigurable system is proposed in [109] to

reduce reconfiguration overhead by pre-fetching hardware modules. The proposed algo-

rithm uses a piecewise linear predictor to find correlation and load hardware modules a

priori. This work tries to optimize the use of fixed resources, while our work plays a role

at a much higher level, and seeks to predict the necessary resources.

The work in [110] proposed a dynamic learning data mining technique for failure

prediction of high performance computing. Their main contribution was to dynamically

increase the training set during the system operation, which helps in predicting failures

at early deployment. The work in [110] is fundamentally different from our work, since

it does not predict resources based on an intelligent machine learning approach.

A multi-objective design space exploration tool is proposed in [111] to enable re-

source management for the Molen reconfigurable architecture. The proposed approach

analyzes an application source code and using heuristic techniques determines a set of

hardware/software candidate configuration (sub-tasks). The resource manager then uses

these candidates to exploit more efficiently the available system resources. Their work

tends to optimize the sub-tasks of the application to fit a fixed pre-determined platform,

while our work predicts a suitable platform for a given application. Their work targeted

a specific platform (Molen) and does not take partial reconfiguration into account.

In [112], the authors proposed an algorithm for Programmable Reconfigurable (PR)

module generation. The proposed technique can be integrated in manual design flow, to

automate the generation of PR partitions and modules. The authors formulated the PR

module generation problem as a standard Maximum-Weight Independent Set Problem

[113]. Their design supports multiple objectives such as reconfiguration overhead and

area with different constraints. This work is different than our work in many aspects.


For example, techniques proposed in their paper do not use machine learning nor do they

learn from previous results; moreover it is limited to the generation of PR partitions.

The closest published research to our work can be found in [114], [115], [116]

and [117]. In [114], the authors present a high level prediction modeling technique that

produces prediction models for miscellaneous platforms and tool chains and application

domains. The frameworks proposed in this papers use linear regression and neural net-

works to accurately capture the relation between hardware and software metrics. The

framework takes an ANSI-C description as input and estimates various FPGA-related

measures, such as area, frequency, or latency. However, ourapproach is totally differ-

ent in the following aspects: (a) Our framework does not predict hardware resources

consumption but predicts the most optimal layout (PRRs, Soft Cores) that would best

execute a certain application such that power is reduced andperformance is enhanced,

(b) our framework is associated with an OS for dynamic reconfigurable systems, (c) our

framework targets dynamic reconfigurable designs and not static designs as in Quipu.

The work in [115] presents a framework consisting of two layers for resource man-

agement of dynamic reconfigurable platforms. The proposed system is capable of evalu-

ating the performance of a reconfigurable computing platform based on prediction model.

The framework is applied to an artificial vision case study. However, this paper does not

seek to predict neither the layout nor the suitable resources required for the applica-

tion. The resources are, in fact, fixed and the main task of therun-time resource manager

(RRM) is to allocate the best computational resource (software or hardware) based on the

application. The approach in [115] is different from our approach since the application

level decision making runs a greedy optimization which is radiatively computationally

expensive to find the best mapping that returns the maximum performance with respect


to a trained data-mining classifier. In our approach, we use asmart supervised learning

approach to efficiently predict the best layout that would maximize the performance and

reduce power consumption.

In [116], the authors propose an online adaptive algorithm that decides the best im-

plementation to be used for the execution of an instance based on features of the process

and history of execution. The work tends to improve the hardware/software partitioning

task by avoiding predetermined execution times and concentrating on run-time based on

system execution history. This work does not use any statistical or machine learning

technique to predict resources or floor-plan of the reconfigurable system.

The work in [117] proposes a decision making support framework, calledDRuid,

which utilizes machine learning and a meta-heuristic (combines Genetic Algorithms and

Random Forest) to extract and learn characteristics that make certain functionality of

applications more suitable for a certain computing technology. Starting from a ’C’ im-

plementation, the framework either selects the best computational element that can be

accelerated by the computational element or offers suggestions on code transformations

that can be applied. The expert system identifies 88.9% of thetime the functionalities

that are efficiently accelerated by using the FPGA. This work, however, is different from

our proposed work since it does not predict the most suitablefloor-plan nor layout of the

reconfigurable computing platform for a specific application, but predicts the functional-

ity of the application that can be accelerated efficiently byan FPGA.


3.5 Summary

In this chapter we reviewed the state-of-the-art work on Reconfigurable OS and the

progress of partial reconfiguration design flow. We also reviewed the work on the use

of multiple architecture variants as part of developing an operating system or a scheduler

along with use of genetic algorithm on reconfigurable scheduling. Finally, the work re-

lated to the use of machine-learning and data-mining for hardware reuse prediction was

also reviewed.

Chapter 4

Overall Methodology and Tools

The main objective of this chapter is to provide the reader with an overview of the overall

methodology used in this thesis. The intent is to help the reader understand the overall

framework being proposed, and to identify the key components of the framework, which

are presented later in Chapters 5, 6 and 7 respectively. We first explain the developed run-

time reconfigurable platform that can manages both hardwareand software tasks, along

with the developed tools and benchmark generators needed tovalidate and measure the

performance of the proposed system. This will be followed bya brief description of the

different phases required to design and implement the proposed framework in this thesis.

4.1 Run-Time Reconfigurable Platform

Xilinx introduced an interesting commercial partial reconfigurable design flow withISE

Design Suiteversion 12.1 in late 2010. Partial reconfiguration is usually used for time

multiplexing multiple functions of an application that arenot required to run at the same

57

CHAPTER 4. OVERALL METHODOLOGY AND TOOLS 58

time on the FPGA (mutually exclusive functions). This approach reduces power con-

sumption and enables the use of smaller FPGA device and hencereduces costs. Fig-

ure 2.7, presented earlier in Chapter 2, illustrates the concepts of partial reconfiguration

where a PRR can be reconfigured with different reconfiguration modules. Xilinx Partial

Reconfiguration design flow [5] uses a bottom up synthesis approach, where Recon-

figurable Modules have to be synthesized separately. Each Reconfigurable Module is

considered as a separate project where it is verified and synthesized separately. The top

design treats each Partial Reconfigurable Region as a black box. After generating all

net-lists (top design and Reconfigurable Modules), each Partial Reconfigurable Region

must be manually floor-planned using the Xilinx PlanAhead design tool. The PRR can

be rectangular or L shaped with some restrictions. More details can be found in [5].

Xilinx ISE 12.4 with partial reconfiguration capability along with EDK, Plan-ahead

were utilized to develop a platform capable of achieving dynamic run-time reconfigura-

tion in this thesis. The proposed RTR platform was initiallymapped on Xilinx Virtex

4 and 5 boards, and eventually implemented on a Virtex 6 FPGA board. The proposed

platform consists of two Xilinx MicroBlaze processors running at 100MHz and five par-

tial reconfigurable regions for hardware tasks. One of the processors is dedicated to run

the reconfigurable operating system along with the schedulers, while the other acts as a

general-purpose processor, as shown in Figure 4.1. Each processor has its own data/in-

struction caches and memory. A DDR memory is shared between the two processors

(using an MDM controller) to facilitate data sharing. In addition, a message box is im-

plemented as a means for intra-processor communication. The message box interrupts

the processors whenever new data arrives. Each processor has its own serial and debug

module for debugging. Thecontroller executes the schedulers and assigns tasks to the


available processing elements (software and hardware).

Xilinx partial reconfiguration design flow has been utilizedto assign five PRRs to

accommodateReconfigurable Modules(RM)as hardware accelerators. Figure 4.2 shows

the FPGA floor plan where the purple rectangles highlight thefive PRRs. Each PRR can

be reconfigured internally using the Xilinx ICAP controller. Hardware tasks are fetched

from a library of partial bit-streams stored in an external memory card. The controller

processor reads the appropriate partial bit-stream from anexternal memory, performs

some pre-processing operations, and then reconfigures the appropriate PRR. The use

of an embedded processor for reconfiguration limits reconfiguration speed. Therefore,

a dedicated hardware reconfiguration manager attempts to free the controller processor

from the task of configuring and reconfiguring the system, thus enhancing performance.

Each PRR consists of a programmable timer that triggers an interrupt upon task com-

pletion. The timer is used to emulate task processing times which helps in studying the

effects of varying task processing times on the schedulers.

4.1.1 PRR Uniformity

The available dynamic reconfigurable area reserved for hardware tasks was partitioned

into five PRRs. The reconfigurable area was first divided into uniform and non-uniform

partitions, as shown in Figure 4.2. The objective of having two implementations with

different uniformities, was to have as many PRRs as possibleand investigate the effect

of PRR uniformity on reconfiguration and hence, total execution time and power con-

sumption. The five PRRs are identical in size for the”uniform” based implementation

and different in size for the”non-uniform” platform. Yet the total area of the five PRRs

are identical in both cases.


Figure 4.1: Framework: Major Blocks.


Figure 4.2: Floorplan for the uniform (left) and non-uniform (right) implementations.


Reconfiguration time mainly depends on the size of the PRR dedicated to the RM.

Therefore, similar reconfiguration times (for each PRR) areexpected for the uniform

floorplan. On the other hand, the non-uniform PRRs will have reconfiguration times

proportional to their respective size. The floorplan based on the non-uniform PRRs is

more flexible, as it can accommodate different task sizes. Toefficiently utilize the ben-

efit of the non-uniform PRR platform, the developed schedulers should take the non-

uniformity into account, hence, a smart scheduler, called RCSched-III, is introduced in

Section 5.3.2.

4.2 DFG Generator

Scheduling sequential tasks on multiple processing elements is an NP-hard problem

[118]. To develop, modify, and evaluate a scheduler, extensive testing is required. Schedul-

ing validation can be achieved using workloads created either by actual recorded data

from user logs, or more commonly using randomly synthetically generated benchmarks

[118]. In this work, a synthetic random DFG generator was developed to assist in veri-

fying system functionality and performance.

An overview of the DFG generator is given in Figure 4.4. The inputs to the DFG

generator include a library of task types (operations), thenumber of required nodes, the

total number of dependencies between nodes (tasks), and themaximum number of de-

pendencies per task. The number of dependencies per task is illustrated in the examples

shown in Figure 4.3. In addition, the user can control the percentage of occurrence in the

DFG for each task or group of tasks. The outputs of the DFG generator consist of two

files: A graph and a platform file. The graph file is a graphical representation of the DFG


in a user friendly format, while the platform file is the same DFG in a format compatible

with the targeted platform.

Figure 4.3: A node can be independent or can have several dependencies.

4.2.1 DFG Generator Sub-modules

The DFG generator consists of four major modules, as shown inFigure 4.4:

1. Matrix dependency generator: This module provides the core functionality of

the DFG generator, and is responsible for constructing the DFG skeleton (i.e., the

nodes and dependencies). The nodes and edges that constitute the DFG are gen-

erated randomly. The nodes represent functions (or tasks) and the edges in the

graph represent data that is communicated among tasks. Bothtasks and edges be-

tween tasks are generated randomly based on user-supplied directives. the matrix

dependency generator tends to build the entire DFG without assigning task types

(operations) to the nodes.

2. Matrix operation generator: This module, assigns tasks types (operation) to the

DFG skeleton generated from theMatrix Dependency generatormodule. A task


type (operation) is assigned to each node from a library of task types (operations)

based on user-supplied parameters.

3. Platform file generator: Generates the DFG in a format compatible with the RTR

platform. This file is explained in more detail later in Section 4.3.1.

4. Graph file generator: Produces a graph representation of the DFG in a standard

open sourceDOT (graph description language) format [119].

Figure 4.4: DFGs Generator Components


4.3 Reconfigurable Simulator

Using the hardware platform described in Section 4.1 has several advantages but intro-

duces some limitations. For example, the machine learning engine proposed in this thesis

(Phase III) requires the evaluation of hundreds of DFGs based on different hardware con-

figurations to develop a model that can accurately predict anappropriate floorplan for an

incoming DFG or application. The FPGA platform to be used in this work requires a

different floor-plan and bit-stream for each new configuration which limits the scope of

testing and evaluation. Accordingly, an architecture reconfigurable simulator was de-

veloped to simulate the hardware platform discussed in Section 4.1, while running the

developed reconfigurable operating system.

The simulator consists of three distinct layers, as shown inFigure 4.5. ThePlatform

Emulator Layer (PEL)emulates the RTR platform functionality and acts as a virtual

machine for the upper layers. The ROS kernel along with the developed schedulers runs

on top of the PEL. The developed code of the ROS kernel along with the schedulers can

run on both, the simulator and on the RTR platform (as discussed in Section 4.1) without

any further modification. The use of the PEL layer enables thedesigner to easily develop

and test new schedulers on the simulator before porting themto the actual platform.

This simulator utilizes three schedulers as will be described in Chapter 5, and sup-

ports any number of PRRs and/or GPPs (software or hardware).The simulator was

written in the C language and runs under Linux. The code was carefully developed in

a modular format, which makes it easily expandable for new schedulers. The simula-

tor uses several configuration files, similar to the one used by the Linux OS. We opted

to publish the simulator under the open source GPL license onGitHub [120] to enable


Figure 4.5: Reconfigurable simulator layout

researcher working in the area of reconfigurable operating systems to unitize it as a sup-

porting tool for their work.

4.3.1 Simulator Inputs

The simulator accepts different input parameters to control it’s operation, as shown in

Table 4.1. The simulator is developed to emulate the hardware platform and expects the

following configuration files as input:

• A Task (architecture) Libraryfile which stores task information used by the sim-

ulator. Task information includes the mode of operation (software, hardware or

hybrid), execution time, area, reconfiguration time, reconfiguration power and dy-

namic power consumption (hybrid tasks can migrate between hardware and soft-

ware), on a per task type (operation) basis. Some of these values are based on


Table 4.1: Reconfigurable Simulator Input Parameter

Input Parameter Description Default

-h , –help Help --V , –version Display simulator version -

-v, –verboseShow each node execution time, and reconfigurationtimes, in addition to the total exuction time

-

-i,–iterationThe number of times the scheduler runs thesame DFG. This helps in the learningphase for RCSched-III

1

-t, –task-migration Enable task migration between SW and HW 0

-q,–disable-q-searchDisable ready queue search for reuse( used by RCSched-III-Enhanced)

0

-k, –scheduler Select the scheduler 3-d, –dfg-file DFG input file name dfg.conf-a, –arch-file Architecture file name arch.conf-p, –prr-file PRR (platform) file name prr.conf

-s, –prrs-setWhich platform to use from the platformlibraries in the platform file.

0

-g, –task-graphGenerate a graph file that contains detailedplacement and timing for each task

-


analytic models found in [121], [122], while others are measured manually from

actual implementations on a Xilinx Virtex-6 platform. For example, Figure 4.6

illustrates a task library for two task types (operations).The first task type, called

Task1, has three hardware implementation variants (arch1, arch2, and arch3). Each

task variant, includes the time and power consumption for executing the task and

reconfiguring it along with the reconfiguration frame area ofthe task (which is rep-

resented by rows and columns). Notice that unlikeTask1, Task2has one hardware

variant and one software variant.

• A Layout (platform)file which specifies the FPGA floor-plan. The layout includes

data that represent the size, shape, and number of PRRs alongwith the types and

number of GPPs. Notice that the schedules developed and usedin this thesis re-

quire the number of processing elements and their reconfiguration times. The area

can also be derived from reconfiguration time. Therefore, wecombined the area

and reconfiguration time in one parameter as shown in Figure 4.7. The figure il-

lustrates an example of five different floorplans (layouts) with different number of

GPPs and PRRs.

• A DFG file which stores the data flow graphs to be scheduled and executed. Fig-

ure 4.8 shows the S1 benchmark file as an example. This file is usually generated

by the DFG Generator, or by a script that reads a graph representation of the DFG

(the case for real-life benchmarks). The file defines the independent input and out-

put vertices, using the keywordsinputsandoutputs. The intermediate vertices are

then identified using theregskeyword (for registers). Finally, each node (T0 - T9)

is defined with the node’s inputs, outputs and type. The type is an index to the


id of the task type (operation), in theTask (architecture) Libraryfile as seen in

Figure 4.6.

Name = " Architecture Library v1.0 Name"Date = "Jan 03, 2015"######################Task Task1 { # Task name

id = 1 # task IDarch arch1 { # first architecture (task varient ) for Task1

exec_time = 5config_time = 5config_power = 5exec_power = 10columns = 1rows = 1mode = HW }

arch arch2 {exec_time = 10config_time = 10config_power = 10exec_power = 5columns = 1rows = 2mode = HW }

arch arch3 {exec_time = 10config_time = 20config_power = 20exec_power = 3columns = 2rows = 2mode = HW }}

######################Task Task2 {

id = 2arch arch1{

exec_time = 5config_time = 5config_power = 5exec_power = 20columns = 1rows = 1mode = HW }

arch arch2 {exec_time = 20exec_power = 15mode = SW }}

######################

Figure 4.6: Simulator task variant (architecture) library(example).


Name = "PRR.conf v1.0"Date = "Feb 4, 2015"processors Set1 {

PRRno= 5GPPno= 0PRRConfigTime={20,20,20,20,20}}

processors Set2 {PRRno= 5GPPno= 0PRRConfigTime={150,130,130,30,30}}


processors Set4 {PRRno= 3GPPno= 0PRRConfigTime={40,40,40}}


Figure 4.7: Simulator platform (PRR) library (example).

4.3.2 Simulator Output

The Simulator generates many system parameters, includingtotal time, reconfiguration

time, task migration information, and hardware reuse. For example, Figure 4.9 shows,

the output with the verbose option (’-v or --verbose’) for the S1 benchmark,

scheduled on a platform with 3 PRRs and 1 GPP. Figure 4.9 first displays data for the ex-

ecution of each node (task) by printing the task ID, execution start time, reconfiguration

start time, hardware or software execution (RECONFIGif the task ran on hardware and

SW COMotherwise), reconfiguration time, execution time, the index of the PRR/GPP,

task reuse, task type priority (RCSched-III), and finally task type ID (operation). This is

followed by the final parameters for the entire DFG, which areself explanatory.

The simulator can further produce, the task placement graph, by enabling the’-g

or --task-graph’ option. Figure 4.10 shows an example of the task placement


Name= "S1 Benchmark"Date= "July 8th, 2013"# this file is generated by the DFG generator#input vertices (no dependecies)inputs = { c0 , c1 , c2 , c3 , c4 , c5 , c6 , c7 , c8 , c9 , c10 , c11 , c12 , c13 , c14}#output verticesoutputs = { o0 , o1 , o2 , o3 , o4 , o5 , o6 }# intermediate verticesregs = {r0 , r1 , r2 }

task T0{ # task nametype = 3 # task type ID, Taken from the archetecture libraryinputs = {c0 ,c1 }output = r0}

task T1{type = 4inputs = {c2 ,r0 }output = o0}

task T2{type = 2inputs = {c3 ,r0 }output = r2}

task T3{type = 3inputs = {c4 ,c5 }output = o1}

task T4{type = 2inputs = {c6 ,c7 }output = r1}

task T5{type = 1inputs = {c8 ,r1 }output = o2}


task T7{type = 2inputs = {r1 ,c11 }output = o4}

task T8{type = 1inputs = {r2 ,c12 }output = o5}


Figure 4.8: Simulator DFG file example for the S1 benchmark


graph for the S1 benchmark scheduled on a platform with 3 PRRsand 1 GPP. The Y

axis of the graph represents time, while the X axis represents the processing elements

(PRR/GPP). Each Column, is assigned to a PRR or a GPP, except for column 1, which

represents the time. A PRR can be idle (”.”), being reconfigured ”#”, or executing a task

”*”. On the left of each task (* or #) is the task ID, which is a reference to the task being

executed or configured.

Node [ 0 ] −−>T[ 333] R[ 0 ] RECONF Conf ig 333 Exec 800 PRR[ 2 ] Reuse [ NO] P r i o[ 0 ] Type [ 3 ]Node [ 1 ] −−>T[ 1467] R[ −] SW COM Conf ig 0 Exec 800 GPP[ 0 ] Reuse [ NO] P r i o [ 1 ] Type [ 4 ]Node [ 2 ] −−>T[ 1268] R[ 0 ] RECONF Conf ig 0 Exec 200 PRR[ 1 ] Reuse [YES] P r i o[ 1 ] Type [ 2 ]Node [ 3 ] −−>T[ 334] R[ −] SW COM Conf ig 0 Exec 960 GPP[ 0 ] Reuse [ NO] P r i o [ 0 ] Type [ 3 ]Node [ 4 ] −−>T[ 668] R[ 335] RECONF Conf ig 333 Exec 200 PRR[ 1 ] Reuse [ NO] P ri o [ 1 ] Type [ 2 ]Node [ 5 ] −−>T[ 1466] R[ 1133] RECONF Conf ig 333 Exec 200 PRR[ 2 ] Reuse [ NO]P r i o [ 1 ] Type [ 1 ]Node [ 6 ] −−>T[ 1001] R[ 668] RECONF Conf ig 333 Exec 800 PRR[ 0 ] Reuse [ NO] Pr i o [ 0 ] Type [ 3 ]Node [ 7 ] −−>T[ 1068] R[ 0 ] RECONF Conf ig 0 Exec 200 PRR[ 1 ] Reuse [YES] P r i o[ 1 ] Type [ 2 ]Node [ 8 ] −−>T[ 1801] R[ 1468] RECONF Conf ig 333 Exec 200 PRR[ 1 ] Reuse [ NO]P r i o [ 1 ] Type [ 1 ]Node [ 9 ] −−>T[ 868] R[ 0 ] RECONF Conf ig 0 Exec 200 PRR[ 1 ] Reuse [YES] P r i o [1 ] Type [ 2 ]

To t a l Conf Time [ 1 6 6 5 ] ,noGPP 1[ 3 ] S ch ed u l e r[ 2267 ] To t a l Number o f Cyc les[ 395 ] To t a l Power[ 5 ] Number o f C o n f i g u r a t i o n[ 0 ] SW Busy[ 131 ] HW Busy[ 0 ] SW2HW MIG[ 2 ] HW2SW MIG[ 3 ] # o f Reuse[ 2 ] # o f SW t a s k s

Figure 4.9: Simulator output for the S1 benchmark, using the–verbose option.

4.4 Methodology Overview

In this thesis, an efficient operating system for reconfigurable computing is proposed de-

signed and implemented to ease the application design and properly manage resources

within the reconfigurable system. Making such a reconfiguration manager available

along with the current flow should further enhance and improve partial reconfiguration

on state of the art FPGAs. The run-time reconfigurable manager consists of three main

cooperating modules, as seen in Figure 4.11. The first module, is mainly responsible for


Time PRR0 PRR1 PRR2 GPP00-- ..... ..... 0##### ......1-- ..... ..... 0##### ......2-- ..... ..... 0##### ......3-- ..... ..... 0##### ......4-- ..... ..... 0##### ......5-- ..... ..... 0##### ......6-- ..... ..... 0##### ......7-- ..... 4##### 0***** 3*****8-- ..... 4##### 0***** 3*****9-- ..... 4##### 0***** 3*****

10-- ..... 4##### 0***** 3*****11-- ..... 4##### 0***** 3*****12-- ..... 4##### 0***** 3*****13-- 6##### 4***** 0***** 3*****14-- 6##### 4***** 0***** 3*****15-- 6##### 4***** 0***** 3*****16-- 6##### 4***** 0***** 3*****17-- 6##### 9***** 0***** 3*****18-- 6##### 9***** 0***** 3*****19-- 6##### 9***** 0***** 3*****20-- 6***** 9***** 0***** 3*****21-- 6***** 7***** 0***** 3*****22-- 6***** 7***** 0***** 3*****23-- 6***** 7***** 5##### 3*****24-- 6***** 7***** 5##### 3*****25-- 6***** 2***** 5##### 3*****26-- 6***** 2***** 5##### ......27-- 6***** 2***** 5##### ......28-- 6***** 2***** 5##### ......29-- 6***** 8##### 5***** 1*****30-- 6***** 8##### 5***** 1*****31-- 6***** 8##### 5***** 1*****32-- 6***** 8##### 5***** 1*****33-- 6***** 8##### ..... 1*****34-- 6***** 8##### ..... 1*****35-- 6***** 8##### ..... 1*****36-- ..... 8***** ..... 1*****37-- ..... 8***** ..... 1*****38-- ..... 8***** ..... 1*****39-- ..... 8***** ..... 1*****40-- ..... ..... ..... 1*****41-- ..... ..... ..... 1*****42-- ..... ..... ..... 1*****43-- ..... ..... ..... 1*****44-- ..... ..... ..... 1*****

Figure 4.10: Simulator graph file, where ’#’ represent reconfiguration, and ’*’ execution.The number is the task ID. ( S1 benchmark )


Figure 4.11: Overall Methodology Flow

placing and scheduling tasks of the application on the FPGA fabric. The second mod-

ule attempts to allocate and bind the appropriate executionunit to each task to further

reduce power consumption and improve performance. The third, and final module, uses

a machine learning supervised approach to predict the most appropriate floorplan given

a specific application. The development of the proposed framework occurred in there

distinct phases, each of which is described next.

1. Phase I: Involves the development and evaluation of several novel heuristics for

on-line scheduling for partially reconfigurable devices (which is discussed in Chap-

ter 5). The proposed schedulers used fixed predefined partialreconfigurable re-

gions with reuse, relocation, and task migration capability. In particular, RCSched-

III uses real-time data to make scheduling decisions. The scheduler dynamically


measures several performance metrics such as reconfiguration time and execution

time, calculates a priority for each task type, and based on these metrics, assigns

incoming tasks to the appropriate processing elements. A dynamic Reconfigurable

framework consisting of five reconfigurable regions and two General Purpose Pro-

cessors (GPPs) was implemented. The schedulers run as part of the ROS in the

developed reconfigurable platform as discussed in Section 4.1. To verify the sched-

uler functionality and performance, different DFGs are needed. Therefore, a DFG

generator was developed, as described Section 4.2. The DFG generator is able to

randomly generate benchmarks with different sizes and features, using predefined

specification.

2. Phase II: Proposes a design and implementation of a parallel, island-based GA ap-

proach for efficiently mapping execution units to task graphs for partial dynamic

reconfigurable systems (as described in Chapter 6). Each GA optimization mod-

ule consists of four main components: an Architecture Library, Initial Population

Generation module, a GA Engine, and a Fitness Evaluation module (based on an

on-line scheduler). The Fitness evaluation module requires the platform and sched-

ulers developed inPhase I. Using a hardware platform as the GA fitness function

is not practical. Therefore, we developed a reconfigurable simulator for the RTR

platform (described in Section 4.1). Each GA module tends tooptimize several

objectives based on asinglestatic floorplan/platform. Unlike previous works, our

approach is multi-objective and not only seeks to optimize speed and power, but

also seeks to select the best reconfigurable floorplan. The basic idea is to aggregate

the results obtained from the Pareto fronts of each island toenhance the overall so-


lution quality. Each solution on the Pareto front tends to optimize power consump-

tion, speed and area based on a different platform (floorplan) within the FPGA.

Our approach was tested using both synthetic and real-worldbenchmarks. This is

the first attempt to use multiple GA instances for optimizingseveral objectives and

aggregating the results to further improve solution quality.

3. Phase III: Proposes a novel adaptive and dynamic methodology based on an intel-

ligent machine learning approach that is used to predict andestimate the necessary

resources for an application based on past historical information. Even though the

approach is general enough to predict most if not all types ofresources from the

number of PRRs, the type of PRRs, the type of scheduler, and communication in-

frastructure, we limit our results to the former three required for an application.

The framework is based on extracting certainfeaturesfrom the applications that

are executed on the reconfigurable platform. The features compiled are then used

to train and build aclassification modelthat is capable of predicting the floorplan

appropriate for an application. The classification model developed is based on a

supervised learning approach that can generalize and accurately predict the class

of an incoming application that it learned from previously observed patterns. The

proposed approach is based on several modules including benchmark generation,

data collection, pre-processing of data, data classification, and post processing.

The goal of the entire process is to extract useful, hidden knowledge from the data,

this knowledge is then used to predict and estimate the necessary resources and

appropriate floorplan for an unknown or not previously seen application. Based

on the literature review, the use of data-mining and machine-learning techniques


has not been proposed by any research group to exploit this specific type of De-

sign Exploration for Reconfigurable Systems in terms of predicting the appropriate

floorplan of an application. This is needed due to the long run-times of training a

bunch of different platforms. The machine learning prediction framework is intro-

duced and explained in detail in Chapter 7.

4.5 Modes of Operations

The proposed framework introduced briefly in Section 4.1 of this dissertation can be used

in several operational modes. The modes of operation will depend on the degree of the

integration of different modules developed and requested by the user.

1. Mode #1: The hardware platform initially developed using Xilinx partial recon-

figuration flow along with the designed schedulers can be usedby the designer to

schedule and place tasks of the application, as shown in Figure 4.12. In this mode

of operation, only a single floorplan is employed along with asingle execution unit

associated with a task. We consider this the most basic mode of operation.

2. Mode #2: The hardware framework along with the Schedulers and the GA opti-

mization modules are integrated, as shown in Figure 4.13. The system can be used

in the following capacities:

(a) Mode #2A: In this mode of operation, a single GA island is used to optimize

the mapping and binding of execution units to the tasks of thegraph. A

single floorplan is used, but multiple execution units can beassociated with

each task, thus reducing power and delay.


Figure 4.12: Operation Mode 1

(b) Mode #2B: In this mode of operation, an island based GA is used to further

optimize the mapping and binding of execution units to the tasks of the graph.

A set of floorplans (designed a priori) are used within the framework and the

user can pick and choose the solution generated by the Islandbased GA along

with the appropriate floorplan associated with it.

3. Mode #3: The Hardware framework along with the schedulers and Machine

Learning module are integrated. This system is capable of predicting the most

appropriate floorplan based on the incoming DFG of the application. The main

limitation of this mode of operation is that no variation of the execution units ap-

ply, and therefore, only a single execution unit is associated with a task.

4. Mode #4: The final and most general mode of operation, is based on the inte-

gration of the basic hardware framework along with the schedulers, Island Based


GA module and Machine Learning module. This is the most general and efficient

framework but most memory and resource demanding mode. In this mode of oper-

ation, the system utilized the library of execution units that are allocated and bound

to the tasks of the DFG using the GA module. The Machine Learning module on

the other hand, attempts to predict a specific floorplan for anincoming DFG. Oper-

ation Mode #4 can also be used to select the most appropriate scheduling technique

via the machine learning module, while the island based GA optimizes the map-

ping and binding of execution units to the task of the incoming DFG.



4.6 Summary

In this chapter we introduced the overall methodology flow ofthe proposed work along

with the developed tools that have been used throughout thisthesis. The methodology

flow consists of three phases:Phase Ideals with developing heuristic based online re-

configurable schedulers for software and hardware tasks. InPhase IIan island based

genetic algorithm is designed and developed, that can map hardware task variants to task

graphs, using the schedulers developed inPhase Ias part of the fitness evaluation mod-

ule. Finally,Phase IIIproposes a novel technique based on machine learning to predict

the necessary resources for Dynamic Run Time Reconfiguration. This chapter also de-

scribed the proposed hardware reconfigurable platform, along with the DFG generator,

and a reconfigurable simulator used as part of the proposed framework.

Chapter 5

Reconfigurable Online Schedulers

In this chapter, several online scheduling algorithms for reconfigurable computing are

designed and implemented. The proposed schedulers, manageboth hardware and soft-

ware tasks. We also introduce an efficient offline schedulersand an exact mathematical

Integer Linear Programming (ILP)model for reconfigurable scheduling that can be used

to verify the quality of solutions obtained by the online schedulers. The developed sched-

ulers tend to reuse hardware resources for tasks to reduce reconfiguration overhead, mi-

grate tasks between software/hardware, assign prioritiesto tasks types, while respecting

tasks priority and maintaining precedence between tasks. The first two schedulers give

priorities to hardware tasks and are called (RCSched-I and RCSched-II). When there is

a shortage of hardware resources, both RCSched-I and RCSched-II initiate the migra-

tion of hardware tasks to software. In contrast the third scheduler (called RCSched-III),

migrates tasks to software only when it is more beneficial to do so. RCSched-III dynam-

ically measures several system metrics, such as execution time and reconfiguration time,

and then calculates the priority for each task type. Based onthese priorities, RCSched-III

83

CHAPTER 5. RECONFIGURABLE ONLINE SCHEDULERS 84

migrates and assigns tasks to the most suitable processing elements (software or hard-

ware).

The main methodology developed in our work is demonstrated as follows: First, the

task representation of benchmarks used in this work is explained along with the impact

of PRR uniformity on scheduling. We introduce several baseline schedulers that were

adapted from the literature to further evaluate the performance of our proposed sched-

ulers. This is followed by a description of the three novel reconfigurable scheduling

algorithms (RCSched-I, RCSched-II, and RCSched-III), along with an enhanced version

of RCSched-III, renamed RCSched-III-Enhanced. Finally, in order to evaluate the per-

formance of the proposed schedulers, several experiments will be carried out along with

comparison with baseline schedulers.

5.1 Task Representation

Applications running on the proposed partial reconfigurable operating systems are mod-

eled as aData Flow Graph (DFG). Each node within a DFG represents a task. Each

task has a predefined operation, ID, priority and type. Edgesbetween nodes represents

task dependencies. Figure 5.1 shows a sample DFG for an application along with the

corresponding modes as seen in the tasks model presented in Table 5.1.

Mode DescriptionHybridSW Hybrid task that usually runs on GPP but can migrate to PRRHybridHW Hybrid task that preferably runs on PRRs but can migrate to GPPSW Software only task executed on GPP (no migration)HW HW only task runs on PRRs (no migration)

Table 5.1: Description of task modes


Figure 5.1: A Data Flow Graph (DFG).


5.1.1 Task Placement Models

The task placement model in a reconfigurable device can be abstracted as a 1D or 2D

model. The 1D model divides the reconfigurable device into columns that can be recon-

figured separately, where a task size is assigned by width only. In the 2D based approach,

a task can have any width and height and be placed anywhere on the FPGA fabric. The

1D model simplifies the placement mechanism and trades this simplification for a sub-

optimal device utilization. Both models are shown in Figure5.2. The RTR platform

presented in Section 4.1, uses 2D placement model as previously shown in Figure 4.2.

Nonetheless, PRR sizes cannot be changed after design time due to design flow limita-

tion, therefore, the scheduling problem presented in this work is based on the 1D model

representation.

Figure 5.2: 1D versus 2D Task Placement Area Models for Reconfigurable Devices

The RTR platform introduced in Section 4.1 has two PRR floorplans:Uniform and


non-uniform. Non-uniform PRRs have different reconfiguration times, therefore, task

placement affects total execution time. Accordingly, an efficient reconfigurable scheduler

should be developed to take uniformity into account.

5.2 Baseline Schedulers

One of the main challenges faced in this work, is the lack of compatible schedulers, tools,

and benchmarks to compare and evaluate our work. Therefore,throughout the research

process, we tend to consider several published algorithms and adapt them to represent

our RTR platform for the sake of comparing with our proposed schedulers. The adapted

scheduling algorithms are described briefly in the following sections. First, an exact ILP

model for scheduling tasks on reconfigurable platform is described in detail. This is fol-

lowed by RCOffline, a heuristic offline scheduler for reconfigurable platforms. Finally,

two baseline schedulers, a meta-heuristic offline based scheduler and a simple generic

online scheduler are used to further evaluate the schedulesproduced by the proposed

online schedulers, which are introduced in Section 5.3.

5.2.1 ILP Model

The ILP model in [123] was modified and developed to solve the task scheduling problem

in the proposed RTR platform. The ILP model takes into account hardware task reuse,

task prefetching, and scheduling, with a single objective of minimizing total time. The

model in [123] targets partial reconfigurable platforms with 1D task placement, and at-

tempts to reduce defragmentation. Since our proposed approach uses preassigned PRRs,

where a task can be placed in one PRR at a time, therefore, the ILP model was modified


by treating each column as a PRR and restricting the placements of tasks into one PRR

(column).

The following ILP model is adapted to suit the proposed RTR platform and usesN

uniform PRRs.

5.2.1.1 Constants

For every taski ∈ O let: WhereO is the total number of tasks in a DFG.

• li := latency of task i;

• ri := reconfiguration time for taski;

• uij := task i and j perform the same operation (they can exploit taskreuse).

It is also important to consider the upper limit of the time needed to schedule all the

tasks in a DFG (worst case), where

T =

|O|∑

i=1

(li + ri). (5.1)

5.2.1.2 Variables

The variables employed in the model are defined as follows: (The variables are integers

unless otherwise noted).

• Piek := taski is loaded on the FPGA at timee onto PRRP ∈ (0, 1).

• tie := reconfiguration of taski starts at timee.

• mi := taski exploits reuse (1 for reuse and0 otherwise).


• Sini := arrival time of taski on the FPGA.

• Souti := end of execution time of taski.

• tf := total execution time.

5.2.1.3 Constraints

PRR constraints At any given time, no more than|N | PRRs can be used:

∀e,

|O|∑

i=1

|N |∑

k=1

Piek PRR Constraints. (5.2)

A PRR can be only used by one task at a time:

∀e, k

|O|∑

i=1

Pier ≤ 1 No overlap. (5.3)

Time constraints The first (initial) instance is reserved

|O|∑

i=1

|N |∑

k=1

Pi0k = 0 Time Zero (5.4)

The 1’s inPiek are arranged in a PRR for every taski for the time it is on the FPGA:

∀i, e, k

T∑

m=1

|N |∑

l=1

Piml − Pimk

≤ T (1 − Piek) Same PRR. (5.5)


Task arrival time must be less than or equal to the initial instant for whichP is 1:

∀i, e, Sini − h

|N |∑

k=1

Piek ≤ T

1 −

|N |∑

k=1

Piek

Task load Time (5.6)

Task leaving time must be greater than or equal to the last instant for whichP is 1:

∀i, e, Souti ≥ e

|N |∑

k=1

Piek Task unload time. (5.7)

The tasks cannot disappear and reappear from the FPGA:

∀i,T

∑

e=1

|N |∑

k=1

Piek = Souti − Sin

i + 1 Continous usage (5.8)

Precedence should be enforced:

∀(i, j) ∈ P, Soutj − lj ≥ Sout

i Precedences (5.9)

Reconfiguration constraints If a task could not reuse an existing task, then it must

stay on the FPGA for at least the amount of time taken by the sumof reconfiguration and

execution times:

∀i, ri + li −

T∑

e=1

|N |∑

k=1

Piek ≤ T.mi Reconfigured time (5.10)

Reconfiguration starts as soon as the task is on the FPGA:

∀i, , Sini −

T∑

e=1

e.tie = T.mi Reconfiguration start (5.11)


Only one reconfiguration at a time :

∀e,

|O|∑

i=1

∑

m=max(1,e−ri+1)

tim ≤ 1 Single Reconfiguration (5.12)

A task can reuse a placed task in PRR at timee if and only if at timee − 1 there is

the same task type in the same PRR position:

∀i, ∀e, ∀k, 1−

|O|∑

j=1,j 6=i

uij ·pj(e−1)k−pi(e−1)k+T ·(1−piek) ≤ T (1−mi) Reuse (5.13)

If a task is reconfigured, the task has to stay in the FPGA for atleast the execution

time:

∀i, li −

T∑

e=1

|N |∑

k=1

piek ≤ T (l − mi) Reused time (5.14)

The reconfiguration time is unique. It is not defined if there is a task reuse:

∀i,

T∑

e=1

tie ≤ 1 − mi No reconfiguration (5.15)

5.2.1.4 Objective

Definition of tf :

∀i, tf ≥ Souti , tf ≤ T (5.16)

The objective is to minimize total execution time

min tf Objective (5.17)


The advantage of using an exact ILP model in this thesis is that it generates an op-

timal solution (schedule), which can help in verifying the solution quality produced by

the proposed online schedulers. However, an ILP model is proven to be NP-Complete.

Therefore, even a small problem takes a long amount of time tosolve. For example, it

takes from 3 to 14 days to evaluate a small DFG with 10 nodes, using CPLEX, a com-

mercial LP/ILP solver. Accordingly, we adapted a greedy heuristic scheduler in [124]

to evaluate our RTR platform as explained in the next section. Compared with the pre-

vious exact ILP model, RCOffline produced near optimal solutions at a fraction of the

computation time.

5.2.2 RCOffline Scheduler

The second scheduler that was adapted to our RTR platform is called, RCOffline. The

RCOffline scheduler is a greedy offline reconfiguration awarescheduler for 2D dynam-

ically reconfigurable architectures. It supports task reuse, module prefetch and anti-

fragmentation techniques. The main advantage of utilizingthis scheduler is that it pro-

duces solutions (schedules) that are of high quality (closeto those obtained by the ILP

model [123,124]) in a fraction of the time for at least the small problems.

The offline scheduler proposed in [124] and modified in our work is different than

the proposed online schedulers in this thesis. Since it is anoffline based technique it

expects the entire DFG in advance. Also, it assumes the existence of a reconfigurable

platform that supports task relocation and PRR resizing at run time. At the time of

writing this thesis, no such reconfigurable platform existed. Another key difference is

that the offline scheduler does not support task migration orsoftware tasks, unlike the

schedulers proposed in this thesis.


5.2.3 Meta-Offline Scheduler

The third scheduler that was adapted for our RTR platform wasa meta-heuristic based

offline scheduler [125] that uses different advanced meta-heuristic methods to achieve

a near optimal solution for multi-processing system. The scheduler is used specifically

for heterogeneous multi-processor environments, where tasks in a DFG are efficiently

assigned to each processor. The Meta-Offline scheduler targets multiprocessor envi-

ronments, therefore, it has been modified to work with reconfigurable platforms. The

processing elements, in our RTR platform, are PRRs and GPPs,which was treated as a

separate processor by the Meta-Offline scheduler. Hardwaretask reconfiguration time

had to be considered, since it was not directly supported by the Meta-Offline scheduler.

5.2.4 RCSched-Base Scheduler

The final baseline scheduler adapted to our RTR platform is a simple online scheduler

that produces feasible solutions with no optimization applied (such as reuse and place-

ment priority). RCSched-Base scheduler however maintainsdependencies, task priority,

and placement restrictions. In the results section in 5.4.3, RCSched-Base is used as a

baseline when comparing the proposed schedulers and evaluating them.

5.2.5 Baseline Schedulers Comparison

The four baseline schedulers were evaluated base on different criteria including, runtime,

quality of solutions, and representation of the RTR platform.

The RCSched-Base scheduler produces inferior solutions asits main objective is to

come up with feasible solutions without optimization.


The Meta-Offline scheduler tends to use the GPP extensively for tasks with high re-

configuration time in respect to execution time, due to the fact that it does not support

hardware task reuse. An important point to reflect upon is that the Meta-Offline sched-

uler requires more CPU time to find a near optimal schedule. Calculation times varies

dramatically by the number of processing elements used as shown in Figure 5.3. The

Meta-Offline scheduler is not suitable for embedded systemsfor the extensive use of

resources and huge time needed to find a valid schedule.

� � � � �

�

��

��

��

��

��

��

��

��

��

��

��

��

Figure 5.3: Meta-Offline scheduler processing time.

The CPLEX solver for the ILP model takes a substantial amountof time to produce

the optimal schedule. Accordingly, it was difficult to utilize the ILP model to evaluate

solutions produced by the proposed online schedulers. For example, the solver took 74

hours to find the optimal schedule for the S1 Benchmark on a RedHat Linux workstation

with 6 cores Intel Xeon processors, running at speed of 3.2GHz, equipped with 16GB


or RAM. Table 5.2 shows thetotal timeof the schedules produced by RCOffline and

RCSched-III-Enhanced with that of the ILP model for three small benchmarks. The

results shows that RCOffline is in average4.7% away from the optimal solution for the

three small benchmarks.

Benchmark # of nodes ILP RCOffline RCSched-III-EnhancedS1 10 160 160 160S11 13 160 160 160

DFG14 2 11 120 140 140

Table 5.2: Baseline schedulers comparison (values for Total time in # of cycles).

For these reasons RCOffline was selected as a baseline since it represents the recon-

figurable platform more than the Mata-Offline scheduler, canproduce results in relatively

short time, and it has been evaluated against an ILP model.

5.3 Proposed Scheduling Algorithms

The quality of the proposed online scheduling algorithms can have a significant impact

on the final performance of applications running on a reconfigurable computing platform.

The overall time taken to run applications on the same platform can vary considerably

by using different schedulers.

In this section, three reconfigurable schedulers are introduced: RCSched-I, RCSched-

II, and RCSched-III. The first two schedulers are designed tohandle both software and

hardware tasks, and migrate tasks to software in the event ofhardware resource scarcity.

Unlike RCSched-I and RCSched-II that prefer utilizing hardware resources,RCSched-III


assigns task priority dynamically based on live measured system metrics, and migrates

tasks only if total execution time is reduced. The RCSched-III also takes advantage of

non-uniform PRR implementations and tends to place tasks intelligently on the most

appropriate processing element. The RCSched-III was further enhanced by modifying

the technique for selecting a task from theready queue, which dramatically improves

performance, as will be shown in Section 5.4.4.

5.3.1 RCSched-I and RCSched-II

Two efficient scheduling algorithms (RCSched-I and RCSched-II) were designed and

implemented with different objectives. Both algorithms nominateFree (i.e., inactive or

currently not running) PRRs for the next ready task. If all PRRs are busy and the GPP

is free, the scheduler attempts to migrate theready task (first task in the ready queue)

by changing the task type from hardware to software (if the task accepts hardware to

software migration).Busy PRRs, unlikeFree PRRs, accommodate hardware tasks that

are active (i.e., running). The two algorithms differ with respect to teh way the next PRR

is nominated.

RCSched-I nominates the first free available PRR for reconfiguration as seen on line

10 in the pseudo code of Figure 5.4. The nominated PRR is then checked against the

ready task. The ready task is scheduled to execute immediately if thereis a match

(i.e., if the PRR holds the bit-stream of the ready task), otherwise the reconfiguration

controller reconfigures the PRR with a bit-stream of the ready task and executes it. The

RCSched-I scheduler tends to minimize the number of active PRRs and has less task

variety. Therefore, it has a lower task reuse ratio comparedto RCSched-II.

Task reuse is noticeably increased with the second scheduler (RCSched-II). RCSched-


II reconfigures theleast recently configuredPRR. The scheduler nominatesall free PRRs

then checks them against theready taskfor a match. If there is a miss-match the sched-

uler sends the least recently configured PRR for reconfiguration. This scheduler has more

active PRRs, i.e., PRRs with running tasks. In addition, this scheduler leads to PRRs with

higher task variety, which increase tasks reuse ratio. Figure 5.4 shows the pseudo code

for RCSched-I and RCSched-II. The schedulers are called by the ROS main loop, at the

end of task execution and task reconfiguration, as shown in Figure 5.5. Assuming all

DFG tasks are available, the complexity of RCSched-I and RCSched-II isO(n), where n

is the total number of tasks in a given DFG.

5.3.2 RCSched-III

Due to the limitations imposed by RCSched-I and RCSched-II of not considering PRR

uniformity and task types when making placement decisions,an attempt to overcome

these shortcomings is proposed by introducing and designing a third scheduler, called

RCSched-III, as seen in Figure 5.7. The main difference between RCSched-III and the

previous two schedulers is in the assignment of tasks to the available processing ele-

ments. RCSched-III intelligently identifies and chooses the most suitable processing

element (PRR or GPP) for the task in hand based on knowledge extracted from previ-

ous tasks’ historical data. The scheduler dynamically calculates taskplacement priorities

based on data obtained from the previous running tasks. During task execution the sched-

uler tends to learn about running tasks by classifying them according to their types. The

scheduler then dynamically updates a memory table,task type tablewith the new task

type information. The data stored in the table is used by the scheduler later to classify

and assign priorities to new incoming tasks.


1 Start2 Read next task3 if ( All dependencies have been met)4 {add task to ready queue }5 Fetch the first task from the ready queue6 switch (task->mode)7 { case Hybrid_HW :8 {if ( A free PRR is available)9 {// to minimize reconfiguration time

10 (RCSed-I) : {Check the free PRR for reuse}11 (RCSed-II): {Search Free PRRs for a match12 against the ready task }13 if(PRR Reuse is available)14 run ready task on available PRR15 else{16 (RCSed-I) : {reconfigure PRR with the17 ready task bit-stream.}18 (RCSed-II) {reconfigure least recently configured19 PRR with the ready task bit-stream}}20 }else // all PRRs are busy21 {if ( GPP is free)22 { Change task type from HybridHW to HybridSW23 Add to the beginning of the ready queue }24 }else{25 return Busy}26 break; }27 case HybridSW :28 {if (GPP is free)29 { load task into GPP}30 }else if (there is a free PRR)31 { change task type from HybridSW to HybridHW32 Add to the beginning of the ready queue}33 }else{34 return Busy35 break;}36 }37 End

Figure 5.4: Pseudo-code for RCSched-I & II (with reuse and task migration)


1 /*2 * Simplied verson of ROS main loop.3 * The OS state is global, and it can be set by the configuraiton thread and task

execution thread.4 */5

6 do {7

8 switch (State) {// OS States9 case CfgDone: // end of configuration

10

11 Call the schedule; // call the scheduler12 State = TaskDone; // The OS state is set, by the configuration thread.13 break;14 case TaskDone:15 // This function checks if there is a new task16 // to be added to the ready queue.17 AddTask2Queue (ReadyQ);18 Call the schedule;19 State = TaskDone;20 break;21 case Start: // The OS just started, and the ready queue is empty22 AddTask2Queue (ReadyQ); // Wait for a task23 State = TaskDone;24 break;25 default:26 print ( Error: Unknown state );27 break;28 } while (TRUE);

Figure 5.5: A simplified Pseudo-code for ROS, illustrates how the scheduler is called.


The learning process is not an issue for embedded systems since usually the same

tasks sets run repeatedly on the same platform. For example,video processing applica-

tions use the same repeated task sets for every frame. That isto say, the scheduler will al-

ways produce a valid schedule even during the learning phase, nonetheless it’s efficiency

increases over time. The following definitions are needed for further understanding the

RCSched-III algorithm:

PRRs priority: Is calculated based on the PRR size. PRRs with smaller size have lower

reconfiguration time and thus have higher priorities.

Placement priority: Is a dynamically calculated metric, which is frequently updated on

the arrival of new tasks. Placement priority is assigned based on the type (function)

of a task. RCSched-III uses this metric to decidewhereto place the current tasks.

Placement Priority is an integer between 0 and maximum processing elements,

where the lower the number the higher the priority.

Based on the pseudo-code of Figure 5.7, RCSched-III takes the following steps to

efficiently schedule tasks:

1. The PRRs are given priority based on their respective size. PRR priority is a num-

ber from0 to PRRmax − 1.

2. Arriving tasks are checked for dependencies. If all dependencies are satisfied the

task is appended to theready queue.

3. The scheduler then selects the task with the highest priority from the ready queue;

if task modeis HybridHW then:


(a) The scheduler checks the available PRRs for task reuse, if reuse is missing,

the scheduler usesplacement priorityandPRR priorityto place the task either

on a PRR or migrate it to software (lines 26 to 39 in the pseudo-code of

Figure 5.7).

This step is useful for the non-uniform implementation, since each PRR has

a different reconfiguration time. Task migration is performed when:

Migrate_task= True,

if [(HW Exec. time + Reconfig. time) > SW Exec. Time ]

or [there No free PRRs]

(b) If no resources are available (either in hardware and software), a busy counter

is incremented while waiting for a resource to be free (lines44 to 48). The

busy counteris a performance metric that is used to give an indication of how

busy (or idle) the scheduler is. The busy counter counts the number of cycles

a task has to wait for free resources.

4. When task mode is HybridSW: RCSched-III first attempts to run the task on a GPP,

and then migrates it to hardware based on GPP availability. Otherwise, RCSched-

III increments the busy counter and waits for free resources(lines 49 to 60).

5. Software and hardware only tasks are handled in a similar way to HybridSW and

HybridHW respectively with no task migration.

Placement Priority, which is calculated by the scheduler, is different thantask prior-

ity,which is assigned to DFG nodes at design time, in thatplacement prioritydetermines


where a ready task should be placed, whiletask prioritydetermines the order by which

tasks should run.

Priority Calculation The calculation of PRR Placement Priority takes into account the

PRRs’ reconfiguration times in addition to task execution time;

1. At the end of task execution, RCSched-III adds any newly discoveredtask typeto

thetask type table.

2. The scheduler updates the reconfiguration time for the current PRR in thetask type

table.

3. RCSched-III updates the execution time for the current processing element based

on the following formula:

(Enew = Eold + (Enew − Eold) ∗ X)

where: E denotes to execution time, and X is the learning factor.

As a result of extensive experimentations,E has been set to0.2. This assists the

scheduler to adapt for changes in execution time for the sameoperation. The learn-

ing formula takes into account old accumulated average execution time (80%) and

the most recent instance of the task (20%). The goal of thetask type tableis to

keep track of execution times (and reconfiguration times in the case of PRRs), of

each task type on every processing elements (PRRs and GPPs).Accordingly, when

similar new tasks arrive, the scheduler can use the data in the table to determine

the appropriate location to run the task. Figure 5.6 show thetask type structure

along with thetask type table.


4. Finally RCSched-III uses the measured reconfiguration and execution times to cal-

culate placement priority for each task type.

struct TaskType{int ID; /* Task Type ID*/char * name; /* Task Type Name, such as adder, sub, or FFT)*/Xuint32 SWET; /* SW execution time */Xuint32 HWET; /* HW execution time */int SWPriority; /* Task type priority */int ConfigTime[NO_PRR]; /* configuratin time for each PRR */Xuint32 CanRun; /* Flags, that define placement restricion */

};struct TaskType TaskTypeTable [size]; /* Task Type Table */

Figure 5.6: Task Type table structure

Assuming all tasks in a DFG are available the complexity of RCSched-III isO(n),

where n is the number of tasks in the DFG.

5.3.3 RCSched-III-Enhanced

To further enhance the performance of RCSched-III, the selection methodology was

slightly modified. Instead of selecting the first task in the ready queue (i.e., the task

with the highest priority), the scheduler searches the ready queue for a task that matches

the one already reconfigured. This addition dramatically reduced the number of recon-

figuration, as will be discussed in Section 5.4.4.

5.4 Results and Analysis

The proposed online schedulers presented in Section 5.3 were implemented and eval-

uated at different stages of the research carried in this thesis. The developed online

scheduler were compared with RCOffline due to reasons highlighted in Section 5.2.5.


1 /*2 * RCSched-III measures reconfiguration time and execution (SW/HW) time for each task3 * on every PRR it runs on in real time. Based on those measurements it Dynamically4 * Calculates the TaskTypePriority.5 */6

7 Start8

9 // The PRRS should be stored from the smallest to the largest so10 //the one with a smaller ID always has smaller reconfiguration time11 Sort PRRs in ascending order based on size.12

13 Read the node with the highest task priority.14 if ( All dependencies have been met)15 {add task to ready queue }16 Fetch the first task from the ready queue17 switch (task->mode)18 { case Hybrid_HW :19

20 { if(TaskTypePriority==0 and GPP is free)21 {// TaskTypePriority ==0 indicates that task is preferred to run on SW22 Change task type from HybridHW to HybridSW23 Add to the beginning of the ready queue }24 }else{25 if (there is a free PRR(s) that fits the current task)26 {{Search Free PRRs for a match27 against the ready task }28

29 if(task found)30 run ready task on the free PRR31 else{32 currentPRR= smallest free PRR;33 if(TaskTypePriority< currentPRR and GPP is free)34 { // check if it’s faster to reconfigure or run on SW35 Change task type from HybridHW to HybridSW36 Add to the beginning of the ready queue }37 }else{38 Reconfigure currentPRR with the ready task bit-stream}}39 }else // all PRRs are busy40 {if ( GPP is free)41 { Change task type from HybridHW to HybridSW42 Add to the beginning of the ready queue }43 }else{44 // return to main program and45 // wait for free resources46 Increase the Busy Counter47 return Busy}48 break; }49 case HybridSW :50 {if (GPP is free)51 { load task into GPP}52 }else if (there is a free PRR)53 { change task type from HybridSW to HybridHW54 Add to the beginning of the ready queue}55 }else{56 // return to main program and57 //wait for free resources58 Increase the Busy Counter59 return Busy60 break;}}61 End

Figure 5.7: Pseudo-code for RCSched-III (with reuse and task migration)


In this section we will introduce the three stages used to evaluate the developed online

schedulers.

• Preliminary Stage: In this stage we evaluated RCSched-I and RCSched-II to the

RCOffline Scheduler introduced in Section 5.2.2. Parametermetrics evaluated in

this stage were based on hardware reuse, flexibility using ofGPPs and the number

of PRRs. This stage uses the uniform implementation, with noplacement restric-

tions (tasks can be placed in any PRR). The results of this stage were published

in [7].

• Intermediate Stage: In this stage, RCSched-I, RCSched-II, and RCSched-III are

evaluated. Two PRR floorplans were utilized in this evaluation stage (uniform

and non-uniform). In addition to the PRR uniformity, the schedulers performance

was evaluated with and without task placement restrictions, where a task can be

placed onto specific locations. The placement restriction can be due to size, I/O or

communication constraints.

• Advanced Stage:The third and final stage examines the performance enhance-

ment of RCSched-III-Enhanced. The latter is evaluated against RCSched-III and

RCOffline. The same benchmarks used in the intermediate stage were again used

in this comparison.

5.4.1 Benchmarks

Verifying the functionality of any developed scheduler requires extensive testing based

on benchmarks with different parameters, properties, and statistics. Accordingly, ten


DFGs were generated by the DFG generator presented in Section 4.2. These DFGs differ

with respect to the number of tasks (nodes), dependencies, and operations, as shown in

Table 5.3. Operation or task type refers to the unique function of a task. The generated

benchmarks have two sets of DFGs. The first set uses four operations while the second

set uses eight operations, that have variable execution time on software and hardware.

In addition to the synthesized benchmarks, 7 real-world benchmarks selected from the

well-known MediaBench DSP suite [126] are used in schedulers’ evaluation, as shown

in Table 5.4.

Name ♯ of nodes Total dep. No. of OperationsS1 10 5 4S2 25 10 4S3 50 50 4S4 100 150 4S5 150 200 4S6 25 40 8S7 50 60 8S8 100 120 8S9 150 200 8S10 200 60 8

Table 5.3: Synthesized Benchmarks

5.4.2 Preliminary Stage

To evaluate the performance of the developed schedulers we compared several param-

eters with different configurations. Each scheduler was tested with a different number


Name ♯ of nodes Total dep. DescriptionDFG2 51 52 JPEG-Smooth DownsampleDFG6 32 29 MPEG- Motion VectorDFG7 56 73 EPIC-Collapse -pyrDFG12 109 116 MESA- Matrix MultiplicationDFG14 11 8 HALDFG16 40 39 Finite Input Response Filter 2DFG19 66 76 Cosine I

Table 5.4: MediaBench Benchmarks

of PRRs with/without the use of a GPP. In addition, we attempted to measure the pa-

rameters of the on-line schedulers by enabling/disabling hardware tasks reuse. Our main

objective was to study the effect of varying the number and types of processing elements

and taking the advantage of hardware re-usability. The results are grouped by the mea-

sured parameters. Table 5.5 shows a comparison between RCSched-I, RCSched-II, and

the baseline RCOffline scheduler. The online schedulers usea platform of 5 PRRs and

one GPP, while RCOffline uses a hardware platform of 5 PRRs.

CH

AP

TE

R5.

RE

CO

NF

IGU

RA

BLE

ON

LINE

SC

HE

DU

LER

S1

08

Table 5.5: Comparison of RCSched-I, RCSched-II and RCOffline

BenchmarkRCSched-I RCSched-II RCSOffline

Tota time Rec. time HW2SW mig # Reuse Busy Cnt. Total time Rec. time HW2SW mig # Reuse Busy Cnt Total time # reuse Prefetch

S1 201 160 0 2 0 160 120 0 4 0 160 3 2S2 583 400 0 5 0 503 320 0 9 0 380 15 3S3 867 780 1 10 16 740 580 2 19 50 520 37 1S4 1735 1540 2 21 33 1293 1100 3 42 33 920 85 3S5 2466 2240 3 35 47 1767 1560 4 68 29 1160 137 2S6 478 440 0 3 0 441 400 0 5 0 300 11 6S7 896 880 0 6 0 818 800 0 10 0 420 34 2S8 1773 1740 0 13 0 1508 1480 0 26 0 775 73 1S9 2697 2660 0 17 0 2417 2120 0 44 0 995 127 1S10 3589 3540 1 22 10 2887 2860 0 57 0 1095 180 1

DFG2 912 800 1 18 12 827 720 1 22 0 680 34 3DFG6 390 280 0 18 0 358 260 0 19 0 325 21 0DFG7 701 420 0 32 0 596 340 0 36 0 405 44 5DFG12 1270 1100 3 51 21 879 720 7 66 46 745 85 1DFG14 183 140 0 4 0 162 140 0 4 0 140 5 2DFG16 469 320 1 23 9 403 280 0 26 0 295 29 3DFG19 738 660 2 31 28 560 500 2 39 30 455 50 0

Average 1173.4 1064.7 0.8 18.3 10.4 959.9 841.2 1.1 29.2 11.1 574.7 57.1 2.1


Results obtained in Table 5.5 clearly indicate that RCSched-II is better than RCSched-

I. Total time is lower for RCSched-II due to more hardware task reuse, which in terns

reduces total reconfiguration time and hence total Time. RCSched-II is40% away from

RCSOffline, with performance improvement over RCSched-I by122%.

The next subsections of the Preliminary stage illustrate the different performance

metrics used to evaluate the schedules, by using the DFG shown in Figure 5.1 as an

example.

5.4.2.1 Total Execution Time

The total execution time is defined as the time to load and execute all tasks, satisfy de-

pendencies, and output results for a particular DFG. Figure5.8 shows the total execution

times for the schedulers with different configurations of the DFG presented in Figure 5.1.

Reducing the total execution time is an important objectivein designing any scheduler

to maximize resource utilization.

The following are some important points that can be concluded from Figure 5.8:

1. The static schedule has a noticeably shorter time when GPPs are available as re-

sources. This can be attributed to the use of simple tasks with almost identical

execution time in software and hardware. It is important to remember that extra

time is needed for hardware tasks due to reconfiguration. While, the off-line static

scheduler attempts to use the GPP as much as possible, the on-line schedulers al-

gorithm uses the GPP as a backup processing element. This is not the case when

more complex tasks are presented that need less time to run onhardware or when

the reconfiguration overhead is minimal. Comparing the static scheduler without


GPP to the on-line schedulers (without GPP as well) shows small difference in

time that reflect the real efficiency of the on-line schedulers.

2. Task re-use leads to shorter total execution time even when compared to the off-

line static scheduler. The latter requires more time to find afeasible schedule on a

similar platform.

3. The use of GPPs to run software tasks was not only a flexibility advantage but it

also reduces total execution time. From Figure 5.8 it can be concluded, using one

PRR with a GPP gives better results than using two PRRs without a GPP. This is

expected to change if the tasks need less time to run on hardware.

4. The RCSched-II algorithm has a lower total execution timebecause of its higher

hardware tasks reuse ratio, which reduces the total reconfiguration time, as shown

in Figures 5.9.

5. Without task reuse the on-line schedulers performed equally well except in two

cases where RCSched-I performed better. That can be attributed to the nature of the

DFG used. It is worth mentioning that RCSched-I has a simplerimplementation,

therefore, it is recommended to be used when the task reuse feature is not needed.

5.4.2.2 Reconfiguration Overhead

The reconfiguration overhead was examined by measuring two related parameters: The

total number of reconfigurations and the accumulated reconfiguration time (as seen in

Figure 5.9). The former represents the number of reconfigurations that occurred for the


� � � � �

�

��

��

��

��

��

��

��

��

��

��

��

��

� � ��

��

��

��

��

��

��

��

��

��

��

��

Figure 5.8: Total Execution Time in mSec

sampled tasks’ module and the latter shows the total time consumed for hardware tasks

reconfiguration for a particular DFG.

RCSched-II tends to reduce the number of reconfigurations required (and hence re-

configuration time) for the following reasons. It checks allfree PRRs for a match, and

also reconfigures the least reconfigured PRR which gives moretask variety. It is worth

mentioning that while increasing the number of PRRs reducesthe number of reconfig-

uration, it has some side effect. This can be attributed to the nature of the application

(DFG) used, the number of concurrent tasks and the sequence of similar tasks.


� � � � �

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure 5.9: Total Reconfiguration Time in mSec

5.4.2.3 Busy Counter

The busy counter parameter measures the frequency by which the scheduler waits for a

free processing element. The busy counter is incremented when a task is present in the

ready queue and no free PRR is available and the GPP is busy. Whenever the GPP is free

the scheduler migrates hardware tasks to software in order to effectively use the GPP

without increasing the busy counter. The busy counter parameter is proportional to the

number of processing elements available, as shown in Figure5.10. The busy counter can

be used to measure the efficiency of adding extra processing elements. For example, in

Figure 5.10 the maximum number of processing elements used was five, and increasing


the number of processing elements beyond three had minimum affect.

� � � � �

�

��

��

��

��

��

��

��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

� ��

��

��

Figure 5.10: Busy Counter: Increased when no Free ResourcesAvailable

5.4.2.4 Hardware to Software Migration

The hardware to software migration parameter is an indicator of the number of tasks that

migrate from hardware to software. A task tends to migrate from hardware to software

only if two conditions are met: all PRRs are busy and at the same time an idle GPP is

available. Figure 5.11, shows the number of hardware-to-software task migrations for

a sample DFG. It is important to note that all tasks within theDFG were initialized as

hardware tasks. Therefore, any task that eventually ran on the GPP was due to a migra-

tion. Tasks migration enhances the system dramatically, especially when the number of


PRRs is small compared to the number of concurrent tasks.

� � � � �

�

�

�

�

�

��

��

��

��

��

��

��

��

��

Figure 5.11: Number of hardware to software Task Migration

5.4.3 Intermediate Stage

Tests were conducted using two strategies with different placement restrictions as shown

in Figure 5.12. The tests were performed on the uniform and non-uniform floorplans,

shown in Figure 4.2. Each floorplan was evaluated with two different placement con-

straints:Restricted, where a task can only be placed in a particular processing element,

andnon-restricted, where a task can be placed on any processing element. A task can be

restricted for many reasons, such as size, communication bus, or I/O requirements. Ta-

ble 5.6 shows, the placement restriction for each task type (operation), for the restricted

case.


Figure 5.12: Experimental Flow of the intermediate state testing

OP ID Mode SW PRR 1 PRR 2 PRR 3 PRR 4 PRR 51 HybridHW YES YES YES YES YES YES2 SWOnly YES NO NO NO NO NO3 HWOnly NO YES YES YES YES NO4 HybridHW YES YES YES NO NO NO5 HybridHW YES YES YES YES NO NO6 HybridHW YES YES NO NO NO NO7 HybridHW YES YES YES YES YES NO8 HybridHW YES YES YES YES YES YES

Table 5.6: Task type Constraints for restricted placement DFG


5.4.3.1 Schedule Evaluation

In this section, results obtained from testing the proposedonline schedulers will be dis-

cussed. For a better and effective evaluation each DFG was run six times in sequence

under the same conditions. The average of the last four runs were then considered. The

first two runs were discarded and the average of the remainingruns were calculated. The

motivation behind this can be explained as follows:

1. The results were more uniform after the first two iterations, especially for small

DFGs. This is due to a better reuse ratio since the currently loaded PRRs are now

relevant to the DFG under test. In addition, RCSched-III needs several iterations

to collect enough data to make more intelligent decisions regarding the placement

of next incoming task.

2. In most cases, the same tasks are repeated many times. For example if there is a

set of tasks for processing video stream, the same DFG would run for every frame.

Therefore, the effect of the first two iterations can be ignored.

3. For higher iterations the measurements of the first would be marginal to the total

average, since all measurement become constant after the first few iterations. The

first two iterations can be considered as a transit time for the system to get into the

steady state mode. Due to the time consuming nature of running the tests, only six

iterations were performed for every DFG under each configuration.

For the graphs in this state each point represents the average of the 10 benchmarks

for each measured parameter.



1 2 3 4 5

RCSched-I 16181 11302 9908 10870 12302

RCSched-II 16263 11008 9571 9594 10998

RCSched-III 16172 10976 9362 8434 8392

Baseline 34570 20149 16852 17364 17821

0

5000

10000

15000

20000

25000

30000

35000

40000

Tim

e(m

s)

# of PRRs

Overall Time (Non Uniform)

1 2 3 4 5

RCSched-I 20085 16434 15951 15847 16072

RCSched-II 20367 17340 15985 14661 13864

RCSched-III 19545 15546 12787 11706 10299

Baseline 33192 21613 19460 18896 18538

0

5000

10000

15000

20000

25000

30000

35000

Tim

e(m

s)

# of PRRs

Overall Time (Unifor m)

Figure 5.13: Total Execution Time: Non-Restricted Placement

Non-Restricted Placement: Figure 5.13 shows the total execution time fo the pro-

posed online schedulers for the non-restricted placement case. Figure 5.13 clearly shows

that RCSched-III produces better results in most cases. RCSched-III has noticeably

lower total execution timewhen enough resources (more PRRs) are available. RCSched-

I and RCSched-II performance were close to RCSched-III whena restricted number of

PRRs (1 - 2) were used.


Restricted Placement: Figure 5.14 presents the total execution time again of the pro-

posed online schedulers for the restricted placement case.Results from Figure 5.14

indicate that the overall time almost doubled when restricted placement was imposed.

The differences in performance among the schedulers was marginal since the placement

constraints reduced the solution space, thus minimizing the effect of the scheduling al-

gorithm.

1 2 3 4 5

RCSched-I 46811 31316 28167 26789 25766

RCSched-II 46217 32217 27525 26021 23285

RCSched-III 47047 31194 29027 25420 25108

Baseline 46999 33282 28895 26527 28084

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Tim

e(m

s)

# of PRRs

Overall Time (Non Uniform)

1 2 3 4 5

RCSched-I 52212 34090 31857 28765 28080

RCSched-II 47765 33497 29087 26397 23843

RCSched-III 54360 36829 29930 28033 26343

Baseline 55431 36521 31455 27788 29132

0

10000

20000

30000

40000

50000

60000

Tim

e(m

s)

# of PRRs

Overall Time (Unifor m)

Figure 5.14: Total Execution Time : Restricted Placement


5.4.3.3 Total Reconfiguration Time

Reconfiguration time is a measure of the total time spent in downloading the bitstreams

into the FPGA (reconfiguring PRRs) for a particular task set.Reconfiguration time is

an overhead for using hardware processing elements and tends to increase overall time.

Reconfiguration time can be minimized using several techniques such as :

1. Hardware reuse, which is supported by the proposed schedulers, namely; RCSched-

I, RCSched-II, and RCSched-III.

2. Selecting the smallest possible processing elements that can accommodate the cur-

rent task (supported by RCSched-III).

3. Migrating the task to software to further improve performance (supported by RCSched-

III).

Figure 5.15 along with Figure 5.16 present the total reconfiguration time involved for

the non-restricted and restricted placement respectively.

Non-Restricted Placement: It is clear from Figure 5.15 that RCSched-III has better

reconfiguration time (in all cases), especially for implementations with four and five

PRRs. As the number of PRRs decreases, the three schedulers performance approach a

similar point. Reconfiguration time was significantly lowerfor the non-uniform imple-

mentations due to the reasons highlighted earlier.

Restricted Placement: Figure 5.16 shows the total reconfiguration time for the Re-

stricted placement. Reconfiguration time was lower for the non-uniform implementation


1 2 3 4 5

RCSched-I 2486 3733 5712 8700 10964

RCSched-II 2212 3325 5193 7222 9531

RCSched-III 2243 3460 5215 6012 6776

Baseline 8037 8974 11809 14825 17152

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Tim

e (m

s)

# of PRRs

Total Reconfiguarti on time (Non Unifor m)

1 2 3 4 5

RCSched-I 7548 11393 13553 14594 15219

RCSched-II 8090 12350 13596 13360 10663

RCSched-III 6738 10802 10439 10296 9204

Baseline 13334 15926 17624 18179 14299

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Tim

e (m

s)

# of PRRs

Total Reconfiguarti on time (Unifor m)

Figure 5.15: Total Reconfiguration Time: Non-Restricted Placement


for RCSched-II. However the difference between the scheduler was marginal for the rea-

sons mentioned earlier.

1 2 3 4 5

RCSched-I 6105 7143 7740 7592 7253

RCSched-II 5018 5711 6508 5998 5160

RCSched-III 5184 5854 6392 5898 4722

Baseline 6247 7153 9076 11711 13468

0

2000

4000

6000

8000

10000

12000

14000

16000

Tim

e (m

s)

# of PRRs

Total Reconfiguarti on time (Non Unifor m)

1 2 3 4 5

RCSched-I 9873 11210 10529 9683 9094

RCSched-II 8612 9070 8293 6724 5289

RCSched-III 14946 16179 13745 9930 8360

Baseline 12942 14167 14593 14905 14857

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Tim

e (m

s)

# of PRRs

Total Reconfiguarti on time (Unifor m)

Figure 5.16: Total Reconfiguration Time: Restricted Placement

5.4.3.4 Intermediate Stage Analysis

Based on results presented in the previous section the following conclusions can be

drawn:

1. The results obtained were based on two different implementations, uniform and


non-uniform. The total reconfigurable area (accumulated PRRs size) are set equal

for both implementations. To further gain insight on the performance of the sched-

ulers placement restrictions were imposed on the benchmarks.

2. The main objective of the proposed schedulers is to reducethe Total execution

time. The non-uniform implementation has lower total executiontime, which is

mainly due to lower reconfiguration time.

3. The performance of RCSched-III was superior compared to RCSched-I and RCSched-

II when there is no resource scarcity. This can be attributedto the smarter task

migration between software and hardware, by taking task execution and reconfig-

uration time into account.

4. The difference intotal execution timewas marginal for the systems with limited

resources (1 and 2 PRRs).

5. Increasing the number of PRRs enhanced the performance byreducing thetotal

execution time.

The remaining of the measured metrics are presented in more details in Appendix A.

5.4.4 Advanced Stage

In this section results based on RCSched-III along with an enhanced version of RCSched-

III will be presented. In order for us to have a fair comparison with the RCOffline sched-

uler, task migration and software tasks were completely disabled from RCSched-III and

RCSched-III-Enhanced. We also assigned the same reconfigurable area available in the


hardware platform to the RCOffline scheduler. All results were based on the uniform

floorplan platform shown in Figure 4.2.


The total execution time (cycles) of the proposed online schedulers (RCSched-III) and its

variant are presented in Table 5.7 and Table 5.8. The tables show a comparison of the to-

tal execution time for RCSched-III, RCSched-III-Enhanced, and RCOffline schedulers.

It also shows the gap percentage of RCSched-III-Enhanced and RCOffline scheduler. Ta-

ble 5.7 shows thetotal execution timeof the first run for all schedulers. Results are based

on the assumption that all PRRs are empty and RCSched-III is in the initial phase of

learning. Table 5.8 on the other hand shows, thetotal execution timeafter the first two it-

erations. At this state the learning phase of RCSched-III iscomplete (i.e., third iteration).

Since RCOffline is an offline scheduler with no learning capability, the time it takes to

configure the first task of a DFG was subtracted (assumed it is already configured).

The results in Table 5.7, clearly indicates an average gap performance increase by

40% of RCSched-III-Enhanced over RCSched-III and a gap of only−6% over the RCOf-

fline. This is a significant improvement, taking into accountRCSched-III-Enhanced is

an online scheduler that requires a small CPU time compared to the RCOffline. The

RCSched-III-Enhanced performance is improved dramatically following the learning

phase as shown in Table 5.8. The RCSched-III-Enhanced performance gap was49%

of RCSched-III and2% of RCOffline. The performance is expected to improve even

more if task migration is enabled.

The CPU runtime comparison of RCSched-III-Enhanced and RCOffline, is shown in

Table 5.9. The tests were performed on a RedHat Linux workstation with 6 cores Intel


Table 5.7: Schedulers comparison forTotal Execution time(No learning).

BenchmarkTotal execution Time (cycles)

Gap %RCOffline RCSched-III RCSched-III-En.

S1 160 160 160 0S2 380 500 460 -17S3 520 740 540 -4S4 920 1430 1000 -8S5 1160 1750 1220 -5S6 300 420 340 -12S7 420 860 500 -16S8 775 1470 820 -5S9 995 1990 1000 -1S10 1095 3000 1140 -4

DFG2 680 800 584 16DFG6 325 340 359 -9DFG7 405 600 481 -16DFG12 745 910 761 -2DFG14 140 160 142 -1DFG16 295 403 383 -23DFG19 455 540 460 -1

Average 575 945 609 -6


Table 5.8: Schedulers comparison forTotal Execution time(after learning phase).

BenchmarkTotal execution Time (cycles)

Gap %RCOffline RCSched-III RCSched-III-En.

S1 140 140 120 17S2 360 440 345 4S3 500 660 520 -4S4 900 1230 866 4S5 1140 1690 1160 -2S6 280 390 280 0S7 400 700 430 -7S8 755 1370 660 14S9 975 1950 950 3S10 1075 2620 1075 0

DFG2 660 830 584 13DFG6 305 360 322 -5DFG7 385 555 436 -12DFG12 725 870 723 0DFG14 120 160 103 17DFG16 275 380 288 -5DFG19 435 540 460 -5

Average 555 876 548 2


Xeon processor, running at a speed of 3.2GHz and equipped with 16GB or RAM. The

RCSched-III-Enhanced is on average 1000 times faster than the RCOffline.

Table 5.9: Run-time comparison of RCOffline and RCSched-III-Enhanced

BenchmarkTime in (ms)

Speedup (X times)RCOffline Rcched-III-En.

S1 2 0.116 1724S2 5 0.614 814S3 12 1.52 789S4 31 4.98 622S5 59 9.14 646S6 5 0.418 1196S7 8 1.31 611S8 29 2.03 1429S9 58 6.28 924S10 110 9.74 1129

DFG2 12 0.829 1448DFG6 3 0.553 542DFG7 7 1.15 609DFG12 29 3.69 786DFG14 2 0.111 1802DFG16 3 0.754 398DFG19 12 1.39 863

5.4.4.2 Hardware Task Reuse

Table 5.10 and Table 5.11 show the number of reused hardware tasks prior to the learning

phase and post learning respectively. Hardware task reuse tends to reduce the number

of required reconfigurations, and hence, total execution time. It is clear from the tables

that the amount of hardware task reuse was improved with RCSched-III-Enhanced. Ob-

viously this contributed to a reduction of the total execution time. Table 5.11, shows the

benefit of the learning phase which tends to increase the variety of the available recon-


figured tasks.

Table 5.10: Schedulers comparison forNumber of Hardware Reuse(No learning).

Benchmark# of HW task Reuse # Prefetch

Gap %RCOffline RCSched-III RCSched-III-En. RCOffline

S1 3 4 4 2 33S2 15 9 14 3 -7S3 37 22 37 1 0S4 85 40 78 3 -8S5 137 74 128 2 -7S6 11 6 10 6 -9S7 34 8 27 2 -21S8 73 28 67 1 -8S9 127 53 118 1 -7S10 180 54 173 1 -4

DFG2 34 23 38 3 12DFG6 21 19 19 0 -10DFG7 44 36 42 5 -5DFG12 85 72 84 1 -1DFG14 5 4 5 2 0DFG16 29 26 26 3 -10DFG19 50 42 49 0 -2

Average 57 31 54 2 -3

RCSched-III-Enhanced has172% and158% more average hardware reuse than RCSched-

III prior and following the learning phase respectively. RCSched-III-Enhanced exceeded

the amount of hardware reuse over RCOffline after the learning phase by119%, and it

reached97% of the hardware reuse without learning. RCOffline is able to prefetch tasks,

which influences the amount of reuse, but the number of prefetches was marginal as can

be seen in the results presented in Table 5.10 and Table 5.11.


Table 5.11: Schedulers comparison forNumber of Hardware Reuse(after learning).

Benchmark# of HW task Reuse # Prefetch

Gap %RCOffline RCSched-III RCSched-III-En. RCOffline

S1 3 8 8 2 167S2 15 13 21 3 40S3 37 26 41 1 11S4 85 51 90 3 6S5 137 79 131 2 -4S6 11 10 13 6 18S7 34 17 31 2 -9S8 73 33 79 1 8S9 127 56 123 1 -3S10 180 74 183 1 2

DFG2 34 24 38 3 12DFG6 21 20 22 0 5DFG7 44 38 48 5 9DFG12 85 73 85 1 0DFG14 5 5 8 2 60DFG16 29 27 32 3 10DFG19 50 42 49 0 -2

Average 57 35 59 2 19


5.5 Summary

In this chapter we presented three efficient reconfigurable online schedulers that are em-

ployed by the OS proposed in the thesis.The schedulers were evaluated and compared

with multiple baseline schedulers in addition to an ILP exact model. The main objective

of the proposed schedulers is to minimize the total execution time for the incoming task

graph. The proposed schedulers support hardware task reuseand software to hardware

task migration. RCSched-I and RCSched-I migrate tasks to software only if hardware

resources are scarce. On the other hand, RCSched-III makes intelligent decisions and

places tasks on the intended processing element (hardware or software) that leads to min-

imal total execution time. RCSched-III takes into account,PRR size and hence, it works

best on the non-uniform PRR floorplan. RCSched-III, learns about task types from pre-

vious history and it’s performance tends to improve overtime. RCSched-III-Enhanced,

further improves upon RCSched-III and searches the ready queue for a task that can ex-

ploit hardware task reuse. RCSched-III-Enhanced has exceptionally good performance

since it reached97% on average the performance of RCOffline scheduler, without the

learning phase, and102% of RCOffline scheduler following the learning phase.

Chapter 6

Allocation/Binding of Execution Units

Optimizing the hardware architecture (execution units) for a specific task in a task graph

is an NP-hard problem [4]. Therefore, in this chapter, we areproposing a heuristic based

technique to select the type of execution units (implementation variants) needed for a

specific task. The proposed approach uses anIsland Based Genetic algorithm (IBGA) as

a meta-heuristic optimization technique.

The main feature of an IBGA is that each population of individuals (i.e., set of can-

didate solutions) is divided into several sub-populations, called islands. All traditional

genetic operators, such as crossover, mutation and selection are performed separately on

each island. Some individuals are selected from each islandand migrated to different

islands periodically. In our proposed Island basedGenetic Algorithm (GA), we do not

perform any migration among the different sub-populationsto preserve the solution qual-

ity of each island, since it is based on a different platform (i.e., floorplan). The basic idea

is to aggregate the results obtained from the Pareto fronts of each island to enhance the

overall solution quality. Each solution on the Pareto fronttends to optimize multiple ob-

130

CHAPTER 6. ALLOCATION/BINDING OF EXECUTION UNITS 131

jectives including power consumption and speed. However, each island tends to optimize

this multi-objective optimization problem based on a distinct platform(floorplan).

The IBGA approach proposed utilizes the framework proposedin Chapter 5 to evalu-

ate the solutions created. However, using the proposed reconfigurable hardware platform

in Chapter 5 would be impractical and limiting. Therefore, areconfigurable simulator

presented earlier in Chapter 4 was developed.

6.1 Single Island GA

Each GA optimization module consists of four main components: an Architecture Li-

brary, Initial Population Generation module, a GA Engine, and a Fitness Evaluation

module (on-line scheduler), as shown in Figure 6.1. TheArchitecture library stores

all of the different available architectures (execution units) for every task type. Archi-

tectures vary in execution time, area, reconfiguration time, and power consumption. For

example, a multiplier can have a serial, semi parallel, or parallel implementation. Each

GA module tends to optimize several criteria (objectives) based on asinglefixed floor-

plan/platform. The architecture library is explained in more details in Appendix B.

The Initial population generatoruses a given task graph along with the architecture

library to generate a diverse initial population of task graphs, as demonstrated by Fig-

ure 6.1. Each possible solution (chromosome) of the population assembles the same

DFG with one execution unit (architecture) bound to each task. The initial population

can be generated randomly or partially seeded using known solutions. TheGenetic algo-

rithm enginemanipulates the initial population by creating new solutions using several

recombinational operators in the form of crossover and mutation. New solutions created


Figure 6.1: A Single GA Module


are evaluated and the most fit solutions replaces the least fit. These iterations will either

continue for a specified number of generations or until no further improvement in solu-

tion quality is achieved. The fourth component is thefitness function, which evaluates

each solution based on a specific criteria. The fitness function for our model is a recon-

figurable scheduler that schedules each task graph and returns the time and overall power

consumed by the DFG to the GA engine.

6.1.1 Initial population

The initial population represents the solution space that the GA will use to search for the

best solution. The initial population needs to be as diverseas possible. In our framework,

we evaluated the system with different population sizes that varied from a population

equal to the number of nodes in the (input) DFG to 5 times the number of nodes of the

input DFGs.

Chromosome representation:Every (solution) individual in the population is repre-

sented using a bit string known as a chromosome. Each chromosome consists of genes.

In our formulation, we choose to use a gene for each of the N tasks in the task graph; each

gene represents the choice of a particular implementation variant (as an integer value) for

the task. A task graph is mapped to a chromosome, where each gene represents an oper-

ation within the DFG using a specific execution unit, as shownin Figure 6.2.

6.1.2 Genetic Operators

GAs begin execution based on an initial population of randomly chosen solutions, which

are successively refined through several generations usingthe crossover and mutation


Figure 6.2: Task Graph to Chromosome Mapping (Binding/Allocation)

operators. The following modules represent the basic operators used:

• TheSelectionmodule: selects a set of individuals in the current population, called

parents, that contribute their genes to their children. Theselection module usually

selects individuals that have better fitness values than their peers. Several selection

modules including tournament and roulette wheel selections can be use.

• TheCrossovermodule: takes two individuals from the population (parents) based

on the selection module and produces two new individuals (children) by splitting

the parent chromosomes at a random position then recombining them. The proba-

bility of applying crossover on two chromosomes is called thecrossover rate.

• TheMutation module: randomly alters genes within a chromosome with a small

probability and is usually applied following crossover. Crossoverrapidly explores

the search space, while mutation provides a modest amount ofrandomsearch.

Mutation helps to diversify the population of solutions to avoid rapid convergence.

• TheReplacementmodule: replaces the current population (parents) with thenewly

formed population (children). Generational GA creates newoffspring from the


members of an old population using the genetic operators introduced earlier and

place these individuals in a new population which becomes the old population

when the whole new population is created. Steady state GA, onthe other hand, is

different to the generation model in that there is typicallyone single new member

instead into the new population at any one time. We used a replacement policy

where the new population is a combination of the best individuals of both (parent

and child) populations.

• Fitness function is an objective function that is used by the genetic algorithm

engine to summarize, as a single figure of merit, the fitness ofa given chromosome.

The fitness function is unique for each problem, and usually returns a numeric

value to the GA engine. In our proposed framework, the fitnessfunction is based

on the outcome of a scheduler which takes each chromosome uses it to construct a

schedule and then returns a value representing time, power,or a weighted average

of both.

6.1.3 Fitness Evaluation

The quality of the scheduling algorithms implemented can have a significant impact on

the final performance of applications running on a reconfigurable computing platform.

The overall time it takes to run a set of tasks on the same platform could vary using

different schedulers. Power consumption, reconfigurationtime and other factors can

also be used as metrics to evaluate schedulers. The three online scheduler described in

Chapter 5 can be used to evaluate each chromosome as part of the fitness function, in the

GA framework. The results of this chapter are based on the RCSched-III scheduler.


6.2 Island Based GA

The proposed Island based GA consists of multiple GA modulesas shown in Figure 6.3.

Each GA module produces a multi-objective Pareto Front based on power and perfor-

mance (execution time) for a unique platform. An aggregatedPareto front is then con-

structed from all the GAs to give the user a framework that notonly optimizes power and

performance, but also the suitable platform (floorplan). A platform in this case includes

the number, size, layout of the PRRs.

The proposed framework simultaneously sends the same DFG toevery GA island,

as shown in Figure 6.3. The Islands then run in parallel to optimize for performance

and power consumption for each platform. Running multiple GAs in parallel helps in

reducing processing time as shown in Table 6.2. The chromosomes in each island, which

represent an allocation of execution units (implementation variant) and binding of units

to tasks are used to build a feasible near optimal schedule. The schedule, allocation, and

binding is then evaluated by the simulator to determine the associated power consumed

and total execution time. These values are then fed back to the GA as a measure of

fitness of different solutions in the population. The GA seeks to minimize a weighted

sum of performance and power consumption (Equation 6.1). The value ofW determines

the weight of each objective. For example, a value of one indicates that the GA should

optimize for performance only, while a value of 0.5 gives equal emphasis on performance

and speed (assuming the values of speed and power are normalized).

FitnessV alue = Power ∗ (1 − W ) + Exec.T ime ∗ W (6.1)


Figure 6.3: Proposed Island Based Framework


Every island produces a separatePareto frontfor a different platform in parallel. An

aggregation module then combines all the Pareto fronts intoan aggregated Pareto front

that not only gives a near optimal performance and power, butalso the hardware platform

(floorplan) associated with the solution. Each island generates the Pareto front by varying

the value ofW in Equation 6.1 from 0.0 to 1.0 in step sizes of 0.1.

6.2.1 Power Measurements

Real time power measurements was initially measured using the Xilinx VC707 board as

explained in Appendix C. We used the (Texas Instrument) TI power controllers along

with TI dongle and a monitoring station to perform the power measurement. Other mea-

surements were estimated using Xilinx XPower software and others were based on data

from the literature.

The essential figures of merits of a digital circuit or systemare speed and power

consumption. Despite the fact, Equation 6.1 uses power, yetswitching energy can be

used instead. The results will be different but the methodology remains the same. In

the case of energy, theArchitecture librarycan be updated with energy values instead of

power as seen in Equation 6.4. The fitness function will be calculated by the simulator

accordingly.

Pdyn = CLVddαfC (6.2)

Es = switching energy =Pdyn

αfC

= CLV 2dd (6.3)


Energy = Power X Time (6.4)

Whereα is the switching activity andfC is the frequency.

To minimize power, one can think of replacing the idle partial reconfigurable mod-

ule with a blank module. Even though, this has not been considered in this work, we

believe doing so will have an adverse effect on power consumption in most cases. The

overhead of increasing the number of reconfigurations (and hence reconfiguration time

and power), due to the lower task variety, will surpass the slight advantage of replacing

the idle module with a blank one. Another alternative strategy, is to power-gate the clock

of the idle task without replacing the idle task, this can be accomplished by the ROS or

the task itself.

6.3 Results

In this section, we will first describe the experimental method used and then presents the

results based on the the proposed island-based GA framework.

6.3.1 Experimental Method

The proposed framework was tested on both synthetic and real-world benchmarks, as

shown in Table 6.1. In total, 5 synthetic benchmarks (i.e., S1-S5) were generated at

random subject only to connectivity constraints. One of themain advantages of using

randomly generated problem instances is that it facilitates the comparison of task graphs

with potentially different behaviors. Different models arise due to differences in the num-


ber and types of tasks, the number of vertices, the number of dependencies, the length

of the critical path, and the number of opportunities to reuse modules. The remaining

7 benchmarks (i.e., DFG2-DFG19) are real-world benchmarksselected from the well-

known MediaBench DSP suite [126]. As the GA is stochastic by nature, it was run

10 times for each benchmark. Solution quality was then evaluated in terms of average

results produced.

Table 6.1: Benchmark Specifications.ID Name # nodes #edges Avg. Edges/node Critical path parallelism (Nodes/Crit. path)

S1 Synthesized (4 task types) 50 50 1 6 8.3S2 Synthesized (4 task types) 150 200 1.33 15 10S3 Synthesized (8 task types) 25 40 1.6 7 3.5S4 Synthesized (8 task types) 100 120 1.2 7 14.2S5 Synthesized (5 task types) 13 10 0.77 3 4.3

DFG2 JPEG - Smooth Downsample 51 52 1.02 16 3.2DFG6 MPEG - Motion Vectors 32 29 0.91 6 5.33DFG7 EPIC - Collapsepyr 56 73 1.3 7 8DFG12 MESA - Matrix Multiplication 109 116 1.06 9 12.11DFG14 HAL 11 8 0.72 4 2.75DFG16 Finite Input Response Filter 2 40 39 0.975 11 3.64DFG19 Cosine 1 66 76 1.15 8 8.25

The run-time performance of the IBGA based on serial and parallel implementations,

is shown in Table 6.2. The IBGA was tested on a RedHat Linux workstation with 6 cores

Intel Xeon processor, running at a speed of 3.2GHz and equipped with 16GB of RAM.

6.3.2 Convergence of Island based GA

The Benchmarks introduced in Table 6.1 were tested on the proposed Island based GA

framework, which consists of four GA islands. Each island targeted a different FPGA

platform (floorplan) that was generated manually. Each platform is distinct in terms of

the number, size, and layout of the PRRs. An example is shown in Figure 6.4. Platform

1 and 2 have uniform size distribution, while platform 3 and 4have non-uniform size


Table 6.2: IBGA run-time for serial and parallel implementations

DFGTime (min:sec)

Speedup (X times)Serial Parallel

S1 00:52 00:17 3.1S2 04:53 01:41 2.9S3 00:23 00:10 2.3S4 03:06 01:16 2.5S5 00:07 00:02 3.0DFG2 01:12 00:28 2.6DFG6 00:12 00:08 1.5DFG7 00:35 00:13 2.7DFG12 02:45 00:52 3.2DFG14 00:05 00:02 2.5DFG16 00:26 00:09 2.9DFG19 01:13 00:23 3.2

distribution. Despite the fact the four islands are optimizing the same task graph they

converge to different solutions, since they are targeting different floorplans on the same

FPGA architecture. Figure 6.6 - 6.8 and Figure 6.9 - 6.15 show, the convergence of the

fitness values for two different benchmarks, synthetic and MeadiaBench, for an average

of 10 runs.

Figure 6.4: Platforms with different floorplans


1000

1500

2000

2500

3000

3500

4000

4500

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation

Generation vs Fitness average convergance for (S1) Benchmark

P 1P 2P 3P 4

Figure 6.5: Convergence of synthesized benchmark S1

For example the DFG for benchmark S3 shown in Figure 6.16, hasdifferent architec-

ture binding for each platform, as shown in Figure 6.17. Eachnumber represents an index

to the selected execution unit (architecture) and the location of the integer represents the

node number.

6.3.3 The Pareto Front of Island GA Framework

The proposed multi-objective Island based GA optimizes forexecution time and/or power

consumption. Since the objective functions are conflicting, a set of Pareto optimal solu-

tions was generated for every benchmark, as discussed in Section 6.2.

Figures 6.19 - 6.21 and Figure 6.22 - 6.28 shows the Pareto fronts for both synthetic

and MediaBench benchmarks respectively based on an averageof 10 runs. The Pareto


2000

4000

6000

8000

10000

12000

14000

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation


P 1P 2P 3P 4


600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation


P 1P 2P 3P 4



1000

2000

3000

4000

5000

6000

7000

8000

9000

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation


P 1P 2P 3P 4


1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation

Generation vs Fitness average convergance for (DFG2) Benchmark

P 1P 2P 3P 4

Figure 6.9: Convergence of MediaBench benchmark (DFG2)


0

500

1000

1500

2000

2500

3000

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation


P 1P 2P 3P 4


500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation


P 1P 2P 3P 4



1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation


P 1P 2P 3P 4


300

400

500

600

700

800

900

1000

1100

1200

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation


P 1P 2P 3P 4



500

1000

1500

2000

2500

3000

3500

4000

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation


P 1P 2P 3P 4


1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90 100

Fitn

ess

Generation


P 1P 2P 3P 4



Figure 6.16: The DFG for benchmark S3

P1: 2533215124431321121115213P2: 2533215124431321121115213P3: 2322231111431211111313212P4: 1322221114431111111111112

Figure 6.17: Architecture binding for each platform (P1-P4), for the S3 Benchmark.


front obtained by each island (P1 to P4) is displayed individually along with the aggre-

gated Pareto front for each benchmark. It is clear from Figures 6.19 - 6.21 that each GA

island tends to produce a different and unique Pareto front that optimizes performance

and power for the targeted platform. On the other hand, the aggregated solution based

on the four GA islands combines the best solutions obtained and thus improves upon the

solutions obtained by the individual GA Pareto fronts. Hence, the system/designer can

choose the most appropriate point based on the desired objective function.

Table 6.3: The Average of 10 runs for the Best Fitness Values

Benchmark No opt. 1-GA IBGA

S1 4435.7 1847 978.2S2 12514.7 4288.3 2244.4S3 2649.95 1973 660.9S4 8474.23 3856 1792.4S5 1392 877 286DFG2 5452.67 2554.1 1036.1DFG6 2942.99 1217.9 406.1DFG7 4558.4 1303.2 582.4DFG12 9249.97 2595.9 1218.1DFG14 1134.13 735 341.5DFG16 3614.28 1301.1 505.2DFG19 5805.24 2368.4 1528.1

Table 6.3 compares the best fitness values of the randomly binding architecture with

no optimization (No Opt.) to that using a single GA approach (1-GA) along with the

proposed aggregated Pareto optimal point approach based onfour Islands (IBGA). Each

value in Table 6.3 is based on the average of 10 different runs. The single GA im-

plementation achieves on average 55.9% improvement over the baseline non-optimized

approach, while the Island based GA achieves on average 80.7% improvement. The

latter achieves on average 55.2% improvement over the single GA approach. Table 6.4


shows the fitness values based on different weight values (introduced in Equation 6.1) for

randomly binding architectures (No Opt.), a single GA (1-GA) and an Island Based GA

(IBGA). On average the single GA implementation achieves 52.7% improvement over

the baseline non-optimized approach, while the Island based GA achieves on average

75% improvement.

Table 6.4: Fitness values for different weights (W)Weight W=0.5 W=0.7 W=0.875 W=1 (Performance)

Bench. No opt. 1-GA IBGA No opt. 1-GA IBGA No opt. 1-GA IBGA No opt. 1-GA IBGAS1 3327 2280 1469 3917 2134 1266 4435.7 1847 978.2 4661 1601 796S2 9483 5731 4351 11034 4872 3331 12514.7 4288.3 2244.4 13457 3724 1604S3 1894 1517 825 2303 1760 734 2649.95 1973 660.9 2914 2126 588S4 6286 4000 2747 7470 3890 2277 8474.23 3856 1792.4 9226 3715 1366S5 954 684 335 1194 790 300 1392 877 286 1550 940 290

DFG2 3749 2237 1324 4673 2450 1189 5452.67 2554.1 1036.1 6025 2582903DFG6 1851 906 387 2434 1073 409 2942.99 1217.9 406.1 3306 1320 396DFG7 2880 1148 668 3792 1231 640 4558.4 1303.2 582.4 5147 1324 537DFG12 5891 2400 1432 7722 2609 1342 9249.97 2595.9 1218.1 10415 2886 1044DFG14 775 557 338 963 656 345 1134.13 735 341.5 1249 785 340DFG16 2274 1003 446 2976 1166 505 3614.28 1301.1 505.2 4095 1392 507DFG19 4588 3152 2308 5232 2743 1907 5805.24 2368.4 1528.1 6146 21861197

Table 6.5 compares the IBGA with an exhaustive procedure in terms of quality of

solution and CPU time. Only two small benchmarks are used dueto the combinatorial

Table 6.5: Exhaustive vs. IBGA (average of 10 runs)

Benchmarks IBGA Exhaustive searchFitness Time(sec) Fitness Time(min)

DFG14 (11 nodes) 341 1.8 313 5.8S5 (13 nodes) 286 2.4 283 257

explosion of the CPU time when using the exhaustive search based procedure. It is clear

from Table 6.5 that the IBGA technique produces near optimalsolutions for these small

benchmarks. The DFG14 (11 nodes) is 9% inferior to the optimal solution, while S5 (13

nodes) is 1% away from optimality.


Pareto front for the different GA Islands for Benchmark s1

0.00.51.01.52.02.53.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

P 1

0.00.51.01.52.02.53.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

P 2

0.00.51.01.52.02.53.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

/:Bold=20 Power (mW)

0.00.51.01.52.02.53.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.51.01.52.02.53.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

All Platforms

Power (mW)

P 1P 2P 3P 4

Figure 6.18: Aggregated Pareto Front (Time vs Power) for synthesized benchmark (S1)


0.01.02.03.04.05.06.07.08.0

0.0 2.0 4.0 6.0 8.010.012.014.016.018.020.0

P 1

0.01.02.03.04.05.06.07.08.0

0.0 2.0 4.0 6.0 8.010.012.014.016.018.020.0

P 2

0.01.02.03.04.05.06.07.08.0

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.01.02.03.04.05.06.07.08.0

0.0 2.0 4.0 6.0 8.0 10.0 12.0

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.01.02.03.04.05.06.07.08.0

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0 20.0

All Platforms

Power (mW)

P 1P 2P 3P 4




0.00.20.40.60.81.01.2

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

P 1

0.00.20.40.60.81.01.2

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

P 2

0.00.20.40.60.81.01.2

0.0 0.5 1.0 1.5 2.0 2.5

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.00.20.40.60.81.01.2

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.20.40.60.81.01.2

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

All Platforms

Power (mW)

P 1P 2P 3P 4



0.00.51.01.52.02.53.03.54.04.55.0

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0

P 1

0.00.51.01.52.02.53.03.54.04.55.0

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0

P 2

0.00.51.01.52.02.53.03.54.04.5

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.00.51.01.52.02.53.03.54.04.55.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.51.01.52.02.53.03.54.04.55.0

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0

All Platforms

Power (mW)

P 1P 2P 3P 4



Pareto front for the different GA Islands for Benchmark DFG2

0.00.51.01.52.02.5

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

P 1

0.00.51.01.52.02.5

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

P 2

0.00.20.40.60.81.01.21.41.61.82.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.00.51.01.52.02.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.51.01.52.02.5

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

All Platforms

Power (mW)

P 1P 2P 3P 4

Figure 6.22: Aggregated Pareto Front (Time vs Power) for MediaBench benchmark(DFG2)


0.00.10.20.30.40.50.6

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

P 1

0.00.10.20.30.40.50.6

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

P 2

0.00.10.20.30.40.50.6

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.00.10.20.30.40.50.6

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.10.20.30.40.50.6

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

All Platforms

Power (mW)

P 1P 2P 3P 4




0.00.10.20.30.40.50.60.70.80.91.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

P 1

0.00.10.20.30.40.50.60.70.80.91.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

P 2

0.00.10.20.30.40.50.60.70.80.91.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.00.10.20.30.40.50.60.70.80.91.0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.10.20.30.40.50.60.70.80.91.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

All Platforms

Power (mW)

P 1P 2P 3P 4



0.00.51.01.52.02.5

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0

P 1

0.00.51.01.52.02.5

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0

P 2

0.00.51.01.52.02.5

0.0 1.0 2.0 3.0 4.0 5.0 6.0

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.00.51.01.52.02.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.51.01.52.02.5

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0

All Platforms

Power (mW)

P 1P 2P 3P 4




0.00.10.10.20.20.20.30.30.4

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

P 1

0.00.10.10.20.20.20.30.30.4

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

P 2

0.00.10.10.20.20.20.30.3

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.00.10.10.20.20.20.30.3

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.10.10.20.20.20.30.30.4

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

All Platforms

Power (mW)

P 1P 2P 3P 4



0.00.10.20.30.40.50.60.7

0.0 1.0 2.0 3.0 4.0 5.0 6.0

P 1

0.00.10.20.30.40.50.60.7

0.0 1.0 2.0 3.0 4.0 5.0 6.0

P 2

0.00.10.20.30.40.50.6

0.0 0.5 1.0 1.5 2.0 2.5

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.00.10.20.30.40.50.60.7

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.10.20.30.40.50.60.7

0.0 1.0 2.0 3.0 4.0 5.0 6.0

All Platforms

Power (mW)

P 1P 2P 3P 4




0.00.51.01.52.02.53.03.54.04.5

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

P 1

0.00.51.01.52.02.53.03.54.04.5

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

P 2

0.00.51.01.52.02.53.03.54.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

P 3

Exe

cutio

n tim

e (1

/100

0 C

ycle

)


0.00.51.01.52.02.53.03.54.04.5

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0

P 4

Exe

cutio

n tim

e (1

/100

0 C

ycle

)

Power (mW)

0.00.51.01.52.02.53.03.54.04.5

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

All Platforms

Power (mW)

P 1P 2P 3P 4



6.4 Summary

In this chapter, a parallel island-based GA approach was designed and implemented to

efficiently map execution units to task graphs for partial dynamic reconfigurable systems.

Unlike previous works, our approach is multi-objective andnot only seeks to optimize

speed and power, but also seeks to select the best reconfigurable floorplan. Our algorithm

was tested using both synthetic and real-world benchmarks.Experimental results clearly

indicates the effectiveness of this approach for solving the problem where the proposed

Island based GA framework achieved on average 55.2% improvement over a single GA

implementation and 80.7% improvement over a baseline random allocation and binding

approach.

Chapter 7

Resource Prediction

The ability to determine and predict the hardware resourcesrequired by an application in

a dynamic run-time reconfigurable environment can significantly improve the system’s

overall performance in different ways. However, optimizing the necessary hardware re-

sources for a given task graph is an NP-hard problem [127]. Predicting resources for real-

time problems makes it even harder to solve. Therefore, in this chapter we are proposing

a machine learning based technique to estimate and predict the needed resources. Given

a new application, our proposed framework employs a database, a simulator and a set

of machine learning algorithms to construct a model capableof intelligently estimat-

ing the resources required to optimize various objectives such as run-time, area, power

consumption and cost. The proposed approach uses previous knowledge and features

extracted from benchmarks to create a supervised machine-learning model that is capa-

ble of predicting the estimated necessary resources. Figure 7.1 illustrates the proposed

prediction framework and flow used in this work. The framework consists of three main

phases: data preparation, training and testing (classification/prediction). Thedata prepa-

158

CHAPTER 7. RESOURCE PREDICTION 159

Figure 7.1: Overall Methodology and Flow


ration phaseuses both synthetic and real-life benchmarks, a simulator and a database, as

shown in Figure 7.1-A. In this phase, benchmarks are generated and evaluated in terms

of the power consumed and execution time using a simulator, as will be explained in

Section (7.1).

All necessary features that will be used to train the machinelearning algorithms are

extracted from both the DFGs generated in addition to metrics provided by the simula-

tor. In thetraining phase(shown in Figure 7.1-B) the framework utilizes the features

extracted from the previous stage to train and create a modelthat learns from previous

historic information. This model extracts useful hidden knowledge (i.e., generalizes)

from the data and estimates/predicts resources of the reconfigurable computing system.

The third and finaltesting/prediction phaseutilizes the developed model given new un-

seen task graphs to predict necessary resources as seen in Figure 7.1-C. Each of these

phases are explained in more detail in the following section.

7.1 Data Preparation Stage

In this stage, a tabular database suitable for the data-mining engine is constructed. Each

record in the database corresponds to a Data-Flow Graph (DFG) along with its features

or attributes that are necessary for the training stage. TheDFGs are either synthesized or

taken from a real-life application. The synthesized DFGs are evaluated under different

hardware set-ups. The evaluation can either be performed using a dedicated hardware

platform or a simulator. The latter allows for faster evaluation and flexibility. Each DFG

is evaluated using different hardware scenarios, by varying the number of processing

elements (PRRs, GPPs), size/shape of PRRs and schedulers. The system performance


metrics (power consumption, execution time) for each case is then recorded accordingly.

Several features from each DFG such as number of nodes, dependencies, fan-out, critical

path, slack, sharable resources and many more are extracted(see Table 7.2). These

features are treated as individual measurable attributes that are effectively exploited in

a supervised learning task such as classification.

The necessary modules to construct the database are:

1. Task Graph Generator: The first step in the proposed methodology involves us-

ing a task-graph generator (described in Section 4.2) to automatically synthesize a

large number of input tasks. These input tasks are used laterto train and test the

proposed predictors. Each input task is represented as a DFG, where the nodes

in the DFG represent particular operations to be scheduled and assigned either to

hardware or software. The edges on the other hand represent the normal dependen-

cies and flow between operations. Each DFG has randomly generated parameters,

as shown in Table 7.1. The probability distribution of DFG parameters is currently

based on a uniform distribution, but any other distribution, including a distribu-

tion based on real-world applications, can be employed. A total of 258DFGs are

generated at this step in the flow.

Parameter Range NoteNodes 5 - 1000 # of nodes in a DFGEdges 0 - 1863 # of edges (dependencies)Edges per Node 0 - 2 Average # of edges per nodeSubgraphs 1 - 968 # of Subgrahs in the taskTask types 3 - 16 # of task typesHW tasks area – Avg/Min/Max/..

Table 7.1: Data Flow Graphs: Statistics


2. Simulator (Evaluation): The evaluation of the performance of all DFGs based on

different hardware configurations is determined before thefeatures and attributes

of each DFG are stored in the database. Accordingly, an architecture reconfig-

urable simulator discussed in Section 4.3 was developed to simulate the hardware

platform , while running the developed reconfigurable operating system.

The target FPGA fabric is first partitioned uniformly and then partitioned with

50% increase in size, while varying the number of PRRs. This results in different

layouts for the same number of PRRs.

In order to have a diverse database we evaluated every DFG with various hard-

ware settings and then recorded the evaluation (performance) metrics. The two

most important evaluation metrics used weretotal execution time and total power

consumptionfor a given DFG. Each DFG was evaluated with different PRR num-

bers and PRR sizes, various number of GPPs, and schedulers. This resulted in a

database containing on average several hundreds of recordsper DFG.

3. Schedulers: The three on-line scheduling algorithms described earlierin Chap-

ter 5 were used. The algorithms have different task reuse capabilities to mini-

mize reconfiguration overhead. Tasks are capable of migrating between software

(GPP) and hardware (PRRs) to minimize total execution time and mitigate resource

scarcity issues. The schedulers record system metrics, learn, and accommodate fu-

ture tasks. The schedulers were first developed and implemented on a hardware

platform, and then incorporated into the simulator that is shown in Figure 7.1.


4. Feature Extraction: Since the DFG format cannot be used directly by the classi-

fication toolWaikato Environment for Knowledge Analysis(WEKA) [128], a soft-

ware application was designed and implemented to extract numerous numeric fea-

tures from a given DFG and store them in a tabular format. Someof the challenges

were to extract as much numeric information as possible without adding irrelevant

information (noise). Features and attributes were represented either as a single

numeric entity or a range format. Below are some of the key extracted features:

• DFG Connectivity: number of edges, average edges per node, fan-in/fanout,

maximum number of parent nodes and maximum number of dependent nodes.

• DFG size: number of sub-graphs, maximum number of nodes in sub-graph,

length of the critical path, average length of critical pathof the sub-graph.

• Task types: the number of task types, the number of tasks of each task type,

number of hardware/software/hybrid tasks.

• Schedule flexibility: node mobilities and reuse possibilities.

• Task type metrics: hardware tasks area, latency, reconfiguration time, and

power consumption.

A complete list of the features is presented in Table 7.2.

These features are extracted from the training DFGs (a subset of the synthesized

DFGs), and then fed to the untrained machine learning algorithm module, as shown

in Figure 7.2-a. The machine algorithm module is trained against a known output

(label) for each of the input DFGs. After the training phase,the same features

are extracted from a new unseen DFG, which is unknown to the trained machine


learning algorithm module. The trained module is then readyto predict the correct

label (class) accordingly, as shown by Figure 7.2-b.

Figure 7.2: Supervised learning: Training and testing

7.2 Training Stage

The second stage of the framework is related to training and model development, as

shown in Figure 7.1-B. In a prediction problem, a model is usually given a data-set of

known data on which training is performed (training data-set), and a data-set of unknown

data (or first time seen data) against which the model is tested (testing data-set). In the

training phase, a data-set is used to train the machine-learning algorithm so that a model


Feature Category Range NoteNodes 5 - 1000 # of nodes in a DFGRoot Nodes 0-254 # of Root nodes without a parentsInternal nodes 0-267 # of Nodes that are neither root nor leaf.Leaf nodes 0-502 # Of leaf NodesIsolated nodes 0-940 # of Nodes with no parent nor childrenEdges 0-1863 # of edges (dependencies)Edges per Node 0-2 Average # of edges per nodeMax parents 0-2 Maximum # of parents for a child node.Max Children 0-19Sharable resources 0-19 Sum of number of sharable resourcesSubgraphs 1-968 # of Subgrahs in the taskCritical paths – # Avg, Min critical pathsCritical path – count the longest pathTask types 3-16 # of task typesTask type N – Frequency of each task typeHW/SW task types – # of HW or SW tasksMigratable tasks – # of migratable tasks.Slack 0-563 Average slackHW/SW latency – Avg/Min/Max valuesHW/SW execution*pwr – Avg/Min/Max valuesHW config time and power – Avg/Min/Max valuesHW tasks area – Avg/Min/Max values

Table 7.2: Extracted Features from DFGs.


is developed, which can be used in the actual prediction and classification phase. The

generated data entries from the previous phase consist of hundreds of records for each

DFG that differ in the evaluation metrics such as power and speed. These records were

obtained by evaluating the same DFG using different hardware configurations. Since the

DFG features are the only metrics that will be used to predictthe class (resource type),

having multiple identical records per DFG, yet with different classes will have a negative

effect on the learning capability of the classifiers. In thisstep, thefittestrecord is selected

for each DFG (to eliminate duplication). The termfittestcould refer to speed, power,

cost, or a combination (depending on the designer’s goal). For example if the goal is to

train the classifier to predict a hardware configuration for the lowest execution time, the

fittest function will only keep the records (for each DFG) that has the lowest execution

time. Accordingly, several smaller databases are created,one for each objective (power,

speed, cost, execution time vs. cost). Each database contains records equivalent to the

number of DFGs available.

Cross validation is an important task in machine learning. It is mainly used to esti-

mate how precise and accurately a classification model will perform in real life. One of

the main goals of cross validation is to avoid problems such as over-fitting [129] (mem-

orizing) and give a perception of how well the model will generalize independent data

sets. Several techniques for cross validation are available such as (i) the holdout method,

(ii) K-fold cross validation, (iii) Leave-one-out cross validation. In the holdout method,

the data set is separated into two different sets called the training set and testing set.

The main advantage of the holdout method is that it takes a short time to compute. The

second cross validation method “Leave one out” has a good validation error but is very

expensive. In our current work, we resorted to the K-fold cross validation as it represents


an improvement over the “Holdout method” and is not as expensive as the “Leave One

Out” method. The data set in the K-fold cross validation is usually divided intok subsets,

and the holdout method is repeatedk times. In each experiment, one of thek subsets is

used as the testing set while the remainingk-1 subsets form a training set. The average

error across allk trials is then computed.

7.3 Classification Stage

The result of the training phase is a model capable of classifying and predicting resources

as shown in Figure 7.1-C. Several supervised machine learning algorithms are deployed

and contrasted by measuring the accuracy of prediction. Each classifier is executed with

their default parameter values, then evaluated using the test sample for each DFG group.

Finally, themean accuracyof the validations is calculated. Classifiers used in this work

range from simple Naive Bayes [129] to more accurate and complex machine learning al-

gorithms in the form of ANNs and SVMs [130]. The classification algorithms are trained

and developed using the training database in WEKA. A detailed evaluation of the classi-

fiers are discussed in the next section. Since ensemble basedsystems provide favorable

results compared to single expert machine learning systemsunder certain scenarios, it

was worth considering them further in this work. Researchers from various disciplines

consider and resort to ensemble based classifiers whenever they seek to improve predic-

tion performance. The goal of the ensemble based method is tocreate a more accurate

predictive model by integrating multiple models. The ensemble methodology attempts

to weigh and combine several individual classifiers in orderto obtain a unified classifier

that outperforms each individual classifier and is similar to human beings seeking sev-


eral opinions before making any important decision. Accordingly, to further improve the

performance of the proposed framework, ensemble based classifiers [131] are employed.

An important feature of the proposed methodology is that thesystem is able to learn,

generalize and, therefore, can be expected to improve its accuracy over time. The goal

of the entire process is to extract useful hidden knowledge from the data. This knowl-

edge is the prediction and estimation of the necessary resources for an unknown or not

previously seen application.

7.4 Experiments and Results

The main objective of this work is to demonstrate that our proposed system is able to

generate accurate prediction models, suitable for use in reconfigurable operating systems,

based on known features of DFGs extracted from previous designs. To achieve this

goal, five different cases were developed and evaluated. Thefirst two (I-II) are based

on predicting schedulers (described in Chapter 5) that can minimize power consumption,

migrate hardware/software tasks, and minimize total execution time. The schedulers

can be distinguished only by power consumption, since RCSched-III performed better

in terms of latency. When optimizing for power, neither the frequency nor the tasks are

altered. The remaining three cases (II-IV) are based on predicting the number of soft

cores (GPPs) and partial-reconfigurable regions (PRRs) to be used on the reconfigurable

platform to minimize combinations of total execution time and power. For the first four

cases (I-IV) the area is fixed, only the layout and the number of PRRs are changed, while

for Case-V the area is proportional to the number of PRRs. In Case-V we introduced

a penalty factor (cost) proportional to the number of PRRs. The cost parameter can be


set by the designer to give a trade-of between area and speed.The proposed cases are

summarized as follows:

Case-I Predict the type of scheduler that will lead to a minimum power consumption

while allowing one or more GPPs to be employed in the design.

Case-II Predict the most appropriate scheduler that will produce the minimum power

schedule while only allowing hardware tasks (PRRs) to be used in the design.

Case-III Predict the number and type of the PRRs and GPPs required to minimize exe-

cution time.

Case-IV Predict the number and type of PRRs and GPPs required to achieve a balance

between power consumption and execution time. (Notice thatthis case involves

conflicting objectives.)

Case-V Predicts the number of the PRRs necessary to minimize both execution time and

the total area consumed by the design.

Table 7.3 summarizes the cases along with the classes that will be predicted. Note

that the last case is based on three different layout of PRRs.

Case # Class type # of Classes Objective NotesI Scheduler 2 Min Power Consumption For GPPs and PRRsII Scheduler 2 Min Power Consumption No GPPs (PRR only)III GPPs/PRRs 4 Min Execution Time -IV GPPs/PRRs 4 Min Exec. Time/Power -V PRR 3 Min Exec. Time and area Figure 1.2-B

Table 7.3: Cases used to Evaluate the Framework along with Specifications


It is important to note that the results achieved and highlighted in this section are not

limited to the previous five cases. The framework and methodology proposed is general

and flexible enough to enable the development of prediction models for any combina-

tion of resources related to the application or underlying run-time environment including

communication infrastructure, interfaces, etc.

7.4.1 Classification Algorithms: Implementation

Machine-learning algorithms proposed in the literature can vary in complexity, accuracy

and performance. They can range from simple to more complex models based also on the

parameters that need to be tuned which usually influence the training procedure. Each

algorithm has its own advantages and disadvantages, and no one algorithm works best

in all cases. Consequently, in this work, we will first attempt to use the following five

individual classification algorithms, all of which are different in terms of complexity and

accuracy, to implement and evaluate each of the five previousprediction cases (i.e., Case

I-V). In Sec. 7.4.5 we will resort to a more advanced classification technique (Ensem-

ble of Classifiers) that combines several individual classification algorithms to further

improve the prediction performance and accuracy.

• Naive Bayes (NB):Based on Bayes theorem, an NB classifier classifies a record

by first computing its probability of belonging to each class, and then assigning the

record to the class with the highest probability [132]. Advantages of this classi-

fier include its simplicity, computational efficiency, and classification performance.

However, for good performance, the classifier requires a large number of records

and assumes that the individual feature values extracted from the records are inde-


pendent, which is not always the case.

• Multi-Layer Perceptron (MLP): Based on a model of biological activity in the

brain, MLP is a feed-forward neural network that has proved to be among the

most effective models for prediction in the context of data mining [133]. Although

it is able to capture highly complicated relationships between the predictors and

response, their flexibility and performance relies heavilyon having sufficient data

for training purposes.

• Support Vector Machine (SVM): Related to statistical learning theory, SVM is a

non-probabilistic supervised learning algorithm where training records are mapped

into a higher dimensional input space so that the separate classes are divided by

an optimal separating hyperplane (i.e., a hyperplane with maximum separation

margin) in this space [102]. New input records are then mapped into this space

and predicted to belong to the class based upon which side of the hyperplane they

fall. Although directly applicable to binary-class problems, SVM can still perform

effectively in high-dimensional spaces. However, this often requires the overhead

of solving a series of binary classification problems.

• K-Nearest Neighbor (K-NN): The K-NN method is both simple and lacks the pa-

rameter tuning of many of the methods described above [134].It is a lazy learning

algorithm since it defers the decision to generalize beyondthe training examples

till a new query is encountered. It works by first identifyingthe k-nearest neighbors

in the training set to the new record. The new record is then assigned to the pre-

dominant class among these neighbors. Despite its simplicity, the K-NN method

performs remarkably well, especially when the target classes are characterized by


a number of related features.

• J48: Tree methods, like J48, are considered to be among the most robust methods

for performing classification and prediction. In general, these methods work by

recursively splitting the training data into subgroups based on the data’s features.

In the case of J48, the decision to split is based on the concept of information

entropy and seeks to choose the features that maximize the normalized information

gain [102]. The recursive splitting stops and leaf nodes arecreated, upon finding

subsets belonging to the same class.

The classifiers listed above were used to classify and predict each of the five cases

presented earlier in Section (7.4). Thus a combination of 25models were developed and

verified. Each of the 25 models was implemented in Java.

7.4.2 Experimental Setup and Evaluation

Each of the 25 models was first trained and tested using 258 synthetic benchmarks

(DFGs) described in Section (7.1). Following that, all 25 models were tested using

20 real-world benchmarks from Media-bench DSP benchmark suite [135]. All of the

models were evaluated based on their accuracy. The accuracyof a prediction model is

a measure of how well the predictor is able to make correct predictions, and is formally

defined as the ratio of correctly classified instances to the number of total instances, as

shown in Equation 7.1.

Accuracy =TN + TP

TN + TP + FP + FN

(7.1)


Note: TP andTN are the number of true positives and true negatives, respectively,

while FP andFN are the number of false positives and false negatives, respectively. In

context of this thesis, an accuracy of 1.0 would mean that themodel was able to pre-

dict the necessary resources (i.e, scheduler, numbers and types of GPPs and/or PRRs)

to achieve a near optimal solutions (e.g., minimum execution time, power, and/or area)

100% of the time. In addition, we also evaluate each model using the well-known Re-

ceiver Operating Characteristics (ROC) curve [136–138]. The ROC curve is a plot of

theTP rate versesFP rate, and shows the trade-off between sensitivity and specificity.

The ROC evaluation is more appropriate measure than accuracy when dealing with im-

balanced data sets, as will be explained later. In the following subsections, the detailed

experimental results are discussed.

7.4.3 Results for Synthetic Benchmarks

The methodology used to evaluate the performance of the 5 cases introduced earlier

in Table 7.3 using 5 different classifiers (25 prediction models) was based on 10-fold

cross-validation. All 258 DFGs described in Section (7.1) were randomly divided into

10 subgroups, each containing an equal number of DFGs. Each of the 25 prediction

models was then trained usingnine of the subgroups and then tested using the tenth

subgroup. The process was repeated 10 times and the average accuracy for each model

was computed. The accuracies of the prediction models were then compared using a

t-test(with alpha = 0.05) corrected to avoidType Ierrors due to the dependence between

samples.

Figure 7.3 presents the prediction accuracy of the 5 cases based on different classi-

fiers. Tables 7.4 and Table 7.5 present statistical significant of the accuracy achieved by


SVM and J48 receptively using synthetic data. Based on Figure 7.3 and Tables 7.4 and

Table 7.5, the results conclusively demonstrate at a high rate of statistical significance

that J48, SVM and MLP achieve the highest accuracy rate of approximately 83%. This

indicates that 83% of the time the prediction model is able topredict the necessary re-

sources for obtaining an optimal objective. The other classification models, based on

K-NN and NB, obtain a lower accuracy rate of approximately 79%.

Interestingly, the graph in Figure 7.3 also shows that the average accuracy of all five

classification methods decreases as the number of classes increases. In particular, the

average accuracy of all five classification methods for case Iand case II, which represent

binary classification problems, are 93.7% and 94.2%, respectively. For case V, which is

a 3-class problem, the average accuracy of all five classification methods is 84.3%. The

worst average accuracy (62%) occurs for case III, where the number of classes increases

to 4. This behavior can be attributed to imbalanced data among the various problem

instances, variance/bias classifiers, or to the ratio of records to feature set. The class

imbalance problem occurs when the distribution of class instances is skewed between

classes. An unbalanced distribution causes typical classification algorithms, which are

designed to maximize classification accuracy, to have trouble learning the minority class

or classes. As can be seen from Figure 7.4, the class imbalance problem is more notice-

able in the multi-class data sets than in the binary-class data sets. The problem of imbal-

anced data can be dealt with using either undersampling or oversampling which might

not be ideal if the dataset size is either small or duplicating data may effect the models.

A more conservative and promising approach used by many researchers to reduce the

impact of within-class imbalance is to utilize ensemble-based methods, as explained in

the next subsection.


55

60

65

70

75

80

85

90

95

100

Case-1

Case-2

Case-3

Case-4

Case-5

Acc

urac

y in

%

Datasets

NBSVMMLPKNNJ48

Figure 7.3: Individual Classifiers: Predication Accuracy for the synthesized benchmarks

1

2

3

4

5

6

7

8

9

10

Case-1

Case-2

Case-3

Case-4

Case-5

X

Datasets

Figure 7.4: Imbalance ratio between Minority/Majority Classes [synthesized benchmark]


Table 7.4: Accuracy with T-test evaluation (SVM) [synthesized benchmark].Dataset SVM NB MLP KNN J48Case-1 95.12 90.12• 96.36 92.45 94.89Case-2 95.19 93.72 95.11 91.27• 95.71Case-3 71.61 66.30• 79.69◦ 74.06 77.80◦Case-4 66.73 58.37• 61.79 60.45• 64.85Case-5 87.72 83.57• 84.34 81.20• 85.01Average 83.27 78.42 83.46 79.89 83.65◦, • statistically significant improvement or degradation

Table 7.5: Accuracy with T-test evaluation (J48) [synthesized benchmark].Dataset J48 NB SVM MLP KNNCase-1 94.89 90.12• 95.12 96.36 92.45Case-2 95.71 93.72 95.19 95.11 91.27•Case-3 77.80 66.30• 71.61• 79.69 74.06Case-4 64.85 58.37 66.73 61.79 60.45Case-5 85.01 83.57 87.72 84.34 81.20◦, • statistically significant improvement or degradation

With respect to the CPU time used by the algorithms, the training of each classier

incurs a one-time computational cost. Table 7.6 shows the average run-times for training

the prediction models based on NB, MLP, SVM, K-NN, and J48, respectively. (Note:

each java-based model ran on an Intel Xeon (W3670) workstation with 16 GB of mem-

ory.) For example, the SVM algorithm is trained at a one-timecost of around 200 mSec.

The Naive Bayes comes handy since it can be trained faster compared to SVM, J48 ad

MLP algorithms. It should be noted that in the case of K-NN, there is no initial training.

Accordingly, KNN is usually referred to alazy classifierdue to the absence of training.

However, the computational cost is incurred when performing the prediction during the

testing phase.


Classifier Training TimeSVM 100 mSecMLP 60.4 SecNB 10 mSec

KNN NAJ48 200 mSec

Table 7.6: Training time for Case # 1 for individual classifiers

7.4.4 Results for MediaBench DSP suite

In addition to the synthetic benchmarks used in the previoussubsection, 20 real-world

DFGs based on DSP applications were used for evaluating the performance of the pre-

diction models. The DFGs were selected from MediaBench a standard DSP benchmark

suite. Table 7.7 lists the characteristics of the DFGs that range in complexity from 18

to 359 nodes and from 16 to 380 edges. Figure 7.5 shows a similar trend to that of the

Table 7.7: Mediabench DSP Benchmark Specifications.ID Name # of nodes # of edges Avg. Edges per node Critical path length parallelism (node/critical path)

1 JPEG - Write BMP Header 106 88 0.83 7 15.142 JPEG - Smooth Downsample 51 52 1.02 16 3.193 JPEG - Forward Discrete Cosine Transform 134 169 1.26 13 10.34 JPEG - Inverse Discrete Cosine Transform 122 162 1.33 14 8.715 MPEG - Inverse Discrete Cosine Transform 114 164 0.44 16 7.1256 MPEG - Motion Vectors 32 29 0.91 6 5.337 EPIC - Collapsepyr 56 73 1.3 7 88 MESA - Invert Matrix 333 354 1.06 11 30.279 MESA - Smooth Triangle 197 196 0.99 11 17.910 MESA - Horner Bezier 18 16 0.89 8 2.2511 MESA - Interpolate Aux 108 104 0.96 8 13.512 MESA - Matrix Multiplication 109 116 1.06 9 12.1113 MESA - Feedback Points 53 50 0.94 7 7.5714 HAL 11 8 0.72 4 2.7515 Finite Input Response Filter 11/h2 44 43 0.98 11 416 Finite Input Response Filter 2 40 39 0.975 11 3.6417 Elliptic Wave Filter 34 47 1.38 14 2.4318 Auto Regression Filter 28 30 1.07 8 3.519 Cosine 1 66 76 1.15 8 8.2520 Cosine 2 82 91 1.11 8 10.25

synthetic data in Figure 7.3 with all of the classifiers performing better on Cases I, II, V,

IV and III in that order. On average MLP performed the best with average accuracy of

75% followed by SVM and J48 (Average accuracy of 73% and 72% respectively). NB


20

30

40

50

60

70

80

90

100

Case-1

Case-2

Case-3

Case-4

Case-5

Acc

urac

y in

%

Datasets

NBSVMMLPKNNJ48

Figure 7.5: Individual Classifiers: Predication Accuracy for MediaBench.

was the worst with average accuracy of 68% while KNN has average accuracy of 70%.

For Case-IV J48 performed the best with 57% accuracy. The prediction of the real-word

data was lower than the synthetic by an average of 10%. Exceptof case-I all cases have

higher accuracy based on synthetic data. The difference wasin the range from 9% dif-

ference for Case-II to 21% difference for Case-IV. While Case-II and Case-V were in

between by an average of accuracy difference of 14% and 11% respectively. The differ-

ence in accuracy between the synthetic benchmark and the MediaBench DSP benchmark

is not surprising. The reason is that the models were trainedand tested with synthetic

data in the first case, but trained with synthetic data and tested with real-world data in

the second. This shows that although our models can accurately predict classes based on

unseen circuits it is recommended to train the model on real-world data to achieve better

accuracy results.


7.4.5 Ensemble Learning

In this subsection we seek to determine if we can further improve accuracy of the pre-

diction models by employing ensemble methods. Ensemble methods [131] employ not

one, but multiple classifiers to classify new data points, often by taking a weighted aver-

age vote of their predictions. Ensemble learning algorithms are primarily used to improve

prediction accuracy by mitigating issues related to variance and bias, or over-fitting when

the data set is small. Many techniques have been proposed forcombining the predictions

of multiple classifiers, but the most popular methods are:

• Baggingwhich is a method that seeks to train each classifier in the ensemble us-

ing a random redistribution of the original data. If X is the size of the original

training set, training sets are generated for each classifier by randomly selecting X

records, with replacement. The predictions of each trainedclassifier are combined

using a (weighted) voting scheme. In practice, bagging is especially useful in situ-

ations where even small changes in the training set may lead to big changes in the

prediction [139].

• Boosting which is an ensemble technique where learning models are learned se-

quentially. Initial prediction models tend to be very simple, and are used to deter-

mine particular data points that are difficult or hard to fit. Later prediction models

focus primarily on those points that are hard to fit, with the goal of trying to predict

them correctly. All of the models are finally given weights and the set is combined

to evolve an increasingly more complex and accurate prediction model [140].

• Stacking applies different types of classification models to the original data. In-


stead of using voting or a weighted average approach, stacking uses a meta-lever

classifier to determine the winning model. In general, stacking works well, but it

is very hard to analyze [102].

• Randomization (Random Forest)works with a large collection of decision trees.

The method works by generating a large set (forest) of independent decision trees

by using different random samples of the original data set. Voting is then used to

determine the final class [102].

In general, each of the previous ensemble methods seeks to create a more accurate

classifier by combining less-accurate ones. Figure 7.6 and 7.7 compare the prediction ac-

curacy of the single classifier, J48, to that of the four ensemble methods Random Forrest,

Boosting, Bagging, and Stacking for both synthetic and MediaBench DSP benchmarks

respectively. (The J48 was selected for comparison, since it achieved the highest av-

erage accuracy rate - 83% along with MLP and SVM. It is clear that in every case all

four ensembles achieve higher average accuracy rates than J48. The improvements on

average in accuracy range from 1% to 11%. However, there is nostatistically significant

difference in average accuracy between the four ensembles.

As observed in the case of the individual classifiers, the average accuracy of all four

ensembles decreases as the number of classes increases. In particular, the average accu-

racy of all four ensembles for Case I and Case II, which represent binary classification

problems, are 96.15% and 96.03%, respectively. For Case V, which is a 3-class problem,

the average accuracy of the ensembles is 86.16%. Finally, the worst average accuracy

(70.9%) occurs for Case III, where the number of classes increases to 4. Although the

use of ensembles does not fully solve the problem of class imbalance, it does help to


mitigate its effects. Notice that the largest improvementsin average accuracy compared

with J48 occurred in Case III and Case IV both of which have more than 2 classes.

Table 7.8 shows the average accuracy achieved by all five individual classifiers and

all four ensembles. It is clear that ensembles outperform all of the single classifier ap-

proaches. Table 7.9 shows the average training time for eachof the classifiers. Although

there is no statistically significant difference between the four ensembles in terms of

accuracy, both bagging and stacking require an order-of-magnitude more training time

compared with random forest and boosting. This would suggest that the latter two tech-

niques should be used due to their run-time efficiencies.

The area under ROC for the ensembles techniques is shown in Figure 7.8 and Fig-

ure 7.9 for both synthetic and Media-bench DSP suite respectively. It is clear that the

average performance of the ensemble based techniques is better than the J48 machine

learning technique in all cases.

Table 7.8: Accuracy with T-Test significant evaluation [synthesized benchmark]Dataset MLP NB SVM KNN J48 Rand. Forest AdaBoost Bagging StackingCase-1 94.74 90.12 • 95.31 96.67◦ 92.60 • 96.36 ◦ 96.29 ◦ 95.86 ◦ 96.09 ◦

Case-2 95.71 93.72 • 95.19 95.11 91.27• 96.01 96.32 95.97 95.82Case-3 77.80 66.30 • 71.61 • 79.69 ◦ 74.06 • 84.79 ◦ 84.14 ◦ 84.45 ◦ 84.50 ◦

Case-4 64.85 58.37 • 66.73 61.79• 60.45 • 70.41 ◦ 70.90 ◦ 71.27 ◦ 71.00 ◦

Case-5 85.01 83.57 • 87.72 ◦ 84.34 81.20• 87.01 ◦ 86.94 ◦ 87.14 ◦ 87.56 ◦

◦, • statistically significant improvement or degradation

Table 7.9: Training time for different classification algorithms.Dataset MLP NB SVM KNN J48 Rand. Forest AdaBoost Bagging StackingCase-1 51.90 0.01 ◦ 0.10 ◦ 0.00 ◦ 0.10 ◦ 0.40 ◦ 0.40 ◦ 3.60 ◦ 5.80 ◦

Case-2 40.90 0.01 ◦ 0.10 ◦ 0.00 ◦ 0.10 ◦ 0.40 ◦ 0.40 ◦ 3.50 ◦ 5.70 ◦

Case-3 83.90 0.02 ◦ 0.50 ◦ 0.00 ◦ 0.60 ◦ 1.60 ◦ 1.70 ◦ 14.60 ◦ 27.40 ◦

Case-4 63.30 0.02 ◦ 0.50 ◦ 0.00 ◦ 0.40 ◦ 1.20 ◦ 1.30 ◦ 10.70 ◦ 20.80 ◦

Case-5 44.00 0.01 ◦ 0.20 ◦ 0.00 ◦ 0.20 ◦ 0.70 ◦ 0.70 ◦ 6.00 ◦ 10.60 ◦

Average 56.8 0.014 0.28 0.00 0.28 0.86 0.90 7.68 14.06◦, • statistically significant improvement or degradation


60

65

70

75

80

85

90

95

100

Case-1

Case-2

Case-3

Case-4

Case-5

Acc

urac

y in

%

Datasets

J48R.Forest

AdaBoostBaggingStacking

Figure 7.6: Ensembles Classifiers: Predication Accuracy for the Synthesized Benchmark

30

40

50

60

70

80

90

100

Case-1

Case-2

Case-3

Case-4

Case-5

Acc

urac

y in

%

Datasets

J48R.Forest


Figure 7.7: Ensembles Classifiers: Predication Accuracy for the MediaBench


0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Case-1

Case-2

Case-3

Case-4

Case-5

Are

a U

nder

RO

C

Datasets

J48R.Forest


Figure 7.8: Ensembles Classifiers: Area Under ROC for the Synthesized Benchmark

0.4

0.5

0.6

0.7

0.8

0.9

1

Case-1

Case-2

Case-3

Case-4

Case-5

Are

a U

nder

RO

C

Datasets

J48R.Forest


Figure 7.9: Ensembles Classifiers: Area Under ROC for the MediaBench


7.4.6 Random Baseline

As a final demonstration of the efficiency of our approach, we provide a comparison

of our prediction models with that of a random predictor as will be highlighted in Ta-

bles 7.10 - 7.13. It is important to note that the best solutions entered in Tables 7.10 - 7.13

were achieved by exhaustive search. While the random solutions indicated in these ta-

bles were picked randomly from the pool of solutions generated. Working with synthetic

data first, 11 DFGs were randomly selected (with uniform probability) out of the pool

of 258 DFGs. These DFGs were then compared with DFGs predicted by the models for

Case-III and Case-I using SVM classifiers on the bases of minimizing execution time (#

of cycles) and power. The power was measured as explained previously in Section 6.2.1.

The number of cycles was used to make the model more general and can target multi-

ple platforms with different frequencies. The actual time can be calculated by dividing

the number of cycles over the frequency of the system (cycles/Frequncy). Table 7.10

and 7.11 show the results for time and power respectively. Anasterisk (*) indicates the

model missed the best class yet still close enough and much better than the random guess

in almost all cases. Results obtained for the synthetic benchmarks indicate that the ML-

engine proposed in this thesis is13.9% and0.1% away from the best available solutions

for latency and power respectively, and improves upon a random based baseline solution

by 278% and119% for latency and power respectively. Similarly a baseline comparison

were performed for the MediaBench DSP benchmark for Case-III and Case-I but using

the J48 classifier. The results are shown in Table 7.12 and Table 7.13. Results obtained

for the MediaBench DSP benchmarks indicate that the ML-Engine proposed is4.2% and

0.0% away from the best available solution for latency and power respectively, and im-


proves upon a random based baseline solution by403% and103% for latency and power

respectively.

Table 7.10: Performance Enhancement for total time (Synthesized).

DFG ID # nodes# of Cycles [Case 3 -SVM]

Random Best Data mining

9 100 27,610 1,317 *1,41839 75 4,464 2,226 2,22640 100 3,333 2,122 *2,48443 200 5,796 2,038 2,03853 50 3,503 1,427 *1,53654 75 4,816 1,685 1,68561 450 3,635 2,401 *4,88863 750 8,173 3,727 *3,98599 1,000 4,827 4,242 4,242106 125 4,988 2,207 2,207112 500 5,257 4,073 *5,186

Average 6,946 2,497 2,900

Table 7.11: Performance Enhancement for power (Synthesized).

DFG ID # nodesPower mW [Case - SVM]


9 100 1,882 1,845 1,84539 75 1,550 1,419 1,41940 100 2,166 1,755 1,75543 200 4,591 3,685 3,68553 50 940 807 80754 75 1,695 1,420 1,42061 450 9,821 7,779 7,77963 750 17,484 13,797 13,79799 1,000 27,758 24,885 *24,963106 125 2,814 2,319 2,319112 500 11,648 9,490 9,490

Average 7,486 6,291 6,298


Table 7.12: Performance Enhancement for total time (MediaBench DSP).

DFG ID # nodes# of Cycles [Case 3 -J48 ]


1 106 2,499 1,091 * 17912 51 2,143 990 9903 134 2,794 1,995 1,9954 122 9,755 2,064 * 21695 114 13,156 2,061 * 20636 32 2,774 567 5677 56 1,265 1,061 1,0618 333 12,490 2,844 2,8449 197 13,158 2,073 2,07310 18 2,567 503 * 67911 108 9,229 1,128 1,12812 109 3,809 1,610 * 181513 53 3,642 982 98214 11 692 349 34915 44 2,994 1,220 * 148716 40 1,398 1,015 * 109117 34 1,469 799 79918 28 983 848 84819 66 1,696 1,507 1,50720 82 13,331 1,649 * 1971

Average 5,092 1,318 1,262


Table 7.13: Performance Enhancement for power (MediaBenchDSP).

DFG ID # nodesPower mW [Case 1 - j48]


1 106 2,954 2,801 2,8012 51 1,455 1,366 1,3663 134 4,257 4,065 4,0654 122 3,514 3,399 3,3995 114 3,747 3,737 3,7376 32 781 781 7817 56 1,416 1,373 1,3738 333 7,100 6,974 6,9749 197 5,657 5,597 5,59710 18 398 396 39611 108 2,788 2,700 2,70012 109 2,597 2,471 2,47113 53 1,299 1,259 1,25914 11 230 227 22715 44 988 978 97816 40 1,019 990 99017 34 1,068 1,049 1,04918 28 697 677 67719 66 1,319 1,253 1,25320 82 1,832 1,741 1,741

Average 2,256 2,192 2,192


7.5 Summary

The Performance metric is one of the fundamental reasons forusing Reconfigurable

Computing Systems (RCS). By mapping algorithms and applications to hardware, de-

signers can tailor not only the computation components, butalso perform data flow op-

timization to match the algorithm. There are many challenges in adaptive computing

and dynamic reconfigurable systems. One of the major understudied challenges is es-

timating the required resources and predicting an appropriate floorplan for the system.

In this thesis we propose a novel technique based on machine learning to predict the

necessary resources for Dynamic Run Time Reconfigurations.For our work proposed

in this chapter, a classification model is used to predict theappropriate type of resources

for a reconfigurable computing platform given a specific application. Using ensemble

based systems provides favorable results compared to single-expert machine learning

algorithms under a variety of scenarios.

Chapter 8

Conclusions and Future Work

Several embedded application domains for reconfigurable systems tend to combine fre-

quent changes with high performance demands of their workloads, such as image pro-

cessing, wearable computing, and network processors. For example in wireless commu-

nication systems several standards and technologies are available such as GSM, WiMax,

WCDMA but it is unlikely that all these protocols will be usedat the same time. Accord-

ingly, it is possible to dynamically load only the one that isneeded. Dynamic run time

reconfigurable architectures is considered to be an attractive venue to compose efficient

platforms for today’s embedded systems. Partial reconfiguration is appealing and attrac-

tive since it provides flexibility, minimizes power, cost and area. The resulting system is

one that can provide high performance by implementing custom hardware functions in

the FPGA and still be flexible by reprogramming the FPGA and /or using the attached

processor.

However, multitasking reconfigurable hardware is complex and requires some over-

head in terms of management. Time multiplexing of reconfigurable hardware resources

189

CHAPTER 8. CONCLUSIONS AND FUTURE WORK 190

raises a number of new issues, ranging from run-time systemsto complex programming

models. In order for users to benefit from the flexibility of such systems, an operat-

ing system must be developed to not only further reduce the complexity of application

development but also providing the developer with tools at ahigher level of abstraction.

8.1 Conclusions

In this thesis an efficient Operating systems for run time reconfigurable architectures

was designed and implemented to ease the application designand help to properly man-

age resources within the reconfigurable system. Making sucha reconfiguration manager

available along with the current flow should further enhanceand improve partial recon-

figuration on state of the art FPGAs. The thesis was carried out in three main phases.

1. Thefirst phaseof the thesis involved The development and evaluation of several

novel heuristics for on-line scheduling of hard real time tasks running on partially

reconfigurable devices. The proposed schedulers used fixed predefined partial re-

configurable regions with re-use, relocation, and task migration capability. In par-

ticular, RCSched-III used real-time data to make scheduling decisions. The sched-

uler dynamically measures several performance metrics such as reconfiguration

time and execution time, calculates a priority and based on these metrics assigns

incoming tasks to the appropriate processing elements. In order to evaluate the

proposed framework and schedulers a DFG generator was carefully designed and

developed. It randomly generates benchmarks with predefined specification, such

as number of nodes, task types and total number of dependencies per DFG. The de-

veloped tools in this thesis should assist developers usingdynamic reconfiguration


with more ease and flexibility.

2. Thesecond phaseof the thesis proposed a design and implementation of a parallel

island-based GA approach for efficiently mapping executionunits to task graphs

for partial dynamic reconfigurable systems. Each GA optimization module con-

sisted of four main components: an Architecture Library, Initial Population Gen-

eration module, a GA Engine, and a Fitness Evaluation module(on-line sched-

uler). TheArchitecture library stores all of the different available architectures

(execution units) for every task type. Architectures vary in execution time, area,

reconfiguration time, and power consumption. Each GA moduletends to optimize

several criteria (objectives) based on asinglefixed floorplan/platform. Unlike pre-

vious works, our approach is multi-objective and not only seeks to optimize speed

and power, but also seeks to select the best reconfigurable floorplan. The basic

idea is to aggregate the results obtained from the Pareto fronts of each island to

enhance the overall solution quality. Each solution on the Pareto front tends to

optimize power consumption, speed and area based on a different platform (floor-

plan) within the FPGA. Our approach was tested using both synthetic and real-

world benchmarks. Experimental results have demonstratedthe effectiveness of

this approach for solving the problem where the proposed Island based GA frame-

work achieved on average 55.2% improvement over a single GA implementation

and 80.7% improvement over a baseline random allocation andbinding approach.

This is on of the few attempt to use multiple GA instances for optimizing several

objectives and aggregating the results to further improve solution quality.

3. Thethird and final phase of this thesis proposed a novel adaptive and dynamic


methodology based on an intelligent machine learning approach that is used to

predict and estimate the necessary resources for an application based on past his-

torical information. An important feature of the proposed methodology is that the

system is able to learn as it gains more knowledge and, therefore, is expected to

generalize and improve its accuracy over time. Even though the approach is gen-

eral enough to predict most if not all types of resources fromthe number and type

of PRRs, the type of scheduler, and communication infrastructure, we limit our

results to the former three required for an application. Theframework was based

on extracting certainfeaturesfrom the applications that are executed on the re-

configurable platform. The features compiled were then usedto train and build

a classification modelthat was capable of predicting the floorplan appropriate for

an application. Our proposed approach was based on several modules including

benchmark generation, data collection, pre-processing ofdata, data classification,

and post processing. The goal of the entire process was to extract useful, hid-

den knowledge from the data, this knowledge is then used to predict and estimate

the necessary resources and appropriate floorplan for an unknown or not previ-

ously seen application. Based on the literature review, theuse of data-mining and

machine-learning techniques has not been proposed by any research group to ex-

ploit this specific type of Design Exploration for Reconfigurable Systems in terms

of predicting the appropriate floorplan of an application.

The proposed framework in this thesis is flexible enough to beused in several oper-

ational modes. The modes of operation will depend on the degree of integration of the

different phases developed above and requested by the user.


8.2 Future Work

Designing and implementing operation systems for partial reconfigurable systems is a

complex, tedious and iterative process. Crating an effective and robust operation sys-

tem for reconfigurable embedded systems is an integral part of our ongoing research.

Accordingly, in our future work we will attempt to pursue thefollowing directions:

• Develop more efficient scheduling algorithms that can produce near optimal solu-

tion compared to offline schedulers.

• Overcome some of the current limitations of the prediction model by applying

it to predict other important resources such as the communication infrastructure

and communication links between soft cores and programmable reconfigurable

regions.

• To achieve real time performance our work in the future should consider develop-

ing hardware acceleration of accelerating the supervised learning classifiers used

in this work.

• Extend the ideas introduced in this thesis for specific applications including real

time image processing, computer vision and communication systems.

Bibliography

[1] H. P. Rosinger,Connecting customized IP to the MicroBlaze soft processor using

the Fast Simplex Link (FSL) channel. Xilinx Inc., 2004.

[2] “Vc707 evaluation board for the virtex-7 fpga , user guide,” UG885(V1.0), March

5, 2012.

[3] K. Eguro, “Automated Dynamic Reconfiguration for High Performance Regular

Expression Searching,” inIEEE Int’l Conference on Field Programmable Tech-

nology, Sydney, Australia, December 2009, pp. 455–459.

[4] M. Huang, V. K. Narayana, M. Bakhouya, J. Gaber, and T. El-Ghazawi, “Efficient

mapping of task graphs onto reconfigurable hardware using architectural variants,”

Computers, IEEE Transactions on, vol. 61, no. 9, pp. 1354–1360, 2012.

[5] Xilinx, “Partial reconfiguration user guide,”UG702, May 2010.

[6] M. Huang, V. K. Narayana, and T. El-Ghazawi, “Efficient mapping of hardware

tasks on reconfigurable computers using libraries of architecture variants,” inField

Programmable Custom Computing Machines, 2009. FCCM ’09. 17th IEEE Sym-

posium on, 2009, pp. 247–250.

194

BIBLIOGRAPHY 195

[7] A. Al-wattar, S. Areibi, and F. Saffih, “Efficient On-lineHardware/Software Task

Scheduling for Dynamic Run-Time Reconfigurable Systems,” in IEEE Reconfig-

urable Architectures Workshop (RAW), Shanghai, China, May 2012, pp. 401–406.

[8] H. K. hay So a, “BORPH: An operating system for FPGA based reconfigurable

computers,” Ph.D. dissertation, University of California, Berkeley, 2007.

[9] C. Bobda,Introduction to Reconfigurable Computing: Architectures,algorithms

and applications. Springer, November 2007.

[10] K. Compton and S. Hauck, “Reconfigurable computing: a survey of systems and

software,”ACM Comput.Surv., vol. 34, no. 2, pp. 171–210, 2002.

[11] R. Hartenstein, “A decade of reconfigurable computing:a visionary retrospective,”

in Design, Automation and Test in Europe, 2001. Conference andExhibition 2001.

Proceedings, 2001, pp. 642–649.

[12] R. Tessier and W. Burleson, “Reconfigurable computing for digital signal process-

ing: A survey,”Journal of VLSI Signal Processing, vol. 28, pp. 7–27, 2000.

[13] T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W. Luk, and

P. Y. K. Cheung, “Reconfigurable computing: architectures and design methods,”

Computers and Digital Techniques, IEE Proceedings -, vol. 152, no. 2, pp. 193–

207, 2005.

[14] P. H. M. S. C. Huang,Reconfigurable System Design and Verification. Boca

Raton, FL, USA: CRC Press, Inc, 2009.

BIBLIOGRAPHY 196

[15] R. Graml and G. Wigley, “Bushfire hotspot detection through uninhabited aerial

vehicles and reconfigurable computing,” inAerospace Conference, 2008 IEEE,

2008, pp. 1–13.

[16] J. A. Clemente, C. Gonzlez, J. Resano, and D. Mozos, “A task graph execu-

tion manager for reconfigurable multi-tasking systems,”Microprocessors and Mi-

crosystems, vol. 34, no. 2-4, pp. 73–83, 6 2010.

[17] Partial Reconfiguration User Guide. Xilinx Inc., October 5, 2010.

[18] E. J. McDonald, “Runtime FPGA partial reconfiguration,” in Aerospace Confer-

ence, 2008 IEEE, 2008, pp. 1–7.

[19] Virtex-4 FPGA Configuration User Guide. Xilinx Inc., June 9, 2009.

[20] M. Sabeghi and K. Bertels, “Current trends in resource management of reconfig-

urable systems,” in19th Annual Workshop on Circuits, Systems and Signal Pro-

cessing (ProRISC2008), November 2008.

[21] T. Nakano, “Hardware implementation of a real-time operating system,” M. Imai,

Ed., vol. 0, 1995, pp. 34–34.

[22] S. Nordstrom and L. Lindh, “Application specific real-time microkernel in hard-

ware,” in14th IEEE-NPSS Real Time Conference 2005. IEEE, June 2005.

[23] L. Yan, L. Xian-yao, G. Ping-ping, Z. Hong-jie, and C. Ping, “Hardware imple-

mentation of muC/OS-II based on FPGA,”Education Technology and Computer

Science, International Workshop on, vol. 3, pp. 825–828, 2010.

BIBLIOGRAPHY 197

[24] T. Kerstan and S. Oberthur, “A configurable hybrid kernel for embedded real-time

systems,” A. Rettberg, Ed., 29May - 1 June 2007.

[25] T. Klevin, “Get real fast RTOS with Xilinx FPGAs,”Xcell Journal, vol. 45, 2003.

[26] F. Cottet, J. Delacroix, C. Kaiser, and Z. Mammeri,Scheduling in Real-Time Sys-

tems. John Wiley and Sons, Ltd, 2003.

[27] S. Hauck and A. Dehon,Reconfigurable Computing: The Theory and Practice of

FPGA-Based Computation. Amsterdam: Morgan Kaufmann, 2008.

[28] S. Guccione, D. Levi, and P. Sundararajan, “JBits: Javabased interface for recon-

figurable computing,” 1999.

[29] E. Eto,Difference-Based Partial Reconfiguration. Xilinx Inc., December 3, 2007.

[30] ——, Two Flows for Partial Reconfiguration: Module Based or Difference Based.

Xilinx Inc., September 9, 2004.

[31] Early Access Partial Reconfiguration. Xilinx Inc., 2008.

[32] E. L. Horta and J. W. Lockwood, “PARBIT: A tool to transform bitfiles to imple-

ment partial reconfiguration of field programmable gate arrays (FPGAs),” Tech.

Rep., 2001.

[33] E. L. Horta, J. W. Lockwood, D. E. Taylor, and D. Parlour,“Dynamic hardware

plugins in an FPGA with partial run-time reconfiguration,” in Proceedings of the

39th annual Design Automation Conference, ser. DAC ’02. New York, NY, USA:

ACM, 2002, pp. 343–348.

BIBLIOGRAPHY 198

[34] S. McMillan and S. Guccione, “Partial run-time reconfiguration using jrtr,” vol.

1896, pp. 352–360, 2000.

[35] M. Huebner, T. Becker, and J. Becker, “Real-time LUT-based network topologies

for dynamic and partial FPGA self-reconfiguration,” inIntegrated Circuits and

Systems Design, 2004. SBCCI 2004. 17th Symposium on, 2004, pp. 28–32.

[36] N. Abel, S. Manz, F. Grull, and U. Kebschull, “Increasing design changeability

using dynamic partial reconfiguration,”Nuclear Science, IEEE Transactions on,

vol. 57, no. 2, pp. 602–609, 2010.

[37] K. Paulsson, M. Hubner, G. Auer, M. Dreschmann, L. Chen,and J. Becker, “Im-

plementation of a virtual internal configuration access port (JCAP) for enabling

partial self-reconfiguration on Xilinx Spartan iii FPGAs,”in Field Programmable

Logic and Applications, 2007. FPL 2007. International Conference on, 2007, pp.

351–356.

[38] M. B. Gokhale and P. S. Graham,Reconfigurable Computing: Accelerating Com-

putation with Field-Programmable Gate Arrays. Springer, December 2005.

[39] C. Cao, “Benefits of partial reconfiguration,”Xcell Journal, no. 55, 2005.

[40] J. McCaskill and D. Lautzenheiser, “FPGA Partial Reconfiguration Goes Main-

stream,”Xcell Journal, no. 73, 2010.

[41] I. Gonzalez, S. Lopez-Buedo, F. J. Gomez, and J. Martinez, “Using partial re-

configuration in cryptographic applications: An implementation of the idea algo-

rithm,” in FPL, 2003, pp. 194–203.

BIBLIOGRAPHY 199

[42] J. M. Granado, M. A. Vega-Rodriguez, J. M. Sanchez-Perez, and J. A. Gomez-

Pulido, “Idea and aes, two cryptographic algorithms implemented using partial

and dynamic reconfiguration,”Microelectron.J., vol. 40, no. 6, pp. 1032–1040,

June 2009.

[43] J. P. Delahaye, G. Gogniat, C. Roland, and P. Bomel, “Software radio and dynamic

reconfiguration on a DSP/FPGA platform,” 2004.

[44] I. Gonzalez, S. Lopez-Buedo, and F. J. Gomez-Arribas, “Implementation of secure

applications in self-reconfigurable systems,”Microprocessors and Microsystems,

vol. 32, no. 1, pp. 23–32, 2 2008.

[45] R. Kumar, R. C. Joshi, and K. S. Raju, “A FPGA partial reconfiguration design

approach for RASIP SDR,” inIndia Conference (INDICON), 2009 Annual IEEE,

2009, pp. 1–4.

[46] P. Ryser, “Software defined radio with reconfigurable hardware and software,” in

Embedded Systems Conference 2005, vol. 442. Xilinx Inc., 2005.

[47] P. Manet, D. Maufroid, L. Tosi, G. Gailliard, O. Mulertt, M. D. Ciano, J.-D. Legat,

D. Aulagnier, C. Gamrat, R. Liberati, V. L. Barba, P. Cuvelier, B. Rousseau, and

P. Gelineau, “An evaluation of dynamic partial reconfiguration for signal and im-

age processing in professional electronics applications,” EURASIP J.Embedded

Syst., vol. 2008, pp. 1:1–1:11, January 2008.

[48] F. Fons and M. Fons,Making Biometrics the Killer App of FPGA Dynamic Partial

Reconfiguration. Xilinx Inc., 2010, vol. 72.

BIBLIOGRAPHY 200

[49] E. El-Araby, I. Gonzalez, and T. El-Ghazawi, “Exploiting partial run-

time reconfiguration for high-performance reconfigurable computing,” ACM

Trans.Reconfigurable Technol.Syst., vol. 1, no. 4, pp. 21:1–21:23, January 2009.

[50] C. S. Ibala and K. Arshak, “Using dynamic partial reconfiguration approach to

read sensor with different bus protocol,” inSensors Applications Symposium,

2009. SAS 2009. IEEE, 2009, pp. 175–179.

[51] C. Claus, J. Zeppenfeld, F. Muller, and W. Stechele, “Using partial-run-time re-

configurable hardware to accelerate video processing in driver assistance system,”

in DATE ’07: Proceedings of the conference on Design, automation and test in

Europe. San Jose, CA, USA: EDA Consortium, 2007, pp. 498–503.

[52] J. Becker, M. Hubner, G. Hettich, R. Constapel, J. Eisenmann, and J. Luka, “Dy-

namic and partial FPGA exploitation,”Proceedings of the IEEE, vol. 95, no. 2,

pp. 438–452, 2007.

[53] M. Liu, W. Kuehn, Z. Lu, and A. Jantsch, “Run-time partial reconfiguration speed

investigation and architectural design space exploration,” in Field Programmable

Logic and Applications, 2009. FPL 2009. International Conference on, 2009, pp.

498–502.

[54] J. Noguera and I. O. Kennedy, “Power reduction in network equipment through

adaptive partial reconfiguration,” inField Programmable Logic and Applications,

2007. FPL 2007. International Conference on, 2007, pp. 240–245.

[55] J. y Mignolet, S. Vernalde, D. Verkest, and R. Lauwereins, “Enabling hardware-

software multitasking on a reconfigurable computing platform for networked

BIBLIOGRAPHY 201

portable multimedia appliances,” inProceedings of the International Conference

on Engineering Reconfigurable Systems and Architecture 2002, 2002, pp. 116–

122.

[56] J. Emmert, C. Stroud, B. Skaggs, and M. Abramovici, “Dynamic fault tolerance

in FPGAs via partial reconfiguration,” inField-Programmable Custom Computing

Machines, 2000 IEEE Symposium on, 2000, pp. 165–174.

[57] G. Brebner, “A virtual hardware operating system for the Xilinx XC6200,” in

Field-Programmable Logic Smart Applications, New Paradigms and Compilers,

ser. Lecture Notes in Computer Science, R. Hartenstein and M. Glesner, Eds.

Springer Berlin / Heidelberg, 1996, vol. 1142, pp. 327–336.

[58] ——, “The swappable logic unit: a paradigm for virtual hardware,” inFPGAs for

Custom Computing Machines, 1997. Proceedings., The 5th Annual IEEE Sympo-

sium on, 1997, pp. 77–86.

[59] P. Merino, J. C. Lopez, and M. Jacome, “A hardware operating system for dynamic

reconfiguration of FPGAs,” p. 431, 1998.

[60] L. Levinson, R. Manner, M. Sessler, and H. Simmler, “Preemptive multitasking

on FPGAs,” inFCCM ’00: Proceedings of the 2000 IEEE Symposium on Field-

Programmable Custom Computing Machines. Washington, DC, USA: IEEE

Computer Society, 2000, p. 301.

[61] G. Wigley and D. Kearney, “The development of an operating system for recon-

figurable computing,” inIn Proceedings of the IEEE Symposium on FPGAs for

Custom Computing Machines (FCCM). IEEE CS. Press, 2001.

BIBLIOGRAPHY 202

[62] G. Wigley, D. Kearney, and D. Warren, “Introducing ReConfigME: An operating

system for reconfigurable computing,” pp. 687–697, 2002.

[63] G. Wigley and D. Kearney, “The first real operating system for reconfigurable

computers,” inProceedings of the 6th Australasian conference on Computersys-

tems architecture; ACSAC ’01. Washington, DC, USA: IEEE Computer Society,

2001, pp. 130–137.

[64] G. B. Wigley, “An operating system for reconfigurable computing,” 2005.

[65] G. Wigley and M. Jasiunas, “A low cost, high performancereconfigurable comput-

ing based unmanned aerial vehicle,” inAerospace Conference, 2006 IEEE, 2006,

p. 13 pp.

[66] H. Walder and M. Platzner, “Reconfigurable hardware operating systems: From

design concepts to realizations,” inIn Proceedings of the 3rd International Con-

ference on Engineering of Reconfigurable Systems and Architectures (ERSA.

CSREA Press, 2003, pp. 284–287.

[67] ——, “A runtime environment for reconfigurable hardwareoperating systems,”

pp. 831–835, 2004.

[68] H. K.-H. So, A. Tkachenko, and R. Brodersen, “A unified hardware/software run-

time environment for FPGA-based reconfigurable computers using BORPH,” in

CODES+ISSS ’06: Proceedings of the 4th international conference on Hard-

ware/software codesign and system synthesis. New York, NY, USA: ACM, 2006,

pp. 259–264.

BIBLIOGRAPHY 203

[69] H. K.-H. So and R. Brodersen, “A unified hardware/software runtime environ-

ment for FPGA-based reconfigurable computers using BORPH,”Transactions on

Embedded Computing Systems, vol. 7, no. 2, pp. 1–28, 2008.

[70] E. Lubbers and M. Platzner, “ReconOS: Multithreaded programming for recon-

figurable computers,”ACM Trans.Embed.Comput.Syst., vol. 9, no. 1, pp. 1–33,

2009.

[71] R. Pellizzoni and M. Caccamo, “Real-time management ofhardware and software

tasks for FPGA-based embedded systems,”IEEE Trans.Comput., vol. 56, no. 12,

pp. 1666–1680, 2007.

[72] F. Ghaffari, B. Miramond, and F. Verdier, “Run-time HW/SW scheduling of data

flow applications on reconfigurable architectures,”EURASIP J.Embedded Syst.,

vol. 2009, pp. 3–3, 2009.

[73] L. Devaux, S. B. Sassi, S. Pillement, D. Chillet, and D. Demigny, “DRAFT:

Flexible interconnection network for dynamically reconfigurable architectures,”

in Field-Programmable Technology, 2009. FPT 2009. International Conference

on, 2009, pp. 435–438.

[74] L. Devaux, D. Chillet, S. Pillement, and D. Demigny, “Flexible communication

support for dynamically reconfigurable FPGAs,” inProgrammable Logic, 2009.

SPL. 5th Southern Conference on, 2009, pp. 65–70.

[75] L. Devaux, S. B. Sassi, S. Pillement, D. Chillet, and D. Demigny, “Flexible inter-

connection network for dynamically and partially reconfigurable architectures,”

Int.J.Reconfig.Comput., vol. 2010, pp. 6:4–6:4, January 2010.

BIBLIOGRAPHY 204

[76] F. Muller, J. L. Rhun, F. Lemonnier, B. Miramond, and L. Devaux, “A flexible

operating system for dynamic applications,”Xcell Journal, no. 73, 2010.

[77] B. Miramond, E. Huck, F. Verdier, M. E. A. Benkhelifa, B.Granado, M. Aichouch,

J. C. Prvotet, D. Chillet, S. Pillement, T. Lefebvre, and Y. Oliva, “OveRSoC : a

framework for the exploration of RTOS for RSoC platforms,”International Jour-

nal on Reconfigurable Computing, vol. 2009, no. 450607, pp. 1–18, dec 2009.

[78] S. M. Loo and B. E. Wells, “Task scheduling in a finite-resource, reconfigurable

hardware/software codesign environment,”INFORMS Journal on Computing,

vol. 18, no. 2, pp. 151–172, January 1 2006.

[79] H. Walder and M. Platzner, “Online scheduling for block-partitioned reconfig-

urable devices,” inDesign, Automation and Test in Europe Conference and Exhi-

bition, 2003, 2003, pp. 290–295.

[80] B. Mei, P. Schaumont, and S. Vernalde, “A hardware-software partitioning and

scheduling algorithm for dynamically reconfigurable embedded systems.” Sys-

tems and Signal Processing Veldhoven, 2000.

[81] K. Danne and M. Platzner, “Periodic real-time scheduling for FPGA computers,”

in Intelligent Solutions in Embedded Systems, 2005. Third International Workshop

on, 2005, pp. 117–127.

[82] J. Resano, D. Mozos, and F. Catthoor, “A hybrid prefetchscheduling heuristic to

minimize at run-time the reconfiguration overhead of dynamically reconfigurable

hardware,” inDesign, Automation and Test in Europe, 2005. Proceedings, 2005,

pp. 106–111 Vol. 1.

BIBLIOGRAPHY 205

[83] J. Resano, D. Mozos, D. Verkest, F. Catthoor, and S. Vernalde, “Specific schedul-

ing support to minimize the reconfiguration overhead of dynamically reconfig-

urable hardware,” inDesign Automation Conference, 2004. Proceedings. 41st,

2004, pp. 119–124.

[84] J. Resano, D. Mozos, D. Verkest, S. Vernalde, and F. Catthoor, “Run-time mini-

mization of reconfiguration overhead in dynamically reconfigurable systems,” vol.

2778, pp. 585–594, 2003.

[85] P. Yang, C. Wong, P. Marchal, F. Catthoor, D. Desmet, D. Verkest, and R. Lauw-

ereins, “Energy-aware runtime scheduling for embedded-multiprocessor SOCs,”

Design and Test of Computers, IEEE, vol. 18, no. 5, pp. 46–58, 2001.

[86] C. Steiger, H. Walder, and M. Platzner, “Heuristics foronline scheduling real-

time tasks to partially reconfigurable devices,” inIn Proceedings of the 13rd In-

ternational Conference on Field Programmable Logic and Application (FPL03.

Springer, 2003, pp. 575–584.

[87] ——, “Operating systems for reconfigurable embedded platforms: online schedul-

ing of real-time tasks,”Computers, IEEE Transactions on, vol. 53, no. 11, pp.

1393–1407, 2004.

[88] Y.-H. Chen and P.-A. Hsiung, “Hardware task schedulingand placement in oper-

ating systems for dynamically reconfigurable soc,” vol. 3824, pp. 489–498, 2005.

[89] C.-W. Liu, “Hardware / software real-time relocatabletask scheduling and place-

ment in dynamically partial reconfigurable systems,” June 2006.

BIBLIOGRAPHY 206

[90] J. Noguera and R. M. Badia, “System-level power-performance trade-offs in task

scheduling for dynamically reconfigurable architectures,” in Proceedings of the

2003 international conference on Compilers, architectureand synthesis for em-

bedded systems, ser. CASES ’03. New York, NY, USA: ACM, 2003, pp. 73–83.

[91] P.-A. Hsiung, P.-H. Lu, and C.-W. Liu, “Energy efficientco-scheduling in dy-

namically reconfigurable systems,” inProceedings of the 5th IEEE/ACM inter-

national conference on Hardware/software codesign and system synthesis, ser.

CODES+ISSS ’07. New York, NY, USA: ACM, 2007, pp. 87–92.

[92] C.-C. Chiang, “Hardware/software real-time relocatable task scheduling and

placement in dynamically partial reconfigurable systems,”Master’s thesis, Na-

tional Chung Cheng University, June 2007.

[93] R. Pellizzoni and M. Caccamo, “Adaptive allocation of software and hardware

real-time tasks for FPGA-based embedded systems,” inReal-Time and Embedded

Technology and Applications Symposium, 2006. Proceedingsof the 12th IEEE,

2006, pp. 208–220.

[94] D. Gohringer, M. Hubner, E. N. Zeutebouo, and J. Becker,“Operating system for

runtime reconfigurable multiprocessor systems,” 2011.

[95] D. Gohringer and J. Becker, “High performance reconfigurable multi-processor-

based computing on fpgas,” inParallel & Distributed Processing, Workshops and

Phd Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, pp. 1–4.

[96] T.-M. Lee, J. Henkel, and W. Wolf, “Dynamic runtime re-scheduling allowing

multiple implementations of a task for platform-based designs,” in Design, Au-

BIBLIOGRAPHY 207

tomation and Test in Europe Conference and Exhibition, 2002. Proceedings, 2002,

pp. 296–301.

[97] W. Fu and K. Compton, “An execution environment for reconfigurable comput-

ing,” in Field-Programmable Custom Computing Machines, 2005. FCCM2005.

13th Annual IEEE Symposium on, 2005, pp. 149–158.

[98] B. Mei, P. Schaumont, and S. Vernalde, “A hardware-software partitioning and

scheduling algorithm for dynamically reconfigurable embedded systems,” 2000.

[99] J. Wang and S. M. Loo, “Case study of finite resource optimization in fpga using

genetic algorithm,”I.J.Comput.Appl., vol. 17, no. 2, pp. 95–101, 2010.

[100] Y. Qu, J. P. Soininen, and J. Nurmi, “A genetic algorithm for scheduling tasks

onto dynamically reconfigurable hardware,” inCircuits and Systems, 2007. ISCAS

2007. IEEE International Symposium on, 2007, pp. 161–164.

[101] R. Chen, P. R. Lewis, and X. Yao, “Temperature management for heteroge-

neous multi-core fpgas using adaptive evolutionary multi-objective approaches,”

in Evolvable Systems (ICES), 2014 IEEE International Conference on, 2014, pp.

101–108, iD: 1.

[102] N. Sumikawa, “Screening customer returns with multivariate test analysis,” in

IEEE International Test Conference, Anaheim, California, 2012, pp. 1–75.

[103] J. Chen, “Mining ac delay measurements for understanding speed-limiting paths,”

in International Test Conference, Austin, Texas, 2010, pp. 553–562.

BIBLIOGRAPHY 208

[104] H. Liu, “On learning-based methods for design-space exploration with high-level

synthesis,” inDesign Automation Conference, California, 2013, pp. 51–57.

[105] S. Ward, “Pade: A high-performance placer with automatic datapath extraction

and evaluation through high-dimensional data learning,” in Design Automation

Conference, California, 2012, pp. 756–761.

[106] L. Wang, “Data mining in design and test processes: Basic principles and

promises,” inISPD, California, 2013, pp. 41–42.

[107] M. Kurek and W. Lik, “Parametric Reconfigurable Designs with Machine Learn-

ing Optimizer,” in IEEE Int’l Conference on Field Programmable Technology,

Seoul Korea, December 2012, pp. 109–112.

[108] A. Monostori, H. Fruhauf, and G. Kokai, “Quick Estimation of Resources of FP-

GAs and ASICs Using Neural Networks,” inLWA, December 2005, pp. 201–215.

[109] J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B.-H. Park, “Dynamic meta-

learning for failure prediction in large-scale systems: A case study,” inInt, Con-

ference on Parallel Processing,, 2008, pp. 157–164.

[110] A. Lifa, P. Eles, and Z. Peng, “Dynamic configuration prefetching based on piece-

wise linear prediction,” inDesign, Automation Test in Europe Conference Exhibi-

tion (DATE), 2013, pp. 815–820.

[111] G. Mariani, V. Sima, G. Palermo, V. Zaccaria, C. Silvano, and K. Bertels, “Using

multi-objective design space exploration to enable run-time resource management

BIBLIOGRAPHY 209

for reconfigurable architectures,” inDesign Automation and Test in Europe Con-

ference and Exhibition (DATE), 2012, pp. 1379–1384.

[112] R. H. Y. M. K. Z. J. Bian, “Isba: An independent set-based algorithm for au-

tomated partial reconfiguration module generation,” inIEEE/ACM International

Conference on Computer-Aided Design (ICCAD), 2012, pp. 500–507.

[113] V. Lozin and M. Milanic, “A polynomial algorithm to findan independent set

of maximum weight in a fork-free graph,”Journal of Discrete Algorithm, vol. 6,

no. 4, pp. 595–604, Dec 2008.

[114] R. Meeuws, S. Ostazadeh, C. Galuzzi, V. Sima, R. Nane, and K. Bertels, “Quipu:

A statistical model for predicting hardware resources,”ACM Transactions on Re-

configurable Technology and Systems, vol. 6, no. 1, pp. 3:1–3:25, 2013.

[115] G. Mariani, V. Sima, G. Palermo, V. Zaccaria, G. Marchiori, C. Silvano, and

K. Bertels, “Run-time optimization of a dynamically reconfigurable embedded

system through performance prediction,” Porto, Portugal,pp. 1–8, September

2013.

[116] V. Sima and K. Bertels, “Runtime decision of hardware or software execution on

a heterogeneous reconfigurable platform,” Rome, Italy, pp.1–6, May 2009.

[117] G. Mariani, G. Palermo, R. Meeuws, V. Sima, C. Silvano,and K. Bertels, “Druid:

Designing reconfigurable architectures with decision-making support,” suntec,

Singapore, pp. 213–218, January 2014.

[118] D. Cordeiro, M. Gregory, S. Perarnau, D. Trystram, J.-M. Vincent, and F. Wagner,

BIBLIOGRAPHY 210

“Random graph generation for scheduling simulations,” inProceedings of the 3rd

International ICST Conference on Simulation Tools and Techniques, ser. SIMU-

Tools ’10. Institute for Computer Sciences, Social-Informatics and Telecom-

munications Engineering; ICST, Brussels, Belgium, Belgium: ICST, 2010, pp.

60:1–60:10.

[119] “Dot (graph description language),” http://www.graphviz.org/Documentation.php.

[120] A. Al-wattar, S. Areibi, and G. Grewal, “Rcsimulator,a simulator for reconfig-

urable operating systems,” https://github.com/Aalwattar/rcSimulator.

[121] G. Vera, D. Llamocca, S. Pattichis, and J. Lyke, “A dynamically reconfigurable

computing model for video processing applications,” inSignals, Systems and

Computers Conference, Pacific Grove, CA, November 2009, pp. 327–331.

[122] L. Cai, S. Huang, Z. Lou, and H. Peng, “Measurement method of the system

reconfigurability,”Journal of Communications and Information Sciences, vol. 3,

no. 3, pp. 1–13, 2013.

[123] F. Redaelli, M. D. Santambrogio, and D. Sciuto, “Task scheduling with configura-

tion prefetching and anti-fragmentation techniques on dynamically reconfigurable

systems,” inDesign, Automation and Test in Europe, 2008. DATE ’08, 2008, pp.

519–522, iD: 1.

[124] F. Redaelli, M. D. Santambrogio, and S. O. Memik, “An ilp formulation for the

task graph scheduling problem tailored to bi-dimensional reconfigurable architec-

tures,” inReconfigurable Computing and FPGAs, 2008. ReConFig ’08. Interna-

tional Conference on, 2008, pp. 97–102, iD: 1.

BIBLIOGRAPHY 211

[125] A. Elhossini, J. Huissman, B. Debowski, S. Areibi, andR. Dony, “An efficient

scheduling methodology for heterogeneous multi-core processor systems,” inMi-

croelectronics (ICM), 2010 International Conference on, 2010, pp. 475–478.

[126] “Express benchmarks: Electrical & computer engineering department at the ucsb,

usa,” http://express.ece.ucsb.edu/benchmark/.

[127] J. R. M. Potkonjak, “Optimizing resource utilizationusing transformation,”IEEE

Transaction on Computer Aided Design of Integrated Circuits and Systems,

vol. 13, no. 3, pp. 277–292, March 1994.

[128] M. H. E. F. G. H. B. P. P. R. I. Witten, “The weka data mining software: an update,”

SIGKDD Explor.Newsl., vol. 11, no. 1, pp. 10–18, nov 2009.

[129] I. H. Witten, E. Frank, and M. A. Hall,Data Mining: Practical Machine Learning

Tools and Techniques, 3rd ed. Amsterdam: Morgan Kaufmann, 2011.

[130] S. Amari and S. Wu, “Improving support vector machine classifiers by modifying

kernel functions,”Journal of Neural Networks, vol. 12, no. 6, pp. 783–789, July

1999.

[131] L. Rokach, “Ensemble-based classifiers,”Artificial Intelligence Review, vol. 33,

no. 1, pp. 1–39, February 2010.

[132] Z. Harry, “The optimality of naive bayes,” inFLAIR, Florida, 2004, pp. 562 – 567.

[133] P. Wasserman and T. Schwartz, “Neural networks: What are they and why is ev-

erybody so interested in them now?”IEEE Expert, vol. 3, no. 1, pp. 10–15, 1988.

BIBLIOGRAPHY 212

[134] P. Hall and R. Samworth, “Choice of neighbor order in nearest neighbor classifi-

cation,”Annals of Statistics, vol. 36, no. 5, pp. 2135–2152, 2008.

[135] D. Beasley, D. R. Bull, and R. R. Martin, “An overview ofgenetic algorithms:

Part 1, fundamentals,” 1993.

[136] S. H. Park, J. M. Goo, and C.-H. Jo, “Receiver operatingcharacteristic (roc) curve:

practical review for radiologists.”Korean journal of radiology : official journal of

the Korean Radiological Society, vol. 5, no. 1, pp. 11–18.

[137] J. A. Hanley and B. J. McNeil, “The meaning and use of thearea under a receiver

operating characteristic (roc) curve,”Radiology, vol. 143, no. 1, pp. 29–36, Apr

1982.

[138] J. Hilden, “The area under the roc curve and its competitors,” Medical Decision

Making, vol. 11, no. 2, pp. 95–101, 1991.

[139] L. Breiman, “Bagging predictors,”Machine Learning, vol. 24, no. 2, pp. 123–140,

1996.

[140] R. Schapire, “The boosting approach to machine learning: An overview,” inMSRI

Workshop on Nonlinear Estimation and Classification, Florida, 2002, pp. 1–23.

[141] “Power methodology guide,”UG786(V13.1), March 1, 2011.

[142] “Vivado design suite user guide power analysis and optimization,”

UG907(V2012.2), July 25, 2012.

[143] “Fusion digital power designer version 1.8230,” June12, 2012.

Appendix A

Scheduling and Placement Restriction

In this Appendix RCSched-I, RCSched-II, and RCSched-III are evaluated in respect to

the RCSched-Base described in Section 5.2.4. We looked at the problem by using two

PRR floorplans (uniform and non-uniform). In addition to thePRR uniformity, the sched-

ulers performance was evaluated with task placement restriction. In many cases a task

can be placed onto only specific locations, the placement restriction can be due size, I/O

or communication restrictions. The benchmark used is this stage is larger with a wide

variate of DFGs, as described in Section 5.4.1.

A.1 Total Number of Reconfiguration

The total number of reconfiguration represents the frequency by which the system uses

the ICAP controller to download a bitstream on the FPGA fabric. The total number

of reconfigurationis related toreconfiguration timeyet it is different for non-uniform

platform. An intelligent scheduler may have higher number of reconfiguration but tends

213

APPENDIX A. SCHEDULING AND PLACEMENT RESTRICTION 214

to use smaller PRRs which leads to lower reconfiguration time.

Non-Restricted Placement: From Figure A.1, it can be noticed that the number of

configuration is inversely proportional to the number of PRRs. Implementations with less

than three PRRs showed no significant difference among the three schedulers. RCSched-

III improved performance can be attributed to the higher number of hardware to software

migration.

1 2 3 4 5

RCSched-I 39.7 51.4 55.4 61.9 67.1

RCSched-II 39.5 51.6 54.9 56.8 53.7

RCSched-III 37.9 49.0 50.2 47.6 42.9

Baseline 77.9 80.4 82.6 84.6 86.0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

# of

Rec

onfig

ura

tion

# of PRRs

Total # of Reconfiguration (Non Unifor m)

1 2 3 4 5

RCSched-I 35.5 53.0 62.4 67.2 70.0

RCSched-II 35.5 52.8 58.1 57.1 45.2

RCSched-III 34.9 44.2 42.8 42.2 37.7

Baseline 63 75 83 85 67

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

# of

Rec

onfig

ura

tion

# of PRRs

Total # of Reconfigurati on (Unifor m)

Figure A.1: Total Number of Reconfigurations: Non-Restricted Placement


Restricted Placement: As expected the results obtained indicates that RCSched-I is

inferior to RCSched-II and RCSched-III, as shown in Figure A.2.

1 2 3 4 5

RCSched-I 47.3 51.1 50.0 46.8 42.9

RCSched-II 48.3 49.8 46.7 37.6 30.6

RCSched-III 48.5 49.9 45.7 39.0 32.0

Baseline 61.4 64.6 66.9 68.0 68.0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

# of

Rec

onfig

ura

tion

# of PRRs

Total # of Reconfiguration (Non Unifor m)

1 2 3 4 5

RCSched-I 46.5 52.6 49.4 45.5 42.8

RCSched-II 49.4 50.5 45.8 37.2 29.2

RCSched-III 46.5 49.8 43.9 36.8 30.3

Baseline 61 67 68 70 70

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

# of

Rec

onfig

ura

tion

# of PRRs

Total # of Reconfiguration (Unifor m)

Figure A.2: Total Number of Reconfigurations: Restricted Placement

A.2 Number of Hardware Task Reuse

Figure A.3 and Figure A.4 show the total number of task reuse,reusing a hardware task,

which avoids reconfiguration and minimizes total executiontime.


Non-Restricted Placement: Figure A.3 clearly shows that RCSched-II has more reuse

than RCSched-III, yet Figure 5.13 shows RCSched-II has higher overall time in most

cases. This is due to lower hardware to software task migration ration for RCSched-II.

1 2 3 4 5

RCSched-I 8.7 9.1 12.8 13.6 15.4

RCSched-II 9.4 10.0 13.4 18.6 27.7

RCSched-III 8.8 10.3 13.4 17.7 24.5

Baseline 0 0 0 0 0

0.0

5.0

10.0

15.0

20.0

25.0

30.0

# of

reu

sed

HW

tas

ks

# of PRRs

# of Reuse (Non Uniform)

1 2 3 4 5

RCSched-I 8.9 10.8 13.6 15.3 15.2

RCSched-II 8.0 11.0 17.5 25.4 28.9

RCSched-III 8.0 11.5 18.0 21.5 27.6

Baseline 0 0 0 0 0

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

# of

reu

sed

HW

tas

ks

# of PRRs

# of Reuse (Uniform)

Figure A.3: Total Number of HW reuse: Non-Restricted Placement

Restricted Placement: As more restrictions are imposed on placement the perfor-

mance of the schedulers tend to be close in quality. RCSched-I has the worst results,

while RCSched-II and RCSched-III were almost identical as shown in Figure A.4.

A.3 Busy Counter

When there are no free hardware processing elements the schedulers tend to migrate a

task from hardware to software without incrementing the counter when applicable.


1 2 3 4 5

RCSched-I 14.1 14.5 17.0 20.7 25.0

RCSched-II 14.2 17.0 21.9 31.7 39.3

RCSched-III 13.8 16.9 22.4 30.2 37.3

Baseline 0 0 0 0 0

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

# of

reu

sed

HW

tas

ks

# of PRRs

# of Reuse (Non Uniform)

1 2 3 4 5

RCSched-I 15.4 14.1 18.5 23.4 26.5

RCSched-II 13.4 15.8 22.8 32.0 40.6

RCSched-III 14.2 16.2 24.1 32.1 38.8

Baseline 0 0 0 0 0

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

# of

reu

sed

HW

tas

ks

# of PRRs

# of Reuse (Unifor m)

Figure A.4: Total Number of HW reuse: Restricted Placement


Figures A.5 and A.6 show that thebusy counter(idle timer) is inversely proportional

to the number of PRRs available in the system.

Non-Restricted Placement: RCSched-III has a lower idle period compared to RCSched-

II due to more optimized utilization of resources when the number of PRRs exceed three,

as shown in Figures A.5.

1 2 3 4 5

RCSched-I 45682 24651 13239 4652 1755

RCSched-II 64324 32979 17113 6811 2829

RCSched-III 51967 25967 12782 5970 2238

Baseline 74767 27333 7720 1795 0

0

10000

20000

30000

40000

50000

60000

70000

80000

Idle

Per

iod

# of PRRs

Id le Time Measurements (Non Unifor m)

1 2 3 4 5

RCSched-I 41846 16762 7481 1901 353

RCSched-II 39738 15781 7004 2433 7121

RCSched-III 59590 20005 7872 2508 925

Baseline 52389 14188 3105 541 4159

0

10000

20000

30000

40000

50000

60000

70000

Idle

Per

iod

# of PRRs

Id le Time Measurements (Unifor m)

Figure A.5: Idle Time Measurement: Non-Restricted Placement


Restricted Placement: RCSched-I has the worst results, while RCSched-II and RCSched-

III were almost identical as shown in Figure A.6.

1 2 3 4 5

RCSched-I 54007 18334 6999 4859 4297

RCSched-II 53779 21412 10108 5696 6531

RCSched-III 79747 27481 13420 8620 10239

Baseline 54176 17767 5378 2401 1913

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Idle

Per

iod

# of PRRs

Id le Time Measurements (Non Unifor m)

1 2 3 4 5

RCSched-I 59202 19435 8475 6475 5969

RCSched-II 55099 18372 8883 5604 6696

RCSched-III 77509 22668 11702 8554 8744

Baseline 57927 14317 4595 1629 2492

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

Idle

Per

iod

# of PRRs

Id le Time Measurements (Unifor m)

Figure A.6: Idle Time Measurement: Restricted Placement

A.4 Hardware to Software Migration

The hardware to software migration parameter measures the total number of tasks that

migrate from hardware to software.


Non-Restricted Placement: Recall from section 5.4.1 that all tasks within a DFG were

initialized as Hybrid hardware tasks. Therefore, any task that was processed on the GPP

was due to migration.

Figure A.7 clearly indicates that RCSched-III has higher number of hardware to soft-

ware migration, since it performs migration based on an intelligent criteria. RCSched-I

and RCSched-II on the other hand only migrate tasks to software when there is lack of

hardware processing elements.

Restricted Placement: RCSched-I has the worst performance (it has similar perfor-

mance to the baseline scheduler), while RCSched-II and RCSched-III were almost iden-

tical as shown in Figure A.8.


1 2 3 4 5

RCSched-I 37.6 25.5 17.9 10.6 3.5

RCSched-II 37.1 24.4 17.7 10.6 4.7

RCSched-III 39.4 26.7 22.4 20.7 18.6

Baseline 8.1 5.6 3.4 1.4 0.0

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

# of

Tas

ks

No. of PRRs

# of HW to SW Mi gration (Non Unifor m)

1 2 3 4 5

RCSched-I 41.6 22.3 10.0 3.5 0.9

RCSched-II 42.6 22.2 10.5 3.6 12.0

RCSched-III 43.1 30.3 25.2 22.3 20.7

Baseline 23 11 3 1 5

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

# of

Tas

ks

No. of PRRs

# of HW to SW Mi gration (Unifor m)

Figure A.7: Number of hardware to software Task Migration: Non-Restricted Placement


1 2 3 4 5

RCSched-I 23.1 19.0 17.5 17.0 16.7

RCSched-II 9.5 5.2 3.5 2.7 2.1

RCSched-III 9.7 5.2 4.0 2.8 2.8

Baseline 23.2 19.9 17.6 16.5 16.6

0.0

5.0

10.0

15.0

20.0

25.0

# of

Tas

ks

No. of PRRs

# of HW to SW Mi gration (Non Unifor m)

1 2 3 4 5

RCSched-I 24.1 19.3 18.1 17.2 16.8

RCSched-II 9.3 5.8 3.4 2.8 2.3

RCSched-III 11.3 6.1 4.0 3.1 2.9

Baseline 25 19 18 16 16

0.0

5.0

10.0

15.0

20.0

25.0

30.0

# of

Tas

ks

No. of PRRs

# of HW to SW Mi gration (Unifor m)

Figure A.8: Number of hardware to software Task Migration: Restricted Placement

Appendix B

Architecture Library

The architecture library used in the Chapter 6 is shown below:

Name = " Architecture Library v1.0"

Date = "Jul 03, 2013"

######################

Task Task1 {

id = 1

arch arch1 {

exec_time = 20

config_time = 5

config_power = 3

exec_power = 10

columns = 1

223

APPENDIX B. ARCHITECTURE LIBRARY 224

rows = 1

mode = HW

}

arch arch2 {

exec_time = 10

config_time = 10

config_power = 5

exec_power = 5

columns = 1

rows = 2

mode = HW

}

arch arch3 {

exec_time = 5

config_time = 20

config_power = 5

exec_power = 3

columns = 2

rows = 2

mode = HW

}

}

######################

Task Task2 {

id = 2


arch arch1{

exec_time = 30

config_time = 5

config_power = 3

exec_power = 20

columns = 1

rows = 1

mode = HW

}

arch arch2 {

exec_time = 20

config_time = 10

config_power = 5

exec_power = 15

columns = 2

rows = 1

mode = HW

}

arch arch3 {

exec_time = 10

config_time = 20

config_power = 10

exec_power = 10

columns = 2

rows = 2

mode = HW


}

arch arch4 {

exec_time = 5

config_time = 30

config_power = 15

exec_power = 5

columns = 3

rows = 2

mode = HW

}

}

######################

Task Task3 {

id = 3

arch arch1 {

exec_time = 125

config_time = 20

config_power = 10

exec_power = 70

columns = 2

rows = 2

mode = HW

}

arch arch2 {


exec_time = 75

config_time = 30

config_power = 15

exec_power = 50

columns = 3

rows = 2

mode = HW

}

arch arch3 {

exec_time = 20

config_time = 20

config_power = 10

exec_power = 40

columns = 4

rows = 1

mode = HW

}

arch arch4 {

exec_time = 30

config_time = 75

config_power = 35

exec_power = 30

columns = 5

rows = 3

mode = HW

}


arch arch5 {

exec_time = 20

config_time = 125

config_power = 65

exec_power = 25

columns = 5

rows = 5

mode = HW

}

}

######################

Task Task4 {

id = 4

arch arch1 {

exec_time = 40

config_time = 75

config_power = 35

exec_power = 70

columns = 3

rows = 4

mode = HW

}

arch arch2 {

exec_time = 30


config_time = 125

config_power = 65

exec_power = 50

columns = 5

rows = 5

mode = HW

}

}

######################

Task Task5 {

id = 5

arch arch1 {

exec_time = 15

config_time = 5

config_power = 3

exec_power = 10

columns = 1

rows = 1

mode = HW

}

arch arch2 {

exec_time = 10

config_time = 10

config_power = 5

exec_power = 5


columns = 1

rows = 2

mode = HW

}

arch arch3 {

exec_time = 5

config_time = 20

config_power = 10

exec_power = 3

columns = 2

rows = 2

mode = HW

}

}

######################

Task Task6 {

id = 6

arch arch1 {

exec_time = 40

config_time = 5

config_power = 3

exec_power = 30

columns = 1

rows = 1

mode = HW


}

arch arch2 {

exec_time = 30

config_time = 10

config_power = 5

exec_power = 15

columns = 2

rows = 1

mode = HW

}

arch arch3 {

exec_time = 15

config_time = 20

config_power = 10

exec_power = 10

columns = 2

rows = 2

mode = HW

}

}

######################

Task Task7 {

id = 7


arch arch1 {

exec_time = 75

config_time = 20

config_power = 10

exec_power = 50

columns = 2

rows = 2

mode = HW

}

arch arch2 {

exec_time = 50

config_time = 40

config_power = 20

exec_power = 25

columns = 4

rows = 2

mode = HW

}

}

######################

Task Task8 {

id = 8

arch arch1 {

exec_time = 30

config_time = 75


config_power = 35

exec_power = 60

columns = 3

rows = 4

mode = HW

}

arch arch2 {

exec_time = 20

config_time = 125

config_power = 65

exec_power = 55

columns = 5

rows = 5

mode = HW

}

}

######################

Task Task9 {

id = 9

arch arch1 {

exec_time = 40

config_time = 5


config_power = 3

exec_power = 30

columns = 1

rows = 1

mode = HW

}

arch arch2 {

exec_time = 30

config_time = 10

config_power = 5

exec_power = 15

columns = 2

rows = 1

mode = HW

}

arch arch3 {

exec_time = 15

config_time = 20

config_power = 10

exec_power = 10

columns = 2

rows = 2

mode = HW

}

}


######################

Task Task10 {

id = 10

arch arch1 {

exec_time = 75

config_time = 20

config_power = 10

exec_power = 50

columns = 2

rows = 2

mode = HW

}

arch arch2 {

exec_time = 50

config_time = 40

config_power = 20

exec_power = 25

columns = 4

rows = 2

mode = HW

}

}

######################

Task Task11 {


id = 11

arch arch1 {

exec_time = 30

config_time = 35

config_power = 20

exec_power = 60

columns = 3

rows = 4

mode = HW

}

arch arch2 {

exec_time = 20

config_time = 75

config_power = 35

exec_power = 55

columns = 5

rows = 5

mode = HW

}

}

######################


Task Task12 {

id = 12

arch arch1{

exec_time = 30

config_time = 5

config_power = 3

exec_power = 20

columns = 1

rows = 1

mode = HW

}

arch arch2 {

exec_time = 20

config_time = 10

config_power = 5

exec_power = 15

columns = 2

rows = 1

mode = HW

}

arch arch3 {

exec_time = 10

config_time = 20

config_power = 10

exec_power = 10

columns = 2


rows = 2

mode = HW

}

arch arch4 {

exec_time = 5

config_time = 30

config_power = 15

exec_power = 5

columns = 3

rows = 2

mode = HW

}

}

######################

Task Task13 {

id = 13

arch arch1 {

exec_time = 125

config_time = 20

config_power = 10

exec_power = 70

columns = 2

rows = 2

mode = HW

}


arch arch2 {

exec_time = 75

config_time = 30

config_power = 15

exec_power = 50

columns = 3

rows = 2

mode = HW

}

arch arch3 {

exec_time = 20

config_time = 20

config_power = 10

exec_power = 40

columns = 4

rows = 1

mode = HW

}

arch arch4 {

exec_time = 30

config_time = 75

config_power = 35

exec_power = 30

columns = 5

rows = 3

mode = HW


}

arch arch5 {

exec_time = 20

config_time = 125

config_power = 65

exec_power = 25

columns = 5

rows = 5

mode = HW

}

}

######################

Task Task14 {

id = 14

arch arch1 {

exec_time = 40

config_time = 75

config_power = 35

exec_power = 70

columns = 3

rows = 4

mode = HW

}


arch arch2 {

exec_time = 30

config_time = 125

config_power = 65

exec_power = 50

columns = 5

rows = 5

mode = HW

}

}

Appendix C

FPGA Power Measurements

Modern state of the art FPGA require multiple power supplies. The use of multiple volt-

ages for different FPGA resources increases performance(signal strength), at the same

time increase system immunity to noise and parasitic effects [141].

Based on Xilinx power classification the total power requirement for each supply

source of an FPGA depends on three components [141]:

Device Static Power: transistor leakage power, that is required for the device tooperate

and be ready for programming.

Design Static Power: additional power consumed when the device is configured but not

active.

Design dynamic power: additional power drawn from design activity. This component

is function of voltage levels, logic, and routing used.

The power consumption paths of an FPGA can either dissipate as heat due to internal

activities or it can be consumed by of-chip peripherals through I/O pins and then dissipate

242

APPENDIX C. FPGA POWER MEASUREMENTS 243

by miscellaneous chip components [141].

A Xilinx FPGA goes through several power phases from power onto power down

and for each phase there is different power requirement. Thepower phases in order

are: Power-On, Configuration, standby, Active, suspend, and Hibernate. For detailed

explanation of these power modes and Xilinx power requirements refer to [141,142].

C.1 Power management of the VC707 board

The Xilinx VC707 board uses several power regulators and many supporting chip in

order to meet power requirements of the FPGA and the other system components. The

power distribution diagram is shown in Figure C.1. For a complete description of the

power supply system of the VC707 board refer to [2].

Figure C.1 clearly indicate that three power controllers (U42, U43 and U64) are in-

volved. The power regulators are PMBus compliant digital PWM system controllers

from Texas Instruments (UCD9248PFC). The power regulatorsare mainly used to sup-

ply the core voltages of the FPGA. The Texas Instruments UCD9248PFC controller can

be used to monitor voltage, current, and power for differentpower rails. Table C.1 shows

power resources supplied by each power rail. For a detailed information about the role of

TI controllers in the VC707 board refer to [2].Texas Instrument’s Fusion Digital Power

graphical user interface is a free software used to monitor real-time power reading of the

UCD9248PFC controllers. A TI USB adapter (TI part number EVMUSB-TP-GPIO) is

needed to convert the signals of the PMBus on the board side toa standard USB signal

suitable for PC software (Fusion power designer).


Figure C.1: Power distribution of Xilinx VC707 Board from [2].

ID Schematic Rail name (power name)Controller at addr. (52) Controller at addr. (53) Controller at addr. (54)

1 V CCINT F PGA (VCCINT ) V CC2C5 F PGA (VCCO 2.5V ) V CCAUX IO (VCCAUX IO)2 V CCAUX (VCCAUX ) V CC1V 5 (VCCO 1.5V ) V CC BRAM (VCC BRAM )3 V CC3V 3 (VCCO 3.3V ) MGTAVCC VCMGTVCCAUX4 VCCADJ VMGTAVTT V CC1V 8 F PGA (VCCO 1.8V )

Table C.1: Power Rail Specfication for UCD9248 PMBus Controllers.


C.2 Methodology

In this appendix two different methods were used to monitor internal FPGA power sup-

plies. The first approach was based on using TI power controllers via TI USB adapter

and TI Fusion software. The second approach was based on using the internal System

Monitor component via the standard JTAG cable. The former has the advantage of mon-

itoring more power supplies via voltage, current and hence power. The latter canonly

monitor some of the internalvoltages and junction temperature, however it does not re-

quire special adapters or software and therefore can be applied to any FPGA with built

in System Monitor component.

C.2.1 Monitoring FPGA resources using TI UCD9248PFC Power

Controllers.

To monitor and measure current, voltage, and accordingly power for the VC707 board

we need Texas Instruments’Fusion Digital Power graphical user interfacewhich can be

downloaded for free from [143]. To connect the board to a PC Texas Instruments’ USB

to GPIO adapter (TI part number EVM USB-TP-GPIO) shown in Figure C.2 is needed.

The steps to connect and measure system metrics using TI Fusion software are:

1. Download and Install TI Fusion Digital Power Designer from [143].

2. Connect TI USB adaptor to the PMBus adapter (J5) on the VC707 board as shown

in Figure C.3 and the USB to Windows based computer.

3. Ensure the VC707 board is powered ON and run TI Fusion software.


Figure C.2: Texas Instrument USB to GPIO adapter

Figure C.3: Xilinx VC707 board, TI Adapter should be connected to the high-lighted PMBus connector (J5)


4. If this the first time running TI Fusion software an informative window about the

software modes will be shown, press OK, as shown in Figure C.4.

Figure C.4: Window showing Fusion software modes, press OK

5. A window showing Device Scanning mode will pop up, click the first link (UCD

Controller and Sequencers, Isolated Controller) as shown in Figure C.5. The soft-

ware will scan compatible power controller connected to thePMBus, then it should

detect three controllers on addresses (52,53, and 54) respectively.

6. The main window of TI Fusion software will show next, clickon Monitor (left

bottom) and selectdevice and railto be monitored (top right) as shown in Figure

C.6. Table C.1 shows the association between power rails andpower sources for

each controller.


Figure C.5: Select Device, click the first link


Figure C.6: Fusion Monitor View (measurement can be observed). The monitorsection on top left is view or hide real-time graphs, the device and rail can bechanged from the top right corner.


7. A user can select/deselect the metrics to be monitored (Input current, Output cur-

rents, Voltage and temperature) from the top left panel as shown in Figure C.6 and

Figure C.7 .

Figure C.7: Fusion Monitor View, Rail dashboard show real time reading of allmetrics, highlighted on top left side.

8. Active Power rail/device under measurements can be changed from the top right

drop list as shown in Figure C.8. Sometime TI Fusion softwarereset the view to

theConfigure view following rail change. Simply change it back to theMonitor

view (left down) as shown in Figure C.7.

9. In order to view real-time measurement of all rails for alldevices, press onSystem

Dashboardbutton as shown in Figure C.7 (the Monitor view should be active), a

new window will pop-up as shown in Figure C.9


Figure C.8: Select the rail and device to be monitored.

Figure C.9: Fusion Monitor View, Rail dashboard show real time reading of allmetrics, highlighted on top left side.


The software is self explanatory and the steps described above are sufficient to

monitor the core power source supplying the Virtex-7 FPGA onthe VC707 board.

efﬁcient scheduling, mapping and resource prediction for

Documents