oyal institute of technology - diva portal646340/fulltext01.pdfast abstract syntax tree cpu central...

KTH

Royal Institute of Technology

School of Information and Communication Technology

Electronic Systems

Automatic Software Synthesis from

High-Level ForSyDeModels Targeting

Massively Parallel Processors

Master of Science Thesis in System-on-Chip Design

June 2013

TRITA–ICT–EX–2013:139

Author:George Ungureanu

Examiner:Assoc. Prof. Ingo Sander

Supervisors:Seyed Hosein Attarzadeh Niaki

Gabriel Hjort Blindell

Author: Ungureanu, George <[email protected]>

Title: Automatic Software Synthesis from High-Level ForSyDe Models Targeting MassivelyParallel ProcessorsThesis number: TRITA–ICT–EX–2013:139

Royal Institute of Technology (KTH)School of Information and Communication Technology (ICT)Research Unit: Electronic Systems (ES)Forum 105164 40 KistaSweden

Copyright ©2013 George UngureanuAll rights reserved.

This work is licensed under the Creative Commons Attribution-NoDerivs cc by-nd 3.0 License.A copy of the license is found at http://creativecommons.org/licenses/by-nd/3.0/

This document was typeset in LATEXwith kp-fonts as font package. Most of the figures wereproduced with TikZ package and the rest were drawn using Google Draw

Document build: Monday 24th June, 2013, 10:26

http://creativecommons.org/licenses/by-nd/3.0/

Abstract

In the past decade we have witnessed an abrupt shift to parallel computing subsequent tothe increasing demand for performance and functionality that can no longer be satisfied byconventional paradigms. As a consequence, the abstraction gab between the applications andthe underlying hardware increased, triggering both industry and academia in several researchdirections.

This thesis project aims at analyzing some of these directions in order to offer a solutionfor bridging the abstraction gap between the description of a problem at a functional leveland the implementation on a heterogeneous parallel platform using ForSyDe – a formaldesign methodology. This report treats applications employing data-parallel and time-parallelcomputation, regards nvidia CUDA-enabled GPGPUs as the main backend platform.

The report proposes a heuristic transformation-and-refinement process based on analysismethods and design decisions to automate and aid in a correct-by-design backend code synthesis.Its purpose is to identify potential data parallelism and time parallelism in a high-level system.Furthermore, based on a basic platform model, the algorithm load-balances and maps theexecution onto the best computation resources in an automated design flow. This design flowwill be embedded into an already existing tool, f2cc (ForSyDe-to-CUDA C) and tested forcorrectness on an industrial-scale image processing application aimed at monitoring inkjetprint-heads reliability.

Keywords: system design flow, high abstraction-level models, ForSyDe, GPGPU, CUDA, time-parallel, data-parallel

iii

Acknowledgements

In the course of this thesis project several people have helped me accomplish my tasks andcontributed some way or another and to whom I am deeply grateful.

First of all, I would like to thank my supervisors, Hosein and Gabriel for all their support.Without Hosein’s scientific feedback my report would have been much less valuable andwithout Gabriel’s active involvement in the software tool’s development and implementationmy progress would have been even further delayed.

Secondly, I would like to thank my mentors, Werner Zapka and Ingo Sander, for investing somuch time and trust in my personal and professional development. Without their "leap offaith" regarding my trustworthiness I would never have had the chance to be involved in suchexciting projects and work in such an amazing environment.

Thirdly, I would like to thank my colleagues from XaarJet AB who proved to be not onlyexcellent professionals in their area of research, helping me to develop insight in areas I couldnever explore, but great friends as well. I am also grateful to my Master’s Program colleaguesMarcus Miculcak and Ekrem Altinel. The excellent collaboration between us ended up ingreat outcomes, and the ideas presented in this report are mostly resulted from the interestingdebates and discussions that I had with them.

And last, but not least, I would like to show my deepest gratitude to my wife, Ana Maria,who has valiantly put up with me during the grim time when I worked for this thesis. Herunconditional support, care and understanding kept me going morally, and helped me yieldresults even if the workload was too heavy. This thesis is undoubtedly dedicated to her...

George UngureanuStockholm, June 2013

iv

Contents

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Document overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4.1 Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.2 Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.3 Part III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.4 Part IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

I Understanding the Problem 7

2 ForSyDe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 The modeling framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 System modeling in ForSyDe-SystemC . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.2 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.3 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.4 Intermediate XML representation . . . . . . . . . . . . . . . . . . . . . . . 15

v

vi

3 Understanding Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1 Parallelism in the many-core era . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 A theoretical framework for parallelism . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Kleene’s partial recursive functions . . . . . . . . . . . . . . . . . . . . . . 193.2.2 A functional taxonomy for parallel computation . . . . . . . . . . . . . . 21

3.3 Parallel applications: the 13 Dwarfs . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Berkeley’s view: design methodology . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.1 Application point of view . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.2 Software point of view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.3 Hardware point of view . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 GPGPUs and General Programming with CUDA . . . . . . . . . . . . . . . . . . . . 294.1 Brief introduction to GPGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 GPGPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 General programming with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 CUDA streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 The f2cc Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.1 f2cc features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 f2cc architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.3 Alternatives to f2cc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.1 SkelCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3.2 SkePU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3.3 Thrust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.3.4 Obsidian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

II Development and Implementation 51

7 The Component Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.1 The ForSyDe model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.1.1 f2cc approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.1.2 Model limitations and future improvements . . . . . . . . . . . . . . . . . 57

7.2 The intermediate model representation . . . . . . . . . . . . . . . . . . . . . . . . 597.2.1 f2cc approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.2.2 Limitations and future improvements . . . . . . . . . . . . . . . . . . . . 61

7.3 The process function code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.3.1 f2cc approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.3.2 Future improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.4 The GPGPU platform model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.4.1 Computation costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.4.2 Communication costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.4.3 Future improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

8 Design Flow and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678.1 Model modifier algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8.1.1 Identifying data-parallel processes . . . . . . . . . . . . . . . . . . . . . . 678.1.2 Optimizing platform mapping . . . . . . . . . . . . . . . . . . . . . . . . 72

vii

8.1.3 Load balancing the process network . . . . . . . . . . . . . . . . . . . . . 738.1.4 Pipelined model generation . . . . . . . . . . . . . . . . . . . . . . . . . . 798.1.5 Future development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.2 Synthesizer algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.2.1 Generating sequential code . . . . . . . . . . . . . . . . . . . . . . . . . . 818.2.2 Scheduling and generating CUDA code . . . . . . . . . . . . . . . . . . . 838.2.3 Future development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9 Component Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.1 The ForSyDe model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.2 Module interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909.3 Component execution flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

III Final Remarks 93

10 Component Evaluation and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 9510.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9510.2 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

11 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

IV Appendices 103

A Component documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.1 Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.2 Preparations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106A.3 Running the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107A.4 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

B Proposing a ForSyDe Design Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 109B.1 A simple explanatory example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109B.2 Layered refinements & Refinement loop . . . . . . . . . . . . . . . . . . . . . . . 110B.3 Proposed architecture for the design flow tool . . . . . . . . . . . . . . . . . . . . 113

C Demonstrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

List of Figures

2.1 ForSyDe process network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 ForSyDe process constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 ForSyDe MoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Kleene’s composition rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Kleene’s basic forms of composition . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Kleene’s primitive recursiveness and minimization . . . . . . . . . . . . . . . . . 20

4.1 nvidia CUDA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 nvidia CUDA thread division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 nvidia CUDA streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1 f2cc identification pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2 f2cc component connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3 f2cc internal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.4 Obsidian program pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.1 f2cc v0.1 internal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.2 f2cc v0.2 internal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.3 The ParallelComposite process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.4 f2cc cross-hierarchy connections . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.5 Generating variable declaration code . . . . . . . . . . . . . . . . . . . . . . . . . 627.6 CFuncton structure in f2cc v0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.7 CFuncton structure in f2cc v0.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.8 Extracting variable information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.1 Grouping potentially parallel processes . . . . . . . . . . . . . . . . . . . . . . . 708.2 Building individual data paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748.3 Loop unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748.4 Modeling streamed execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

ix

x List of Figures

9.1 f2cc v0.2 internal model architecture . . . . . . . . . . . . . . . . . . . . . . . . . 909.2 f2cc execution flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

B.1 Simple model after analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.2 Simple model after refinements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3 Model analysis aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111B.4 Hierarchical separation of transformation layers . . . . . . . . . . . . . . . . . . . 111B.5 Refinement loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

C.1 Demo example: input model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118C.2 Demo example: model after flattening . . . . . . . . . . . . . . . . . . . . . . . . 119C.3 Model after grouping equivalent comb processes . . . . . . . . . . . . . . . . . . 120C.4 Model after grouping potentially parallel leaf processes . . . . . . . . . . . . . . 121C.5 Model after removing redundant zipx and unzipx processes . . . . . . . . . . . . 122C.6 Model after platform optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 123C.7 Model after load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124C.8 Model after creating pipeline directives . . . . . . . . . . . . . . . . . . . . . . . . 125

List of Tables

3.1 The 13 Dwarfs of parallel computation . . . . . . . . . . . . . . . . . . . . . . . . 23

8.1 Assigning data bursts to streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

10.1 The component’s status at the time of writing the report . . . . . . . . . . . . . . 96

xi

xii List of Tables

Listings

2.1 ForSyDe-SystemC signal definition . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 ForSyDe-SystemC leaf process function definition . . . . . . . . . . . . . . . . . 132.3 ForSyDe-SystemC composite process declaration . . . . . . . . . . . . . . . . . . 142.4 ForSyDe-SystemC testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 ForSyDe-SystemC introspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.6 ForSyDe-SystemC intermediate XML format . . . . . . . . . . . . . . . . . . . . . 164.1 Matrix multiplication in C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Matrix multiplication in CUDA - Host . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Matrix multiplication in CUDA - Device . . . . . . . . . . . . . . . . . . . . . . . 344.4 Concurrency in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.1 SkelCL syntax example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 SkePU function macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3 SkePU syntax example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.4 Thrust syntax example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.5 Obsidian function declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.6 Obsidian definition for pure and sync . . . . . . . . . . . . . . . . . . . . . . . . . 457.1 GraphML port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2 XML port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.3 Static type name declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.4 Result of static type name declaration . . . . . . . . . . . . . . . . . . . . . . . . . 617.5 Algorithm for parsing ForSyDe function code . . . . . . . . . . . . . . . . . . . . 638.1 Algorithm for identifying data parallel sections . . . . . . . . . . . . . . . . . . . 688.2 Methods used by data parallel sections identification algorithm . . . . . . . . . . 698.3 Method used by data parallel sections identification algorithm . . . . . . . . . . 718.4 Proposed algorithm for identifying data parallel sections . . . . . . . . . . . . . 728.5 Algorithm for platform optimization . . . . . . . . . . . . . . . . . . . . . . . . . 728.6 Algorithm for load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738.7 Method for data paths extraction, used by load balancing algorithm . . . . . . . 748.8 Method for extracting and sorting contained sections . . . . . . . . . . . . . . . . 778.9 Method for splitting the process network into pipeline stages . . . . . . . . . . . 788.10 Algorithm for code synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

xiii

xiv List of Tables

8.11 Method for generating sequential code for composite processes . . . . . . . . . . 828.12 Top level for method for generating CUDA code . . . . . . . . . . . . . . . . . . . 86A.1 Platform model template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107C.1 ForSyDe process function code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117C.2 Extracted C code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117C.3 Excerpt from the f2cc output logger . . . . . . . . . . . . . . . . . . . . . . . . . 126C.4 Sample sequential code: composite process execution wrapper . . . . . . . . . . 127C.5 Sample parallel code: top level execution code . . . . . . . . . . . . . . . . . . . . 128C.6 Sample parallel code: kernel function wrapper . . . . . . . . . . . . . . . . . . . 129

List of Abbreviations

3D three-dimensionalANSI American National Standards InstituteAPI Application Program InterfaceAST Abstract Syntax TreeCPU Central Processing UnitCT Continuous Time (MoC)CUDA Computer Unified Device ArchitectureDE Discrete Event (MoC)DI Domain InterfaceDRAM Dynamic Random-Access MemoryDSL Domain Specific LanguageDUT Design Under TestEDSL Embedded Domain Specific LanguageESL Electronic System Levelf2cc ForSyDe to CUDA CForSyDe Formal System DesignGPGPU General Purpose Graphical Processing UnitGPU Graphical Processing UnitGraphML Graph Markup LanguageGUI Graphical User InterfaceHDL Hardware Description LanguageILP Instruction Level ParallelismIP Intellectual PropertyITRS International Technology Roadmap for SemiconductorsMIMD Multiple Instruction Multiple DataMoC Model of ComputationOS Operating SystemPOM Project Object ModelRTTI Run-Time Type InformationSDF Synchronous Data Flow (MoC)

xv

xvi List of Tables

SDK Software Development KitSIMD Single Instruction Multiple DataSIMT Single Instruction Multiple ThreadSM Streaming MultiprocessorSP Streaming ProcessorSTL Standard Template LibrarySY Synchronous (MoC)UT Untimed (MoC)XML Extensible Markup Language

Chapter

1

Introduction

This chapter will present the problem that will be approached throughout this thesis. The problem will be statedprior to a brief motivation for this project in the current industrial context. Afterwards, a set of overarching goalswill be enumerated, followed by an overview of this report.

1.1 Problem statement

T he current project aims at tackling with the problem of mapping intensive parallelcomputation on platforms with resource support for data- and time-parallel computation,

with special consideration to the leading many-core platform in industry, the General PurposeGraphical Processing Unit (GPGPU, [Kirk and Hwu, 2010]). As a design language for describingsystems at a high level of abstraction, ForSyDe [Sander and Jantsch, 2004] will be used.ForSyDe is associated with a formal high-level system design methodology that raises theabstraction level in designing real-time embedded systems in order to aid the mapping oncomplex heterogeneous platforms through techniques like design space exploration, semantic-preserving transformations, refinement-through-replacement, etc.

The first problem that has to be treated is analyzing whether or not ForSyDe supports thedescription of parallel computation in harmony with the existing MoC-based framework. Inthis sense, a deep understanding of parallelism and its principles is necessary. The twomain terms introduced in the current contribution, data parallelism and time parallelism arebe presented in the context of parallel computation in Chapter 3.

The second problem that this project must attend to is the implementation of a mappingalgorithm from a parallel ForSyDe model to a GPGPU backend. In order to do so, anexisting tool called f2cc [Hjort Blindell, 2012]1 has to be extended to support both the newForSyDe-SystemC features and new data-parallel and time-parallel models.

1ForSyDe to CUDAC (f2cc) was developed and implemented by Gabriel Hjort Blindell as part of his Master’s Thesisin 2012.

1

2 1. Introduction

The third and final problem treated by the ongoing thesis is the resulting software component’svalidation with an industrial-scale application provided by XaarJet AB, a printing-orientedcompany.

1.2 Motivation

In the past decade we witnessed a dramatic shift of computation paradigms into the paralleldomain, hence the dawn of the "many-core era". This shift was not a result of great innovationas much as a necessity to cope with the increasing demands for performance and functionality.This fact is summed up with the increasing complexity of both platforms and applications thatcannot be handled by traditional design methods anymore.

Faced with the "parallel problem", both industry and academia came up with a number ofsolutions which will further be presented in Chapter 3, Chapter 4 and Section 5.3. The mainissue is that most of these solutions are not based on a formal basis commonly agreed uponwhich can constitute a theoretical foundation for the parallel paradigm, just like Turing’s modelwas the foundation for the sequential paradigm. Furthermore, they represent points of viewthat are dispersed among research groups which try to mold the paradigms to their desiredgoals (productivity, backward-compatibility, verification, etc.).

Most of the aforementioned solutions treat many-core parallel platforms as means of high-throughput computation. Strangely, one point of view was ignored until now, especiallyby high-level programming models: treating many-cores as complex, heterogeneous andanalyzable platforms.

Hence, we invoke ForSyDe as a methodology to treat with these issues. Due to its inher-ent formalism, the complexity problem can be properly handled, enabling correct-by-designimplementation solutions. Furthermore, the Model-of-Computation-base formalism [Sanderand Jantsch, 2004] is a natural framework for expressing parallel computation, consequentlyit offers a good environment for a foundation for parallelism. The design flows associatedwith the ForSyDe methodology are based on analysis, design space exploration and semantic-preserving transformations, providing means to take advantage of architectural traits hard toexplore otherwise.

The platform chosen for analysis is the GPGPU since it is the most widely-used many-coreplatform in industry. GPGPUs are notoriously difficult to program and verify due to their low-level style of programming based on a sequential model. One application that requires highthroughput computation on a parallel platform whose development stagnated due to theseissues is provided by XaarJet AB. This application will be implemented in ForSyDe and givenas example for testing the current project.

1.3 Objectives

The main goal of this project is to investigate and offer proper solutions to the problems statedin Section 1.1. This task has been split in the following set of sub-goals:

1. Extensive literature study. This will include:

• ForSyDe tutorials and research papers;

1.4. Document overview 3

• The f2cc architecture, tool API, implementation, and the thesis report;

• Relevant material related to parallel computation;

• GPGPUs, their architecture and programming model;

• Alternatives to f2cc;

2. Devise a plan for expanding f2cc’s functionality.

3. Expand f2cc with new features provided by the ForSyDe-SystemC modeling framework,and implement a new frontend.

4. Implement a synthesis flow for pipelined CUDA code synthesis for f2cc.

5. Provide high quality code documentation.

6. Evaluate the improved f2cc tool with an image processing application provided byXaarJet AB.

As an optional goal, the analysis and proposal of a generic development flow tool for ForSyDe

will be presented. This tool should easily embed the implemented synthesis flow but keepa high degree of flexibility in order to enable any type of flow. As it is beyond the scope ofa M.Sc. thesis to consider implementing such a fully generic tool, only proposals for futureresearch will be delivered. Other relevant optional goals would be the implementation of othertypes of parallelism if time permits it.

1.4 Document overview

The document is divided into four parts. The first part includes the background studyperformed in order to understand the full scale of the problem which we will encounter. Thesecond part presents the individual steps of implementing the component. The third part closesthe report with some concluding remarks based on evaluation results. The fourth part containssupplementary material that could not be included in the report body. The following sectionsaim to offer a reading guide for the current report.

1.4.1 Part I

The first part of the document digs into theoretical problem and tries to analyse it from differentperspectives. Its purpose is to provide the reader with enough knowledge to understand the fullscale of the problem and the challenges that will arise during the component implementation.

Chapter 2 briefly introduces the reader to the ForSyDe methodology and the ForSyDe-SystemC

design framework. It presents the basic concepts and usage, focusing on the structures used inthis project, and it points to further related material. The reader may skip this chapter providedhe or she possesses previous knowledge of ForSyDe.

Chapter 3 paves the way into the current industrial and academic context of dealing withparallelism. This chapter is of pure scientific interest, and it does not contain informationdirectly related to the project’s implementation work efforts. Its purpose is to dig into the heartof the problem at hand at a high level, and to analyze it from different perspectives. A readerinterested only in the project’s methodology may skip this chapter, having in mind that there

4 1. Introduction

exist a few theoretical notions that are defined and referenced in these sections. Still, a futureForSyDe developer is encouraged to read the provided material, since it offers valuable insightin the problems she or he might encounter.

Chapter 4 introduces the reader to the basic concepts of GPGPUs, as they are the main targetplatform. Material for further reading is referenced, and the basic usage of threads and streamsis shown. This chapter constitutes the background for the implementation of the softwarecomponent’s synthesizer module2.

Chapter 5 briefly presents the current component that has to be improved, f2cc. It also analysesfour alternatives to f2cc for synthesis flows targeting GPGPUs, available in the researchcommunity. The reader is strongly encouraged to read this chapter in order to understandthe content in Part II.

Chapter 6, the final chapter belonging to this part, lists the main challenges that were identifiedduring Part I as needed to be treated by this project’s work efforts, and prioritizes them.

1.4.2 Part II

The second part concerns the development of the software component. An in-depth analysisof the software architecture and the algorithms used is presented. The part closes with puttingtogether the previously presented components, in order to depict the proposed design flow.

Chapter 7 introduces the reader to the main development framework that had to be improved inorder to both deliver the desired goals and to embed this project’s design flow into the availabledesign flow. Thus the main features are presented in order to understand the magnitudeof the work effort. Apart from the design decisions made, a a comprehensive list of futureimprovement is proposed for a potential developer.

Chapter 8 presents the main theory and concepts that hide behind the above-mentionedsoftware tool. Its algorithms are both analyzed for scalability and provided with eitheroptimized alternatives or proposals for future development, since they are tempering with stillyoung and undiscovered issues.

Finally, Chapter 9 binds together the previous two chapters and shortly presents the compo-nent’s main implementation features and plots its execution flow.

1.4.3 Part III

The third part closes this report. An evaluation of the current state of the project is offered, asregards the initial goals, along with a list with proposals for future development. Chapter 10tries to evaluate the current state of the software component while listing and prioritizingproposals for future development, as they emerge from an overview of Part II. Chapter 11concludes the M. Sc. project and gives a verdict with respect to the delivered versus the initiallyproposed goals.

2see Section 5.2

1.4. Document overview 5

1.4.4 Part IV

The fourth and last part contains the appendices. Appendix A contains a documentation ofthe software component, including installation, preparation, usage and maintenance tips. InAppendix B, a few of the author’s personal reflections concerning the future development ofForSyDe with regard to a ForSyDe Development Toolkit will be presented. Appendix Cholds samples to demonstrate the software component’s usage and results outputted by itsintermediate steps. As example, a core part from the Linescan application has been used.

6 1. Introduction

Part I

Understanding the Problem

7

Chapter

2

ForSyDe

This chapter will briefly present ForSyDe (Formal System Design), a system design methodology that startsfrom a high-level formal description. The first section will introduce ForSyDe in the system design environment.The second section will provide a brief overview of the modeling framework, while the third section will showan example of how to model systems using the SystemC implementation. It is out of the scope of this report toprovide full comprehensive documentation, which is why the reader is encouraged to consult related documentslike [Sander and Jantsch, 1999, Sander, 2003, Sander and Jantsch, 2004, Attarzadeh Niaki et al., 2012, Jakobsenet al., 2011] or tutorials [ForSyDe, 2013]

2.1 Introduction

K eutzer et al. states that "in order to be effective, a design methodology that addressescomplex systems must start at high levels of abstraction" [Keutzer et al., 2000]. ForSyDe is

one such methodology for Electronic System Level (ESL) design that "raises the abstractionlevel of the design entry to cope with the increased complexity of (embedded systems)"[Attarzadeh Niaki et al., 2012].

ForSyDe’s main objective is to "move the design refinement from the implementation to thefunctional domain" [Sander and Jantsch, 2004], by capturing a design’s functionality inside aspecification model. Thus, the designer works on this model which hides the implementationdetails and is able to "focus on what a system is supposed to do rather than how" [Hjort Blindell,2012].

Working at a high level of abstraction has two main advantages. Firstly, it enables the designerto have an overview of the system and of the data flow. Secondly, it aids the identificationof optimizations and of opportunities for better design decisions. One example, which willbe extensively treated in this report, is the identification and exploitation of parallel patternsin algorithms described in ForSyDe. These patterns could not be exploited as naturally atcompiler-level, since the full context for the cause and effect in the execution is lost.

Another key feature of ForSyDe is the design transformation and refinement. By applying

9

10 2. ForSyDe

semantic-preserving transformations to the high-level model, and gradual refinement by addingbackend-relevant information, one can achieve a correct implementation through a transparentprocess optimized for synthesis. Combining refinement with analysis at each design stage, thedesigner is able to reach an optimum implementation solution for the given problem.

Perhaps the most important feature of ForSyDe is its formalism. Practically the design startsfrom a formal model that expresses functionality. This fact aids in developing a correct-by-designsystem that can be both tested and validated at all levels of refinement. This is an especiallydifficult task to achieve without a formal basis. Also, the computational and structural featurescan be captured and analyzed formally, dismissing all ambiguities. This eliminates, at leasttheoretically, the need for post-verification and debugging which is often the most expensivestage of a product realization process.

2.2 The modeling framework

The following subsection is based on material found in [Attarzadeh Niaki et al., 2012] and [Lee and Sangiovanni-Vincentelli, 1997].

To understand the mechanisms behind ForSyDe one should have a clear picture of its modelingframework, which determines its formal basis. In the following paragraphs, the basic conceptswill be explained.

Structure

The system model is structured as a concurrent hierarchical process network. The componentsof a process network are processes and domain interfaces, connected through signals, as shownin Figure 2.1. The processes are triggered and synchronized only through signals, and thefunctions encapsulated by them are side-effect free.

Hierarchy can be achieved through composite processes. They are formed by composing eitherleaf processes (like p1 . . .p5 in Figure 2.1), or other composite processes.

Models of Computation

The Models of Computation (MoCs) describe the semantics of concurrency and computationof the processes in the system. Each process belongs to a MoC which explicitly describes itstimely behavior. Currently the ForSyDe-SystemC framework supports four MoCs [ForSyDe,2013], but more are researched and in development. The supported MoCs are:

• The Synchronous Data Flow MoC (SDF), which is a variant of the Untimed MoC (UT),where there is no explicit time description. In SDF the synchronization is done by passingtokens through signals, and it is suitable for describing analyzable streaming applications.

• The Synchronous MoC (SY), where it is assumed that neither computation nor commu-nication takes time. It is suitable for describing digital systems or control systems, wherethe design model ignores timing details by implying synchronization with a master event.

2.2. The modeling framework 11

p1

p2

p3

p4

p5

di1

di2

Legend

MoC A

MoC B

processdomain interface

composite process1 composite process2

Figure 2.1: ForSyDe process network

• The Discrete Event MoC (DE), where a time quantum is defined. It is suitable fordescribing test bench systems and modeling the environment.

• The Continuous Time MoC (CT) that describes physical time. It is suitable for modelinganalog components and physical processes.

Process Constructors

The process constructors enforce formal restrictions upon the design, to ensure analyzabilityand an organized structure.

In order to create a leaf process in the model, the designer must choose a process constructorfrom the defined ForSyDe library. A process constructor takes side-effect-free functions andvalues as arguments and creates a process. The formal computation and communicationsemantics are embedded in the model based on the chosen constructor.

f g v

mealy mealy

f g

v

Process Constructor Functions Values Process

Figure 2.2: Example for creating a Mealy process using a ForSyDe process constructor. Source: adapted from [ForSyDe,2013]

.

Figure 2.2 illustrates the concept of process constructor, by creating a process that implementsa Mealy finite-state machine within the SY MoC. The process constructor defines the Modelof Computation, the type of the process (finite-state machine), and the process interface. Thefunctionality of the process is defined by a function f that specifies the calculation of the nextstate, another function g that specifies the calculation of the output, and a value v that specifiesthe initial value of the process.

Domain Interfaces and Wrappers

The domain interfaces (DI) are used to connect processes belonging to different MoCs. ForMoCs of close abstraction levels, there are defined DIs as suggested by the bold lines in

12 2. ForSyDe

Figure 2.3. Other DIs (the dotted lines) are derived by composing existing DIs.

SDF SY DE CT

Figure 2.3: ForSyDe MoCs and their DIs

The wrappers are special processes which behave similarly to other processes, but which embedexternal models. They communicate their input/output to external simulators to co-simulatethe model and assure system validation even if not all components are implemented in theForSyDe framework. It is out of this thesis’ scope to study the effects and offer solutions forDIs and wrappers.

The Synchronous Model of Computation

This report will mainly focus on the SY MoC, since it is the only MoC considered in thedesign flow associated with this project’s software component. The SY MoC describes a timedconcurrent system, implying that its events are globally ordered. This means that any twodistinct events are either synchronous (they happen at the same moment, and are associatedwith the same tag) or one unambiguously precedes another [Lee and Sangiovanni-Vincentelli,1997].

Two signals can be considered synchronous if all events in one signal are synchronous withthe events from the other signal and vice-versa. A system is synchronous if every signal in thesystem is synchronous to every other signal in the system.

Apart from ForSyDe, there are several languages that describe synchronicity, such as Lustre[Halbwachs et al., 1991], Esterel [Berry and Gonthier, 1992] or Argos [Maraninchi, 1991]. Theselanguages describe events tagged as present > or events ⊥. A key property is that the order ofthese event tags is absolute and unambiguous.

2.3 System modeling in ForSyDe-SystemC

The following section is based on material found in [ForSyDe, 2013]. This report assumes that the reader isfamiliar with programming in C++, understanding XML, and using the SystemC platform. For a comprehensiveSystemC tutorial, the reader is encouraged to consult [ASIC World, 2013].

ForSyDe-SystemC is an implementation of the ForSyDe design framework using the SystemC

kernel. SystemC is a template library for C++, an object-oriented language, with themain purpose of co-simulating and validating of hardware-software systems at a high levelof abstraction. Many elements of the SystemC language are not allowed to be used inForSyDe-SystemC, and the ones which are used may appear in a different terminologicalcontext, to enforce formalism.

2.3. System modeling in ForSyDe-SystemC 13

All ForSyDe-SystemC elements are implemented as classes inside the ForSyDe namespace.Each element belongs to a MoC, which is in fact a sub-namespace of the ForSyDe namespace.For example ForSyDe::SY holds all elements (processes, signals, DIs) belonging to the SYMoC.

2.3.1 Signals

Signals are bound to an input or an output of a ForSyDe process. They are typed and canbe defined as belonging to a MoC by using their associated class from the respective MoC

namespace. For the SY MoC there is a helper (template) class abst_ext<T> which is used torepresent absent-extended values. Absent-extended values can be either absent or present witha value of type T. Listing 2.1 defines a signal of SY MoC called my_sig which carries tokens oftype abst_ext<double>.

1 ForSyDe::SY::SY2SY<double> my_sig;

Listing 2.1: ForSyDe-SystemC signal definition

2.3.2 Processes

Leaf processes are created using process constructors. Process constructors are templatesprovided by the library that are parameterized in order to create a process. The parametersto a process constructor can be initial values (e.g., initial states) or functions. From the C++point of view, creating a leaf process out of a process constructor is equivalent to instantiatinga C++ class and passing the required parameters to its constructor.

1 void mul_func(abst_ext<int>& out1,

2 const abst_ext<int>& a, const abst_ext<int>& b){

3 int inp1 = a.from_abst_ext(0);

4 int inp2 = b.from_abst_ext(0);

56 #pragma ForSyDe begin mul_func

7 out1 = inp1 * inp2;

8 #pragma ForSyDe end

9 }

Listing 2.2: ForSyDe-SystemC leaf process function definition

Listing 2.2 shows an example of defining a process constructor’s associated function. It lookslike a regular C++ function definition but there are a few particularities that have to be takeninto account:

• The function header contains the function name and the function parameters, in theorder defined in the API (please consult the API documentation in [ForSyDe, 2013]). Inthis example, the function has two inputs that have to be declared const and one output.

• The function body, where one can identify two separate parts: the computation part,between pragmas, which hold the C function which can be analysed or further be mappedto a platform; and the protocol part, outside pragmas, with the sole purpose of wrapping /unwrapping the data variables into / from functional capsules like abst_ext<T>.

14 2. ForSyDe

A composite process is the result of instantiation of other processes and wiring them togetherusing signals. A set of rules should be respected in order to benefit from the ForSyDe featuressuch as formal analysis, composability, etc. Otherwise, the system can still be simulated usingSystemC kernel, but will not be able to follow a design flow. These rules are [ForSyDe, 2013]:

• A composite process is in fact a SystemC module derived from the sc_module class.

• A composite process is the result of instantiation and interconnection of other validForSyDe processes, no ad-hoc SystemC processes or modules are allowed.

• Ports of all child processes in a composite process are connected together using signals ofSystemC channel type ForSyDe::[MoC]::[signal] (for example ForSyDe::SY::SY2SY).

• A composite process in the includes zero or more inputs and outputs of SystemC porttypes [MoC]_in and [MoC]_out (for example SY_in and SY_out).

• If an input port of a composite process should be connected to several child processes, anadditional fanout process (i.e., ForSyDe::SY::fanout) is needed in between.

1 #ifndef MULACC_HPP

2 #define MULACC_HPP

34 #include <forsyde.hpp>

5 #include "mul . hpp"6 #include "add . hpp"78 using namespace ForSyDe::SY;

910 SC_MODULE(mulacc){

11 SY_in<int> a, b;

12 SY_out<int> result;

1314 SY2SY<int> addi1, addi2, acci;

1516 SC_CTOR(mulacc){

17 make_comb2("mul1", mul_func, addi1, a, b);

1819 auto add1 = make_comb2("add1", add_func, acci, addi1, addi2);

20 add1->oport1(result);

2122 make_delay("accum", abst_ext<int>(0), addi2, acci);

23 }

24 };

2526 #endif

Listing 2.3: ForSyDe-SystemC composite process declaration

Having this knowledge in mind, the code from Listing 2.3 can be explained. The file includesthe forsyde header library and the functions referenced by the process constructors. TheSystemC constructs SC_MODULE and SC_CTOR are used to declare the composite process calledmulacc. The leaf processes can be declared either in SystemC fashion, by connecting channels(signals) to ports, or by using helper functions like make_comb2 (line 17).

2.3. System modeling in ForSyDe-SystemC 15

1 SC_MODULE(top){

2 SY2SY<int> srca, srcb, result;

34 SC_CTOR(top){

5 make_constant(" c o n s t a n t 1 ", abst_ext<int>(3), 10, srca);

67 make_source(" s i g g e n 1 ", s_func, abst_ext<int>(1), 10, srcb);

89 auto mulacc1 = new mulacc(" mulacc1 ");

10 mulacc1->a(srca);

11 mulacc1->b(srcb);

12 mulacc1->result(result);

1314 make_sink(" r e p o r t 1 ", report_func, result);

15 }

16 };

Listing 2.4: ForSyDe-SystemC testbench

2.3.3 Testbenches

There are processes in each MoC that only produce / consume values and can be used for testingpurposes. As seen in Listing 2.4, the testbench can be seen as a top module which connects thedesign under test (DUT – in this case the mulacc composite process) with these source / sinkprocesses.

2.3.4 Intermediate XML representation

ForSyDe’s introspection feature enables it to extract structural information from the SystemC

project files and encapsulate it in an XML format. The XML files represent an intermediateformat that will further be fed to the system design flow, and they capture essential structural in-formation. This information can be easily accessed, analyzed and modified (refined) by an auto-matic process. To enable introspection, one has to invoke the ForSyDe::XMLExport::traversefunction to traverse the DUT’s top module at the start of the simulation, and to compile thedesign with the macro FORSYDE_INTROSPECTION defined. Listing 2.5 shows the syntax to enablethe introspection feature while Listing 2.6 shows an example XML output.

1 #ifdef FORSYDE_INTROSPECTION

2 void start_of_simulation() {

3 ForSyDe::XMLExport dumper("");4 dumper.traverse(this);

5 }

6 #endif

Listing 2.5: Enabling the introspection feature

16 2. ForSyDe

1 <?xml version=" 1.0 " ?>

2 

3 <!DOCTYPE process_network SYSTEM " f o r s y d e . dtd " >

4 <process_network name="CombSubMul">5 <port name=" port_0 " type=" i n t " direction=" in " bound_process=" sub1 " bound_port="

port_0 "/>6 <port name=" port_1 " type=" i n t " direction=" in " bound_process=" y_fanout " bound_port="

port_0 "/>7 <port name=" port_2 " type=" i n t " direction=" out " bound_process="mul1" bound_port="

port_2 "/>8 <signal name=" f i f o _ 0 " moc=" sy " type=" i n t " source=" sub1 " source_port=" port_2 " target

="mul1" target_port=" port_0 "/>9 <signal name=" f i f o _ 1 " moc=" sy " type=" i n t " source=" y_fanout " source_port=" port_1 "

target=" sub1 " target_port=" port_1 "/>10 <signal name=" f i f o _ 2 " moc=" sy " type=" i n t " source=" y_fanout " source_port=" port_1 "

target="mul1" target_port=" port_1 "/>11 <leaf_process name=" y_fanout ">12 <port name=" port_0 " type=" i n t " direction=" in "/>13 <port name=" port_1 " type=" i n t " direction=" out "/>14 <process_constructor name=" fanout " moc=" sy "/>15 </leaf_process>

16 <composite_process name=" sub1 " component_name=" sub ">17 <port name=" port_0 " type=" i n t " direction=" in "/>18 <port name=" port_1 " type=" i n t " direction=" in "/>19 <port name=" port_2 " type=" i n t " direction=" out "/>20 </composite_process>

21 <composite_process name="mul1" component_name="mul">22 <port name=" port_0 " type=" i n t " direction=" in "/>23 <port name=" port_1 " type=" i n t " direction=" in "/>24 <port name=" port_2 " type=" i n t " direction=" out "/>25 </composite_process>

26 </process_network>

Listing 2.6: Example of intermediate XML format

Chapter

3

Understanding Parallelism

This chapter aims at tackling the problem that has arisen due to the brusque leap from the industry standardof single-processor sequential computation to many-core parallel computation. First, a short background willattempt to place the current problem which industry is facing in the context of the many-core era. The secondsection will propose a theoretical framework defining parallelism starting from Kleene’s computational model.The third and the fourth sections will describe Berkeley’s view of the parallel problems, and its views regarding thedesign of hardware and software systems. The fourth section will also compare Berkeley’s proposed methodologieswith ForSyDe, and we will argue about why ForSyDe is a proper methodology for designing heterogeneoussystems embedding massively parallel many-core processors.

3.1 Parallelism in the many-core era

Today industry is facing an abrupt shift to parallel computing, which it is not yet ready tofully embrace. Over the past decades the main means of pushing forward the IT industry

was by either increasing the clock frequency or by other innovations that were inefficient interms of transistors and power, but which kept the sequential programming model (ILP , deeppipelining, cache systems, etc) [Hennessy and Patterson, 2011].

During this time, there were several attempts to develop parallel computers, like MasPar[Blank, 1990], Kendall Square [Dunigan, 1992] or nCUBE [Hayes et al., 1986], but theyfailed due to the rapid development and increase of the sequential performance. Indeed,compatibility with legacy programs, like C programs, was more valuable to industry thannew innovations, and programmers accustomed to continuous improvement in sequentialperformance saw little need to explore parallelism.

However, during the last decade industry has reached its most important turning point byhitting the power limit which a chip is able to dissipate, called in [Hennessy and Patterson,2011] "the power wall". As the International Technology Roadmap for Semiconductors (ITRS)was "replotted" during these years [ITRS, 2005, ITRS, 2007, ITRS, 2011] one could see anincreasing discrepancy between earlier clock rate predictions (15GHz in 2010 judging by

17

18 3. Understanding Parallelism

2005 predictions [ITRS, 2005]), and actual processors’ sequential performance (currently Intelproducts are far below even the conservative 2007 predictions [ITRS, 2007]).

This is an understandable phenomenon due to the sudden changes in conventional wisdomsthat had to be accepted by the industry. A comprehensive list of old versus new conventionalwisdoms can be found in [Asanovic et al., 2006]. Apart from the well-known power wall, memorywall and ILP wall which constitute "the brick wall", we can point out the tenth conventionalwisdom pair. According to it programmers cannot rely on waiting for sequential performanceincrease instead of parallelizing their programs, since it will be a much longer wait for a fastersequential computer.

Thus the current leap to parallelism is not based on a breakthrough in programming orarchitecture, but "(it) is actually a retreat from the more difficult task of building power-efficient, high-clock-rate, single-core chips" [Asanovic et al., 2009]. Indeed, the current solutionfor general computing is still replicating sequential processors into multi-cores, which hasproven to work for a small number of cores (2 to 16) without drastic changes from the sequentialparadigms and way of thinking. But this strategy is likely to face diminishing returns once thenumber of cores increases beyond 32 [Hennessy and Patterson, 2011], stepping into the many-cores domain.

Apart from that, the more pessimistic predictions in [ITRS, 2011] show an increased discrep-ancy in the user required performance and device performance. Faced with this new knowledgeand the new eleventh conventional wisdom stating that "increasing parallelism is the primaryway of increasing a processor’s performance", industry is faced with the decision of adoptingnew paradigms and new functional models that maximise productivity into thousands of coresenvironments. Asanovic et al. state that the only solution lies in the research community, andthat "researchers [have to] meet the parallel challenge".

3.2 A theoretical framework for parallelism

As seen in Section 3.1, the difference between multi- and many-processors is qualitative ratherthan quantitative. While multi-processors could be regarded as multiple machines runningsequentially and extended with scheduling constructs with the parallel execution mainly atprogram level, many-processors have a completely different foundation principle.

Sequential processors have a strong foundation in Turing’s computational model, which lead tothe von Neumann machine. Although this model lasted for more than half a century, it doesn’texpress execution platforms naturally any more. Maliţa et al. say that "Turing’s model cannotbe used directly to found many-processors" [Maliţa and Ştefan, 2008]. Unfortunately, industry isconservative and many of the available solutions are rather non-formal extensions of availabletopologies.

This drawback is summed up with the theoretical weakness of the new domain. The parallelcomputation is still in its infancy and does not have a theoretical framework of its ownthat is unanimously accepted by both computer scientist and industry. During the past fewdecades, several groups of researchers adopted Kleene’s model of partial recursive functions[Kleene, 1936] as a computational model for parallelism [Papadimitriou, 2003, Beggs andTucker, 2006, Chen et al., 1992]. The following subsections will define a formalism for parallelcomputation based on Kleene’s model, as presented in [Maliţa et al., 2006, Maliţa and Ştefan,2009, Maliţa and Ştefan, 2008,Ştefan, 2010].

3.2. A theoretical framework for parallelism 19

3.2.1 Kleene’s partial recursive functions

The following subsection is based on material found in [Maliţa et al., 2006,Maliţa and Ştefan, 2009,Maliţa andŞtefan, 2008]

In the same year that Turing published his paper, Kleene published the partial recursivefunctions. He defines computation using basic functions (zero, increment, projection), and rules(composition, primitive recursiveness and minimization). The main rule is the composition,and Figure 3.1 depicts a structure which computes Equation 3.1:

f (x0,x1, ...xn−1) = g(h0(x0,x1, ...xn−1), . . .hm−1(x0,x1, ...xn−1)) (3.1)

where hx represents an increment function and g represents a projection function. Both the firstlevel of computing and the second level are parallel, and the only restriction is that g cannotstart the computation before all h have finished.

x0,x1, ...xn−1

h0 h1 . . . hm−1

g

out = f (x0,x1, ...xn−1)

Figure 3.1: The structure associated with the composition rule

A Universal Turing Machine is a sequential composition of functions (for example hi), thusthe parallel aspect of the computation is lost. On the other hand, the Kleene processes areinherently parallel, where hi functions are independent and g could be independent if it worksin a pipelined fashion and on different inputs data x0,x1, ...xn−1. Thus, Kleene’s model is anatural starting point for a parallel computation model.

From the general form of composition expressed in Equation 3.1 one can express severalsimplified forms to describe the other rules:

• data-parallel composition, described by Equation 3.2, is a limit case of Equation 3.1,where n = m and g is the identity function.

f (x0,x1, ...xn−1) = [h0(x0), . . .hn−1(xn−1)] (3.2)

• serial composition, described by Equation 3.3, is defined for p applications of thecomposition with m = 1, and the function is applied on an input stream < x0,x1, ...xn−1 >in a pipelined fashion.

f (x) = kp−1(kp−2(. . . k0(x)) . . . ) (3.3)


• reduction composition, described by Equation 3.4, is a special case of Equation 3.1, whereh is the identity function, and the input vector [x0,x1, ...xn−1] is reduced to a scalar.

f (x0,x1, ...xn−1) = g(x0,x1, ...xn−1) = out (3.4)

x0,x1, ...xn−1

h0 h1 hm−1. . .

h0(x0) h1(x1) hn−1(xn−1)

(a) Parallel composition

x

k0

. . .

km−1

f (x)

(b) Serial composition

h0 h1 . . . hm−1

g

g(x0,x1, ...xn−1)

(c) Reduction composition

Figure 3.2: The basic forms of composition

The composition rule is strong and natural enough to describe almost all types of data-intensiveproblems and applications, and could be associated to many implementations. The lasttwo rules, primitive recursiveness and minimization, introduce a higher degree of difficultyin expressing them in implementations and are less natural to associate with structuralimplementations.

x y f (x,y)

H

G1

Gi

. . .

. . .R

r0

r1

ri

(a) Primitive recursion

x

G0 Gi

f (x) = miny [g(x,y) = 0]

Rr0 ri

. . . . . .

(b) Minimization

Figure 3.3: Structure for two of Kleene’s rules

The primitive recursion is described by Equation 3.5 and by a structure like in Figure 3.3a.This structure is fully parallel since, apart from the serial composition, it supports speculationat each level of computation through a reduction network. The function has an initial value,described by block H , which feeds the infinite pipeline. The reduction network R inputs aninfinite vector of pairs {scalar, predicate}, corresponding to the predicated result for each stage.Thus the result will always return the scalar which is paired with the predicate having value 1.

f (x,y) = g(x,y, f (x,y − 1)) where f (x,0) = h(x) (3.5)

3.2. A theoretical framework for parallelism 21

The minimization rule is described by Equation 3.6 and it computes the function f (x) tothe value of the minimal y for which g(x,y) = 0. As with the previous rule, the structuredepicted in Figure 3.3 is an example of applying minimization, while keeping the concept ofideal parallelism by using speculation. Thus each block G computes the predicated value andreturns a pair of form {i,g(x, i) == 0} and the reduction network R extracts the first pair havingthe predicated value 1 (if any).

f (x) = miny[g(x,y) = 0] (3.6)

3.2.2 A functional taxonomy for parallel computation

Since new mathematical models are emerging to describe parallelism, the huge diversity ofsolutions involved in actual implementations tend to make the classic computer taxonomies[Flynn, 1972, Xavier and Iyengar, 1998] obsolete.

One such taxonomy introduced in [Flynn, 1972] describes parallel machines from a structuralpoint of view, where parallelism is symmetrically described using a two-dimensional space:data × programs. The current parallel applications cannot fit in one of Flynn’s categories (forexample SIMD or MIMD), since they require more than one type of parallelism.

In [Maliţa and Ştefan, 2008] and [Ştefan, 2010] there is proposed a new more consistent func-tional taxonomy, starting from the way a function is computed, as presented in Subsection 3.2.1.Thus, five types of parallel computation have been emphasized:

• Data-parallel computation as seen in the data-parallel composition. It is applied onvectors, and each component of the output vector results from the predicated executionof the same program.

• Time-parallel computation as seen in the serial composition. It applies a pipe of functionson input streams and, according to [Hennessy and Patterson, 2011], it is efficient if thelength of the stream is much greater than the pipe’s length.

• Speculative-parallel computation extracted as a functional approach for solving prim-itive recursion and minimization. This computation can be described by replacingEquation 3.1 with the limit cases in Equation 3.7. It usually applies the same variableto slightly different functions.

• Reduction-parallel computation deduced from the reduction composition. Each vectorcomponent is equivalent related to the reduction function.

• Thread-parallel computation is not directly presented in Subsection 3.2.1 but can bededuced by replacing Equation 3.1 with the limit case in Equation 3.8. It also describesthe timely behaviour of interleaved threads.

hi(x1, . . .xm) = hi(x), g(h1(x), . . .hm(x)) = {h1(x), . . . ,hm(x)} (3.7)

hi(x1, . . .xm) = hi(x), g(h1(x1), . . .hm(xm)) = {h1(x1), . . . ,hm(xm)} (3.8)

Based on this taxonomy, the types of computation can be separated into two categories:

• complex computation where parallelism is tightly interleaved allowing efficient complexcomputations. It includes the thread-parallel computation.


• intensive computation where parallelism is strongly segregated, allowing large-sizedsimple computations. It groups together the data-parallel computation, time-parallelcomputation, speculative-parallel computation and reduction-parallel computation.

[Maliţa et al., 2006] concludes that "any computation, if it is intensive, can be performedefficiently in parallel".

3.3 Parallel applications: the 13 Dwarfs

The parallel problem has been studied intensely in the last decade by numerous research groupsfocussing on multi- and many-core processing [Georgia Tech, 2008, Habanero, 2013, PPL,2013, Illinois, 2013,Par Lab, 2013]. One of the groups involved in this research originates fromBerkeley University of California and consists in multidisciplinary researchers. They discussin [Asanovic et al., 2006] and [Asanovic et al., 2009] an application-oriented approach to treatthe parallel problem from different perspectives and at different layers of abstraction.

They motivated this approach by examining the parallelism at the extremes of the computingspectrum, namely embedded computing and high performance computing. They argue that"these two ends of the computing spectrum have more in common looking forward than theyhad in the past" [Asanovic et al., 2006]. By studying the success driven by parallelism for manyof the applications, it is possible to synthesize feasible and correct solutions based on applicationrequirements. Thus, the main approach is to "mine the parallelism experience" to get a broaderview of the computation mechanisms.

Also, since parallelism is still not clearly described yet by formal means, benchmarkingprograms cannot be used as measurements of innovations. Asanovic et al. argues that "there is aneed to find a higher level of abstraction for reasoning about parallel application requirements"[Asanovic et al., 2006]. This point is valid, judging from the experience of successful mappinghigh-performance scientific applications on embedded platforms.

Extending the work of Phil Colella [Colella, 2004], the research team grouped similarapplications into thirteen "dwarfs", which are equivalence classes based on similarity ofcomputation and data movement, and by studying the programming patterns. They arepresented in Table 3.1.

# Dwarf Data Comm. Pat-terns

Description Application Exam-ples

Hardware

1 Dense Lin-ear Algebra

densematricesor vectors

memorystrides

usually vector-vector, matrix-vector and matrix-matrix opera-tions

Block TriadiagonalMatrix, SymmetricGauss-Siedel

Vectorcomputers,Arraycomputers

2 Sparse Lin-ear Algebra

compressedmatrices

indexedloads/stores

data includes many zero values,compressed for low storage andbandwidth

ConjugateGradient

Vectorcomputerswith gather/scatter

3 SpectralMethods

frequencydomain

multiplebutterflypatterns

combination of multiply-addoperations and specific datapermutations

Fourier Transform DSPs, ZalinkPDSP

4 N-BodyMethods

discretepoints

interactionbetweenpoints

particle-particle methods –O(N2); hierarchical particles –O(N logN ) or O(N )

Fast MultipoleMethod

GRAPE,MGRAPE

3.4. Berkeley’s view: design methodology 23

# Dwarf Data Comm. Pat-terns

Description Application Exam-ples

Hardware

5 StructuredGrids

regulargrids

high spatiallocality

grid may be subdivided intofiner grids ("Adaptive Mesh Re-finement"); transition betweengranularity may happen dynam-ically

Multi-Grid, ScalarPentadiagonal,Hydrodynamics

QCDOC,BlueGeneL

6 Unstructuredgrids

irregulargrids

multiplelevels ofmemoryreference

location and connectivity de-termined from neighboring ele-ments

Unstructure Adap-tive

Tera MultiThreadedArchitecture

7 MapReduce – notdominant

calculations depend on statisti-cal results of repeated randomtrials. Considered embarrass-ingly parallel

Monte Carlo, Raytracer

NSF Teragrid

8 Combinato-rial Logic

largeamountof data

bit-level op-erations

simple operations on very largeamount of data, often exploitingbit-level parallelism

Encription, CyclicRedundancyCodes, IP NAT

Hardwiredalgorithms

9 GraphTraversal

nodes,objects

manylookups

algorithms involving many lev-els of indirection and smallamount of computation

Route Lookup,XML parsing,collision detection

Sun Niagara

10 DynamicProgram-ming

– – solve simpler overlapping sub-problems. Used in optimizationof problems with many feasiblesolutions

Viterbi Decode,variableelimination

Dyna

11 Back-trackand Branch+ bound

– – optimal solutions by dividing insubdomains and pruning sub-problems that are suboptimal

Kernel Regression,Net-work SimplexAlgorithm

–

12 GraphicalModels

nodes – graphs where random variablesare nodes and conditions areedges

Bayesian Net-work, HiddenMarkov Models

–

13 Finite StateMachines

states transitions behavior defined by states, tran-sitions and events

PNG, JPEG,MPEG-4, TCP,compiler

–

Table 3.1: The 13 Dwarfs of parallel computation

While the first twelve dwarfs show inherent parallelism, the parallelization of the thirteenthconstitutes a challenge. The main reason is that it is difficult to split the computationinto several parallel finite state machines. Although the Berkeley research group favors theexclusion of the thirteenth dwarf from the parallel paradigm, as being considered "embarrass-ingly sequential", architectures like Revolver [Öberg and Ellervee, 1998], the Integral ParallelArchitecture [Ştefan, 2010], or the BEAM [Codreanu and Hobincu, 2010], demonstrate that theseproblems can successfully be parallelized, as being derived from the complex computation classfrom Subsection 3.2.2.

3.4 Berkeley’s view: design methodology

The following section is based on material found in [Asanovic et al., 2006, Asanovic et al., 2009]

As stated in the previous section, the multidisciplinary research team from Berkeley studiedthe parallel problem from a broad range of perspectives. Like ForSyDe, they support theidea of raising the level of abstraction for both programming and system design. The maindifference though, is that they adopt a programmer-friendly application-oriented approach forproductivity layer rather than a formal starting point.


The following section will present Berkeley’s view, in comparison to the ForSyDe methodology,in order to merge these two different schools of thought into an even stronger conceptualfoundation. In addition we will try to demonstrate that ForSyDe is a proper methodologyto approach even parallel computational problems, not only real-time embedded problems.

3.4.1 Application point of view

Section 3.3 pointed out the need to mine applications that demand more computing power andcan absorb the increasing number of cores for the next decades in order to provide concretegoals and metrics to evaluate progress.

For this matter, a number of applications are studied and developed based on differentcriteria: compelling in terms of marketing and social impact, short-term feasibility, longer-term potential, speed-up or efficiency requirements, platform coverage, potential to enabletechnology for other applications, involvement in usage and evaluation of technology. Some ofthe applications could count: music and hearing, speech understanding, content-based imageretrieval, intraoperative risk assessment, parallel browser, 3D graphics, etc.

Currently, ForSyDe has a suite of case studies originating in industrial or academic applicationsand some of them, for example Linescan, focus on industrial control. In the future, itsapplication span could broaden even in other user-oriented areas (for example actor-basedparallel browsers [Jones et al., 2009]), by studying and following other successful attemptssuch as Berkeley’s. ForSyDe has a good profile for many of the applications in Table 3.1. Sinceparallelism is expressed inherently, it could well fit in the above classes of applications.

3.4.2 Software point of view

The Berkeley research group admit that developing a software methodology for bridging thegap between users and the parallel IT industry is the most vexing challenge. One reason is thefact that many programmers are unable to understand parallel software. Another reason is thatboth compilers and operating systems have grown so large that they are resistant to changes.Also, it is not possible to properly measure improvement in parallel languages, since most ofthem are prototypes that reflect solely the researchers’ point of view. There are eight main ideaspresented that will be analysed separately in the following paragraphs.

Idea #1: Architecting parallel software with design patterns, not just parallel programminglanguages This is the first idea proposed by Asanovic et al. Since "automatic parallelismdoesn’t work" [Asanovic et al., 2009], they propose to re-architect the software through a"design patten language", explored in earlier works such as [Alexander, 1977, Gamma et al.,1993, Buschmann et al., 2007]. The pattern language is a collection of related and interlockingpatterns, constructed such that the pattern flow into each other as the designer solves a designproblem. The computation and structural patterns can be composed to create more complexpatterns. These are conceptual tools that help a programmer reason about a software projectand develop an architecture, but are not themselves implementation mechanisms for producingcode.

Unfortunately, this is one of the ideas that arouse disputes between researchers belonging tothe two schools of thoughts studied in this report. While the Berkeley group openly disagrees


with using formalism as a starting point in a design methodology, ForSyDe enforces formalrestrictions at early design stages. A reason against using a formal model as a starting point isthe fact that it is understood only by a narrow group of researchers, and it limits expressivenessin designing solutions, at least for the uninitiated.

We argue that this is a common misconception amongst the research groups and has to beovercome in order to fully take advantage of both concepts. Case studies have shown thatby using a structured thinking, the formal constraints do not limit expressiveness. On thecontrary, describing computation through processes and signals aids the designer in keeping aclear picture of the entire system.

It is simpler to mask MoC details under a design pattern than to assure formal correctness tolarge pattern-based systems. By masking, the designer needs only minimal prior knowledge ofthe mathematical principles behind MoCs, while still being able to respect formalism and fullytake advantage of it. Thus, ForSyDe could be easily extended into a pattern framework since itallows composable patterns of process networks. This subject is further treated in Section 5.3,and could be a relevant point of entry for future research.

Idea #2: Split productivity and efficiency layers, not just a single general-purpose layerProductivity, efficiency and correctness are inextricably linked and must be treated together,during system design stages. They are not a single-point solution and must be defined inseparate layers.

The productivity layer uses a common composition and coordination language to glue togetherthe libraries and programming frameworks produced by the efficiency-layer programmers. Theimplementation details are abstracted at this layer. Customizations are made only at specifiedpoints and do not break the harmony of the design pattern.

The efficiency layer is very close to machine language, allowing the best possible algorithm to bewritten in the primitives of the layer. This is the working ground for specialist programmerstrained in details of parallel technology.

This concept is powerfully rooted in the parallel programming community. It explains themultitude of template libraries, DSLs or language extensions for specific parallel platforms (i.e.the ones presented in Section 5.3) during the last decade.

Although ForSyDe is not only a programming language, but rather a system design method-ology, it follows this conceptual pattern. While the ForSyDe-Haskell or ForSyDe-SystemC

design frameworks can be associated with the productivity layer, the suite of tools of analysis,transformation, refinement and synthesis can be associated with the efficiency layer. Theschema proposed in Appendix B extends this seemingly simple but powerful idea.

Idea #3: Generating code with search-based autotuners, not compilers Since compilers havegrown so large and are resistant to changes, one cannot rely on them to identify and optimiseparallel applications. Instead, one useful lesson can be learned from autotuners. These areoptimization tools that generate many variants of a kernel and measure each variant by runningon the target platform. Autotuners are built by efficiency-layer programmers.

ForSyDe replaces the idea of auto-tuning with design space exploration mechanisms. Sincethey involve off-line analysis of systems before synthesizing implementation solutions, thesemechanisms have a much higher potential for productivity performance. Even so, ForSyDe’s


development should be aware of the autotuner mechanisms, since it can benefit from hybridsynthesis methods. One best practice example is narrowing down the design space, thenrunning several analyses in parallel on virtual platforms with different configurations andchoose the best solution.

Idea #4: Synthesis with sketching This idea enforces programmers to write "incompletesketches" for programs. In them, they provide an algorithmic skeleton and let the synthesizerfill in the holes in the sketch.

As presented in Chapter 2, this is one of ForSyDe’s ground rules, described by the abstractionof design details. Moreover, in earlier ForSyDe publications, process constructors were referredto as skeletons [Sander and Jantsch, 1999], which is infers the concept of "sketch".

Idea #5: Verification and testing, not one or the other The research group enforces modularverification and automated unit-test generation through a high-level semantic constraints onthe behavior of the individual modules (such as parallel frameworks and parallel libraries).They identified this to be a challenging problem since most programmers find it convenientto specify local properties using assert statements or static program verification. As aconsequence, these programmers would have a hard time adapting to the high-level constructs.

Since the ForSyDe methodology starts from a formal correct-by-design reaching an implemen-tation mostly through semantic-preserving constraints, this problem is not valid anymore.Validation may be elegantly taken care of by the design’s formalism, while early-stage testingcan be achieved by executing the model [Attarzadeh Niaki et al., 2012].

Idea #6: Parallelism for energy efficiency Efficient use of multiple cores to complete a taskis more efficient in terms of energy consumption [Hennessy and Patterson, 2011]. Severalmechanisms are recommended, such as task multiplexing, use of parallel algorithms toamortize instruction delivery and message passing instead of cache coherency.

There is a number of projects related to ForSyDe which address power estimations in systemdesign [Zhu et al., 2008, Jakobsen et al., 2011]. Since it is an issue problem especially inembedded systems, this problem will still be a main topic for future ForSyDe researches.

Idea #7: Space-time partitioning for deconstructed operating systems The spatial partitioncontains the physical resources of a parallel machine. The space-time partitioning virtualizesspatial partitions by time-multiplexing whole partitions onto available hardware. As seen in[Ştefan, 2010], the partition can be done at low instruction level or, as the Berkeley researchgroup propose, at a "deconstructed OS" level.

Currently there is no OS support in ForSyDe, but implementing it could be seen as a mappingproblem. The temporal dimension has to be described along with the partitioning of tasks toresources. Therefore, it counts as a design space exploration problem, and could be a relevantfuture research topic.

Idea #8: Programming model efforts inspired by psychological research This idea hasbeen mentioned in [Asanovic et al., 2006], and identifies the need for researching a human-centric programming (in our case design) model, along with the other already widespread


models: hardware-centric, application-centric and formalism-centric. Since humans writeprograms, studying human psychology could lead to breakthroughs in identifying sources ofhuman errors, problem solving abilities, maintaining complex systems, etc. Also the above-mentioned report increases awareness of the risk of human experiments when testing a newmodel. Examples in academic environments have shown that the test subjects’ intuition hasbeen challenged and changed during these experiments.

Regarding this research topic "there has been no substantial progress to date" [Asanovic et al.,2006]. Even so, examples like [Hochstein et al., 2005, Malik et al., 2012] and other researchin the field could constitute valuable resources for future ForSyDe studies on user or marketimpact.

3.4.3 Hardware point of view

Driven by the power wall, there are changes both in programming model and in the archi-tectural designs of systems. Many-core architectures have implications that may potentiallywork in industry’s advantage. Apart program parallelism, many simple low-frequency coresimply lower resource consumption [Hennessy and Patterson, 2011], simplifying design andverification, and a higher yield. Also guided by Amdahl’s law [Hennessy and Patterson,2011], heterogeneous hardware platforms are encouraged since they optimally exploit differentaspects of computation.

There is a number of guidelines for parallel hardware design in both [Asanovic et al., 2006] and[Asanovic et al., 2009]. Among them, we could point out using new mechanisms for assuringdata coherency like the transactional memory or message passing and accurate, completecounters for performance and energy. Since these features favor both parallel computationand correct system design methods, ForSyDe’s future development could be guided and easedby platforms implementing them.

As presented in Section 3.4, we can draw the conclusion that ForSyDe is indeed an adequate methodologyboth for synthesizing parallel software and for designing systems with parallel platforms. Its primary target, thereal-time embedded applications, can successfully be extended to parallel applications, while keeping its conceptsand philosophy intact. Furthermore, even adepts of functionalism should not be discouraged by the formal stylefor programming, since it is much easier for ForSyDe to "put on a functional coat" than for a functional approachto implement a formal methodology. This functional flavor can materialize in the form of MoC-based IP blocks.Nevertheless, the advantages of formalism are obvious.

Chapter

4GPGPUs and General Pro-gramming with CUDA

In order to understand the tasks that f2cc needs to solve and the challenges further implied by the componentimplementation, one has to understand the target platform: the GPGPU. This chapter will provide a shortoverview of the GPGPU landscape for a quick grasp of this very large domain. After a brief introduction, thereader is presented with the main GPGPU architectural features in Section 4.2, followed by an example of CUDA

programming in Section 4.3. The chapter closes with an overview on CUDA streams, which constitutes the maintool for delivering this thesis’ objectives.

4.1 Brief introduction to GPGPUs

As their name suggests, GPUs were propelled by the graphics industry, especially in gaming.Throughout their evolution, their main purpose was to render high-resolution 3D scenes

at real-time [Nickolls and Dally, 2010]. The marketing term GPU was coined by nvidia in1999 when they released GeForce 256, "the world’s first GPU" [nvidia, 2013b]. Since then, thenumber of transistors has increased from 23 million to 7 billion, and the computing powerhas raised from 480 Mega-operations per second to 4.5 Single Precision Tera-FLOPS for nvidiaGeForce GTX Titan in 2013 [nvidia, 2013c].

This tremendous boost in computing power has attracted developers from areas other thangraphics processing areas [Nickolls and Dally, 2010, Kirk and Hwu, 2010, Joldes et al., 2010,Owens et al., 2008]. Pioneering developers had to express non-graphical computations throughthe graphics API shader languages, which were not ideal for general purpose. Furthermore,increased inefficiencies due to poor load balancing between the CPU and the GPU made theseplatforms difficult to handle for other applications.

Because of the increased demands in general purpose computing using massive parallelprocessors, manufacturers opted that GPUs and CPUs are to be unified into a single entity[Lindholm et al., 2008]. Also, further modifications in the GPU architecture, like support forinteger floating-point arithmetic, synchronization barriers, etc, allowed them to be computedusing a general-purpose imperative program like C. This lead to the dawn of the General

29

30 4. GPGPUs and General Programming with CUDA

Purpose GPUs (GPGPUs) in 2006, when nvidia launched GeForce 8800, the first unifiedgraphics and computing architecture [Nickolls and Dally, 2010].

4.2 GPGPU architecture

The following subsection is based on material found in [Kirk and Hwu, 2010] and [Hjort Blindell, 2012]. For amore detailed presentation, the reader is encouraged to read the latter report

The GPU architecture reflects its purpose, namely processing of 3D graphics. Most of thegraphical algorithms in the graphics pipeline [Kirk and Hwu, 2010] can be grouped underthe Dense Linear Algebra motif (see Section 3.3). Therefore, many architectural features canbe explained related to the first of Berkeley’s Dwarfs, and to the legacy compatibility to thegraphics pipeline.

The GPU is a throughput-oriented architecture, assuming plenty of parallelism and employingthousands of simple processing units, in comparison with the latency-oriented architecture of aCPU. Thus, by keeping loyalty to the historical term of "graphics accelerator", its purpose is notto replace the CPU, but to act as co-processor.

GPGPUs follow the principle that computation is much cheaper than memory transfers. Thememory organization differs from that of a general purpose CPU and it has its own throughput-oriented memory hierarchy. Furthermore, raw computation is enforced rather than lookupalgorithms with precomputed values, since it is faster and less power consuming.

For processing individual graphical elements, such as pixels or triangle, a small programnamed in CUDA a kernel function is invoked. In order to keep the processing cores busy whilewaiting for long latency operations, GPGPUs apply hardware multithreading to make threadswitches with fine granularity. A kernel normally involves spawning, executing and retiring ofthousands of threads with minimal overhead.

DRAM DRAM DRAM DRAM DRAM

GPGPU

Cluster

Texture memory

SM

SP

SP

SP

SP

SP

SP

SP

SP

SFU

SFU

Fetch/Disp.

Registers

Inst. cache

Const.cache

Sharedmemory

SM

SP

SP

SP

SP

SP

SP

SP

SP

SFU

SFU

Fetch/Disp.

Registers

Inst. cache

Const.cache

Sharedmemory

Figure 4.1: Overview of nvidia CUDA platform. Source: adapted from [Hjort Blindell, 2012]

Figure 4.1 illustrates a typical nvidia CUDA GPGPU. In the following enumeration, a briefexplanation of the main architectural components is given:

• a cluster is a subdivision of the GPGPU, which contains a pair of streaming multiprocessors(SM).

4.2. GPGPU architecture 31

• each SM has its storing elements connected to high bandwidth DRAM interfaces. TheDRAM memory as a whole is called the global memory.

• each SM has 8 streaming processors (SP), known as CUDA cores. They are the primarycomputing units, and each one is fully-pipelined and in-order. The number of SPs, aswell as SMs is device independent.

• the special-function unit (SFU) computes fast approximations for certain functions, suchas sinx or 1/

√x.

• the register file holds the thread context. Parts of it can be spilled in a portion of the DRAM

called local memory, referring to its scope rather than its locality.

• the shared memory is a small low-latency, multibanked on-chip memory which allowsapplication-controlled caching of DRAM data.

• the constant cache is used for reducing latencies of operations with constants.

• the texture memory caches neighbored elements in 2D matrices, for improved performancein this type of memory accesses.

• newer CUDA GPGPUs are equipped with L1 and L2 caches as well.

In order to manage the large population of threads and to achieve scalability, the threads aredivided into a set of hierarchical groups – grids, threads blocks and warps [Kirk and Hwu,2010]. Figure 4.2 illustrates this organization, shortly explained in the following bullet points:

Host

Kernel 1

Kernel 2

Device

Grid 2

Grid 1

Thread block(2,1)

Thread (1,0,1)

Thread (2,1,0)

Figure 4.2: Thread division and organization as seen from the programmer’s point of view. Source: adapted from [Kirkand Hwu, 2010, Hjort Blindell, 2012]

• a grid is formed of all the threads belonging to the same kernel invocation. The geometryand size is configurable by the programmer.

• a thread block contains the threads to execute, arranged in either 1D, 2D or 3D geometry,configurable by the programmer. The geometry limitations depend on the device’sgeneration. Each SM supports up to 8 thread blocks, and the GPU dynamically balancesthe workload across SMs

• The threads are executed in a manner similar to SIMD execution, called Single InstructionMultiple Thread (SIMT), where threads adjacent to each other are grouped in warps.


• warps eligible for execution are scheduled and executed as a whole. Thus it is theprogrammer’s task to assure that the warps are distributed so that there are no idleresources during execution. Also, thread divergence within warps (data-dependentbranch instructions) involve idle resources.

• the CPU and the GPGPU are logically separated as host and device. As suggested inFigure 4.2, each kernel call on the host executes the kernel on the device.

4.3 General programming with CUDA

The following subsection is based on material found in [Kirk and Hwu, 2010]. This text will provide a verybrief example of CUDA programming. For a more comprehensive tutorial the reader is encourage to consult[Hjort Blindell, 2012, Kirk and Hwu, 2010]

For CUDA programmers, GPUs are massive parallel processors programmed in C withextensions. nvidia provides a software development kit (SDK) consisting in a set of libraries anda compiler called nvcc, so that developers can establish applications in a familiar sequentialenvironment.

According to Section 3.4, CUDA’s approach is a hardware-centric one, since it has constructs forexplicit control of the hardware architecture. Although it enables an experienced developer totake advantage of the hardware features and tweak programs to boost the performance, thisapproach is a tedious one, not optimal in terms of productivity on a large scale. Its similarityto C although, enables skilled programmers to port appropriate C programs to CUDA C "in amatter of days" [Kirk and Hwu, 2010].

In order to understand CUDA C programming style, an example program performing matrixmultiplication is shown in listings 4.1, 4.2 and 4.3. The multiplication between matrix A andmatrix B is done by calculating the dot product between a row in A and a column in B, assuggested in Equation 4.11.

ci,j = rowA,i · colB,j (4.1)

Listing 4.1 illustrates a pure C approach of the given problem. It computes the dot productbetween two matrices A and B in an iterative fashion that can be optimized by compilers forsequential machines.

The CUDA C implementation for the same problem has a rather different approach. Thedevelopers cannot relate anymore to automatic compiler optimizations, thus they have to takefull responsibility for the proper utilization of resources, which can differ from application toapplication.

As seen in Figure 4.2, there is a logical separation between host and device. Since there will betwo programs syncing each other, two files have to be written. Listing 4.2 shows the host code.Generally, a CUDA program follows five steps, identified in the given code:

1. allocate memory on device, with the cudaMalloc library function;

2. copy data from host to device, using cudaMemcpy;

1for more on the matrix multiplication, the reader may consult [Hennessy and Patterson, 2011] or any linear algebramanual

4.3. General programming with CUDA 33

1 void matrixMult(int* a, int* b, int* c) {

2 int i, j, k;

3 for (i = 0; i < N; ++i) {

4 for (j = 0; j < N; ++j) {

5 int sum = 0;

6 for (k = 0; k < N; ++k) {

7 sum += a[i * N + k] * b[j + k * N];

8 }

9 c[i * N + j] = sum;

10 }

11 }

12 }

Listing 4.1: C code for matrix multiplication

3. perform data calculations on the device, by calling a kernel function. The syntax showsthe grid and thread configuration for the given kernel;

4. copy the result back to host, with cudaMemcpy;

5. free the allocated memory, achieved with cudaFree.

Listing 4.3 illustrates one choice of implementation for the given problem on the GPGPU

device. This choice was selected since it shows the usage of threads, synchronization barriersand shared memory – the main mechanisms that will be employed in the current project.

The kernel declaration is marked by the keyword __global__. Two variables are allocatedin the shared memory, Da and Db, to temporarily store intermediate blocks of data2. Theselocations will be shared by multiple threads to drastically reduce the memory access time.

To allow computations of matrices larger than supported by the device resources, they can besplit into multiple kernel invocations, thus into multiple thread blocks. As seen in Listing 4.3,individual elements in large matrices can be identified by indexing them by a displacementsrelative to the thread index (threadIdx) and block index (blockIdx).

Perhaps the most significant change in matrixMult is that the two outer loops have beenremoved. Since CUDA is a parallel platform, calculating each element iteratively would be awaste of resources, because the compiler cannot identify the potential for parallelism. Insteadeach element is calculated in parallel as a thread, identified by the its indices.

Synchronization barriers (_syncthreads) have been used to separate each computation steps.They ensure that all threads have finished loading their data or performed their calculationsbefore proceeding to the next tile.

2For more on the blocked matrix multiplication algorithm, the reader may consult [Hennessy and Patterson, 2011]and [Kirk and Hwu, 2010]


1 //Memory allocation on device

2 cudaMalloc((void**) &Ma, size);

3 cudaMalloc((void**) &Mb, size);

4 cudaMalloc((void**) &Mc, size);

56 //Host to device memory copy

7 cudaMemcpy(Ma, a, size, cudaMemcpyHostToDevice);

8 cudaMemcpy(Mb, b, size, cudaMemcpyHostToDevice);

910 //Kernel invocation

11 dim3 gridDimention(1, 1);

12 dim3 blockDimension(N, N);

13 matrixMult<<<gridDimention, blockDimension>>>

14 ((int*) Ma, (int*) Mb, (int*) Mc);

1516 //Device to host memory copy

17 cudaMemcpy(c, Mc, size, cudaMemcpyDeviceToHost);

1819 //Device memory deallocation

20 cudaFree(a); cudaFree(b); cudaFree(c);

Listing 4.2: CUDA Host code for matrix multiplication

1 __global__ void matrixMult(int* a, int* b, int* c) {

2 __shared__ int Da[TILE_SIZE][TILE_SIZE];

3 __shared__ int Db[TILE_SIZE][TILE_SIZE];

45 int bx = blockIdx.x; int int by = blockIdx.y;

6 int tx = threadIdx.x; int ty = threadIdx.y;

7 int row = by * TILE_SIZE + ty;

8 int col = bx * TILE_SIZE + tx;

910 //Calculate dot product

11 int i, sum = 0;

12 for (i = 0; i < N / TILE_SIZE; ++i) {

13 int k, sum_tmp = 0;

1415 //Load tile

16 Da[ty][tx] = a[row * N + TILE_SIZE * i + tx];

17 Db[ty][tx] = b[col + (TILE_SIZE * i + ty) * N];

18 __syncthreads();

1920 //Calculate partial dot product

21 for (k = 0; k < TILE_SIZE; ++k) {

22 sum_tmp = += Da[ty][k] * Db[k][tx];

23 }

24 sum += sum_tmp;

25 __syncthreads();

26 }

27 c[row * N + col] = sum;

28 }

Listing 4.3: CUDA Device code for matrix multiplication

4.4. CUDA streams 35

4.4 CUDA streams

In March 2010 nvidia launched the Fermi architecture, which allowed concurrency betweenCPU computation, multiple kernels, one memory transfer from host to device, one from deviceto host. This feature favours complex computation (see Subsection 3.2.2) through the use ofCUDA streams.

A stream is a sequence of operations that execute in issue-order on the GPU [Sanders andKandrot, 2010]. The programming model allows CUDA operations in different streams to runconcurrently and to be interleaved. Figure 4.3 shows a pipeline behavior induced by streamconcurrency.

default_stream

stream 1 H2D kernel D2H H2D kernel D2H



CPU exec

Figure 4.3: Concurrency example with CUDA streams. Source: adapted from [nvidia, 2013a]

The default stream, or stream 0 is used when no stream is specified. It implies that all operationsare synchronous between host and device. Listing 4.2 shows an example of kernel called in thedefault stream.

In order to implement concurrency, a number of requirements have to be fulfilled [nvidia,2013a]. First, The concurrent CUDA operations have to be in different, non-0 streams. Secondly,the data transfers have to be made using cudaMemcpyAsync, and from page-locked memory onhost (cudaMallocHost). Thirdly, the user has to assure that there are enough resources availablefor full concurrency. This means that data is transferred in different directions at one pointin time, and there are enough device resources (like SMs, blocks, registers, etc.). Listing 4.4demonstrates the use of streams.

1 cudaStream_t stream1, stream2, stream3, stream4 ;

2 cudaStreamCreate ( &stream1) ;

3 ...

4 cudaMalloc ( &dev1, size ) ;

5 cudaMallocHost ( &host1, size ) ; // pinned memory required on host

6 ...

7 //potentially overlapped section

8 cudaMemcpyAsync ( dev1, host1, size, H2D, stream1 ) ;

9 kernel2 <<< grid, block, 0, stream2 >>> ( ..., dev2, ... ) ;

10 kernel3 <<< grid, block, 0, stream3 >>> ( ..., dev3, ... ) ;

11 cudaMemcpyAsync ( host4, dev4, size, D2H, stream4 ) ;

12 some_CPU_method ();

13 //end of potentially overlapped section

14 ...

Listing 4.4: Using streams in CUDA. Source: [nvidia, 2013a]


In this chapter the reader has been briefly introduced in the topic of GPGPUs, and general programming withCUDA. While the reader is encouraged to read more about this topic in order to fully understand the mechanismsbehind a successful CUDA program, the information provided is enough to stand as a technical background forthe synthesizer module and its methods presented in Chapter 8

Chapter

5

The f2cc Tool

This chapter will introduce the reader to the f2cc tool, the main component to be improved during the currentproject. Section 5.1 will present the tools main features and its usage. Since the last chapter presented CUDA

programming basics, it is proper now to offer an insight into the component’s architecture in Section 5.2. Thechapter will end with a presentation of alternative solutions for CUDA code generation and their methodologies,in order to compare them with ForSyDe.

5.1 f2cc features

F 2cc stands for ForSyDe to CUDA C. It is a ForSyDe tool and its main purpose is tosynthesize CUDA-enabled GPGPU backend code from a high-level ForSyDe model. It has

been developed by Gabriel Hjort Blindell as part of his Master’s Thesis in 2012 [Hjort Blindell,2012].

The tool inputs high-level ForSyDe models as GraphML1 files. Apart from structural

information, these intermediate representations encapsulate C code for each leaf process2.It offers support for the following ForSyDe process constructors: MapSY, ParallelMapSY,ZipWithNSY, UnzipxSY, ZipxSY, DelaySY3.

The component generates either CUDA C for running on nvidia GPGPUs or sequential C codefor running on CPUs. The output is controlled via a command-line switch. For GPGPU codesynthesis it identifies contained Unzipx-Map-Zipx sections like in Figure 5.1, coalesces theminto internal ParallelMapSY process and wraps around them an optimized CUDA kernel. Forsequential code synthesis, it uses an internal scheduling mechanism to identify the correctprocess execution order, and generates C code afterwards.

1at the time of the tool’s development ForSyDe-SystemC was not yet released, thus f2cc inputted ForSyDe-Haskell

generated GraphML files.2at the time of of the tool’s development there was not yet a tool for C code extraction from Haskell processes, thus

the C code for each process had to be hand-written.3the process nomenclature corresponds to ForSyDe-Haskell naming convention.

37

38 5. The f2cc Tool

...

...

...

P arallelMapSY

Figure 5.1: Parallel patterns identified by f2cc

f2cc uses an internal object-based representation of the ForSyDe model, for easy user-accessand parsing. The frontend module of the tool extracts data from the GraphML input files andtranslates it into the internal representation, so that all transformations happen internally.

One last feature of f2cc is that it has the option to make use of CUDA shared memory forminimising transfer costs within parallel processes.

5.2 f2cc architecture

f2cc is an independent tool implemented in an object-oriented style with C++. All its classesand components with their APIs are documented using Doxygen. To avoid naming clashes, allcomponents that belong to the tool are implemented in their own namespace, f2cc.

Its main modules and their interconnections are shown in Figure 5.2. In principle, the tool’sdata flow follows the path described in the next enumeration. Further information about thewhole process can be found in [Hjort Blindell, 2012].

1. parse input GraphML file;

2. translate the data into own intermediate ForSyDe object-based representation;

3. perform modifications upon the intermediate model, like redundant process elimination,or finding, coalescing and wrapping of contained sections into ParallelMapSY;

4. input the resulting model to the synthesis process, which finds a correct sequentialschedule and wraps parallel sections into CUDA kernels.

5. output the resulting host and device code, or the sequential code respectively.

The frontend module holds classes and definitions for frontend parsing. It contains generalfunctions, and the GraphML parser that converts input data into the internal format. Assuggested in Figure 5.2, this module makes use of a third-party library called TinyXML++,included in the ticpp module. This library provides f2cc with XML parser functions.

The module mainly used in the next step of the component’s execution path is forsyde.It includes classes for the internal representation of the ForSyDe model’s components, andfor model modifier methods called throughout the process. In order to avoid clashes withother tools or components, every class or method that belongs to this category resides underf2cc::Forsyde namespace.

5.2. f2cc architecture 39

ticpp

frontend

forsyde

logger

synthesizer

config

language

toolsexceptions

Figure 5.2: Component modules and their connections. Source: [Hjort Blindell, 2012]

The internal model is equipped with a class for each ForSyDe process constructor recognizedby the component. Every object instantiated at runtime encapsulates all the information needed(extracted from the model or assumed, as it will be presented in Chapter 7) to describe aForSyDe process. The relations between the classes used in the internal representation followthe pattern in Figure 5.3.

ZipWithNSY DelaySY ZipSY UnzipSY MapSY

Process

Model Port

CFunction

CoalescedMapSY

ParallelMapSY

is ais a

isa is a is a

has many

has many

has many

has a

is a

is a

Figure 5.3: Classes used for f2cc internal model representation and the relations between them. Source: adaptedfrom [Hjort Blindell, 2012]

The main model modifier identifies contained sections, coalesces them and transforms theminto a ParallelMapSY process, for an easy wrapping into a CUDA kernel. The analysis ofcontained section is done by traversing the model in reverse order, and identifying the processesbetween an UnzipxSY and a ZipxSY. Coalescences and conversions are done at the model level,and all model modifications consist mainly in adding/removing processes and redirectingsignals.

When all the necessary model modifications are ready, the synthesizer module takes over. Itstwo main classes are Synthesizer with methods for generating CUDA C or sequential C code,and the Schedule Finder, with methods for identifying a correct sequential schedule for runninga ForSyDe model.

The language module is referenced throughout execution by the previous modules, and it

40 5. The f2cc Tool

holds containers for storing the C code for each process. It also contains string-based methodsfor manipulating the contained code.

Each executing module reports its runtime events to the logger module. The tools modulecontains common miscellaneous methods, and the config class contains user-specified settingsfor the current program invocation. The exceptions module defines all exception classes usedthroughout the program.

5.3 Alternatives to f2cc

GPGPUs are notoriously difficult to program, due to the low level of the programminglanguages like CUDA and OpenCL. Due to this fact, code synthesis from a high level languageis an intensely studied domain among the research community. Several research groupsfocused their work in the past few years on providing with proper tools and methodologiesfor increasing yields and productivity in GPGPU programming.

In Chapter 3 it has been shown that ForSyDe it is not just a parallel programming languagelike most of the researches in the field, and that it is flexible in describing and handling parallelconstructs since it formally supports the parallel computation framework.

Since GPGPUs are the most popular parallel platforms in industry, it is a reasonable decisionto start exploring the high-throughput parallel domain with applications that target them. Inorder to do so a review of the available alternatives for GPU code synthesis has to be made.This will aid in comparing their approaches with ForSyDe and examining their conceptualcompatibility with a formal methodology. [Hjort Blindell, 2012] presents two alternative DSLs

for GPGPU synthesis: SkePU and Obsidian. The following subsections extend the list withadditional approaches.

5.3.1 SkelCL

The following section is based on material found in [Steuwer et al., 2013].

SkelCL is a frontend for OpenCL, developed at University of Münster. It is built as a C++library that offers pre-implemented recurring computation and communication patterns whichsimplify programming for single- and multi-GPU systems. These patterns4, are called skeletons.These skeletons are raising the abstraction level for programming, and shields the programmerfrom boilerplate code, in the same manner as ForSyDe process constructors do5.

Formally, a skeleton is a "higher-order function that executes one or more user-definedfunctions in a pre-defined parallel manner, while hiding the details of parallelism andcommunication from the user" [Steuwer et al., 2013]. SkeCL framework offers four basicskeletons, one extended skeleton and two data containers.

The containers offered by SkelCL are Vector and Matrix. Vector is an abstraction for a contiguousmemory area that is accessible by both the host and device, and implements optimal transfersbetween them. Matrix behaves as a 2D vector and takes care of memory organization for 2D

applications The skeletons implemented by SkelCL are:

4the reader may notice the resemblance with Idea #1 in Section 3.45in fact in its early development stages, ForSYDe named the process constructors skeletons

5.3. Alternatives to f2cc 41

• Map: applies a unary function f to each element on an input vector vin, and is describedby Equation 5.1;

• Zip: operates on two input vectors and applies a binary operator ⊕ to all pairs of elements.It is described by Equation 5.2;

• Reduce: computes a scalar value r from a vector using a binary operator ⊕. It is describedby Equation 5.3;

• Scan: (or prefix-sum) yields an output vector with each element obtained by applying abinary operator ⊕ to the elements of the input vector up to the current element’s index. Itis described by Equation 5.4;

• MapOverlap is the extended skeleton, used with either vector or matrix data type and itapplies f to each element of an input matrix min while taking the neighboring elementswithin the range of [−d,+d]. It is described by Equation 5.5.

vout[i] = f (vin[i]) (5.1)

vout[i] = vinl[i]⊕ vinr [i] (5.2)

r = v[0]⊕ v[1]⊕ ...⊕ v[n− 1] (5.3)

vout[i] = ⊕i−1j=0vin[j] (5.4)

mout[i, j] = f

min[i − d, j − d] · · · min[i − d, j] · · · min[i − d, j + d]...

......

min[i, j − d] · · · min[i, j] · · · min[i, j + d]...

......

min[i − d, j + d] · · · min[i, j + d] · · · min[i + d, j + d]

(5.5)

1 int main (int argc, char const* argv[]){

2 SkelCL::init(); //initializes SkelCL

3 //Create skeletons

4 Reduce<float> sum (" f l o a t func ( f l o a t x , f l o a t y ) r e t u r n x+y ; ");5 Zip<float> mult (" f l o a t func ( f l o a t x , f l o a t y ) r e t u r n x *y ; ");67 Vector<float> A(SIZE); fillVector(A); //create input vectors

8 Vector<float> B(SIZE); fillVector(B);

910 Vector<float> C = sum( mult( A, B ) ); //execute skeletons

11 cout << " R e s u l t : " << C.front(); //print result

12 }

Listing 5.1: SkelCL program computing the dot product of two vectors. The same program in OpenCL would take 59lines of code. Source [Steuwer et al., 2013]

A strong resemblance with ForSyDe is that SkelCL also uses formal means for describingcommunication and computation. Despite that, one can see that the two methodologies startedfrom different areas of interest. While SkelCL implements optimal solutions for increasingperformance and productivity on GPGPUs, it implements no concept of MoC. As a result itcould not describe complex heterogeneous systems, and the programming model cannot be

42 5. The f2cc Tool

extended farther than GPGPUs. Nevertheless, its strong mathematical foundation in parallelcomputation6 should be an exemplary model for ForSyDe in further exploring the parallelcomputation domain.

5.3.2 SkePU

The following subsection is based on material found in [Enmyren and Kessler, 2010].

SkePU is a skeleton programming library for multiple GPGPU backends, developed atLinköping University, in the Department of Computer and Information Science. It is a C++template library which provides a simple and unified interface for specifying data-parallelcomputation, and its target platforms are CUDA, OpenCL, OpenMP and sequential CPU. Italso provides support for multiple GPUs.

The approach is similar to SkelCL’s in many ways. SkePU also provides a list of skeletons anddata containers that operate similarly and generate backend code. The only data container atthe time of writing this report was the vector, witch employs lazy memory copies, in orderto minimize bottlenecks. The basic skeletons provided are Map and Reduce, with formaldescriptions similar to those in Equation 5.1 and Equation 5.3. Other skeletons provided arecompositions of these two basic ones: MapReduce, MapOverlap and MapArray.

1 UNARY_FUNC(name, type1, param1, func)

2 UNARY_FUNC_CONSTANT(name, type1, param1, const1, func)

3 BINARY_FUNC(name, type1, param1, param2, func)

4 BINARY_FUNC_CONSTANT(name, type1, param1, param2, const1, func)

5 TERTIARY_FUNC(name, type1, param1, param2, param3, func)

6 TERTIARY_FUNC_CONSTANT(name, type1, param1, param2, param3, const1, func)

7 OVERLAP_FUNC(name, type1, over, param1, func)

8 ARRAY_FUNC(name, type1, param1, param2, func)

Listing 5.2: SkePU function macros. Source [Enmyren and Kessler, 2010]

Listing 5.2 shows the function macros available in SkePU. They offer a standardized but largerdegree of expressiveness in solving problems with the aid of skeletons. Listing 5.3 demonstratesthe use of the Map skeleton, associated with a binary function.

The main addition that SkePU offers is the multi-platform support. Each skeleton containsmember functions that correspond to the supported backends. Basically the function macrosexpand the implementation of the same function in CUDA, OpenCL, OpenMP and evensequential CPU. Also, the authors state that the interface is general enough to be expandedto other platforms than the ones mentioned.

Despite the apparent flexibility and generality of the method approached by SkePU, it is stilllimited static code mapping. Also, since ForSyDe was born in the real-time heterogeneousembedded platforms domain, it has a higher, more abstract view of systems, a view whichcan integrate parallel platforms from a formal perspective. Thus these strong features likesystem analysis and manipulation are not used when regarding solely code generation. Hence,

6the reader can clearly see the resemblance between the parallel taxonomy derived from Kleene’s model(Subsection 3.2.2) and the skeletons. Basically the first four skeletons are implementations of the data, reductionand speculative parallel computations, while the last one is a composition between the data and speculative parallelcomputations.


1 BINARY_FUNC(plus, double, a, b, return a+b)

23 int main(){

4 skepu::Map<plus>globalSum(new plus);

5 skepu::Vector<double> v0(10,10);

6 skepu::Vector<double> v1(10,5);

7 skepu::Vector<double> r;

89 sum(v0, v1, r)

10 std::cout<<" R e s u l t : "<<r;11 }

Listing 5.3: SkePU syntax example. Source [Enmyren and Kessler, 2010]

1 int main(void){

2 // generate 16M random numbers on the host

3 thrust::host_vector<int> h_vec(1 << 24);

4 thrust::generate(h_vec.begin(), h_vec.end(), rand);

56 // transfer data to the device

7 thrust::device_vector<int> d_vec = h_vec;

89 //sort data on the device

10 thrust::sort(d_vec.begin(), d_vec.end())

1112 //transfer data back to host

13 thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin())

14 }

Listing 5.4: A Thrust program which sorts data on the GPU. Source [Bell and Hoberock, 2011]

SkePU’s approach, although useful for future research, is not proper for a (full) system designmethodology. Still, the optimization mechanisms will definitely aid in the implementation of asynthesis stage in the ForSyDe design flow.

5.3.3 Thrust

The following subsection is based on material in [Bell and Hoberock, 2011].

Thrust is a high level API which mimics STL C++ library, provided with CUDA version 4.0.It implements an abstraction layer for the CUDA platform meant for increased productivity7. Itcan be utilized in rapid prototyping of CUDA applications, as well as in production, since itprovides robust implementations with increased performance.

Thrust provides two main data containers, device_vector and host_vector, which containmember functions and iterators. The most important tools of this library are custom devicefunctions, like permutation, constant, counting, etc. These features give an application-oriented flavour to the programming environment, very much alike to MATLAB or other similarlanguages, rather than a hardware-oriented one. Listing 5.4 illustrates a typical programwritten with Thrust.

7equivalent with productivity layer, see Section 3.4.

44 5. The f2cc Tool

As can be seen, Thrust’s approach is completely different than ForSyDe. It is merely aproductivity layer targeting only CUDA, and there is no method that can be imported inForSyDe methodology, since there is no type of formalism involved. The reason why this set oftools has been included in the list for alternatives is to present an application-oriented approachwhich enforces a general programming style, much appreciated in industry. The patternsemployed may be useful for future ForSyDe researches on user impact. Design patterns maypresent themselves as formal IP blocks.

5.3.4 Obsidian

The following subsection is based on material found in [Svensson et al., 2010].

Obsidian is a domain-specific language (DSL) embedded in the functional programminglanguage Haskell, targeting data-parallel programming on GPUs. It generates nvidia CUDA

code, and at the time of writing this report, it considered only the generation of single kernelswithout coordination. Obsidian aims at simplifying development of GPU kernels by usingfamiliar constructs from Haskell, like map, reduce, foldr, permutation functions like rev, riffle,unriffle, recursion, or other special design combinators for combining GPU programs.

As in the previously presented environments, the computation is described as computationbetween arrays, and combinators are used instead of direct indexing into structures. Thisfavors developing of prototypes and experiment with different partitionings and choices whenimplementing an algorithm, without thinking of the architectural details.

Once a data array is formed, the programmer may use Obsidian versions of Haskell libraryfunctions, with the restriction that they only operate on Obsidian arrays. Listing 5.5 shows anexample where fmap is used to declare a function incr, that increments each value in an array.

1 incr :: Arr a -> Arr a

2 incr = fmap (+1)

Listing 5.5: Declaration of an Obsidian function incr. Source [Hjort Blindell, 2012]

As mentioned, Obsidian features a set of combinators to construct GPGPU kernels. A typicalGPGPU kernel is represented by data a :-> b, and is depicted in Figure 5.4. The combinatorsused are pure and sync, and they are defined like in Listing 5.6. The pure turns one or morearrays into a kernel, and the sync is used either to store data in the shared memory betweenkernels, or to apply synchronization barriers.

a Pure

Sync Pure

Sync Pure b

Figure 5.4: A GPU program of type a :-> b is represented in Obsidian as pure computation interspersed by syncs.Source: adapted from [Svensson et al., 2010]

The generated CUDA code has "satisfactory" [Svensson et al., 2010] performance for someapplications, while for others it is not enough.


1 pure :: (a -> b) a :-> b

2 sync :: Flatten a => Arr a :-> Arr a

Listing 5.6: Definition of pure and sync combinators in Obsidian. Source [Hjort Blindell, 2012]

Obsidian, like all the other approaches, is a platform-specific environment. Although it workson raising the abstraction level of development through formal methods (like computation andsynchronization), it has little in common with a system design methodology, like ForSyDe.Obsidian focusses on enabling restrictions to the programming model, forcing the developerto adapt his / her algorithm to fit them, in order to enable platform-speciffic optimization.ForSyDe on the other hand, models platform-independent computation and the platformoptimizations are done through design space explorations. Thus, a relevant approach in aForSyDe design flow would be to model GPGPUs as a target platform and enable design spaceexploration on the MoC based algorithm.

There are many more similar research or commercial tools available, that have more or less the samepurpose: synthesizing low-level GPGPU code from a high-level model. Among them, we can enumerateDPSKEL (dynamic programming skeletal-based environment, targeting multi-GPU), CUDPP, Brook (otherC++ template libraries), Vertigo (Haskell EDSL, optimized for DirectX9), Accelerate (another Haskell

EDSL), PyGPU (Python library), etc. The multitude of tools demonstrate the intensity of research in thisparticular area. Strangely though, among them there is no system design methodology like ForSYDe. Sinceit has a different approach than most of the tools available, ForSyDe can borrow too few conceptual methods.Studying their optimization mechanisms on the other hand, could prove useful for the development of specificdesign space exploration algorithms.

46 5. The f2cc Tool

Chapter

6

Challenges

This chapter lists the main challenges that the component development and implementation will face.

Based on the material covered through Part I of this report, the following set of challengeshas been identified. They are all related to the f2cc tool, since it will be the main software

component extended and developed throughout this project. Based on the module that theyare associated with, the new set of goals have been grouped into five categories, and annotatedbased on their priority as either High, Medium or Low. Since time is a limiting factor, onlythe challenges marked with High and Medium priorities will be treated. Challenges with Low

priority shall be taken into account only if time permits it, and will be taken out of the scope ofthis M.Sc. thesis.

frontend related challenges:

• Implement a new XML parser (High)The current ForSyDe-Haskell generated GraphML representation is obsolete, and isreplaced by the new XML intermediate representation. Therefore, it is necessary tobuild a parser inside the frontend module of f2cc in order to input and interpretForSyDe-SystemC models.

• Identify a minimal set of additional XML annotations (High)Data-parallel computation can be identified from the high-level functional model, buttime-parallel computation needs additional information regarding architectural detailsand constraints. Since ForSyDe does not specify yet design constraints nor platformdescriptions, this information has to be assumed. The challenge is to find a minimalset of implied information so it does not impose restrictions for future ForSYDe research.

• Propose topics for further investigation (Low)Since the intermediate format is still under development and hasn’t reached its final form,new requirements can be identified actively. One should be aware of the future researchand propose solutions for extracting information not yet available. It is more appropriateto extract missing information through various means than to infer it inside the tool.

47

48 6. Challenges

model related challenges:

• Offer model support for the new frontend (High)The fact that the XML representation encapsulates more information than the pre-vious GraphML could be considered a major advantage. Since both structural andcomputational information is easier to extract, this could ease further steps like modeltransformations, pattern identification and even code synthesis.

• Develop an identification algorithm for the data parallel sections (High)Currently, f2cc is limited to identifying contained split-map-merge sections as dataparallel. Consequently, it misses some opportunities to exploit models that do not fit thispattern. The composite process described by the new XML representation has an improvedpotential for expressing patterns, that can be exploited to identify parallel computation.Hence a new pattern recognition algorithm is necessary.

• Implement model modifier for synthesis of time-parallel code (High)In order to correctly wrap streamed CUDA kernels that employ time-parallelism, a set ofmodel modifiers have to be implemented. These modifiers split "contained" sections into"pipelined" ones.

• Implement load balancing algorithm (High)Improper usage of streams can degrade performance instead of improving it. To avoidsuch situations, a correct load-balancing algorithm has to be employed, that splitscorrectly contained sections so that the best resource usage occurs. This load-balancingalgorithm has to take into account the supplementary model annotations.

• Implement internal model dumper (Medium)An internal model dumper may be an invaluable addition for debugging purposes. Bybeing able to dump the internal model, one can verify that the model modifiers areworking correctly.

• Conform with the new ForSyDe naming convention (Low)There have been some minor modifications in ForSyDe naming convention regardingprocess constructors. In order to avoid confusion, the internal model of f2cc shouldconform with these changes as well.

• Implement a graph-based internal representation model (Low)Although the current internal representation model is sufficient for the task at hand, it isa limiting factor for future ForSyDe development. All the analysis, traversal and modifieralgorithms have to be implemented manually, and the object-based containers may not beoptimized for performance. On the other hand, the graph theory is intensely studied andthere are enough optimized template-based libraries with "out-of-the-box" functions thatcould prove invaluable for future ForSyDe development. This problem will be treatedonly if time permits it.

• Explore the possibility of identification and implementation of other types of parallelism (Low)If time permits it, new solutions for identifying and implementing various tipes of parallelcomputation can be researched. The different types of parallelism are presented inSubsection 3.2.2 and some implementation solutions can be inspired from Section 5.3.

49

synthesizer related challenges:

• Implement algorithm that takes care of signal conversion (High)The GraphML representation does not include information regarding data type containedby signals. Therefore, the association between a signal and its variable had to beinferred using a specific algorithm. Since this information is "for free" in the new XML

representation, signal conversion requires a different algorithm.

• Implement CUDA stream wrapper (High)This task is straightforward, but it is critical for enabling the synthesis of time-parallelcode on an nvidia CUDA-enabled GPGPU. The synthesis of optimized streamed CUDA

code strongly depends on a good load-balancing algorithm.

• Further optimize the use of shared memory (Low)Currently, f2cc employs the use of CUDA shared memory, but more optimizations arepossible. If time permits it, this issue can be treated.

language related challenges:

• Develop separate tool for C code extraction (High)f2cc uses string-based code parsers embedded in the language module. Although theyare sufficient for parsing C code, they are not enough for identifying C++ or SystemCtemplates. Since developing an internal C++ parser cannot fit inside the time slot of thisproject, it will be developed as a separate tool based on an existing code parser with C++grammar. Since such a tool is developed in parallel with the current project, the workeffort associated with this goal will not be assigned an important time frame. Instead itwill be regarded as a temporary solution that will be improved in the future.

• Conform with new C header model (High)There is a discrepancy between the C code implied by the GraphML model inputted byf2cc and the ForSyDe-SystemC function declaration. This has to be taken care of as wellfor a proper interpretation of the processes’ code.

• Take care of C++ and ForSyDe containers (High)The ForSyDe-SystemC framework makes extensive use of C++ templates and data types.Also, the protocol part, where signal containers are wrapped and unwrapped cannot beidentified by a C parser either. There has to be an association algorithm for a correctinterpretation of the source code.

General implementation challenges:

• Make use of current algorithms and modules as much as possible (High)Since f2cc is a fully-functional tool, an important part of its code, classes and methodscan be reused to increase productivity.

• Maintain backward-compatibility (High)Backward-compatibility must not be damaged. The component has to be compatible withthe initial GraphML models.

• Implement a new execution path for ForSyDe-SystemC models (High)There are too many discrepancies between ForSyDe-Haskell generated GraphML andForSyDe-SystemC generated XML representations. The frontend, model modifier,

50 6. Challenges

synthesizer and language algorithms are more or less different. Therefore it is easierto implement a new execution path, and ensure backward-compatibility, than trying toembed new functionality in the same path and risk damaging existing code.

• Merge the old and the new execution paths (Low)Once the new component is validated, one can merge the two execution paths for codecleanness.

Part II

Development andImplementation

51

Chapter

7

The Component Framework

This chapter describes the software requirements and the architectural traits that a tool like f2cc demands inorder to satisfy the desired functionality. Since the current project will develop and improve an already existingtool while maintaining backward compatibility, special attention will be paid to the existing features and possibleways to improve them. Finally, for each feature, the design decisions made in the implementation process willbe presented and justified, along with solutions for future development. Part II will provide an overview ofthe existing tool implementation described in [Hjort Blindell, 2012] with f2cc v0.1 and the new componentimproved as part of this thesis’ contribution with f2cc v0.2.

7.1 The ForSyDe model

A s presented in Chapter 2, the ForSyDe modeling framework originally provided threemain building blocks for describing systems: the process; the signal which transports data

between processes through ports; and the domain interface, which is not treated in the currentcontribution. Starting with the ForSyDe-SystemC modeling framework, a new building blockwas introduced: the composite process, which describes the compositions of other processes.

The composite process enables the designer to express more naturally hierarchy in thedesign (similar to Hardware Description Languages – HDL), while enabling the potentialtool developer with means for enhanced model analysis and manipulation. Models can nowbe analyzed and grouped locally, and replication or coalescing can be done without usingintermediate data structure, thus both the designed and the tool developer can have a clear viewof the model even during intermediate design flow steps. Also patterns like data parallelismare easier to identify since it is not necessary to parse the whole model, favoring early patternrecognition.

7.1.1 f2cc approach

In order to use, analyze and manipulate the ForSyDe model, f2cc needs its own internal datarepresentation that can be extracted from the XML or GraphML representations generated by

53

54 7. The Component Framework

either of the ForSyDe implementations. Since every step in the execution flow revolves aroundthis model representation, it can be considered the "backbone" of f2cc, as it plays a similar roleas the backbone presented in Appendix B.

f2cc v0.1

In Section 5.2 the existing tool’s architecture is described as implemented in [Hjort Blindell,2012]. From its point of view, a process is an object that derives from the Process class(Figure 5.3). These objects encapsulate data relevant for the model description outputted bythe ForSyDe-Haskell design framework. The main features of f2cc can be depicted like inFigure 7.1.

ID: mapsy1List ofOut Ports

ID:1

*

C function

List ofIn Ports

ID:1

* variables

ID: zipwithnsy1

List ofOut Ports

ID:1

*

C function

List ofIn Ports

ID:1

* variables

ID:2

*

ID: proc1

type: CopySY

List ofOut Ports

ID:1

*

List ofIn Ports

ID:1

*ID:2

*

ID: proc2

type: UnzipxSY

List ofOut Ports

ID:1

*

List ofIn Ports

ID:1

*ID:2

*

ID: proc3

type: ZipxSY

List ofOut Ports

List ofIn Ports

ID:1

*

ID:1

*

ID:2

*

List of Inputs

*

List of Outputs

*

List of processes

*****

mapsy1proc1proc2proc3zipwithnsy1

Model

Figure 7.1: Visual representation of the internal model in f2cc v0.1

As can be seen, f2cc’s internal model supports the description and manipulation of a ForSyDe

process network with respect to development stage at the time. The process network isdenominated in f2cc v0.1 as a Model. The Model contains a list of unique process IDs associatedwith processes. The entries in this list point to Process objects which have special properties asdescribed by their equivalent ForSyDe process constructor.

Each Process object has a unique ID, a list of input ports, and a list of output ports. These Portsare objects identified by their ID, and they each contain a pointer to another port belongingto a different process. Two connected ports are double-linked, meaning that both objectsencapsulate pointers to one another. A port is connected to only one other port and if anotherconnection is formed, the initial connection is broken.

Some processes, like mapSY, zipwithnSY, coalescedMapSY and parallelMapSY [Hjort Blindell,2012] encapsulate C functions, further described in Section 7.3.

The Model also contains a list of inputs and a list of outputs, which are in fact "one-way"pointers to ports. These pointers can be considered starting and ending points for modelparsing algorithms.

The implementation of this model has been optimized for a limited set of actions, undertakenby the algorithms in f2cc. Special consideration for high performance execution has been paidby using combsets, lists, and optimized memory accesses for critical methods.

7.1. The ForSyDe model 55

f2cc v0.2

Although the framework provided by f2cc v0.1 has an intuitive interface and facilitates fastcode execution, its main weakness is the inflexibility to the new features introduced withthe ForSyDe-SystemC modeling framework, which were not known at the time of the tool’sdevelopment. This weakness is amplified by the fact that every module of the softwarecomponent is highly dependent of the internal model, so any changes in the existing structurecould render the whole tool unusable.

Since maintaining backward compatibility is a high priority goal, the main challenge is to keepthe internal model API unmodified, while providing new functionalities transparent to theexecution flow in v0.1. Facing this challenge, a number of design decision were made whichfavoured the intuitive pattens in the object-based representation from v0.1 over an optimizedexecution time which would imply changing the whole structure of the tool.

The new features can be depicted in Figure 7.2 which reflects the design decisions made. TheModel from v0.1 is replaced by the Process Network. From outside, it has to be seen in thesame manner: a collection of processes with two lists of pointers to the input/output ports.Now there are two types of processes having distinct natures and functions, leaf processes andcomposite processes. Due to this, they have been separated in two distinct lists, accessed withdifferent methods.

The Process objects, although renamed into Leaf objects to disambiguate them from Compositeobjects, suffered minor modifications concerning their interface. Their names have beenconformed to the new ForSyDe terminology: mapSY and zipwithnSY have been renamedto comb, and copySY has been renamed to fanout. Also, support for coalescedMapSY andparallelMapSY has been seized, since the execution flow in v0.2 will not use these processes.

Internally however, a leaf process’ structure is different. Instead of only one unique ID, allprocesses (including composites) are described by a hierarchical path that enables the correctplacement in the process network. They have to have a context, and using a process withoutcontext results in an invalid model exception. A relation between two processes (child, parent,first child, sibling, sibling’s child, etc.1) can be extracted from the hierarchy path, for furthermanipulation.

Another main change is that actor leaf processes (comb) do not hold C functions any more;instead, functions are held by the Process Network. An actor leaf process just points to afunction in that list, enabling easy identification of data parallel processes and reducing storagespace in case they occur.

The Composite processes are a new class of processes, sharing some traits with both theProcess Network and Leaf processes. As seen in Figure 7.2, they may contain leafs or othercomposites, but unlike the Process Network, their scope of vision includes only their firstchildren. Their content is described by their name which is the equivalent of the componentname in the ForSyDe-SystemC design framework. By using this design framework property(further described in Section 7.2) it is now possible to identify data parallel sets of processes bycomparing their parents’ component name.

The Ports now contain more information than in v0.1. They now encapsulate data typeinformation, extracted directly from the XML representation. Thus, the data type coherencedoes not have to be tested when a process network is built, since it is assured by the

1the nomenclature uses similar terms to the ones used in tree structures [Knuth, 1997]


ID: r

oot

Nam

e: to

p_m

odul

e

List of In I/O Ports

List of Out I/O Ports

ID: c

omb1

List

of

Out

Por

ts

*C fu

nctio

n

List

of

In P

orts

varia

bles

ID: c

omb2

List

of

Out

Por

ts

*C fu

nctio

n

List

of

In P

orts

varia

bles

ID: l

eaf1

type

: fa

nout

List

of

Out

Por

ts

ID: l

eaf2

type

: U

nzip

x

List

of

Out

Por

tsLi

st o

fIn

Por

ts

ID: l

eaf3

type

: Zi

px

List

of

Out

Por

tsLi

st o

fIn

Por

ts

List of Inputs

List

of l

eaf p

roce

sses * * * * *

com

b1co

mb2

leaf

1le

af2

leaf

3

Proc

ess

Net

wor

k

List

of c

ompo

site

pr

oces

ses

* * *

com

posi

te1

com

posi

te2

root

List

of f

unct

ions * *

func

1fu

nc2

List of Outputs

Com

posi

te p

roce

ss

*

List

of l

eaf p

roce

sses *

leaf

1

List

of c

ompo

site

pr

oces

ses

*co

mpo

site

1

ID: c

ompo

site

1N

ame:

func

tion_

set


List of Out I/O Ports

ID: c

ompo

site

2

N

ame:

act

ors

List

of l

eaf p

roce

sses * *

leaf

2le

af3


List of Out I/O PortsLi

st o

f com

posi

te

proc

esse

s*

com

posi

te2

List

of

com

posi

te

proc

esse

s

List

of l

eaf p

roce

sses

* *co

mb1

com

b2

ID:io

p1

**ar

rar

r

List

of

In P

orts

ID:p

1 *ar

ray

ID:p

1 *ar

ray

ID:p

2 *ar

ray

ID:io

p1

**ar

rar

r

ID:io

p1

**ar

rar

r

ID:p

1 *ar

ray

ID:p

1 *int

ID:p

2 *int

ID:io

p1

**int

int

ID:io

p2

**int

int

ID:io

p3

**ar

rar

r

ID:p

1

*int *

ID:p

2

*arr *

ID:p

1

*int *

ID:p

1

*int *

ID:p

1

*int *

ID:io

p2

**int

int

ID:io

p1

**int

int

ID:p

1 *int

ID:p

2 *int

ID:p

1 *ar

ray

ID:io

p1

**ar

rar

r

ID:io

p1

**ar

rar

r

*

Figure 7.2: Visual representation of the internal model in f2cc v0.2

7.1. The ForSyDe model 57

ID: comb1List ofOut Ports

*C function

List ofIn Ports

variables

ID: comb2List ofOut Ports

*C function

List ofIn Ports

variables

ID: composite2 Name: actors_pcompNumber of processes: 8

List of In I/O P

orts

List of Out I/O

PortsList of

composite processes

List of leaf processes

**

comb1comb2

ID:iop1

* *intarr[8]

ID:p1

*int

*ID:p2

*arr

*

ID:p1

*int

*

ID:p1

*int

*

ID:p1

*int

*

ID:iop4

* *arr[8]int

ID:iop2

* *intarr[8]

ID:iop3

* *intarr[8]

ID:iop5

* *arr[8]int

Figure 7.3: Visual representation of the ParallelComposite process

ForSyDe-SystemC design framework. Furthermore, ports belonging to actor leaf processespoint directly to a C variable in the process’ C function. In this way, model coherence, datacoherence and enhanced access to information are enforced.

Because the "port-to-port" pattern had to be maintained, the access mechanisms had to beunmodified. Therefore the Ports belonging to Composite objects were derived into a new class,the IOPort. The IOPort is similar to Port, with the distinction that they have two connections:one outside the composite process and one inside it. The outside connection can lead to eithersibling processes or to its first parent, while the inside connection can lead only to its firstchildren. These restrictions are enforced by the internal mechanisms employed by the IOPort’smethods.

Another compatibility feature is that composite processes have to be transparent as concerns thealgorithms in v0.1. Therefore port-to-port accesses offer the possibility of recursive accessinguntil the connected Leaf Port has been reached, ignoring the intermediate Composite IOPorts.The search direction is handled by the methods’ internal mechanism which check the caller’shierarchical relation to the callee. Also, the methods for breaking connections offer the samerecursive mechanisms, enabling the composite processes’ transparency.

In order to express parallelism internally, a new process type has been implemented inf2cc’s internal model, the ParallelComposite. This process compensates for the lack ofcoalescedMapSY and parallelMapSY. It is structured as a Composite process, but it describesmultiple instantiations of the same process. This is possible through a new property: thenumber of processes. This has multiple implications, the most important one being thecorrelation between the ports connected outside and the ones connected inside, as seen inFigure 7.3.

7.1.2 Model limitations and future improvements

Although this model improves usability and suffices for the desired functionality, it is ratherinflexible for future development. Both the compatibility and the new features made the


root

cp0 cp1

cp00 cp01 lf10 lf11

lf000 lf001 lf010 lf011

Figure 7.4: Visualization of the cross-hierarchy connection mechanism by three examples. The processes are depictedin by their tree structure. cpx denotes a composite process’ ID, and lfx represents a leaf process’ ID. An intersectionbetween a red line and a dotted line represents a new IOPort that has to be automatically generated

implementation too complex to be scaled and handled efficiently. Also, since the time framepermitted implementing only a prototype for the component, some features still need to beadded and existing ones need thorough testing and debugging.

For example, the design flow would be greatly aided by a mechanism of connecting two portsanywhere in the process network. The implications of such an action can be seen in Figure 7.4.For each transition to and from a composite process, an IOPort has to be generated. Itsgeneration has to take into account the source where the connection method has been called.Based on the relation between the current process and the destination process, the frameworkshould compute the position of the next IOPort.

Another priority for future improvement is developing even further the link between ports andcode variables. Ideally, the data flow traceability should not stop at the process level. Insteadit should go even further at code level. By splitting the code into an abstract syntax tree (AST)the signals can be traced and analyzed from the process input to its output, feature which mayprove invaluable for model analysis.

Since introducing a new feature might require a high overhead in both development time andend performance, it is the author’s belief that in order to continue the development of a designflow tool for ForSyDe, the current internal model has to be left aside in order to develop a moreflexible model. This new internal framework has to satisfy three main requirements:

• it has to scale better with new features that are continuously developing in ForSyDe, andwith custom features needed for different design flows.

• it has to provide an intuitive interface so that ideally a broader community of developersare able to contribute to the tool’s development. The current interface is a good startingpoint.

• it has to be implemented with respect to a high performance on very large models. Beingthe "backbone" of the tool set, performance is critical.

As presented in Appendix B, a proposed internal model will mainly use a fast tree library withstorage facilities (most likely an XML or database library, like RapidXML

2 or SQLite3) with

2http://rapidxml.sourceforge.net/manual.html3http://www.sqlite.org/

http://rapidxml.sourceforge.net/manual.html

http://www.sqlite.org/

7.2. The intermediate model representation 59

a graph library (like Boost Graph Library4). Both library types offer the functionality and

performance needed for the task at hand. The model can benefit from an XML database byusing its inherent tree hierarchy, and its data storage facilities which can cope with the new orcustom features introduced in ForSyDe. A graph library can aid the design flow with its fastout-of-the box analysis algorithms most needed in the design flow. The main task for futuredevelopment is to bind these libraries into an intuitive API specific for ForSyDe models.

7.2 The intermediate model representation

The intermediate model representation is outputted by the ForSyDe modeling framework, andcontains data extracted from the ForSyDe model. It encapsulates information about the processinterconnections and process constructor information. Beginning with ForSyDe-SystemC

design framework, additional information was included regarding the data type transportedby signals and arriving in / departing from ports.

7.2.1 f2cc approach

Based on the data encapsulated in the intermediate XML or GraphML files, f2cc builds aninternal model using the functions included in the frontend module. Depending on whichpieces of information are available, the design flow relies on extracted, inferred or assumeddata.

The following paragraphs will present both the old and the new versions of f2cc, to justify andunderstand the context of the design decisions taken in the tool’s development.

f2cc v0.1

A comprehensive documentation of the available information contained in the input GraphML

files can be found in chapter 9 of [Hjort Blindell, 2012]. f2cc inputs a single GraphML filewhich contains a flat process network, as presented in Subsection 7.1.1.

The input file holds both structural information as XML nodes, and C code encapsulated by adata XML element. Since at the time of the tool’s development the ForSyDe-Haskell modelingframework did not dump C code, the C functions had to be included manually for each actorprocess.

One important design decision taken in the development and implementation of f2cc v0.1was inferring signal data types from the C code in the synthesis stage of its execution flow.This decision was influenced by the fact that there is no information regarding the datatransported by signals or ports in the GraphML file. This fact is reflected in the internalmodel representation (Figure 7.1), since ports hold no data type information. Due to this lackof information, it was impossible to infer the sizes of array data types, thus sizes had to beprovided manually in the GraphML file. Furthermore, structural information such as portdirection had to be inferred from a strong naming convention, and any deviations from thisconvention results in a parse exception.

4http://www.boost.org/doc/libs/1_53_0/libs/graph/doc/index.html

http://www.boost.org/doc/libs/1_53_0/libs/graph/doc/index.html

1 <port name=" in ">2 

3 <data key=" a r r a y _ s i z e ">7</data>4 </port>

Listing 7.1: Example of port transporting an array in the GraphML intermediate format

Also, since the only function code present in the GraphML is C code, the only data typeassumed and supported by f2cc v0.1 are ANSI C basic data types.

f2cc v0.2

f2cc v0.2 uses the XML intermediate files dumped by ForSyDe-SystemC to harvest model data.Since this representation is very different from the ForSyDe-Haskell generated GraphML

representation, assuring backward-compatibility was never an issue, and it will not beconsidered in future implementations either. A different frontend parser is taking care ofbuilding an enhanced internal model from the input data available.

The ForSyDe-SystemC modeling framework dumps multiple XML files for one processnetwork, one file for each composite process available. This efficiently reduces storage spacefor multiple instantiations of the same composite process and, as presented in Subsection 7.1.1,enables easy identification of data parallel sections.

Since the XML files contain data type information for ports and signals, and structuralinformation is provided from explicit tags there is no need for the inferences in v0.1. As canbe seen in Listing 7.2 and in Figure 7.2, this information is present in the intermediate modelfrom early stages.

1 <port name=" port_0 " type=" i n t " direction=" in " bound_process=" sub1 " bound_port=" port_0 "/>

Listing 7.2: Example of port transporting an integer in the XML intermediate format

ForSyDe-SystemC’s introspection module dumps run-time type information (RTTI). The RTTIsuccessfully identifies both ANSI C standard types and classes (custom structures). Whenit comes to STL types though, the information regarding the base data type wrapped bytemplate is lost. Since SystemC is a collection of C++ classes, and the ForSyDe-SystemC designframework often employs C++ STL types, a mechanism for preserving type information had tobe provided. For the moment, the static declaration of custom templates was employed, usingthe macro DEFINE_TYPE_NAME provided by ForSyDe-SystemC. This means that every time thedesigner uses a custom data type, he or she has to make sure to declare it. These names canafterwards be interpreted by the f2cc methods into an internal data type representation andtranslated into an ANSI C type necessary for CUDA C code synthesis.

1 DEFINE_TYPE_NAME(std::vector<abst_ext<double> >, " v e c t o r <a b s t _ e x t <double > >");

Listing 7.3: Example of using the static type name declaration during the system design.

7.3. The process function code 61

1 

2 <port name=" port_0 " type=" St6vectorIN7ForSyDe8abst_extIdEESaIS2_EE " direction=" in "bound_process=" sampleVUnzip " bound_port=" i p o r t 1 "/>

34 

5 <port name=" port_0 " type=" v e c t o r <a b s t _ e x t <double > >" direction=" in " bound_process="sampleVUnzip " bound_port=" i p o r t 1 "/>

Listing 7.4: The difference between not using and using the static type name declaration.

Concerning the data provided by the XML intermediate representation, some temporarymethods for data size extraction have been provided, necessary for cost calculations, and theyare further presented in Section 7.4.

Unlike in v0.1, in v0.2 the function code for actor processes is extracted directly from theForSyDe model, thus the designer does not have to provide the C code manually. The extractionprocess is further presented in Section 7.3.

As part of the implementation effort, an XML dumper class has been provided as well. Itprovides methods for dumping the internal model representation into an XML file similar tothe intermediate ForSyDe model. Its usage can be seen in Appendix C.

7.2.2 Limitations and future improvements

The solutions in the previous subsection were presented as temporary, since they need furtherstudy and development until they can be considered syntactically and functionally correct andflexible enough to be included into a large-scale tool. Most of the improvements would implyincreased and direct support from the ForSyDe design framework.

For example, since the ForSyDe-SystemC design framework is built over a C++ environment,the data types do not have fixed sizes. The current component, especially the analysis part ishighly dependent on information provided by the design framework. This means that if thesystem was designed on a different machine than the one that does the analysis (for example anx64 and an x86 machines) the results are likely to be erroneous. Due to this fact, a standardizedForSyDe set of data types should be developed (similar to u_int16, int32, etc.) and be offeredsupport for analysis. Also, design flows that target hardware backends like HDLs would greatlybenefit from support of data types controlled at bit level.

Another feature that could benefit the whole design flow would be advanced support forcomplex template data types. This would mean either enhanced support for recognitionand introspection of composite STL types or, as mentioned before, the development of astandardized ForSyDe set of fully analyzable data containers.

7.3 The process function code

As presented in Subsection 7.1.1, actor processes hold function code. This code stands asbasis for synthesizing code for the different ForSyDe backends. Currently C code is preferredfor denoting these processes’ functionality, since it is the most widely spread language forcomputing machines and most targeted platforms are equipped with C compilers.


7.3.1 f2cc approach

f2cc extracts the function code as text, analyzes it and encapsulates it into a CFunction object.The function body as such, since it is formed of C code, is left unmodified and it is relevant forthe tool only in the final stage of the design flow, namely during the backend code synthesis.

f2cc v0.1

Upon the code extracted from the GraphML file, text manipulation functions are applied.The C function is separated into body and header and stored into a CFunction object. Theheader is further analyzed and split into function name and variables like in Figure 7.6. Theheader contains parameters as CVariable objects, each having associated a CDataType object.The objects structure all the information in such a way that it is accessible for code generationmechanisms associated with these variables, as can be seen in the examples from Figure 7.5.

Name Type array array size pointer const

sampIn double T 500 F F

1 double *sampIn;

2 sampIn = (double*) malloc (500 * sizeof(

double));

Figure 7.5: Generating C-style declaration code for a double array, based on information found in a CVariable object

Body

In parameter CDataType

... ...


return CDataType

function name

Figure 7.6: CFunction structure in f2cc v0.1

Body


... ...


function name

Out parameter CDataType

... ...

Out parameter CDataType

Figure 7.7: CFunction structure in f2cc v0.2

f2cc v0.2

f2cc v0.2 uses the same mechanism for storing a C function and similar object containersfor that purpose. Since now code has to be generated for composite processes as well as leafprocesses, the functions are not restricted to only one output. Instead, a list with CVariableobjects associated with output parameters are provided for each CFunction, like in Figure 7.7.

As mentioned in Section 7.2, the function code is extracted directly from the ForSyDe model.Since the ForSyDe-SystemC modeling framework uses C++ constructs, a basic set of methodsfor converting the function code into C code has been implemented, and included into a C

parsing frontend. Currently, these methods employ only text manipulation, but they suffice forthe desired functionality.

7.3. The process function code 63

The function extraction algorithm is presented in Listing 7.5. The parsing methods are invokedevery time a new comb process is built during the XML parsing. The source code file name isbuilt from the process constructor function name. This means that the designer has to respecta set of conventions when naming both process functions and function code files, as suggestedin Appendix A. The coding style suffers a few restrictions as well, listed also in Appendix A.

When the code parsing is complete, the associated CFunction object contains C code andconnections between ports and variables are made immediately, and type sizes are adjustedaccordingly. The variables extracted have ANSI C data types, and have direct equivalents inthe function body. An example of extracting data type information from ForSyDe STL types isshown in Figure 7.8.

1 forsydeCode ← Read ( s o u r c e _ f i l e )2 new_CFunction_obj ← empty CFunction ( )3 e q u i v a l e n c e _ l i s t ← empty l i s t45 for each l i n e ∈ forsydeCode do6 i f l i n e ∈ funct ion d e c l a r a t i o n then7 e x t r a c t function_name , function_arguments from l i n e8 for each argument ∈ function_arguments9 e x t r a c t variable_name , var iab le_ type from argument

10 new_CVariable_obj ← CVariable ( variable_name , var iab le_ type )11 add new_CVariable_obj to the new_CFunction_obj input l i s t / output l i s t1213 i f l i n e ∈ ("#pragma ForSyDe begin", "#pragma ForSyDe end" ) then14 rename macros with macro d e f i n i t i o n s in l i n e15 add l i n e to new_CFunction_obj body1617 i f l i n e ∈ v a r i a b l e wrapping/unwrapping s e c t i o n then18 e x t r a c t lhs_name , rhs_name from l i n e19 add lhs_name , rhs_name to e q u i v a l e n c e _ l i s t2021 for each equivalence ∈ e q u i v a l e n c e _ l i s t22 analyze equivalence23 rename pointed v a r i a b l e from new_CFunction_obj v a r i a b l e l i s t s with equivalent name2425 return new_CFunction_obj

Listing 7.5: Pseudocode for ForSyDe function code parser

const abst_ext<std::array<double, BUFFER_SIZE+3>> &in_state

std::array<double, BUFFER_SIZE+3> state = unsafe_from_abst_ext(

in_state);

<port name=" i p o r t 1 " type=" s t d : : a r r a y&l t ; doub l e&gt ; " size="216"direction=" in "/>

Name in_statestate

type double

array T

array size 27

pointer F

const T

Figure 7.8: Extracting variable information (right) from the ForSyDe function code (upper left) and the XML

intermediate representation (lower left). The variable name is later renamed according to its occurrence in the functionbody (middle left)

7.3.2 Future improvements

The current solution for code extraction, although suffices the proposed tasks, is very limitedand inflexible. Since it is purely based on text parsing methods, a number of restrictions have


to be respected, restrictions that limit the usability of the design framework.

A proper way to store function information is its AST or another description of the functionality,rather than the code text. Thus an intermediate form for the code would greatly benefit thewhole design flow. As suggested in Subsection 7.2.2 dataflow traceability through code may aidthe design space exploration further than the current process network-based analysis.

Because developing an intermediate model for code analysis may prove too resource-consuming,support for C code extraction needs to be further improved. Currently ForSyDe specificconstructs residing in the body, like the usage of absent values is ignored or removed. A properway of generating semantically correct C constructs has to be developed.

Also, currently STL data type support is rudimentary. A proper way of dealing with templatesand containers has to be developed. Currently, only std::array is supported, since it is staticand its container size reflects the number of elements transported, thus structural data is easilyextracted.

ForSyDe’s range of applications can be broadened by implementing support of dynamically-sized vectors. Although they would greatly increase the difficulty of system analysis, heuristicalgorithms associated to vectors may be developed. Also, extending the idea of vectors intoprocesses with dynamic parameters (for example parallel processes) may be associated withmapping to systems with dynamic resource allocation (for example run-time thread spawning).

7.4 The GPGPU platform model

In Chapter 8 a heuristic algorithm for load balancing is presented. This algorithm is basedon a platform model which is not currently described by any available ForSyDe tool. Due tothe limited time frame, the current project developed a minimal set of descriptive empiricalattributes that can provide enough information to support the load balancing algorithm.

An XML file has been written to model the GPGPU as a platform. Currently this file containsonly seven nodes each containing an attribute. These attributes are constants that roughlydescribe the platform’s execution patterns.

7.4.1 Computation costs

The computation costs describe the running time of individual processes. These costs areinferred from rough approximations of the processes’ running time in the context of theirForSyDe model running on a test platform (e.g. sequential CPU). Although the executionon a GPGPU is influenced by many more factors than provided, these coefficients suffice forproof-of-concept purposes.

Calculating cost values imply two factors: an average run-time estimation coefficient and aplatform cost coefficient. The calculation is done with one of the equations 7.1, 7.2 and 7.3.

7.4. The GPGPU platform model 65

Cleaf ,seq = kseq ·Cleaf Cleaf ,par = kpar ·Cleaf (7.1)

Ccomp,seq =∑

proc ∈ comp

Cproc,seq Ccomp,par =∑

proc ∈ comp

Cprocess,par (7.2)

Cpcomp,seq = N ·Ccomp,seq Cpcomp,par = Ccomp,par (7.3)

For a leaf process, the cost calculations are straightforward (Equation 7.1). The cost associatedwith sequential execution on the host CPU (Cleaf ,seq) is done by multiplying the cost coefficientfor running the process on a sequential platform (kseq) with the process’ run time estimationcoefficient (Cleaf ). For computing the cost associated with parallel execution on the GPGPU

device (Cleaf ,par ), a parallel platform cost coefficient is used instead (kpar ).

For a composite process (Equation 7.2), both sequential (Ccomp,seq) and parallel (Ccomp,par )execution costs are calculated by summing all the associated costs for the contained processes.The communication happening internally between the contained processes, a different type ofanalysis is employed, as it will be shown in the next section.

A parallel composite process (Equation 7.3) gets its sequential execution cost (Cpcomp,seq) bymultiplying the number of processes (N ) with the sequential execution cost for only oneinstantiation of this process (Ccomp,seq). When calculating the parallel cost execution though,N is ignored. This implies the assumption that the parallel platform (GPGPU) is infinitelyparallel, and there are enough resources to execute all the process threads at once. Althoughnot realistic, this assumption is sufficient for demonstrating the load balancing algorithm.

The cost coefficients kseq and kpar are extracted from the platform model, while the rest of thefactors are included in the ForSyDe intermediate model representation. The run time executioncoefficients for leaf processes (Cleaf ) are provided manually by the designer, by writing a newattribute for the leaf_process nodes: cost. An example is shown below.

<leaf_process cost="260">

7.4.2 Communication costs

The communication costs describe the time necessary for data to be transferred betweencomputing resources. Unlike the computation costs, the main parameter, namely the size ofthe transported data is precisely calculated, not approximated. Since part of the work effortwas to implement a mechanism that extracts signal sizes5, the only assumed information is thetransfer mechanism costs, which are synthesized into a set of transfer coefficients and providedin the platform model file. The operations revolving around transfer costs are following thepattern in Equation 7.4.

Ctransfer = ktransfer ·nelements · sizedatatype (7.4)

where Ctransf er is the communication cost associated with one type of transfer, ktransf er is thetransfer mechanism cost for the same type of transfer, nelements is the fixed array size andsizedatatype is the signal data type size in bytes.

5for this purpose a modified ForSyDe library has been used, with enhanced introspection features. Themodifications will be studied whether or not they should feature on future releases of ForSyDe.


The types of transfer described in f2cc v0.2 are: host-to-device (H2D), device-to-host (D2H),device inter-thread (D2D), device intra-thread (T2T), host-to-host (H2H).

7.4.3 Future improvements

As presented, the current solution for modeling GPGPU execution is minimal, and is destinedonly for proof-of-concept purposes. In order to obtain real improvements two areas need to befurther developed: the model analysis part and the platform description part.

Currently there is no ForSyDe analysis tool available, other than the run-time introspection andmodel testbench present in ForSyDe-SystemC modeling framework. In Appendix B a separatetool is proposed for static and dynamic analyses. With its help, a more elaborate and properrun-time cost extraction is possible. Since process execution is influenced by different factorson different platforms, this aspect has to be reflected as correctly as possible in the analysisphase. For static analysis, a high-level code would be helpful, while for dynamic code analysis,better costs may be extracted from running partial testbenches on virtual platforms or targetplatforms.

Also, a thorough platform description has to be modelled in order to have realistic predictionson both execution and data transfer. The current cost calculations (Equations 7.1 through 7.4)take into account only a set of empirical constants. These equations need to be developed todepend on real architectural or implementation traits, or at least statistical values.

Chapter

8

Design FlowandAlgorithms

This chapter will present the main algorithms implemented by the current component. These algorithms aretightly connected to the component framework presented in Chapter 7, and many of the design decisions taken intheir development are dependent on it. By following the last chapter’s model, after describing the used algorithms,solutions for future improvement will be mentioned.

8.1 Model modifier algorithms

T he first part of the design flow is presented Chapter 7 and it implies code and modelparsing and extraction of data necessary for building an internal model. It shall be assumed

that the internal model has been built and verified by the new frontend and there is enoughinformation to continue with the second major step of the design flow: the model modifications.

These modifications are necessary in order to synthesize the desired CUDA code, and are resultsof decisions made after an thorough search and analysis of the intermediate model.

8.1.1 Identifying data-parallel processes

Identifying data-parallel processes is the first step of the model modifier algorithms. Aspresented in [Hjort Blindell, 2012], there are four approaches for this task:

1. let the software component decide where and when potential data parallelism can beexploited and always execute it on the GPGPU device.

2. same as approach 1, but execute on GPGPU only parts of the code where there is sufficientdata for such execution to be beneficial, and let the component take this decision.

3. let the model developer decide where and when potential data parallelism can be exploitedand always execute it on the GPGPU device.

4. same as approach 3, but execute on GPGPU only parts of the code where there is sufficientdata for such execution to be beneficial, and let the component take this decision.

67

68 8. Design Flow and Algorithms

f2cc v0.1 applies approach 1, and identifies data parallel sections by searching for unzipx-map-zipx patterns. In f2cc v0.2 methods of tackling with approach 2 are implemented, laying afoundation for future development. The identification is made using different methods than inv0.1, since now the model holds more information.

Future development will consider approach 3 and 4 as well, but at the moment they are out ofthe scope of this Master’s Thesis project.

Implemented algorithm

The pseudo-code in Listing 8.1 describes the top level of the data parallel processes identifica-tion algorithm. It reflects one early design decision that was forced by the framework’s context:flattening the process network before any analysis. Since flattening the process network is adecision imposed by the current framework, it is not optimal and reduces the advantages ofmanipulating composite processes, rendering a high algorithm time complexity, as shown inthe following paragraphs. An improved algorithm associated with an improved framework ispresented at the end of this section.

1 root ← ProcessNetwork . root23 for each composite ∈ root . composites do4 FlattenCompositeProcess ( composite )56 equivalent_comb_groups ← ExtractEquivalentCombs ( root )7 for each comb_group ∈ equivalent_comb_groups do8 pcomp ← createParallelComposite ( root , comb_group )9 add pcomp to root , ProcessNetwork

1011 equivalent_ leaf_groups ← ExtractEquivalentLeafs ( root )12 while ∃ unresolved equivalent_ leaf_groups do13 for each leaf_group ∈ equiva lent_ leaf_groups do14 pcomp ← createParallelComposite ( root , leaf_group )15 add pcomp to root , ProcessNetwork16 equivalent_ leaf_groups ← ExtractEquivalentLeafs ( root )1718 RemoveRedundantZipsUnzips ( root )

Listing 8.1: Top level for the algorithms for identifying data parallel sections

The algorithms in Listing 8.2 expand the functions from Listing 8.1. They simplify theimplementation mechanisms in order to express only their functionality, and may not fullyreflect the implementation details. The algorithm results in a flat process network containingonly leafs for non-parallel processes and parallel composites for potentially parallel processes.

As its name suggests, FlattenCompositeProcess() destroys the hierarchy of a compositeprocess, and brings all leaf processes at the same level. It is applied recursively for all childbranches, and it systematically raises the hierarchy with one level for all leaf processes at theend of these branches. The method results can be observed in Figure C.1 (input model) andFigure C.2 (modified model) from Appendix C.

Since this algorithm is a recursive one, its time complexity scales very quickly with O(2N ),where N is the number of composite processes and it strongly depends on the complexity ofthe hierarchy tree. Another aspect that has to be regarded is that, due to the intermediateconnections that have to be destroyed when moving a leaf process, the algorithm adds O(n),where n is the number of leaf processes, which scales with the number of ports contained by

8.1. Model modifier algorithms 69

1 funct ion FlattenCompositeProcess ( composite )2 for each child_composite ∈ composite . composites do3 FlattenCompositeProcess ( child_composite )4 for each c h i l d _ l e a f ∈ composite . l e a f s do5 move c h i l d _ l e a f to composite . parent6 r e d i r e c t c h i l d _ l e a f . data_flow through c h i l d _ l e a f78 funct ion ExtractEquivalentCombs ( composite )9 grouped_equivalent_processes ← empty l i s t of l i s t s

10 ta b le _o f_ eq u iv a l en ce s ← empty combset ( i d e n t i f i e r s , comb l i s t s )11 for each l e a f ∈ composite . l e a f s do12 i f l e a f i s comb and l e a f ∈ t ab le _o f_ e qu iv a l en ce s then13 function_name ← t ab le _o f_ e qu iv a l en c es . i d e n t i f i e r ( l e a f )14 i f FoundDependencyUp(/Down)stream ( l ea f , e q u i v a l e n c e _ l i s t ) then15 new_pair ← new pair ( function_name , l e a f )16 add new_pair to t ab le _o f _e qu iv a l e nc es17 e l s e add l e a f to t ab le _o f _e qu iv a l e nc es . l i s t s ( function_name )18 for each e q u i v a l e n c e _ l i s t ∈ t ab le _o f_ e qu iv a l en c es do19 i f e q u i v a l e n c e _ l i s t . s i z e ( ) > 1 then20 add e q u i v a l e n c e _ l i s t to grouped_equivalent_processes21 return grouped_equivalent_processes2223 funct ion ExtractEquivalentLeafs ( composite )24 grouped_equivalent_processes ← empty l i s t of l i s t s25 ta b le _o f_ eq u iv a l en ce s ← empty combset ( i d e n t i f i e r s , l e a f l i s t s )26 for each l e a f ∈ composite . l e a f s do27 i f l e a f i s not zipx or unzipx then28 i f l e a f i s connected to a zipx or unzipx then29 i d e n t i f i e r ← the ID of the connected zipx or unzipx30 i f i d e n t i f i e r ∈ t ab le _o f_ e qu iv a l en c es then31 add l e a f to t ab le _o f _e qu iv a l e nc es . l i s t ( i d e n t i f i e r )32 e l s e33 new_pair ← new pair ( i d e n t i f i e r , l e a f )34 add new_pair to t ab le _ of _e qu iv a l e nc es35 for each e q u i v a l e n c e _ l i s t ∈ t ab le _o f_ e qu iv a l en c es do36 i f e q u i v a l e n c e _ l i s t . s i z e ( ) > 1 then37 add e q u i v a l e n c e _ l i s t to grouped_equivalent_processes38 return grouped_equivalent_processes3940 funct ion createParallelComposite ( parent , group_of_equivalent_processes )41 for each l e a f ∈ group_of_equivalent_processes do42 count number_of_processes43 new_pcomp ← new Paral le lComposite ( number_of_processes )44 add new_pcomp to parent45 re ference_process ← group_of_equivalent_processes . pop_front ( )46 i n t e g r a t e re fe rence_pr ocess in to new_pcomp47 zips_and_unzips ← new Zipx/Unzipx ( ) ∀ re fe rence_process . ports48 r e d i r e c t re fe rence_pro cess . data_flow through zips_and_unzips49 for each process ∈ group_of_equivalent_processes do50 r e d i r e c t process . data_flow through zips_and_unzips51 erase process52 return new_pcomp5354 funct ion FoundDependencyUp(/Down)stream ( l ea f , to_compare_with )55 mark l e a f as v i s i t e d56 i f l e a f i s delay then return f a l s e 1

57 for each port ∈ l e a f . in / out_ports do58 connected_leaf ← port . connected_leaf_port . process59 i f connected_leaf has been v i s i t e d then return f a l s e60 i f l e a f ∈ to_compare_with then return true61 e l s e i f FoundDependencyUp(/Down)stream ( connected_leaf , to_compare_with ) then return true62 return f a l s e

Listing 8.2: Methods used by data parallel sections identification algorithm


the process. Since all the composite processes are destroyed, the N term disappears from futurecomplexity calculations.

ExtractEquivalentCombs() is a method that searches for all the comb processes in the systemand groups them into equivalent processes. A set of equivalent combs that are potentiallydata-parallel point to the same function and are not data-dependent. In order to identifydata-parallel processes, a system of lookup tables is employed, along with an additionalparse through the process’ data path to search for dependencies, through the functionFoundDependencyUp(/Down)stream(). This function is still experimental, and it needs to befurther studied whether the dependency verification should target the whole data path betweenthe process network’s input to its output, or only the data path between two delay processes.Currently this feature may be activated through a flag. The method’s behavior can be observedin Figure C.3 from Appendix C.

This function traverses a flat process network for each comb process and could be greatlyenhanced through systematic groupings of processes. Since all processes have to be verifiedwhether they are combs, the search time would scale with O(n). And since for every comb

process found, a dependency check is activated which is basically parsing through the wholeprocess network, with a worst case scenario of O(n) (since visited process are ignored). Thusthe full time complexity scales with O(n2). Another list swipe is performed at the end of themethod, where all the lists in the built lookup table are grouped into a list of lists, but itscomplexity can be considered negligible compared to the process network parsing. Also, with acorrect lookup table system, verifying whether a process is found in a group is reduced to O(1).This is available for the following methods as well, thus this aspect will be neglected.

ExtractEquivalentLeafs() is similar to the previous method, but instead of parsing throughthe full data path, only neighbour processes are visited. This method identifies leaf processesthat have the same neighbours and groups them into separate lists, so that they too becomeparallel composites. This aids simplifying process networks that contain potentially parallelprocesses which do not necessarily respect the zipx-map-unzipx pattern, but may have a morecomplex pattern, like in Figure 8.1. The method’s effect can be observed upon the test modelin Figure C.4 from Appendix C. This method is called systematically until the componentcan make sure that there is no further potential for data parallelism, since grouping parallelcomposites may generate further opportunities for parallelism.

unzipx unzipx

comb comb comb comb comb comb comb comb

zipx zipx

unzipx unzipx

pcomp pcomp pcomp pcomppcomp[2]

pcomp[2]

pcomp[2]

pcomp[2]

zipx zipx

Figure 8.1: Grouping potentially parallel processes with the same source(s) and target(s)

Since the potentially parallel combs have been grouped into parallel composite processes, parsingthrough the system will take less time. Thus, the worst case scenario would run the method inO(n) time, and the nested search would depend only on the number of processes, which will benamed np. Thus the full time complexity of this method is O(n ·np).


createParallelComposite() is a method that mainly integrates a set of equivalent and poten-tially parallel leaf process into a parallel composite process. The integration implies creatingthe parallel composite, moving the leaf inside the new process redirecting the data flow andassuring data type and process integrity. To make the new parallel composite transparent, allits input and output ports are connected to the rest of the process network through sets of zipand unzip processes. Thus, when redirecting the data flow through the new parallel composite,these zips/unzips are used instead, aiding the future identification of potentially data parallelprocesses. The results of this method applied upon the example model can bee seen in bothFigure C.3 and Figure C.4 from Appendix C.

As seen in Listing 8.1, this method is called for every set of equivalent processes found, andis continually called until there are no equivalent processes left. For each group of potentialparallel processes, the method’s time complexity scales with O(n·np), depending on the numberof equivalent processes, since for every one of them the data flow needs to be redirected.

1 funct ion RemoveRedundantZipsUnzips ( composite )2 for each l e a f ∈ composite . l e a f s do3 i f l e a f i s zipx or unzipx then4 equiva lence_se t ← empty combset ( connected_id , current_ id )5 for each port ∈ leaf.in(out)_ports do6 i f port . connected_port . process ∈ equiva lence_se t then7 i n t e g r a t e port . s i g n a l in to equiva lence_se t . s i g n a l8 erase port . connected_port , port9 e l s e

10 new_equivalence ← new pair ( connected_id , current_ id )11 add new_equivalence to equiva lence_se t12 for each l e a f ∈ composite . l e a f s do13 i f l e a f i s zipx or unzipx and l e a f has only one in(out) port then14 r e d i r e c t flow ignoring l e a f15 erase l e a f

Listing 8.3: Pseudo-code for RemoveRedundantZipsUnzips() method

The final method in this algorithm is RemoveRedundantZipsUnzips() listed in Listing 8.3. Itis used for grouping together bundles of signals into arrays rendering fewer redundant datapaths. After this grouping is made, the zipx and unzipx processes with only one connection areremoved from the process network, since they do not play any role. The method’s effect can beobserved in Figure C.4 from Figure C.5.

Since this method consists in two parsings of the full list of leaf processes, the execution timewould be O(n+n), thus the time complexity scales with O(n).

Proposed algorithm

As seen in Listing 8.1, the used algorithm for identifying data parallel processes scales quicklywhen increasing the data set. Its time complexity grows exponentially with the number ofcomposite processes, and predominant quadratically with the number of leaf processes (dueto repeated parsings of the full process network, for each leaf process). We propose analgorithm for future development that would increase the execution performance of the currentidentification algorithm, offering a logarithmic growth rate for for both leaf and compositeprocesses. This algorithm is described in Listing 8.4.

Although at first glance the algorithm seems to employ more nested loops than the previousone, this aspect is deceiving. The process network tree is parsed and flattened from root


1 root ← ProcessNetwork . root23 while ∃ root . composites do4 equivalent_process_groups ← ExtractEquivalentProcesses ( root )5 while ∃ unresolved equivalent_process_groups do6 for each process_group ∈ equivalent_process_groups do7 pcomp ← createParallelComposite ( root , process_group )8 add pcomp to root , ProcessNetwork9 equivalent_ leaf_groups ← ExtractEquivalentProcesses ( root )

1011 for each composite ∈ root . composites do12 move composite . l e a f s and composite . composites to root13 r e d i r e c t data flow through composite . l e a f s and composite . composites14 erase composite1516 RemoveRedundantZipsUnzips ( root )

Listing 8.4: Top level for the proposed algorithm for identifying data parallel sections

to branches, not viceversa. For each level, all data parallelism is systematically identified,resolved and grouped for that specific level, after which the composite information along withits hierarchy is destroyed.

Since potentially data parallel processes are identified early on, before reaching the branches,each composite that is transformed into a parallel composite "cuts" a branch from thehierarchical tree thus all methods associated with that branch are skipped. Also, the leafprocesses in the ramifications are also destroyed and synthesized into the new parallelcomposites. This gives the algorithm a time complexity with a curve proportional to O(logn).

8.1.2 Optimizing platform mapping

As mentioned in Subsection 8.1.1, the current component tries to identify sections whereexecution on a GPGPU could be beneficial. This is done through a set of cost calculations and itemploys the platform model described as part of the component framework in Section 7.4. Thealgorithm is described in the pseudo-code from Listing 8.5.

1 root ← ProcessNetwork . root23 while opt imizat ion i s not f i n i s h e d do4 for each process ∈ root . processes do5 c ur r e nt _ co s t ← t o t a l _ c o s t ( process on current platform )6 changed_cost ← t o t a l _ c o s t ( process on changed platform )7 i f changed_cost < c u rr e nt _ co s t then8 map process on changed platform9 mark opt imizat ion as not f i n i s h e d

Listing 8.5: Algorithm for platform optimization

The algorithm assumes that the data parallel processes were identified and grouped intoparallel composites. By default, all leaf processes are mapped for host (i.e. for sequentialexecution) and parallel composite processes are mapped for device (i.e. parallel execution). Thealgorithm parses through all processes in the process network root, and calculates their totalcost assuming that they are mapped on either host or device. If the total cost on a changedplatform ends up being less than the cost on the current platform, then the that process’mapping directive is changed. This triggers another parse through the process network at the


end of this algorithm, since new opportunities for optimizations may arise after this change. Inthe worst case scenario, all processes systematically change platforms, rendering the algorithm’stime complexity of O(n2)2, but this is very unlikely. In the average case the algorithm is finishedafter O(n) or O(2n).

Ctotal,platf orm = Cproc,platf orm +∑

port ∈ proc

Ctransf er,platf orm (8.1)

The total cost calculation is done with Equation 8.1. The execution cost Cprocess,platf orm iscalculated depending on the platform with either Equation 7.1 or Equation 7.3. The transfercost Ctransf er,platf orm is calculated with Equation 7.4, where ktransf er can be kH2D , kD2H , kD2D orkH2H , depending on the process’ placement in the network.

8.1.3 Load balancing the process network

The main purpose of this component is to load balance the execution of a process network inorder to efficiently map it to a parallel platform that supports time-parallel computation (seeSubsection 3.2.2). The outcome should result in a parallel execution of the process network ina pipelined fashion, like presented in Section 4.4. In order to do so, the critical section of theprocess network has to be identified.

1 root ← ProcessNetwork . root2 datapaths ← ExtractDataPaths ( root )3 quantum_cost ← FindCriticalCost ( root )4 conta ined_sec t ions ← ExtractAndSortContainedSectionsByCost ( datapaths )5 for each conta ined_sect ion ∈ conta ined_sec t ions do6 SplitPipelineStages ( conta ined_sect ion )7 i f c o s t was modified then goto l i n e 5

Listing 8.6: Top level for the algorithm for load balancing

This algorithm can be reused in many design flows that target platforms which provideresources for time parallelism. In our case, this platform is the GPGPU device, but it mayvery well be used for other platforms that may benefit from load balancing pipeline stages.

This algorithm is straightforward, and only source for time complexity scaling at the top levelis the parsing through contained sections at the end. The guard at line 7 can be activated onlyonce, if the quantum cost belongs to a process’ execution, and the calculation algorithm omits toadd transfer costs (also unbreakable). The following paragraphs will present the main methodsused for this algorithm.

Data paths extraction

The load balancing algorithm implies repeated analyses upon the paths followed by data. Forthis reason, a sensible decision would be to extract the data paths from the beginning and applythe algorithm upon them instead of parsing the process network every time an analysis needsto be done.

2since the equivalent processes have been erased and the model contains only grouped parallel, the initial numberof processes is not relevant any more. From now on n will denote the new (reduced) number of processes


1 funct ion ExtractDataPaths ( composite )2 group_of_paths ← empty group ( datapaths )3 for each port ∈ composite . out_ports do4 group_of_paths . append (ParsePath ( port . connecte_port . proces , emty_path , root ) )5 return group_of_paths67 funct ion ParsePath ( process , path , root )8 mark process as v i s i t e d ∈ path9 new_group_of_paths ← empty group ( datapaths )

10 new_path ← path11 add process to new_path12 for each port ∈ process . in_ports do13 next_process ← port . connected_port . process14 i f next_process i s root then15 add new_path to group_of_paths16 e l s e17 i f next_process was v i s i t e d in the same path then18 mark new_path as loop19 add new_path to group_of_paths20 e l s e new_group_of_paths . append (ParsePath ( next_process , new_path , root ) )21 return new_group_of_paths

Listing 8.7: Method for data paths extraction, used by load balancing algorithm

The ExtractDataPaths() solves this task and it is described in the pseudo-code from Listing 8.7.It parses through the process network starting from its outputs and adds a new process networkevery time it reaches either the inputs or a visited process. When reaching a visited process, itmarks the current data path as a loop, since loops imply further analyses in future steps of thealgorithm. This results in an unrolled loop like in Figure 8.3.

A

B

C

D

E

F

G

H

B C E G H

A C E G H

B C E F H

A C E F H

· · ·

Figure 8.2: Building individual data paths

A

B C

D

E

D E A B C D

Figure 8.3: Loop unrolling

The method returns a set of linear paths like in Figure 8.2. Its time complexity is difficultto describe, since it is dependent on the result of the previous algorithms. If the methodRemoveRedundantZipsUnzips() from Subsubsection 8.1.1 would not have been applied, thismethod would have scaled with O(2n), analysing and extracting many redundant paths.Instead, we can affirm that currently this method would execute in a time period somewherebetween O(n2) for a network filled with ramifications and O(n) in a straightforward network.Again, the costs for searching through groups and verifying whether processes were visitedare ignored, since they are implemented with combsets as lookup tables, rendering their timecomplexity O(1).


Computing the quantum cost and the number of bursts

This section will describe a quantum cost, that cannot be split any further. Based on this cost, theload balancing algorithm will have to equilibrate the process network based on execution andtransfer costs so that the pipeline stages mirror the critical section for an optimum performanceversus resource distribution ratio.

Another key parameter introduced at this stage is the number of data bursts. It can be definedas the number of interleaved execution tracks that can be performed in parallel, similar toCUDA streams. Splitting data to be executed into separate streams has the advantage to overlapcommunication over computation, but in order to be advantageous, these two mechanisms haveto have the same execution time (see Figure 4.3). The optimum distribution of data streams isstrongly dependent on the ratio between the maximum communication cost and the maximumcomputation cost.

The method FindCriticalCost() from Listing 8.6 performs the above mentioned tasks. Itinvolves browsing the processes in the network once and the data paths once. For the datapaths, only the ones containing loop sections are analyzed, and the others are ignored. Judgingby this, we can affirm that the method’s time complexity scales with O(n) in the average case,and O(n2) in a very unlikely case when all data paths are full loops that enclose all the processesfrom the network.

This method’s main task is search and apply Equation 8.2 through Equation 8.6, and extractthe maximum costs afterwards through Equation 8.7 and Equation 8.8. The main factors are:

• Cmax,H2D and Cmax,D2H are the transfer costs between host and device, which mayconstitute the main bottleneck in many applications. Since cost is "unbreakable", it isa candidate as the quantum in pipelining.

• Cmax,par is the critical computation cost among the processes mapped for a device withresources for time parallelism. In this case, this platform is the GPGPU device.

• Cmax,seq is the total cost of processes running on a platform that does not offer resources fortime parallelism. In our case, this platform is the CPU host. Since this execution cannotbe split into pipeline stages, it may very well be considered a bottleneck and treated assuch.

• Cmax,∆ is the maximum cost of the processes in a loop. Since the current component isnot equipped with an algorithm for splitting further loops while preserving the system’ssemantics, it is safe to assume that the loops are allowed to be split into pipelined sectionsas many times as there are delay processes available. Also, further methods will take thisfact into account.

• Cmax,comp is the maximum computation cost describing the critical execution time in theprocess network.

• Cmax,comm is the maximum communication cost describing the critical transfer time in theprocess network.


Cmax,H2D = max(CH2D ) (8.2)

Cmax,D2H = max(CD2H ) (8.3)

Cmax,par = max(Cproc,par ) (8.4)

Cmax,seq =∑

proc ∈ seq

Cproc,seq (8.5)

Cmax,∆ = max

∑

proc ∈ loopCproc,par

n∆ + 1,∀loop

(8.6)

Cmax,comp = max(Cmax,comp,Cmax,seq,Cmax,∆) (8.7)

Cmax,comm = max(Cmax,H2D ,Cmax,D2H ) (8.8)

After the critical costs are calculated, a re-evaluation is done in order to find the optimumnumber of bursts and to fix the quantum cost Q. The number of bursts are found out byapplying Equation 8.9. The equation can be justified by the fact that splitting the execution ona platform where the main bottleneck is data transfers into pipeline stages, may prove uselessif the cost for execution is greater than the cost for transfer. In that case, there will be no timeparallelism involved, thus all the data will be processed in one burst at a time. Otherwise, thenumber of bursts may be described as the number of times that the slowest process executionmay be overlapped with the slowest data transfer When splitting the execution into data burststhe maximum transfer cost lowers as can be seen in Equation 8.10, thus Q will reflect this.

Nbursts =

⌈Cmax,comm

Cmax,comp

⌉when Cmax,comm > Cmax,comp

1 otherwise

(8.9)

Q = max(Cmax,comp,

Cmax,comm

Nbursts

)(8.10)

The information extracted at his stage will be used both for further model transformations andfor the synthesis of pipelined code in Section 8.2.

Sorting the contained sections by cost

After the extraction of individual data paths, a second extraction is performed. This timecontained sections are extracted from individual data paths. A contained section is a streamof neighbouring processes in a data path that are mapped for parallel execution. While theyare being extracted the contained section groups are being sorted by their cost. The containedsection cost respects Equation 8.11.

Ccontained,par =∑

proc ∈ contained

Cproc,par (8.11)

Method ExtractAndSortContainedSectionsByCost() is used for the above mentioned task.It takes into account whether the contained section is part of a larger loop and calculatesthe loop cost with Equation 8.6. Since loops can be broken only as the number of delay


1 funct ion ExtractAndSortContainedSectionsByCost ( datapaths )2 s o r t e d _ c s e c t i o n s ← empty group ( conta ined_sec t ions )3 for each datapath ∈ datapaths do4 e x t r a c t conta ined_sec t ions from datapath5 for each s e c t i o n ∈ conta ined_sec t ions do6 i f s e c t i o n i s loop then loop_cost ← c o s t ( loop )7 s e c t _ c o s t ← c o s t ( s e c t i o n )8 i s loop_cost > s e c t _ c o s t then s e c t _ c o s t ← loop_cost9 c s e c t _ p a i r ← pair ( s e c t _ c o s t , s e c t i o n )

10 add s e c t i o n to s o r t e d _ c s e c t i o n s11 return s o r t e d _ c s e c t i o n s

Listing 8.8: Method for extracting and sorting contained sections by their cost

processes permits it, these section’s cost should not be treated separately. Regarding its timecomplexity, we can say that it scales with O((n/2) · n) in the worst case scenario where allprocesses alternate platforms, making all contained sections one process-wide, and they areall part one system-wide loop, forcing every time a second browsing through the whole system.Otherwise, a balanced system would be parsed by this algorithm in around O(n) time, since allthe supplementary inner loops tend to be balanced out by the groupings into paths.

Load balancing main method

The main method for the load balancing algorithm is SplitPipelineStages() and it uses all thedata extracted previously. It is applied upon the contained sections in reverse order of their cost,fixing a strict priority for resolving the load balance. This enables the analysis of the criticalpath first, since its analysis may change the quantum cost and render the load balancing invalid.

The pseudo-code in Listing 8.9 presents the current method. As can be seen, the method startsby filling up already existent pipeline stages until they reach the maximum allowed cost Q.These stages were formed in previous steps for contained sections with a higher priority, butwhich have common processes with the current section. For this purpose, the synchronizationcost plays a very important role, along with the computation cost. Depending on the process’context, the synchronization cost may be one of the following: CH2D or CD2H for transfersbetween host and device, CD2D for transfers between two processes belonging to the samepipeline stage (thus it will be mapped to the same kernel), or CT 2T for transfers between twoprocesses belonging to different pipeline stages (thus mapped for different kernels, introducinga new type of cost).

The processes which are left unmapped are then analyzed for assigning them to pipeline stages.After further splitting them into coherent sections, a further cost calculation is done, similar tothe previous one. As seen in line 29, there is a guard that verifies whether the quantum cost Qis still relevant or has been surpassed by the synchronization and computation cost of a single(critical) process. In this case, the algorithm is halted and invalidated, and a new stage splittingis performed with respect to the new quantum cost.

Since the algorithm is a heuristic one, based on empirical cost calculations and parsed fromprocess to process, the result may not be optimal concerning the ratio between performanceand resource allocation. This is why, to increase the chances of delivering an optimal solution,a second analysis is done, but this time in the reverse order of the process execution (i.e. fromright to left). This doubles the amount of time needed to perform the method, but may provide a


1 funct ion SplitPipelineStages ( conta ined_sect ion )2 a l ready_ass igned_processes ← empty l i s t3 for each process ∈ conta ined_sect ion do4 i f process i s ass igned to p ipe l ine s tage then5 add process to a l ready_ass igned_processes6 for each assigned_proc ∈ al ready_ass igned_processes do7 i f assigned_proc i s before / a f t e r an unassigned process ∈ conta ined_sect ion then8 proc_to_ass ign ← tha t unassigned process9 ass igned_stage ← assigned_proc . s tage

10 new_sync_cost ← sync_cost ( proc_to_ass ign )11 old_sync_cost ← sync_cost ( assigned_proc )12 new_stage_cost = ass igned_stage . c o s t − old_sync_cost + new_sync_cost13 i f new_stage_cost < Q then14 ass ign proc_to_ass ign to ass igned_stage15 mark f l a g ⇒ goto . non_interupt l i n e 216 unass igned_sect ions ← conta ined_sect ion . s p l i t ( unassigned processes )17 for each s e c t i o n ∈ unass igned_sect ions do18 l e f t _ r i g h t _ s t a g e s ← empty group ( se c t i ons , c o s t s )19 r i g h t _ l e f t _ s t a g e s ← empty group ( se c t i ons , c o s t s )20 for each process ∈

−−−−−−−→section do

21 i f process i s s e c t i o n . f i r s t then22 sync_cost ← sync_cost ( process ) . input23 sync_cost ← sync_cost + sync_cost ( process ) . output24 computation_cost ← comp_cost ( process )25 i f computation_cost + sync_cost < Q then26 add process , sync_cost , computation_cost to l e f t _ r i g h t _ s t a g e s . l a s t27 e l s e i f process i s the only one in the s tage then28 Q← computation_cost + sync_cost29 return i n v a l i d a t i n g the algorithm so f a r30 e l s e31 new_stage ← new stage ( process , sync_cost , computation_cost )32 add new_stage to l e f t _ r i g h t _ s t a g e s33 for each process ∈

←−−−−−−−section do

34 i f process i s s e c t i o n . l a s t then35 sync_cost ← sync_cost ( process ) . output36 sync_cost ← sync_cost + sync_cost ( process ) . input37 repeat s teps 24 − 32 f i l l i n g r i g h t _ l e f t _ s t a g e s38 demands ← r i g h t _ l e f t _ s t a g e s . sync_cost < l e f t _ r i g h t _ s t a g e s . sync_cost and r i g h t _ l e f t _ s t a g e s .

s i z e < l e f t _ r i g h t _ s t a g e s . s i z e39 i f demands are s a t i s f i e d then40 for each process in r i g h t _ l e f t _ s t a g e s do41 ass ign process to r i g h t _ l e f t _ s t a g e s . s tage42 e l s e43 for each process in l e f t _ r i g h t _ s t a g e s do44 ass ign process to l e f t _ r i g h t _ s t a g e s . s tage

Listing 8.9: Method for splitting the process network into pipeline stages based on cost information

better mapping in concerning synchronization costs and number of pipeline stages. The actualmapping of processes to pipeline stages is done after deciding which of the previous resultsare optimal, decision that is subject to change in future improvements. The ideal number ofpipeline stages should respect Equation 8.12 and should be either the number of processes inthe section or the report between the section cost and the quantum cost, but due to mappingmechanisms it may be higher. This is why a second search is beneficial.

Nsections = min(nproc ∈ contained ,

⌈Ccontained,par

Q

⌉) (8.12)

The method’s effect on a ForSyDe model can be seen by studying the differences betweenFigure C.63 and Figure C.7 from Appendix C. In the former model all processes are assigned tostage 0 while in the latter model they hold different stage mapping information.

3see the notes in Appendix C for the reason why Figure C.6 is different than Figure C.5


Due to the algorithm’s complexity and its strong dependencies upon all the previous steps, itis difficult to make an analysis regarding its scalability. It employs many switch statementsand guards, and its execution flow is hard to follow. The only certainty is that the number ofcalculations decreases as the data paths are analyzed, since once a process has been assignedto a pipeline stage, all the algorithm steps revolving around it are skipped. Also, processesmapped for sequential platform are not even taken into consideration. Thus we can affirm thatthe time complexity roughly scales with O(n).

8.1.4 Pipelined model generation

In order to enable the component to synthesize CUDA code, a series of directives are neededto be provided to the synthesizer module. The model holds already enough information toenable mapping of processes to kernels, but a "cleaner" way would be to modify the modelso that it does most of the synthesizer’s work: gathering the correct functions under a kerneland generating wrappers around them. Thus a final model modifier algorithm is necessary, togroup processes associated with a pipeline stage under a parallel composite. The effect of thisalgorithm on the ForSyDe model can be seen in Figure C.8 from Appendix C.

The algorithm is similar to the method CreateParallelComposite() from Listing 8.2. It is asuccession of creating new parallel composites and filling them with processes, while assuringmodel integrity. Because of the current framework’s instability, the model integrity checks takemost of the execution time. Since all processes mapped for time parallel execution are visitedonly once, we can affirm that the algorithm scales with O(n).

8.1.5 Future development

Many aspects of the algorithms and the implementation mechanisms are still to be studied anddeveloped. The current component is just a prototype and provides just a theoretical basis fora much larger area: model manipulation and design space exploration as part of the ForSyDe

system design flow. As in the previous chapter, a few proposed directions for further researchwill be listed.

As mentioned, the current algorithms and the design decisions taken in their implementationswere forced by the current model framework. The aspect that suffered the most was theperformance during the identification and extraction of data parallel sections. The algorithmproposed in Subsubsection 8.1.1 can greatly increase the algorithm performance from O(2nc ),respectively O(n2

l ) to O(lognl). Also, the new algorithm should be implemented with minimumcode and development overhead on a framework like the one proposed in Section B.3 andSubsection 7.1.2. Therefore, after developing the ForSyDe Toolbox backbone, the developmentof this algorithm has the highest priority.

In Listing 8.6, the load balancing algorithm employs two extractions: one for data paths, andanother for contained sections. While this separation is natural taking into consideration thefact that they employ different analyses, these two algorithms may be merged into a singlebrowsing of the whole system. While increasing the performance, this improvement may lowerthe algorithm’s flexibility and reusability, thus its effect needs to be studies. In Listing 8.8 thesorting method may trigger the calculation of the same loop cost multiple times, if there existmultiple contained sections included in the same loop. This value can be stored in a separatelookup table, or a container may be provided for a loop object to avoid such a situation. The


development overhead although would not have been justified compared to the performancegain. Still, it may be regarded as possible enhancement. The platform optimization algorithm inListing 8.5 studies one process at a time to take the decision of changing the platform. If thereexist a chain of processes that may benefit from changing the platform altogether instead ofseparately, they will be ignored. A method for identifying such processes needs to be developed.

Apart from purely implementation issues, there is a number of algorithm issues that have to beintensely studied and formally validated in order to continue the component’s implementation.First of all, the problems revolving around splitting loops and preserving the semantics of thesystem need to be solved. Currently a safe solution is adopted, that still needs to be formallyverified: allowing a loop to be split only as much as there are delay elements in the data path.This restricts both the splitting position and diminishes the potential for parallelism in case ofsystem-wide loops. This issue needs to be studied and further treated. Also there are still manyeffects that the separation into pipeline stages may have upon the process network, which haveto be studied as well, for example removing a delay elements at a separation.

As can be seen, most of the algorithms and features provided in this contribution have apredominant engineering profile: we present the problem and then we try to find the bestsolutions in the given time frame. For this matter, an in-depth study of the related researchin order to find the roots for the problems is definitely necessary. For example, the automaticparallelization has been intensely studied in the past years, and the scientific literature mayhold answers for our problems as well. For example, [Feautrier, 1996] describes an automaticparallelization process which transforms the concerning code into a polyhedral called polytope,and using mathematical methods to perform automatic partitioning for data depending on thefrequency of their access. An enhanced version of this algorithm is presented in [Baskaran et al.,2008], containing a method for estimating when such allocations are beneficial.

The load balancing algorithm still needs to be formally validated. Although it delivers thedesired results, it is still an engineering solutions, that may not be ideal in all situations. Ananalytical way of tackling with the problem and delivering the optimum result need to bestudied in the research community.

8.2 Synthesizer algorithms

The second type of algorithms provided in the current contribution are code synthesisalgorithms. They are employed by the synthesizer module, as part of the flow depicted inFigure 5.2. These transformations target only CUDA platforms and they are the equivalents ofcode optimizations in the flow depicted in Figure B.4.

The core function of this stage, called several time is the scheduler, which is presented andthoroughly described in [Hjort Blindell, 2012] and developed as part of f2cc v0.1. its purposeis to find an optimal sequential schedule for executing processes so that system’s semanticsare preserved. The current contribution did not develop the algorithm any further, except forintegrating it into ForSyDe v0.2.

The algorithm is described by the pseudo-code in Listing 8.10. Its purpose is to perform a finalset of parsings and generate a function code file (.c for C and .cu for CUDA) and a header file(.h). It treats each type of process differently:

• comb leaf processes are not processed any further, since they already have the function

8.2. Synthesizer algorithms 81

1 check ProcessNetwork2 root ← ProcessNetwork . root3 for each composite ∈ root . composites do4 FindSchedule ( composite )5 for each composite ∈ ProcessNetwork . composites do6 i f t a r g e t platform i s CUDA and composite i s root then7 generateCudaKernelWrapper ( composite )8 e l s e generateWrapperForComposite ( composite )9 generateKernelConfigurationFunction ( )

10 funct ions ← ProcessNetwork . funct ions11 code ← generateDocumentation ( )12 for each f c t ∈ funct ions do13 code ← code + f c t . s t r i n g14 generateCodeHeader ( funct ions . l a s t )

Listing 8.10: Top level for the algorithm for code synthesis

body. Their header instead is used for function invocation inside their parent’s (acomposite process) wrapper function.

• other leaf processes are treated only in their parent’s wrapper function, when specific codeis generated for each one, as it will be presented in Subsection 8.2.1.

• composite processes are analyzed for a sequential schedule of their contained processes.The synthesizer module generates a wrapper function that invokes its child processes’functions in order determined by the found schedule.

• parallel composite process are treated like composite processes. If they are targeted forsequential platforms they are invoked multiple times. Otherwise, on parallel platforms,their execution implies the usage of parallel threads.

The following subsections will present the specific methods for the two approaches forsynthesizing either C or CUDA code.

8.2.1 Generating sequential code

The sequential code generation methods are used system-wide for both target platforms.Only the root process is skipped for CUDA mapping, since that requires a CUDA kernel.As previously mentioned, only composite and parallel composite processes are analyzed andare associated with wrapper functions. The generation of these wrappers is described inListing 8.12.

The method starts with extracting signals and delay variables from the composite process. Thesenew elements and their generation are presented in section 8.5 of [Hjort Blindell, 2012] andthey are necessary to generate variables that transmit data between the process functionsin the wrapper body. In f2cc v0.1 their role was crucial since the ForSyDe model did notcarry information about the data types transported between processes, and a full system parsewas necessary to back-trace these types from the function code. In f2cc v0.2 this problemis not valid anymore, since all necessary information is available in the model. Storing thisinformation in a set of containers that will directly be transformed into variables, though,eases the synthesis process. Their extraction does not imply a full network parse, but insteadgathering information from existing ports and encapsulate it in signal, respectively delayvariable object containers.


1 funct ion generateWrapperForComposite ( composite )2 new_function ← empty funct ion3 e x t r a c t s i g n a l s from composite4 e x t r a c t d e l a y _ v a r i a b l e s from composite5 for each port ∈ composite . ports do6 new_input ( output ) _parameter ← v a r i a b l e ( port )7 add new_input ( output ) _parameter to new_function . parameters8 body ← GenerateCompositeDefinitionCode ( composite . schedule )9 RenameVariables ( body )

10 add body to new_function11 add new_function to ProcessNetwork and composite . wrapper12 i f composite i s root then13 GenerateRootExecutionCode ( new_function )14 add new_function to ProcessNetwork . funct ions

Listing 8.11: Method for generating sequential wrapper code for composite processes

The second step in creating a sequential function wrapper is building the function header. Forthis, the composite process’ inside ports are parsed and transformed into input/output functionparameters. Since all the information is already present in the ports, creating parameters is amatter of binding this information into new containers.

The next step is generating the function body with GenerateCompositeDefinitionCode(). Thismethod takes the composite and its schedule as arguments and builds the execution code, inthe following order:

1. generate variable declaration code from signals.

2. generate delay variable declaration and initialization code.

3. generate first step of the delay execution code (i.e. loading the previous value from thedelay variable).

4. generate execution code for all other processes. Depending on the process type, theprocesses are treated differently:

• fanout processes just copy the input values to its output variable;

• zipx and unzipx distribute their input values to/from multiple other variables;

• comb processes invoke the process function through a function call generated fromits header;

• composite processes invoke the previously generated wrapper execution code, in thesame way as for comb processes;

• parallel composite processes invoke the associated composite wrapper execution code,but inside a for loop that reflects the number of processes;

5. generate the final step of the delay execution code (i.e. storing the next state value in thedelay variable).

6. generate signal clean-up code. All the variables that have been allocated during executionneed to be deallocated from memory.

The next step is to browse again through the function body and rename variables. Since signalshave unique IDs which are combinations of characters involving process IDs, port IDs, andhierarchy paths, they are hard to follow in manual debugging is necessary. Thus a set of moreintuitive names are assigned to variables.


As seen in Listing 8.12, for the root composite process, there are two wrappers generated. Thesecond one is the top module function accessed by the C header.

An example of generated code for a composite process wrapper can be seen in Listing C.4 fromAppendix C.

8.2.2 Scheduling and generating CUDA code

The CUDA code generation mechanism is invoked if the component’s target platform is set toGPGPU. It consists in three steps:

• generate execution code for the parallel composite processes, a method similar to the inListing 8.12.

• generate a scheduler for splitting and executing pipelined code streams with the con-tained sections belonging to the root process.

• generate kernel wrapper functions (ideally for each contained section), configure it foroptimal execution and run them through an elaborate set of mechanisms.

Basic concepts for the mapping algorithm

Before presenting the method’s mechanisms, the theoretical background that the method isbased on needs to be presented.

All the development efforts for the current component have been targeted to provide enoughinformation so that it is possible to implement a ForSyDe system on a platform that supportsdata and time parallelism. Cost analyses, data extraction, model modifications and all otheroperations built up to this purpose: offering an optimized data parallel and pipelined solution.

Although the final purpose of this component is to generate CUDA code that employs streamsfor an interleaved execution, all the stages of the design flow were developed to be as generalas possible, so that they can be treated separately from nvidia GPGPUs. Ideally, they should beflexible enough to be reused for other design flows than the current one, targeting a large setof parallel platforms. Following the same philosophy, even this last algorithm was tried to beconceived as independent as possible from the specific GPU implementation traits (although afully independent approach is impossible). This way, future development may have a startingpoint for different research directions.

The mapping algorithm revolves around two parameters that are extracted during differentstages of the design flow:

• the number of stages or Nstages denotes the number of pipeline stages that may beinterleaved for parallel execution. This parameter is extracted further on in Listing 8.12.Apart from the number of parallel composites associated with stages, this parameterneeds to add up the data transfers between host and device that happen during thescheduled execution of processes.

• the number of data bursts or Nbursts denotes the number of times the transfer cost are to besplit so that they are balanced by the load of the slowest executing process. This parameterhas been described in Equation 8.9.


As presented in Section 4.4, the streams have the potential of overlapping the data transfer withthe device execution, and (for devices supporting this type of actions) interleaving the executionof multiple kernels and the host. Although this is a device-specific property, the architectureprovided by devices with the above-mentioned compute capability may be regarded (at least inearly stages) as a general pipelined architecture.

The execution model targeted for this component is depicted in. Figure 8.4 which shows theadvantages of pipelined execution concerning performance. The plot models the repeatedexecution of three data sets on a GPGPU. Since the ratio between the maximum computationtime and the maximum transfer time is approximately 3, the data transfer has been split intothree bursts. Since the number of pipeline stages is 3kernels + 2transf ers + 1cpu = 6, which is lessthan the number of streams available in the GPU (16 for compute capability 2.0 or higher), thecomponent will allocate 6 streams for interleaved execution.

If the system is fed an infinite stream of data sets, provided there are no resource conflictsduring execution and the pipeline stages are perfectly split (there is no synchronizationoverhead) so that the resource utilization becomes 1, the execution has the potential to performin 1/Nstages of the initial time. In other words, the system execution time is reduced to thecritical execution time. Practically though, Figure 8.4 shows that these overheads are inevitable.

The figure presents two approaches for executing the system in a pipelined fashion. The firstone involves synchronizing the device after each stage. This results in a predictable executionpattern that would respect the semantics of the ForSyDe SYMoC. This model is a first approachfor future research in using GPGPUs in systems which require a predictable description of theirtimely behavior. As seen in Figure 8.4, the synchronization induces an overhead for the stagesthat have lower costs that the quantum cost.

The second approach does not synchronize the device after each pipeline stage, leaving theCUDA scheduler to assign processes as long as there are resources available. This would leadto an unpredictable behavior and a number of resource clashes, that may affect performanceand cancel the head-start over the synchronous approach (as it happens in Figure 8.4). Such anexecution model could be compared with the ForSyDe SDF MoC where processes are executedas soon as there is data available.

Assigning data bursts to streams is a "revolving barrel" problem. Each stream is assigned witha displacement with which it browses through the data sets, as suggested in Table 8.1.

stream 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1

burst 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2

Table 8.1: Assigning data bursts to streams for a system with Nbursts = 7 and Nstreams = 5

The synthesizer algorithm

The pseudocode in Listing 8.12 describes the top level for the CUDA code generation algorithm.It is applied upon a process network’s root, and it generates several functions associated witheither kernel functions or kernel wrappers. The generation of these kernels is thoroughlydescribed in [Hjort Blindell, 2012] thus only important aspects will be stressed out, the readerbeing encouraged to consult the documentation for f2cc v0.1 for further details.

The first action after extracting signals and delay variables is to add the prefix __device__

8.2. Synthesizer algorithms 85Non-StreamedExecution

_def

ault

H2D

1K

1K

2K

3D

2H1

H2D

2K

1K

2K

3D

2H2

H2D

3K

1K

2K

3D

2H3

CP

U

StreamedExecutionwithDeviceSynchronization

def

ault

stre

am1

H2D

1K

1K

2K

3D

2H1

H2D

3K

1K

2K

3D

2H3

stre

am2

H2D

1K

1K

2K

3D

2H1

H2D

3K

1K

2K

3D

2H3

stre

am3

H2D

1K

1K

2K

3D

2H1

H2D

3K

1K

2K

3D

2H3

stre

am4

H2D

2K

1K

2K

3D

2H2

stre

am5

H2D

2K

1K

2K

3D

2H2

stre

am6

H2D

2K

1K

2K

3D

2H2

CP

U

StreamedExecutionwithoutDeviceSynchronization

def

ault

stre

am1

H2D

1K

1K

2K

3D

2H1

H2D

3K

1K

2K

3D

2H3

stre

am2

H2D

1K

1K

2K

3D

2H1

H2D

3K

1K

2K

3D

2H3

stre

am3

H2D

1K

1K

2K

3D

2H1

H2D

3K

1K

2K

3D

2H3

stre

am1

H2D

2K

1K

2K

3D

2H2

stre

am2

H2D

2K

1K

2K

3D

2H2

stre

am3

H2D

2K

1K

2K

3D

2H2

CP

U

Figu

re8.

4:St

ream

edex

ecu

tion

mod

el.C

omp

aris

onbe

twee

nno

tu

sing

and

usi

ngst

ream

sin

two

confi

gura

tion

s


1 funct ion generateCudaKernelWrapper ( root )2 new_function ← empty funct ion3 e x t r a c t s i g n a l s from root4 e x t r a c t d e l a y _ v a r i a b l e s from root5 for each port ∈ root . ports do6 new_input ( output ) _parameter ← v a r i a b l e ( port )7 add new_input ( output ) _parameter to new_function . parameters8 for each process ∈ root . processes do9 i f process i s mapped for device execution then

10 process . c funct ion . p r e f i x ← "__device__"1112 kernel_schedules ← empty l i s t of schedules13 num_stages ← 014 for each conta ined_sect ion ∈ root . schedule do15 num_stages ← num_stages + 216 add empty_schedule to kernel_schedules17 for each process ∈ conta ined_sect ion do18 num_stages ← num_stages + 119 add process to kernel_schedules . l a s t20 compute n_proc21 kernel_exec ← GenerateExecutionForKernelComposite ( kernel_schedules . l a s t , n_proc )22 kernel_func ← generateCudaKernelWrapperFunction ( kernel_exec )23 add kernel_exec , kernel_func to ProcessNetwork . funct ions2425 new_body ← generateCudaRootCode ( root , kernel_schedules , num_bursts , num_stages )26 add body to new_function27 add new_function to ProcessNetwork . funct ions and root . wrapper

Listing 8.12: Method for generating wrappers associated with CUDA code

for all the functions belonging to processes mapped for device execution. This prefix lets thecompiler know that the specific function will be executed on the device.

The next step is to browse all the previously identified contained sections (chains of processesmapped for parallel execution on the device). For each section, the number of pipeline stagesNstages is calculated as suggested in the previous subsection. Also, the processes are added intoa new schedule which will aid in generating the kernel execution function for that particularcontained sections.

Before generating the kernel execution function and the kernel wrapper associated with acontained section, the number of parallel processes (which will further be used to calculatethe number of threads) is calculated using Equation 8.13. Since the data "wideness" has beensplit into bursts, the number of processes contained by the parallel composites is not relevantany more, and needs to be readjusted.

Nproc,kernel =⌈Nproc,pcomp

Nbursts

⌉(8.13)

The synthesis of a kernel execution function with GenerateExecutionForKernelComposite()

is similar with the method described in Listing 8.12. The main difference is that now devicesynchronization code is added, in case the first approach in Figure 8.4 is chosen. The functionprefix, is set as well to __device__. The kernel wrapper created with generateCudaKernel-

WrapperFunction() sets the CUDA indexes according to the future kernel configuration, andinvokes the execution function. Its prefix is set to __global__. More information about thekernel wrappers is found in [Hjort Blindell, 2012].

The top level function is the last one synthesized by the current component. This functioncontains code for gathering device information and based on it calculate the optimum


configuration for the invoked kernels. Its mechanisms are also explained in [Hjort Blindell,2012]. The main contribution of v0.2 to this method is introducing the mapping mechanism ofkernels to streams and distributing the data associated with them.

Since for each contained section the component generated a (presumably smaller) kerneleach having its own Nproc,kernel , the top module function’s main role is to invoke them whilepreserving the execution model in Figure 8.4. This function an array of Nstreams streams, whereNstreams is calculated witn Equation 8.14 and may be considered the number of pipeline stages,as long as it is smaller that the maximum number of resources provided by the GPGPU device4.

Nstreams = min(Nmax,streams,Nstages) (8.14)

All the signals between the contained sections and the non-contained processes in the topmodule are split as well into Nstreams, since each stream operates on its own set of data. For eachsignal, the component generates code for allocating device and host memory at the beginningof the system’s execution, and deallocate it at the end.

The system execution is then wrapped inside a while loop that pends for streams to finish theirexecution, interrogating them in a revolving manner. Each time a stream is identified to havefinished its execution and still has data left in its work set, its associated kernel is invoked aftercalculating an ideal kernel configuration and its work set index is incremented.

An example of generated kernel execution function and its associated kernel wrapper can beseen in Listing C.6, and a top level function that invokes this kernel is found in Listing C.5.For page layout reasons, parts of the code had to be removed and replaced with (...). Thiscode is either reused from f2cc v0.1 and its meaning is explained in [Hjort Blindell, 2012] or isrepeated over multiple data sets.

8.2.3 Future development

All the theoretical foundation provided in part I of this document and the implementationefforts presented in part II built up for one particular task: offering a data parallel and timeparallel solution for a ForSyDe model. For this purpose, GPGPUs have been used as platform.But since the main contribution is intended to provide a foundation for multiple researchdirections, more resources have been invested in the conceptual part and algorithms than in theplatform-specific optimizations for performance. Therefore the end result of this componentmay not be fully optimal or fully correct from a CUDA developer’s point of view.

The synthesizer module has the greatest potential for future optimization, and will require thelargest work effort for future development and improvement. Although GPGPUs are notoriousfor their demands for efficient hardware resources usage through clever coding, apart from thealready implemented features in f2cc v0.1 and the features presented in this report, no specificoptimizations were employed. This can be justified with both the limited time frame available,and with the predominant proof-of-concept profile that this project has.

Regarding the sequential code generation, there is still room for much performance improve-ments. For example, although the inputs and outputs of generated composite executionfunctions are passed as pointers to spare execution time, the internal variables associatedwith signals and signal manipulation processes (zipx, unzipx, fanout) still employ data copying

4for example, the maximum number of streams provided by an nvidia GPGPU with compute capability 2.1 is 16.


mechanisms for passing values, for data integrity reasons. The sequential execution time wouldbe greatly reduced if a correct mechanism for transporting pointers is found.

The current CUDA code generated by the component could not be validated because of thelimited time frame, and both the schedule and the execution model still need require thoroughresearch and development. Currently there is no support for models with more than onecontained sections which would require the execution of more than one kernel. Although themodel is built up for this specific purpose, there are still many implementation issues that haveto be handled until offering a stable result. Also, both the execution model and the generatedcode are far from optimal. For example, the kernel configuration function is invoked everytime a kernel is invoked even if its parameters may have not changed, rendering unnecessaryoverhead on the CPU execution. Also, the invocation of the same CUDA kernel on multiplestreams may not be considered a best-practice in the GPGPU programming community, butwas chosen again for purely proof-of-concept reasons.

Concerning the synthesizer algorithms, many improvements are still to be done in that areaas well. For example, there is currently no support for generated parallel composite processeswhose child processes do not have the same fixed width5, nor kernel generation for this typeof processes. Also, several stages may be merged or mechanisms changed in order to furtherincrease the tool’s execution performance. For example, as mentioned, the signal objectsmay be considered redundant in a model where ports already contain data type information,but their usage aided reusability of several methods from f2cc v0.1, and hastened the toolimplementation.

The development of a robust synthesizer tool will be possible only with help from the researchcommunity. Since there are many related researches that target the generation of optimizedCUDA code, most of the ideas may be harvested and compiled for our own purpose. Forexample, [Udupa et al., 2009] presents a method for generating pipelined execution modelsusing CUDA streams from a high level language called StreamIt. A more extensive study ofthe available researches than the one already performed in Part I will be performed as part offuture development.

5this explains the reason why the model in Figure C.5 could not have been fed to the synthesizer, thus the morestrict model in Figure C.6 was adopted.

Chapter

9

Component Implementation

This chapter will present a few of the main implementation traits. Although the main concepts behind theframework have been already presented, showing a part of the implementation is beneficial in order to get a gripon the internal mechanisms which are used during the design flow, and the complexity of the project. Detaileddescriptions of the architecture can be found in the component’s API documentation provided as part of the toolsuite, and a potential developer is encouraged to consult it.

9.1 The ForSyDe model architecture

T he ForSyDe model presented in Section 7.1 is implemented in a predominant object-oriented manner. It was conceived as a superset for the model used in f2cc v0.1, depicted

in Figure 5.3. Its inheritance and collaboration graph is presented in Figure 9.1.

The relations between the model classes can be determined from the above-mentioned figure.Each Process contains a Hierarchy object, which is a list of Ids with a set of methods for findingrelations. The Process class acts as parent for Leaf and Composite classes, thus both these childclasses can be regarded as processes. Also, their lists of Ports, respectively IOPorts can beregarded from the outside as interfaces, since they both inherit basic traits from the Inerfaceclass.

A Model is a class that contains lists with pointers to leafs and composites. The Composite isboth a Process and a Model, since its main role is to encapsulate other processes, but it has to beseen as a process from the outside. The Process Network is another class that inherits traits fromthe Model. Its main role is to serve as a "via" for selecting the main components in the ForSyDe

model. For the backward compatibility reasons mentioned earlier in Section 7.1, the processnetwork points to ports which are inputs or outputs for the system.

The CFunction, apart from its body which is plain text, contain CVariables, who themselvesencapsulate a CDataType each. As mentioned in Section 7.1, the comb processes point to aCFunction while Map processes still encapsulate them, to preserve backward compatibility.Also, the fact that ports and "ioports" contain data type information can be seen from

89

90 9. Component Implementation

delay zipX unzipX fanout zipWithN comb map

CoalescedMap

ParallelMap

ParallelComposite

Composite

Process

Leaf

IOPort

Interface

Port

CDataType

CVariable

CFunctionModel

ProcessNetworks

Hierarchy

Legend

inherits (is a)

encapsulates (has)

points to (has)

Figure 9.1: The restructured classes used for f2cc v0.2 internal model

their inclusions. Although not previously mentioned, the composite processes do include aCFunction container. They will be filled further in the synthesis stage, when the tool willgenerate execution functions for each composite process.

9.2 Module interconnection

The interconnection between the main modules has been maintained as in Figure 5.2. Since thecomponent has grown considerably since v0.1, most of the modules have been restructured oreven split into sub-modules. The following paragraphs will present only the main changes. Fora full view of the component the reader is encouraged to consult the API documentation.

As in f2cc v0.1, all methods and classes are bound under the f2cc namespace to avoid namingclashes with other libraries. The tools module using the f2cc::tools namespace still containsmethods common to all modules and has been broadened with new methods. The same can besaid about the config module which now contains both data about the platform model and newswitches and flags associated with the new execution flow. The logger and exceptions moduleswere left unmodified.

The frontend module has been provided a new set of classes associated with the newcomponent. The Frontend main class has been linked to an XmlParser class which extractsinformation necessary to build the internal ForSyDe model. A subclass of XmlParser is CParses

9.3. Component execution flow 91

which parses ForSyDe code to transform it into CFunction objects, as presented in Section 7.3.Separated from these classes, a new XmlDumper class has been provided with methods forgenerating XML files from intermediate stages of the model modification.

The forsyde module underwent the most modifications. It has been split into three submod-ules. One resides under the f2cc::Forsyde namespace, and it holds the classes associatedwith the internal ForSyDe model. The second resides under f2cc::Forsyde::SY namespaceand is a subset of the ForSyDe model, specific to the SY MoC. This leaves space for futuredevelopment, if it is decided to offer model support for more MoCs. This would implydescribing process constructors with the same name but different semantics. Grouping theminto different namespaces takes care of the naming issues. The third submodule contains themodel modifier classes for both f2cc v0.1 and v0.2.

The language module as such underwent minor modifications. Since it contains the language-related container classes, just a few methods were added to support the new type of functions.The whole C++ and ForSyDe-SystemC language support is handled by the frontend.

The synthesizer module was developed separately from the existing one, thus a new class withsynthesizer methods was added.

9.3 Component execution flow

Now that the main component’s algorithm, implementation and architectural features havebeen presented, the reader should be familiar with its internal mechanisms for creating,analysing, modifying and generating code from ForSyDe models. This makes a review of allthe steps executed in the tool’s design flow, as depicted in Figure 9.2.

Frontend

GraphmlParser

XmlParser

CParser

XmlDumper

ModelModifier

ModelModifierSysC

Synthesizer

Internal Model Repre-

sentation

Synthesized Code

C code

CUDA code

GraphMLForSyDe-Haskell

C

v0.1

Platform model (XML)

XMLXMLXML

XMLXMLXML

ForSyDe-SystemC

v0.2

XMLXMLC++

(function code)

Internal Model Repre-

sentation

Synthesizer02

Figure 9.2: The f2cc component execution flow

The component starts its execution by invoking the frontend which inputs a ForSyDe model.While v0.1 inputs graphml files with C code annotations, v0.2 inputs xml files generated fromForSyDe-SystemC and hpp files with C++ code directly from the framework model. The

92 9. Component Implementation

choice of following a particular execution flow is done automatically by verifying the inputfile extension. The frontend has specific parsers for each execution flow which are invokedaccording to the above-mentioned choice.

After the input files have been parsed, both execution flows result in an internal modelrepresentation using the new internal model. This model serves as input for the followingstage, the modification stage. This stage is handled by different modifiers for each tool version,and ends up in another (modified) internal representation. As suggested in the depiction, inv0.2 the intermediate states are continuously monitored and the XmlDumper is invoked fordumping these models, resulting in multiple xml files.

The model outputted by the modifier serves as input for the synthesizer module. Again, thereare different synthesizers for each version of the component. The synthesizer in v0.2 inputsplatform informations stored as cost information in another xml file. This final step ends upgenerating either CUDA C code or sequential C code, in accordance with on the user’s choice.

Part III

Final Remarks

93

Chapter

10Component Evaluation andLimitations

This chapter will attempt to evaluate the current state of the Master’s Thesis project. The evaluation will beperformed based on the output generated so far and a final verdict will be given. The second part will review allthe limitations of the current component and will group together the proposals for future development based ontheir priority.

10.1 Evaluation

A t the time of writing this report the component’s status reflected Table 10.1. Thetable presents the main tasks performed by the f2cc tool, grouped by the module which

performs the tasks. All other features are ignored since they cannot be measured and deliverno result.

Judging the frontend module, we can affirm that its performance is satisfactory. The XML

parser works well, all XML files generated by the enhanced version of ForSyDe-SystemC (seeSubsection 7.4.2) being parsed and analyzed for data extraction. The C function code parserperforms acceptably as well, process function files from the ForSyDe model, written in C++

with annotations (see Appendix A) being able to be parsed and encapsulated in semantically-correct C function objects. The internal model construction is also done properly according tothe f2cc model (see Section 7.1), being subject for further manipulation along the executionflow. Finally, the XML model dumper is working correctly, therefore plotting the intermediatestructures is possible.

Analyzing the forsyde module, the results are satisfactory as well, although much developmentis still to be done. The internal f2cc model representation respects the ForSyDe protocoland stores enough information to make possible further model manipulations. Moreover, themanipulations are correctly performed according to Chapter 8. The data parallel sections areidentified using an advanced model analysis algorithm that determines potentially parallelprocesses from random ForSyDe patterns, that would otherwise be difficult to point out bya human designer. The platform optimization algorithm performs well when consideringsingle processes in their context, but still needs improvement when chains of processes would

95

96 10. Component Evaluation and Limitations

Module Function Status Sample(s)

frontend

parse XML structural filesworking, sat-isfactory Figure C.1

parse C++ function codeworking, sat-isfactory Listings C.1 & C.2

build internal ForSyDe modelworking, sat-isfactory Figure C.1

dump XML from modelworking, sat-isfactory Figures C.1 to C.8

forsyde

storage of internal modelworking, sat-isfactory

model manipulationworking,limited Figures C.1 to C.8

advanced identification of data parallel sectionsworking, sat-isfactory Figures C.1 to C.5

platform optimizationworking,limited

load balancing based on platform modelworking, sat-isfactory Figures C.6 to C.8

synthesizer

process sequential schedulerworking, sat-isfactory Listing C.3

pipeline execution schedulerworking, sat-isfactory Listings C.3 & C.5

C code generationusable,incomplete Listing C.4

CUDA code generationnot usable,incomplete Listings C.5 & C.6

Table 10.1: The component’s status at the time of writing this report

benefit from platform re-mapping. The load balancing employs a novel algorithm based oncost analysis extracted from a platform model (presumably provided by future ForSyDe tools),which performs well. Although functioning within satisfactory parameters, these algorithmsrequire further research and development.

The synthesizer module could not be finalized in the time frame assigned for this project,thus, as Table 10.1 states, the code synthesis features, although functioning, do not generateproper code that can be used or tested. The C code, although usable, still needs some manualdebugging to be run and tested (mostly concerning variable allocations, or syntax). The CUDA

code on the other hand, as mentioned in Subsection 8.2.3, still requires implementation effortin order to generate proper code. The state of the current output may be verified in Listing C.5,and it can be seen that it respects the principles presented in Subsection 8.2.2. Still it is along way until generating syntactically correct CUDA code especially for ForSyDe models thatwould require the synthesis of multiple kernels. Even so, the schedulers work correctly anddeliver the expected outcome.

As a result both performance evaluation and functional verification of the generated code areimpossible. Therefore, the final verdict is that the project is still in its early youth to be properlyevaluated. For now, the demonstration provided in Appendix C which shows intermediatesteps of the tool’s execution flow will have to suffice for the project’s current development stagevalidation. Based on this material, we can affirm that although the final results are not availablefor the time being, the f2cc component behaves in the desired way, as described in Chapter 8.

The f2cc tool version 0.2 is therefore currently in its pre-alpha stage, and the release of a stableversion will have to be delayed.

10.2. Limitations and future work 97

10.2 Limitations and future work

Part II of this report presents the contributions made in implementing the current component.Each time a new feature, concept or mechanism is introduced, its limitations and proposalsfor future contributions are stated. This section is intended to provide a short summary ofthese proposals, with the purpose of offering a clear and condensed list of goals for futuredevelopment to be prioritized. As in Chapter 6, the priorities have been denominated withHigh, Medium and Low.

The proposals for improving the component framework are:

• Develop a new internal model based on existing tree and graph libraries (High)Since it has been shown on numerous occasions that the current model is very limitedin its flexibility and scalability, a new model has been proposed both in Appendix B andSubsection 7.1.2. Its development is crucial for the project’s further development.

• Develop an API that masks all operations employed in the model manipulations (Medium)The manipulation of the internal model, although greatly aided by the current API,is still laborious and unstable. Since the new hierarchy structure introduces a newdimension for internal management, this has to be masked in order to increase thedevelopment productivity and attract a larger community of developers. One of theoperations proposed is presented in Subsection 7.1.2.

• Create a link between the structural model and the functional code (Low)Although invaluable for research and experimental purposes, this feature, as presented inSubsection 7.1.2, does not have a high priority yet, as compared to the other more urgentgoals.

• Improve support for/from the ForSyDe-SystemC design framework (High)In order to embed the component into a fully automated design flow, direct support withthe ForSyDe-SystemC is crucial. In Subsection 7.2.2 a few proposals are made in thissense.

• Impose a standardized set of data types (High)Although this feature would seem obvious in the context of system design and analysis, astudy on its full impact on the design framework and language is necessary.

• Enhance support for recognition and extraction of complex data types and containers (High)The importance of complex data types has been seen in the development of the Linescan

model and other ForSyDe models. As proposed in Subsection 7.2.2, mechanisms forrecognition and manipulation of these data types need to be embedded into the ForSyDe

framework.

• Offer support for dynamic systems with variable parameters (Low)These systems would greatly broaden ForSyDe’s range of applications, as suggested inSubsection 7.3.2. Currently though, this goal is just speculated and further research isnecessary to determine its validity.

• Develop a model for storing code’s functionality based on its AST (Medium)As long as code manipulation and storage is based on text methods, it is subject to errorsand imposes unnecessary restrictions upon the design framework, weakening ForSyDe’spotential and attractiveness for new users.


• Develop mechanisms for functional code analysis and manipulation (Low)This task connects the previous goal with creating a link with the ForSyDe structuralmodel. At the time being it is a far goal for pure scientific purposes.

• Develop a tool for model run-time analysis in addition to the run-time functional validation(Medium)Since the current component employs an empirical model for cost description, it wouldbe proper to extract these costs from run-time analyses of the execution patterns. Such atool is proposed in Appendix B.

• Provide a real model of the platforms based on real measurements and architectural traits(High)This issue is discussed in Subsection 7.4.3. For the design space exploration algorithms tomake sense, a real model of the platform is needed.

• Improve the platform description algorithms (High)Currently, the running platform is described using the simple (linear) equations enumer-ated in Section 7.4 involving some empirical constants. These formulas need to be furtherdeveloped.

Regarding the algorithmic background, the following proposals can be made:

• Replace the algorithm for identifying data parallel sections (High)In Subsection 8.1.1 a much faster algorithm than the one currently implemented foridentifying potentially data parallel processes was presented, which was dropped due tothe model’s inflexibility to complex manipulations. Once the new internal model libraryis ready, the implementation of this algorithm has the highest priority.

• Increase the performance of the current algorithms (Medium)Although some proposals have been stated in Subsection 8.1.5, the algorithm performanceis currently a much weaker issue than the proof of concept. Nevertheless, performanceneeds to be taken into account when considering the component’s scalability.

• Formally validate the algorithms (High)The algorithms developed as part of the current contribution have been presented, andtheir effect has been shown. A formal verification though is out of the scope of thisMaster’s Thesis, but necessary for employing them in future design flows.

• Further develop model manipulation and design space exploration algorithms (High)In the present contribution, only a small part of ForSyDe’s full potential has beenexplored. Further research is still necessary in order to provide better solutions. Thecontinuation with an extensive study of the current scientific literature may providevaluable resources.

• Complete the implementation of the synthesizer and improve its algorithms (High)As stated in Section 10.1, the project is not completed yet. In order to be fully evaluated,it has to be finalized by providing a fully functional and correct synthesizer for both C

and CUDA code.

• Optimize the generated code (Medium)Since the project is in still in its youth stages, proof of concept and scientific explorationare more important than platform-specific optimizations. In order to offer a full-scaledindustrial tool however, this task will have to be assigned the highest priority. Even so,the platform optimizations, will strongly depend on the targeted industry or market,

10.2. Limitations and future work 99

which are not known yet. As for the design space exploration algorithms, potential foroptimization and its exploitation can be aided by the research community which hasproven to be very active in this particular domain in the past few years.

• Offer support for generating multiple kernels (High)All the design flow stages supports and builds models that would render in multiplekernel invocations. Even so, the absolute last step, namely the top level invocationfunction synthesis still needs a theoretical basis to generate code that invokes multiplekernels. This issue is stated in Subsection 8.2.3.

Chapter

11

Concluding Remarks

This chapter closes the current project by summarizing the main contributions to both system design and theacademic community. Its achievements will be listed with respect to the initial goals fixed in the introduction,and to the challenges that were identified in Chapter 6.

This report presents a method for analyzing and manipulating formally-described systemsin order understand their potential for parallel computation and to map them on platforms

offering resources for time and data parallelism. The language chosen for describing systemsis ForSyDe, a language that has been shown to meet all the demands for describing parallelsystems, and which is associated with a methodology with high potential for developingautomatic design flows. As design framework, ForSyDe-SystemC has been used. As targetplatforms, GPGPUs were chosen, as they are the leading platform for massively multi-parallelprocessing in industry, and there is much ongoing research revolving around them. Finally, astool for design flow, an existing component called f2cc previously developed in [Hjort Blindell,2012] that already offered code synthesis mechanisms for nvidia GPGPUs was chosen to beexpanded.

As concerns the scientific contribution, both the theoretical background and the implemen-tation work efforts offer a strong foundation in multiple directions for future research. The"parallel paradox" as stated in Chapter 3 is analyzed from different angles, while attemptingto reach the heart of the problem. While offering an insight into available paradigms in Part Ithe current contribution builds its own views on dealing with parallelism in Part II, proposingmethods and solutions for the main challenges that arose during the development process. Alsoa secondary track has been kept throughout the project with the main purpose of buildingspecifications for a potential ForSyDe development toolkit, synthesized in Appendix B.

Considering the foundation built by the current project, it may be considered a success.Regarding the initial objectives however, not every target could be achieved successfully inthe time frame available, some of them remaining open problems for future work:

• An extensive literature study has been conducted and relevant ideas have been extracted.

101

102 11. Concluding Remarks

• A plan for expanding f2cc’s functionality has been presented.

• New features specific to the previously non-supported ForSyDe-SystemC design frame-work have been provided.

• A synthesis flow for data and time parallel code targeting CUDA-enabled GPGPUs hasbeen developed and embedded into the existing f2cc flow.

• A high-level model for the industrial-scale application Linescan provided by XaarJet AB

was implemented and serves as demonstration model In Appendix C.

• High-quality code documentation was provided with both in-line comments and Doxygen

generated API documentation.

• The evaluation of the improved f2cc tool was impossible in the given time frame, due tothe unfinished state of the code generation mechanisms.

As concerning the component challenges identified in Chapter 6, the project places itself in thefollowing situation:

• all the frontend related goals have been achieved, even the low priority ones.

• the model related challenges have been achieved, except for the low priority ones, whichfell too far off the scope of this Master’s Thesis.

• as concerns the synthesizer related goals, they have been partially achieved. Optimizingthe use of shared memory was ruled out, due to its low priority and the stream wrapper,although provided, needs further development in order to generate correct CUDA code.

• all language related tasks were accomplished, although a different approach has beenchosen for code extraction: text-based manipulation from inside f2cc, with enhancedsupport from the ForSyDe-SystemC design framework.

• the general challenges were faced successfully. Although the low priority goal whichproposed merging the two execution flows was overlooked, since it was unrealistic fromthe start, both backward compatibility and execution flow integration were delivered ingood conditions.

Part IV

Appendices

103

Appendix

A

Component documentation

This appendix presents how to build and use the software component in its current development state. Since thetool is a part of the ForSyDe design flow, a few guides on how to set up its context will be listed, with regard tothe user’s presumed already-acquired ForSyDe knowledge. As long as the tool is still in its pre-alpha stages, afew guides for maintenance and future development will be provided in the last section of this chapter.

A.1 Building

T he current software tool has been developed and ran on Unbuntu Linux 12.04, Un-buntu Linux 12.10 and Xubuntu Linux 12.04, and it should compile and run on any Linux

distribution1. The component as such uses standard libraries and tools that come with theLinux distribution, such as make and g++. In case that, for some reason, these dependenciesare not fulfilled, the user must install them. On a Debian distribution running the followingcommand in the terminal will fix this.

$ sudo apt-get install build-essential

The source code comes with its own makefile. To build the component, go to the path wherethe source code was extracted and run:

$ make

This produces the binary file along with its library, object, and all the other files associatedwith the tool. The program has its own folder structure that is generated and placed in the bin

subfolder. To clean the build, execute:

$ make clean

1no warrants can be made for compiling and running under Windows, since it most likely require a few minormodifications in the source code regarding file handling and data acquisition

105

106 A. Component documentation

A.2 Preparations

Although the tool itself uses tools and libraries provided with the Linux distribution, it is partof a design flow based on existing ForSyDe tools. The system that needs to be synthesized intoparallel CUDA code is designed in the ForSyDe-SystemC design framework.

For setting up and usage of ForSyDe-SystemC, it is strongly advisable to consult the ForSyDe

wiki page [ForSyDe, 2013]. As stated in the official documentation, the design frameworkdepends on SystemC

2, and it uses features from the Boost libraries and C++11 standard.

At the time of writing this report, a custom ForSyDe library with enhanced introspectionfeatures was used. These features will be discussed whether or not they should appear inthe official releases, and they implement data type recognition and data size extraction. Ifit is decided that another solution is to be used, the component’s wiki page will hold thisinformation and further guides.

The ForSyDe system development is performed according to [ForSyDe, 2013], with thefollowing constraints:

• each process function needs to be defined in a separate file called [procname]_func.hpp,where [procname] is the name given to the process. Also, the function has to be namedfollowing the same convention: [procname]_func.

• each time a new complex data type is used, it has to be defined for proper introspectionusing the DEFINE_TYPE_NAME macro like in the following example:

DEFINE_TYPE_NAME(std::array<double,BUFFER_SIZE>," s t d : : array <double >");

• the only STL data type allowed for the time being is std::array, since it holds structuralinformation that can be extracted by the ForSyDe introspection module.

• code between #pragmas needs to be pure C code, since no semantic modification is doneto that part.

• the code outside #pragmas has the main role of wrapping variables into or unwrappingthem from signals. This area must contain no functionality, hence it will be ignored.

After the system model has been created, validated and its structural information has beendumped, its xml files need to be annotated with cost information. This means that the designerneeds to add run time approximations for each process3, by adding a cost attribute foreach process_constructor. This step may be automated by further offering introspectionsupport from the ForSyDe library, but this kind of support would lead to a lot of non-formalworkarouds, thus a correct way for extracting run-time information is necessary.

The platform model (see Section 7.4) needs to be provided as well. The tool comes with atemplate that looks like in Listing A.1. This file is created in the bin/config subfolder thefirst time the component is built. Depending on to the user’s own platform model, the valuescontained in this file need to be adjusted.

2a good tutorial on setting up SystemC with popular IDEs is found at [Bjerge, 2007]3for a run-time analysis tool, the user may consult Callgrind or KCacheGrind.

A.3. Running the tool 107

<?xml version=" 1.0 " ?>

<f2cc_cost_coefficients>

<host_to_device_transfer value="100"/><device_to_host_transfer value="100"/><device_inter_thread_transfer value="1"/><device_intra_thread_transfer value="0"/><host_to_host_transfer value="0"/><host_execution value="5"/><device_execution value="5"/>

</f2cc_cost_coefficients>

Listing A.1: The platform model XML template.

To set up the synthesis of the ForSyDe model, the internal tool paths need to be filledaccordingly:

• bin/inputs: contains all the input structural XML or GraphML files that serve as inputmodel for the f2cc tool.

• bin/outputs: will hold the generated .h, .c or .cu files.

• bin/func: contains the process function files, as they were used for the validation processin ForSyDe-SystemC. These files are needed for f2cc v0.2 execution flow.

• bin/config: as mentioned above, holds the platform model.

• bin/intermediate: will hold the intermediate XML representations of the internal f2ccmodel, as they are dumped between the main steps of the model modification stage.

A.3 Running the tool

After all the preparations in Section A.2 have been made, the tool can be run. It is controlledfrom the command prompt through command-line arguments. To synthesize a model file withthe default options, execute:

$ ./f2cc top_level_input_file

The argument top_level_input_file coincides with the XML structural file corresponding tothe top module of a ForSyDe-SystemC system or the GraphML structural file associated witha ForSyDe-Haskell system.

All the command-line option arguments need to be placed before the input file, but their orderis not important. For a full list of the option arguments and their explanation, run:

$ ./f2cc -h

A.4 Maintenance

Since f2cc version 0.2 is still in a pre-alpha state, it does not arrive with any documentation,because it is due to change. The source code though has been documented using Doxygen

annotations. To dynamically generate API documentation that reflects all the source code

108 A. Component documentation

changes, the makefile is equipped with a command that invokes Doxygen to parse throughthe source. This implies that Doxygen is installed using the following command line:

$ sudo apt-get install doxygen

The API documentation, having a HTML format, resides in the source/docs subfolder and isgenerated using the following command line:

$ make docs

The source project’s module structure suffered minor modifications from version 0.1. Thereforethe guidelines from Appendix A in [Hjort Blindell, 2012] are still valid with the additionspresented in Section 9.2. Therefore, improving the component by adding process supportimplies:

• adding the process’ class description into its correct MoC folder and namespace, andinheriting its base attributes from the Leaf class.

• providing frontend support so that the new process is recognized and parsed correctly,and the internal model is able to include it.

• if the process requires any further analysis or affects the current analysis or model mod-ifier algorithms, the model modifier classes need to add support for that (ModelModifierfor v0.1 execution flow and ModelModifierSysC for v0.2 execution flow).

• the synthesizer needs to be provided with methods for interpreting this new process andgenerate code from it.

As expected, any new algorithm that refers to model analysis or manipulation needs to beprovided with a public function in one of the model modifier classes (depending on the designflow they regard). Also, any improvement or new feature regarding platform synthesis, codegeneration or optimizations, should be reflected in one of the two synthesizer classes.

Appendix

BProposing a ForSyDeDesignToolbox

This chapter will present some of the author’s personal reflections that arouse while being emerged into theForSyDe methodology and studying the ForSyDe-SystemC framework. Trying to solve an industrial problemusing ForSyDe, a number of features were identified and analyzed from the author’s point of view. A series ofreports has been written while identifying these features.

The following sections do not intend to solve any given problem, and will not be specifically treated during thisproject. Their purpose is to shortly synthesize the work report’s content to ease settling of some concepts that mayaid the development of future ForSyDe research.

B.1 A simple explanatory example

In order to explain the full scale of the ForSyDe design flow, a very simplistic exampleis presented. It is assumed that there is a tool that analyses it for its computation andcommunication behavior used after the designer modeled a system in the modeling framework.This tool will further be presented in Section B.2. The analysis tool should provide quantitativemeasurements as costs (empirically denominated as low, medium or high), and potential forfurther manipulation or design space exploration. The analyzed system looks like in Figure B.1.

Following a set of refinements based on analysis information, design space exploration andother similar mechanism, the system in Figure B.1 would result in the refined system inFigure B.2. A suitable set of transformation are:

• process 2, identified as potentially parallelizable (e.g. matrix multiplication), could besuggested to be mapped on a intensive parallel1 platform. If such is the case, itscomputation cost will be recalculated to medium.

• process 6, could be suggested to be mapped to a cache-based / branch prediction-basedplatform. This suggestion is enforced by the low communication intensity.

1more on parallel taxonomies can be found in Subsection 3.2.2

109

110 B. Proposing a ForSyDe Design Toolbox

2. High- high potential for parallelization- identified potential for code optimization

1. Med

- distribution role

3. Med

- no identified potential for optimization

4. Low

5. Low 6. High

- high complexity

L

M

H H

H

L L

L

Figure B.1: Example of ForSyDe model after analysis

1. Med

5. Low6. High

L

M

H H

L L

L

2. Med

4. Low

3. Med

Platform A (complex + intensive parallel) Platform B (sequential)

stage 1 stage 2 stage 4stage 3

Figure B.2: Example of ForSyDe model after transformations & refinements

• a good practice would be load balancing the pipeline stages on a complex parallel1

platform. As seen in Figure B.1, process 3 cannot be further optimized for parallelism,thus it is not susceptible to further design space exploration.

• the high communication between process 4 and process 5 could be considered a potentialbottleneck. Consequently, the best choice is to merge these two processes as shown inFigure B.2, and lower communication overhead using low-latency memories and othercode optimization techniques.

B.2 Layered refinements & Refinement loop

As seen in Section B.1, the bridge from a high-level description to a low-level implementation isa long and tedious one, and definitely not straight-forward. It involves several design decisions

B.2. Layered refinements & Refinement loop 111

Analysis flow

Static code analysis

Run-time analysis

Cyclomatic complexity

Channel structure

Computation intensity

Performance analysis

Comm. intensity

instructions vs. memory accesses

simple access patterns

relative execution time

/ CPIchannel calls

Figure B.3: Some aspects of ForSyDe analysis

High-Level, MoC-based description

MoC info

Backend infoLow-Level, Backend

implementation

Purely HL model transformations

Model transformations with mapping information

Code optimizations

application-specific

platform-specific

Figure B.4: Hierarchical separation of refinements


based on a proper analysis of the system.

First, the analysis mechanism, in order to be efficient and favor proper design decisions, hasto be far more complex than presented in Section B.1. Fig. B.3 depicts a few aspects of modelbehavioral and structural analysis. These analyses are grouped as static analyses, achieved byparsing and analyzing either the code or the model, and run-time analyses, accomplished byrunning the specified process on virtual platforms described by relevant features:

• cyclomatic complexity is a measurement that describes a program’s complexity (e.g. loops,jumps, decision points) [McCabe, 1976]. Usually, complex programs benefit from specifictypes platforms (cache-based, bubble-free [Codreanu and Hobincu, 2010], etc.).

• channel structure takes information about the data and amount of data encapsulated andtransported by each channel associated with a ForSyDe signal.

• computation intensity measurements can be done by analyzing the program’s behavior atrun-time. Some computation details can be identified, for example arithmetical intensitydefined as the computation vs. memory accesses [Hennessy and Patterson, 2011]; andsimple memory access patterns which could prove useful in further code / mappingoptimizations.

• performance analysis. Although counting the CPI or the execution time may not berelevant, especially for high-level models, a run-time analysis would provide earlyestimates of potential bottlenecks and critical paths in the system.

• communication intensity describes how much the channel is used. A good analysis isdone by running the model in a real-case scenario. By counting the channel calls andknowing the channel’s structure, a relevant measurement of the communication intensityis possible.

A thorough analysis of the transformation patterns leads to adopting a layered topology fordescribing ForSyDe design flow. In Figure B.4 a hierarchical model for separating designtransformation layers is depicted. It enforces re-usability of the same refinements for differentdesigns flows based on their similarity, or the similarities between targeted platforms or MoCs.

As the graphic suggests, the farther one goes into refinement, the more MoC information is lostor getting irrelevant. On the other hand, more implementation / mapping details show up asbackend / platform information. Three main hierarchical levels can be identified:

• purely high-level model transformations that affect only the model of the design and haveno outcome on the final implementation.

• model transformations with mapping information include mainly model transformationsdependent on the platform and relevant only for a specific class of applications. Furtherclassifications can be made, for example: class of applications → class of platforms →platform.

• code optimizations, the final stages of the refinement process, when the high-level modelis not relevant anymore. This implies the fact that the means of communication andcomputation have been described and/or mapped, but code optimizations are stillpossible.

Finally, a refinement loop is defined, as a quantum mean to achieve the proper transformationfor a design stage. It is applied upon an intermediate representation of the design and results

B.3. Proposed architecture for the design flow tool 113

Intermediaterepresentation

Analysis

Decision

Transformation

Intermediaterepresentation

Analysis

Decision

Transformation

Structure

Code

Mapping

Designconstraints

Platformmodel &

informationAnalysistool

Human

Machine

Transformationlibrary

Figure B.5: A sketch for a general refinement loop

in another intermediate representation. Each hierarchical layer may have several refinementloops or none at all, depending on the desired goal.

A simple sketch is drawn in Figure B.5 and it implies that the transformation is done after adecision based upon a proper analysis that has the proper input data. The picture also impliesthat the author believes that the refinement loop should not be completely transparent to thedesigner, or at least the designer should also be allowed in the decision process if he wishes so.

B.3 Proposed architecture for the design flow tool

Currently ForSyDe as a design methodology is still in its youth, thus there is still no unifiedtoolbox that embeds all its features. Driven by this fact, we propose the future research of aunified software architecture that implements ForSyDe design flow, based on examples foundin the open-source community.

One especially attracting project model is inspired by Apache Maven build automation tool[Maven, 2007]. It fixes a Project Object Model (POM) which, like in the object-oriented paradigm,it describes a hierarchical structure of projects and sub-projects. It also describes dependencybetween projects, dependency inheritance and the plugin usage. The most important featureis that the primary means to extend a project is through plugins. Once a project has set itsfoundation, technically anyone can write plugins, without possessing thorough knowledge ofthe project’s internal architecture.

By following Maven’s model, ForSyDe Toolkit can be built as two main entities:

• the backbone, which will play the role of hub for the toolkit. It will be built followingthe Model-View-Controller (MVC) [Krasner and Pope, 1988] architecture, and will serve asboth interface and controller for the plugins. It will implement three sets of classes: aninterface, for interacting with the plugins, a control set for running, automating pluginsand taking care of dependencies, and a data set encapsulating intermediate data forplugins.


• the plugins which will be individual components of the ForSyDe Toolkit, developed asseparate tools. They will be controlled and will communicate only with the backbonethrough its API. They can be transformation libraries, parsing tools, analysis tools,backend synthesis, automation tools (which implement design flows), visualization tools,or even GUIs.

The advantages of such an architecture would be separation and individual development ofdifferent components by different research parties, with minimum overhead. By following thePOM, backward compatibility will be easy to assure, and the handling of dependencies willbe taken care of through automation rather than software development. Furthermore, it willenforce parallel execution of tools since they have their own separate environment, hence highperformance processing can be exploited.

Appendix

C

Demonstrations

This appendix demonstrates the usage of the current component by showing the intermediate results outputtedat the time of writing this report.

A s mentioned in the project’s introduction, the main application studied and used as anexample to demonstrate f2cc’s usage is called Linescan and was provided by the company

XaarJet AB. It is an industrial-scale image processing application with the purpose of real-time monitoring of hardware systems. Since it involves processing massive amounts of dataand safety-critical system control loops, it is a good application for demonstrating the potentialof ForSyDe as a system design methodology.

As part of the Master’s Thesis contribution, core parts of Linescan have been modelled inForSyDe in order to be fed into experimental design flows that would result in correct-by-designimplementations on heterogeneous platforms. The core functionality chosen for testing f2cc

with was profiled in the provider’s reports as being potentially time- and data-parallel.

The material found in page 117 through page 129 show samples from different stages of thedesign flow associated with f2cc v0.2. Although the end result of the component could not betested and validated due to the lack of time available for debugging, the results are neverthelessshown to reflect the current state of the tool’s development.

The test system has been modelled in ForSyDe-SystemC as described in Appendix A. A sampleof the function code, both as implemented in ForSyDe, and as extracted by the f2cc frontend

module can be seen in Listing C.1 and Listing C.2. The ForSyDe model has been dumped intoits XML structural representation, and parsed by f2cc and an internal model was successfullybuilt. This fact is demonstrated by Figure C.1 which plots the internal model.

Afterwards the model modifier algorithms applied their methods upon this internal representa-tion: the model flattening (Figure C.2), the grouping of potentially data-parallel comb processes(Figure C.3), the grouping of the rest of potentially parallel processes (Figure C.4), the removalof redundant processes (Figure C.5), the load balancing of the process network (Figure C.7) andfinally the remodelling of the system in order to describe pipelined execution (Figure C.8).

115

116 C. Demonstrations

Figures C.3 to C.5 show the result of grouping potentially parallel processes even from differentdata flows provided there is no data dependency between them. As can be seen, the data flowis merged into xcorr_ns and then split again, while still respecting the correct semantics of thesystem. Since it has been shown that further development is still needed to provide a solutionto synthesize kernels containing processes with different widths, we give up this optimizationthus leaving the two data paths separated in Figure C.6. Also, since the network contains noprocess that would benefit from changing the initially-mapped platform, platform optimizationcould not be observed on the provided model.

Excerpts from the logger output have been provided in Listing C.3 to demonstrate the workingstate of both the sequential scheduler and the pipeline stage scheduler.

Listing C.4, Listing C.5 and Listing C.6 show samples of backend code generated by the currentcomponent. As can be seen, the sequential code is in an almost finished state. With minoradjustments and polishing, the generated sequential program can even be compiled and run.The CUDA code though, although complex, still needs much further adjustment in order to bevalidated.

117

1 // type name d e f i n i t i o n used by the ForSyDe dumper to represent types in the XML f i l e s2 DEFINE_TYPE_NAME( std : : array <double , BUFFER_SIZE> ," s t d : : array <double >" ) ;3 DEFINE_TYPE_NAME( std : : array <double , BUFFER_SIZE + 3>," s t d : : array <double >" ) ;45 // process funct ion d e f i n i t i o n6 void acorr_od ( abst_ext <std : : array <double , BUFFER_SIZE>> &output ,7 const abst_ext <std : : array <double , BUFFER_SIZE+3>> &i n _ s t a t e ) {89 std : : array <double , BUFFER_SIZE+3> s t a t e = unsafe_from_abst_ext ( i n _ s t a t e ) ;

1011 #pragma ForSyDe begin od12 unsigned i n t p o s i t i o n = ( unsigned i n t ) s t a t e [ 0 ] ;13 i n t sw = ( i n t ) s t a t e [ 1 ] ;14 i n t W = sw * 2 ;15 i n t sampWin_begin_idx = ( p o s i t i o n + ( BUFFER_SIZE / 2 − sw ) ) ;1617 double retValue [ BUFFER_SIZE ] = { 0 } ;18 i n t k , d , c_idx , d_idx ;19 for ( i n t b = −sw ; b < sw ; b++) {20 for ( i n t c = 0 ; c <= W; c ++) {21 d = c + b + 1 ;22 c_idx = ( sampWin_begin_idx + c ) % ( BUFFER_SIZE + 1) ;23 d_idx = ( sampWin_begin_idx + d ) % ( BUFFER_SIZE + 1) ;24 k = b + sw ;25 i f ( ( d >= 0) && ( d < W) ) {26 retValue [ BUFFER_SIZE/2 + sw − k − 1] += *(& s t a t e [ 2 ] + c_idx )27 * *(& s t a t e [ 2 ] + d_idx ) ;28 }29 }30 }3132 #pragma ForSyDe end33 std : : array <double , BUFFER_SIZE> out_vector = r e i n t e r p r e t _ c a s t <std : : array <double , BUFFER_SIZE>&>(

retValue ) ;34 output = abst_ext <std : : array <double , BUFFER_SIZE>>( out_vector ) ;35 } ;

Listing C.1: Sample ForSyDe process function code written in C++ for simulation in the ForSyDe-SystemC framework

1 void acorr_od ( double * retValue , const double * s t a t e ) {23 unsigned i n t p o s i t i o n = ( unsigned i n t ) s t a t e [ 0 ] ;4 i n t sw = ( i n t ) s t a t e [ 1 ] ;5 i n t W = sw * 2 ;6 i n t sampWin_begin_idx = ( p o s i t i o n + (24 / 2 − sw ) ) ;78 i n t k , d , c_idx , d_idx ;9 for ( i n t b = −sw ; b < sw ; b++) {

10 for ( i n t c = 0 ; c <= W; c ++) {11 d = c + b + 1 ;12 c_idx = ( sampWin_begin_idx + c ) % (24 + 1) ;13 d_idx = ( sampWin_begin_idx + d ) % (24 + 1) ;14 k = b + sw ;15 i f ( ( d >= 0) && ( d < W) ) {16 retValue [24/2 + sw − k − 1] += *(& s t a t e [ 2 ] + c_idx )17 * *(& s t a t e [ 2 ] + d_idx ) ;18 }19 }20 }21 }

Listing C.2: Sample C process function code extracted by the CParser module from the the code in Listing C.1


hallo.xml

f2cc0

f2cc0_DDFA0

f2cc0_DDFA0_ACorr

f2cc0_DDFA0_AvgSub

f2cc0_DDFA0_OutB

f2cc0_DDFA0_SubDiv

f2cc0_DDFA0_XCorr

f2cc0_DDFA1

f2cc0_DDFA1_ACorr

f2cc0_DDFA1_AvgSub

f2cc0_DDFA1_OutB

f2cc0_DDFA1_SubDiv

f2cc0_DDFA1_XCorr

f2cc0_DDFA10

f2cc0_DDFA10_ACorr

f2cc0_DDFA10_AvgSub

f2cc0_DDFA10_OutB

f2cc0_DDFA10_SubDiv

f2cc0_DDFA10_XCorr

f2cc0_DDFA11

f2cc0_DDFA11_ACorr

f2cc0_DDFA11_AvgSub

f2cc0_DDFA11_OutB

f2cc0_DDFA11_SubDiv

f2cc0_DDFA11_XCorr

f2cc0_DDFA12

f2cc0_DDFA12_ACorr

f2cc0_DDFA12_AvgSub

f2cc0_DDFA12_OutB

f2cc0_DDFA12_SubDiv

f2cc0_DDFA12_XCorr

f2cc0_DDFA13

f2cc0_DDFA13_ACorr

f2cc0_DDFA13_AvgSub

f2cc0_DDFA13_OutB

f2cc0_DDFA13_SubDiv

f2cc0_DDFA13_XCorr

f2cc0_DDFA14

f2cc0_DDFA14_ACorr

f2cc0_DDFA14_AvgSub

f2cc0_DDFA14_OutB

f2cc0_DDFA14_SubDiv

f2cc0_DDFA14_XCorr

f2cc0_DDFA15

f2cc0_DDFA15_ACorr

f2cc0_DDFA15_AvgSub

f2cc0_DDFA15_OutB

f2cc0_DDFA15_SubDiv

f2cc0_DDFA15_XCorr

f2cc0_DDFA2

f2cc0_DDFA2_ACorr

f2cc0_DDFA2_AvgSub

f2cc0_DDFA2_OutB

f2cc0_DDFA2_SubDiv

f2cc0_DDFA2_XCorr

f2cc0_DDFA3

f2cc0_DDFA3_ACorr

f2cc0_DDFA3_AvgSub

f2cc0_DDFA3_OutB

f2cc0_DDFA3_SubDiv

f2cc0_DDFA3_XCorr

f2cc0_DDFA4

f2cc0_DDFA4_ACorr

f2cc0_DDFA4_AvgSub

f2cc0_DDFA4_OutB

f2cc0_DDFA4_SubDiv

f2cc0_DDFA4_XCorr

f2cc0_DDFA5

f2cc0_DDFA5_ACorr

f2cc0_DDFA5_AvgSub

f2cc0_DDFA5_OutB

f2cc0_DDFA5_SubDiv

f2cc0_DDFA5_XCorr

f2cc0_DDFA6

f2cc0_DDFA6_ACorr

f2cc0_DDFA6_AvgSub

f2cc0_DDFA6_OutB

f2cc0_DDFA6_SubDiv

f2cc0_DDFA6_XCorr

f2cc0_DDFA7

f2cc0_DDFA7_ACorr

f2cc0_DDFA7_AvgSub

f2cc0_DDFA7_OutB

f2cc0_DDFA7_SubDiv

f2cc0_DDFA7_XCorr

f2cc0_DDFA8

f2cc0_DDFA8_ACorr

f2cc0_DDFA8_AvgSub

f2cc0_DDFA8_OutB

f2cc0_DDFA8_SubDiv

f2cc0_DDFA8_XCorr

f2cc0_DDFA9

f2cc0_DDFA9_ACorr

f2cc0_DDFA9_AvgSub

f2cc0_DDFA9_OutB

f2cc0_DDFA9_SubDiv

f2cc0_DDFA9_XCorr

double[27] double unsigned int

xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout

double[27] double[27]

double[27]

acorr_od

double[24]

double[24] : double[24]


double : double unsigned int : unsigned int

double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout


double[26]

oblock_od

double


oblock_ns

double[26]

double : double

double[26]

delay

double[26]

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double double double double double double double double double double double double double double double double

zipX

double[16]

double

fanout

double double

unsigned int

fanout

unsigned int unsigned int

unsigned int : unsigned intdouble : double


xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]




double[24]

averager

double[24]


sub

double[24]


double[24]

fanout




double[26]

fanout



oblock_ns

double[26]

double[26]

oblock_od

double

double[26]

delay

double[26]

double : double

double : double


div

double[24]


double[24]

fanout



sub

double[24]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]



double

fanout

double double

unsigned int

fanout




double[16]

unzipX


unsigned int[16]

unzipX

unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int

double[16] : double[16] unsigned int[16] : unsigned int[16]

Figure C.1: Input model. Zoomed detail

119

flattened.xml

f2cc0


xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double


zipX

double[16]


double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout


double[27]

xcorr_od

double[24]


xcorr_ns

double[27]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout



xcorr_ns

double[27]

double[27]

delay

double[27]

double[27]

fanout


double[27]

acorr_od

double[24]

double[24]

fanout



div

double[24]


sub

double[24]

double[24]

averager

double[24]


sub

double[24]


oblock_ns

double[26]

double[24]

fanout


double[26]

delay

double[26]

double[26]

fanout


double[26]

oblock_od

double

double[27]

fanout



xcorr_ns

double[27]

double[27]

xcorr_od

double[24]

double[27]

delay

double[27]

double

fanout

double double

unsigned int

fanout


double[16]

unzipX


unsigned int[16]

unzipX



Figure C.2: The model after flattening. Zoomed detail


flattened1.xml

f2cc0

pcomp_2

pcomp_1

pcomp_6

pcomp_7

pcomp_8

pcomp_9

pcomp_3

pcomp_4

pcomp_5

double[24]

averager

double[24]


double[384]

unzipX

double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24] double[24]


double[27]

acorr_od

double[24]


double[384]

unzipX




sub

double[24]


double[384]

unzipX


double[384] : double[24] double[384] : double[24]


sub

double[24]


double[384]

unzipX




xcorr_ns

double[27]


double[864]

unzipX

double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27] double[27]

double[864] : double[27] double[32] : double unsigned int[32] : unsigned int

double[27]

xcorr_od

double[24]


double[384]

unzipX




div

double[24]


double[384]

unzipX




oblock_ns

double[26]


double[416]

unzipX



double[26]

oblock_od

double

double : double[16]

double[16]

unzipX



double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout



zipX

double[384]

double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[24]

fanout


double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]

double[26]

delay

double[26]


zipX

double[16]


zipX

double[384]


zipX

double[384]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]

double[27]

delay

double[27]


zipX

double[384]

double[27]

fanout



zipX

double[432]


zipX

double[864]


zipX

double[384]


zipX

double[384]

double[26]

fanout



zipX

double[416]


zipX

double[416]


zipX

double[384]


zipX

double[384]

double[27]

fanout



zipX

double[432]

double

fanout

double double

double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double

zipX

double[32]

unsigned int

fanout


unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int

zipX

unsigned int[32]

double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout


double[27]

fanout


double[26]

fanout


double[27]

fanout


double

fanout

double double

unsigned int

fanout



double[16]

unzipX


unsigned int[16]

unzipX



Figure C.3: The model after grouping equivalent comb processes. Zoomed detail

121

flattened2.xml

f2cc0

pcomp_1

pcomp_10

pcomp_11

pcomp_12

pcomp_13

pcomp_14pcomp_15

pcomp_16

pcomp_17

pcomp_18

pcomp_2

pcomp_3

pcomp_4

pcomp_5

pcomp_6

pcomp_7

pcomp_8

pcomp_9

double[27]

acorr_od

double[24]


double[384]

unzipX



double[26]

delay

double[26]


double[416]

unzipX



double[27]

delay

double[27]


double[864]

unzipX



double[27]

fanout



double[432]

unzipX


double[432]

unzipX



double[24]

fanout



double[384]

unzipX


double[384]

unzipX



double

fanout

double double

double : double[16] double : double[16]

double[16]

unzipX


double[16]

unzipX


double[16] : double

unsigned int

fanout


unsigned int : unsigned int[16] unsigned int : unsigned int[16]

unsigned int[16]

unzipX


unsigned int[16]

unzipX


unsigned int[16] : unsigned int

double[27]

fanout



double[432]

unzipX


double[432]

unzipX



double[24]

fanout



double[384]

unzipX


double[384]

unzipX



double[26]

fanout



double[416]

unzipX


double[416]

unzipX



double[24]

averager

double[24]


double[384]

unzipX




div

double[24]


double[384]

unzipX




oblock_ns

double[26]


double[416]

unzipX



double[26]

oblock_od

double

double : double[16]

double[16]

unzipX




sub

double[24]


double[384]

unzipX




sub

double[24]


double[384]

unzipX




xcorr_ns

double[27]


double[864]

unzipX



double[27]

xcorr_od

double[24]


double[384]

unzipX




zipX

double[384]


zipX

double[416]


zipX

double[432]


zipX

double[432]


zipX

double[432]


zipX

double[864]


zipX

double[384]


zipX

double[384]

double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double double

zipX

double[32]

unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int unsigned int

zipX

unsigned int[32]


zipX

double[432]


zipX

double[384]


zipX

double[384]


zipX

double[416]


zipX

double[416]


zipX

double[384]


zipX

double[384]


zipX

double[416]


zipX

double[16]


zipX

double[384]


zipX

double[384]


zipX

double[864]


zipX

double[384]


double[16]

unzipX



zipX

double[16]

unsigned int[16]

unzipX



zipX

unsigned int[16]


Figure C.4: The model after grouping potentially parallel leaf processes. Full scale


flattenAndParallelize.xml

f2cc0

pcomp_1

pcomp_10

pcomp_11

pcomp_12

pcomp_13

pcomp_14pcomp_15

pcomp_16

pcomp_17

pcomp_18

pcomp_2

pcomp_3

pcomp_4 pcomp_5

pcomp_6

pcomp_7

pcomp_8

pcomp_9

double[27]

acorr_od

double[24]




double[26]

delay

double[26]




double[27]

delay

double[27]


double[864]

unzipX



double[27]

fanout




zipX

double[864]


double[24]

fanout





double

fanout

double double

double : double[16] double : double[16]


zipX

double[32]

double[16] : double

unsigned int

fanout


unsigned int : unsigned int[16] unsigned int : unsigned int[16]

unsigned int[16] unsigned int[16]

zipX

unsigned int[32]


double[27]

fanout





double[24]

fanout






double[26]

fanout




double[24]

averager

double[24]




div

double[24]




oblock_ns

double[26]



double[26]

oblock_od

double

double : double[16]



sub

double[24]



sub

double[24]




xcorr_ns

double[27]



double[27]

xcorr_od

double[24]



Figure C.5: The model after removing redundant zipx and unzipx processes. Zoomed detail

123

optimizePlatform.xml

f2cc0

pcomp_1[16] <device, stage:0> pcomp_11[16] <device, stage:0>

pcomp_12[16] <device, stage:0>





pcomp_16[16] <device, stage:0>pcomp_17[16] <device, stage:0>












acorr_ns = cost:10 =

double[27]


double[432] : double[27] double[432] : double[27] double[16] : double unsigned int[16] : unsigned int

double[27]

delay = cost:0 =

double[27]



double[26]

delay = cost:0 =

double[26]




double[27]

delay = cost:0 =

double[27]




double[24]

fanout = cost:0 =






double[27]

fanout = cost:0 =





double[27]

xcorr_od = cost:267 =

double[24]



double

fanout = cost:0 =

double double

double : double[16]double : double[16]

double[16] : double

double[16] : double

unsigned int

fanout = cost:0 =


unsigned int : unsigned int[16]unsigned int : unsigned int[16]



double[27]

fanout = cost:0 =




double[24]

fanout = cost:0 =






double[27]

acorr_od = cost:260 =

double[24]


double[26]

fanout = cost:0 =




double[24]

averager = cost:30 =

double[24]




div = cost:23 =

double[24]




oblock_ns = cost:30 =

double[26]



double[26]

oblock_od = cost:5 =

double

double : double[16]



sub = cost:25 =

double[24]



sub = cost:25 =

double[24]



xcorr_ns = cost:10 =

double[27]



Figure C.6: The model after platform optimization for a different example. Zoomed detail


loadBalanced.xml

f2cc0







pcomp_16[16] <device, stage:1>pcomp_17[16] <device, stage:1>













double[27]


double[432] : double[27] double[432] : double[27] double[16] : double unsigned int[16] : unsigned int

double[27]

delay = cost:0 =

double[27]



double[26]

delay = cost:0 =

double[26]




double[27]

delay = cost:0 =

double[27]




double[24]

fanout = cost:0 =






double[27]

fanout = cost:0 =





double[27]


double[24]



double

fanout = cost:0 =

double double

double : double[16]double : double[16]

double[16] : double

double[16] : double

unsigned int

fanout = cost:0 =


unsigned int : unsigned int[16]unsigned int : unsigned int[16]



double[27]

fanout = cost:0 =




double[24]

fanout = cost:0 =






double[27]


double[24]


double[26]

fanout = cost:0 =




double[24]


double[24]




div = cost:23 =

double[24]





double[26]



double[26]


double

double : double[16]



sub = cost:25 =

double[24]



sub = cost:25 =

double[24]




double[27]



Figure C.7: The model applying load balancing algorithm. Zoomed detail

125

pipelined.xml

f2cc0







double[27]

double[27]

delay = cost:0 =

double[27]

double[27]

fanout = cost:0 =




double[27]

fanout = cost:0 =




double[27]


double[27]

delay = cost:0 =

double[27]


double

fanout = cost:0 =

double double

unsigned int

fanout = cost:0 =


double[16] : double unsigned int[16] : unsigned int

double[27]


double[24]



double[24]


double[24]


sub = cost:25 =

double[24]



double[26]

double[24]

fanout = cost:0 =


double[26]

delay = cost:0 =

double[26]

double[26]

fanout = cost:0 =


double[26]


double

double : double[16]



div = cost:23 =

double[24]

double[24]

fanout = cost:0 =



sub = cost:25 =

double[24]


double[27]


double[24]



Figure C.8: The model after creating pipeline directives. Zoomed detail


1 (...)

23 2013-06-10 15:57:10 [INFO] - NEW MODEL INFO:

4 Number of leafs: 20

5 Number of composites: 5

6 Number of inputs: 2

7 Number of outputs: 1

8 2013-06-10 15:57:10 [INFO] - Checking that the internal processnetwork is valid for

9 synthesis...

10 2013-06-10 15:57:10 [INFO] - All checks passed

11 2013-06-10 15:57:10 [INFO] - Generating sequential schedules for all composite

12 processes...

13 2013-06-10 15:57:10 [INFO] - Generating process schedule for ParDDFA...

14 2013-06-10 15:57:10 [INFO] - Process schedule for ParDDFA:

15 pcomp_1, pcomp_2, pcomp_10, pcomp_12

16 2013-06-10 15:57:10 [INFO] - Generating process schedule for stage_1...

17 2013-06-10 15:57:10 [INFO] - Process schedule for stage_1:

18 f2cc0_DDFA0_ACorr_buffer , f2cc0_DDFA0_ACorr_bufFan ,

19 f2cc0_DDFA0_XCorr_buffer , f2cc0_DDFA0_XCorr_bufFan ,

20 f2cc0_DDFA0_sampFan , f2cc0_DDFA0_swFan ,

21 f2cc0_DDFA0_ACorr_acorr_ns , f2cc0_DDFA0_XCorr_xcorr_ns



24 f2cc0_DDFA0_XCorr_xcorr_od



27 f2cc0_DDFA0_OutB_buffer , f2cc0_DDFA0_OutB_bufFan ,

28 f2cc0_DDFA0_SubDiv_inFan , f2cc0_DDFA0_SubDiv_sub ,

29 f2cc0_DDFA0_SubDiv_div , f2cc0_DDFA0_AvgSub_inFan ,

30 f2cc0_DDFA0_AvgSub_averager , f2cc0_DDFA0_AvgSub_sub ,

31 f2cc0_DDFA0_OutB_oblock_ns , f2cc0_DDFA0_OutB_oblock_od



34 f2cc0_DDFA0_ACorr_acorr_od

35 2013-06-10 15:57:10 [INFO] - Generating wrapper functions for composite processes...

36 2013-06-10 15:57:10 [INFO] - Creating signal variables for "pcomp_2"...

37 2013-06-10 15:57:10 [INFO] - Created 2 signal(s).

38 2013-06-10 15:57:10 [INFO] - Creating delay variables for "pcomp_2"...

39 2013-06-10 15:57:10 [INFO] - Created 0 delay variable(s)

40 2013-06-10 15:57:10 [INFO] - Creating signal variables for "pcomp_12"...


42 2013-06-10 15:57:10 [INFO] - Creating delay variables for "pcomp_12"...

43 2013-06-10 15:57:10 [INFO] - Created 1 delay variable(s)

4445 (...)

4647 2013-06-10 15:57:10 [INFO] - Generating streamed CUDA kernel execution functions for

48 adjacent parallel composite processes...

4950 (...)

5152 2013-06-10 15:57:10 [INFO] - Creating a CUDA kernel wrapper from the contained section

53 "pcomp_1--pcomp_12"...

54 2013-06-10 15:57:10 [INFO] - USING SHARED MEMORY FOR INPUT DATA: NO

55 2013-06-10 15:57:10 [INFO] - Creating signal variables for "f2cc0"...


5758 (...)

5960 2013-06-10 15:57:10 [INFO] - Creating the top level execution function...

61 2013-06-10 15:57:10 [INFO] - Optimizing kernel for 1 burst(s) and 6 stage(s)...

6263 (...)

Listing C.3: Excerpt from the f2cc output logger. The highlighted lines 14, 17, 23, 26 and 33 show the sequentialschedule, while line 61 shows the calculated parameters necessary for pipeline stage mapping

127

1 void stage_1_exec_wrapper ( double * out1 , double * out2 , double in1 , unsigned i n t in2 ) {2 i n t i ; // Can s a f e l y be removed i f the compiler warns3 // about i t being unused4 // Declare s i g n a l v a r i a b l e s5 double * fanout1 = new double [ 2 7 ] ;6 double fanout2 ;7 unsigned i n t fanout3 ;8 double * delay4 = new double [ 2 7 ] ;9 double * comb5 = new double [ 2 7 ] ;

10 double * delay6 = new double [ 2 7 ] ;11 double * comb7 = new double [ 2 7 ] ;12 double * fanout8 = new double [ 2 7 ] ;13 double fanout9 ;14 unsigned i n t fanout10 ;1516 // Declare delay v a r i a b l e s17 s t a t i c double v_delay_element2 [ 2 7 ] = { 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,

0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 } ;18 s t a t i c double v_delay_element1 [ 2 7 ] = { 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,

0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 } ;1920 // Execute l e a f s21 for ( i = 0 ; i < 27; i ++) {22 delay4 [ i ] = v_delay_element1 [ i ] ;23 }24 for ( i = 0 ; i < 27; i ++) {25 delay6 [ i ] = v_delay_element2 [ i ] ;26 }27 for ( i = 0 ; i < 27; i ++) {28 out2 [ i ] = delay4 [ i ] ;29 }30 for ( i = 0 ; i < 27; i ++) {31 fanout1 [ i ] = delay4 [ i ] ;32 }33 for ( i = 0 ; i < 27; i ++) {34 out1 [ i ] = delay6 [ i ] ;35 }36 for ( i = 0 ; i < 27; i ++) {37 fanout8 [ i ] = delay6 [ i ] ;38 }39 fanout2 = in1 ;40 fanout9 = in1 ;41 fanout3 = in2 ;42 fanout10 = in2 ;43 acorr_ns ( comb5 , fanout1 , fanout2 , fanout3 ) ;44 xcorr_ns ( comb7 , fanout8 , fanout9 , fanout10 ) ;45 for ( i = 0 ; i < 27; i ++) {46 v_delay_element1 [ i ] = comb5 [ i ] ;47 }48 for ( i = 0 ; i < 27; i ++) {49 v_delay_element2 [ i ] = comb7 [ i ] ;50 }5152 // Clean up memory53 d e l e t e [ ] fanout1 ;54 d e l e t e [ ] delay4 ;55 d e l e t e [ ] comb5 ;56 d e l e t e [ ] delay6 ;57 d e l e t e [ ] comb7 ;58 d e l e t e [ ] fanout8 ;59 }

Listing C.4: Sample from a generated sequential function. This is the wrapper function for process pcomp1 fromFigure C.8


1 void cudaParDDFA(double* out1, double* in1, unsigned int* in2, unsigned long long N) {

2 int i; // Can safely be removed if the compiler warns about it being unused

3 cudaStream_t stream[6];

4 struct cudaDeviceProp prop;

5 int max_threads_per_block;

6 int shared_memory_per_sm;

7 int num_multicores;

8 int full_utilization_thread_count;

9 int is_timeout_activated;

1011 // Get GPGPU device information

12 // method presented and explained in [Hjort Blindell, 2012]

13 // (...)

1415 // Declare signal variables

16 double* out1_device[6];

17 // @todo Better error handling

18 if (cudaMalloc((void**) &out1_device[1], 16 * sizeof(double)) != cudaSuccess) {

19 printf("ERROR: F a i l e d t o a l l o c a t e GPU memory\n");20 exit(-1);

21 }

22 // (...)

23 unsigned int* in2_device[6];

24 // (...)

2526 unsigned long data_index[6] = {0, 1, 2, 3, 4, 5};

27 unsigned long number_of_bursts = N * 1;

28 char finished = 0;

2930 // start executing the kernels in a revolving barrel pattern like suggested in Table 8.1

31 while (!finished) { // while there is still data needed to be processed

32 for (i = 0; i < 6; i++) {

33 if ((data_index[i] < number_of_bursts) && (cudaStreamQuery(stream[i] == cudaSuccess) {

34 // H2D transfer

35 if (cudaMemcpyAsync((void*) in1_device[i], (void*) in1[data_index[i]], 16 * sizeof

(double), cudaMemcpyHostToDevice , stream[i]) != cudaSuccess) {

36 printf("ERROR: F a i l e d t o copy data t o GPU\n");37 exit(-1);

38 }

39 if (cudaMemcpyAsync((void*) in2_device[i], (void*) in2[data_index[i]], 16 * sizeof

(unsigned int), cudaMemcpyHostToDevice , stream[i]) != cudaSuccess) {


42 }

43 // Execute kernel

44 if (is_timeout_activated) {

45 // Prevent the kernel from timing out by splitting up the work into smaller

pieces through multiple kernel invokations

46 int num_threads_left_to_execute = 16;

47 int index_offset = 0;

48 while (num_threads_left_to_execute > 0) {

49 // method presented and explained in [Hjort Blindell, 2012]

50 // (...)

51 }

52 }

53 else {

54 struct KernelConfig config = calculateBestKernelConfig(16,

max_threads_per_block , 1 * sizeof(double), shared_memory_per_sm);

55 pcomp_1_kernel_stage_kernel <<<config.grid, config.threadBlock, config.

sharedMemory, stream[i]>>>(out1_device[i] , in1_device[i], in2_device[i],

index_offset_device[i], 0);

56 }

57 //D2H transfer

58 if (cudaMemcpyAsync((void*) out1[data_index[i]], (void*) out1_device[i], 16 *sizeof(double), cudaMemcpyDeviceToHost , stream[i]) != cudaSuccess) {


61 }

62 data_index[i] += 6;

63 }

64 }

65 finished = (index[0] >= number_of_bursts) && (index[1] >= number_of_bursts)

66 && (index[2] >= number_of_bursts) && (index[3] >= number_of_bursts)

67 && (index[4] >= number_of_bursts) && (index[5] >= number_of_bursts);

68 }

69 // Free allocated memory

70 if (cudaFree((void*) out1[1]) != cudaSuccess) {

71 printf("ERROR: F a i l e d t o f r e e GPU memory\n");72 exit(-1);

73 }

74 // (...)

75 }

Listing C.5: Sample from a generated CUDA function. This is the top level execution code for the process network infrom Figure C.8. It reflects the current stage of the tool’s ability to generate CUDA code.

129

1 __device__

2 void pcomp_1_kernel_stage_exec(double* out1, double in1, unsigned int in2) {

3 int i; // Can safely be removed if the compiler warns

4 // about it being unused

5 // Declare signal variables

6 double* par_composite1 = new double[27];




101112 // Execute leafs

1314 cudaDeviceSync();

1516 stage_1_exec_wrapper(par_composite1, par_composite4, in1, in2);


1920 stage_4_exec_wrapper(par_composite2, par_composite4);


2324 stage_2_exec_wrapper(par_composite3, par_composite1);


2728 stage_3_exec_wrapper(out1, par_composite2, par_composite3);


313233 // Clean up memory

34 delete[] par_composite1;




38 }

3940 __global__

41 void pcomp_1_kernel_wrapper(double* out1, double in1, unsigned int in2, int index_offset) {

42 unsigned int global_index = (blockIdx.x * blockDim.x + threadIdx.x) + index_offset;

43 if (global_index < 16) {

44 int in1_index = global_index * 0;

45 int in2_index = global_index * 0;

46 int out1_index = global_index * 0;

47 pcomp_1_kernel_stage(&out1[out1_index], in1[in1_index], in2[in2_index]);

48 }

49 }

Listing C.6: Sample from a generated CUDA function. This is a kernel function generated for executing the dataparallel section in the process network from Figure C.8. It reflects the current stage of the tool’s ability to generateCUDA kernel functions.

Bibliography

[Alexander, 1977] Alexander, C. (1977). A pattern language: towns, buildings, construction,volume 2. Oxford University Press, USA.

[Asanovic et al., 2006] Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P.,Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., and Williams, S. W. (2006). Thelandscape of parallel computing research: A view from berkeley. Technical report, TechnicalReport UCB/EECS-2006-183, EECS Department, University of California, Berkeley.

[Asanovic et al., 2009] Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K.,Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., and Wawrzynek, J. (2009). A view ofthe parallel computing landscape. Communications of the ACM, 52(10):56–67.

[ASIC World, 2013] ASIC World (last update: 17/03/2013). SystemC tutorial. Available from:http://www.asic-world.com/systemc/tutorial.html.

[Attarzadeh Niaki et al., 2012] Attarzadeh Niaki, S. H., Jakobsen, M. K., Sulonen, T., andSander, I. (2012). Formal heterogeneous system modeling with SystemC. In 2012 Forumon Specification and Design Languages (FDL), pages 160 –167.

[Baskaran et al., 2008] Baskaran, M. M., Bondhugula, U., Krishnamoorthy, S., Ramanujam,J., Rountev, A., and Sadayappan, P. (2008). Automatic data movement and computationmapping for multi-level parallel architectures with explicitly managed memories. InProceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallelprogramming, PPoPP ’08, pages 1–10, New York, NY, USA. ACM.

[Beggs and Tucker, 2006] Beggs, E. J. and Tucker, J. V. (2006). Embedding infinitely parallelcomputation in newtonian kinematics. Applied mathematics and computation, 178(1):25–43.

[Bell and Hoberock, 2011] Bell, N. and Hoberock, J. (2011). Thrust: A productivity-orientedlibrary for CUDA. GPU Computing Gems: Jade Edition, pages 359–372.

[Berry and Gonthier, 1992] Berry, G. and Gonthier, G. (1992). The esterel synchronousprogramming language: Design, semantics, implementation. Science of computerprogramming, 19(2):87–152.

131

http://www.asic-world.com/systemc/tutorial.html

132 Bibliography

[Bjerge, 2007] Bjerge, K. (2007). Guide for getting started with systemc development. Technicalreport, Danish Technological Institute. Available from: http://www.dti.dk/_root/media/27325_SystemC_Getting_Started_artikel.pdf.

[Blank, 1990] Blank, T. (1990). The MasPar MP-1 architecture. In Compcon Spring’90.Intellectual Leverage. Digest of Papers. 35th IEEE Computer Society International Conference.,pages 20–24.

[Buschmann et al., 2007] Buschmann, F., Henney, K., and Schmidt, D. C. (2007). PatternOriented Software Architecture: On Patterns and Pattern Languages, volume 6. John Wiley& Sons.

[Chen et al., 1992] Chen, G.-H., Wang, B.-F., and Lu, C.-J. (1992). On the parallel computationof the algebraic path problem. Parallel and Distributed Systems, IEEE Transactions on,3(2):251–256.

[Codreanu and Hobincu, 2010] Codreanu, V. and Hobincu, R. (2010). Performance gain fromdata and control dependency elimination in embedded processors. In Electronics andTelecommunications (ISETC), 2010 9th International Symposium on, pages 47–50.

[Colella, 2004] Colella, P. (2004). Defining software requirements for scientific computing.presentation.

[Ştefan, 2010] Ştefan, G. M. (2010). Integral parallel architecture in system-on-chip designs.Faculty of Electronics, Tc. and IT, Politehnica University of Bucharest, România. Available from:http://arh.pub.ro/gstefan/2010ucas.pdf.

[Dunigan, 1992] Dunigan, T. H. (1992). Kendall square multiprocessor: Early experiencesand performance. Technical report, Mathematical Sciences Section, Oak Ridge NationalLaboratory.

[Enmyren and Kessler, 2010] Enmyren, J. and Kessler, C. W. (2010). SkePU: a multi-backend skeleton programming library for multi-GPU systems. In Proceedings of the fourthinternational workshop on High-level parallel programming and applications, pages 5–14.

[Feautrier, 1996] Feautrier, P. (1996). Automatic parallelization in the polytope model. InPerrin, G.-R. and Darte, A., editors, The Data Parallel Programming Model, number 1132 inLecture Notes in Computer Science, pages 79–103. Springer Berlin Heidelberg.

[Flynn, 1972] Flynn, M. J. (1972). Some computer organizations and their effectiveness.Computers, IEEE Transactions on, 100(9):948–960.

[ForSyDe, 2013] ForSyDe (2013). The ForSyDe Homepage. Page Version ID: 24. Availablefrom: https://forsyde.ict.kth.se/trac.

[Gamma et al., 1993] Gamma, E., Helm, R., Johnson, R., and Vlissides, J. (1993). Designpatterns: Abstraction and reuse of object-oriented design. ECOOP’93 – Object-OrientedProgramming, pages 406–431.

[Georgia Tech, 2008] Georgia Tech (last update: 10/25/2008). STI center of competence for thecell broadband engine processor. Available from: http://sti.cc.gatech.edu/.

[Habanero, 2013] Habanero (last update: 11/03/2013). Habanero multicore software researchproject - habanero - rice university campus wiki. Available from: https://wiki.rice.edu/confluence/display/HABANERO/Habanero+Multicore+Software+Research+Project.

http://www.dti.dk/_root/media/27325_SystemC_Getting_Started_artikel.pdf

http://www.dti.dk/_root/media/27325_SystemC_Getting_Started_artikel.pdf

http://arh.pub.ro/gstefan/2010ucas.pdf

https://forsyde.ict.kth.se/trac

http://sti.cc.gatech.edu/

https://wiki.rice.edu/confluence/display/HABANERO/Habanero+Multicore+Software+Research+Project

https://wiki.rice.edu/confluence/display/HABANERO/Habanero+Multicore+Software+Research+Project

Bibliography 133

[Halbwachs et al., 1991] Halbwachs, N., Caspi, P., Raymond, P., and Pilaud, D. (1991). Thesynchronous data flow programming language LUSTRE. Proceedings of the IEEE, 79(9):1305–1320.

[Hayes et al., 1986] Hayes, J. P., Mudge, T. N., Stout, Q. F., Colley, S., and Palmer, J. (1986).Architecture of a hypercube supercomputer. In Proceedings of the 1986 InternationalConference on Parallel Processing, pages 653–660.

[Hennessy and Patterson, 2011] Hennessy, J. L. and Patterson, D. A. (2011). ComputerArchitecture, Fifth Edition: A Quantitative Approach. Morgan Kaufmann, 5 edition.

[Hjort Blindell, 2012] Hjort Blindell, G. (2012). Synthesizing software from a ForSyDe modeltargeting GPGPUs. Master’s thesis, Dept. ICT, Royal Institute of Technology (KTH),Stockholm, Sweden.

[Hochstein et al., 2005] Hochstein, L., Carver, J., Shull, F., Asgari, S., and Basili, V. (2005).Parallel programmer productivity: A case study of novice parallel programmers. InSupercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, pages 35–35.

[Illinois, 2013] Illinois, U. (accessed: 17/03/2013). Universal parallel computing researchcenter at the university of illinois. Available from: http://www.upcrc.illinois.edu/.

[ITRS, 2005] ITRS (2005). International Technology Roadmap for Semiconductors, 2005edition, Executive Summary. Technical report. Available from: http://www.itrs.net/

Links/2005ITRS/ExecSum2005.pdf.

[ITRS, 2007] ITRS (2007). International Technology Roadmap for Semiconductors, 2007edition, Executive Summary. Technical report. Available from: http://www.itrs.net/

Links/2007ITRS/ExecSum2007.pdf.

[ITRS, 2011] ITRS (2011). International Technology Roadmap for Semiconductors, 2011edition, System Drivers. Technical report. Available from: http://www.itrs.net/Links/2011ITRS/2011Chapters/2011SysDrivers.pdf.

[Jakobsen et al., 2011] Jakobsen, M. K., Madsen, J., Niaki, S. H. A., Sander, I., and Hansen, J.(2011). System level modelling with open source tools. In Embedded World Conference 2011,Nuremberg, Germany.

[Joldes et al., 2010] Joldes, G. R., Wittek, A., and Miller, K. (2010). Real-time nonlinear finiteelement computations on GPU-Application to neurosurgical simulation. Computer methodsin applied mechanics and engineering, 199(49):3305–3314.

[Jones et al., 2009] Jones, C. G., Liu, R., Meyerovich, L., Asanovic, K., and Bodik, R. (2009).Parallelizing the web browser. In Proceedings of the First USENIX Workshop on Hot Topics inParallelism.

[Keutzer et al., 2000] Keutzer, K., Newton, A. R., Rabaey, J. M., and Sangiovanni-Vincentelli,A. (2000). System-level design: Orthogonalization of concerns and platform-based design.Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 19(12):1523–1543.

[Kirk and Hwu, 2010] Kirk, D. B. and Hwu, W.-m. W. (2010). Programming Massively ParallelProcessors: A Hands-on Approach. Morgan Kaufmann, 1 edition.

[Kleene, 1936] Kleene, S. C. (1936). General recursive functions of natural numbers.Mathematische Annalen, 112(1):727–742.

http://www.upcrc.illinois.edu/

http://www.itrs.net/Links/2005ITRS/ExecSum2005.pdf




http://www.itrs.net/Links/2011ITRS/2011Chapters/2011SysDrivers.pdf

http://www.itrs.net/Links/2011ITRS/2011Chapters/2011SysDrivers.pdf

134 Bibliography

[Knuth, 1997] Knuth, D. E. (1997). The Art of Computer Programming: Fundamental Algorithms,volume 1, chapter 2.3. Addison-Wesley Professional.

[Krasner and Pope, 1988] Krasner, G. E. and Pope, S. T. (1988). A description of the model-view-controller user interface paradigm in the smalltalk-80 system. Journal of object orientedprogramming, 1(3):26–49.

[Lee and Sangiovanni-Vincentelli, 1997] Lee, E. A. and Sangiovanni-Vincentelli, A. (1997).Comparing models of computation. In Proceedings of the 1996 IEEE/ACM internationalconference on Computer-aided design, pages 234–241.

[Lindholm et al., 2008] Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. (2008). nvidiatesla: A unified graphics and computing architecture. Micro, IEEE, 28(2):39–55.

[Maliţa and Ştefan, 2008] Maliţa, M. and Ştefan, G. (2008). On the many-processor paradigm.In Proceedings of the 2008 World Congress in Computer Science, Computer Engineering andApplied Computing, vol. PDPTA, volume 8.

[Maliţa and Ştefan, 2009] Maliţa, M. and Ştefan, G. (2009). Integral parallel architecture &berkeley’s motifs. In Application-specific Systems, Architectures and Processors, 2009. ASAP2009. 20th IEEE International Conference on, pages 191–194.

[Maliţa et al., 2006] Maliţa, M., Ştefan, G., and Stoian, M. (2006). Complex vs. intensive inparallel computation. In Computing in the Global Information Technology, 2006. ICCGI’06.International Multi-Conference on, pages 26–26.

[Malik et al., 2012] Malik, M., Li, T., Sharif, U., Shahid, R., El-Ghazawi, T., and Newby, G.(2012). Productivity of GPUs under different programming paradigms. Concurrency andComputation: Practice and Experience.

[Maraninchi, 1991] Maraninchi, F. (1991). The Argos language: Graphical representationof automata and description of reactive systems. In IEEE Workshop on Visual Languages,volume 3.

[Maven, 2007] Maven, A. S. F. (2007). Apache maven project. URL: http://maven.apache.org/.Ultima Consulta, 8.

[McCabe, 1976] McCabe, T. (1976). A complexity measure. IEEE Transactions on SoftwareEngineering, SE-2(4):308 – 320.

[Nickolls and Dally, 2010] Nickolls, J. and Dally, W. J. (2010). The GPU computing era. Micro,IEEE, 30(2):56–69.

[nvidia, 2013a] nvidia (2013a). CUDA C/C++ streams and concurrency. Available from: http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.

pdf.

[nvidia, 2013b] nvidia (accessed: 24/03/2013b). GeForce 256. Available from: http://www.nvidia.com/page/geforce256.html.

[nvidia, 2013c] nvidia (accessed: 24/03/2013c). GeForce GTX TITAN graphics card designdetails | nvidia. Available from: http://www.nvidia.com/titan-graphics-card/design.

[Öberg and Ellervee, 1998] Öberg, J. and Ellervee, P. (1998). Revolver: A high-performanceMIMD architecture for collision free computing. In Euromicro Conference, 1998. Proceedings.24th, volume 1, pages 301–308.

http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf



http://www.nvidia.com/page/geforce256.html

http://www.nvidia.com/page/geforce256.html

http://www.nvidia.com/titan-graphics-card/design

Bibliography 135

[Owens et al., 2008] Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., and Phillips,J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5):879–899.

[Papadimitriou, 2003] Papadimitriou, C. H. (2003). Computational complexity. In Encyclopediaof Computer Science, 4th edition, number ISBN:0-470-86412-5, pages 260–265. John Wiley andSons Ltd.

[Par Lab, 2013] Par Lab (accessed: 17/03/2013). The parallel computing laboratory. Availablefrom: http://parlab.eecs.berkeley.edu/.

[PPL, 2013] PPL (accessed: 17/03/2013). Stanford pervasive parallelism laboratory.http://ppl.stanford.edu/main/. Available from: http://ppl.stanford.edu/main/.

[Sander, 2003] Sander, I. (2003). System modeling and design refinement in ForSyDe. PhD thesis,Dept. IMIT, KTH, Stockholm, Sweden.

[Sander and Jantsch, 1999] Sander, I. and Jantsch, A. (1999). System synthesis based ona formal computational model and skeletons. In VLSI’99. Proc. IEEE Computer SocietyWorkshop On, pages 32–39.

[Sander and Jantsch, 2004] Sander, I. and Jantsch, A. (2004). System modeling andtransformational design refinement in ForSyDe [Formal System Design]. Computer-AidedDesign of Integrated Circuits and Systems, IEEE Transactions on, 23(1):17–32.

[Sanders and Kandrot, 2010] Sanders, J. and Kandrot, E. (2010). CUDA by example: anintroduction to general-purpose GPU programming. Number ISBN:0132180138. Addison-Wesley Professional.

[Steuwer et al., 2013] Steuwer, M., Gorlatch, S., Buß, M., and Breuer, S. (2013). Using theSkelCL library for high-level GPU programming of 2D applications. In Euro-Par 2012:Parallel Processing Workshops, pages 370–380.

[Svensson et al., 2010] Svensson, J., Claessen, K., and Sheeran, M. (2010). GPGPU kernelimplementation and refinement using Obsidian. Procedia Computer Science, 1(1):2065–2074.

[Udupa et al., 2009] Udupa, A., Govindarajan, R., and Thazhuthaveetil, M. J. (2009). Softwarepipelined execution of stream programs on GPUs. In International Symposium on CodeGeneration and Optimization, 2009. CGO 2009, pages 200–209.

[Xavier and Iyengar, 1998] Xavier, C. and Iyengar, S. S. (1998). Introduction to parallelalgorithms, volume 1 of Wiley Series on Parallel and Distributed Computing. Wiley-Interscience.

[Zhu et al., 2008] Zhu, J., Sander, I., and Jantsch, A. (2008). Energy efficient streamingapplications with guaranteed throughput on MPSoCs. In Proceedings of the 8th ACMinternational conference on Embedded software, pages 119–128.

http://parlab.eecs.berkeley.edu/

http://ppl.stanford.edu/main/

oyal institute of technology - diva portal646340/fulltext01.pdfast abstract syntax tree cpu central...

Documents