task graph clustering with internal state€¦ · task graph clustering with internal state joachim...

32
Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design University of Erlangen-Nuremberg Am Weichselgarten 3 D-91058 Erlangen, Germany Co-Design-Report 1-2007 Dec. 12, 2007

Upload: others

Post on 16-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

Task Graph Clustering with

Internal State

Joachim Falk, Christian Haubelt, and Jürgen Teich

Department of Computer Science 12Hardware-Software-Co-Design

University of Erlangen-NurembergAm Weichselgarten 3

D-91058 Erlangen, Germany

Co-Design-Report 1-2007

Dec. 12, 2007

Page 2: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

Contents

1 Introduction 3

2 Non-Hierarchical SysteMoC Models 5

3 Hierarchical Network Graphs 73.1 Clustering Operation . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Semantics of Hierarchical Network Graphs . . . . . . . . . . . . . 11

4 Cluster Composition 11

5 Cluster Composition in SDF Graphs 125.1 Cluster Composition to SDF Actors . . . . . . . . . . . . . . . . . 135.2 Cluster Composition to CSDF Actors . . . . . . . . . . . . . . . . 14

6 SDF Cluster Composition in General Network Graphs 146.1 Cluster Composition Examples . . . . . . . . . . . . . . . . . . . . 166.2 Cluster Composition Algorithm . . . . . . . . . . . . . . . . . . . 20

7 Proofs 25

8 Results 29

References 31

2

Page 3: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

1 Introduction

Multi Processor System on Chips (MPSoCs) are becoming more and more importantas implementation platform for computationally intensive applications, e.g., multime-dia or networking. As MPSoCs are mainly designed for a given set of applications,their advantage lies in a low power dissipation due to a lower clock frequency butstill high computational power by exploiting the parallelism provided by the archi-tecture. However, the high degree of parallelism of processors makes programmingof MPSoCs a challenging task. While many research groups have conducted researchon optimally mapping an application onto a given MPSoC platform, e.g., Daedalus[TNS+07], SystemCoDesigner [HFK+07], Koski [KKO+06], etc., there is still onlylittle work on defining the right programming model for MPSoCs.

At Electronic System Level (ESL), a trend towards actor-oriented programmingmodels [JBP06, ET05] can be identified. These models are often domain specificand typically used already in the mapping step. For instance, data flow models arewell suited to model multimedia applications. Beside the application mapping, fur-ther refinements are needed to come from a high level model to an implementation.In particular, scheduling actors bound to one processor is one of the key issues. Of-ten, a straight forward scheduling idea is to postpone any scheduling decision toruntime. Such a dynamic scheduling often results in a significant scheduling over-head and, hence, a reduced system performance. On the other hand, static or quasi-static scheduling for multi processor systems is only solved for models with limitedexpressiveness, e.g., static data flow graphs. For instance, efficient single proces-sor scheduling algorithms [BBHL95, HB07] exist for synchronous data flow models(SDF) [LM87], a widely accepted static data flow model.

Unfortunately, these algorithms are constrained to pure SDF models while realworld multimedia application require actors modeled at different levels of complex-ity. As an example, consider the top level data flow model of a Motion-JPEG decoderdepicted in Figure 1. Beside static data flow actors (shaded vertices), i.e., actors with

SinkImage Frame

ShufflerIDCT Inverse

ZigZag

Parser Huff.Decoder

InverseZRL

InverseQuant.

JPEGSource

Figure 1: Example of a Motion-JPEG decoder including dynamic as well as staticdata flow actors (shaded)

constant consumption or production rates known from SDF models or cyclo-static

3

Page 4: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

data flow (CSDF) [BELP96] actors, it also includes dynamic data flow actors likethe Parser which can be modeled as a Kahn Process [Kah74]. The 2-dimensionalInverse Discrete Cosine Transform (IDCT) actor shown in Figure 1 represents a staticdata flow subcluster, i.e., a subgraph consisting of SDF and CSDF actors only, as it iscomposed of two 1-dimensional IDCTs modeled as SDF graph and a CSDF transposeactor (see Figure 17). After mapping the Motion-JPEG application to an MPSoC, allactors bound onto one processor have to be scheduled either dynamically or stati-cally. As the model includes dynamic actors, a static schedule is not possible in allcases. However, a dynamic schedule may degrade the performance by introducingscheduling overheads even in the schedules of static data flow subclusters. To per-mit the generation of an efficient schedule, a remedy could be the replacement of thestatic data flow subcluster by an actor, i.e, clustering all static data flow actors intoa new actor. In the following this operation will be called the composite operation.Unfortunately, existing algorithms might result in infeasible schedules or greatly re-strict the clustering design space by considering only SDF subclusters that can becomposed into monolithic SDF actors without introducing deadlock. Moreover, allthese approaches are limited to integrate the resulting monolithic SDF actor only withan enclosing static data flow system representation.

In this report, we will propose a novel composite operation for static data flowsubclusters connected to dynamic data flow graphs. In contrast to prior work wherethe subcluster is replaced by a SDF actor only, our approach computes a quasi-staticschedule (QSS) for a static data flow subcluster and replaces this subcluster by a dy-namic data flow (DDF) actor implementing the QSS. This dynamic actor can be eas-ily integrated into a dynamic schedule of all remaining actors mapped onto the sameprocessor reducing the overall scheduling overhead. Thus, the advantages of staticdata flow subclusters can be exploited by our algorithm in the context of more gen-eral models of computation and we improve previous work by means of increasingthe design space for clustering. We formally prove the correctness of our algorithmusing denotational semantics of Kahn Process Networks [Kah74] and show the bene-fits of our scheduling using a scalable benchmark as well as a real-world example, the2-dimensional IDCT of a Motion-JPEG decoder mapped onto a 4-processor system.In this case study, we could improve throughput by about 80% and latency by about75%.

The remainder of this report is organized as follows: Section 2 and 3 will for-mally define the mathematical input model for our composite operation. While Sec-tion 4 will formally introduce the composite operation and presents the constraintsa valid composite operation has to satisfy. Known composite operations for pureSDF graphs will be discussed in Section 5. Subsequently, Section 6 presents the pro-posed composite algorithm, which will be proved to conform to the constraints forvalid composite operations in Section 7. Finally, experimental results are presentedin Section 8.

4

Page 5: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

2 Non-Hierarchical SysteMoC Models

In this section we repeat the definitions from [FHT05, FHT06] relevant to clustering.In actor-oriented design, actors are objects which execute concurrently and can

only communicate with each other via channels instead of method calls as known inobject-oriented design. Each actor a ∈ A may only communicate with other actorsthrough its dedicated actor input ports a.I and actor output ports a.O via channelsconnecting these ports. The basic entity of data transfer is regulated by the notionof tokens which are transmitted via these channels. An actor itself is subdivided intothree parts: (i) The actor input ports and output ports for communication, (ii) theactor functionality for transforming tokens, and (iii) the actor finite state machine, inthe following called firing FSM, which encodes the consumption and production oftokens by this actor. These three parts are also clearly visible in the SqrLoop actordepicted in Figure 2. More formally, we can derive the following definition:

De�nition 2.1 (Actor) An actor is a tuple a = (P,F ,R) containing a set of actorports P = I∪O partitioned into actor input ports I and actor output ports O, the actorfunctionality F and the firing FSMR.

i2(1)&¬ fcheck&o1(1)/ fcopyInput

i1(1)&o1(1)/ fcopyStore

i2(1)& fcheck&o2(1)/ fcopyApprox

fcopyApproxfcopyStore

fcheckfcopyInput

t1

t3t2

qstart qloopi2

i1

Inpu

tpor

tsa 2

.I={i

1,i 2}

Out

putp

orts

a 2.O

={o

1,o 2}

o1

o2

Firing FSM a2.R

Functionality a2.F

a2|SqrLoop

Figure 2: Visual representation of the SqrLoop actor a2 used in the network graphgsqr,flat displayed in Figure 3. The SqrLoop actor is composed of inputports and output ports, its functionality, and the firing FSM determiningthe communication behavior of the actor.

Actor-oriented designs are often represented by bipartite graphs consisting of chan-nels c ∈C and actors a ∈ A, which are connected via point to point connections froman actor output port o to a channel and from a channel to an actor input port i. In

5

Page 6: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

the following, we call such representations network graphs, an example of such arepresentation can be seen in Figure 3. More formally, we can derive the followingdefinition for a network graph:

De�nition 2.2 (Non-Hierarchical Network Graph) A non-hierarchical net-work graph is a directed bipartite graph gn = (V,E,P) containing a set of ver-tices V = C ∪A partitioned into channels C and actors A, a set of directed edgese = (vsrc,vsnk) ∈ E ⊆ (C×A.I)∪ (A.O×C) from channels c ∈C to actor input portsi ∈ A.I as well as from actor output ports o ∈ A.O to channels.1 These edges arefurther constraint such that exactly one edge is incident to each actor port and thein-degree and out-degree of each channel in the graph is exactly one. Additionally,the network graph contains a channel parameter function P : C→ N∞×V∗ whichassociates with each channel c ∈C its buffer size n ∈N∞ = {1,2,3, . . .∞}, and alsoa possibly non-empty sequence ννν ∈ V∗ of initial tokens.2

The execution of SysteMoC models can be divided into three phases: (i) checkingfor enabled transitions for each actor, (ii) selecting and executing one enabled tran-sition per actor, and (iii) consuming and producing tokens needed by the transition.Note that step (iii) might enable new transitions. More formally, the firing FSM of anactor is defined as follows:

De�nition 2.3 (Firing FSM) The firing FSM of an actor a ∈ A is a tuple a.R =(T,Qfiring,q0firing) containing a finite set of firing transitions T , a finite set of firingstates Qfiring and an initial firing state q0firing ∈ Qfiring. A firing transition is a tu-ple t = (qfiring,k, faction,q′firing) ∈ T containing the current firing state qfiring ∈ Qfiring,an activation pattern k, the associated action faction ∈ a.F , and the next firing stateq′firing ∈ Qfiring. The activation pattern k is a boolean function which decides if tran-sition t can be taken (true) or not (false).

Each transition is annotated with an activation pattern, a boolean expression whichdecides if the transition can be taken, and an action from the actor functionality whichis executed if the transition is taken. Moreover, our activation patterns encode bothstep (i) and step (iii) of the execution phases, because each transition consumes com-municates the minimal number of tokens and free places on each input and outputport still satisfying the activation pattern. The activation patterns are partitioned into:(i) Predicates on the number of available tokens on the input ports called input pat-terns, e.g., i1(1) denotes a predicate that tests the availability of at least one token atthe actor input port i1, (ii) predicates on the number of free places on the output ports

1 We use the ‘.’-operator, e.g., a.I, for member access of tuples whose members have been explicitlynamed in their definition, e.g., member I of actor a∈A from Definition 2.1. Moreover, this memberaccess operator has a trivial pointwise extension to sets of tuples, e.g., A.I =

⋃a∈Aa.I, which is

also used throughout this document.2We use V∗ to denote the set of all possible finite sequences of tokens v∈ V , i.e., V∗ =

⋃n∈{0,1,...}Vn

6

Page 7: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

ApproximationLoop Check

ApproximationLoop Body

a2|SqrLoopo 2 i 2

o 1i 1

c1 c2

c5c6a5|Sink

o1a1|Src

i1

a3|Approx

a4|Dup

o 1

o 1 i 1

i 2

c4

i1

o2

Newton square root approximation

Actor output port o2 ∈ a4.OActor input port i1 ∈ a5.I

Cha

nnel

c 6∈

g sqr

,flat.C

c3

Act

ora 3∈

g sqr

,flat.A

ofty

pe“A

ppro

x”P(c

3)=

(∞,(

ν))

Network graph gsqr,flat

gsqr,flat|SqrRoot

Figure 3: The network graph gsqr,flat displayed above implements Newton’s iterativealgorithm for calculating the square roots of an infinite input sequence gen-erated by the Src actor a1. The square root values are generated by New-ton’s iterative algorithm SqrLoop actor a2 for the error bound checking anda3 - a4 to perform an approximation step. After satisfying the error bound,the result is transported to the Sink actor a5.

called output patterns, e.g., o1(1) checks if at least one free place is available at theoutput port o1, and (iii) more general predicates called functionality conditions, e.g.,¬ fcheck, depending on the functionality state or the token values on the input ports.

3 Hierarchical Network Graphs

We next introduce the notion of clusters which we will use to introduce hierarchyinto our non-hierarchical network graph model.

De�nition 3.1 (Cluster) A cluster is a directed bipartite graph gγ = (P,V,E,P)containing a set of cluster ports P = I∪O partitioned into cluster input ports I andcluster output ports O, a set of vertices V = C ∪ A∪Gγ partitioned into channelsC, actors A, and subclusters Gγ , a set of directed edges e = (vsrc,vsnk) ∈ E ⊆ ((C∪

7

Page 8: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

I)× (A.I ∪Gγ .I))∪ ((A.O∪Gγ .O)× (C∪O)) from channels c ∈ C or cluster inputports i′ ∈ I to actor or subcluster input ports i ∈ A.I ∪Gγ .I as well as from actor orsubcluster output ports o ∈ A.O∪Gγ .O to channels or cluster output ports o′ ∈ O,These edges are further constraint such that exactly one edge is incident to each actor,subcluster, or cluster port and the in-degree and out-degree of each channel in thegraph is exactly one. Additionally, the cluster contains a channel parameter functionP : C→ N∞×V∗ which associates with each channel c ∈C its buffer size n ∈ N∞,and also a possibly non-empty sequence ννν ∈ V∗ of initial tokens.

As can be seen from Definition 2.2 and Definition 3.1, a non-hierarchical networkgraph gn is simply a cluster without subclusters and cluster ports, i.e., gn.P = ∅ andgn.V ∩Gγ = ∅.3 Hierarchy is introduced into a network graphs by also allowing itto contain subclusters, i.e., the vertices gn.V of a hierarchical network graph gn arepartitioned into actors a ∈ gn.A, channels c ∈ gn.C, and additionally also subclustersgγ ∈ gn.Gγ . More formally, we define a hierarchical network graph as following:

De�nition 3.2 (Hierarchical Network Graph) A hierarchical network graph gnis a cluster without cluster ports, i.e., gn.P = ∅.

We will call a cluster (network graph) gγ to be non-hierarchical if contains nosubcluster, i.e., gγ .Gγ = ∅, and hierarchical if it contains subclusters, i.e., gγ .Gγ 6= ∅,respectively. In the following we will assume that a network graph if not explicitlystated to be non-hierarchical can contain subclusters.

3.1 Clustering Operation

The transformation of a possibly non-hierarchical original cluster into a hierarchi-cal cluster will be denoted by usage of the Γ : Gγ × 2V → Gγ ×Gγ cluster op-erator. This operation takes an original cluster and a set of vertices Vsubcluster togenerate a clustered network graph containing a subcluster which replaces the setof clustered vertices Vsubcluster. An example of such a cluster operation can beseen in Figure 3 depicting the original network graph gsqr,flat and Figure 4 dis-playing the clustered network graph gsqr containing the subcluster gsqr,γ which re-places the clustered actors and channels Vsubcluster = {a3,a4,c3,c4} ⊆ gsqr,flat.V , i.e.,Γ (gsqr,flat,Vsubcluster) = (gsqr,gsqr,γ).

Without loss of generality we treat in the following definition all contained sub-clusters inside the original graph gunclustered like actors, because only actor ports arereferenced inside this definition which are also provided by subclusters. Should anactor be indeed a cluster the actor ports will refer to the equivalent cluster ports.

3 In contrast to gγ .Gγ which denotes the set of subclusters inside the gγ cluster we use Gγ alone todenote the set of all possible clusters. This convention is additionally used accordingly for the setof all possible actors A, channels C, edges E, ports P , and vertices V = A∪C∪Gγ .

8

Page 9: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

a2|SqrLoopo 2 i 2

o 1i 1

c1 c2

c5c6

o1

i1

o 1

o 1 i 1

i 2

c3 c4

i1

o2

Network graph gsqr and its subcluster gsqr,γ ∈ gsqr.Gγ

gsqr|SqrRoot

Cluster input port i1 ∈ gsqr,γ .I

Cluster output port o1 ∈ gsqr,γ .O Channel c4 ∈ gsqr,γ .C

Act

ora 3∈

g sqr

,γ.A

ofsu

bclu

ster

g sqr

gsqr,γ |ApproxBody

a3|Approx

a4|Dup

a1|Src

a5|Sink

i1

o1

Figure 4: The hierarchical network graph gsqr displayed above is derived from thenon-hierarchical network graph gsqr,flat depicted in Figure 3 by clusteringits actors and channels {a3,a4,c3,c4} ⊂ gsqr,flat.V into a subcluster gsqr,γ ∈gsqr.Gγ .

De�nition 3.3 (Cluster Operation) The cluster operation Γ : Gγ × 2V → Gγ ×Gγ is a partial function which maps an unclustered graph gunclustered ∈ Gγ anda set of channels and actors Vsubcluster ⊆ gunclustered.V contained in the unclus-tered graph into a clustered graph gclustered ∈ Gγ and a subcluster gclustered,γ ∈gclustered.Gγ contained in the clustered graph. Prerequisite for the cluster operationis that all channels to cluster c ∈Csubcluster = Vsubcluster∩gunclustered.C are connectedto actors to cluster a ∈ Asubcluster = Vsubcluster ∩ gunclustered.A, i.e., ∀c ∈Csubcluster :∃i ∈ Asubcluster.I,o ∈ Asubcluster.O : (c, i),(o,c) ∈ gunclustered.E. The clustered graphand its subcluster are derived from the unclustered graph and the set of channels andactors to cluster Vsubcluster in the following way:

• The clustered vertices Vsubcluster are replaced in the clustered graph gclusteredby the subcluster gclustered,γ ∈ gclustered.Gγ which now contains the replacedvertices Vsubcluster, i.e., gclustered,γ .V = Vsubcluster,gclustered.V = (gunclustered.V \Vsubcluster)∪{gclustered,γ}.

• The ports of the clustered graph are taken from the unclustered graph withoutany modifications, i.e., gclustered.P = gunclustered.P . The input ports of the sub-

9

Page 10: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

cluster are derived via a one-to-one correspondence from the edges ein ∈ Eindirected from vertices of the clustered graph to vertices of the subcluster, i.e.,∃ fin : Ein→ gclustered,γ .I : fin is a bijection where Ein = {e ∈ gunclustered.E |e.vsrc ∈ gclustered.C∪ gclustered.I,e.vsnk ∈ gclustered,γ .A.I}.4 The output ports ofthe subcluster are derived accordingly, i.e., ∃ fout : Eout→ gclustered,γ .O : foutis a bijection where Eout = {e ∈ gunclustered.E | e.vsrc ∈ gclustered,γ .A.O,e.vsnk ∈gclustered.C∪gclustered.O}.

• After defining the vertices and ports of the clustered graph and its subclusterwe have to define the edges connecting these vertices. These edges are derivedfrom the edges e ∈ gunclustered.E of the unclustered graph. Two cases have tobe distinguished: (i) Edges ein ∈ Ein or eout ∈ Eout which cross the boundarybetween the clustered graph and its subcluster and have to be transformed intotwo edges, one in the clustered graph to or from a subcluster port and intoone edge in the subcluster from or to a subcluster port, respectively. (ii) Andthe remaining edges e ∈ gunclustered.E \ (Ein ∪ Eout) which can be taken ver-batim into their corresponding cluster. Thus, more formally for the edges ofthe clustered graph we declare as follows: gclustered.E = {e ∈ gunclustered.E |e.vsrc,e.vsnk ∈ gclustered.P ∪ gclustered.C ∪ gclustered.A.P} ∪ {(ein.vsrc, fin(ein)) |ein ∈Ein}∪{( fout(eout),eout.vsnk) | eout ∈Eout} and for the edges of the subclus-ter accordingly gclustered,γ .E = {e ∈ gunclustered.E | e.vsrc,e.vsnk ∈ gclustered,γ .C∪gclustered,γ .A.P}∪{( fin(ein),ein.vsnk) | ein ∈ Ein}∪{(eout.vsrc, fout(eout)) | eout ∈Eout}.

• Finally, the channel parameter function gunclustered.P is split into two newchannels parameter functions gclustered.P and gclustered,γ .P which are only de-fined on channels of the clustered graph and the channels of the subclus-ter, respectively, i.e, gclustered.P : gclustered.C → N∞ × V∗,∀c ∈ gclustered.C :gclustered.P(c) = gunclustered.P(c) and accordingly for the channel parameterfunction of the subcluster.

For a better understanding of the preceding definition we consider again the clus-tering operation Γ (gsqr,flat,{a3,a4,c3,c4}) = (gsqr,gsqr,γ) depicted in Figure 3 andFigure 4. For this example the edges crossing the boundary between the clusteredgraph and its subcluster are ein = (c3,a2.i1) and eout = (a4.o2,c5) which are mappedvia the bijections fin and fout to the subcluster ports gsqr,γ .P = {i1,o1} as follows:fin(ein) = gsqr,γ .i1 and fout(eout) = gsqr,γ .o1. Furthermore, these two edges are trans-lated into two edges of the clustered graph (ein→ (c2,gsqr,γ .i1) ∈ gsqr.E and eout→(gsqr,γ .o1,c5) ∈ gsqr.E) and two edges of the subcluster (ein → (gsqr,γ .i1,a3.i1) ∈gsqr,γ .E and eout→ (a4.o2,gsqr,γ .o1) ∈ gsqr,γ .E).

4Here we use the ‘.’-operator again to refer to the source and sink of an edge e respectively with thedenotation e.vsrc and e.vsnk, as given in Definition 2.2 and Definition 3.1.

10

Page 11: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

3.2 Semantics of Hierarchical Network Graphs

The semantics a of a hierarchical network graph gn is defined by the semantics ofits corresponding non-hierarchical network graph g′n. This non-hierarchical networkgraph is obtained by recursively applying the dissolve operation Γ−1 until the result-ing network graph no longer contains any subclusters. The dissolve operation Γ−1 isthe inverse of the cluster operation.

4 Cluster Composition

Cluster composition is the conversion of a subcluster into a corresponding compositeactor of equivalent behavior. For this purpose we introduce the composite operationA which transforms an original network gn ∈ Gn

5 and its subcluster gγ into a com-posite network graph g̃n containing the actor aγ ∈ g̃n.A replacing the subcluster whilekeeping the semantics of the original network graph. This actor aγ will be calledthe corresponding composite actor of gγ . This operation is exemplified in Figure 4and Figure 5 depicting the original network graph gsqr and its subcluster gsqr,γ whichis transformed by the composite operation A(gsqr,gsqr,γ) = (g̃sqr,asqr,γ) into the re-sulting composite network graph g̃sqr and the corresponding composite actor asqr,γderived from gsqr,γ . More formally, we define the composite operation as follows:

De�nition 4.1 (Composite Operation) Given a data flow graph g and a deter-ministic data flow subgraph gγ ∈ g.Gγ of g the composite operation A : Gn×Gγ →Gn×A replaces gγ by a single DDF actor aγ , called composite actor, resulting in anetwork graph g̃, i.e.,A(gn,gγ) = (g̃,aγ) where g̃ = g except g̃.V = g.V +{aγ}−{gγ}and aγ .P = gγ .P . Furthermore, gγ and aγ satisfy the equivalence condition from Def-inition 4.2.

In order to maintain semantic equivalence we first have to define what semanticequivalence means. In the following we will disregard the channel size constraints ofthe channels connected to the composite actor ports. With this simplification we canuse the denotational semantics of Kahn [Kah74] to compare the subcluster and itscorresponding composite actor for equivalence. More formally, we define as follows:

De�nition 4.2 (Equivalence Condition for Deterministic Subclusters) If asubcluster gγ ∈ g.Gγ with deterministic behavior and its corresponding com-posite actor aγ ∈ g̃.A are connected to infinite input and output channels, i.e.,∀p ∈ gγ .P : ∀c ∈ g.C : (∃p ∈ gγ .P : (c, p) ∈ g.E ∨ (p,c) ∈ g.E) =⇒ g.P(c).n =∞and ∀p ∈ aγ .P : ∀c ∈ g̃.C : (∃p ∈ aγ .P : (c, p) ∈ g̃.E ∨ (p,c) ∈ g̃.E) =⇒ g̃.P(c).n =∞, then the subcluster gγ and its corresponding composite actor aγ are equivalent

5 Following the Gγ notation for the set of all possible clusters we use Gn to denote the set of allpossible network graphs, i.e., Gn = {gγ ∈ Gγ | gγ .P = ∅}.

11

Page 12: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

a4|Dup

a3|Approx

o 1

o 1 i 1

i 2

c3 c4

o2

i1

a2|SqrLoop

a1|Src

a5|Sink

o 2 i 2o 1i 1

c1 c2

c5c6

o1

i1 o1

i1

g̃sqr|SqrRoot

Network graph g̃sqr and its composite actor asqr,γ ∈ g̃sqr.A

Actor input port i1 ∈ asqr,γ .I asqr,γ |ApproxBody

Actor output port o1 ∈ asqr,γ .O

qstart t1

i1(1)&o1(1)/(a3,a4)

Figure 5: The composite network graph g̃sqr displayed above is derived from the hi-erarchical network graph gsqr depicted in Figure 4 by composing its sub-cluster gsqr,γ into the corresponding composite actor asqr,γ .

iff their corresponding Kahn descriptions Fgγand Faγ

6 are equivalent, i.e., gγ and aγ

have equivalent semantics ⇐⇒ Fgγ≡Faγ

. Where the functions Fgγ,Faγ

: (V∗∗)|aγ .I|→(V∗∗)|aγ .O| mapping infinite sequences of tokens νi ∈ V∗∗ 7 on the cluster/actor inputports i ∈ gγ .I, represented as a tuple ννν inputs = (νi1 ,νi2, . . .νi|gγ .I|) ∈ (V∗∗)|gγ .I|, to in-finite sequences of tokens νo ∈ V∗∗ on the cluster/actor output ports o ∈ gγ .O, rep-resented as a tuple νννoutputs = (νo1,νo2, . . .νo|gγ .O|) ∈ (V∗∗)|gγ .O|, are used to describethe semantic of the subcluster gγ and the composite actor aγ , respectively.

5 Cluster Composition in SDF Graphs

The composition algorithms shown are constraint to network graphs containing onlyactors exhibiting SDF semantic, called SDF graphs for short. We will consider thecluster composition of the three subclusters gsdf,γ1 , gsdf,γ2 , and gsdf,γ3 of the SDF graph

6The relation between Kahn’s denotational semantics and the semantics of dataflow models with thenotion of firings is presented in [Lee97].

7We use V∗∗ to denote the set of all possible finite and infinite sequences of tokens v ∈ V , i.e.,V∗∗ =

⋃n∈{0,1,...∞}Vn

12

Page 13: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

gsdf displayed in Figure 6(a). As our notion of firing FSM is cumbersome for mereSDF actors we will use the usual notations to describe consumption and productionrates of SDF actors, e.g., we will use prod(a1,a2) to denote the token productionper firing invocation of the actor a1 on the edge from a1 to a2, cons(a1,a2) to de-note the token consumption per firing invocation of the actor a2 on the edge froma1 to a2, and rmin,gγ

to denote the repetition vector [LM87] of the SDF graph gγ ,e.g., rmin,gsdf = (ra1,ra2,ra3,ra4) = (3,2,4,2).8 Furthermore, we will abstain from theexplicit depiction of channels in the visual representation of SDF network graphs toconform to the usual visual representation of these graphs. However, we will stillassume that these graphs are given in the notation given in Definition 3.2.

For SDF graphs arbitrarily connected subclusters can be composed into CSDF ac-tors. However composition of subclusters into SDF actors is not always possible,due to the introduction of additional constraints into the scheduling of the resultingnetwork graph. This constraints may introduce artificial deadlocks due to input de-pendencies of a subcluster, e.g., i2 in Figure 6(a) generated by subcluster outputs,e.g., o1. Only compositions of subclusters into actors which do not generate cyclesin the acyclic precedence graph (APG) corresponding to the SDF graph are validtransformations. An example of a SDF graph and its corresponding APG is given inFigure 6.

5.1 Cluster Composition to SDF Actors

If we compose, i.e., A(gsdf,gγ) = (g̃sdf,aγ), a subcluster gγ contained in the networkgraph gsdf into a transformed network graph g̃sdf and a composite actor aγ each firinginvocation of the resulting actor aγ will fire the actor a∈ gγ .A contained in the gγ sub-cluster exactly rmin,gγ

(a) times which is the minimum number of actor invocations forthe a actor to return the subcluster back to its original state. Therefore, the resulting

actor aγ has a repetion count of rmin,g̃sdf(aγ) =rmin,gsdf(a)rmin,gγ

(a) where a ∈ gγ .A. This rep-etition count corresponds to the number of clusters introduced by this compositionoperation into the APG of the original network graph gsdf.

For example, let us consider the subcluster gsdf,γ1 from Figure 6(a). The actor a1

has a repetition count of rmin,gsdf,γ1(a1) = cons(a1,a2)

gcd(prod(a1,a2),cons(a1,a2))= 3. Therefore, the

resulting composite actor asdf,γ1 has a repetition count ofrmin,gsdf(a1)

rmin,gsdf,γ1(a1)

= 33 = 1 and one

cluster is depicted in the clustering of the APG shown in Figure 7(a) correspondingto this composition of the subcluster to a SDF actor. As can be seen from this figurethis cluster does not introduce any cycles into the APG, therefore this compositionoperation is a valid transformation. The clustering of the APG corresponding tocomposition of subcluster gsdf,γ2 is displayed in Figure 7(b). However, in this case the

8To be more succint we will also regard the repetition vector as a function from an actor to itsrepetition count, e.g., rmin,gsdf(a1) = ra1 = 3.

13

Page 14: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

i2 o2

i1o1

6

3

3 1

1

2 1

24

3

gsdf,γ1 gsdf,γ2

gsdf,γ3

a1 a2 a4

a3

(a) A non-hierarchical SDF network graph gsdfand three possible subclusters gsdf,γ1 , gsdf,γ2 , andgsdf,γ3 .

3

3

3

3

2

2

3

2

3

2

a11

a21

a31

a13

a33

a43

a22

a12 a1

4

a24

a23

(b) Marked graph conversion of the SDF graphgiven in Figure 6(a).

Figure 6: A SDF graph and its corresponding marked graph. Disregarding the dot-ted edges in the marked graph gives the acyclic precedence graph (APG)corresponding to the SDF graph.

repetition count of the resulting actor asdf,γ2 is two, therefore two cluster are displayedin the figure.

5.2 Cluster Composition to CSDF Actors

As can be seen in Figure 7 the clusters gsdf,γ1 and gsdf,γ2 can be composed into SDFactors. But this composition is not valid for arbitrary subclusters as shown by thecounter example depicted in Figure 8(a). The example corresponds to the composi-tion of the gsdf,γ3 cluster to a SDF actor which introduces the cycle a1

sdf,γ3→ a1

1 →a1

sdf,γ3into the APG indicating that the composition operation is an invalid transfor-

mation. To avoid this cycle the two clusters in the APG can be reshaped as depictedin Figure 8(b). However, this changes the semantic of the composite actor asdf,γ3 tobe CSDF instead of SDF.

6 SDF Cluster Composition in General Network

Graphs

In this section we describe the cluster composition of the non-hierarchical SDF sub-cluster gγ contained in the general network graph g ∈ Gn, e.g., as depicted in Fig-

14

Page 15: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

3

3

3

3

2

2

3

2

3

2

a1sdf,γ1

a21

a31

a23

a33

a43

a24

a14

a22

a11 a1

2

a13

(a) Cluster in the APG corresponding to the com-position of the gsdf,γ1 subcluster in Figure 6(a) toa SDF actor.

3

3

3

3

2

2

3

2

3

2

a2sdf,γ2

a1sdf,γ2

a11

a21

a31

a14

a24a2

2

a12

a43

a33

a13

a23

(b) Clusters in the APG corresponding to the com-position of the gsdf,γ2 subcluster in Figure 6(a) toa SDF actor.

Figure 7: Resulting APG clusters for the SDF actor composition from the gsdf,γ1 andgsdf,γ2 subclusters shown in Figure 6(a).

ure 9(a) for the network graph g1 and its subcluster g1,γ . As our notion of firingFSM is cumbersome for SDF actors we will use the usual notations to describe con-sumption and production rates of SDF actors, i.e., we use prod : A.O→ Z+

09 to de-

note the number of tokens produced per actor invocation on an output port, e.g.,prod(a1.o1) = 1 to denote that one token is produced per firing of the actor a1 onthe output port o1, for consumption we use cons : A.I → Z+

0 to denote the numberof tokens consumed per actor invocation on an input port, e.g., cons(a1.i1) = 1 todenote that one token is consumed per firing of the actor a1 on the input port i1. Asis customary, the constant consumption and production rates of SDF actors are anno-tated on the edges in the network graph figures, e.g., 2 on the edge from c4 to a2 inFigure 9(a) to indicate that firing a2 once consumes two tokens on the actor port i1.

More formally, we are concerned with the algorithm implementing the compositeoperation A(g,gγ) = (g̃,aγ) from Definition 4.1 under the constraint that the sub-cluster gγ ∈ g.Gγ is a non-hierarchical cluster only containing static dataflow actors.The composite operation transforms the original network g and its subcluster gγ intoa transformed network graph g̃ containing the composite actor aγ ∈ g̃.A replacingthe subcluster while keeping the semantics of the original network graph, e.g., asdepicted in Figure 10 for the composite network graph g̃1 and its composite actora1,γ .

The goal of our composite operation is the reduction of the possible scheduling

9Z+0 = {0,1,2,3, . . .} denotes the set of non-negative integers.

15

Page 16: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

3

3

3

3

2

2

3

2

3

2

a2sdf,γ3

a1sdf,γ3

a11

a21

a31

a12

a22

a23

a33

a24

a43

a13

a14

(a) Clusters in the Agraph1PG corresponding tothe composition of the gsdf,γ3 subcluster in Fig-ure 6(a) to a SDF actor.

3

3

3

3

2

2

3

2

3

2

a2sdf,γ3

a1sdf,γ3

a11

a21

a31

a12

a22

a13

a23

a33

a43

a14

a24

(b) Clusters in the APG corresponding to the com-position of the gsdf,γ3 subcluster in Figure 6(a) toa CSDF actor.

Figure 8: Resulting APG clusters for the SDF actor composition and a CSDF actorcomposition from the gsdf,γ3 subcluster shown in Figure 6(a).

sequences for the actor in the subcluster while still providing enough flexibility inthe scheduling of these actors to prevent the introduction of artificial deadlocks. Thereduction of the possible scheduling sequences to a quasi-static schedule (QSS) elim-inates the overhead required by a fully dynamic scheduling of these actors. For thispurpose, we will present a methodical way to construct the firing FSM aγ .R of com-posite actor aγ as defined in Definition 2.3. This firing FSM represents the QSS for agiven static data flow subcluster gγ , i.e., the actions faction ∈ aγ .F ⊂ gγ .A∗ are staticscheduling sequences of actor firings of the actors of the substituted subcluster gγ . Inthe following we will call this form of firing FSM a cluster FSM. Furthermore, wewill no longer distinguish between the firing FSM aγ .R of the composite actor andthe subcluster gγ from which this composite actor has been derived and instead justtalk about the cluster FSM gγ .R of a subcluster.

6.1 Cluster Composition Examples

As motivation for our proposed composite algorithm, consider the example with theSDF actors a1,a2, and a3 depicted in Figure 9(a). Note that our algorithm is alsoable to consider CSDF actors. However, for the sake of readability, the examples inthis report use static data flow subcluster containing only SDF actors. As first trywe compute the static schedule (a1,a1,a2,a3,a3) and substitute the static data flow

16

Page 17: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

22

1 1

2 2

1 1 11

g1,γ

i1

i2

o1

o2

o1

i1o2

i1

o1i2

i2

i1 o1

o2

o1i1o2

o1 i1i2

a5

a1 a3

a2

a4

c4

c5

c6

c7

c8

c1

c2

c3

g1

(a) Example network graph g1 with non-hierarchical SDF subcluster g1,γ containing only SDF actors.

i1(1)& fguard&o1(1)

i1(1)&¬ fguard&o2(1)

qstart

t1

t2

(b) Firing FSM of actor a4.

qstart

i1(1)&o1(1)

i2(1)&o1(1)

t1

t2

(c) Firing FSM of actor a5.

Figure 9: Example network graph g1 with non-hierarchical SDF subcluster g1,γ con-taining only SDF actors and the firing FSMs of the actors a4 and a5.

subcluster in Figure 9(a) by a SDF actor a1,γ , cf. Figure 10.Furthermore, the composite operation may need to increase the size of the channels

connected to the composite actor input ports aγ .I and output ports aγ .O due to apossible later consumption of input tokens and possible atomic production of outputtokens in comparison to an earlier consumption of input tokens and a production ofoutput token in multiple firings by the corresponding subcluster gγ , respectively. Incase of the composite network graph g̃1 the channels c4− c7 must at least be ableto contain two tokens while in the original network graph g1 a size of one for thechannels c5 and c7 is sufficient. However, in the following this effect is neglected,i.e., we assume fifo channels of infinite size on the subcluster input and output ports.This restriction is also formalized in Definition 4.2.

The substitution of the static data flow subcluster by a corresponding SDF com-posite actor might, however, be infeasible due to the environment of the subclus-ter, e.g., the actors a4 and a5. The firing FSMs for both actors are depicted in Fig-ure 9(b) and 9(c). Considering Figure 11 the conversion of g2,γ into a SDF actor isinfeasible due to the feedback loop between the output o2 to the input i1 over the path(c7,a5,c8,a4,c4). This feedback loop imposes the additional constraint on the sched-

17

Page 18: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

22

1 1

2 2

1 1 11

2

2

2

2

g1,γ

i2

i1 o1

o2

o1 i1i2o2

i1 o1a1 a3

a2

c1

c2

c3

i1

i2

o1

o2

o1

i1o2

i1

o1i2

a5a4

c4

c5

c6

c7

c8

qstart t1

i1(2)&i2(2)&o1(2)&o2(2)/(2a1,a2,2a3)

g1

Figure 10: Composite network graph g̃1 resulting from the subsitution of g1,γ by itscorresponding composite SDF actor a1,γ .

ule that actor a3 has to be executed at least once before executing actor a2 in orderto avoid a deadlock. Therefore, transforming the cluster g2,γ into its correspondingSDF actor would introduce an artificial deadlock.

22

1 1

2 2

1 1 11

g2,γ

i1

i2

o1

o2

o1

i1o2

i1

o1i2

i2

i1 o1

o2

o1i1o2

o1 i1i2

a5

a1 a3

a2

a4

c4

c5

c6

c7

c8

c2

c3 c1

g2

Figure 11: Example network graph g2 with non-hierarchical SDF subcluster g2,γ .

In the worst case, feedback loops may only contain the minimal number of requiredinitial tokens to enable a deadlock free execution of the original network graph. Thisform of feedback loop will in the following be called a maximal tight feedback loop.Adding initial tokens to c4 does not really help in this regard due to the guard fguardin actor a4 which may divert an indeterminable amount of tokens to channel c5. Thesame argument can be made for all feedback paths from cluster outputs to clusterinputs. Therefore, the subclusters g1,γ , g2,γ , g3,γ , and g4,γ are embedded in worst casecluster environments. The above situation can be solved by converting the subclusterg2,γ into a CSDF composite actor, e.g., cf. Figure 12.

The above examples demonstrated that not all static data flow subclusters can beconverted into SDF actors. However, even conversion of a subcluster into a CSDFactor is not always sufficient to prevent the introduction of a deadlock. To exemplifythis, we conside Figure 13 with the two feedback loops o2→ i1 and o1→ i2 which

18

Page 19: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

22

1 1

2 2

1 1 11[1,1] [1,1]

[0,2] [0,2]

g1,γ

i2

i1 o1

o2

o1 i1i2o2

i1 o1a1 a3

a2

c1

c2

c3

i1

i2

o1

o2

o1

i1o2

i1

o1i2

a5a4

c4

c5

c6

c7

c8

t2

q1qstart

t1i2(1)&o2(1)/(a1,a3)

i1(2)&i2(1)&o1(2)&o2(1)/(a1,a2,a3)

g1

Figure 12: Composite network graph g̃2 resulting from the subsitution of g2,γ by itscorresponding composite CSDF actor a2,γ .

can be enabled alternately by the DDF actor a4. In this situation, the subclusterg3,γ cannot be converted into a SDF or CSDF actor without possibly introducing adeadlock. Considering that feedback loop o1→ i2 implies firing actor a2 first whileo2 → i1 implies to fire the scheduling sequence (a1,a3) first. Therefore the actualselection of the scheduling sequence to execute first dependes on available tokensand free spaces on the subcluster input and output ports. This flexibility cannot berepresented by a SDF nor by a CSDF actor.

22

1 1

2 2

1 1 11

g3,γ

i1

i2

o1

o2

o1

i1o2

i1

o1i2

i2

i1 o1

o2

o1i1o2

o1 i1i2

a5

a1 a3

a2

a4

c4

c5

c6

c7

c8

c2

c1c3

g3

Figure 13: Example network graph g3 with non-hierarchical SDF subcluster g3,γ .

However, the cluster FSM representation for quasi-static schedules provides thisflexibility. With this notation, the composite actor a3,γ can be expressed as depictedin Figure 14. As it can be seen, two transitions t1 and t2 are leaving the start stateqstart. t1 requires at least two tokens on input port i1 and two free spaces on outputport o2 whereas t2 requires at least one input token on input port i2 and one free spaceon output port o2. Furthermore, we see that taking t1 fires actor a2 and taking t2 firesthe scheduling sequence (a1,a3) as required by the two feedback loops in Figure 13.In the following section we will present a methodical way to construct the clusterFSM gγ .R of a given subcluster gγ .

19

Page 20: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

q1 q2

t1

t3 t4

t 5

q3

qstart

i2(1)&o2(1)/(a1,a3)

t2i1(2)&o1(2)/(a2)

i 2(1

)&o 2

(1)/

(a1,

a 3)

i2(1)&o2(1)/(a1,a3)

i1(2)&o1(2)/(a2)

Figure 14: Cluster FSM of subcluster g3,γ .

6.2 Cluster Composition Algorithm

The key idea to our algorithm is that each output o∈ gγ .O of gγ might have a feedbackvia other data flow actors to each input i ∈ gγ .I. As we have learned from the exam-ples in Section 6.1, any produced token on an output o ∈ gγ .O might cause throughthese feedback loops the activation of an actor a∈ gγ .A in the same subcluster. In par-ticular, postponing the production of an output token may result in a deadlock of theentire system. Hence, the QSS determined by our clustering algorithm guarantees theproduction of a maximum number of output tokens from the consumption of a min-imal number of input tokens. However, the proposed cluster compositio algorithmrequires the following composite condition as prerequisite:

De�nition 6.1 (Composite Condition) A static data flow subcluster gγ can beclustered by the given algorithm if the subcluster disregarding its inputs and outputsis deadlock free itself and for each pair of actors (asrc,adest) possessing a subclusterinput and output port there exists a directed path p ∈ gγ .V ∗ from actor asrc to actoradest, i.e., ∀asrc,adest ∈ gγ .A,asrc 6= adest : (asrc.I∩gγ .I 6= ∅∧adest.O∩gγ .O 6= ∅) =⇒∃ a directed path p = ((asrc,c1),(c1,a2), . . . ,(cn−1,adest)) ∈ gγ .V ∗.

Otherwise an unbounded accumulation of tokens will result. To exemplify this con-sider the subcluster g3,γ depicted in Figure 13 satisfying the composite condition.However, removing the channel c3, as depicted in Figure 15, would contradict thiscondition due to the missing path from i2 to o1. This violation would introduce anunbounded accumulation of tokens on channel c1 caused by the production of tokenson port o1 which do not require any tokens on port i2 and therefore never firing ac-tor a3. However, static data flow subclusters violating the composite condition canbe partitioned by removing the unbounded channels into subclusters satisfying the

20

Page 21: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

composite condition.

2

1

2 2

1 1 11

g4,γ

i1

i2

o1

o2

o1

i1o2

i1

o1i2

i1 o1

o2

o1i1 o1 i1i2

a5

a1 a3

a2

a4

c4

c5

c6

c7

c8

c2

c1

g4

Figure 15: Example network graph g4 with non-hierarchical SDF subcluster g4,γ vi-olating the composite condition.

Our proposed clustering algorithm works in three steps: (step 1) Preprocessing,(step 2) Compute the set of input/output states each representing the maximal pro-duction of output tokens with a minimal consumption of input tokens, and (step 3)Construct the cluster FSM. In the following, we discuss these individual steps indetail.

The preprocessing computes some termination criteria for step 2 and step 3. Inparticular, the number of firings of each actor to bring the subcluster back into itsinitial state as well as the number of consumed and produced tokens by these firingswill be computed. More formally:Step 1.1: Compute the repetition vector [LM87] rmin,gγ

for subcluster gγ , i.e., apositive integer rmin,gγ

(a) is assigned to each actor a∈ gγ .A in the subcluster denotingthe minimal number of firings of a to return gγ back to its initial state. For thesubcluster g3,γ in Figure 13, the repetition vector is rmin,g3,γ

= (ra1,ra2 ,ra3) = (2,1,2),i.e., actors a1 and a3 have to be fired twice, whereas actor a2 has to be fired once.10

Step 1.2: Compute the so called input/output repetition vector nmin,gγthat assigns to

each input i ∈ gγ .I the number of consumed tokens nmin,gγ(i) and to each output o ∈

gγ .O the number of produced tokens nmin,gγ(o) by firing each actor a ∈ gγ .A exactly

rmin,gγ(a) times. For example, the input/output repetition vector of g3,γ computes to

nmin,g3,γ= (ni1 ,ni2,no1 ,no2) = (2,2,2,2), i.e., firing actors a1 and a3 twice and actor

a2 once consumes two tokens from each input port and produces two tokens on eachoutput port.

In order to avoid deadlocks, we assume the worst case, i.e., each produced outputtoken causes the activation of an actor in the subcluster via a feedback loop. Hence,it is required that the resulting QSS always produces a maximal number of outputtokens with a minimal number of input tokens. Each end point of such a production10To be more succint we will also regard the repetition vector as a function from an actor to its

repetition count, e.g., rmin,g3,γ(a1) = ra1 = 2.

21

Page 22: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

is marked by an input/output state of the subcluster. As an exhaustive evaluation isprohibitive, we propose the following three steps to determine the input/output states:Step 2.1: Compute for each output port o the input/output dependency tuples encod-ing the minimal numbers of consumed tokens on the input ports to produce n tokenson o. For this purpose, we formally define an input/output dependency function depgγ

.

De�nition 6.2 (Input/Output Dependency Function) Given a subcluster gγ

the input/output dependency function depgγ: gγ .O×Z+

0 → Z+0|gγ .I| is a function that

associates with a subcluster gγ , for each subcluster output port o ∈ gγ .O, and fora requested number of tokens n ∈ Z+

0 a vector of minimal number of input tokens(ni1 ,ni2, . . .ni|gγ .I|) consumed on each subcluster input port i ∈ gγ .I to produce therequested number n of tokens on the output port o.

Note that the set of input/output dependency values iodepgγwe need to consider can

be bounded by the projection mmin,gγof the input/output repetition vector nmin,gγ

toits inputs, i.e., mmin,gγ

= (nmin,gγ(i1),nmin,gγ

(i2), . . .nmin,gγ(i|gγ .I|)). The set iodepgγ

is bounded by the condition that we only need one tuple per output port greater orequal to the input/output repetition vector, i.e., iodepgγ

= {depgγ(o,n) | o ∈ gγ .O,n ∈

Z+0 ,@n′ ∈ Z+

0 ,n′ < n : mmin,gγ≤ depgγ

(o,n′) < depgγ(o,n)}. This tuple is needed for

step 3.3 to generate cycles in the resulting cluster FSM. For the subcluster g3,γ , theinput/output dependencies are shown in Table 1.Step 2.2: Compute the so called input/output state set iostatesgγ

by determining themaximal number of output tokens producable on each output port for the numberof input tokens available in each input/output dependency value. Additionally, addthe zero input/output state (0,0, . . .0) which may be missing if the subcluster canproduce output tokens without consuming any inputs. More formally, we define theinput/output state set as follows:

De�nition 6.3 (Input/Output State Set) The input/outputstate set iostatesgγ

⊆ Z+0|gγ .I∪gγ .O| consists of input/output states

(ni1,ni2, . . .ni|gγ .I|,no1,no2, . . .no|gγ .O|) encoding the minimal number of requiredinput tokens ni per input port i ∈ gγ .I and the maximal producable output tokens noper output port o ∈ gγ .O from this input tokens:

iostatesgγ= {(0,0, . . .0)}∪{

(ni1 ,ni2 , . . .ni|gγ .I|,no1,no2 , . . .no|gγ .O|)| (ni1 ,ni2 , . . .ni|gγ .I|) ∈ iodepgγ

∧ ∀o ∈ gγ .O : no = max({n ∈ Z+

0 |depgγ(o,n)≤ (ni1,ni2, . . .ni|gγ .I|)}) }.

For the subcluster g3,γ in Figure 13, the input/output state set iostatesg3,γ computes to{(0,0,0,0),(2,0,2,0),(0,1,0,1),(2,2,2,2)} (cf. Table 1). Note that the entries for

22

Page 23: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

Table 1: Input/output dependency values for subcluster g3,γ from Figure 13 and cor-responding input/output states. For example, to produce at least n = 1 tokenon output o = o1, at least two tokens are required from input port i1.

depg3,γ(o,n) iostatesg3,γ

o = o1 o = o2

n = 0 (0,0) (0,0) (0,0,0,0) (0,0,0,0)n = 1 (2,0) (0,1) (2,0,2,0) (0,1,0,1)n = 2 (2,0) (2,2) (2,0,2,0) (2,2,2,2)n = 3 (4,2) (2,3) — —

n = 3 are not contained in iodepgγdue to their redundancy.

Step 2.3: In the previous two steps, we have neglected the different interleavingsof actor firings that are permitted by the partial order of these actor firings. Thus, wecompute the least fixpoint lfp(iostatesgγ

) of the pointwise maximum of all pairs of in-put/output states n ∈ iostatesgγ

. For our example in Figure 13, the least fixpoint com-putes to lfp(iostatesgγ

= {(0,0,0,0),(2,0,2,0),(0,1,0,1),(2,1,2,1),(2,2,2,2)}.Note that the state (2,1,2,1) has been added to the input/output state set.

After computing the input/output states, the cluster FSM can be constructed byordering the input/output states and computing the transitions between input/outputstates. This can be done by the following five steps:Step 3.1: Compute the partial order n1 ≤ n2 on lfp(iostatesgγ

) where n1 ≤ n2 iff∀p ∈ gγ .I∪gγ .O : n1(p)≤ n2(p). We use Hasse diagrams to visualize partial orders,where vertices represent input/output states and directed edges (n1,n2) representthe relation n1 ≤ n2. The resulting Hasse diagram of lfp(iostatesg3,γ ) computed instep 2.3 is depicted in Figure 16. This partial order implies the state transitions forthe cluster FSM as can be seen later. Furthermore, we use T Ogγ

to denote the set ofso called tightly ordered pairs (nsrc,ndest) ∈ lfp(iostatesgγ

)2 where nsrc < ndest butno input/output state n′ exists between nsrc and ndest, i.e., T Ogγ

= {(nsrc,ndest) ∈lfp(iostatesgγ

)2 | nsrc < ndest ∧ @n′ ∈ lfp(iostatesgγ) : nsrc < n′ < ndest}. Hence,

the tightly ordered pairs are the edges in the corresponding Hasse diagram, e.g.,e3 = ((2,0,2,0),(2,1,2,1)) ∈ T Og3,γ in Figure 16. Finally, we define a modulooperation n mod nmin,gγ

to be the greatest vector n′ = n+n ·nmin,gγ,n∈Z not greater

than or equal to nmin,gγ, e.g., (1,3) mod (2,2) = (1,3) but (2,3) mod (2,2) = (0,1).

Later, the modulo operation is used in step 3.3 to determine the set of transitions Tof the cluster FSM.Step 3.2: Compute the state set Qfiring of the cluster FSM R by generating a state

q ∈ Qfiring for each input/output state n not greater than or equal to the input/outputrepetition vector, i.e.,R.Q = {n ∈ lfp(iostatesgγ

) | n 6≥ nmin,gγ}. The bottom element

in the partial order, i.e., the zero input/output state (0,0, . . .0), corresponds to

23

Page 24: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

(0,0,0,0)

(2,1,2,1)

(2,2,2,2)

(0,1,0,1)

(2,0,2,0)

(ni1 ,ni2,no1 ,no2)

e1

e2

e3

e4

e5

Figure 16: Hasse diagram depicting the partial order derived from the input/outputstate set lfp(iostatesg3,γ ) corresponding to the subcluster g3,γ shown in Fig-ure 13.

the initial state q0firing. For the example in Figure 13, the state set computes toQ = {q0firing = (0,0,0,0),q1 = (2,0,2,0),q2 = (0,1,0,1),q3 = (2,1,2,1)} (cf.Figure 14).Step 3.3: Compute the transition set T of the cluster FSM R by generat-ing a transition t ∈ T for each tightly ordered pair of input/output states, i.e.,R.T = {(qsrc,Ngγ

(qsrc,qdest),Rgγ(qsrc,qdest),qdest) | (nsrc,ndest) ∈ T Ogγ

,qsrc =nsrc mod nmin,gγ

,qdest = ndest mod nmin,gγ}. The functions Ngγ

and Rgγare used

to encode the activation pattern k and the scheduling sequence faction for eachtransition t and will be derived in step 3.4 and step 3.5, respectively. Consid-ering T Og3,γ = {e1,e2,e3,e4,e5} in Figure 16, the resulting set of transitionsis T = {(qstart,ke1, fe1,q1), (qstart,ke2, fe2,q2), (q1,ke3, fe3,q3), (q2,ke4, fe4,q3),(q3,ke5, fe5,qstart)}. Note that only for transition t5 = (q3,ke5, fe5,qstart) the modulooperation was necessary.Step 3.4: Compute the activation patterns k for each transition t ∈ T by generatinga activation pattern Ngγ

(qsrc,qdest) encoding the minimal number of tokens oneach subcluster input port to enable the transition t from each tightly ordered pairof input/output states, i.e., ∀(nsrc,ndest) ∈ T Ogγ

: Ngγ(nsrc mod nmin,gγ

,ndest modnmin,gγ

) = i1(ni1)&i2(ni2)& . . . i|gγ .I|(ni|gγ .I|)&o1(no1)&o2(no2)& . . .o|gγ .O|(no|gγ .O|)where np = ndest(p)− nsrc(p)∀p ∈ gγ .I∪gγ .O. Considering e5, at least one token,i.e., ((2,2,2,2)− (2,1,2,1))(i2) = 1, on port i2 and one free space on port o2 isnecessary to enter state qstart.Step 3.5: Compute the scheduling sequences faction for each transition t ∈ T bygenerating a partial repetition vector rgγ ,qsrc,qdest that assigns to each actor a ∈ gγ .A

24

Page 25: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

a non-negative integer rgγ ,qsrc,qdest(a) indicating the number of actor firings to gofrom state qsrc to state qdest. This partial repetition vector is derived from eachtightly ordered pair of input/output states, i.e., rgγ ,qsrc,qdest = rdest − rsrc whereqsrc = nsrc mod nmin,gγ

and qdest = ndest mod nmin,gγ. Furthermore, the repetition

vectors rdest and rsrc correspond to the input/output states ndest and nsrc, e.g., firingeach actor a ∈ gγ .A rsrc(a) times produces nsrc(o) tokens on output port o andconsumes nsrc(i) tokens from input port i. Finally, for each transition on the basis ofthe partial repetition vector, a single processor schedule is computed by a version ofthe scheduling algorithm presented in [HB07] modified to support partial repetitionvectors. This schedule is assigned to Rgγ

(qsrc,qdest) and is one of the schedule phasesof the resulting DDF actor replacing the static data flow subcluster gγ .

7 Proofs

In this section, we will formally prove the correctness of our proposed clusteringalgorithms by showing the equivalence of the semantics of a SDF subdomain gγ andits corresponding composite actor aγ . This is done by defining the semantics of both,gγ and aγ , with the denotational semantics for Kahn process network (KPN) [Kah74]and abstracting from the data values.

Proposition 7.1 Given a SDF subdomain gγ of a data flow graph g satisfying theclustering condition in Definition 6.1 and its corresponding composite actor aγ con-structed by our clustering algorithm in Section 6.2. It holds, the behaviors of g andthe data flow graph g̃ resulting from replacing gγ with aγ are sequence equivalent.11

To prove the previous proposition we will first introduce some formal notations:A repetition state is a tuple r = (ra1,ra2, . . .ra|gγ .A|) ∈ Z+

0|gγ .A| encoding the number

of actor firing ra of an actor a in the cluster gγ counted from the startup of the sys-tem.12 Analogous, we define an input/output state as a tupple n = (ni1,ni2, . . .ni|gγ .I|,

no1,n02 , . . .no|gγ .O|) ∈ Z+0|gγ .P| encoding the number of consumed tokens ni on each

subcluster input port i ∈ gγ .I and the number of produced tokens no on each subclus-ter output port o ∈ gγ .O counted from the startup of the system.13

A repetition vector is called reachable iff there exists at least one valid schedulingsequence ρ from the startup of the system which contains each actor a∈ gγ .A exactly

11The definition of sequence equivalence can be found in [LSV98].12To be more succint we will also regard the repetition state as a function from an actor to its repetition

count, e.g., r(a1) = ra1 .13Analogous to the repetition state the input/output state will also be regarded as a function from ports

to the consumed or produced tokens on this port, e.g., n(i1) = ni1 .

25

Page 26: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

ra = ρ(a) times.14 The set of reachable repetition states is denoted by R⊆ Z+0|gγ .A|.

Equivalently we define a input/output state n as reachable if a corresponding reach-able repetition state r exists which consumes exactly n(i) tokens for each subclusterinput port i ∈ gγ .I and produces axactly n(o) tokens for each subcluster output porto. This corresponding repetition state will be denoted by frepstate(n). The set of

reachable input/output states will be denoted by N⊆ Z+0|gγ .P|.

As first step we proof that all input/output states in lfp(iostatesgγ) are reachable.

This guarantees that for each transition generated in step 3.5 a valid schedule ex-ists. From Definition 6.1, i.e., gγ disregarding its input and output ports is dead-lock free, and the definition of iostatesgγ

we know that all input/output states iniostatesgγ

are reachable. From Section 6.2 we know that lfp(iostatesgγ) is defined

as the least fixpoint of the pointwise maximum operation of all pairs of input/outputstates in iostatesgγ

. Therefore, proving that the pointwise maximum max(n1,n2) oftwo reachable input/output states n1 and n2 is also reachable proofs the reachabilityof all input/output states in lfp(iostatesgγ

) by induction. From the observation thateach input and output port of a subcluster is associated with a unique actor we derivethe following relation: max(n1,n2) = f−1

repstate(max( frepstate(n1), frepstate(n2))). Withthis relation the proof of the reachability of the pointwise maximum of pairs of reach-able input/output states can be reduced to the proof of the reachability of the point-wise maximum pairs of reachable repetition states, i.e., (∀r1,r2 ∈ R : max(r1,r2) ∈R) =⇒ ∀n1,n2 ∈ N : max(n1,n2) ∈ N.

Proposition 7.2 (Repetition State Reachability) If two repetition vec-tors r1 and r2 are reachable the pointwise minimum and maximum ofthese two repetition vectors are reachable as well, i.e., r1 is reachable ∧r2 is reachable =⇒ min(r1,r2) is reachable ∧ max(r1,r2) is reachable where∀a ∈ gγ .A : min(r1,r2)(a) = min(r1(a),r2(a)) and ∀a ∈ gγ .A : max(r1,r2)(a) =max(r1(a),r2(a)).

Proof: We rely on the well known conflict free property of SDF systems. Thisproperty ensures that an actor a once enabled can not be disabled by firing anotheractor a′ ∈ gγ .A\{a}.

Let us first prove the repetition vector reachability proposition for pointwise max-imum. Given two reachable repetition vectors r1, r2 and let A′ be the set of actorswhich have more repetitions in r2 than in r1, i.e., A′ = {a′ ∈ gγ .A | r2(a′) > r1(a′)}.If A′ is empty the proposition is trivially true due to max(r1,r2) = r1. Otherwise,we note that a valid scheduling sequence ρ2 for the repetition vector r2 exists. Letρ ′2 be the longest prefix of ρ2 with the property that each actor a′ ∈ A′ occurs lessor equal than r1(a′) times, i.e., ρ ′2 @ ρ2,∀a′ ∈ A′ : ρ ′2(a

′) ≤ r1(a′) and ρ ′2 is the

14We use ρ(a) to denote the number of occurrences of the actor a in the scheduling sequence ρ , e.g.,r = (ra2 ,ra3 ,ra4) = (2,1,2), ρ = (a2,a2,a3,a4,a4),ρ(a2) = r(a2) = 2,ρ(a3) = r(a3) = 1, andρ(a4) = r(a4) = 2.

26

Page 27: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

longest prefix with this property. Therefore, after executing the scheduling sequenceρ ′2 at least one enabled actor a′ ∈ A′ exists, i.e., ρ ′2

a(a′) v ρ2.15 Furthermore, dueto ∀a ∈ gγ .A\A′ : ρ ′2(a) ≤ ρ2(a) = r2(a) ≤ r1(a) all actors a ∈ gγ .A occur less orequal than r1(a) times in the scheduling sequence ρ ′2. Due to the SDF conflictfree property this actor a′ must also be enabled after executing the scheduling se-quence ρ1 corresponding to the repetition vector r1. Firing this actor a′ proves theproposition by induction, i.e., let r′1 = r1 except r′1(a

′) = r1(a′) + 1. Additionally,max(r1,r2) = max(r′1,r2) and we note that max(r′1,r2) is reachable due to induction.�

The proof of the reachability of the pointwise minimum additionally relies onthe dedicated dependency property of SDF systems. This property ensures thateach data dependency of an actor a′′ can only be satisfied by firing a dedicated ac-tor a′, i.e., if ρa(a′,a′′)is a valid schedule and ρa(a′′) is an invalid schedule then∀ρ ′ ∈ gγ .A∗ : ρ ′ is a valid schedule and ρ ′(a′′) > ρ(a′′) =⇒ ρ ′(a′) > ρ(a′).Giventwo reachable repetition vectors r1, r2 and let A′ be the set of actors which have lessrepetitions in r2 than in r1, i.e., A′ = {a′ ∈ gγ .A | r2(a′) < r1(a′)}. If A′ is empty theproposition is trivially true. Let ρ ′1 be the longest prefix of ρ1 with the property thatafter executing the scheduling sequence ρ ′1 at least one enabled actor a′ ∈ A′ exists,i.e., ∃a′ ∈ A′,ρ1,ρ

′1 ∈ gγ .A∗ : ρ ′1

a(a′) v ρ1, ρ1 is a valid schedule for r1, and ρ ′1 isa longest prefix with this property. If ρ ′1

a(a′) @ ρ1 then there exist a valid sched-ule ρ ′1

a(a′,a′′),a′′ ∈ gγ .A\A′ but ρ ′1a(a′′) must be invalid. Otherwise ρ ′′1 = ρ ′1

a(a′′)would be a longer prefix than ρ ′1 because the SDF conflict free property ensuresthat ρ ′′1

a(a′) is a valid schedule for r1. This triggers the SDF dedicated dependencyproperty which ensures ∀ρ ′ ∈ gγ .A∗ : ρ ′ is a valid schedule and ρ ′(a′′) > ρ ′1(a

′′) =⇒ρ ′(a′) > ρ ′1(a

′). However, this contradicts the existence of the valid schedule ρ2 cor-responding to r2 which satisfies ρ2(a′′) ≥ ρ1(a′′) > ρ ′1(a

′′) but fails ρ2(a′) > ρ ′1(a′)

due to a′ ∈ A′ and ρ2(a′) = r2(a′) ≤ r1(a′)− 1 = ρ ′1(a′). Therefore, the last actor

in the schedule ρ1 must be a′, i.e., ρ ′1a(a′) = ρ1. Unfiring this actor a′ proves the

proposition by induction, i.e., let r′1 = r1 except r′1(a′) = r1(a′)− 1. Additionally,

min(r1,r2) = min(r′1,r2) and we note that min(r′1,r2) is reachable due to induction.�Proof: We will prove Proposition 7.1 by showing that the Kahn descriptions

Fgγand Faγ

of gγ and aγ are equivalent, i.e., Fgγ≡ Faγ

.16 These functions maptuples of sequences of tokens ννν inputs = (νi1,νi2, . . .νi|gγ .I|) ∈ (V∗∗)|gγ .I| on the input

ports to tuples of sequences of tokens νννoutputs = (νo1,νo2, . . .νo|gγ .O|) ∈ (V∗∗)|gγ .O|

15We use the ‘a’-operator to denote the concatenation of sequences, e.g., for the two schedulingsequences ρ1 = (a2,a2,a3) and ρ2 = (a4,a4) the concatenation equals ρ1

aρ2 = (a2,a2,a3,a4,a4).16The relation between Kahn’s denotational semantics and the semantics of data flow models with the

notion of firing is presented in [Lee97].

27

Page 28: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

on the output ports, i.e., Fgγ,Faγ

: (V∗∗)|aγ .I| → (V∗∗)|aγ .O|.17 Due to the data in-dependent nature of synchronous data flow and CSDF actors, we can safely ab-stract from the token values in the sequences. The production of the correct to-ken values is guaranteed by the firing of the contained actors on the same to-kens as in the original subgraph. To corresponding abstract functions F ′gγ

, F ′aγof

the denotational Kahn functions Fgγ, Faγ

substitute sequences of tokens by their

length, i.e., F ′gγ: Z+

0|gγ .I|→ Z+

0|gγ .O|, ∀ννν = (νi1,νi2, . . .νi|gγ .I|) ∈ (V∗∗)|gγ .I|,o ∈ gγ .O :

F ′gγ(|νi1|, |νi2|, . . . |νi|gγ .I||)(o) = |Fgγ

(ννν)(o)| and analogous for Faγand the correspond-

ing abstracted function F ′aγ.18 With this abstraction the equivalence of the denota-

tional Kahn functions is reduced to the equivalence of their corresponding abstractedfunctions, i.e., Fgγ

≡ Faγ⇐⇒ F ′gγ

≡ F ′aγ. We note that the behaviors of Faγ

can onlybe a subset of the behaviors of Fgγ

due to the validness of all transitions t ∈ gγ .R.T .Therefore, the non-equivalence of Fgγ

≡ Faγcould only be caused by missing transi-

tions in T .We prove the existence of all necessary transitions by induction and assume in

the following an infinite input/output state set iostates′gγ= {n | n mod nmin,gγ

∈iostatesgγ

}. This is permissible because it introduces only redundant input/outputstates (see steps 3.2 - 3.5). Lets assume that F ′gγ

is equivalent to F ′aγfor all

m ≤ mbound = (mb,i1,mb,i2, . . .mb,i|gγ .I|), i.e., ∀m ∈ Z+0|gγ .I|,m≤mbound : F ′gγ

(m) =F ′aγ

(m), and that the boundary mbound has minimal input for its generated output,i.e., @m′ < mbound : F ′gγ

(mbound) = F ′gγ(m′). We note that the input/output state corre-

sponding to mbound is nbound = (mb,i1,mb,i2, . . .mb,i|gγ .I|,nb,o1,nb,o2, . . .nb,o|gγ .O|) where∀o ∈ gγ .O : nb,o = F ′gγ

(mbound)(o) = max({n ∈ Z+0 |depgγ

(o,n) ≤ mbound}) and thatthe cluster FSM aγ .R after having consumed mbound(i) tokens on the input portsi ∈ aγ .I is in state = nbound mod nmin,gγ

= qbound ∈ Qfiring.The induction proceeds as follows: (i) We identify the set of minimal input exten-

sions for which the Kahn function Fgγgenerates additional outputs and (ii) Prove that

each such input extension corresponds to a transition leaving the state qbound, and (iii)finally by firing this transition leading to a new state qnext for which we can derive anew greater bound m′bound > mbound we complete our induction step.

More formally, we identify the set Mnext of minimal input extensions to gener-ate additional outputs, i.e., Mnext = {mnext ∈ Z+

0|gγ .I| |mnext > mbound∧F ′gγ

(mnext) >

F ′gγ(mbound)∧@m′ ∈ Z+

0|gγ .I| : mnext > m′ > mbound∧F ′gγ

(m′) > F ′gγ(mbound)}. With-

out loss of generality, we select mnext ∈ Mnext and define nnext from mnext anal-ogous to nbound from mbound. To complete the induction step, we have to in-voke transition t = (qbound,qnext) which exists per definition from step 3.3 if

17We use V∗∗ to denote the set of all possible finite and infinite sequences of tokens v ∈ V , i.e.,V∗∗ =

⋃n∈{0,1,...∞}Vn

18We use |ννν | to denote the length of the sequence ννν , e.g., |(11,7,33)|= 3.

28

Page 29: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

Table 2: Measured runtimes for different synthetic graphs and 1000 cluster FSM cy-cles.

#Actors QSS [s] dynamic [s] reduction [s]12 1.7 4.3 2.624 6.8 11.7 4.939 13.2 21.5 8.3

qnext = nnext mod nmin,gγ∈ Qfiring. To prove qnext ∈ Qfiring, we observe the prop-

erty @m′ < mnext : F ′gγ(mnext) = F ′gγ

(m′) which is equivalent to ∀i ∈ gγ .I : ∃o =ftightio(i) ∈ gγ .O : mnext(i) = depgγ

(o,F ′gγ(mnext)(o))(i). Otherwise a contradiction

to the definition of Mnext or the property that mbound has minimal input for itsgenerated output would exist. With the existence of the tight input/output depen-dency of i→ o for each input port i ∈ gγ .I where o = ftightio(i), we define mnext,i =depgγ

( ftightio(i),F ′gγ(mnext)( ftightio(i))). Moreover, we derive nnext,i from mnext,i anal-

ogous to nbound from mbound. Furthermore, we note that nnext,i ∈ iostates′gγwhich

can be trivially derived from Definition 6.3 itself, i.e., {nnext,i | i ∈ gγ .I} ⊆ iostates′gγ.

Additionally, we note that nnext,i ≤ nnext ∧ nnext,i(i) = nnext(i)∧ nnext,i( ftightio(i)) =nnext( ftightio(i)) and therefore nnext ∈ lfp({nnext,i | i ∈ gγ .I}). Finally, A ⊆ B =⇒lfp(A)⊆ lfp(B) and therefore nnext ∈ lfp(iostates′gγ

) and at last qnext ∈ Qfiring.As induction basis we use mbound = (0,0, . . .0) and qbound = q0firing which is triv-

ially true if the subgraph can not produce output tokens without consuming any in-puts. Otherwise we note that exactly one transition t0 exists leaving q0firing whichproduces this initial outputs. Invoking t0 leads to the cluster FSM state we use as newqbound. �

8 Results

In order to illustrate the benefits of our clustering algorithm developed in Section 6.2,we have applied it to both synthetic data flow graphs as well as the IDCT of a MotionJPEG Decoder. The synthetic graphs are generated based on Figure 13 by adding avariable number of actors on edges (a4,a2),(a2,a4),(a1,a3).

Table 2 shows the obtained measurements dependent on the number of actors inthe subgraph. It compares the runtime when applying the QSS determined by theclustering algorithm against a dynamic scheduler. The latter one polls each actor ina round-robin fashion whether it can execute or not. The achieved improvementsclearly show the benefits of our clustering algorithm which is able to derive quasi-static schedules even when the cluster cannot be represented by a static data flowactor.

29

Page 30: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

In order to evaluate the optimization potential for real-world examples, we addi-tionally have applied our clustering algorithm to the two-dimensional IDCT of ourMotion-JPEG decoder (see Figure 17). It transforms blocks of 8×8 frequency coef-ficients into equally sized image blocks and encompasses both synchronous data flowand CSDF semantics.

1

1 1

1

1

1

1

1

1 1

1

1

11

1

11

8 8

11

64

88

1

µBlaze4µBlaze3

a3a2

a1s

µBlaze1

d

µBlaze2

Figure 17: Clustered multi-processor implementation of the two-dimensional IDCT.Inter-processor communication is realized via hardware FIFO links.

Figure 17 shows the chosen partitioning into four static data flow subgraphs. Eachof them satisfies the clustering condition from Definition 6.1. Each of the four re-sulting composite actors is then implemented on an embedded MicroBlaze processorfrom Xilinx. The source (s) and the sink (d) are implemented as hardware modules.

Comparing the dynamic round-robin scheduler with the QSS resulting from theclustering algorithm, the latter improves the throughput by 83%, the latency by 76%,whereas the latency is the time required to completely process one 8×8 block. Theunderlying reason for the latency improvement can be particularly well explained byhelp of the cluster running on MicroBlaze 1. Here, actor a1 writes eight values perinvocation whereas the other actors only consume one token. The resulting QSS con-sequently executes once actor a1, followed by eight invocations of the other clusteractors. For the round-robin scheduler however, each actor has the same priority andthus actor a1 already starts to process the second block while the first block is not al-ready finished. As a consequence, two blocks must share the processor power, whichresults in a bad latency value.

The increase of the throughput can be explained by two facts. First of all, thedynamic round-robin scheduler polls each actor individually whether it can fire ornot. Due to the unequal number of invocations however, several pollings fail, wastingprecious computational power. Additionally, the dynamic scheduler has to check thefill level of each FIFO connecting two actors, even if both of them are mapped to

30

Page 31: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

the same processor. The QSS schedule on the other hand only checks the hardwareinput- and output FIFOs. As soon as there are enough input tokens, a sequenceof actor firings determined during compile time is executed without checking theinternal FIFOs.

References

[BBHL95] Shuvra S. Bhattacharyya, Joseph T. Buck, Soonhoi Ha, and Edward A. Lee. GeneratingCompact Code from Dataflow Specifications of Multirate Signal Processing Algorithms.IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications,42(3):138–150, March 1995.

[BELP96] Greet Bilsen, Marc Engels, Rudy Lauwereins, and Jean Peperstraete. Cyclo-StaticDataflow. IEEE Transaction on Signal Processing, 44(2):397–408, February 1996.

[ET05] Stephen A. Edwards and Olivier Tardieu. SHIM: A Deterministic Model for Heteroge-neous Embedded Systems. In Proceedings of EMSOFT, pages 264–272, 2005.

[FHT05] Joachim Falk, Christian Haubelt, and Jürgen Teich. Syntax and Execution Behavior ofSysteMoC. Technical Report 02-2005, University of Erlangen-Nuremberg, Departmentof CS 12, Hardware-Software-Co-Design, Am Weichselgarten 3, D-91058 Erlangen, Ger-many, December 2005.

[FHT06] Joachim Falk, Christian Haubelt, and Jürgen Teich. Efficient Representation and Simu-lation of Model-Based Designs in SystemC. Proc. FDL’06, Forum on Design Languages2006, pages 129 – 134, Darmstadt, Germany, September 2006.

[HB07] C. Hsu and S. S. Bhattacharyya. Cycle-Breaking Techniques for Scheduling SynchronousDataflow Graphs. Technical Report UMIACS-TR-2007-12, Institute for Advanced Com-puter Studies, University of Maryland at College Park, February 2007.

[HFK+07] Christian Haubelt, Joachim Falk, Joachim Keinert, Thomas Schlichter, Martin Streubühr,Andreas Deyhle, Andreas Hadert, and Jürgen Teich. A SystemC-based Design Method-ology for Digital Signal Processing Systems. EURASIP Journal on Embedded Systems,Special Issue on Embedded Digital Signal Processing Systems, 2007:Article ID 47580,2007.

[JBP06] Ahmed A. Jerraya, Aimen Bouchhim, and Frédéric Pétrot. Programming Models andHW-SW Interfaces Abstraction for Multi-Processor SoC. In Proceedings of DAC, pages280–285, 2006.

[Kah74] Gilles Kahn. The semantics of simple language for parallel programming. In IFIPCongress, pages 471–475, 1974.

[KKO+06] Tero Kangas, Petri Kukkala, Heikki Orsila, Erno Salminen, Marko Hännikäinen, Timo D.Hämäläinen, Jouni Riihimäki, and Kimmo Kuusilinna. UML-Based Multiprocessor SoCDesign Framework. ACM Transactions on Embedded Computing Systems, 5(2):281–320,May 2006.

[Lee97] Edward A. Lee. A denotational semantics for dataflow with firing. Technical report,EECS, University of California, Berkeley, CA, USA 94720, 1997.

[LM87] Edward A. Lee and David G. Messerschmitt. Synchronous Data Flow. Proceedings ofthe IEEE, 75(9):1235–1245, September 1987.

31

Page 32: Task Graph Clustering with Internal State€¦ · Task Graph Clustering with Internal State Joachim Falk, Christian Haubelt, and Jürgen Teich Department of Computer Science 12 Hardware-Software-Co-Design

[LSV98] Edward A. Lee and Alberto Sangiovanni-Vincentelli. A Framework for Comparing Mod-els of Computation. IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, 17(12):1217–1229, December 1998.

[TNS+07] Mark Thompson, Hristo Nikolov, Todor Stefanov, Andy Pimentel, Cagkan Erbas, SimonPolstra, and Ed Deprettere. A Framework for Rapid System-level Exploration, Synthesis,and Programming of Multimedia MP-SoCs. In Proceedings of CODES-ISSS’07, pages9–14, 2007.

32