extending force-directed scheduling with explicit parallel ...rsinha/research/pubs/fccm2011.pdf ·...

Extending Force-directed Scheduling with ExplicitParallel and Timed Constructs for High-level

Synthesis

Rohit SinhaElectrical and Computer Engineering

University of WaterlooWaterloo, Canada

[email protected]

Hiren D. PatelElectrical and Computer Engineering

University of WaterlooWaterloo, Canada

[email protected]

Abstract—This work extends force-directed scheduling (FDS)to support specification constructs that express parallelism andtiming behaviours. We select the FDS algorithm because itmaximizes the amount of resource sharing, and it naturallysupports constructs for parallelism. However, timed constructsare not supported. As a result, we propose timed FDS (TFDS)that optimizes over parallel, timed and untimed constructs. Indoing so, we make the following four contributions: 1) we extendthe definition of control data flow graphs (CDFGs) to definetimed CDFGs (TCDFGs), 2) we define a scheduling algorithmfor timed constructs called TIME, 3) we extend the definitionof mobility used in FDS, and 4) we present optimizations for acomposition of parallel, timed and untimed constructs to betteraid FDS. We implement our extensions in a high-level synthesisframework based on the abstract state machine formalism, andwe generate synthesizable VHDL. We experiment with severalexamples such as FIR, edge detector, and a differential equationsolver, and target them onto an Altera DE2 FPGA. Some ofthese experiments show improvements of up to 52% in circuitarea when compared to their unoptimized counterparts.

Index Terms—High-level Synthesis, Force-Directed Scheduling,Timing Semantics.

I. INTRODUCTION

High-level synthesis (HLS) frameworks generate efficienthardware designs from high-level algorithmic specifications.The primary advantage of HLS frameworks is that it im-proves design productivity. This allows non-hardware expertsto quickly design hardware without any prior hardware designexperience. Even experienced hardware designers benefit fromthe increased design productivity. However, HLS frameworksare also under scrutiny [1], [2]. The main criticism is thatdesigners cannot make area and latency trade-offs at the algo-rithmic level. For example, traditional algorithmic specifica-tions use sequential programming languages such as C and/orC++. However, these languages do not natively provide anymechanisms to make the trade-offs between area and latency,which we find are necessary for designing efficient hardware.Therefore, designers are left at the mercy of HLS frameworksto decide the performance of the resulting hardware circuit.

While most HLS frameworks do not address this issue [1],[3], Handel-C [4], SynASM [2], Kiwi [5] and Forte’s Cynthe-

sizer [6] are a select few that do. They extend the algorithmicspecification language with constructs that allow expressingparallelism or timing behaviours. Parallel constructs enabledesigners to explicitly expose parallelism that a compilerwould be unable to automatically extract from a sequentialspecification. Timed constructs allow designers to specify theirtiming requirements in the specification, which is essential forbuilding components of a larger system that must have stricttiming behaviours. Designers use a combination of paralleland timed constructs to make area and latency trade-offs.

Handel-C extends C with a par construct, and a timingmodel where every assignment takes one clock cycle. Theunderlying model of computation is of communicating sequen-tial processes [7]. SynASM uses the abstract state machine(ASM) formal language for its specification [8]. ASMs na-tively support the notion of explicit parallelism through thepar construct, but a timing model is absent. Therefore, theauthors of SynASM provide further timed construct extensionsthat allow designers to control the timing behaviour of thespecification [2]. Kiwi [5] uses C#’s threading concurrencysemantics to introduce parallelism in the synthesized hardware.However, Kiwi does not provide control over timing behaviour.Cynthesizer [6] is different from the others because it uses thediscrete-event semantics of SystemC to introduce concurrency.It also provides language extensions to SystemC for timingcontrol. SynASM borrows these from them. Note that whilethese frameworks provide some support for explicit parallelismand/or control over the timing behaviour of the design, they donot present its impact on scheduling algorithms. We contendthat by incorporating parallel and timed constructs in schedul-ing, the HLS framework can generate efficient circuits thatalso meet the designer’s constraints.

This brings us to the focus of our work: incorporatingexplicit parallel and timed constructs into a HLS schedulingalgorithm. In particular, we select the force-directed schedul-ing (FDS) [9] algorithm. This is because it minimizes thenumber of functional and storage units in the design by sharingresources amongst operations across different clock cycles.Note that FDS does not exploit resource sharing across parallel

IEEE International Symposium on Field-Programmable Custom Computing Machines

978-0-7695-4301-7/11 $26.00 © 2011 IEEE

DOI 10.1109/FCCM.2011.49

214

computations specified by the parallel constructs, nor does itdefine a method to schedule timed constructs. To accommodatethis, our work presents appropriate extensions to FDS, whichwe call timed FDS (TFDS). Our work has the following maincontributions: 1) we extend CDFG representations to supportparallel and timed specifications, which we call timed CDFGs(TCDFGs), 2) we define an algorithm TIME that computesclock cycle values for timed constructs, 3) we redefine themobility function in FDS to support timed constructs, andwe define a timed FDS (TFDS), and 4) we present twooptimizations for the parallel composition of timed constructs,and the parallel composition of a mixture of timed and untimedconstructs that aid TFDS in further resource savings. Ourexperiments show improvements of up to 52% in design area.

II. RELATED WORK

Timed constructs impose specific timing constraints onhardware designs. For that reason, we choose a time-constrained scheduling algorithm to minimize functional unitsin a design with fixed number of clock cycles.

Integer Linear Programming (ILP) formulates the schedul-ing problem as an optimization problem to minimize func-tional units. Although it solves for the global minimum, ILPdoes not scale with the problem size.

FDS is a heuristic method for time-constrained scheduling.It minimizes functional units in a design by efficiently sharingthem amongst operations in different clock cycles [10]. Theadvantages are: 1) FDS extracts parallelism amongst sequentialoperations within a CDFG, 2) FDS can be extended to supporttiming constraints via the concept of mobility, and 3) FDS iscomputationally inexpensive compared to ILP for very littleloss in quality of results. For these reasons, we select the FDSscheduling algorithm.

III. BACKGROUND: SYNTHESIS FROM ASMS

We implement our scheduling back-end as an extension tothe SynASM [2] framework. The constructs that we describein this article are seqblock, tseqblock, nseqblock, and parblock constructs. These are used to describe sequential andparallel computations, with tseqblock and nseqblock beingthe timed constructs. Note that any of these constructs cannest multiple other constructs.

1) Parallel Block Construct: The constructs nested withina par block execute in parallel. SynASM generates parallelhardware for the nested constructs within the par block. TheUpdates set of the par block is computed as the union ofUpdates sets of nested constructs.

2) Sequential Block Construct: The constructs nestedwithin a seqblock execute in program order. The Updates set isa sequential composition of Updates of the nested constructs.For the design in ASM Spec 1, the sequential block yields{(x, 2), (y, 1)} as the Updates set. Notice that a seqblockhas no well defined timing model. This means the operationscan be scheduled in any clock cycle as long as the updatesaccumulate in program order.

ASM Spec 1 ASM Spec 2 ASM Spec 3seqblock par pary := 1 tseqblock tseqblockx := 1 w := k + 1 w := k + 1x := 2 x := k ∗ 2 x := k ∗ 2

endseqblock endtseqblock endtseqblocktseqblock seqblocky := k + 3 y := k + 3z := k ∗ 4 z := k ∗ 4

endtseqblock endseqblock

TABLE I: ASM Specifications3) Timed Sequential Block Construct: A tseqblock con-

struct is operationally the same as a seqblock with the excep-tion that tseqblock incorporates a strict timing model: eachupdate takes one clock cycle. Since all constructs in SynASMare eventually composed of update statements, the designercan constrain individual state updates to specific clock cycles.If we replace seqblock with tseqblock in ASM Spec 1, weget an Updates set of {(y, 1, 0), (x, 1, 1), (x, 2, 2)}. Note thatthe tuple for each update now contains a clock cycle value atthe end. For instance, (y, 1, 0) means that the update y := 1happens in clock cycle 0. SynASM generates a finite statemachine that schedules updates to consecutive clock cycles.

4) N-Timed Sequential Block Construct: A nseqblockpresents a relaxed timing model compared to the tseqblock.nseqblock accepts an additional parameter n that guaranteesthe constituent operations occur within n clock cycles. There-fore, n specifies a hard deadline on the entire block’s timingbehaviour. The final schedule is driven by optimizations.

IV. SYNTHESIS FRAMEWORK DESIGN FLOW

Figure 1 shows the design flow of our scheduling back-end.The design flow starts with the ASM specification stage. Thedesigner provides a behavioural specification of the intendeddesign. The designer can use the parallel and timed constructsto explore space and time trade-offs, and guide the synthesisprocess to generate efficient hardware. While we specify ourdesigns directly in ASMs, it is possible to automaticallytranslate C-based specifications into ASMs [2].

SynASM transforms the ASM specification into a controldata flow graph (CDFG). Unlike traditional CDFG structures,we extend it to support a combination of parallel and timedconstructs. The next stages perform scheduling via the FDSalgorithm and allocation. The scheduling algorithm partitionsthe CDFG into subgraphs, and determines the starting clockcycle of the operations in the subgraphs. The allocationalgorithm is divided into two stages: unit selection and unitbinding. The final stage generates synthesizable VHDL for theentire design, which we then target onto the Altera DE2 FPGAplatform.

V. EXTENDING CDFGS TO SUPPORT PARALLEL ANDTIMED SPECIFICATIONS

We support explicit parallelism using ASM par blockconstruct, and timed constructs with tseqblock and nseqblock.To capture a design using these constructs, we construct aCDFG. However, conventional CDFGs [11] do not provide

215

ASM Spec.

Design Specification

in ASMs

TCDFG

Constructionof internal

IR

Allocation

UnitSelection

Scheduling

ASAP,ALAP,

FDS, TFDS

Binding

UnitBinding

RTL

SynthesizableRTL

Fig. 1: Design flow

a method for supporting parallel and timed constructs. Thisis because traditional HLS methods automatically extractparallelism from sequential specifications, and there is noability to specify explicitly parallel and timed behaviours.As a result, we extend CDFGs with special nodes and edgelabels that enable us to use the extended CDFG for schedulingparallel and timed computations. We term these extendedCDFGs timed CDFGs (TCDFG)s. TCDFGs contain six newnode types: ParNode, TseqNode, NseqNode, and theircorresponding ending nodes. We also extend the set of edgetypes TE with two types: d for data dependency edges, andc for signifying that the next node occurs in the next clockcycle. Therefore, TE = {d, c}.

VI. SCHEDULING OPERATIONS IN TIMED CONSTRUCTS

A. Extending FDS to Support Timed Constructs

In order to use FDS scheduling for timed constructs, weneed to make two extensions to traditional FDS: 1) we need tocompute an additional schedule specific for timed constructs,and 2) we need to extend the definition of mobility.

1) Scheduling Timed Constructs: We call this the TIMEschedule. TIME maps all operations within tseqblock andnseqblock constructs to clock cycles as shown in Algorithm 1.The arguments to TIME are the TCDFG G, starting clockcycle (usually 0), and the empty set S denoting the schedule.At completion of TIME, S is populated with tuples that mapoperations to clock cycles. The schedule S of ASM Spec 2 is{(w := k+1, 0), (x := k∗2, 1), (y := k+3, 0), (z := k∗4, 1)}.

2) Extending the Definition of Mobility: In FDS, increas-ing an operation’s mobility creates more opportunities forresource sharing optimizations. Therefore, we use an extendeddefinition of mobility in our scheduling. An operation’s mo-bility depends on the type of enclosing block. For instance,w := k + 1 in ASM Spec 2 is enclosed by tseqblock, whichmeans it has the mobility defined by TseqNode. Timed forced-directed scheduling (TFDS) is FDS with mobility being themob function from Definition 1.

Definition 1. The mobility of an operation is a function mob :V → Z+ × Z+ where

mob(vi) =

(ρASAP (vi), ρALAP (vi)): type(vi) = SeqNode

(ρASAP (vi), ρTIME(vi)): type(vi) = TseqNode

(max(ρASAP (vi), ρTIME(top(vi)) + n): type(vi) = NseqNode

Algorithm 1: TIME(G, t, S)Let duration← 0 be the durationLet subNodes← [] be an empty listLet vi be the root node of Gif (type(vi) = ParNode) then

foreach vj ∈ {vk : (vi, vk) ∈ E} doduration← max(duration, TIME(vj , t, S))

endreturn duration

endif (type(vi) = TseqNode ∨ type(vi) = NseqNode) then

subNodes← getSubNodeList(vi)foreach vi in subNodes do

duration← duration+ TIME(vj , duration, S)t← t+ 1

endreturn duration

endif (type(vi) = Operation) then

S ← S ∪ {(vi, t)}return 1

endreturn duration

where vi ∈ Vops and top(vi) returns the first operation nodewithin that nseqblock.

ρ denotes the clock cycle for a particular node, and wesubscript ρ with the respective scheduling algorithms. ASAPstands for as soon as possible, and ALAP for as late aspossible.

B. TFDS for Parallel Composition of Timed Blocks

SynASM supports a strict timing model with tseqblockconstructs, and we utilize TFDS to optimize their synthesis.However, to better assist TFDS, we perform an optimizationthat increases the mobility of the operations in tseqblocks. Inparticular, we perform computations earlier and temporarilystore the result in a register. We then update the state from theregister at the clock cycle specified by the TIME schedule.This preserves the semantics of tseqblock constructs, and itincreases mobility.

We illustrate this optimization with the example in Fig-ure 2 for ASM Spec 2. Notice that we store the computedvalue in a temporary register, and update z at the clockcycle denoted by the tseqblock. The node symbolizes theassignment (z := z′) to state z from the temporary registerz′. We show the optimized schedule in Figure 2(b) uses 1fewer multiplier in exchange for a register. We reflect thisoptimization in Definition 1 where the lower bound of themobility of tseqblock constructs is determined by the ASAPschedule instead of the TIME schedule.

216

ParNode

TseqNode TseqNode

+

∗

+

∗

EndTseqNode

EndTseqNode

EndParNode

w : {c}

x

y : {c}

z

cycle0

cycle1

(a) Unoptimized TCDFG

ParNode

TseqNode

TseqNode

TseqNode

+

∗

∗

•

+

EndTseqNode

EndTseqNode

EndTseqNode

EndParNode

w : {c}

x

y

z′ : {c}

z

(b) Optimized TCDFG

Fig. 2: Optimization for parallel timed blocksC. TFDS for Parallel Composition of Timed and UntimedBlocks

We schedule the composition of seqblock and tseqblockconstructs such that we share the resources amongst theenclosed operations. Since the seqblock construct has notiming model, we are able to explore area versus latency trade-offs. Our approach has three steps: 1) schedule timed oper-ations within tseqblock constructs using TFDS, 2) scheduleoperations within seqblock constructs that can share functionalunits with one of the timed operations, and 3) delay remainingoperations within seqblock such that they occur after alltseqblock operations have finished.

We demonstrate this optimization on ASM Spec 3. Thedesign in Figure 3b uses 1 adder and 1 multiplier and hasa latency of 4 clock cycles. Contrast this with another validschedule in Figure 3a which has identical resource utilizationbut has a latency of 3 clock cycles. The hardware in Figure 3bis a result of scheduling algorithm that does not optimize forlatency. On the other hand, our algorithm generates a schedulewith the minimum possible resources, and then minimizeslatency without increasing resources.

ParNode

TseqNode SeqNode

+

∗ +

∗EndTseqNode

EndSeqNode

EndParNode

w : {c}

x y

z

(a) Latency: 3 cycles

ParNode

TseqNode SeqNode

+

∗

+

∗

EndTseqNode

EndSeqNode

EndParNode

w : {c}

x

y

z

cycle0

cycle1

cycle2

cycle3

(b) Latency: 4 cycles

Fig. 3: Optimization for parallel timed and untimed blocks

VII. RESULTS

Table II shows our experiments synthesized using the AlteraQuartus II tools for the DE2 FPGA platform. The strictly con-strained version of the design uses tseqblock constructs, andthe loosely constrained version uses nseqblock and seqblockconstructs. The optimized designs undergo the optimizationspresented in sections VI-B and VI-C. For each example, theloosely constrained version enables more resource sharing, andlower area at the expense of higher latency.

Strictly Constrained Unoptimized Optimized GainDesigns LUTS FFs MHz LUTS FFs MHz LUTParallel FIR 248 102 101.3 248 102 101.3 0%Sequential FIR 271 102 94.9 183 102 91.4 32%Kalman Filter 310 783 126.4 266 802 118.3 14%Diff. Eq. Solver 119 62 139.6 86 62 135.2 27%Edge Detector 167 1021 137.8 138 1033 133.5 17%Loosely Constrained Unoptimized Optimized GainDesigns LUTS FFs MHz LUTS FFs MHz LUTParallel FIR 248 104 128.1 180 116 121.4 27%Sequential FIR 271 102 94.9 185 102 91.2 32%Kalman Filter 310 784 126.1 178 811 113.2 43%Diff. Eq. Solver 119 62 138.7 57 62 129.0 52%Edge Detector 167 1023 139.8 97 1037 127.6 41%

TABLE II: Synthesis results of example designs

VIII. CONCLUSION

This paper presents extensions to FDS that incorporateexplicit parallel constructs and timed sequential constructs. Weaugment the definition of CDFGs, redefine mobility, and wepresent two optimizations that utilize resources across parallelcomputations to assist in making better scheduling decisions.We call the resulting scheduling algorithm timed FDS. Ourfuture work involves exploring an allocation algorithm thatleverages timed constructs to perform pipeline optimizations.

REFERENCES

[1] S. Edwards, “The challenges of synthesizing hardware from C-likelanguages,” IEEE Design and Test of Computers, vol. 23, no. 5, pp.375–386, 2006.

[2] R. Sinha and H. D. Patel, “Abstract state machines as an intermediaterepresentation for high-level synthesis,” in proceedings of the IEEEDesign, Automation and Test in Europe, March 2011, pp. 1406–1411.

[3] G. Martin and G. Smith, “High-Level Synthesis: Past, Present, andFuture,” IEEE Design & Test of Computers, pp. 18–25, 2009.

[4] Mentor Graphics, “Handel-c high-level synthesis,”http://www.mentor.com/products/fpga/handel-c/.

[5] S. Singh and D. Greaves, “Kiwi: Synthesis of FPGA circuits fromparallel programs,” in proceedings of the 16th International Symposiumon Field-Programmable Custom Computing Machines. IEEE ComputerSociety, 2008, pp. 3–12.

[6] P. Coussy and A. Morawiec, High-level synthesis: from algorithm todigital circuit. Springer Publishing Company, 2008.

[7] A. Butterfield, “Formal methods and hybrid real-time systems,” C. B.Jones, Z. Liu, and J. Woodcock, Eds. Berlin, Heidelberg: Springer-Verlag, 2007, ch. A denotational semantics for Handel-C, pp. 45–66.

[8] E. Borger and R. Stark, Abstract State Machines – A method for High-Level System Design and Analysis. Springer-Verlag, 2003.

[9] P. Paulin and J. Knight, “Force-Directed Scheduling for the BehavioralSynthesis of ASIC’s,” Transactions on Computer Aided Design, vol. 8,no. 6, pp. 661–679, 1989.

[10] D. Gajski, N. Dutt, A. Wu, and S. Lin, High-level Synthesis: Introductionto Chip and System Design. Kluwer Academic Publishers, 1992.

[11] R. Namballa, N. Ranganathan, and A. Ejnioui, “Control and dataflow graph extraction for high-level synthesis,” IEEE Computer SocietyAnnual Symposium on VLSI, vol. 0, p. 187, 2004.

217

extending force-directed scheduling with explicit parallel ...rsinha/research/pubs/fccm2011.pdf ·...

Documents