pace: dynamic programming algorithm for hardware/software ...papers/compendium94-03/papers/... ·...

PACE: A Dynamic Programming Algorithm for Hardware/SoftwarePartitioning

Peter Voigt Knudsen and Jan MadsenDepartment of Computer Science, TechnicalUniversity of Denmark, DK-2800 Lyngby, Denmark

Abstract.This paper presents the PACE partitioning algo-

rithm which is used in the LYCOS co-synthesis systemforpartitioning control/dataflow graphs into hardware-and software parts. The algorithm is a dynamic pro-gramming algorithm which solves both the problemof minimizing system execution time with a hardwarearea constraint and the problem of minimizing hard-ware area with a system execution time constraint. Thetarget architecture consists of a single microprocessorand a single hardware chip (ASIC, FPGA, etc.) whichare connected by a communication channel. The al-gorithm incorporates a realistic communication modeland thus attempts to minimize communication over-head. The time-complexity of the algorithm is O(n2.A)and the space-complexity is O(n.A) where A is the to-tal area of the hardware chip and n the number of codefragments which may be placed in either hardware orsoftware.

1 Introduction

The hardware/software partitioning of a systemspecification onto a target architecture consisting of asingle CPU and a single ASIC has been investigatedby a number of research groups [2, 5, 1, 7, 8, 11]. Thistarget architecture is relevant in many areas where theperformance requirements cannot be met by general-purpose microprocessors, and where a complete ASICsolution is too costly. Such areas may be found in DSPdesign, construction of embedded systems, software ex-ecution acceleration and hardware emulation and pro-totyping [10].

One of the major differences among partitioning ap-proaches is in the way communication between hard-ware and software is taken into account during parti-tioning. Henkel, Ernst et al. [2, 1, 6] present a simu-lated annealing algorithm which moves chunks of soft-ware code (in the following called blocks) to hardwareuntil timing constraints are met. The algorithm ac-counts for communication and only variables whichneed to be transferred are actually taken into account,i.e., the possibility of local store is exploited. Guptaand De Micheli [5] present a partitioning approach

I

0-8186-7243-9/96 $05.00 @ 1996 IEEE

which starts from an all-hardware solution. Their algo-rithm takes communication into account and is able toreduce communication when neighboring vertices areplaced together in either software or hardware. Thesystem model presented by Jantsch et al. [7] ignorescommunication. They present a dynamic programmingalgorithm based on the Knapsack algorithm whichsolves the partitioning problem for the case where someblocks include other blocks and are therefore mutu-ally exclusive. The algorithm has exponential mem-ory requirements which makes it impractical to use forlarge applications. To solve this problem they pro-pose a pre-selection scheme which only selects blockswhich induce a speedup greater than 10%. However,this pre-selection scheme may fail to produce good re-sults as communication overhead is ignored. Kalavadeand Lee [8]present a partitioning algorithm which doestake communication into account by attributing a fixedcommunication time to each pair of blocks. This ap-proach may overestimate the communication overheadas more variables than actually needed are transferred.

In this paper we present a dynamic programmingalgorithm called PACE [9] which solves the hard-ware/software partitioning problem taking communi-cation overhead into account.

2 System Model

This section presents the system model used by thepartitioning algorithm, and describes how it is obtainedfrom the functional specification.

2.1 The CDFG Format

The functional specification, which is currently de-scribed in VHDL or C, is internally represented as acontrol/data flow graph (CDFG) which can be definedas follows:

Definition 1 A CDFG is a set of nodes and directededges (N, E) where an edge ei,j = (ni, nj) from ni ENto nj E N, i 1= j, indicates that nj depends on nibecause of data dependencies and/or control dependen-cies.

Definition 2 A node ni E N is recursively defined as

nj = DFG ICond ILoop IFU IWait

85

WaltDFGLoopTestFUDFG

BodyDFGCondBranch1DFGBranch2DFG

wherea DFG is a pure dataflowgraph without controlstructures, FU represents a function call, Wait is usedfor synchronization with the environment, Branch1 andBranch2 are the CDFGs to be executed in the "true"and "false" branch case of a conditional Cond respec-tively and Test and Body are the test and body CDFGsof a Loop.

2.2 Derivation of Basic Scheduling Blocks

In order to be able to partition the CDFG, it mustfirst be divided into fragments, in the following calledBasic Scheduling Blocks or BSBs. For each node ni inthe CDFG, a BSB is created as shown in figure 1. Each

Cond = (Branch1,Branch2)Loop = (Test,Body)

Branch1 = CDFGBranch2 = CDFG

Test = CDFG

Body = CDFGFU = CDFG

:'

~' DFG'\

Body i j//"\ Cone!!Bl'llllch~...\ .. :"..BI'8nch2/--1' ',-,' ~--,

(9': (8':'-.. '-..',-,' '--,'

DFG

CDFG BSB hierarchy

Figure 1: The BSB hierarchy and its correspondencewith the CDFG.

BSB can have child BSBs which are shown indentedunder the BSB. In this way a BSB hierarchy whichreflects the hierarchy of the CDFG is obtained. A BSBstores information which is used by the partitioningalgorithm to determine whether it should be placed inhardware or software:

Definition 3 A BSB, Bi, is defined as a six-tuple

Bi = < a.,i,t.,i,ah,i,th,i,Ti,Wi >

where as,i and ts,i are the area and execution time ofBi when placed in software, ah,i and th,i are the areaand execution time of Bi when placed in hardware andri and Wi contain the read-set and write-set variablesof Bi.

In order to be able to control the number and sizesof the BSBs which are considered by the partitioning

r

- - -- - -- ~--

algorithm, parent BSBs can be collapsed as to appearas single BSBs instead of the child BSBs they are com-posed of. This is illustrated in figure 2. The partition-

Walt.DFG.Loop

TestFU

DFG.Body

DFG.Cond

Branch1DFG.

Branch2DFG.

DFG.

Walt.DFG.Loop

TestFU

DFG.Body

DFG.

O'DFG.

Walt.DFG.

L~UBody

.

~ .

DFG.

Original Hierarchy.Seven leaf BSBs.

Cond BSB collapsed. Test and Body BSBsSix leaf BSBs. collapsed. Five leaf BSBs.

Figure 2: Adjusting BSB granularity by hierarchicalcollapsing.

ing algorithm only considers leaf BSBs which are BSBswhich have no children. The leaf BSBs are markedwith a dot in the figure. When BSBs are collapsed, thenumber of leaf BSBs decreases. As the execution timeof the PACE algorithm depends quadratically on thenumber of BSBs, it is relevant to be able to control thenumber of BSBs in this way.

As all leaf BSBs together make up the total systemfunctionality, we can now define the system specifica-tion in terms of leaf BSBs:

Definition 4 A system specification S is described asan ordered list of n leaf BSBs, i.e. {Bl, B2'...' Bn}where Bi denotes BSB number i.

In order to estimate performance, it is necessary toknow how many times each BSB is executed for typicalinput data. This information is obtained from profiling.It is convenient to define two global functions whichreturn profiling information for individual BSBs andindividual variables:

Definition 5 The function pc : Bi E S -t N at re-turns the number of times Bi has been executed in aprofiling runl.

Definition 6 Let V denote the set of all variables fromthe read-sets and write-sets of the BSBs in S. Then,for a given variable v from the read-set or write-set ofBi, the function ac : v E V -t N at returns the numberof times the variable is accessed by Bi;2

(v E n) V (v E Wi) :::}ac(v) = pC(Bi)

1"pc" is short for "profiling count" .2"ac" is short for "access count".

86

3 Problem formulation

The partitioning model which the PACE algorithmuses is illustrated in figure 3. In this model hardware

sw HW sw HW

<>

EJs

8' &,7

A) B)

Figure 3: Partitioning model used by PACE: a) Ex-ample of actual data-dependenciesbetweenhardware-and software BSBs, b) How data-dependencies betweenadjacent hardware BSBs and software BBBs are inter-preted in the model.

BSBs and software BSBs cannot execute in parallel.Furthermore, adjacent hardware BSBs are assumed tobe able to communicate the read/write variables theyhave in common directly between them without in-volving the software side. As illustrated in the figure,a given hardware/software partition can be thoughtof as composed of sequences of adjacent BSBs whichonly communicate their effective read- and write-setsfrom/to the software side. The following definitionsformalize these assumptions.

Definition 7 Si,j, j 2: i, denotes the sequence ofBBBs {Bi,Bi+1,...,Bj}.

Definition 8 The effective read-set and the effectivewrite-set of a sequence Si,i are denoted ri,j and Wi,jrespectively and are defined as

ri,j (ri U riH U.. . U rj) \ (Wi U Wi+l U .. . U Wj)

(Wi U Wi+l U . . . U Wj) \ (ri U ri+l U . . . U rj)

=Wi,j

=

Using these definitions and the BSB definitions givenin section 2.2 we can now compute the speedup inducedby moving a sequence of BSBs from hardware to soft-ware:

Definition 9 The total (possibly negative) speedup in-duced by moving a BSB sequence Si,j to hardware isdenoted Si,j and is computed as

Si,j

j

L pC(Bk)(ts,k - th,k) -k=i

. ( L ac(v)ts-+h+ L aC(v)th-+.)vEri.; VEWi,;

=

where t.-+h and th-+s denote the software-to-hardwareand hardware-to-software communication times for asingle variable, respectively.

Definition 10 The area penalty ai,j of moving Si,' tohardware is computed as the sum of the individual "BBBareas:

ai,j

j

Lakk=i

=

In section 4.2 we discuss how the effect of hardwaresharing is taken into account. Note that in calculatingthe speedup and area of a sequence it is not consideredthat hardware synthesis may synthesize the sequenceas a whole which would probably reduce both sequencearea and execution time as compared to just summingthe individual area- and execution time componentsas described above. Incorporating such sequence op-timizations in the estimations will be fairly straight-forward but has not been carried out yet. Note, how-ever, that the improvement in speedup induced by allBSBs within the sequence being able to communicatedirectly with each other is taken into account.

The partitioning problem can now be formulatedas that of finding the combination of non-overlappinghardware sequences which yields the best speedupwhile having a total area penalty less than or equalto the available hardware area A.

4 Software, Hardware and Communi-cation Estimation

This section describes how hardware area- and exe-cution time, software execution time, and communica-tion time are estimated.

4.1 Software Estimation

Software execution time for a pure DFG (Le., nocontrolflow) is estimated by performing a topologicalsort (linearization) of the nodes in the DFG. The nodesare then translated into a generic instruction set withthe addressing modes of the instructions determinedby data-dependencies and a greedy register allocationscheme. The execution times of the generic instructions

87

are then determined from a technology file correspond-ing to the target microprocessor. This is similar to the

, approach described in [3], where good estimation re-sults are reported, and the same technology files forthe 8086, 80286, 68000 and 68020 microprocessors areused. The execution time of the DFG is obtained bysumming the execution times of the generic instructionsand multiplying the sum with the profiling count for theDFG. Execution times for higher level constructs suchas loop BSBs and branch BSBs are obtained on basisof the execution times of their child BSBs.

4.2 Hardware area estimation

A common way of estimating the hardware area ofa BSB is to estimate how much area a full hardwareimplementation of the BSB will occupy. This includeshardware to execute the calculations of the BSB andhardware to control the sequencing of these calcula-tions. IT the total chip area is divided into a datapatharea and a controller area, each BSB moved to hard-ware may be viewed as occupying a part of the datap-ath and a part of the controller. Figure 4a shows thismodel when one BSB has been moved to hardware.

Area =ACI+ A DI Araa '" (A crt' A C2)+

(ADI+AD2)

Araa=(Acl+AC2)+AD

A) C)B)

Figure 4: BSB area estimation which accounts for hard-ware sharing: a) Controller- and datapath area for asingle BSB, b) When sharing hardware, the total areafor multiple BSBs are less than the summation of theindividual areas, c) Variable controller area and fixeddatapath area for multiple BSBs with hardware sharing.

When several BSBs are moved to hardware they mayshare hardware modules as they execute in mutual ex-clusion. Hence, an approach which estimates area asthe summation of datapath and control areas for allhardware BSBs will probably overestimate the totalarea. This problem is depicted in figure 4b where thearea of the datapath is not equal to the sum of theindividual BSB datapaths.

In our approach the datapath area is the area ofa set of preallocated hardware modules in the datap-ath as illustrated in figure 4c. The BSBs share thesemodules and the controller area is the area left for theBSB controllers. The hardware area of a BSB is thenestimated as the hardware area of the corresponding

controller, and will depend on the number of timestepsrequired for executing the BSB. The hardware area ofa BSB therefore depends on its execution time.

4.3 Hardware execution time estimation

The hardware execution time for a DFG is deter-mined by dynamic list based scheduling [4] which at-tempts to utilize the hardware modules in the given al-location in or<Ier to maximize parallelism and therebyminimize execution time. The execution time obtainedin this way is multiplied with the profiling count for theDFG. The execution time for higher level constructs isobtained as in the software case.

4.4 Communication estimation

Communication is currently assumed to be memorymapped I/O. The transfer of k variables from softwareto hardware is assumed to require k generic MOVmi-croprocessor instructions and k Import operations asdefined in the hardware library. Communication fromhardware to software is estimated in the same way, justusing the hardware Export operation instead.

5 The PACE algorithm

The idea behind the PACE algorithm is best illus-trated by an example. Figure 5 shows four BSBs whichmust be partitioned as to reach the largest speedup onthe available area A=3. The speedup and area penaltyfor a single BSB which is moved to hardware is shownbelow each BSB. The numbers between two BSBs de-note the extra speedup which is incurred because ofthe BSBs being able to communicate directly with eachother when they are both placed in hardware.

aA=1SA=5

aB= 1SB = 10

ac=1Sc=2

aD=1SD = 10

Figure 5: Example of partitioning problem with com-munication cost considerations.

Obviously B and D should be placed in hardware asthey have large inherent speedups. This leaves room forone more BSB. Should it be A or C? The answer to thisis not obvious as A induces a large inherent speedup buta small communication speedup when placed togetherwith B in hardware, whereas C induces a smaller in-herent speedup but on the other hand induces a largecommunication speedup when placed together with Band D in hardware. The following paragraphs explainhow the PACE algorithm solves this problem.

The algorithm utilizes the previously mentionedfact, that any possible partition can be thought of as

88

composed of sequences of BSBs. If A, C and D arechosenfor hardware, it corresponds to choosing the se-quencesSA,A and Se,D. The speedup of sequence Se,Dis larger than the sum of speedups of its componentsC and D due to the extra communication speedup in-duced by both blocks being chosen for hardware. Soa natural approach will be to calculate the areas andspeedups of all sequences of BSBs, and chose the com-bination of sequences that induces the largest speedup.The areas and speedups of all sequences are calculatedand shown in table 1. The ordering and grouping ofBSBsis explained below.

I Sequence I Elements I Area I Speedup I

Table 1: Grouping of sequences.

The problem is to find the combination of non-overlapping sequences which fits the available area Aand whose speedup sum is as large as possible. Thisproblem cannot be solved with an ordinary KnapsackStuffing algorithm as some of the sequences are mu-tuallyexclusive (because they contain identical BSBs)and therefore cannot be moved to hardware at the sametime. But if the sequences are ordered and grouped asshown in the table, a dynamic programming algorithmcan be constructed which does not attempt to chosemutually exclusive sequences for hardware at the sametime.

The algorithm works as follows. Assume first thatfor each group up to and including group C thebest (maximum speedup) combination of sequences hasbeen found and stored for each (integer) area a fromzero up to the available area A. Assume then thatfor instance sequence Se,D with area aO,Dis selectedfor hardware at the available area a. How is the op-timal combination of sequences on the remaining areaa - aC,D then found? As C and D have been chosenfor hardware, only A and B remain. So the best so-lution on the remaining area must be found in groupB which contains the best combination of sequencesfor all BSBs from the set {A,B}. Similarly, if the "se-quence" SD,D is chosen for hardware, the best com-bination on the remaining area is found in group C.The optimal combination is always found in the groupwhose letter in the alphabet comes immediately beforethe letter of the first index in the chosen sequence. Theimportant thing to note is that when a sequence from

Area:

3 4Group A:

A (a=I.0=5) ISa,a 5 5 5

BasI:Group B:

AB (a=2,5=I7)

I

Sa'b

B (a=1. 5=10) ~,b10

BasI:~,b:10

Group C:

ABC (a=3, 5=21) Sa,e

BC (a=2.5=14) Sb,e

C (0=1.5=2) Se,e2

BasI:Group D:

ABCD (a=4, 5=35) Ss,d

BCD (a=3, 5=28) Sb,d

CD (a=2, 5=16) Se,d

D (a=I, 5=10) Sd,d10

BasI:

BestSpeedup(D, 4]

Figure 6: The PACE algorithm employed for a simpleexample.

group X has been chosen, the optimal combination ofsequences on the remaining area can be found in oneof the groups A to pred(X), and, when sequences areselected as above, no mutually exclusive BSBs are se-lected simultaneously. In this way the best solutionsfor a given group can always be determined on basis ofthe best solutions found for the previous groups.

Figure 6 shows how the best combination ofsequences can be found using three matrices;Speedup[l. .ns, O. .A], BestSpeedup[l. .n,l. .A]and BestChoice[1..n, O..A].

ns is the number of sequences, n is the number ofBSBs and A is the available area. Zero entries are notshown. Arrows indicate where values are copied from,but arrows are not shown for all entries in order tomake the figure more readable.

The Speedup matrix contains for each sequenceand each available area the best speedup that canbe achieved if that sequence is first moved to hard-ware and then sequences from the previous groups aremoved to hardware. In the figure, Speedup [SB,G, 3]is 19 and is found as the inherent speedup of SB,Gwhich is 14 plus the best obtainable speedup 5 on area3 - aB,O = 3 - 2 = 1 in group A (as Band C havebeen chosen). The BestSpeedup matrix contains foreach group (which there are n of) and each area thebest speedup that can be achieved by first selecting asequence from that group or one of the previous groups.It can be calculated as

BestSpeedup[g,a] =

89

Liroup A: sequences endIng with ASA A Q

liroup J:j: sequences endIng With J:jSA B ,J:I :Ii 1'1

SB B 1 lU

Group (): sequences endIng with ()SA 0 HI..; 3 21

SB 0 .J 2 14

Soo 2

Group : sequences endIng with UAD .JU "BD ;U :IiOD 1DD I'

max (max(Speedup[S ,a]), BestSpeedup[pred(g) ,a])SEg

The BestChoice [9, a] matrix identifies the choice ofsequence that gave this maximum value. The last twomatrices are interleaved and typeset with bold lettersin the figure.

In the example, BestChoice [C, 3] is 21 as this isthe maximum speedup that can be found in groupC with available area 3 and it is larger than thelargest speedup that could be found in the previousgroups, namely 17. The corresponding choice of se-quence is BA,C. In contrast, BestSpeedup[C,1] andBestChoice [D, 1] are copied from the correspondingentries of the previous group. For group C this is be-cause all Speedup entries in that group for area 1 aresmaller than the best speedup 10achieved with only se-quences from groups up to and including B. For groupD, Speedup [SD,D, 1] is also 10, so the choice of bestsequence for this group is arbitrary.

The solution to the posed problem is found inthe BestChoice [D, 3] and BestSpeedup [D, 3] en-tries. The best initial choice is sequence BB,D withthe corresponding total speedup of 28. As the area ofthis sequence is 3, no other sequences were taken, andneed thus not be found by backtracking. This showsthat it was best to chose C for hardware instead of A.The area 4 was included in the figure to show that thealgorithm correctly chooses all four BSBs for hardwarewhen there is room for them. This can be seen fromthe [D,4] entries.

Once the BestSpeedup and BestChoice lines havebeen calculated for each group, the Speedup values areno longer needed. Actually, the Speedup matrix is notneeded at all, as it can be replaced by the BestSpeedupmatrix whose maximum values can be calculated "onthe run". This is because we are only interested inmaximum values and corresponding choice of sequencesfor each group. This means that instead of the mem-ory requirements being proportional to the number ofsequences ns, they are now proportional to n, as onlythe BestSpeedup and BestChoice matrices are needed.The PACE algorithm is shown as algorithm 1.

After the algorithm has been run, the bestspeedup that can be obtained is found in the entryBestSpeedup [NumBSBs, AvailableArea]. But as forthe simple Knapsack algorithm, reconstruction of thechosen sequences and thereby of the chosen BSBs isnecessary.

5.1 Algorithm Analysis

Direct inspection of the PACE algorithm shows thatthe time complexity is O(n2 . A) and the space com-plexity is O(n . A) (the PACE-reconstruct algorithmobviously has smaller time and area complexity andcan hence be disregarded). Note that areas must beexpressed as integral values. A can be reduced (at theexpense of partitioning quality) by using a larger "areagranularity", for example by expressing BSBs sizes in

PACE(n,A);:{

for..all groupsg =1 to n dofor all areas a =0 to A do {

BC[g,a] t- {};BS[g,a] t- 0;

} .}for all groups g = 1 to n do {

HighBSB t- g;for LowBSB = 1 to HighBSB do {

SeqArea t- area(SLowBSB,HighBSB);SeqSpeedup t- speedup(SLowBSB,HighBSB);for all areas a = SeqArea to A do {

if (LowBSB = 1) then {if SeqSpeedup > BS[g, a] then {

BS[g,a] +- SeqSpeedup;BC[g,a] +- SLowBSB,HighBSB;

}} else {

if SeqSpeedup + BS[LowBSB-I, a-SeqArea] >BSlg, a] then {BS g,a] +- SeqSpeedup +

BS[LowBSB-I, a-SeqArea];BC[g,a] +- SLowBSB,HighBSB;

}

}

} }}if (HighBSB > 1)

for all areas a = 0 to A doif BS[

H

-I, a] > BS[g, a] then {BS g, a] +- BS[g-I, a];BC g, a] +- BC[g-I, a];

} }return BCD;return BS[J;

PACE-reconstruct(n,A,BSD,BCD);:{

HwBSBList t- 0;AStart t- 0;Found t- false;while (AStart <= A) and not Found do

if BS[n, AStart] = BS[n, A] thenFound +- true

elseAStart +- AStart + 1;

a +- Astart;g +- n;repeat {

Seq t- BC[g, a];if Seq <> 0 then {

LowBSB +- first index of Seq;HighBSB +- second index of Seq;for BSB = LowBSB to HighBSB do

add BSB to HwBSBList;a +- A - area(Seq);g +- LowBSB - 1;

}} until (a < 0) or (Seq = 0) or (g = 0);return HwBSBList;

}

Algorithm 1: PACE - A Partitioning Algorithm withCommunication Emphasis.

90

terms of RTL modules. The n2 term can be reducedby enlarging BSB granularity by hierarchical collapsingor by only considering BSB sequences which induce aspeedup greater than zero (or greater than some givenpercentage). Also, there is no need to pre-calculate ar-eas and speedups for sequences whose total area willbe greater than A. Another optimization could be toonly consider sequences of length:::; m where m is anarbitrary value selected prior to execution of the algo-rithm.

As for the simple Knapsack problem, the dual prob-lemof minimizing area with a fixed time-constraint canbe solved by swapping the area- and speedup entriescalculated for each sequence. Another approach couldbe to scan the bottom line of the BestSpeedup matrixfrom the left (see figure 6) until an entry is found whichviolates the time-constraint, as the PACE algorithm ineffect calculates the best speedup for all areas.

Note that the areas and speedups of all sequencesmust be pre-calculated before the algorithm can ex-ecute. This operation basically has time complexityO(n2) but this can also be reduced.

6 Experiments

A series of experiments which demonstrate the capa-bilities of the PACE algorithm have been carried out.The experiments were carried out in LYCOS, the LYn-gby CO-Synthesis system, an experimental co-synthesissystem currently being developed at our institute.

The sample application used for the experiments isa VHDL behavioral description, taken from an imageprocessing application in optical-flow. The applicationcalculates eigenvectors which are used to obtain localorientation estimates for the cloud movements in a se-quence of Meteosat thermal images. The applicationconsisting of 448 lines of behavioral VHDL. The corre-sponding CDFG contains 1511 nodes and 1520 edges.BSB software execution-time is estimated for a 8086processor and a hardware library for an Actel ACT 3FPGA is used to estimate hardware datapath and con-troller area. Partitioning is performed for a sequenceof total hardware areas ranging from 1000 to 2000 insteps of 20, where an area unit equals the area of alogic/sequential module in the FPGA. Table 2 summa-rizes the characteristics of the most important modules.

Hardware modules used for the experIments

'perahons I Area I Cyclesadd, subtract, less, equamultIplymultiplyma-eIvlde

"3 I

103 16

Table 2: Area and execution-time estimates for hard-ware modules and operations.

Figure 7 shows the results of partitioning the sampleapplication using three different partitioning models;

1200000Knapsack algorithm. ignoring communication -+-

p~gir.~~:~r"h.":.dj~:,:,~a b~:k~~~~~I~ation ::;::

1000000

800000

j!

~~ 800000

~i

400000

200000

1000 1600 1800Total chip area

2000 2200 24001200 1400

Figure 7: The PACE algorithm compared with Knap-sack algorithms which ignore communication or do notaccount for adjacent hardware block communication op-timization (allocation A).

ignoring communication, simple communication wherethe read- and write-sets of a BSB are always transferredregardless of other BSBs placed in hardware, and adja-cent block communication which is the one used by thePACE algorithm. All three approaches are assumedto be implemented with local hardware store, and arethus evaluated in the model domain according to theadjacent block communication model.

For chip areas less that the allocated area for thedatapath (in the figure 760), no speedup is obtained asno control area is available. As soon as control areais available the approach which ignores communicationstarts to move BSBs to hardware. For the approachestaking communication into account, moving BSBs tohardware is not beneficial before the total area reachesaround 1040. It can be seen that as the chip area in-creases, more and more BSBs are moved to hardware.From the figure it is clear that the approach using thesimple communication model does not move as manyBSBs to hardware as the approach which takes adja-cent block communication into account. This is mainlydue to the fact that many of the BSBs have a communi-cation overhead which is larger than the speedup theyinduce. In any case the best results are obtained bypartitioning according to the adjacent block communi-cation model.

Datapath allocationsoeat,on

Moduleadd-sub-commul-com~

~v-comb ~Iv-ser 0 1

I Area I 760 I 1148 I 427 I

Table 3: Modules and corresponding area for each ofthe three allocations.

91

_A -+-Aloeelion B -<--Anoeelion C -Bu.

L.....

~..........

"",

1000 1500Tolelchiper..

2000

Figure 8: The PAGE algorithm employed for differentallocations.

Figure 8 shows the results of partitioning with thePACE algorithm for three different allocations; A, Band C, all listed in table 3.

Widely different results are obtained for given avail-able areas, and a specific allocation which is optimalfor all areas cannot be found. Allocation C has thesmallest datapath area which means that for relativelysmall areas, the partitioning algorithm is able to moveBSBs to hardware and, thus, obtain the best partition.Around the area 1500 this is changed, now allocation Abecomes more attractive due to the fact that the largerdatapath allocation can benefit from the inherent par-allelism of the sample application, Le., larger speedupsmay be achieved for the individual BSBs. The figurealso illustrates the problem of allocating too much dat-apath area, as allocation B which has the largest data-path area never manages to give the best partitioningeven for large chip areas.

7 Conclusion

This paper has presented the PACE hard-ware/software partitioning algorithm which, within thepresented partitioning model, gives an exact solutionfor the problem of minimizing total execution time witha hardware area constraint as well as for the problemof minimizing hardware area with a global executiontime constraint. The partitioning model recognizes 1)that sequences of hardware BSBs only need to commu-nicate their effective read- and write sets from/to soft-ware and 2) that both area and execution times can becalculated on a BSB sequence basis and therefore maybe smaller than the sum of the individual area- andexecution times of the BSBs in the sequence. The al-gorithm has quadratic time complexity, but as shown,this may be reduced.

Experiments have been carried out that show thatthe algorithm is superior to algorithms which ignore

communication or do not recognize adjacent block com-munication optimization. Also, it has been demon-strated how the algorithm can be used with a BSBhardware area model which assumes hardware sharingamong BSBs.

Acknowledgements

This work is supported by the Danish Technical Re-search Council under the "Codesign" program.

2500 References

[IJ R. Ernst, J. Henkel, and T. Benner. Hard-ware/software co-synthesis of microcontrollers. Designand Test of Computers, pages 64-75, December 1992.

[2J Rolf Ernst, Wei Ye, Thomas Benner, and Jorg Henkel.Fast timing analysis for hardware/software co-design.In ICCD '93, 1993.

[3J Jie Gong, Daniel D. Gajski, and Sanjiv Narayan. Soft-ware estimation from executable specifications. Tech-nical Report ICS-93-5, Dept. of Information and Com-puter Science, University of California, Irvine, Irvine,CA 92717-3425, March 8 1993.

[4J Jesper Grode. Scheduling of control flow dominateddata-flow graphs. Master's thesis, Technical Universityof Denmark, 1995.

[5J Rajesh K. Gupta and Giovanni De Micheli. Systemsynthesis via hardware-software co-design. TechnicalReport CSL-TR-92-548, Computer Systems Labora-tory, Stanford University, October 1992.

[6] D. Herrmann, J. Henkel, and R. Ernst. An approachto the adaptation of estimated cost parameters in thecosyma system. In CODES '94, 1994.

[7] Axel Jantsch, Peeter Ellervee, Johnny Oberg, AhmedHermani, and Hannu Tenhunen. Hardware/softwarepartitioning and minimizing memory interface traffic.In EURO-DAC '94, 1994.

[8] Asawaree Kalavade and Edward A. Lee. A globalcriticality/local phase driven algorithm for the con-strained hardware/software partitioning problem. InThird International Workshop on Hardware/SoftwareCodesign, pages 42-48, September 1994.

[9J Peter V. Knudsen. Fine-grain partitioning in code-sign. Master's thesis, Technical University of Denmark,1995.

[lOJ Giovanni De Micheli. Computer-aided hardware-software codesign. IEEE Micro, 14(4):10-16, August1994.

[11] Frank Vahid, Jie Gong, and Daniel D. Gajski. Abinary-constraint search algorithm for minimizinghardware during hardware/software partitioning. InEURO-DAC '94, pages 214-219, 1994.

92

700000

600000

500000j

1400000

Ej

300000

200000

100000500

pace: dynamic programming algorithm for hardware/software ...papers/compendium94-03/papers/... ·...

Documents