partitioning and exploration strategies in the tosca...

!~

Partitioning and Exploration Strategies in the TOSCA Co-Design Flow

A. Balboni (a), W. Fornaciari (b, c), D. Sciuto (c)

(a) ITAL TEL-SIT, Central Research Labs, CL TE,20019 Castel/etto di Settimo Milanese (MI), Italy

.(b) CEFRIEL, via Emanue/i 15, 20126 Milano, Italy,

[

I

(c) Po/itecnico di Milano, Dip. Elettronica e Informazione,P.zza L. Da Vinci 32, Milano, Italy

Abstract

The TaSCA environment for hardware/softwareco-design of control dominated systems implemented ona single chip includes a novel approach to the systemexploration phase for the evaluation of alternativearchitectures. The paper presents the metrics and thepartitioning algorithm defined for the identification ofthe best hardware and software bindings andmodularization, given the design constraints and goals.The system exploration phase is implemented as aniterative process directed by the user, based on theformal internal design representation adopted inTaSCA. The application of the metrics is then shown ona simple example to illustrate the approach.

1. Introduction

In many embedded applications heterogeneoushardware/software architectures may provide a moreeffective design solution for some target cost andperformance figures with respect to fully dedicatedhardware implementations. Some of the most importantaspects to be considered in the concurrent design ofsuch systems is the early prediction of the final results,the possibility of generating different alternativesolutions of the hardware/software partitioning processsatisfying the user requirements, and the possibility ofsystem designing by incrementing the functionalities ofexisting systems, i.e. the possibility of re-using existingmodules (software or hardware).

The tools that implement such strategies shouldcooperate with existing design flows and standards toenable a realistic use within industrial designenvironments.

Aim of this paper is to introduce a novelmethodology to manage the co-design process for aspecific application field, i.e. control-dominated ASICs,

0-8186-7243-9/96 $05.00 @ 1996 IEEE

,

- - ~

such as those embedded into telecom digital switchingsubsystems. In particular, this paper will focus on thedescription of the implemented exploration strategies ofalternative system architectures aiming at balancinghardware cost/performance and software flexibility.

Research in co-design has produced a number ofdifferent approaches ([1], [2], [3], [4]) which can beroughly partitioned in strategies starting from a fullysoftware system implementation moving pieces ofsoftware toward the hardware domain and, viceversa,strategies aiming at obtaining the minimum cost byreplacing pieces of hardware with software code. Ourapproach does not assume an initial solution, apart fromthe possibility of tagging some modules as software orhardware if they are already available in the designenvironment and no alternative solution is sought. Itprovides an environment enabling the exploration ofdifferent partitioning which can be either user-driven orautomatically performed. All the decisions areevaluated and possibly taken according to the results ofan iterative analysis performed through a set of metricstailored to predict their effects onto the finalimplementation.

The paper is organized as follows. Next sectiongives an overview of the TaSCA framework and thetarget architecture adopted. Section 3 defines theinternal system representation allowing the applicationof formally correct transformations of the system toidentify and evaluate the different partitioningalternatives. The system exploration strategies andevaluation metrics are presented in Section 4. Finally anapplication of the partitioning algorithm is presented inSection 5.

2. The TOSCA design flow

The TaSCA environment represents a pragmaticapproach to co-design, accepted as prototype in theR&D department of an Italian telecom company. In

62

l

fact, the proposed framework is integrated within anindustrial design flow and allows the user to approve ormodify all decisions pertaining system exploration,with the possibility of backtracking along the designprocess and of reusing already designed submodules.The high level of integration with the existingcommercial EDA tools has been obtained by directlyinterfacing with existing design entry environments(e.g. speedCHAR7). a VHDL representation of thehardware-bound parts, the direct synthesis of thesoftware modules and the ability of achieving co-simulation of the entire mixed hw/sw system within thesame VHDL-based environment. Moreover, the CADenvironment allows the user to cover the whole designprocess, ranging from the system-level specificationcapture down to the synthesis, by considering low leveleffects of the software timing properties (down toassembly level). by providing the code for softwareprocesses as well as the necessary operating systemsupport and, finally, by generating the interfaces amonghw and sw modules.

Multiple formalisms are accepted as systemspecification, based on the consideration that inindustrial design environment projects are seldomunconstrained but usually their specifications arederived from previous experiences, and therefore somecomponents are a priori hw or sw bound, some arealready synthesized. some are specified by graphicalformalisms, others by textual specification. A unifiedinternal representation level based on a process algebracomputational model with an Occam II more pragmaticsyntax has been adopted. Other modules, describedthrough different formalisms. can be mapped orconnected to the internal Occam II representation. Theoverall system representation is stored within an objectoriented database tailored to support high-levelarchitectural exploration. shared by all the toolscomposing the TaSCA environment.The system exploration and partitioning strategy isbased on the internal process algebra basedrepresentation, allowing the application of formal, welldefined transformations. This process is managed bythe Exploration Manager which drives manipulation ofthe initial system modularization to produce a new setof system partitions and their association (if stillfloating) either with software or dedicated hardwareunits. Our approach. apart from possible bindingsforced by users, does not start from an a priori defaultsolution (e.g. initially fully software bound). A set ofstrategies and basic algorithms can be iterated onto thesystem representation, until the design constraints aresatisfied. The output of this process is a set ofmonolithic architectural units with a bindingestablishing either a hardware or a softwareimplementation. Each architectural unit is then passedas input to the following co-synthesis stages.The partitioning and synthesis processes have toconsider different functional and non functional design

requirements, that are tightly related to the targetimplementation architecture. Our class of applicationsrequires a single chip implementation including anoff-the-shelf microprocessor core with its memory(even if part of the memory can be external) and thededicated logic implementing a set of coprocessors, i.e.the set of synthesized hardware-bound modulesidentified during system exploration. The termcoprocessor includes also arithmetic/logic operationsand possible private storage capability, while high-levelsynthesis tools typically separate controllers from data-paths. The master processor is programmable and thesoftware can be either on-chip resident or read from anexternal memory; dedicated units operate as peripheralcoprocessors. To satisfy the requirement of interfacingamong hardware and software bound elements, amaster-slave shared bus communication strategy hasbeen adopted. All hardware to hardwarecommunications are managed through dedicated lines(local buses). The RAM memory required forprogram/data storage shares with the s;oprocessors themain data bus, but can be accessed only by the masterCPU. Communications among CPU and coprocessorsare based on a memory mapped I/O scheme with onebus interface manager per coprocessor based on acommon I/O buffered protocol manager.

The last stage defined within the TOSCA designflow covers the synthesis of all the elements:software-bound modules including the basic operatingsystem support, hardware-bound parts and interfaces.The embedded system software is usually built around alightweight operating system that provides processscheduling and interrupt handling facilities aiming atguaranteeing fairness and real-time behavior to all thesoftware actors. Our solution is to consider the softwaredescription at the level of a virtual assembly instructionset (VIS) whose structure can be mapped onto differentCPU cores with fully predictable translation rules and,consequently, reliable performance estimation [5].

This solution provides the possibility of achieving afully VHDL based co-simulation of each proposedhw/sw architecture [5], to get feedback on theeffectiveness of the implementation. In fact. a suitableVHDL model for VIS instruction-level execution. hasbeen developed, so that the entire system can be co-simulated through a unified VHDL-based environment.

The model is parametric to allow the analysis oflow-level timing/cost/performance, for different classesof microprocessors.

The hardware synthesis can be performed usingthree different strategies: direct mapping ofhardware-bound FSM into a VHDL code compliantcommercial RTL synthesis tools ([6]); transparentpassing of imported VHDL modules descriptions to thelogic synthesis tool; definition of behavioral VHDLdescriptions of the hardware-bound modules to be inputto a commercial high-level synthesis tool.

63

3. Design representation

The internal model onto which the systemexploration model is defined is based on acustomized process algebraJOccamII representation.The system representation can be obtained byimporting modules descriptions obtained throughcommercial design entry tools (such as speedCHAR7)as well as by using the built-in Occam II editor (seefig. I).A key issue of the Occamll formalism is thepossibility of easily representing both parallel andsequential execution of processes at any abstractionlevel. For example, by considering a pair of processesp2 and pI, the termination of p2-after-p I executioncan be captured by the construct SEQ, while and theimpossibility of statically predicting the terminationorder between pI and p2 is expressed via the PARconstruct. It is important to underline that norestriction on the process granularity is required, sothat processes executed in parallel can be constitutedby assignments, input or output statements as well asparallel or sequential composition of simplerprocesses. Details on the Occam paradigm can befound in [7], [8].The main advantages provided by the chosen processalgebraJOccamII for co-design purposes, with respectto the other formalisms (e.g. the statechart family,flow-chart diagrams, CDFG, VHDL, SDL, ...) are thefollowing:. the computational model is based upon a small set

of primitives (assignment, input and output) andcomposition operators (sequential, parallel,

alternative) allowing the modeling of arbitrarycomplex behaviors;. the simple and regular syntax makes easier theparsing, the redisplay of transformedspecifications and the internal representation;

. a formal and well assessed theory oftransformations is available;. multiple paradigms (event-driven, dataflow,rendez-vou~, FSM,...) are supported through thesame syntax and semantics;. a powerful concurrency model suitable forsystem-level specification is defined: channels,user-defined communication protocols, regularparallelism (array of processes and channels),timers and timeouts, process priorities.An important advantage of using channels to

represent communication is the possibility ofapplying the same compact model of data exchangebetween heterogeneous processes (and consequentlyinterfaces). The channels behave as variables with anattached unidirectional pipe management, leavingunmodified the original value (the channel receivesonly a copy). Since a channel can be shared by twoprocesses only, broadcast communication can bemodeled via multiple channels. Synchronizationamong processes is realized through channels,because they implement a rendez-vous mechanism.Shared variables among parallel processes areforbidden, they can only transfer values via channels.

Time related issues are supported through the useof timer objects, acting as read-only channelsproviding an integer value corresponding to thecurrent time.

Figure I.' The visual front-end of the TOSCA environment.

,

64

.

Delays can be introduced by using deferred inputsthrough the AFTER clause. Typical real-time schemescan be modeled by combining the statements formanaging time with the construct allowing alternativecomposition of processes.

The Occam model of the system description isstored within the TOSCA object oriented designdatabase, storing classes of objects for modeling thesystem functionality and supporting the specific co-design activities such as simulation and design entry.All the classes inherit from a top-level class thefollowing methods:. general purpose methods implementing

functionalities for database management;. methods realizing a user-level visualback-annotation interface of information concerningthe objects;. methods activating specific actions operating on the'associated object.

4. System exploration and partitioning

After the acquisition of the system model has beencompleted, architectural tradeoffs may be carried out byiterated manipulation of the internal model storedwithin the TOSCA database. This stage manages twoclasses of objects: processes and architectural units.The processes can be subdivided in two classes:. processes to be mapped onto predefined hardware

or software components;. processes whose specification is not biased by theimplementation and therefore different designalternatives can be explored at system level (e.g. theones captured via the graphic OccamII built-ineditor ofTOSCA).

The partitioning process can be viewed as anincremental activity of modification of the initialspecification through the application of transformations.The TOSCA project has considered this processthrough:. The definition and implementation of formal

transformation rules working onto the internalmodel stored within the design database.. The definition and implementation of a set ofmetrics to drive the partitioning process.. The definition of a partitioning algorithm and itsimplementation within the Exploration Managerframework to obtain a tool allowing both directintervention of the user and automatic selection of

the strategies by following the built-in evaluationcriteria.

Since Occam has been derived from CSP [8], a numberof formal properties are available as a background todefine a set of transformations preserving the semanticsof the original specification.Some preliminary studies on transformations tailoredfor state-based synchronous specifications have been

--

experimented in [6], but the current version of TOSCAincludes:

. Parallelization of sequential processes andviceversa, that can be useful to improveperformance or reuse of basic building blocks (hw 0sw) of processes belonging to the same architecturalunit.. Replacement of channels with variables andviceversa; this action is usually performed when twoprocesses will become part of the same sw-boundarchitectural uhit. For instance, processes belongingto the same coprocessor share the local variableswhile if they are located on different coprocessorscommunication channels are required.. Merge of code sections.. Splitting of code sections.

The partitioning process can be executed step by step,thus allowing direct control of the user through theexploration manager interface. This allows to takeadvantage of user expertise whenever possible, but itcan be automatically performed by applying the defaultpartitioning flow. Here. the default partitioning flow isintroduced.

The top-level algorithm of the default flow uses agreedy strategy where the initial model is first of allexpanded in a top-level process containing a PARassert, by properly applying the formal transformationlaws. The system allows modification of the granularityof the decomposition according to the user selection.After this preliminary step an initial graph G containingall the not yet allocated processes (leaf nodes) is built.A set of actions is applied onto such a graph. Theseactions can be partitioned in two phases:I. Pre-allocation: this phase consists of marking the

processes nodes according to the most suitableimplementation technology (hw or sw). The usercan accept the proposed solution or manuallymodify the decisions taken.

2. During a second phase, according to some closenesscriteria defined by the user and based on the metricsavailable, new process leaves are built by pairwisecollapsing nodes to build a unified PAR statement;this activity of collapsing nodes is iterativelyperformed until all the architectural units areidentified and their binding is defined.

The initial pre-allocation strongly affects both theconvergence speed of the algorithm and the resultquality. The collapsing of processes in the secondphase, will occur whenever the measure of thecloseness criteria, during the selection of the closestprocesses, stays under a certain a priori definedthreshold. This solution is particularly flexible since itis possible to consider a set of criteria, with differentpriorities that can be changed dynamically. according tothe current nodes granularity or user choice.To support these activities, a set of metrics has beendefined, according to the co-design target. Ourapproach focuses on providing a modular design flow

65

sufficiently open to be easily extended to cover anynew strategy defined. For this purpose the mechanismsimplementing the system transformations have beenkept disjoint from their application policy. A specificset of metrics for early prediction of cost/performance.as well as to evaluate the system synthesis results, has

.been developed at first. The activity of identifying somecloseness properties among parts of the systemspecification, can be performed by means of a staticanalysis based on metrics for early prediction ofcost/performance. It aims at producing an initialnearly-optimum allocation and binding to be iterativelyimproved by the user until design goals are satisfied.Afterwards, to update the predicted data, a final stage ofactual synthesis is performed. This is the most timeconsuming task and it is strongly sensitive to the qualityof the pre-allocation performed during the formerphase; it returns information that will be back-annotatedas replacement of the predicted data.

The metrics consider the system analysis from athreefold point of view:. Statically, by analyzing composition and structure

of the description contained within the database.. Dynamically, but still independent of theimplementation, through an high-level executionand profiling of the specification to extractinformation such as communication bottlenecks and

statistics on operators applications to better tune thedecision on the initial partitioning and binding.. After-synthesis, requiring at least a completesynthesis cycle. This returns information that will beback-annotated as a replacement of predicted dataand will also contribute to the final designevaluation.

The application of metrics is based on the association ofeach Occam process with a module, including theexported interface and the process body that is hiddento the other processes. Under a structural point of view,each Occam module composing the system descriptioncan be considered as a hierarchy of modules. beingeach component a module itself.The most relevant factors considered on such modulesto evaluate the quality of the resulting design, are thefollowing.. Communication: costs related to the number of lines

and bandwidth for hw-hw and sw-hwcommunication.. Interfacing: according to the adoptedcommunication/synchronization technique amongdifferent modules (e.g. between hw and sw, oneinterface unit per coprocessor vs a common busmanager/arbiter queuing messages), costs can beaffected by number and granularity of modules.Alternative bus protocol templates can be evaluated.. Area: this is the most important aspect because ofthe single chip implementation. Overalloptimization takes into account possibleuser-defined binding along the evaluation and

r

-- ~ --

comparison among different alternatives.. Resources exploitation: on the software side, sincethe microprocessor is anyhow present, it isimportant to increase as much as possible the CPOutilization while fulfilling the timing requirementsof the programmed modules and the effectiveness ofmemory. Another relevant issue is related to thepower consumption that is improved by amodularization able to identify the minimum set ofarchitectural units with the lowest amount of idletime.

Currently, the TOSCA environment implements a set ofmethods to evaluate the following static metrics, basedon the following static properties of the Object Orientedrepresentation of the Occam specification:. Declarations count.. Data dependencies among data declarations.. Module use.. Structure of modules composing the system graph.The relevant parameters considered to determine thedeclaration count are:. Local(module): representing the number of local

data declarations.. Global(module): representing the number of datavisible within a module.. Scope(module): representing the visibility area ofdata declaration overlapped to the moduledefinition.

The properties representing the relations amongmodules are captured by considering the number ofother modules both used (imported) and using theconsidered (exported) one. Other importantcharacteristics are obtained through the analysis of thegraph representing the hierarchy of the modulescomposing the system. In particular maximum andaverage depth and the number of possibly completepaths (from the root to a leaf) are taken into account.These parameters can be considered as a bias toestimate the system complexity.Another set of metrics has been introduced to report thepresence of possible inconsistencies when modules arecollapsed. They describe the level of internal cohesionfor each module and the coupling between modules.Two types of interactions have been evaluated:declaration and declaration which analyzes if adeclaration interacts with another by considering if amodification of the first implies also a modification inthe second one: declaration and subroutine whichconsiders that a declaration interacts with a subroutineif it has a declarattion and declaration interaction withat least one of the subroutine declarations.

A significant goal of the entire partitioning process,is to achieve an high level of internal cohesion togetherwith a loose coupling with other modules. This isparticularly important e.g. to make easier the moving ofcommunicating modules within differentimplementation domains while saving bus bandwidth,

66

l.

since it is the only mean to obtain hw-sw dataexchange. Moreover, it is also possible to avoidunnecessary replication of data variables (or registersfor the hw domain) to implement the corresponding-data handler.

In addition to the static properties, a profilingthrough simulation of the internal specification can becarried out. This allows to better understand the real

bottleneck and critical parts of the system description.Since the purpose of this analysis is to provide statisticsand predictions of the execution times, a parametrizablemodel has been defined to cover both the hardware andsoftware domain. In fact, a process composed of basicactions cannot be represented by simply adding theexecution times of each component. It is necessary toconsider also the effect of process synchronization andcommunication that are non explicitly managed byOccam descriptions. For instance, if software executionis considered it is necessary to model the overhead dueto context switching among threads and the presence ofsubroutine calls.

This modeling of the time properties, allows theidentification of the following process properties:. Latency of a process execution.. Execution frequency of a process.. CPU usage (summation of the latency-frequency

products, for all the software bound processes).. Channel utilization (bandwidth, idle time).. Computational load, representing the total numberof operations executed by process p duringsimulation.

All the data obtained through metrics evaluation aregathered within matrices that will be used to evaluatethe closeness among the processes composing thesystem graph, and possibly to trigger the application ofsystem-level transformations, according to ahierarchical multi-stages clustering algorithm. Ingeneral. the cost function used to evaluate the quality ofcandidate solutions has the form of a weighted sum:

cost = Lwm m = W x M

mEMwhere M is the set of the considered metrics and W the

matrix expressing the relevance of the metric accordingto the designer's goal. In case of direct intervention bydesigner, the results of metrics evaluations can beback-annotated and considered separately. Theevaluation of metrics computed onto the designdescription stored within the database is basicallytailored to support the system level exploration phase,i,e, as a support in the process of collapsing processesto identify a set of architectural units with the properhw or sw binding and to modify the process granularity.

The algorithm operates by considering the set G ofprocesses constituting the system graph and thecloseness matrix V(c,G) according to the criterion cconsidered; in other words, each V[i,j] entry represents

I

f

--

the measure of the distance between nodes i and) basedon the evaluation of the c criterium. The decision on theprocesses to be collapsed is carried out by the functionscoupleWithMaxCloseness(V) which selects thecloseness pair of nodes of the graph, andmaxCloseness(Vj which returns the correspondingdistance. Given a certain criterion, the opportunity tocollapse nodes is driven by the crossing of a userdefined threshold VI.The decision on whether or not thenodes have to b~ collapsed is related with the thresholdCt defined onto the cost function summarizing theadherence between design results and initialrequirements. This method allows to establish anordering in terms of importance among the consideredcriteria, which can be modified dynamically. Apseudo-code representation of the algorithm is herereported:while costFunction > Ct {

select c in criteria

while maxCloseness(V(c,G»:?: Vt(c) {(i,j) := coupleWithMaxCloseness(V(c,G»GI:=G-{i,j}G2 := G I + { collapse(ij)}if not G2,violatesConstraintsO {

G :=G2}modify (the system)

}}The procedure modifY allows the introduction of somemodifications into the system model such as a new setof bindings, the serialization of operations withinprocesses obtained through previous collapsing and soon. G2vioiatesConstraints evaluates if the consideredsystem instance is acceptable according to the definedconstraints; to reduce the possibility to be cached in alocal optimum, a non deterministic function can beintroduced. The complexity of the implementedalgorithm is O( c x n x log ,n) where c is the numberof criteria, n is the number of processes of the system.The derivation of the closeness matrix starting from agiven metric can be performed by following twodifferent approaches according to the type ofconsidered metric. Metricscan expresspropertiesbothintrinsic the module (e,g. the Local(m) metric) andrelated with its connection to the system (e.g. theExported(m) metric) or concern the interaction betweenpairs of modules (e.g. the Contemporaneous Presence(m I, m2».In the first case a vector M(c,G) is built such that eachelement M[g] represents the evaluation of metric c forthe process gE G. For each pair of processes {i,j}, thevalues of matrix V[i,j] are given by:

V[i,j] = :t.~ if MU] 2: M[i],

67

V[iJ] = :~jif M[i] > M[jJ.

For the latter case the distance matrix is obtained bynormalizing the entries of M(c,G) so that each of itselements M[iJ] represents the evaluation of metric cwith respect to the processes i and}. As a consequence:

V[i,j] = 1 if {ij}=coupleWithMaxCloseness(M(c,G»

V [i ' J= M[i,jJ,J M[k,mJ

if {k,m}=coupleWithMaxCloseness(M(c,G)).If a criterion is associated with more than a singlemetric, the distance matrix is obtained by building amatrix whose elements are the weighted sum of thecorresponding entries belonging to the matricesconcerning the considered metrics (one per metric), andfinally by normalizing the elements to the highest valueobtained, The entries of the matrix all range from 0 toI, so that the same range of values can be consideredfor the threshold VI.

For example, by considering for simplicity only thecohesion metric, i.e. the metric which evaluates thedegree of interrelation between two modules, it ispossible to define a threshold that, if exceeded, enablesthe collapsing of two processes pI and p2 within asingle module m. This choice, according to theconsidered metric, is advantageous if the cohesion of mis greater than the cohesion of the set constituted bymodules pI and p2. In other terms, the decisiondepends on the following relation:

CI(m1) + CI(m2) CI(m1) + CI(m2) + CII2>M(m1)+M(m2) M(m)

where C/(mi) represents the number of cohesiveinteractions of module mi, M(mi) is the maximumnumber of possible cohesive interactions of module miand Ch2 expresses the number of cohesive interactionsbelonging to mi and m2 when they are collapsed. Thevalues of Cll2 obtained by solving the above relation,can be considered as the threshold for the cohesionmetric.

It has to be PQinted out the high flexibility of theproposed approach that can be extended with minimalre-design effort to cover new design goals orcharacteristics, simply. by adding a proper method operating onto the

design DB to compute the corresponding metric;. by defining its relative importance with respect tothe other existing metrics in case of user-drivendesign to produce a consistent set of analysis dataevaluating the system quality, as well as if theautomatic system partitioning is enabled to modifythe selection strategy.

5. Example of the partitioning process

Let us consider a small example to show the default

,

- - - -~-- -

partitioning flow, using only static properties due tospace reasons. The system implements a convolverwhose functionality is expressed by the following

equation: Yi =f xi-I x Wj x aij=/

where I~ i ~ 2n-l, Wj+I = b x x I x Wj and ai is givenby ai= Ci+di with Ci= di if Xi ;CO,Ci= x;/2 if Xi < 0and di = Xi+/ ifxi+/ ;CO,di = (xi+/)/2ifxi+/ < O.

The corresponding Occam representation, storedwithin the TOSCA design database, is composed ofthree processes whose Occamll hierarchy andinteractions are depicted in figure 2:pO models the environment and produces the values

for x and w;pI implements the convolver;p models the parallel composition of pOand pI.The communication between pO and pi is perfoipledvia channels.

The system analysis starts by taking into account forsimplicity only the metric measuring the number ofpaths of the overall system (Paths) to define thecost-function weighted sum (i.e. the coefficientmultiplying the other metrics are zeros) and to acceptsolutions producing cost < 35.

p

pO p1

Figure 2: Schematic diagram of the example.

Therefore, the cost function is: cost = k * Paths (withk= I). In addition we introduced some limitations on themetrics measuring the number of variables local to eachprocess (that are related to the size) and to the processlatency. The corresponding constraint for each processp is thus: (Latency (p) < 12) AND (Local (p) < 20).

A static analysis of the system model considered asa whole has been carried out, the result of the staticmetrics evaluation shows that the system does notsatisfy the above conditions (Paths evaluates 43 that isgreater than 35) so that it needs to be restructured tofulfill the user's goal. The first step has been toconsider the system at a finer granularity bydecomposing p I into a set of parallel processes, whosestructure is shown in figure 3.

To identify a better system modularization, a set ofmetrics has to be evaluated. Here, only five staticmetrics have been computed. The data obtained afterprocessing the system description are reported in figA.Based on these results, for each metric thecorresponding closeness matrix between processes canbe derived. For instance, the matrix corresponding to

68

l

I

I

I

I

I

II

--

.----

the local(pj metric is shown in figS

p

pO

wX

~w

Figure 3: The system after the/irs! decomposition.

Figure 4: Results of the analysis offive properties ofthe system description.

Similarly, according to the data contained infigure 4, it is possible to compute the values for eachone of the considered metrics. All these values are then

gathered within a matrix which is evaluated to computethe closeness among the processes composing thesystem graph.

By considering as global closeness criterion only thelocal metric, processes p I and p2 in figure 3 are theclosest and their collapsing does not violate theconstraints.

Figure 5: An example of computation of the localmetric. The values. representing the number ofdeclarations local to a module. have been normalized.

Since p I and p2 show a low level of cohesiveinteraction, they can be composed to produce a uniquesoftware-bound process, To allow their execution on asingle microprocessor, a sequential composition of theoriginal pair of processes has to be applied. Thetransformation process proceeds in a similar way andprocesses p3 and p4 are collapsed. According to aprofiling obtained via a dynamic analysis. thecorresponding architectural unit is hardware boundsince it accounts to 73% of the overall computationalload. This modularization corresponds to an acceptableand final redesign of the system, since a new

computation of the metrics produces a set of valuessatisfying user's goal and constraints:cost = 28 < 35 latency =9 < 12 AND local =17 < 20

The partitioning and binding process is thuscompleted and the system is ready to be processed bythe TaSCA co-synthesis tools to produce VIS andVHDL code for the pl2 and p34 architectural units,respectively. Eventually, the channel used forcommunication "among processes belonging to the samepartition are replaced with global-scope sharedvariables and a serialization of the operations of all thesw-bound processes is automatically performed.

p

pOx

~w

Figure 6: Thefinal system modularization aftercollapsing of p3 and p4.

References

I Hardware/Software Co-Design, NATO Advanced StudyInstitute, G. De Micheli. M.G. Sami eds.. to appear byKluwer. 1996.

2. Benner T.. Ernst R.. Henkel 1.. "Hardware-Software

Cosynthesis for Microcontrollers", IEEE Design&Test.Vo1.10. No.4. December 1993.

3. Barros E.. Rosenstiel W.. Xiong X.. "Hardware/SoftwarePartitioning with UNITY". Proc. 2nd Workshop onHW/SW Co-Design. Cambridge. MA. 1993.

Ismail T.. O'Brien K.. Jerraya A.. "InteractiveSystem-level Partitioning with PARTIF". Proc.EDAC'94. Paris. France. February. 1994.

4.

Antoniazzi S.. Balboni A.. Fornaciari W.. Sciuto D..

"The Role of VHDL within the TOSCA Co-designFramework". Proc. of Euro-VHDL'94. September 1994.

6. Antoniazzi S.. Balboni A.. Fornaciari W.. Sciuto D..

"HW/SW Co-design for Embedded Telecom Systems",Proc. ICCD'94. pp.278-29I, 1994.

5.

7. Jifeng H.. Page I.. Bowen 1.. "Towards a ProvablyCorrect Hardware Implementation of Occam". Technical

Report. Oxford University Computing Laboratory, 1994.

Hoare C. A. Roo"Communicating Sequential Processes",Communications of the ACM, Vol. 18, No.8, 66-77.

August 1978.

8.

69

Property/proc. pI p2 p3 p4

Local 3 4 10 17

Global 3 4 10 17

Depth 3 3 3 4Paths 4 5 18 16

Cohesive Inter, I I 5 5

LOCAL pI p2 p3 p4

pI - I A .24p2 - - .53 .34p3 - - - .79p4 - - - -

partitioning and exploration strategies in the tosca...

Documents