chapter system-level design and hardware/software 7 co-design · hardware/software co-design 7...

System-Level Design andHardware/SoftwareCo-design 7CHAPTER POINTS

• Electronic system-level design

• Hardware/software co-synthesis

• Thermal-aware design and design-for-reliability

• System-level co-simulation

7.1 IntroductionEmbedded computing systems are designed to a wide range of competing criteriathat have given rise to several different fields. On the one hand, they must performincreasingly complex tasks. Not only have user interfaces become more sophisti-cated, but the underlying control and signal processing algorithms have becomemuch more complex in order to meet more demanding system requirements. Thedesign of systems with millions of lines of code running on complex, heterogeneousplatforms has given rise to the field of electronic system-level design.

On the other hand, embedded systems must meet tight cost, power consumption,and performance constraints. If one design requirement dominated, life would bemuch easier for embedded system designersdthey could use fairly standard architec-tures with easy programming models. But because the three constraints must be metsimultaneously, embedded system designers have to mold hardware and softwarearchitectures to fit the needs of applications. Specialized hardware helps to meet per-formance requirements for lower energy consumption and at less cost than would bepossible from a general-purpose system. Hardware/software co-design is the studyof algorithms and tools for the detailed design of hardware/software systems.

Section 7.2 looks at performance estimation methods useful in co-design. Section7.3 is a large section that surveys the state of the art in hardware/software co-synthesis.Section 7.4 introduces system-level design, a field that makes extensive use of resultsfrom co-design. We then look at two particular design challenges critical to systemsbuilt from nanometer-scale VLSI: thermal-aware design in Section 7.5 and reliabilityin Section 7.6. Finally, Section 7.7 looks at system-level simulation.

CHAPTER

High-Performance Embedded Computing. http://dx.doi.org/10.1016/B978-0-12-410511-9.00007-1 341Copyright © 2014 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/B978-0-12-410511-9.00007-1

7.2 Performance estimationDuring co-synthesis, we need to evaluate the hardware being designed. High-levelsynthesis was developed to raise the level of abstraction for hardware designers,but it is also useful as an estimation tool for co-synthesis. This section surveyssome high-level synthesis algorithms to provide a better understanding of the costsof hardware and fast algorithms for area and performance estimation. We also lookat techniques specifically developed to estimate the costs of accelerators in heteroge-neous multiprocessors.

7.2.1 High-level synthesisHigh-level synthesis starts from a behavioral description of hardware and creates aregister-transfer design. High-level synthesis schedules and allocates the operationsin the behavior as well as maps those operations into component libraries.

Figure 7.1 shows a simple example of a high-level specification and one possibleregister-transfer implementation. The data dependency edges carry variable valuesfrom one operator or from the primary inputs to another operator or a primary output.

The register-transfer implementation shows which steps need to be taken to turnthe high-level specification, whether it is given as text or a data flow graph, into aregister-transfer implementation:

• Operations have been scheduled to occur on a particular clock cycle.• Variables have been assigned to registers.• Operations have been assigned to function units.• Some connections have been multiplexed to save wires.

In this case, it has been assumed that we can execute only one operation per clockcycle. A clock cycle is often called a control step or time step in high-level synthesis.We use a coarser model of time for high-level synthesis than is used in logic synthesis.Because we are farther from the implementation, delays cannot be predicted as accu-rately, so detailed timing models are not of much use in high-level synthesis. Abstract-ing time to the clock period or some fraction of it makes the combinatorics ofscheduling tractable.

Allocating variables to registers must be done with care. Two variables can share aregister if their values are not required at the same timedfor example, if one inputvalue is used only early in a sequence of calculations and another variable is definedonly late in the sequence. But if the two variables are needed simultaneously theymust be allocated to separate registers.

Sharing either registers or function units requires adding multiplexers to thedesign. For example, if two additions in the data flow graph are allocated to thesame adder unit in the implementation, we use multiplexers to feed the proper oper-ands to the adder. The multiplexers are controlled by a control finite-state machine(FSM), which supplies the select signals to the muxes. (In most cases, we don’tneed demultiplexers at the outputs of shared units because the hardware is generally

Goals of high-level

synthesis

Data flow graph, data

dependency, variable

function unit

Control step, time step

342 CHAPTER 7 System-Level Design and Hardware/Software Co-design

designed to ignore values that aren’t used on any given clock cycle.) Multiplexers addthree types of cost to the implementation:

1. Delay, which may stretch the system clock cycle.2. Logic, which consumes area on the chip.3. Wiring, which is required to reach the multiplexer, also requires area.

Sharing hardware isn’t always a win. For example, in some technologies, addersare sufficiently small that you gain in both area and delay by never sharing an adder.Some of the information required to make good implementation decisions must comefrom a technology library, which gives the area and delay costs of some components.Other information, such as wiring cost estimates, can be made algorithmically.The ability of a program to accurately measure implementation costs for a large num-ber of candidate implementations is one of the strengths of high-level synthesisalgorithms.

a b c

x1 x2

+

+

–

x1 < = a + b;x2 < = b – c;y < = x1 + x2;Languagespecification Data flow specification

Controller Control signals

b

ac

sel

sel

op

Dload

Q

Dload

Q

Dload

Q

x1

x2

Y

Implementation

y

FIGURE 7.1

An example of behavior specification and register-transfer implementation.

Technology library

7.2 Performance estimation 343

When searching for a good schedule, the as-soon-as-possible (ASAP) andas-late-as-possible (ALAP) ones are useful bounds on schedule length. Some ofthe scheduling algorithms discussed in Section 4.2.2 are also useful for high-levelsynthesis.

A very simple heuristic that can handle constraints is first-come-first-served(FCFS) scheduling. FCFS walks through the data flow graph from its sources to itssinks. As soon as it encounters a new node, it tries to schedule that operation in the cur-rent clock schedule; if all the resources are occupied, it starts another control step andschedules the operation there. FCFS schedules generally handle the nodes from sourceto sink, but nodes that appear at the same depth in the graph can be scheduled in arbi-trary order. The quality of the schedule, as measured by its length, can change greatlydepending on exactly which order the nodes at a given depth are considered.

FCFS, because it chooses nodes at equal depth arbitrarily, may delay a criticaloperation. An obvious improvement is a critical-path scheduling algorithm, whichschedules operations on the critical path first.

List scheduling is an effective heuristic that tries to improve on critical-pathscheduling by providing a more balanced consideration of off-critical-path nodes.Rather than treat all nodes that are off the critical path as equally unimportant,list scheduling estimates how close a node is to being critical by measuring D, thenumber of descendants the node has in the data flow graph. A node with few descen-dants is less likely to become critical than another node at the same depth that hasmore descendants.

List scheduling also traverses the data flow graph from sources to sinks, but whenit has several nodes at the same depth vying for attention, it always chooses the nodewith the most descendants. In our simple timing model, where all nodes take the sameamount of time, a critical node will always have more descendants than any noncrit-ical node. The heuristic takes its name from the list of nodes currently waiting to bescheduled.

Force-directed scheduling [Pau89] is a well-known scheduling algorithm thattries to minimize hardware cost to meet a particular performance goal by balancingthe use of function units across cycles. The algorithm selects one operation toschedule using forces as shown in Figure 7.2. It then assigns a control step to thatoperation. Once an operation has been scheduled, it does not move, so the algorithm’souter loop executes once for each operation in the data flow graph.

To compute the forces on the operators, we first need to find the distributions ofvarious operations in the data flow graph, as represented by a distribution graph.The ASAP and ALAP schedules tells us the range of control steps at which each oper-ation can be scheduled. We assume that each operation has a uniform probability ofbeing assigned to any feasible control step. A distribution graph shows the expectedvalue of the number of operators of a given type being assigned to each control step, asshown in Figure 7.3. The distribution graph gives us a probabilistic view of the num-ber of function units of a given type (adder in this case) that will be required at eachcontrol step. In this example, there are three additions, but they cannot all occur on thesame cycle.

Scheduling terminology

FCFS scheduling

Critical-path scheduling

List scheduling

Force-directed

scheduling


If we compute the ASAP and ALAP schedules, we find that þ1 must occur in thefirst control step, þ3 in the last, and þ2 addition can occur in either of the first twocontrol steps. The distribution graph DGþ(t) shows the expected number of additionsas a function of control step; the expected value at each control step is computed byassuming that each operation is equally probable at every legal control step.

Force

Time

FIGURE 7.2

How forces guide operator scheduling.

Number of adders1 2

0

1

2

Con

trol s

tep

Distribution graph for addersData flow graph

+1

+2

+3

*1

FIGURE 7.3

Distribution graphs for force-directed scheduling.


We build a distribution for each type of function unit that we will allocate. Thetotal number of function units required for the data path is the maximum numberneeded for any control step, so minimizing hardware requirements requireschoosing a schedule that balances the need for a given function unit over the entireschedule length. The distribution graphs are updated each time an operation isscheduleddwhen an operation is assigned to a control step, its probability atthat control step becomes 1 and at any other control step 0. As the shape of thedistribution graph for a function unit changes, force-directed scheduling tries toselect control steps for the remaining operations, which keeps the operator distri-bution balanced.

Force-directed scheduling calculates forces like those exerted by springs to helpbalance the utilization of operators. The spring forces are a linear function of displace-ment, as given by Hooke’s Law:

FðxÞ ¼ kx (EQ 7.1)

where x is the displacement and k is the spring constant, which represents the spring’sstiffness. When computing forces on the operator we are trying to schedule, we firstchoose a candidate schedule time for the operator, and then compute the forces toevaluate the effects of that scheduling choice on the allocation.

There are two types of forces applied by and on operators: self-forces andpredecessor/successor forces. Self-forces are designed to equalize the utilizationof function units across all control steps. Since we are selecting a schedule for oneoperation at a time, we need to take into account how fixing that operation in timewill affect other operations, either by pulling forward earlier operations or pushingback succeeding ones. When we choose a candidate time for the operator being sched-uled, restrictions are placed on the feasible ranges of its immediate predecessors andsuccessors. (In fact, the effects of a scheduling choice can ripple through the wholedata flow graph, but this approximation ignores effects at a distance.)

The predecessor/successor forces, Po(t) and Xo(t), are those imposed onpredecessor/successor operations. The scheduling choice is evaluated based on thetotal forces in the system exerted by this scheduling choice: the self-forces, the prede-cessor forces, and the successor forces are all added together. That is, the predecessorand successor operators do not directly exert forces on the operator being scheduled, butthe forces exerted on them by that scheduling choice help determine the quality of theallocation. At each step, we choose the operation with the lowest total force to scheduleand place it at the control step at which it feels the lowest total force.

Path-based scheduling [Cam91] is another well-known scheduling algorithm forhigh-level synthesis. Unlike the previous methods, path-based scheduling is designedto minimize the number of control states required in the implementation’s controller,given constraints on data path resources. The algorithm schedules each path indepen-dently, using an algorithm that guarantees the minimum number of control states oneach path. The algorithm then optimally combines the path schedules into a systemschedule. The schedule for each path is found using minimum clique covering; thisstep is known by the name as-fast-as-possible (AFAP) scheduling.

Path-based scheduling


7.2.2 Accelerator estimationEstimating the hardware costs of an accelerator in a heterogeneous multiprocessorrequires balancing accuracy and efficiency. Estimates must be good enough to avoidmisguiding the overall synthesis process. However, the estimates must be generatedquickly enough that co-synthesis can explore a large number of candidate designs.Estimation methods for co-synthesis generally avoid very detailed estimates ofphysical characteristics, such as might be generated by placement and routing.Instead, they rely on scheduling and allocation to measure execution time and hard-ware size.

Hermann et al. [Her94] used numerical methods to estimate accelerator costs inCOSYMA. They pointed out that as additional layers of a loop nest are peeled offfor implementation in the accelerator, the cost of the modified accelerator is not equalto the sum of the cost of the new and additional pieces. For example, if a function isdescribed in three blocks, then eight terms may be necessary to describe the cost of theaccelerator that implements all three blocks. They used numerical techniques to esti-mate accelerator costs as new blocks were added to the accelerator.

Henkel and Ernst [Hen95; Hen01] developed an algorithm to quickly and accu-rately estimate accelerator performance and cost given a control data flow graph(CDFG). They use path-based scheduling to estimate the length of the scheduleand the resources required to be allocated. To reduce the runtime of the AFAP andoverlapping steps of path-based scheduling, they decompose the CDFG and scheduleeach subgraph independently; this technique is known as path-based estimation.

Henkel and Ernst use three rules to guide the selection of cut points. First, they cutat nodes with a smaller iteration count, since any penalty incurred by cutting the graphwill be multiplied by the number of iterations. Second, they cut at nodes where manypaths join, usually generated by the end of a scope in the language. Third, they try tocut the graph into roughly equal-size pieces. The total number of cuts it is possible tomake can be controlled by the designer. Their estimation algorithm, includingprofiling, generating, and converting the CDFG; cutting the CDFG; and superposingthe schedules is shown in Figure 7.4.

Vahid and Gajski [Vah95] developed an algorithm for fast incremental estimationof data path and controller costs. They model hardware cost as a sum:

CH ¼ kFUSFU þ kSUSSU þ kMSM þ kSRSSR þ kCSC þ kWSW (EQ 7.2)

where Sx is the size of a given type of hardware element and kx is the size-to-cost ratioof that type. In this formula:

• FU is function units.• SU is storage units.• M is multiplexers.• SR is state registers.• C is control logic.• W is wiring.

Estimation by numerical

methods

AFAP-based estimation

Incremental estimation


path_based_estim_tech(){

CP : = {};

collect_profiling_data{CDFG};DAG : = convert_to_DAG{CDFG};#paths : = compute_num_of_paths{DAG};

if {#paths < max_paths } {CP : = compute_cutpoints{DAG};DAG : = split_DAG{DAG, CP};

}

for each dag; ∈ DAG {P : = calculate_all_paths{dagi};CS : = {};for each p ∈ P {

CS : = CS do_AFAP_schedule{p};}Aupurpose_all_schedules {P, CS};

}}

compute_cutpoints{DAG}{

CP_list_intialize{};for each vi ∈ V { /* apply rule 2 */

if { fulfills_rule2(vi)} {CP_list_insert{∈i};

sort_by_profiling{CP_list}; /* apply rule 1 */

}}

#cut_pts : = input{};if {len{CP_list} < #cut_pts} {

if { ambiguous_cut{CP_list, #cut_pts} }CP_list : = apply_rule 3{CP_list, #cut_pts};

}}

}return{CP_list}:

FIGURE 7.4

An algorithm for path-based estimation [Hen01] ©2001 IEEE.


They compute these from more direct parameters such as the number of states in thecontroller, the sources of operands for operations, and so on.

Vahid and Gajski preprocess the design to compute a variety of information foreach procedure in the program.

• A set of data path inputs and a set of data path outputs.• The set of function and storage units. Each unit is described by its size and the

number of control lines it accepts.• A set of functional objects is described by the number of possible control states for

the object and the destinations to which the object may write. Each destination isitself described with an identifier, the number of states for which the destination isactive for this object, and the sources that the object can assign to this destination.

A tabular form of this information is shown in Figure 7.5. This information can beused to compute some intermediate values as shown in Figure 7.6. These values inturn can be used to determine the parameters in EQ. (7.2).

Vahid and Gajski use an update algorithm that is given a change to the hardwareconfiguration. Given a design object, they first add the object to the design if it doesnot yet exist. They then update multiplexer sources and sizes and update control lineactive states for this destination. Finally, they update the number of controller states.Their update algorithm executes in constant time if the number of destinations per ob-ject is approximately constant.

Functionalobject States Destination Sources

Activestates

A CAdder 1

3

AD

1

CDE

2

Comparator 1

Comparator 1

Adder 1

Adder 1

Adder 1

Storage 1

5Procedure 1

Procedure 2 2

A

FD

1

1

1

B ‘O’ 1

FIGURE 7.5

Preprocessed information used for cost estimation [Vah95] ©1995 IEEE.


7.3 Hardware/software co-synthesis algorithmsHardware/software co-design is a collection of techniques that designers use to helpthem create efficient application-specific systems. If you don’t know anything aboutthe characteristics of the application, then it is difficult to know how to tune the sys-tem’s design. But if you do know the application, as the designer, not only can youadd features to the hardware and software that make it run faster using less power,but you also can remove hardware and software elements that do not help with theapplication at hand. Removing excess components is often as important as addingnew features.

As the name implies, hardware/software co-design means jointly designing hard-ware and software architectures to meet performance, cost, and energy goals.Co-design is a radically different methodology than the layered abstractions used ingeneral-purpose computing. Because co-design tries to optimize many different partsof the system at the same time, it makes extensive use of tools for both design analysisand optimization.

When designing a distributed embedded system, developers need to deal withseveral design problems:

• We must schedule operations in time, including communication on the networkand computations on the processing elements. Clearly, the scheduling of opera-tions on PEs and the communications between PEs are linked. If one PE finishesits computations too late, it may interfere with another communication on the

Destination SourcesContrib.fct. objs.

Componentrequired Size

Controllines

Activestates

AC

adder1 Procedure1

Procedure1

Procedure1

Procedure1

Procedure1

Procedure1

Procedure1

Procedure1

8-bit 2x1 mux

8-bit compare

8-bit 2x1 mux

8-bit adder

1-bit register

200

300

200

400

75

1

0

1

1

0

1 1

2

2

3

comparator1

comparator1

adder1

storage1

A

D

C

D

E

wires srcs_list units size_list ctrl active_list

Hwsize (wires, srcs_list, units, size_list, ctrl, active_list, states)

Hwsize (8, srcs_list, 5, size_list, 3, active_list, 5) (from PP)

FIGURE 7.6

Tabular representation of hardware resources [Vah95] ©1995 IEEE.

Application-specific

systems

Co-synthesis activities


network as it tries to send its result to the PEs that need it. This is bad for both thePEs that need the result and the other PEs whose communication is interfered with.

• We must allocate computations to the processing elements. The allocation ofcomputations to the PEs determines which communications are requireddif avalue computed on one PE is needed on another, it must be transmitted over thenetwork.

• We must partition the functional description into computational units. Inco-design, the hardwired PEs are generally called accelerators. In contrast, a co-processor is controlled by the execution unit of a CPU. Partitioning in this senseis different from hardware/software partitioning, in which we divide a functionbetween the CPU and the accelerator. The granularity with which we describethe function to be implemented affects our choice of search algorithm and ourmethods for performance and energy analysis.

• We also need to map processing elements and communication links onto specificcomponents. Some algorithms allocate functions to abstract PEs and links that donot model the behavior of a particular component in detail. This allows the syn-thesis algorithm to explore a larger design space using simple estimates. Mappingselects specific components that can be associated with more precise cost, per-formance, and power models.

Co-synthesis designs systems to satisfy objective functions and meet constraints.The traditional co-synthesis problem is to minimize hardware cost while satisfyingdeadlines. Power consumption was added to objective functions as low-power designbecame more important. Several groups have developed algorithms that try to satisfymultiple objective functions simultaneously.

In the first two sections, we consider design representations. Section 7.3.1 sur-veys methods used to specify the programs that describe the desired system behavior;Section 7.3.2 looks at ways to describe the characteristics of the hardware. Section7.3.3 considers techniques to synthesize based on a hardware architecture template.Section 7.3.4 looks at co-synthesis algorithms that generate arbitrary hardware topol-ogies without the use of templates. In Section 7.3.5, we study multi-objectivealgorithms for co-synthesis, and Section 7.3.6 considers co-synthesis for controland I/O systems. Section 7.3.7 looks at some co-synthesis techniques designed formemory-intense systems, and we conclude this topic with a discussion of co-synthesis algorithms for reconfigurable systems in Section 7.3.8.

7.3.1 Program representationsCo-synthesis algorithms generally expect system functionality to be specified in oneof two forms: a sequential/parallel program or a task graph.

Programs provide operator-level detail of a program’s execution. Most program-ming languages are also sequential, meaning that synthesis must analyze the availableparallelism in a program. Some co-synthesis systems accept fairly straightforwardsubsets of languages such as C. Others add constructs that allow the designer to

Optimization criteria

Program representations

7.3 Hardware/software co-synthesis algorithms 351

specify operations that can be performed in parallel or must be performed in a partic-ular order; an example is the HardwareC language used by Vulcan [Gup93]. The sys-tem of Eles et al. [Ele96] takes VHDL programs as behavioral specifications; they useVHDL processes to capture coarse-grained concurrency and use VHDL operations todescribe interprocess communication.

Task graphs, as we saw in Section 6.3.2, have been used for many years to specifyconcurrent software. Task graphs are generally not described at the operator level, andso provide a coarser-grained description of the functionality. They naturally describethe parallelism available in a system at that coarser level.

Task graph variants that allow conditional execution of processes have been devel-oped. For example, Eles et al. [Ele98] used a conditional task graph model. Someedges in their task graph model are designated as conditional; one is labeled withthe condition, called a guard, that determines whether that edge is traversed. Anode that is the source of conditional edges is called a disjunction node and a nodethat is the sink of conditional edges is a conjunction node. They require that forany nodes i and j, where j is not a conjunction node, there can be an edge i e> jonly if it is the case that the guard for i is true, implying that the guard for j is true.The control dependence graph (CDG) introduced in Section 4.3 can be used to analyzethe conditions under which a node will be executed.

Task Graphs for Free (TGFF) [Dic98c; Val08] is a web tool that generates pseu-dorandom task graphs. The user can control the range of deadlines, graph connectivityparameters, and laxity.

A very different specification was used by Barros et al. [Bar94], who used theUNITY language [Cha88] to specify system behavior. UNITY was designed as an ab-stract language for parallel programming. The assignment portion of a UNITY pro-gram specifies the desired computations. It consists of a set of assignments thatmay be performed either synchronously or asynchronously on a statement-by-statement basis. Execution starts from an initial condition, statements are fairlyexecuted, and execution continues until the program reaches a fixed point at whichthe program’s state does not change.

7.3.2 Platform representationsIn addition to representing the program to be implemented, we must also represent thehardware platform being designed. Because a platform is being created and is notgiven, our representation of the platform itself must be flexible. We need to describecomponents of the platform so that those components can be combined into a com-plete platform.

Some of the information about the platform is closely linked to the representationof the program. For example, processor performance is usually not capturedabstractly; instead, we generally capture the speed at which various pieces of the pro-gram execute on different types of PEs. One data structure that is used with some vari-ation in many systems is the technology table, as shown in Figure 7.7. This tablegives the execution time of each process for each type of processing element.

Task graphs

TGFF

UNITY

Technology tables


(Some processes may not run on some types of PEs, particularly if the element is notprogrammable.) A co-synthesis program can easily look up the execution time for aprocess once it knows the type of PE to which the process has been allocated.

The performance of a point-to-point communication link is generally easier tocharacterize since transmission time is independent of the data being sent. In thesecases we can characterize the link with a data rate.

In general, a multidimensional table could have several entries for each row/column pair.

• The CPU time entry shows the computation time required for a process.• The communication time entry gives the time required to send data across a link;

the amount of data is specified by the sourceedestination pair.• The cost entry gives the manufacturing cost of a processing element or

communication link.• The power entry gives the power consumption of a PE or communication link;

this entry can be further subdivided into static and dynamic power components.

We also need to represent the multiprocessor’s structure, which in generalchanges during co-synthesis. A graph is often used, as shown in Figure 7.8, withnodes for the PEs and edges for the communication links. Complex communicationstructures, such as multistage networks, are generally not directly represented in theplatform description. Instead, the structure of the links shows possible connectionsthat can be implemented using different physical layers. The platform model alsodoes not usually include the location of the memory within the communicationtopology.

PE 1 PE 2 PE 3

P1

P2

P3

5

4

7

–

4

9

14

5

6

FIGURE 7.7

A process-to-PE technology table.

PE 1PE 2

PE 3

PE 4

FIGURE 7.8

A multiprocessor connectivity graph.

A co-synthesis platform

as graph


7.3.3 Template-driven synthesis algorithmsMost early co-synthesis algorithms, and many co-synthesis algorithms in general,generate hardware architectures based on an architectural template. The most com-mon architectural template is a bus-based system, with either a single or multiple CPUand one or more accelerators. Co-synthesis that maps the design onto this bus-basedtemplate is generally known as hardware/software partitioning because the busdefines a boundary between two partitions, making it possible to apply traditionalgraph partitioning algorithms to co-synthesis.

The template provided by the CPU/bus/accelerator configuration is a powerfulelement of the hardware/software co-synthesis methodology. The template providessome important knowledge that restricts the design space and helps to decouplesome operations:

• The type of CPU is known. Once we know the CPU type, we can estimatesoftware performance much more accurately. We can also generate a great deal ofperformance data in advance since knowledge of the CPU type greatly limits theamount of data that we need to generate.

• The number of processing elements is known. We can more easily build per-formance models because we know the interconnection topology.

• Only one processing element can multitask. Performance analysis is morecomplicated when several processing elements can switch between tasks.

When we move to more general architectures, co-synthesis algorithms becomeconsiderably more complex. As we will see, different algorithm designers havemade different choices about simplifying assumptions and strategies to break theproblem into manageable pieces.

Two early hardware/software partitioning systems exemplify different approachesto co-synthesis. COSYMA [Ern93] started with all functions on the CPU and movedsome functions to the accelerator to improve performance; Vulcan [Gup93] startedwith all functions in the accelerator and moved some functions to the CPU to reducecost. Comparing these two systems gives us a good understanding of the majorhardware/software partitioning problems.

The block diagram of the COSYMA system is shown in Figure 7.9. The applica-tion is described in CX, which extends C with some task constructs and timing con-straints. That description is compiled into an intermediate form called an ES graph.The ES graph represents the control and data flow in the program in a way that canbe rewritten either as a hardware description or an executable C program. (COSYMAalso includes a simulator for ES graphs.) The ES graph is composed of basic sched-uling blocks (BSBs) that are the primary units of co-synthesis. COSYMA uses simu-lated annealing to search the design space to determine which BSBs should beimplemented on the accelerator.

The Vulcan system started with its own programming language extension knownas HardwareC. Vulcan was designed to synthesize multiple tasks running at differentrates, so HardwareC allows the description of timing and rate constraints. Input and

Hardware/software

partitioning assumptions

COSYMA

Vulcan


output operations are nondeterministic operations that do not take a predictableamount of time. The HardwareC description is automatically broken into tasks byVulcan by splitting the description at the nondeterministic operation points.

Vulcan schedules and allocates threads using iterative improvement. The initialsolution puts all threads in the accelerator, which is a high-performance but alsohigh-cost solution. Vulcan iteratively moves threads to the CPU, testing whetherthe implementation still meets its performance goals after the move. Once a threadis successfully moved, its successors become the next candidates for movement.

Vulcan estimates performance and cost on the CPU by directly analyzing thethreads. The latency of each thread has been predetermined; the thread reactionrate is given. Vulcan computes the processor utilization and bus utilization for eachallocation. Vulcan estimates hardware size based on the costs of the operators inthe function.

Because the space of feasible designs is large, co-synthesis algorithms rely on effi-cient search. Vahid et al. [Vah94] developed a binary constraint search algorithm. Thealgorithm minimizes the size of the required hardware in a hardware/software parti-tioned system to meet the performance constraint.

Vahid et al. formulate their objective function so as to meet a performance goaland to have hardware area no larger than a specified bound:

0 ¼ kperfX

1�j�m

Vperf

�Cj

�þ kareaVareaðWÞ (EQ 7.3)

In this objective function, Vperf (Cj) is the amount by which performance constraint j isviolated, and Vperf (W) is the amount by which the hardware area exceeds the pre-scribed hardware size bound W.

This objective function is zero when both the performance and area bounds aresatisfied. If solutions are arranged on a line, then all solutions to the right of a certain

Cx system description

Softwaretranslation

Software

Runtimeanalysis

ES graph

Partitioning

Cost estimation

Hardwaretranslation

Hardware

Hardwaresynthesis

FIGURE 7.9

The COSYMA system.

Binary constraint search


point will be zero; these are satisfying implementations. All the nonsatisfying solu-tions will be to the left of that boundary. The goal of a search algorithm is to findthe boundary between the zero and nonzero objective function regions.

The search algorithm is shown in Figure 7.10. It collects hardware (H) and software(S) elements of the system during the search procedure. PartAlg() generates new so-lutions to probe different regions of the objective function. The hardware/softwarepartition at the boundary between a zero and nonzero objective function is returnedas {Hzero,Szero}.

The CoWare system [Ver96] supports the design of heterogeneous embeddedsystems. The system behavior is described as communicating processes. A modelcan be encapsulated so that existing design tools, such as compilers and simulators,can be reused. The system description is refined over several steps to create animplementation.

A process specifies the behavior of part of the system. Processes can be describedin several different host languages such as C or VHDL, and they can be defined hier-archically by composing them from other processes. A thread is a single flow of con-trol in a process; a process can have several threads. A slave thread is associated witha slave port; a time-loop thread is not associated with a port and is repeatedlyexecuted. Objects communicate through ports; like processes, ports can be describedhierarchically. A protocol defines the communication syntax and semantics. Protocolscan also be described hierarchically.

Co-synthesis implements communicating processes, some of which execute insoftware on a CPU while others are implemented as hardware attached to the CPU.A library describes the CPU and its interface. On the hardware and software sidesof the interface, concurrent processes are combined using inlining into a product pro-cess. On the software side, device drivers need to be added to the specifiedfunctionality.

Eles et al. [Ele96] compared simulated annealing and tabu search for co-synthesis.Their simulated annealing algorithm has three major parameters: initial temperatureTI, temperature length TL, and cooling ratio a. Their simple move procedurerandomly selects a node to be moved from one partition to the other. Their improved

low = 0; high = AIIHWSize;while (low < high) {

mid = (low + high + 1)/2;{Hprime,Sprime} = PartAlg(H,S,constraints,mid,cost());if (cost (Hprime,Sprime,constraints,mid) == 0) {

high = mid-1;{Hzero,Szero} = {Hprime,Sprime};}

low = mid;else

}return {Hzero,Szero};

FIGURE 7.10

Hardware/software partitioning by binary search [Vah94].

CoWare

Simulated annealing vs.

tabu search


move procedure randomly selects a node to be moved and also moves directlyconnected nodes whose movement would improve the objective function.

Tabu search uses two data structures, short-term memory and long-term mem-ory. Short-term memory holds some information on recent search moves. Long-termmemory holds information about moves over a longer time period. For each node, itholds the number of iterations it has been in during each partition; this information canbe used to determine the priority with which it should be moved.

Eles et al. uses an extended form of VHDL to describe the system to be designed;they use VHDL primitives for interprocess communication. They analyze the code toidentify processes, loops, and statement blocks. They both statically analyze the codeand profile the code during execution. Their co-synthesis system uses an objectivefunction:

C ¼ Q1

Xði; jÞ˛cut

W1Eij þ Q2

Pi˛HW

Pdi;j

W2Eij

W1Ni

NHþ Q

264

Pi˛HW

W2Ni

NHþ

Pi˛SW

W2Ni

NS

375 (EQ 7.4)

In this formula, Q1, Q2, and Q3 are weights. HW and SW represent the hardwareand software partitions, while NH and NS represent the number of nodes in thosesets, respectively.W1Eij denotes the total amount of data transferred between processesi and j. W2Eij counts the total number of interactions between the two processes, witheach interaction counting as one event. W1Ni is the total number of operations per-formed by process i. We define

w2NI ¼ MCLKCLi þMUKU

i þMPKPi �MSOKSO

i (EQ 7.5)

where the Ms are weights and the Ks represent measures of the processes.KCLi is the relative computational load of process i, which is the total number of

operations performed by a block of statements in the program divided by the totalnumber of operations in the complete program. KU

i is the ratio of the number ofdifferent types of operations in the process divided by the total number of operations,which is a measure of the uniformity of the computation. KP

i is the ratio of the numberof operations in the process to the length of the computation path, which measurespotential parallelism. KSO

i is the ratio of the number of software-style operations(floating-point calculations, recursive subroutine calls, pointer operations, and soon) to the total number of operations.

Eles et al. concluded that simulated annealing and tabu search provided similarquality results but that tabu search ran about 20 times faster. However, tabu search al-gorithms took considerably longer to develop. They also compared these algorithmsto KernighaneLin style partitioning; they found that for equal-quality solutions, bothsimulated annealing and tabu search ran faster than KernighaneLin on largeproblems.

The LYCOS co-synthesis system [Mad97] works from a single design representa-tion that can be derived from multiple languages and feed multiple tools.

LYCOS


LYCOS represents designs internally as CDFGs, using a format known as Quenya,and it can translate subsets of both C and VHDL into its internal representation. Que-nya’s execution semantics are based on colored Petri nets. Tokens arrive at nodesalong input edges; the nodes can remove tokens from their inputs and place tokenson their outputs based on firing rules. Figure 7.11 shows the model for an if-then-else statement. The right block evaluates the condition and sends a token along theb edges. If b is true, then the s1 block is executed, else the s2 block is executed.

Figure 7.12 shows the model for a while statement. Procedures are translated intoseparate graphs, with call nodes denoting calls to the procedure body and an edge foreach input or output parameter. Shared variables are represented by a combination ofnodes: import/export nodes to sample the value of a shared variable; and wait nodesfor synchronization.

LYCOS uses profiling tools to estimate performance. The Quenya model is trans-lated to Cþþ and executed for profiling. Hardware models are constructed in theArchitectural Construction Environment (ACE), which is used to estimate area andperformance.

LYCOS breaks the functional model into basic scheduling blocks (BSBs). BSBsfor hardware and software do not execute concurrently. A sequence of BSBs <Bi,.,Bj> is denoted by Si,j. The read and write sets for the sequence are denoted by ri,j andwi,j. The speedup given by moving a BSB sequence to hardware is computed using

Si;j ¼Xi�k�j

pcðBkÞ�ts;k � th;k

�� XacðvÞts/h þ

XacðvÞth/s

�(EQ 7.6)

In this formula, pc(B) is the number of times that a BSB is executed in a profilingrun and ac(v) is the number of times that a variable is accessed in a profiling run.

Branch

Statements s1

s1;

s2;

s2

Branch

Merge Merge

C code Quenya model

Expression

if (e)

else

eb

FIGURE 7.11

A quenya model for an if-then-else statement. After Madsen et al. [Mad97].

Quenya design

representation

Profiling and design

estimation

LYCOS partitioning


dts/h represents the software-to-hardware communication time, and th/s representsthe hardware-to-software communication time. The area penalty for moving a BSBsequence to hardware is the sum of the hardware areas of the individual BSBs.

LYCOS evaluates sequences of BSBs and tries to find a combination of nonover-lapping ones that give the largest possible speedup and that fit within the availablearea. A sequence has some BSBs mapped to hardware and others mapped to software.With proper construction of a table, this problem can be formulated as a dynamic pro-gram. The table is organized into groups such that all members of a group have thesame BSB as the last member of that sequence. Each entry gives the area and speedupfor an entry. Once some BSBs are chosen for the solution, other sequences that arechosen cannot include those BSBs, since they have already been allocated. The tableallows sequences that end at certain BSBs to be easily found during search.

Xie and Wolf [Xie00] used high-level synthesis to more accurately estimate per-formance and area of accelerators during co-synthesis. They targeted a bus-basedarchitecture with multiple CPUs and multiple accelerators. After finding an initialfeasible solution, they first iteratively reduced the number of accelerators and thecost of CPUs; they then reduced accelerator cost using more accurate analysis; finally,

entry

EntryEntry

Statement Expressions eb

Exit

while (e)

C code Quenya model

s;

Exit

exit

FIGURE 7.12

A quenya model for a while statement. After Madsen et al. [Mad97].

Hardware estimation

using co-synthesis


they allocated and scheduled tasks and data transfers. When choosing whether toallocate a task to an accelerator or to a CPU, they used two heuristics. First, if thespeed difference between the CPU and accelerator implementation was small, theychose the task as a candidate for software implementation. They also chose tasksthat were not on the critical path as candidates for software implementation.

The accelerator cost-reduction phase looks at two metrics of the schedule. Globalslack is the slack between the deadline and the completion of the tasks; local slack isthe slack between the accelerator’s completion time and the start time of its successortasks. This phase relies on the use of a high-level synthesis tool that, given the numberof cycles an accelerator has to implement, can generate a register-transfer implemen-tation that meets that performance goal and is also area efficient.

This phase first tries to replace each accelerator with the smallest possible imple-mentation of that accelerator. If the schedule is still feasible, they keep that allocation,else they replace it with the accelerator’s fastest implementation. They next calculatethe global slack and use it to determine the amount by which accelerators can beslowed down. Each accelerator that uses its fastest implementation is slowed downby the sum of the global and local slacks; other accelerators are slowed down bythe local slack. High-level synthesis is used to generate the implementation that meetsthis time bound; from the synthesis result, the accelerator’s area can be accuratelyestimated. For each accelerator, the minimum-cost implementation at the chosenspeed is selected.

The Serra system [Moo00] combines static and dynamic scheduling of tasks.Static scheduling, which is handled by a hardware executive manager, is efficientfor low-level tasks. Dynamic scheduling, which is handled by a preemptive static pri-ority scheduler, handles variations in execution time and event arrival time. Serra tar-gets an architecture with a single CPU and multiple accelerators. Serra represents thetask set as a forest of directed acyclic graphs (DAGs). Each task graph has a never setthat specifies tasks that cannot execute at the same time. The never set makes sched-uling NP-complete; Serra finds a schedule using a dynamic programmingestyle heu-ristic. The scheduler works from the end of the schedule backward, finding a goodschedule given the tasks already scheduled.

7.3.4 Co-synthesis of general multiprocessorsIn this section, we reconsider hardware/software co-design for more general multipro-cessor architectures. Many useful systems can be designed by hardware/software par-titioning algorithms based on the CPUþ accelerator template. Hardware/softwarepartitioning can also be used to design PEs that are part of larger multiprocessors.But if we want to design a complete application-specific multiprocessor system, weneed to use more general co-synthesis algorithms that do not rely on the CPUþ accel-erator template.

In the most general case, all these tasks are related. Different partitions of func-tionality into processes clearly changes scheduling and allocation. Even if we choosea partitioning of functions, scheduling, allocating, and binding are closely related.

Cost reduction using

high-level synthesis

Static and dynamic

scheduling

The Gordian knot of

co-synthesis


We want to choose the processing element for a processdboth the general allocationand the binding to a specific typedbased on the overall system schedule and whenthat process has to finish. But we can’t determine the schedule and the completiontime of a process until we at least choose an allocation and most likely a binding.This is the Gordian knot that co-synthesis designers must facedthe set of intertwinedproblems that must somehow be unraveled.

Kalavade and Lee [Kal97] developed the Global Criticality/Local Phase(GCLP) algorithm to synthesize complex embedded system design. Their algorithmperforms the co-synthesis operations of resolving whether a task should be imple-mented in hardware or software and determining the schedule of the tasks in the sys-tem; it also allows for several different hardware and/or software implementations of agiven task and selects the best implementation for the system. The application is spec-ified as a DAG.

The target architecture includes a CPU and a custom data path. The hardware islimited to a maximum size and the software is limited to a maximum memory size.The various hardware and software implementations of the task are precharacterized.Each node in the task graph has a hardware implementation curve and a softwareimplementation curve that describe hardware and software implementation costs,respectively, for various bins that represent different implementations.

Their GCLP algorithm adaptively changes the objective at each step to try to opti-mize the design. It looks at two factors: the global criticality and the local phase. Theglobal criticality is a global scheduling measure. It computes the fraction of as-yetunmapped nodes that would have to be moved from software to hardware implemen-tations in order to meet the schedule’s deadline. The global criticality is averaged overall the unmapped nodes in each step.

The local phase computation improves hardware/software mapping. Heuristicsare used to determine whether a node is best implemented in hardware or software;bit manipulations have a bias toward hardware implementation and extensive mem-ory operations have a bias toward software. These heuristics are termed repellingforcesdfor example, bit manipulations repel a node away from software implemen-tation. The algorithm uses several repeller properties, each with its own repellervalue. The sum of all the repeller properties for a node gives its repeller measure.

The algorithm computes extremity measures to help optimize mappings.A software extremity node has a long runtime but a small hardware area cost. Simi-larly, a hardware extremity node has a large area cost but a small software executiontime.

The co-synthesis algorithm iteratively maps nodes in the task graph. After select-ing a node to map, it classifies a node as a repeller, an extremity, or a normal node.

SpecSyn [Gon97b; Gaj98] is designed to support a specifyeexploreerefinemethodology. As shown in Figure 7.13, SpecSyn transforms functional specificationsinto an intermediate form known as SLIF. A variety of tools can then work on thedesign. A refiner generates software and hardware descriptions. SpecSyn representsdesigns using a program-state machine model, which uses a statechart-like descriptionthat allows complex sequential programs at the leaf states. The SLIF representation is

GCLP algorithm

SpecSyn


annotated with attributes such as area, profiling information, number of bits per trans-fer, and so on.

The allocation phase can allocate standard or custom processors, memory, orbuses. Partitioning assigns operations in the functional specification to allocated hard-ware units. During partitioning, performance is estimated using parameters such asbus frequency, bit width of transformations, and profiling values. Hardware size isestimated using estimates for the processing elements, buses, and so on, while soft-ware size is estimated by compiling code into generic three-operand instructions,then estimating code size for a specific processor by multiplying generic code sizeby a processor-code size coefficient.

A refined design adds detail but is simulatable and synthesizable, so its compo-nents can be used for further verification and synthesis. A variety of refinementscan be performed, including:

• Control-related refinementdThis preserves the execution order of a specifica-tion when a behavior is split among several behaviors during partitioning. Signals,either unidirectional or handshaking, must be added to ensure that the two modulesperform their respective operations in the proper order.

• Data-related refinementdThis updates the values of variables that are shared byseveral behaviors. For example, as shown in Figure 7.14, a bus protocol can beused to send a value from one processor to another.

• Architecture-related refinementsdThese add behaviors that resolve conflictsand allow data transfer to happen smoothly. For example, a bus arbiter can beinserted when several behaviors may use the bus at the same time. Bus interfacesare inserted when message passing is used for data communication.

Wolf [Wol97b] developed a co-synthesis algorithm that performed scheduling,allocating, and binding for architectures with arbitrary topologies. The applicationis described as a task graph with multiple tasks. The set of processing elements

Specify Functionalspecification Allocator

Estimators

Partitioner

SLIF

SoftwareHardware

RefinerRefine

Transformer

Explore

FIGURE 7.13

The SpecSyn system [Gaj98].

Allocation, partitioning,

and refinement

Iterative improvement


that can be used in the design is described in a technology table that gives, for each PEand each process, the execution time of that process on each PE as well as the totalcost of the processing elements. No initial template for the architecture is given. Syn-thesis proceeds through several steps.

1. An initial processor allocation and binding is constructed by selecting for eachprocess the fastest PE for that process in the technology table. If the system cannot

behavior B is sequentialvariable x …

B1: (x > 1, B2);B2: (x > 5, B3);

behaviour B1 is

end;…

behavior B1 is…

behavior B2 is…

behavior B3 is…

behaviour B2 is

end;…

behavior B3 is

end;end;

…

(a) Variable x mapped to a memory

(b) Data access implement by bus protocols

B

B1

x > 1

x > 5

xB2

B3

ComponentB

variable tmp …

B1: (tmp > 1, B2);B2: (tmp > 5, B3);

MST_receive(bus,x_tmp);end;

MST_receive(bus,x_tmp);end;

end;

Memory

variable x …

loopwait until bus_start = 1;if bus_start = 1 and bus_rd = 1 and bus_addr = x_addr then SLV_send(bus, x);if bus_start = 1 and bus_wr = 1 and bus_addr = x_addr then SLV_receive(bus, x);end loop;

Bus

FIGURE 7.14

Data refinement in SpecSyn [Gon97B].


be scheduled in this allocation, then there is no feasible solution. The initialschedule tells the algorithm the rates at which processing elements mustcommunicate with each other.

2. Processes are reallocated to reduce PE cost. This step is performed iteratively byfirst pairwise merging processes to try to eliminate PEs, then moving processes tobalance the load. These operations are repeated until the total cost of the hardwarestops going down. At the end of this step, we have a good approximation of thenumber and types of PEs required.

3. Processes are reallocated again to minimize inter-PE communication. This stepallows us to minimize communication bandwidth, which is an important aspect ofdesign cost. At the end of this step, we have completely determined the numberand types of PEs in the architecture.

4. We then allocate communication channels as required to support the inter-PEcommunication.

5. Finally, we allocate input/output devices, as required, to support the I/O operationsrequired by the processes.

Xie and Wolf [Xie00] coupled hardware/software co-synthesis to a high-level syn-thesis engine that used integer programming, rather than heuristic graph algorithms, toschedule and allocate hardware. This engine was fast enough to be used as a subrou-tine during co-synthesis. They developed a performance analysis tool based on thehigh-level synthesis. At the start of synthesis, they constructed an initial solutionwith every task on the fastest PE, then iteratively improved the design.

The first phase tries to move tasks from the accelerator to CPUs and also to reducethe cost of CPUs; these procedures analyze slack to find good candidates for move-ment. The second phase reduces the cost of the accelerator. The system calculatesslacks in the system schedule and assigns them to various accelerators; it uses the per-formance analysis tool to determine whether each accelerator can be slowed down bythe prescribed amount and the area of the modified accelerator.

Large sets of tasks present additional problems for co-synthesis. Dave et al.[Dav99a] developed the COSYN system to synthesize heterogeneous distributedembedded systems from large task graphs. COSYN allows task graphs to be formu-lated using repeated tasks: a prototype task is defined and then copied many times.Communication systems often have many copies of identical tasksdeach task repre-sents a call. This technique is also useful in other types of large real-time systems.

COSYN task graphs are directed sets of processes. Each task graph has an earlieststart time, period, and deadline. The implementation is built from PEs and communi-cation links. Several tables and vectors define the tasks and their underlying hardware.

• A technology table defines the execution time of each process on each feasible PE.• A communication vector is defined for each edge in the task graph; the value of

the ith entry defines the time required to transmit the data sent from source to sinkover the ith type of communication link.

• A preference vector is indexed by the processing element number; the ith value is0 if the process itself cannot be mapped onto the ith PE and 1 if it can be.

Estimation through high-

level synthesis

Large task graphs

COSYN problem

formulation


• An exclusion vector is indexed by process number; the ith value is 0 if the twoprocesses cannot be mapped onto the same processing element, the ith PE, and 1 ifthey can co-exist on the same processor.

• An average power vector for each process defines the average power con-sumption of the task on each type of PE.

• A memory vector for each process consists of three scalar values: programstorage, data storage, and stack storage. There is one memory vector for each task.

• The preemption overhead time is specified for each processor.• The set of processes allocated to the same PE is called a cluster.

COSYN will adjust the periods of tasks by a small amount (3% by default) toreduce the length of the hyperperiod. The number of executions of the task in thehyperperiod causes problems: each execution must be checked. By adjusting theperiod of a task that has many copies in the hyperperiod, COSYN can reduce the num-ber of times it must consider that task in the hyperperiod. COSYN uses this value torank tasks for period adjustment. Threshold reduction stops when the number of totalexecutions of a task in the hyperperiod fall below a threshold.

COSYN also uses an association graph to manage the representation of multipleexecutions of a task in the hyperperiod. If all tasks have deadlines shorter than theirperiods, then the association graph need only be one-dimensional; its entries includethe PE to which the task is allocated and its priority, deadline, best-case projected fin-ish time, and worst-case projected finish time. If a task’s deadline can extend beyondits period, then the association graph must be two-dimensional because there may bemore than one instance of the task executing at a time. The second dimension indexesover the simultaneously executing copies of the task.

The COSYN synthesis algorithm is summarized in Figure 7.15. COSYN performsthree major phases: (1) it clusters tasks to reduce the search space for allocation; (2) itallocates tasks to processing elements; and (3) the tasks and processes are scheduled.

assign priorities by deadline;form association array;form clusters of processes;initialize architecture to empty;allocate clusters;foreach unallocated cluster Ci { form allocation array for Ci; foreach allocation in allocation array{ schedule allocated clusters; evaluate completion time, energy, power. if deadline is met in best case then { save current architecture; break; } else save best allocation; tag cluster Ci as allocated; }}

FIGURE 7.15

The COSYN synthesis algorithm [Dav99].

Handling large task sets

COSYN synthesis


COSYN forms clusters to reduce the combined communication costs of a set ofprocesses that is on the critical path. This is done by putting the set of processes onthe same PE. It first assigns a preliminary priority to each task based on its deadline.It then uses a greedy algorithm to cluster processes: it starts with the highest-prioritytask, puts it in a cluster, and walks through the fan-in of the task to find further tasksfor the cluster. To avoid overly unbalanced loads on the PEs, a threshold limits themaximum size of a cluster.

After clusters are formed, they are allocated, with allocation decisions driven bythe dollar cost of the allocation. Clusters are allocated in order of priority, startingwith the highest-priority clusters. COSYN checks for the compatibility of connectionsbetween PEs during allocation.

After allocation, the processes are scheduled using priorities. COSYN concen-trates on scheduling the first copy of each task. The association array is used toupdate the start and finish times of the remaining copies of the task; other copiesneed to be explicitly scheduled only when an execution slot is taken by a higher-priority task.

COSYN allows the supply voltages of parts to be mixed so that some noncriticalsections of the design can be run at lower power levels. It checks the compatibility ofconnections between components at different supply voltages.

COSYN evaluates the feasible allocations by comparing the sum of their worst-case finish times. It chooses the solution with the largest total worst-case finish timesfor the processes, since the longest finish times are usually associated with the lowest-cost hardware.

COSYN tries to allocate concurrent instances of tasks to allow pipelining. Asshown in Figure 7.16, several instances of a task may execute concurrently; the in-stances of each process in the task are identified by a superscript. If we allocate t1to PE A, t2 to PE B, and t3 to PE C, then we pass data from process to process inthe task at low cost and pipeline the execution of the task instances.

Dave and Jha [Dav98] also developed methods for hierarchical co-synthesis. Asshown in Figure 7.17, their COHRA system both accepts hierarchical task graphsand produces hierarchically organized hardware architectures. A hierarchical task isa node in the task graph that contains its own task graph. A hierarchical hardware ar-chitecture is built from several layers of PEs that are composed in an approximate treestructure, though some nontree edges are possible.

COHRA’s synthesis algorithm resembles COSYN’s algorithm: clustering, thenallocation, then scheduling. COHRA also uses association tables to keep track of mul-tiple copies of tasks.

7.3.5 Multi-objective optimizationEmbedded system designs must meet several different design criteria. The tradi-tional operations research approach of defining a single objective function andperhaps some minor objective functions, along with design constraints, may

Hierarchical co-synthesis

Pareto optimality


not adequately describe the system requirements. Economist Vilfredo Pareto(http://www.en.wikipedia.orgdPareto efficiency) developed a theory for multi-objective analysis known as Pareto optimality. An important concept providedby this theory is the way in which optimal solutions are judged: an optimal solutioncannot be improved without making some other part of the solution worse.

D’Ambrosio and Hu [D’Am94] developed the GOPS system to co-synthesize real-time embedded systems; they called this level of design configuration synthesisbecause it does not use detailed models of the processing elements.

GOPS uses a feasibility factor to estimate the schedulability of a set of tasks.Feasibility is determined relative to the speed of the PE and is defined in terms ofseveral throughput factors. The upper-bound throughput TRU is an upper boundon the throughput required for the PE to finish the task. If rate-monotonic schedulingis used, then TRU for a set of N tasks is

TRU ¼ 1�N�21=N � 1

�� X1�i�N

cidi � ai

(EQ 7.7)

PE A PE B

Architecture

Overlapping instances of tasks Time

Allocation

PE C

t11 t21

t12 t22

t13 t23 t33

t32

t31

FIGURE 7.16

Allocating concurrent task instances for pipelining.

GOPS

Bounds on throughput


http://www.en.wikipedia.org

where ci is the computation time, di is the deadline, and ai is the activation time,respectively, for task i. If earliest-deadline-first scheduling is used, then

TRU ¼X

1�i�N

cidi � ai

(EQ 7.8)

However, because these bounds are loose, D’Ambrosio and Hu provided sometighter bounds. They first showed that a set of n tasks arranged in ascending orderof deadlines cannot be feasibly scheduled if

X1�i�n

kicidn � ai

> TRP (EQ 7.9)

where

ki ¼8<:

Qðdn � aiÞ=piSPðdn � aiÞ=piR

if Pðdn � aiÞ=piR pi þ di � dn

otherwise(EQ 7.10)

They also showed that

X1�i�n

hicidn � ai

> TRP (EQ 7.11)

where

ki ¼�ki � Qan � ai=piS if ai < an

ki otherwise(EQ 7.12)

PE A

PE B

PE C

T1

ta

tb

tc

Hierarchical task graph Hierarchical hardware architecture

PE CPE E PE F PE C

PE D

FIGURE 7.17

Hierarchical specifications and architectures.


Based on these results, they showed that TRL can be computed as

TRL ¼ max1�n�N

kici

dn � ai;

hicidn � ai

(EQ 7.13)

The actual throughput required to finish all tasks on time is TRP, which lies betweenTRL and TRU. They define the feasibility factor lP in terms of these throughputestimates:

lP ¼

8>><>>:

TRP � TRL

TRU � TRLif ðTRP � TRLÞ < ðTRU � TRLÞ

1 otherwise

(EQ 7.14)

During optimization, the feasibility factor can be used to prune the search space.Candidate architectures with negative feasibility factors can be eliminated asinfeasible.

The feasibility factor is also used as an optimization objective. The two criteria ofGOPS are cost and feasibility factor.

Genetic algorithms have been used for co-synthesis. This class of algorithms ismodeled on genes and mutations. A solution to the problem is represented by a stringof symbols. Strings can be modified and combined in various ways to create newdesigns. There are three basic types of moves in genetic algorithms:

1. A reproduction step makes a copy of a string.2. A mutation step randomly changes a string.3. A crossover step interchanges parts of two strings.

The new designs are then evaluated for quality and possibly reused to create stillfurther mutations. One advantage of genetic algorithms is that they can easily handlecomplex objective functions.

Dick and Jha [Dic97; Dic98] developed a genetic algorithm called MOGAC forhardware/software co-synthesis. Their algorithm can generate architectureswith multiple processors and communication links; the architecture is optimizedfor both performance and power consumption. The synthesis procedure charac-terizes PEs as either independent or grouped. A processor can perform only onetask at a time; an IC can have multiple cores on one chip so that a chip can simul-taneously run several tasks.

The tasks are characterized by technology tables that give the worst-case execu-tion time and power consumption of each task on each feasible processor. MOGACalso keeps arrays that map given the worst-case performance, average power con-sumption, and peak power consumption for each task on each feasible core.

A communication link has several attributes: packet size, average power consump-tion per packet, worst-case communication time per packet, cost, number of contacts(the number of points connected by the link), number of pins, and idle power con-sumption. Buses are modeled as communication links with more than two contacts.

Feasibility factor

Genetic algorithms for

co-synthesis

MOGAC components


MOGAC keeps track of multiple objective functions and ranks designs on eachobjective function. The designer also specifies hard constraints on design characteris-tics. The genetic model for the design has several elements.

• The processing element allocation string lists all the PEs and their types that areallocated in the architecture. The grouped processing element allocation stringrecords the allocation of grouped PEs.

• The task allocation string shows which processing element has been assigned toeach task.

• The link allocation string shows how communication within the tasks is mapped tocommunication links. The link connectivity string records how chips and inde-pendent PEs are connected to communication links.

• The IC allocation string records how tasks are assigned to chips.• The PE allocation string shows how PEs are allocated to chips; in general,

the architecture can include multiple chips with different numbers of PEs oneach chip.

The MOGAC optimization procedure is summarized in Figure 7.18. MOGACconstructs an initial solution and then repeatedly performs the evolveeevaluate cycle.The evaluation phase, consisting of evaluation and ranking, determines which solu-tions are noninferiordthat is, which solutions are as good as any other for somedesign objective. Noninferior solutions are more likely to be selected for evolution,but some inferior solutions will be selected as well. If the optimization procedure isnot done, some low-rank solutions are terminated while other high-rank solutionsare replicated. This pool of solutions is modified using crossover and mutation.

To reduce the time required to search the design space, MOGAC organizespossible solutions into clusters. Each member of a cluster has the same allocation

Initialize

Evaluate

Rank

PruneDoneReproduce/terminate

Crossoverand mutate

TF

FIGURE 7.18

The MOGAC optimization procedure [Dic98].

MOGAC genetic model

MOGAC genetic algorithm


string but different link allocations and assignments. Some operations are applied tothe cluster and affect every member of the cluster. Other operations are applied only toa single solution within the cluster.

Real-time performance constraints are treated as hard constraints. The system’sreal-time constraint violation is the sum of all the violations of all nodes in the sys-tem’s task graphs. Price and average power consumption are treated as soft con-straints. Solutions are ranked by their Pareto rank, which is the number of othercurrent solutions that do not dominate it in the multi-objective space. The rank of acluster is defined by

clust dominationðx; yÞ ¼ max½a˛nisðxÞ�X

b˛nisðyÞdomða; bÞ (EQ 7.15)

rank½x� ¼X

y˛set of clustersw^ys x

clust dominationsðx; yÞ (EQ 7.16)

where nis(x) is the set of noninferior solutions in x and dom(a,b) is 1 if a is not domi-nated by b and 0 if a is dominated by b.

Avariable called solution_selection_elitism is used to weight the probability that asolution is selected for reproduction; its value monotonically increases during theoptimization procedure. Near the end of its run, MOGAC uses greedy proceduresto help it converge on local minima. The number of crossovers and mutations pergeneration are specified by the user.

Yang et al. [Yan01] developed an energy-aware task-scheduling algorithm formultiprocessors. Their method combines design-time and runtime schedulingmethods. At design time, a scheduler evaluates different scheduling and allocationchoices for the set of threads to run on the multiprocessor. The cost function isdescribed as a Pareto curve over the performanceeenergy space. They use a ge-netic algorithm to find a schedule and assignment for the tasks. Yang et al.generate a table that is used at runtime to select the best schedule. They use heu-ristic algorithms to select which of the available scheduling/allocation patternsshould be used.

Dick and Jha [Dic04] developed the COWLS system to co-synthesize client-serversystems that used low-power clients communicating over wireless links to servers.Allocation of wireless links and scheduling of data transfers over those links is animportant part of this problem, since wireless links are both relatively expensiveand consume large amounts of energy.

COWLS uses a parallel recombinative simulated annealing approach similar tothat used in MOGAC. Clusters of candidate solutions are analyzed, mutated, andrecombined to search a large solution space. It Pareto-ranks candidate solutions byfinding whether one cluster of candidates is better than or equal in all solution dimen-sions to all the members of another cluster. It uses three costs to guide mutation ofsolutions: communication time, computation time, and utilization. COWLS ranks

MOGAC constraints

Energy-aware scheduling

Wireless system

co-synthesis


candidate PEs during mutation by these costs; the candidate processing elements aresorted by their ranks and mutation favors highly ranked PEs.

Scheduling helps determine both the power consumption and timing of the solu-tion. COWLS uses slack to assign task priorities, with tasks that have little slackassigned to high priorities. Inter-task communication on the same PE is consideredto be free, while communication between PEs uses the wireless links. The schedulerallocates communications to wireless links. The scheduler models bus contention.

7.3.6 Control and I/O synthesisControl is often central to communication between processes. Control operations alsodominate most I/O routines. Control-dominated systems pose different challenges forco-synthesis.

The control finite-state machine (CFSM) model [Chi94] was developed tomodel control-dominated systems. In contrast to traditional FSMs, which operate syn-chronously, CFSMs have finite, nonzero, unbounded reaction times. The inputs andoutputs of CFSMs are events. CFSMs are used as an intermediate form for languagessuch as Esterel, hardware description languages, and so on. This results in a networkof communicating CFSMs.

Design partitioning assigns each component machine to either hardware or soft-ware implementation. A hardware implementation of a CFSM uses combinationallogic guarded by latches to implement the next-state and output functions. A softwareimplementation is created by generating another intermediate form known as ans-graph, then translating that model into C code. An s-graph is a DAG that is areduced form of a control-flow graph.

Chou et al. [Cho98] developed a modal process model as a framework for thedescription and implementation of distributed control. A modal process can be inone of several modes; its I/O behavior depends on its current mode as well as theinput events presented to it. Abstract control types (ACTs) define control opera-tions with known properties. Examples of abstract control types include unify,which keeps the modes in several processes the same; mutual exclusion; and pre-emption. The mode manager interprets the constraints on process modes and imple-ments ACT call requests. A mode manager can be implemented in either centralizedor distributed form.

Chou et al. [Cho95b] developed a methodology to synthesize interfaces forembedded controllers. The I/O behavior is represented as control flow graphs. Thedesigner manually selects which input/output tasks are to be implemented in hardwareor software. They first generate hardware or software implementations of the I/Otasks. Next, they allocate I/O ports to the processes, which may require adding mul-tiplexers to share ports. The algorithm can split an operation into several steps if thedevice data is wider than the bit width of the I/O device. If not enough I/O ports areavailable, they implement some devices using memory-mapped input/output. Theythen generate an I/O sequencer that ensures devices meet their response time andrate requirements.

CFSMs

Modal processes

Interface co-synthesis


Daveau et al. [Dav97] took an allocation-oriented approach to communicationsynthesis. A library describes the properties of the available communication units:the cost of a component, the protocol it implements, the maximum bus rate at whichdata can be transferred, a set of services it provides, and the maximum number of par-allel communication operations it can perform simultaneously. These authors modelthe system behavior as a process graph. Each abstract channel has several constraints:the protocol it wants to use, the services it provides to the processes, and its averageand peak transfer rates.

A communication unit can implement an abstract channel if it provides therequired services, uses the right protocol, and has a large enough maximum datarate to satisfy at least the channel’s average communication rate. Synthesis buildsa tree of all possible implementations, then performs a depth-first search of thetree. An operation allocates several abstract channels to the same communicationunit.

7.3.7 Memory systemsMemory accesses dominate many embedded applications. Several co-synthesis tech-niques have been developed to optimize the memory system.

Li and Wolf [Li99] developed a co-synthesis algorithm that determines the propercache size to tune the performance of the memory system. Their target architecturewas a bus-based multiprocessor. As described in Section 4.2.5, they used a simplemodel for processes in the instruction cache. A process occupied a contiguous rangeof addresses in the cache; this assumes that the program has a small kernel that isresponsible for most of the execution cycles. They used a binary variable ki to repre-sent the presence or absence of a process in the cache: ki¼ 1 if process i is in the cacheand 0 otherwise.

Their algorithm uses simulation data to estimate the execution times of programs.Co-synthesis would be infeasible if a new simulation had to be run for every newcache configuration. However, when direct-mapped caches change in size by powersof two, the new placement of programs in the cache is easy to determine. The examplein Figure 7.19 shows several processes originally placed in a 1 KB direct-mappedcache. If we double the cache size to 2 KB, then some overlaps disappear but newoverlaps cannot be created. As a result, we can easily predict cache conflicts for largercaches based on the simulation results from a smaller cache.

During co-synthesis, they compute the total system cost as

CðsystemÞ ¼X

i˛CPUs

CðCPUiÞ þ C�icachej

�þ CðdcacheiÞ

þX

j˛ASICs

�C�ASICj

�þ C�dcachej

��þ Xk˛links

CðcommlinkkÞ

(EQ 7.17)

where C(x) is the cost of component x.

Co-synthesizing cache

sizes and code placement


The authors use a hierarchical scheduling algorithm that builds the full hyperper-iod schedule using tasks, then individually moves processes to refine the schedule.Co-synthesis first finds an allocation that results in a feasible schedule, then reducesthe cost of the hardware. To reduce system cost, it tries to move processes from lightly

Address(Byte)

Address(Byte)

0

0

128

288400512

768832

1024

1152

13121424

1536

1792Positions of oldoverlap boundary

1856

2048

128

288400512

768832

1024

Addresses of tasks in the main memory (byte)

Addresses of tasks in the main memory (byte)

Task a: 2048–2560 Task b: 3200–3472 Task c: 4384–4928 Task d: 6912–7296

Task a: 2048–2560 Task b: 3200–3472 Task c: 4384–4928 Task d: 6912–7296

1 KB direct-mapped cache

2 KB direct-mapped cache

Cache Task a

Cache Task a

Task cTask a

Task c

Task d

Tasks a, c

Tasks c, dTask d

Task bTask d

Task b

Task d

Task c

Task b1 Tasks a, d

1

234

5

6

6

7

7

1

2

34

5

2 Tasks a, d

4 Tasks a, c

5 Task c

6 Tasks c, d7 Task d

3 Tasks a, d, c

FIGURE 7.19

How code placement changes with cache size [Li99].


loaded PEs to other processing elements; once all the processes have been removedfrom a PE, that processing element can be eliminated from the system. It also triesto reduce the cost of a PE by reducing its cache size. However, when processes aremoved onto a PE, the size of that processing element’s cache may have to grow tomaintain the schedule’s feasibility.

Because the execution time of a process is not constant, we must find a measureother than simple CPU time to guide allocation decisions. Dynamic urgencydescribes how likely a process is to reuse the cache state to reduce misses:

DUðtaski;PEiÞ ¼ SUðtaskiÞ �maxðreadyðtaskiÞ � availableðtaskiÞÞþ ½medianWCETbaseðtaskiÞ �WCETðtaski;PEiÞ� (EQ 7.18)

In this formula, SU is the static urgency of a task, or the difference between the execu-tion time and its deadline; the worst-case execution times are measured relative to thecurrent cache configuration.

Wuytack et al. [Wuy99] developed a methodology for the design of memory man-agement for applications, such as networking, that require dynamic memory manage-ment. Their methodology refined the memory system design through the followingseveral steps.

1. The application is defined in terms of abstract data types (ADTs).2. The ADTs are fined into concrete data structures. The proper data structures are

chosen based on size, power consumption, and so on.3. The virtual memory is divided among one or more virtual memory managers. Data

structures can be grouped or separated based on usage. For example, some datastructures that are lined to each other may be put together.

4. The virtual memory segments are split into basic groups. The groups are organizedto allow parallel access to data structures that require high memory performance.

5. Background memory accesses are ordered to optimize memory bandwidth; thisstep looks at scheduling conflicts.

6. The physical memory is allocated. Multiport memory can be used to improvememory bandwidth.

7.3.8 Co-synthesis for reconfigurable systemsField-programmable gate arrays (FPGAs) are widely used as implementation vehiclesfor digital logic. One use of SRAM-based FPGAs is reconfigurable systemsdmachines whose logic is reconfigured on the fly during execution. As shown inFigure 7.20, an FPGA may be able to hold several accelerators; the logic for these ac-celerators is embedded in the two-dimensional FPGA fabric. The configuration can bechanged during execution to remove some accelerators and add others.

Reconfiguration during execution imposes new costs on the system.

• It takes time to reconfigure the FPGA. Reconfiguration times may be in themilliseconds for commercial FPGAs. Some experimental FPGA architecturescan be reconfigured in a small number of clock cycles.

Memory management


• Reconfiguration consumes energy.• Not all combinations of accelerators can be accommodated in the FPGA simul-

taneously. Scheduling is influenced by which combinations of accelerators can fitat any given time.

Determining the feasibility of a schedule is much more difficult for a reconfig-urable system than for a traditional digital system. If we want to schedule a taskat a given time, we have to first determine whether its accelerator is already residentin the FPGA. If not, we have to determine what combination of accelerators must beremoved to make room for the new one (a task known in IC design as floorplanning[Wol09]). Reconfiguration time must be added to the schedule’s execution time andreconfiguration energy must be added to the cost of the schedule.

CORDS [Dic98b] uses an evolutionary algorithm to synthesize programs ontoreconfigurable platforms. Its basic algorithms are similar to those used by MOGAC.CORDS adds reconfiguration delay to the costs evaluated during optimization;reconfiguration delay must be adjusted as the state of the schedule changes. The dy-namic priority of a task is equal to the negation of the sum of its slack and its recon-figuration delay. CORDS increases the dynamic priority of tasks with lowreconfiguration times to encourage several similar tasks to be scheduled together,reducing total reconfiguration time.

The Nimble system [Li00] performs fine-grained partitioning that exploitsinstruction-level parallelism to map an algorithm onto a reconfigurable platform,which consists of an embedded CPU coupled to a reconfigurable data path. The detailsof the platform are configurable and are described in an architecture description lan-guage. The program to be implemented is represented as a control flow graph. Loopsin the program may contain multiple kernels; profiling information is attached to thebasic blocks and loop kernels.

M1 M2

M3

FPGA FPGA

Configuration 1 Configuration 2

M3

M4

FIGURE 7.20

Successive configurations of an FPGA.

CORDS

Nimble


The execution time of a loop that is implemented in whole or in part in hardwaredepends on several factors: execution time of the hardware loop, execution time ofany software portion of the loop, communication time between hardware and software,and the time required to configure the hardware unit on the FPGA. Configuration timedepends on program statedif the hardware unit had been previously instantiated andnot removed by subsequent activity, then no configuration time is required.

Synthesis concentrates on interesting loops that consume most of the executiontime. Given an interesting loop, the portions of that loop selected as hardware can-didates are determined by the execution and communication time, but not theconfiguration time, since the order of hardware loop body invocations is not yetknown. Inter-loop selection determines the overall hardware/software allocationbased on global costs. The synthesis algorithm walks through a graph of the loopsand procedures; this graph is similar to the control dependence graph of Ferranteet al. [Fer87]. An example is shown in Figure 7.21.

This graph helps identify loops that compete. Two loops that have different first-level predecessors do not conflict in the trace and could be put into different clusters.Loops that share a common predecessor may compete with each other and are pref-erentially placed in the same cluster. Loops are clustered based on a bottom-up walk-through of the loop-procedure hierarchy graph, with the size of each cluster limited bya parameter based on the size of the FPGA.

Main

I

I1

R

R1 R2 R3 R4

FW

FW1

FW3 FW4 FW6

FW2 FW5

FW7

Q

Q1

BQ

Q2

Q3 Q6

Q4 Q5

RLE

RLE1

RLE2

RLE3

E

E1

E2

E4

E3

W 1

2

3

4

5

Level

I Initialization R Read image FW Forward wavelet Q QuantizationW Write compressed fileE Entropy encodingRLE Run-length encoding

Loop cluster

BQ Block quantizationProcedureRI1 Loop

FIGURE 7.21

An example loop-procedure hierarchy graph [Li00].


7.4 Electronic system-level designElectronic system-level (ESL) design has become widely accepted as a term for thedesign of mixed hardware/software systems such as systems-on-chips, driven fromhigh-level modeling languages such as SystemC or MATLAB. ESL methodologiesleverage existing hardware and software tools and design flows, concentrating onthe refinement of an abstract system description into concrete hardware and softwarecomponents.

An early representation for tasks in system-level design was the Gajski-KuhnY-chart [Gaj83], which was developed for VLSI system design, at a time in whichhardware and software design were largely separate. As shown in Figure 7.22, theY-chart combines levels of abstraction with forms of specification. Three types ofspecification form the Y: behavioral, structural, and physical. Levels of abstractioncan be represented in each of these forms and so are represented as circles aroundthe center of the Y: system, register-transfer, gate, and transistor. This chart hasbeen referenced and updated by many researchers over years.

Nikolov et al. [Nik08] developed the Daedalus system for multimedia MPSoCs.They represent the application as a Kahn process network. Design space explorationuses high-level models of the major hardware components to explore candidate par-titionings into hardware and software.

behavioral

system

structural

physical

abstraction

specification register-transfer

gate

transistor

FIGURE 7.22

The Gajski-Kuhn Y-chart [Gaj83].

Y-chart

Daedalus


Domer et al. [Dom08] used a three-level design hierarchy for their System-on-Chip Environment (SCE):

• Specification modeldThe system is specified as a set of behaviors interconnectedby abstract communication channels.

• Transaction-level model (TLM)dThe behaviors are mapped onto the platformarchitecture, including processors, communication links, and memories. Theexecution of behavior is scheduled.

• Implementation modeldA cycle-accurate model. The RTOS and hardwareabstraction layer (HAL) are explicitly modeled.

Gerstlauer et al. [Ger09] extended the Y-chart concept to an X-chart for SoCdesign as shown in Figure 7.23. The behavior and nonfunctional constraints togetherform the system specification. A synthesis process transforms that specification intoan implementation consisting of a structure and a set of quality numbers that describeproperties such as performance and energy consumption.

Kwon et al. [Kwo09] proposed a common intermediate code (CIC) representationfor system-level design. Their CIC includes hardware and software representations.Software is represented at the task level using stylized C code that includes functioncalls to manage task-level operations, such as initialization, the main body, andcompletion. The target architecture is described using XML notation in a form knownas an architectural information file. A CIC translator compiles the CIC code intoexecutable C code for the tasks on the chosen processors. Translation is performedin four steps: translation of the API, generation of the hardware interface code,OpenMP translation, and task scheduling code.

Keinert et al. [Kei09] proposed SystemCoDesigner to automate design spaceexploration for the hardware and software components of an ESL design. They

behavior constraints

synthesis

structure quality numbers

FIGURE 7.23

The X-chart model for system-on-chip design [Ger09].

SCE

X-chart

Common intermediate

code

SystemCoDesigner

7.4 Electronic system-level design 379

represent the design problem using three elements: an architecture template describesthe possible CPUs/function units and their interconnection; a process graph describesthe application; and a set of constraints restricts the possible mappings. Multi-objective evolutionary algorithms explore the design space using heuristics derivedfrom biological evolution.

Popovici et al. [Pop10] divided MPSoC design into four stages:

• System architecturedThe system is partitioned into hardware and softwarecomponents with abstract communication links. The types of hardware andsoftware components are specified only as abstract tasks.

• Virtual architecturedCommunication between tasks is refined in terms of thehardware/software application programming interface (HdS API), which hides thedetails of the operating systems and communication systems. Each task is modeledas sequential C code. Memory is explicitly modeled.

• Transaction-accurate architecturedHardware/software interfaces are describedin terms of the hardware abstraction layer. The operating system is explicitlymodeled. The communication networks are explicitly modeled.

• Virtual prototypedThe architectures of the processors are explicitly modeled.Communication is modeled by loads and stores. The memory system, including allcaches, is fully modeled.

Their system-level synthesis system allows applications to be described inrestricted forms of either Simulink or C. They model the virtual architecture andbelow using SystemC. Design space exploration between levels is guided by thedesigner.

7.5 Thermal-aware designThermal-aware design has become a prominent aspect of microprocessor and system-on-chip design due to the large thermal dissipation of modern chips. In very high per-formance systems, heat may be difficult to dissipate even with fans and thermallycontrolled ambient environments. Many consumer devices avoid the use of coolingfans, although such fans have become common in laptop computers.

Heat transfer in integrated circuits depends on the thermal resistance and thermalcapacitance of the chip, its package, and associated heat sink. The mathematicalmodels for thermal resistance and capacitance of materials are isomorophic to thosefor electrical resistance and capacitance. Thermal RC models can be constructed todescribe a heat transfer problem. These models can be solved in the same manneras are electrical RC circuit models. Figure 7.24 shows the configuration of compo-nents used to transfer heat for a high-power chip: a heat spreader, made of a good ther-mal conductor like copper, is attached to the chip; a heat sink is attached to the heatspreader to transfer heat to the ambient environment.

Skadron et al. [Ska04, Ska08] developed the HotSpot tool for thermal modeling ofchips. A chip is modeled as a set of architectural units. The die, package, and heat sink

System and virtual

architectures

Modeling tools


are all modeled as separate elements. An RC network is constructed to describe theheat flow between elements. The activity of the architectural units determines theamount of heat generated in each unit. The flow of that heat throughout the systemcan be found to determine the temperature of each of the architectural units as a func-tion of time.

Xie and Hung [Xie06] developed algorithms for temperature-aware task sched-uling and allocation based on the work of Xie and Wolf [Xie01]. They formulatedtwo heuristics, one aimed at spatial characteristics of temperature and the other attemporal characteristics. In both cases, they used HotSpot to analyze the thermalcharacteristics of the chip during floorplanning. They analyzed spatial behaviorby adding temperature information to the calculation of the dynamic criticalityof a task:

DC�taski;PEj

� ¼ SCðtaskiÞ �WCET�taski;PEj

�¼ max

�avail: PEj; ready taski

�� temp (EQ 7.19)

In this formula, temp can be defined as the either the average chip temperature ormaximum temperature at any point on the chip. A scheduling and allocation for a taskthat reduces chip temperature will increase its dynamic criticality. Xie and Hungincorporated temporal behavior by allowing the allocation of a task to a PE to changeover time. This heuristic balances loads across PEs to balance thermal generationacross different parts of the chip.

Murali et al. [Mur07] formulated an optimization problem for temperature-awaredynamic voltage and frequency scaling of multicore and multiprocessor chips. Theydivide the chip and heat spreader into cells. Each cell has a thermal capacitance thatcan be written as a diagonal matrix C. Adjacent cells have a thermal resistance

chip

heat sink

heat spreader

FIGURE 7.24

Heat transfer components for an integrated circuit.

Temperature-aware task

scheduling

Temperature-aware DVFS

7.5 Thermal-aware design 381

between them as described by a thermal conductance matrix G; the thermal conduc-tance of silicon is an approximately linear function of temperature, so the G matrix isitself a function of temperature. The heat flow equation can be written as

C _tk ¼ �GðtkÞtk þ pk (EQ 7.20)

where tk is the temperature at time k and pk is the power consumption of the chip attime k.

The equation can be solved for temperature as a function of time using nu-merical integration. Murali et al. model the power consumption of a processor i

as a function of clock frequency as pi;k ¼ pmax f2i;k=f 2max

: They formulated a steady-

state problem that maximizes the total number of instructions per second of all theprocessors on the chip while keeping the temperature and power consumption of allprocessors below a threshold. They also formulated a dynamic case in which proces-sor frequency is a function of time while satisfying the temperature and power con-straints. The strict formulation of these problems is nonconvex. They solve theproblems by relaxing the power-versus-frequency equation to an inequality. Theyalso use the maximum thermal conductivity value. They then iterate over the solu-tions, using the result of one relaxed convex optimization problem to determine thesteady-state temperature for the next iteration until the temperature values converge.

Hartman et al. [Har10] proposed ant colony optimization for lifetime-aware taskmapping. They noted that different failure mechanisms depend upon different charac-teristics: electromigration depends on current density, time-dependent dielectricbreakdown depends on power supply voltage, and thermal cycling depends on theambient temperature. As a result, there is no simple relationship between temperatureand lifetime. The optimization algorithm alternates between selecting a task to bemapped and a component to which the task is to be mapped. The choice of whattask to map next is determined by a weighted random selection. They estimate compo-nent lifetime using a lognormal distribution for each of electromigration, time-dependent dielectric breakdown, and thermal cycling.

Ebi et al. [Ebi11] used market-based learning algorithms to dynamically managepower budgets and thermal profiles in multicore processors. Each processor has alocal agent that manages the clock frequency and voltage of that processor. Eachlocal agent is assigned an income of available power; if a processor runs at lessthan its maximum frequency, that excess power can be added to its savings. A ther-mal domain may have one or more market agents that negotiate with other marketagents for available clock cycles. Each local agent sends a request to its marketagent based on the tasks it needs to run in the next period. Each market agent dis-tributes power to the local agents using a round-robin, prioritized algorithm. If alocal agent does not receive as much power as it requested, it must scale back itsfrequency and/or voltage.

Ebi et al. categorize the state of each local agent depending on its temperature:agents near the ambient temperature are classified as good, agents near themaximum allowable temperature are classified as bad, and those whose temperature

Lifetime-aware task

management

Thermal profile

management


is in between the two are classified as neutral. The income of a processor is a func-tion of temperaturedincome to a processor decreases as its temperature increases,with the slope of the income-versus-temperature curve becoming more negative astemperature increases. The boundaries of the good/neutral/bad regions determinesthe inflection points of the income-versus-temperature curve. The system learnsthe proper selection of these regions by using weights to adjust the positionsof the boundaries between the good/neutral/bad regions. States that change fromgood to bad are negatively weighted while state changes from bad to good are posi-tively weighted.

Wang and Bettati [Wan08] developed a reactive scheduling algorithm whileensuring that the processor temperature remained below a threshold TH . Let SðtÞrepresent the processor frequency at time t. The processor power consumption ismodeled as PðtÞ ¼ KSaðtÞ where K is a constant and a > 1. We can model the ther-mal properties as

a ¼ k

Cth(EQ 7.21)

b ¼ 1

RthCth(EQ 7.22)

They define the equilibrium speed, the speed at which the processor stays at thethreshold temperature, as

SE ¼�b

aTH

�1=a

(EQ 7.23)

When the processor has useful work to do and is at its threshold temperature, the pro-cessor clock speed is set to the SE; it is run at its maximum speed when it is below itsthreshold and processes are ready to run.

Chaturvedi et al. [Cha12] used simulation studies to show that a constant-speedschedule does not always minimize peak processor temperature. They went on togive theorems that show the conditions under which a step-down scheduledone inwhich the clock speed goes down at some point in the scheduledgives the lowestpeak temperature.

Chen et al. [Che09] developed several algorithms for proactive speed schedulingfor thermal management. One minimizes response time under a thermal constraint;another minimizes temperature under a timing constraint.

Thiele et al. [Thi13] developed a thermal analysis model for multitasking systemsrunning on MPSoCs. The critical accumulated computing time Q½k� is the worst-casecomputation time for node k that maximizes that node’s temperature in a given inter-val. Intuitively, this critical accumulated computing time occurs when a processor isactive for the maximum amount of time required to compute its tasks. Computing thevalue of Q½k� results in the set of tasks that produces the worst-case temperature. Theythen use this task set to drive a detailed simulation of the thermal behavior of the chipover time as the task set executes.

Reactive scheduling for

processor temperature

Proactive scheduling for

processor temperature

7.5 Thermal-aware design 383

7.6 ReliabilityFault-tolerance and reliable computing have a very long history. The book by Sie-wiorek and Swarz [Sie98] provides a comprehensive introduction to the topic.Real-time constraints, power considerations, and thermal problems all add additionalchallenges; knowledge of the task set to be run can be used to optimize the design atthe system level.

Dave and Jha [Dav99b] developed the COFTA system to co-synthesize fault-tolerant systems. They used two types of checks at the task level.

• Assertion tasksdThey compute assertion operations, such as checking an addressagainst a valid range, and issue an error when the asserted condition is false.

• Compare tasksdThey compare the results of duplicate copies of tasks and issuean error when the results do not agree.

The system designer specifies the assertions to be used; duplicate copies of tasksand the associated compare task are generated for tasks that do not have any asser-tions. The authors argue that assertion tasks are much more efficient than runningduplicate copies of all tasks. An assertion task can catch many errors in a task withoutrequiring that the entire computation of the task be repeated.

COFTA uses the clustering phase to enforce a fault-tolerant task allocation. It as-signs an assertion overhead and fault-tolerance level to each task. The assertionoverhead of a task is the computation and communication times for all the processesin the transitive fan-in of the process with an assertion. The fault-tolerance level of atask is the assertion overhead of the task plus the maximum fault-tolerance level of alltasks in its fanout. These levels must be recomputed as the clustering changes. Duringclustering, the fault-tolerance level is used to select new tasks for the clusterdthe fan-out task with the highest fault tolerance level is chosen as the next addition to thecluster.

COFTA tries to share assertion tasks whenever possible. If several tasks use thesame assertion but at nonoverlapping times, then one assertion task can be used tocompute the assertion for all those tasks, saving hardware resources.

COFTA also generates a fault-tolerant hardware architecture. Figure 7.25 showsa simple 1-by-n failure group that consists of n service modules that perform theactual work of the system and one protection module. The protection module checksthe service modules and substitutes results for bad service modules. In general, afailure group may use m-by-n protection with m protection modules. COFTA usesa restricted form of subgraph isomorphism to build failure groups into thearchitecture.

Ayav et al. [Aya08] use program transformations to make a set of periodic taskstolerant of single failures. They implement fault tolerance using a combination oftwo methods. Each task performs a heartbeat operationdit writes a location period-ically to demonstrate to a monitor task that it is alive. Each task also periodicallywrites a checkpoint of its current state; in case of a fault the task state is reset tothe checkpoint value and restarted. These methods require periodic checking. They

Co-synthesizing fault

tolerant systems

Task transformations


use program transformations to equalize the execution time of each task across execu-tion paths. Other transformations insert heartbeating and checkpointing code at theproper intervals in the program. A monitor program runs on a spare processor. Itchecks the heartbeat and performs rollbacks when necessary.

Meyer et al. [Mey11] introduce architectural features to allow added flexibility inthe scheduling of redundant tasks. They allow copies of a critical task to execute asyn-chronously. Buffers save the results of task copies for checking. They use simulatedannealing to generate a schedule for the execution of tasks.

Huang et al. [Hua11] developed algorithms to analyze and optimize the executionof a set of tasks. They model the schedule of a set of tasks using execution slots. Theyuse binary search to analyze a task schedule. At each branch, one branch models non-faulty execution of the task while the other models faulty behavior. A task executessuccessfully if at least one of its instances executes successfully in the schedule; atask is faulty if no instances execute successfully. The reliability of a set of jobs iscomputed iteratively, moving from the highest-priority to lowest-priority job. Theyuse static priority slack sharing to schedule jobs with redundant copies of a task,and use a multi-objective evolutionary algorithm to optimize the task schedule.

In 1

In 2

In n

Service module 1

Failure group

Out 1

Out 1

Out 1

……

Service module 2

Service module n

Protection module

FIGURE 7.25

1-by-n protection in a failure group [Dav99B].

Scheduling for

redundancy

7.6 Reliability 385

They facilitate using task migration to handle permanent faults by keeping track oftask migrations within the evolutionary algorithm.

Ejlali et al. [Ejl09] developed a low-power architecture in a redundant architecturefor real-time systems. They initiate backup copies of a task before the task deadline,then kill the backup if the primary copy executes successfully. An offline algorithmcomputes the worst-case start time for the spare copy of each task. They computeat runtime the actual delay and operating voltage of the processor to complete theredundant copy by the deadline. They use dynamic power management to managethe backup processor.

Reimann et al. [Rei08] optimizes the allocation of tasks and voters to meet reli-ability requirements. They model the problem as an integer linear program. Binaryvariables model the possible placement of voters and the allocation of tasks toprocessors.

7.7 System-level simulationEven after we verify the hardware and components of a system independently, weneed to make sure that the components work together properly. Because hardware/software systems are both large and operate at many different time scales, powerfulverification tools are needed to identify improper behavior and help the designerdetermine the origin of a bug. Designers should run the hardware against the softwareto find bugs on both sides of the interface. Hardware/software co-simulation gives thedesigner traditional debugging tools; co-simulators also run fast enough to give satis-factory turnaround times for design experiments.

Co-simulators provide mechanisms that allow different types of simulators tocommunicate. A brute force approach to simulating software that talks to hardwarewould be to simulate a register-transfer implementation of the CPU along with thecustom hardware, setting the software bits as state in the CPU simulation and runningthe software by exercising the RTL model. Even if we have a register-transfer model ofthe processor, which is not always the case, a logic-level simulation of the CPU wouldbe unreasonably slow. Becausewe have cycle-accurate simulators that run considerablyfaster than RTL models of processors, we can use them to execute software and simu-late only the custom hardware using traditional event-driven hardware simulation.

As shown in Figure 7.26, a simulation backplane is the mechanism that allowsdifferent simulators to communicate and synchronize. Each simulator uses a bus inter-face module to connect to the backplane. The backplane uses concurrent programmingtechniques to pass data between the simulators. The backplane must also ensure that thesimulators receive the data at the proper time. The bus interface includes controls thatallow the backplane to pause a simulator. The backplane controls the temporal progressof the simulators so that they see data arriving at the correct times.

Becker et al. [Bec92] built an early co-simulator to simulate a large network sys-tem. They used the programming language interface (PLI) of the Cadence Verilog-XLsimulator to add C code that could communicate with software simulation modules.

Low power and

redundancy

Voter optimization

Co-simulation

backplanes


They used UNIX networking operations to connect the hardware simulator to theother elements of the system being simulated, the firmware, and a monitor program.They did not simulate a processor at the cycle level.

Ghosh et al. [Gho95] later built a more general simulation environment that alsoused the Verilog-XL PLI to coordinate the hardware and software simulators over abackplane. Their simulator included a CPU simulator. Zivojnovic and Meyr [Ziv96]compiled software instructions to be simulated into native instructions on the hostsimulation processor; they attached these models to a hardware simulator forco-simulation.

The next example describes a commercial co-verification tool.

Example 7.1 Seamless Co-VerificationTheMentor Graphics Seamless system (http://www.mentor.com/seamless) simulates heteroge-neous hardware/software systems. Hardware modules can be described using standard hard-ware description languages; software can be loaded into the simulator as binary code or as Clanguage models. A bus interface model connects the designer’s hardware modules with theprocessor’s instruction set simulator. A coherent memory server allows some parts of the mem-ory to be shared between hardware and software models while isolating other parts of memory;only shared memory segments need to be mediated by the co-simulation framework. A graphicprofiler allows the designer to visualize system behavior.

The next example describes a standard for the design of distributed simulators.

Example 7.2 High-Level ArchitectureHigh-Level Architecture (HLA) is an IEEE standard for the architecture of distributed simulators[IEE10]. An HLA simulator is composed of a set of simulators known as federates. Federatesexchange objects that describe simulation data. Their interaction is managed by the Run-TimeInfrastructure (RTI), which provides an interface by which the federates may manage simulationtime, data objects, and other elements necessary for the federated simulation.

CPUsimulator 1

CPUsimulator 2

Verilog orVHDLsimulator

Bus interface Bus interface

Simulation backplane

Bus interface

FIGURE 7.26

A simulation backplane.

Co-simulators

7.7 System-level simulation 387

http://www.mentor.com/seamless

7.8 SummaryModern embedded systems must implement very sophisticated algorithms whilemeeting stringent power, performance, and cost goals. Efficient design space explora-tion is necessary to find the best design solution. Hardware/software co-design algo-rithms analyze a functional specification and map it onto a set of software andcustomized hardware to meet the nonfunctional requirements. System-level designexpands on this goal by allowing designers to work in high-level languages such asSystemC or MATLAB and to map their systems onto large multiprocessors. Succes-sive refinement methodologies allow the system to be designed incrementally, addingdetail as it moves from higher to lower levels of abstraction.

What we have learned• Hardware/software co-synthesis can start either from programs or task graphs.• Platform-driven co-synthesis allocates processes onto a predefined hardware

architecture, tuning the parameters of that architecture during synthesis.• Co-synthesis algorithms that do not rely on platforms often must make other

assumptions, such as fixed component libraries, to make their search spaces morereasonable.

• System-level design is driven by high-level models and explores the hardware/software design space of multiprocessor systems.

Further readingStaunstrup and Wolf’s edited volume [Sta97b] surveys hardware/software co-design,including techniques for accelerated systems like those described in this chapter.Gupta and De Micheli [Gup93] and Ernst et al. [Ern93] describe early techniquesfor co-synthesis of accelerated systems. Callahan et al. [Cal00] describe an on-chipreconfigurable co-processor connected to a CPU.

QuestionsQ7-1 Compare and contrast a co-processor and an accelerator.Q7-2 What factors determine the time required for two processes to communicate?

Does your analysis depend on whether the processes are implemented inhardware or software?

Q7-3 Which is better suited to implementation in an accelerator: Viterbi decodingor discrete cosine transform? Explain.

Q7-4 Estimate the execution time and required hardware units for each data flowgraph. Assume that one operator executes in one clock cycle and that eachoperator type is implemented in a distinct module (no ALUs).


Q7-5 Show how to calculate the parameters of EQ (7.1) using the information in thetable of Figure 7.7.

Q7-6 How might task partitioning affect the results of scheduling and allocation?Q7-7 Compare and contrast the CDFG and the task graph as a representation of

system function for co-synthesis.Q7-8 How could you use an execution-time technology table to capture the char-

acteristics of configurable processors with custom instruction sets?Q7-9 Explain the characteristics of EQ. (7.3) that allow it to be used in a binary

search algorithm.Q7-10 Compare and contrast the objective functions of EQ. (7.3) and EQ. (7.4).Q7-11 Explain the physical motivation behind the terms of the speedup formula (EQ.

7.5).Q7-12 Explain how global slack and local slack can be used to reduce the cost of an

accelerator.Q7-13 How do scheduling and allocation interact when co-synthesizing systems with

arbitrary architectures?Q7-14 Give examples of control-related refinement, data-related refinement, and

architecture-related refinement.Q7-15 How can we take advantage of multiple copies of a task graph during

scheduling and allocation?Q7-16 Derive the equilibrium speed of EQ. (7.23).Q7-17 Does scheduling for minimum power also minimize temperature?Q7-18 Explain why making multiple copies of a task for fault tolerance may not

require much additional hardware.Q7-19 Show how to model the co-synthesis of a video motion-estimation engine

using genetic algorithms, assuming that all functions will be implemented ona single chip. Show the form of the PE allocation string, task allocation string,and link allocation string.

Lab exercisesL7-1 Develop a simple hardware/software partitioning tool that accepts as input a

task graph and the execution times of the software implementations of the tasks.Allocate processes to hardware or software and generate a feasible systemschedule.

L7-2 Develop a tool that allows you to quickly estimate the performance and area ofan accelerator, given a hardware description.

L7-3 Develop a simple hardware/software partitioning tool that uses genetic algo-rithms to search the design space.

L7-4 Develop a simple co-simulation system that allows one CPU model to talk toone accelerator model.

L7-5 Use a thermal simulation tool to simulate the temperature profile of a processorrunning a program.

Lab exercises 389

chapter system-level design and hardware/software 7 co-design · hardware/software co-design 7...

Documents