eindhoven university of technology master parallel ... · pdf fileparallel simulation of...

Eindhoven University of Technology

MASTER

Parallel simulation of handshake circuits

Dieben, M.P.

Award date:1994

DisclaimerThis document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Studenttheses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the documentas presented in the repository. The required complexity or quality of research of student theses may vary by program, and the requiredminimum study period may vary in duration.

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Download date: 02. May. 2018

EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics and Computing Science

MASTER'S THESIS Parallel Simulation of Handshake Circuits

by M.P.Dieben

Supervisors: Drs.R.H.Mak

Ir.A.M.G.Peeters prof.dr.M.Rem

January 1994

416J31

M.P.Dieben

Parallel Simulation of Handshake Circuits

Abstract

Programs, written in the VLSI-programming language Tangram, can automatically be compiled into circuit descriptions, with handshake circuits as intermediary. Due to the transparent compilation method the programmer can infer cost and performance directly from the Tangram program. Simulation results are helpful to discriminate between various Tangram programs. Tests on the dynamic properties of a program, such as speed and power, are first possible on the level of handshake circuits. The highly parallel structure of handshake circuits allows for a concurrent approach. This report contains the design of a parallel simulator of handshake circuits on a Transputer.

Preface

The work presented in this report has been done in order to fulfil the requirements for the Master's thesis in computing science of the Eindhoven University of Technology. The assignment has been performed in the group of Parallelism & Architecture at the Eindhoven University of Technology during the period February 1993 - March 1993 and August 1993 January 1994.

Contents

0 Introduction 4

0.0 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 0.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 VOC project 6

1.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1 VOICE.............................................. 6 1.2 Tools for VOICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.0 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Optimisation and Visualisation . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Relation between VOC project and Tangram project . . . . . . . . . . . . . . 9 1.4 Intentions of the Handshake Circuit Simulator . . . . . . . . . . . . . . . . . 10

1.4.0 Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.1 Front-end analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Delay insensitive circuits 1 1

2.0 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Operation of Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Handshake circuits 14

4

3.0 Handshake channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Handshake protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Handshake components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.0 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Specification of a basic component, example MIX . . . . . . . . . 18 3.2.2 Specification of a basic component, example SEQ . . . . . . . . . 20

3.3 Handshake circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.0 Handshake circuits: delay insensitive circuits

The Transputer Network 21

4.0 The hardware ....................................... . 4.1 The Pascal system .................................... .

4.1.0 Parallelism .................................... . 4.1.1 File i/o ....................................... .

4.2 Communication primitives .............................. . 4.2.0 Setnaphores ................................... .

20

21 21 22 22 23 23

5 Functional simulation ntodel 24 5.0 Modelling handshake synchronisation . . . . . . . . . . . . . . . . . . . . . . . 24 5.1 Modelling handshake communication . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Modelling the handshake components . . . . . . . . . . . . . . . . . . . . . . . 30

5.2.0 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2.1 Translation of components, example SEQ . . . . . . . . . . . . . . . 31 5.2.2 Translation of components, example CON . . . . . . . . . . . . . . . 32 5.2.3 Translation of components, example PAR . . . . . . . . . . . . . . . 33 5.2.4 Translation of components, example NewPAR . . . . . . . . . . . . 34 5.2.5 Translation of components, example BAR . . . . . . . . . . . . . . . 38

5.3 Simulation of Handshake Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.0 Correctness of the simulation . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.1 1/0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.2 Handshake circuit as Transputer Pascal program 42

6 Timing & Power dissipation 43

7

6.0 Timing analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.1 Timing model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1.0 Timing analysis of components, example Newpar . . . . . . . . . . 47 6.1.1 Timing analysis of components, example MON . . . . . . . . . . . 48

6.2 Word based timing versus bit based timing . . . . . . . . . . . . . . . . . . . . 49 6.2.0 Timing function of an adder . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.3 Timing of the Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.4 Power dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

HCL2TP compiler 57 7.0 Specification of HCL2TP 7.1 Lexical scanner ...................................... . 7.2 Parser ............................................. . 7.3 Code generation ..................................... .

7.3.0 Syntax Directed Translation Scheme .................. . 7.3.1 Code Attributes ................................. . 7.3.2 On the fly code generation ......................... . 7.3.3 Translation of file-i/o components.

57 58 58 59 59 59 60 60

8 Evaluation and conclusion 62 8.0 Evaluation and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.0.0 Functional simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.0.1 Timing analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.0.2 Power dissipation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.0.3 Single Transputer simulator . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.0.4 Surrounding software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.1 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.2 Suggestions for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.2.0 Functional simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.2.1 Timing analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.2.2 Power dissipation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.2.3 Multi Transputer simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.2.4 Surrounding software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Bibliography 67

A APPENDIX in supplement

4

0 Introduction

0.0 Assignment

Designing a Very Large Scale Integrated (VLSI) circuit is a complicated task. One way to make the task more manageable is to use a high level programming language to specify the computations that have to be realized by the VLSI circuit. This language has to have certain properties. First of all it must be expressive enough to define a wide range of desired computations. Second it must provide the designer with means to manage circuit resources like area, speed and power consumption. The latter must be accomplished without demanding the designer to have extensive knowledge of the circuit technology and layout strategy used. In particular, the timing constraints of the circuit technology should be of no concern to the designer on the level of the VLSI programming language.

As a next step in the design a VLSI program has to be translated into a circuit description. This translation can be performed automatically by a tool called Silicon compiler. At Eindhoven University of Technology such a compiler is currently in development, as a part of the VOC project [Kor92]. Another project is the Tangram project at Philips Research Laboratories, Eindhoven.

The compilation of VLSI programs is divided in two parts. The front-end of the compiler translates the VLSI program into handshake circuits. The back-end of the compiler maps the handshake circuits to a VLSI circuit description.

The division is justified by one of the critical elements of circuit design: the timing problem. The theory of delay insensitive circuits [Ver92] provides a method for handling this timing problem. The theory defines a delay insensitive circuit to be a system consisting of parallel operating components and interconnecting wires, for which the functional correctness is independent of the delay in both components and connecting wires. Delay insensitive circuits can be used as building blocks in delay insensitive systems. A practical example is a handshake circuit, i.e. a network of asynchronous components that communicate according to a (4-phase) handshake protocol.

5

For the translation of VLSI programs into handshake circuits only a small number of different handshake components is required. Handshake circuits provide a narrow interface between the front-end and the back-end of the compiler. Besides this, timing and functional correctness are separated on the level of handshake circuits. Therefore, this is a good point to analyze the results with simulation tools.

Simulation is important as it gives the programmer an idea of the effects of his design decisions, with respect to the efficiency of the resulting circuit. The simulation results can be used to evaluate and develop VLSI programming design heuristics. Therefore it is of utmost importance that the translation method is transparent. Otherwise it is not possible to deduce performance characteristics of the VLSI circuit from the VLSI program. Note that the compiler could have affected the results with internal optimisation.

Since handshake circuits are fine grained networks of parallel operating components, it is expected that simulation of the circuit in an environment which supports fine grain parallelism, is more efficient with respect to both simulator design as well as simulation speed. At Eindhoven University of Technology there is such an environment available, a Transputer network. It is the subject of this Master's thesis to design a simulator for handshake circuits on this Transputer network.

0.1 Approach

In Chapter 1 the assignment is situated in the context of the VOC project. The necessity of simulation is explained. Chapter 2 contains a formal definition of delay insensitive circuits. In Chapter 3 the handshake circuits are described, according to the handshake component definitions of Van Berkel [Ber92]. The relation between the formal model of delay insensitive circuits and the handshake circuit is pointed out. The relevant aspects of the Transputer network are recorded in Chapter 4. After the preliminary work, the construction of the simulator is treated in the following chapters. The construction is divided into a few stages. First the functional aspects of the handshake circuit components are captured in a set of Transputer Pascal procedures. This leads to the executable functional modeL Semaphores are used to implement the handshake protocoL In Chapter 6 the functional model is extended with calculations to record the timing behaviour and the power dissipation of the circuit. The compilation of the handshake circuit description into a Transputer Pascal program that can be executed on a single transputer, is automated with the design of a compiler in Chapter 7. Conclusions and suggestions for a multi transputer simulator are presented in Chapter 8.

6

1 VOC project

1.0 Introduction

In 1988 the VOC [Vertragings Ongevoelige Circuits, english: Delay Insensitive Circuits] project started at the Department of Mathematics and Computing Science at Eindhoven University of Technology [Kor92]. The VOC project investigates the design process of a special class of VLSI circuits: viz. DI-circuits. DI-circuits have some very important properties which make research on them worth while. DI-circuits are constructed as a network of components. The circuits do not contain a clock but the synchronization of the components of the circuit is done by handshake communications. The components are called handshake components, the network is called a handshake circuit. Timing and functional correctness are separated on the level of the handshake circuit, thus facilitating the design of VLSI circuits considerable (separation of concerns). The absence of the clock introduces another big advantage: the power dissipation is lower, since only the active components of the circuit consume energy.

The complexity of the design process of VLSI circuits demands for support with proper software. The aim of the VOC project is to develop a system that automatically generates a VLSI circuit design from a program in a high level description language. The system is called VOICE (Vertragingsongevoelige IC Compiler Eindhoven).

1.1 VOICE

VOICE is specifically designed for the following three applications: • Research on the compiler: Once the first implementation of the compiler is

made, experiments can be performed such as application of alternative data encoding and various target technologies, like FPGAs [Bru91].

• Research with the compiler: VOICE is intended to be used for evaluation of design heuristics for VLSI programming. VOICE shows the effect, on aspects like speed and dissipation, of high level methods for developing programs on the low level VLSI implementation of these programs.

• Education: The compiler will be used in courses on VLSI programming to provide students hands on experience in generating VLSI designs.

7

To support these applications VOICE has a modular structure which facilitates changes of parts of the system. The central part of the VOICE compiler transforms a VLSI program into a handshake circuit. The translation is not performed in one step, but several intermediate representations are introduced that partition the translation in a number of independent steps. This way the compiler can be set up as a sequence of modules that each perform one of the translation steps.

The translation method described in, for example, [Bekp90] is syntax-directed. An important property is its transparency, i.e., the important characteristics of the resulting VLSI design such as area, speed and power dissipation can be deduced from the VLSI program in a straightforward way. The VLSI programmer will therefore not only be able to control the functional and behavioural correctness of the program, but the efficiency of the resulting circuit as well.

The top level representation of a circuit is the program in a VLSI programming language (VPL). A VPL program is first translated into the kernel language (YOKEL). The translation of a VOKEL program results in a handshake circuit: a network of asynchronous components that interact through handshake communications. Handshake circuits provide a narrow interface between the front-end and the back-end of the compiler. Therefore this is a good point to analyze the results with simulation tools. The back-end of the compiler translates the handshake circuits into VLSI circuits.

VPL

' VOKEL

' front end

' Handshake circuits

' back end

' VLSI circuits

Figure 1.0

1.2 Tools for VOICE

VOICE contains a number of tools which perform tasks that are not directly part of the translation. The main tasks of these tools are analysis and optimisation at various stages of the translation and visualisation of the (intermediate) results.

8

1.2.0 Analysis

An important property of the compilation method is its transparency, i.e. the VLSI programmer can easily make, for a VLSI program a raw estimate of important properties of the resulting VLSI implementation such as area, speed, power dissipation and testability. This is important because the programmer will have to meet requirements for these properties that depend on the application at hand. In general the VLSI programmer can produce several programs which all satisfy the functional specification but result in VLSI circuits with varying degrees of efficiency. In order to support the VLSI programmer in finding an optimal program for the problem, the silicon compilation system must support analysis of the properties that influence the efficiency of the program. This analysis can be performed at various stages in the translation; analysis at an early stage wilJ be rough, but provides a quick answer, and estimations in a later stage will probably be more accurate but require more time.

An estimation of the speed and power dissipation of a VLSI circuit can, in contrast to the area, not be determined statically, because these criteria depend on the way the circuit is used. To measure these aspects VOICE must include simulators. Within VOICE simulators are considered on three levels: a VOKEL interpreter, a simulator for handshake circuits, and a simulator at the level of the IC-CAD system.

The VOKEL interpreter is useful for fast verification of the functional correctness of the program and to detect problems in the communication behaviour. It will however, not provide information on the speed or power dissipation of the resulting circuit.

Simulation on handshake circuits can be used for both checking the behaviour of the program and estimating the speed and the power dissipation of the resulting circuit. A handshake circuit is simulated as a network of parallel processes that interact through four-phase handshake communications. Such a network is controlled by simulating the communication behaviour of the environment on the external channels of the circuit. In such a simulation the timing is determined by estimated delay times for the components and the wires. Power dissipation is measured by keeping track of the number of communications over the links in the circuit. The energy consumption is proportional to the number of links [Ber92].

9

There is a handshake simulator currently in development [Lin92]. This simulator, written in C, runs on a SUN work station. As the handshake circuit consists of a network of parallel processes it is one of the intentions of the VOICE project to develop a parallel simulator. This simulator should run on a Transputer network. It is expected that this results in notable speed up. This report contains the design of a simulator for handshake circuits on a Transputer.

It is possible to use the more accurate simulator which is incorporated in the ICCAD system. However this simulation will take more time. The question is whether the accuracy of the handshake simulation is enough for the purpose it is used for or not.

1.2.1 Optimisation and Visualisation

Also for optimisation and visualisation tools must be developed. Optimisation is a difficult topic as the transparency of translation should not be violated. Only those optimisations should be performed automatically that do not interfere with the goals of the programmer. At the level of VLSI programming languages optimisation methods are considered that are commonly used in conventional compilers. On the basis of one of these methods, sharing of common subexpressions, a program is developed that supplies the user with suggestions for optimisation of the VLSI program [Jan92]. This program analyses the program on program fragments that occur more than once, and reports these fragments to the VLSI programmer. The programmer can decide to adjust the program so that this piece of the program occurs only once in the circuit, and is shared between the occurrences. This decision cannot be made automatically, because sharing of program fragments may reduce the size but introduces delays with the infrastructure required for sharing. This speed/size trade off must be made by the programmer. HCS [Handshake Circuit Simulator] can help by showing the penalty (lower speed) versus the benefit (size reduction).

The simulator can also be helpful as a extension of an interactive editor, by visualising the behaviour of a circuit.

1.3 Relation between VOC project and Tangram project

There is an extensive cooperation between the Technical University of Eindhoven and Philips Research Laboratories Eindhoven in the field of delay insensitive circuit design. This simulator is intended to be used within the VOC project. However, the Tangram definitions of the basic components are used. There are three reasons for this choice. Tangram offers a more complete set of handshake

10

components in which data is also dealt with in a neat way, whereas VOC handshake circuits so far basically ignore data. Tangram offers a stable bug-free front-end compiler for an expressive language. Tangram handshake components form a well defined stable set.

1.4 Intentions of the Handshake Circuit Simulator

1.4.0 Concurrency

Simulation of asynchronous circuits allows for a concurrent approach. It is expected that the Transputer network which supports concurrency, wilJ facilitate an elegant design of the simulator. The Transputer network at the Technical University of Eindhoven consists of 50 parallel operating Transputers. The Transputer network can treat parallelism efficiently. Especially for large circuits with a I 00.000 processes this may result in a notable speed up.

1.4.1 Front-end analysis

One of three main goals of VOICE is the evaluation of VLSI programming design heuristics. Discrimination between various VLSI programs, with the same functionality but with different speed or power consumption, is important. This requires more results than just the execution time. In fact, the simulator has to follow the architecture of the VLSI program precisely, to be able to check the performance of those parts of the program that are of interest to the programmer.

11

2 Delay insensitive circuits

To be able to argue about the properties of the circuit in a more formal way, the formal model of delay-insensitive circuits is discussed here. This chapter is a summary of the course notes of Delay insensitive circuits theory [Ver92].

2.0 Processes

Let 2, be infinite set of symbols. A subset of 2, is called an alphabet. A process P is a triple (iP,oP,tP) such that • iP is finite subset of L:, • oP is finite subset of 2,, • iPnoP=0, • tP ~ (iPuoP)*, • tP is non-empty and prefix closed.

Subset iP is called the input alphabet, oP the output alphabet, and tP the trace set of P. The alphabet aP of P is defined by aP=iPuoP. The set of all processes is called PROC. Typically, variables P, Q and R range over PROC.

The trace set specifies the behavioural properties of a process. The current state of process P is characterized by a trace tE tP. The intention is as follows:

•For aE oP: taE tP if and only if P can send an output signal via a in state t.

•For aE iP: taE tP if and only if P can receive an input signal via a in state t.

The trace set prescribes restrictions (for proper operation) on the process itself and on its environment.

The reflection operator is defined on PROC by -p = (oP,iP,tP).

2.1 Systems

A system is a finite set of processes such that for all aE 2,: • (#P:PE S:aE iP):::;I and • (#P:PE S:aE oP):::;l.

12

The set of all systems is denoted by SYS. Typically, variables S, T, U range over SYS.

For system S its input and output alphabets are defined by respectively: iS = (uP:PE S:iP), oS = (uP:PE S:oP).

For system S its internal symbols, external symbols, external inputs and external outputs are defined by:

nS = iSnoS, xS = iS+oS, xiS= iS\oS, xoS= oS\iS.

A system is called closed when xS = 0.

Systems S and T are called connectable whenever xiSnxiT = 0 and xoSnxoT = 0.

The internal symbols of connectable systems S and T can be renamed, yielding systems S' and T' respectively, such that

aS'rmT' = 0 and nS'naT' = 0.

In that case, S'uT' is again a system. It is called the composition of S and T and is denoted by Spar T. Thus, par is a partial operator on SYS.

2.2 Operation of Systems

Let S be a system. In order to define its operation it is convenient to add explicit processes to model the behaviour of the internal wires. For that purpose system S" is defined obtained from S by taking

• for each process PES, process fr.PE S" where fr is a renaming of aP defined by

fr.a = a? if aE iP fP.a = a! if aE oP

• for each symbol aE nS, process W(a!,a?)E S".

Example For systemS,

S = { l(a;b), l(b;a)}, SA= {l(a?;b!), I(b?;a!), W(a!;a?), W(b!;b?) }.

The trace set tS is defined inductively by • epsilone tS and • if tetSA, PeS\ aeoP, and ta taPetP then taetS

Note that tS c (aSA)*.

13

The empty trace "epsilon" models the initial state of S. The (global) state ta of system S induces the (local) state of ta laP at process P. State changes can occur whenever a process (possibly an added wire) can produce an output in its current state. Such an output appears at the same time at the receiving end, regardless of whether that input is acceptable in the current state of the receiver.

The safe set sq{e.S if S is defined by safe.S = { t: te (aSA)* I\ (AP: Pe SA :t taPe tP) :t}.

System S is calJed free of interference when tS s;,;;; sa.fe.S

A closed system S is free of interference if and only if safe.S = tS.

Such a system is called Correct.

For system S its pass set pass.S is defined by pass.S = {U: UeSYS I\ Correct.(S par U) :U}

14

3 Handshake circuits

A handshake circuit is a special kind of delay insensitive circuit. It consists of handshake components connected by handshake channels. The components communicate with each other via a handshake protocol along the channel. This chapter includes an extensive description of the constituent parts of handshake circuits. Sections 3.0 deals with handshake channels, Section 3.1 describes the handshake protocol and Section 3.2 presents the definitions of handshake components. The connection between handshake circuits and delay insensitive circuits as defined in Chapter 2 is presented in Section 3.3.

3.0 Handshake channels

A handshake channel connects exactly two ports belonging to distinct components. In their mutual communication actions, one component is the active partner and the other one is the passive partner. The port of the active partner is cal1ed the active port, denoted by •, and the port of the passive partner is called the passive port, denoted by 0 •

The purpose of a communication action is twofold. Firstly, it synchronizes the partners involved and secondly, it serves to transport data between the partners. As a consequence two kind of channels are distinguished. Nonput channels which provide synchronisation only and data channels which provide both synchronization and data transport. Channels which realize data transport can be divided into two categories. Straight channels, in which data is transported from the active partner to the passive partner and Anti channels, in which the data is transported in the opposite direction. Table 3.0 summarizes the various types.

Table 3.0

CATEGORY CHANNEL SYMBOL ! TYPE

i

-data-driven straight ·~-0

-demand-driven anti •-<E-O

11 -nondata/control non put ·--0

15

Similar, directions can be defined for ports. A straight port is a port connected to a straight channeL Hence a straight port is either an active output or a passive input port. Anti ports are defined in the same way.

Besides a direction channels and ports have a second attribute, viz. their width indicated by an integer value W. This attribute defines the number of bits used to represent the data items that are communicated. The bits are numbered over the range [0 .. 1 Wj). The width of a port can be positive, negative or zero. The sign of W is used to distinguish signed from unsigned data. Negative widths will be used exclusively for those basic components, for which the specified behaviour depends on the distinction between signed and unsigned data (e.g. addition). Therefore, most ports are specified with non-negative widths, even if the data type is a signed integer sub-range. In particular, the width of a channel can only be positive or zero when the behaviour of the channel is independent of the type of the data that is transported. For convenience the width of nonput ports and nonput channels is chosen to be zero.

3.1 Handshake protocol

Two components connected by a channel c can communicate through handshaking. The active pat1ner initiates the communication by sending the request and subsequently waits for an acknowledge. The passive partner awaits a request and subsequently sends the acknowledge. Together they implement a 2-phase handshake protocoL

A nonput channel c is implemented by two wires. One wire, c.req, is used for the request whereas the other wire, c.ack, is used for the acknowledge. In a 2-phase handshake protocol a request or an acknowledge is sent by changing the voltage level of the respective wires (transition signalling). However this protocol leads to complex implementations of some handshake components. The communication is implemented by a 4-phase handshake protocol using level signalling. During the first 2-phase handshake the voltage level of the wires is raised to the 'high' voltage level. To be able to repeat the protocol the voltage level of the wires has to be Returned To Zero first. This is done during the so-called RTZ phase.

As a consequence, a communication consist of a Data-phase and a RTZ-phase. The Data-phase on channel c is implemented with a 2-phase handshake "up", denoted by ci. The RTZ-phase is implemented with a 2-phase handshake "down", denoted by d. Combined and in this order they form a complete 4-phase handshake protocol.

16

Table 3.1 shows all the possible combinations. For a wire W, the assignment w r means "raise w to high voltage level" . Waiting is denoted by a pair of square brackets , i.e. [ w] means "wait until wire w is raised to high voltage level".

Table 3.1

2-PHASE ACTIONS HANDSHAKE

cT" [ c.req];c.ack T

cl' c.reqi;[c.ack]

d· [ ...,c.req];c.ackv

cl' c.reql;[...,c.ack] I

4-PHASE ACTIONS HANDSHAKE

c . cT";cl"

ll In data channels one of the wires is replaced by a set of wires depending on the direction of the channel. In case of a straight channel the data is transported along the request whereas anti channels transport the data together with the acknowledge. The number of wires depends on the specific form of data encoding that is chosen. A common choice is dual rail encoding in which each data bit is represented by two wires [Vcr88]. Commands which involve data communication use an exclamation mark to indicate an output of data. The question mark denotes an input communication.

c T"!E :denotes a passive output of the contents of variable x. c T'?x :denotes an active input, which is to be stored in variable x.

3.2 Handshake components

A handshake circuit is an interconnection of handshake components, which are connected by point to point handshake channels. Handshake components are the smallest units for which the functional behaviour is independent of the timing of the environment. Behaviours of elementary handshake components are defined by means of a command. A command specifies the order in which the various parts of the handshake protocols on the (external) ports are executed.

17

In addition some components require their environment to behave in a restricted manner. In particular, mutual exclusion has to be guaranteed for a certain pair of inputs. Violation of this restriction is called a conflict. A set of restrictions is indicated by means of a conflict set.

3.2.0 Commands

Behaviours of handshake components are defined by commands. A command prescribes an interleaving of the handshakes the component performs on its individual (external) ports.

Regard the different 2-phase handshakes, as defined in Section 3.1, to be the elementary commands. New commands are build from other commands by means of a well chosen set of command operators. The command operators, in decreasing binding power, are:

• infinite repetition • finite repetition • parallel composition • enclose • sequential composition • choice • guarded selection

*[ ]

#[-] _L

_,_ [ [

The finite repetition "#" is defined by:

O#[c] = skip N=O N#[c] = c; (N-l)#[c] I~N

The infinite repetition "*" is defined by:

*[c] = c; *[c]

]

D--7_]

The "I" operator denotes parallel execution of its operands. It awaits the termination of the execution of all operands before it terminates itself.

18

The ":" is called enclose. It is only used between 2-phase handshakes ("up" or "down") as a shorthand notation. The first operand must be a passive 2-phase handshake.

ci":ai· = c.reqi; a.reqi; a.acki; c.acki cr·:ai" = ai":ct· = c.reqi II a.reqi; c.acki II a.acki

The ";" operator is the usual sequencing operator.

The "I" is the notation for both the choice operator and the separator, which is used to separate the auxiliaries and the command in a block. The choice operator implements a choice on communication. The choice operator " I" restricts the allowed behaviour of the environment; e.g. in ai" lbi" the environment must avoid an overlap of the handshakes at" and bi". The set { a",b"} is called a conflict of the component. The specification of a basic component contains a set of such conflicts.

The "0' operator denotes the alternatives of the guarded selection. Each "0' is followed by a guarded command.

3.2.1 Specification of a basic component, example MIX

Consider for example, the specification of basic component MIX.

requirement: W>=O conflict set: { { a,b} }

MIX[straight. OJ(a",b",c") = (a",b",c"). *[ [ai":c i·; aL":d"

lbi":ci<; b!•:d" ]

]

MIX[maight. WJ(a",b",c") = (a"?W,b"?W,c"!W).

* [ [ x: var W I [ai"?x:ci"!x; a!"?x:c!"!x

lbi"?x:ci"!x; bL"?x:c!·!x ]

]

MIXrami, w1(a·,b·,c·) = (a"!W,b"!W,c"?W).

* [ [ x: var W I [al"!x:ci"?x; al"!x:d·?x

lbt·!x:cl"?x; bl"!x:cl·?x ]

Jl

19

The above definitions belong to the three different versions of the component MIX. The distinction is caused by the absence or presence, and in the latter case the direction, of the data flow. This is determined by the parameters direction and width of the component. The exclamation mark and the question mark also denote the direction of the data flow and are therefore superfluous. The specification begins with a conflict set and a set of requirements, both are optional. The cm~flict set is a restriction on the use of the component. The conflict set contains pairs of input ports for which an overlap in handshake is not allowed. In this case there is a requirement ~0 and a conflict set If a,b}) defined. W defines the width of all channels of the MIX component. The name of the component, written in bold capitals, is followed by two lists. The second list is the port list (between parenthesis) which contains the names of the ports. The first list (between square brackets) is the parameter list which specifies the properties of the ports. The equality operator and the alphabet followed by a dot separate the header of the component from the body of the component. The body consists of the necessary declarations and a 4-phase command which defines the communication behaviour of the component. For each component version there is also a graphical symbol defined.

Remark: Note that the RTZ-phase of any handshake does not start until the Data-phase of all the handshakes in the command is completed. The command can therefore be rewritten into two sequenced commands which are identical apart from the direction of the arrows. Note that, the environment completes the handshake on channel a or b.

MIX[straight,OJ(a",b·,c·) = (a",b",c").

*[ [ai":ct· lbt":cl"]; [al":cl"

lbl":d" ]]

3.2.2 Specification of a basic component, example SEQ

SEQ(a·,b· ,c·) =(a",b·,c·). *[ai":(b*;ci·);al·:d·]

The sequencer is a control element which enforces sequential execution of two program parts. It is an interesting example because the command of the sequencer cannot be divided in a Data-phase and a RTZ-phase that b are identical apart from the directions of the arrows.

3.3 Handshake circuits

20

A handshake circuit is a composition of basic components. The circuit definition contains a name, a set of external ports and a list of the constituent components. Among the external ports there is one port, often named Z, which is the initialising port. When the environment starts a handshake communication on port Z the circuit starts its operation. A handshake circuit is denoted in the Handshake Circuit Language (HCL) [A.O].

3.3.0 Handshake circuits: delay insensitive circuits

The relation between handshake circuits and the formal model is pointed out here to be able to use it in Chapter 5 where the operation of handshake circuits is modelled. Handshake circuits are delay insensitive circuits. In terms of the formal model of delay insensitive circuits (Chapter 2), a handshake circuit is a system. A system is a collection of processes and wires, in which a wire connects maximum two processes. The basic components correspond with processes and the wires correspond with channels.

Within the formal model wires are treated as processes, and according to the definition of a process, this requires an input, output alphabet and a trace set describing the possible states. To stay in line with the formal model, which treats a wire as a process, the channel together with handshake protocol can be treated as a process too. The straight data channel has two input ports (data?, ack?) and two output ports (data!, ack!). The 4-phase, handshake protocol defines a strict order of actions upon these ports, which can be described by the command: *(data? data! ack? ack!). One could also decompose the channel by observing that a port of the component consists of two ports. The channel is a collection of two wires. The second observation will be used in Chapter 5 to define a model for handshake synchronisation.

21

4 The Transputer Network

This chapter contains a summary of the Transputer network user manual [Luk92]. It mainly describes Transputer Pascal primitives that implement communication, synchronisation and parallelism.

4.0 The hardware

The Transputer network consists of 50 INMOS Transputers. Between the processors are cables and link switches that define the topology of the network. Some of the connections between the Transputers are hardwired and some of the link switches can be set by software. Each Transputer can be connected to maximum 4 others. The network is controlled by a Transputer, number 51, that resides in a host machine.

The only program a user of the system can call is tmon. Tmon is capable of performing all the required actions either by performing these actions itself or by calling the appropriate programs. Hence, in order to compile and run a program on the network, tmon is executed with the appropriate arguments. Program tmon requires at least one argument: a configuration file. This file contains a complete description of the network. The file contains amongst others:

•A spanning tree of the network •For every Transputer, the name of the Transputer Pascal program it has to execute.

4.1 The Pascal system

It has been the aim of designers of Transputer Pascal (TP) to stay as close as possible to the standard of Pascal. The main difference with standard Pascal is the lacking of the .flle type and of the keyword packed. All the other differences can be regarded as extensions. The most important difference is the parallelism.

22

4.1.0 Parallelism

A conventional Pascal program can be viewed as a single process program. On a Transputer it is very cheap and very easy to have several processes running in parallel. The memory of the Transputer is shared among these processes. Parallelism is supported on two levels: at the statement level and the procedure level. In the latter case a process is either the main program or a procedure. One achieves parallelism by specifying that a procedure must be executed in parallel with the caller.

This is specified by the newly defined keyword fork. The caller can await the termination of the procedure by performing the statement join. A process that executes join awaits the termination of all the processes started by the statement fork in the activation record in which the join is performed. If a process does not await the termination of all the processes it stmted, these processes must not refer to local variables of the procedure. Furthermore these processes should not terminate, if they do, it is an error.

The fork statement supports dynamic process creation without restrictions. This causes the fork to be a rather slow statement. One might want a simpler way of having parallelism at the expense of having a more limited construct. Besides the fork - join pair there is the cobegin - coend pair. They enclose a compound statement just like begin and end. Such a statement cannot contain calls of userdeclared procedures or calls of the more complicated standard procedures nor can they contain a fork or join statement.

4.1.1 File i/o

The monitor is the procedure that establishes input/output with the host in the root of the Transputer network. It cooperates with tmon. Currently there are two monitors: the linkmonitor and the filemonitor. The latter one is chosen to take care of the 1/0, as the first one is not very efficient with file transfer. The filemonitor, available in the incJude file "file monitor", is started by calling the procedure monitor at the beginning of the program. It provides standard i/o along the channels input and output and offers a number of routines to perform file i/o. Note: File access is only possible from the Transputer in the host.

23

4.2 Communication primitives

The communication protocols of Transputer Pascal can be divided into the following two categories: synchronous and asynchronous. Synchronous communication is not suitable as it implies synchronisation of the asynchronous components of handshake circuits.

4.2.0 Semaphores

In the include file "semaphore" the type Semaphore and three routines are provided: the ?-operation, V-operatio11 and a initialisation routine Initsemaphore. InitSemaphore must be called before a semaphore is used. Using semaphores to provide mutual exclusion is in general more efficient than using communications.

24

5 Functional simulation model

In this chapter a simulation model of the functional behaviour of handshake circuits is designed. The model includes a description of the communication behaviour of handshake circuits. In the next chapter this model wi11 be extended to incorporate various performance measures, like timing and power dissipation of a circuit.

A simulation model r~l a system is a description r~l its components and their mutual connections, in such a manner that it describes the behaviour of the system.

In case of handshake circuits the connections consist of channels which transfer symbols between the components. The behaviour of the handshake circuit must be described, before an attempt can be made at modelling this behaviour. Each of the following sections begins therefore with an accurate description of the behaviour of parts of handshake circuits and ends with a definition of a suitable model for this behaviour. First handshake synchronisation is modelled by describing the behaviour of a nonput channel. Next handshake communication is modelled. This leads to the introduction of a data structure to contain the communicated value. After the definition of the translation model for the various command operators, the way is cleared for the translation of component specifications into Transputer Pascal procedures.

5.0 Modelling handshake synchronisation

Consider handshake circuit H consisting of two components PI and P2 connected by nonput channel c (see Figure 5.0). Pl and P2 communicate with each other via channel c. Process PI is the active partner in the mutual communication actions. Only the external behaviour of P1 and P2 restricted to actions on channel c is taken into account.

Figure 5.0

25

The channel consists of two wires named c.req and c.ack, along which Pl and P2 communicate using a handshake protocol, *(c.req;c.ack).

c.req

c.ack Figure 5.1

The handshake circuit above can be described by system S. System S consists of two processes, P l and P2.

S = {PI, P2} PI= ({c.ack}, {c.req}, *(c.req; c.ack)) = l(c.ack; c.req) P2 =-PI = W(c.req; c.ack)

Note, that the processes are not defined by means of a trace set, but by means of a so-called command. However, a command is a description of the behaviour of a process or system, from which the corresponding trace set can be generated.

In order to define the behaviour of S it is necessary to add explicit processes to model the internal wires. This results in the system SA, where SA = {I(c.ack?;c.req!), W(c.req!;c.req?), W(creq?;c.ack!), W(c.ack!;c.ack?)}

c.req! c.req?

c.ack? c.ack! Figure 5.2

The 2-phase handshake protocol on channel c is obtained by the weaving the commands of the constituent processes.

*(c.req!;c.ack?) .!Y *(c.req!;c.req?) w *(creq?;c.ack!) w *(c.ack!;c.ack?) ={calc}

1 *( c.req! ;c.req? ;c.ack! ;c.ack?)

Proof:

Invl:

Def: Inv: Axiom:

26

The trace set of S" is the set of traces that can be generated from the command 1. The process { {c.req!,c.ack!}, {c.req?,c.ack?}, *(c.req!;c.req?; c.ack!;c.ack?)} is an accurate description of the behaviour of the 2-phase handshake protocol.

Now it is time to model this behaviour within the simulation environment. In the simulation model there are processes and various communication primitives. The communication primitives can be divided into two categories. There are

synchronous and asynchronous primitives. The synchronisation of the handshake components is already implemented by the handshake protocol. Therefore, asynchronous primitives, in particular, semaphores are chosen to model the communications of handshake circuit H. The wires c.req and c.ack are implemented by semaphores c.x and c.y. Output actions, from the point of view of the processes Pl and P2, are modeJled by V-operations. Input actions are modelled by Poperations. In short:

V(c.x) = c.req! P(c.y) = c.ack?

V(c.y) = c.ack! P(c.x) = c.req?

Components P 1 and P2 are modeled by the processes P 1 and P2. Initially c.x=O A

c.y=O.

PI = *(V(c.x);P(c.y)) P2 = *(P(c.x);V(c.y))

The order of the P and V -operations on the semaphores corresponds precisely with the order of the actions of the 2-phase handshake protocol on the channel.

Processes P 1 and P2 respect the following invariant, where f. denotes the number of completed operations during the execution of P 1 and P2 so far.

O:S£(V(c.x))-£(P( c. y )):::;;1 1\ O:::;;.£(P( c.x))-.£(V( c.y)):S l

The behaviour of the model can not be computed by simply taking the weave of both processes, as the operations P and V require satisfaction of their respective preconditions before they are executed. Let Q., .£ denote the number of blocked respectively completed operations. Semaphore s with initial value s0 is specified as follows:

s = s0 + £V(s) £P(s) o:::;;s A O:SQ.P(s) s=O v Q.P(s)=O

2

27

V -operations are never blocked and P-operations on s are blocked only when they are executed in a situation in which s=O. The sequence of events and states is deterministic because of the semaphore invariants, as can be deduced from the derivation below. Initially c.x=O and c.y=O.

{c.x=O A c.y=Ol V(c.x) { c.x=l A c.y=O} P(c.x) {c.x=O A c.y=O} V(c.y) {c.x=O A c.y=l} P(c.y) {c.x=O A c.y=O}

{(invl =::} -.P(c.y)A-N(c.y)) A P(c.x) is blocked}

{ (invl =::} -.V(c.y)A-.V(c.x)) A P(c.y) is blocked}

{ (invl =::} -.V(c.x)A...,P(c.x)) A P(c.y) is blocked}

{(invl =::} -.V(c.x)A-.V(c.y)) A P(c.x) is blocked}

{ }

Note that, the P and V-operation are no booleans . ...,V(c.y) denotes that a Voperation on semaphore c.y is not possible in this state. The repeated behaviour can be denoted with the following command:

*(V(c.x); P(c.x); V(c.y); P(c.y))

Command 2 corresponds precisely with command 1. Hence, nonput channels between handshake components can be mode11ed by a pair of semaphores. This is a suitable simulation model for 2-phase handshake synchronisation, as long as the implementations of the handshake components P 1 and P2 do not affect the external behaviour of P 1 and P2 restricted to channel c.

Handshake circuits, however, use 4-phase handshakes for both synchronisation and communication. The model of the RTZ part of the 4-phase protocol is identical to the first part, as the model does not depend on directions of arrows of 2-phase commands.

The two semaphores needed to model the handshake protocol are placed in a global data structure of type Wire. The name Wire refers to the physical connection between components, but deliberately deviates from the TP type channel.

type Wire= record ; x,y : semaphore end

The model is defined for the following four 2-phase commands:

Let a: Wire

at": al· : at· : al· :

P(a.x); V(a.y) P(a.x); V(a.y) V(a.x); P(a.y) V(a.x); P(a.y)

5.1 Modelling handshake communication

28

There are four different 2-phase commands which define data transport along channel a:

ai"!E: ai·?x : at·!£:

ai"?x :

active sending of data (straight) passive receiving of data (straight) passive sending of data (anti) active receiving of data (anti)

The exclamation mark denotes output whereas the question mark denotes input. E is an expression in which input channels may be specified 1• This construct allows a reduction of the required amount of variables (area).

The semaphores used to describe the handshake protocol almost solve the problem of communication. The processes involved are already synchronised, but a shared data structure is needed for the transmission of data. This data structure is automatically protected from concurrent read or write actions by the two semaphores. Access to the data happens in mutual exclusion.

The communicated data is of the type word: a finite sequence of W bits. Integer W specifies the width of the connected ports (channel). In a circuit implementation the cost of every channel is closely related to its width. The channels within the simulator, however, do not show such a relation. It is efficient to give all shared data structures the same size, namely the sizeof(integer), as TP contains a set of efficient instruction for the manipulation of integers. This implies that only the I WI least significant bits of the integer v are used. Problems arise when the width of a channel is larger than 32 bits. In practice this will rarely be the case. The simulator does not accept circuits with any channel wider than 32 bits.

1Expression E has to be unfolded

29

Data communication requires an extension of the type Wire with variable v to contain the communicated value.

type Wire= record ; v : integer {value} ; x,y : semaphore end

Straight communication is implemented as follows:

Let a: Wire al'!E : ai"?x :

a.v:=E; V(a.x); P(a.y) P(a.x); x:=a.v; V(a.y)

Data transport in the opposite direction (anti communication) is realised by: Let a: Wire

at·!E: al'?x:

P(a.x); a.v:=E; V(a.y) V(a.x); P(a.y); x:==a.v

Only the correctness of the implementation of straight communication is proved here, but the implementation of communication in opposite direction can be verified similarly. Consider a handshake communication model consisting of two processes P 1 and P2. PI and P2 implement straight communication via a. Process Pl sends the evaluation of expression E to process P2. Only the external behaviour of PI and P2 restricted to actions on a is taken into account.

Pl = *(a.v:==E; V(a.x); P(a.y)) P2 = *(P(a.x); z:=a.v [z=E}; V(a.y))

{ a.x=O 1\ a.y=O} a.v:=E {a.x=O 1\ a.y=O 1\ a.v=E} V(a.x) {a.x=l 1\ a.y=O 1\ a.v=E} P(a.x) {a.x=O 1\ a.y=O 1\ a.v=E} z:=a.v { a.x=O 1\ a.y=O 1\ a.v=E 1\ z=E} V(a.y) { a.x=O 1\ a.y=l } P(a.y) { a.x=O 1\ a.y=O}

30

Note, that the commands which denote the RTZ-phase of the 4-phase protocol can be implemented without the assignments. The RTZ-phase is not meant for data communication. However in the code of the simulator these assignments are sometimes added to make the implementation of the RTZ-phase identical to the Data-phase. This leads to code reduction, when the RTZ-phase is implemented as another execution of the code that implements the Data-phase.

5.2 Modelling the handshake components

Modelling components, requires translation of their constituent parts into TP. A component is translated into a procedure that is executed as a process (see Chapter 4). Within TP it is not possible to declare a procedure with two lists of parameters, hence the list of parameters and the list of channels have to be merged into one list of parameters. Requiremellfs and conflict set can be found in the precondition of a procedure. The alphabet is redundant. Command operators are translated into semantically equivalent operations in TP. The translation of most command operators is trivial. The 'enclose' operator requires extra explanation. This is given in subsection 5.2.0, together with some shorthand notations for elementary handshakes commands. Subsections 5.2.1 through 5.2.5 contain translation examples of components.

5.2.0 Preliminaries

A few shorthand notations for the implementations of the various handshakes are introduced to increase the readability of the code.

Active_2cycle(a) Active_ 4cycle(a) Passive_2cycle(a)

V(a.x); P(a.y) Active_2cycle; Active_2cycle P(a.x); V(a.y)

Commands are constructed from elementary commands using a set of operators. The translations of most operators do not present any problem. The 'enclose'

operator is more difficult and wi11 be investigated here. The translation of the elementary nonput handshake commands can be deduced from the definition of the 'enclose' operator (see Chapter 3). Note that the implementation of ai":bi" is symmetric. As V -operations cannot block it is efficient to rewrite fork V( a.y );

V(b.y); join into an ordered sequence of V-operations, for example V(a.y); V(b.y),

thus avoiding dynamic creation of unnecessary processes.

ai":bi• ai":bi•

P(a.x); Active_2cycle(b); V(a.y) fork P(a.x); P(b.x); join; fork V(a.y); V(b.y); join

Consider the implementations of the following commands:

ai"?x:bi"!x P(a.x);{a.v=E} x:=a.v; b.v:=x; {b.v=E} Active_2cycle(b); V(a.y)

ai"!x:bi.?x P(a.x); Active_2cycle(b);{b.v=E} a.v:=b.v; {a.v=E} V(a.y)

ai"!x:bi"?x

31

fork P(a.x); P(b.x);{b.v=E} join; a.v:=b.v; {a.v=E} fork V(a.y); V(b.y); join

The assignment x:=a. v can not be placed after the assignment b. v:=x, as this would lead to undefinedness of b. v. The two statements can be replaced by one statement b. v:=a. v. Other combinations lead to similar implementations.

The implementation of the other command operators is listed in Table 5.0.

Table 5.0

Command Transputer Pascal

*[b ->A] while b do A I

*[A] while true do A

A;B A;B

All B fork A ;B ;join

[ -,b -7 A D b -7 B ] if -,b then A else B

(N)#A for i:=l toN do A

AIB=AIIB fork A ;B ;join

5.2.1 Translation of components, example SEQ

After the preliminary work the components are translated into TP procedures.

SEQ(a",b·,c·) =(a·,b·,c·).*[ai":(b.;ci·); a!":d·]

;procedure SEQ(var a,b,c : Wire) ;begin while true do

begin P(a.x) Active_ 4cycle(b) Active_2cycle( c) V(a.y)

P(a.x) Active_2cycle(c) V(a.y)

end end

5.2.2 Translation of components, example CON

32

The specification of CONfd. w1(a,b) is divided in three parts. Depending on the value of [d,W] one of the component implementations is chosen to be used in the circuit. Within the simulator, the code of the three different CON specifications is put together in one procedure CON. Besides this the RTZ-phase, being identical to the Data-phase, is not written down explicitly, as it would introduce redundant code. The statement "b.v:=a.v" is not needed in the RTZ phase but it does no harm either.

requirement: W>O

CON[straight. 01(a,b) = (a",b'). * [ai":bi'; al":bl']

CON[stmight. WJ(a,b) = (a"?W,b'!W). *[ [x: var W

jai"?x:bi'!x; al"?x:bl'!x

]

]

CON[nnti. w](a,b) = (a"!W,b'?W). *[ [x: var W

jai"!x:bi'?x; al"!x:bl'?x

]

}

;procedure CON(d : Type; W : integer; var a,b : Wire) ;begin

if W=O then begin while true do

begin {af:bT and af:bt} P(a.x) Active_2cycle(b) V(a.y)

end else { W<.>O}

if d=STRAIGHT then begin while true do

begin {af?x:br!x and af?x:bJ'!x} P(a.x)

end end

b.v:=a.v Active_2cycle(b) V(a.y)

else { d=ANTI, W<:>O} begin while true do

begin {arlx:bT?x and af!x:bl'?x} P(a.x)

end end

end end

Active_2cycle(b) a.v:=b.v V(a.y)

5.2.3 Translation of components, example PAR

33

This component introduces a new phenomenon: paralielism within the component. TP offers two primitives for implementing parallelism: the co-statement and the fork-statement. In this case the statements that have to be executed are complex and according to Chapter 4 the ./(>rk-statement is the appropriate solution here. The join-statement awaits the termination of all the processes started by the statement fork in the program block in which the join is performed. For example, in program fragment fork P 1; .fork P2; fork P3; join the statement join awaits the execution of PJ, P2, P3.

PAR(a·,b·,c·) =(a·,b·,c·).*[ai·:(b"llc·); al·]

;procedure PAR{var a,b,c : Wire) ;begin while true do

begin P(a.x) fork Active_ 4cycle(b); Active_ 4cycle(c) ;join V(a.y} Passive_2cycle(a)

end end

5.2.4 Translation of component~, example NewPAR

34

The component NewPAR cannot be translated following the present translation rules. Observe the partial order of the actions of NewPar:

a.rf<b.rf -b.a t

c.rf- c.a f

b.r 1-b.a1

a.a t-a.rJ

c.r 1- c.aJ

Figure 5.3

The problem is that a.ai can not start until both b.aT and c.ai are finished. This demands a join operation. However, the join-statement awaits the termination of all the processes started by the previous .fork-statement. Using the join-statement after b.ai and c.ai would enforce c.aT before b.rl and b.ai before c.rl, leading to the following partial order:

c.r 1- c.aJ

Figure 5.4

Actions started by two different processes cannot be 'joined' without joining the two processes completely. Apparently the degree of parallelism within the NewPAR component is to complex to describe it with its actions on the external channels only. The problem can be described more accurately when the components are decomposed into more elementary parts. The components PAR and NewP AR are modelled with a composition of elements from the set PROC.

35

def PROC = {Toggle, Merge, C-elt, Fork, wire, 1-wire}

Toggle: { {a},{b,c},(a?;b!;a?;c!)*} Fork: { {a},{b,c},(a?;b!,c!)*} I_wire: { {a},{b},(b!;a?)*} wire: { {a},{b},(a?;b!)*} Merge: { {a,b},{c},(a?lb?;c!)*} C_eJt: { { a,b}, { c },(a? ,b?;c !)*}

The components PAR and NewPAR are systems modelled with elements of PROC. Figure 5.5 (PAR) and 5.6 (NewPar) show the internal structure of the two handshake components. Note that the double circled C denotes a 3-input C_elt. The ports are modelled with a system consisting of a toggle and a merge. It converts a 2-phase protocol on the channel into a 4-phase protocol. The model is derived from the partial order of the component.

ar

a a

Figure 5.5

ar

a a

Figure 5.6

36

The behaviour of PAR can be computed by weaving the commands of the constituent processes. The external behaviour can be computed by projecting the resulting command on the external alphabet of PAR. The question marks and exclamation marks are omitted, for reasons of readability. The operators in order of decreasing binding power are: "*",":","11",";"," I"·

PAR

= Toggle(ar,arl,arr)wMerge(aar,aal,aa)wToggle(ba,bal,bar)w Merge(brl,brr ,br )wToggle( ca,cal,car )w Merge( crl,crr ,cr )w C_elt(bar,car,aar)wFork(arl,brl,crl)w W(bal,brr)wW(cal;crr)wW(arr;aal)w l(ar;aa)wW(br;ba)wW(cr;ca)wW(ar,aa) {environment}

= ( ar;arl;ar;arr)*w( ( aar I aal );aa)*w(ba;bal;ba;bar)*w ( (brll brr);br)*w( ca;cal;ca;car)*w( ( crli crr);cr)*w (barl\car;aar)*w(arl;brlllcrl)*w (bal ;brr)*w( cal ;crr)*w( arr;aal)*w ( ar;aa)*w(br;ba)*w( cr;ca)* I { ar,aa,br,ba,cr,ca}

= ( ar;arl ;(brl; br;ba;bal ;brr; br;ba;bar) II( crl;cr;ca;cal ;crr;cr;ca;car) ;aar; aa;ar;arr;aal;aa)* I { ar,aa,br,ba,cr,ca}

= ( ar;(br;ba;br;ba) II( cr;ca;cr;ca) ;aa;ar;aa)* = {handshake command notation} *[ai":(b.llc·); al"]

For NewPAR this method does not work as the structure of the command, (e.g. the desired partial order) cannot be maintained during the projection. In other words handshake commands containing external channels only, cannot express the behaviour of NewPAR.

(ar; ( (br;ba;(br;ha llu)) II( cr;ca;( cr;callv) ll(u llv;aa;ar)) ;aa)*

This leads to a specification of NewPar with a subcomponent C_elt.

NewPAR (a", b', c') = (a", b', c'). *[ [ . U,V: Wire;

C_elt(u,v :wire) ((u?llv?);a.af;a.rl) I (a.rl; C_elt(u,v,w) ll(bf';bl' ~u!) ll(cl';cl'lvl) ;a. a!) ]

37

The specification can be translated into TP using the defined translation rules with a few extensions for as far as it concerns the translation of the elementary processes.

Let a be a wire then

a? P(a) a! V(a)

Apart from the actual translation an optimisation can be applied to the resulting implementation, according to the rule that sequenced actions do not require concurrency. Unnecessary concurrency is expensive as it requires creation of extra processes. This results in the following implementation. Note, that the C_elt+ is a result of an optimisation.

;procedure NewPar(var a,b,c : Wire) ;var u, v: Semaphore;

procedure C_eW(var s,t : Semaphore) begin

fork P(s); P(t); join; V(a.y); P(a.x) end

procedure Active_ 4cycle&sync(var w : Wire; s : Semaphore) begin

Active_2cycle(w); V(s); Active_2cycle(w) end

;begin while true do

end

begin P(a.x) fork Active_ 4cycle&sync(b,u) fork Active_ 4cycle&sync(c,v) fork C_elt+(u,v) join V(a.y)

end

This implementation is interesting for a number of reasons. Firstly, the speed of a circuit that contains NewPARs instead of PARs is increased, as the concurrency is enlarged. Secondly NewPAR shows that handshake components are not black boxes with commands which describe their external behaviour. It is possible to simulate the internal behaviour of the component, when a component is specified

38

by a list of subcomponents. Note that these subcomponents are not 4-phase handshake components but more elementary parts. However, when each component is treated as a system, the simulation speed will decrease.

5.2.5 Translation of components, example BAR

The component BAR introduces another command operator "the choice" ( j). The translation of this operator requires the proof of a small theorem.

Theorem 5.1

Let {b",c"} be a conflict then

*(b.lc·) == *b"ll*c·

Proof:

Command I, *(b. lc·), is rewritten into: *((bi";bl•) l(ci";d")). Command 2, *b"ll*c", is rewritten into: *(bi";bl")ll*(ci";d").

Consider the state graphs of both commands. They are not identical, but the conflict set implies that state 4 in the state graph of command 2 is not reachable since the sequences bi;cl and ci;bi are not allowed.

ct

Figure 5.7 State graph of 1

cl cf

w~ Cl

bt bt

CJ

Figure 5.8 State graph of 2

39

The first operand of " I" is implemented by procedure Guard, the second operand is implemented by procedure Execute. Execution of the critical sections of Guard and Execute may not overlap as they operate on a shared variable "local". Guard is activated along channel b and Execute is activated along channel c. Thus the environment makes the decision between both procedures. According to Theorem 5.1 this can be implemented by the fork -statement and as both procedures are nonterminating processes there is no use for the join-statement.

;procedure BAR(var b, c, bO, cO, bl, cl : Wire) ;var local : integer

procedure Guard begin while true do

begin P(b.x) fork Active_2cycle(b0) ;Active_2cycle(b I) ;join local:=bO. v b.v:= Or(bO.v,b l.v) V(b.y)

{2-phase} P(b.x) fork Active_2cycle(b0) ;Active_2cycle(b 1) ;join V(b.y)

end end procedure Execute begin while true do

begin P(c.x) if local=O then Active_2cycle(cl) else Active_2cycle(c0) V(c.y)

{2-phase} P(c.x)

if local=O then Active_2cycle(cl) else Active_2cycle(c0) V(c.y)

end end

;begin fork Guard; Execute;

end

40

5.3 Simulation of Handshake Circuits

Modelling its components is not sufficient for the simulation of a handshake circuit. Firstly the following question requires an answer. Does parallel composition of processes results in the same behaviour as parallel composition of the modelled components? Secondly handshake circuits have to be initiated and offered the right input to make them operate. The same is required for the simulation model.

5.3.0 Correctness of the simulation

The above question can be answered by stating that the parallel composttton is exactly the same in both the handshake circuit and the simulator. Hence parallel composition of the processes that define the components, results in a correct simulation model of a handshake circuit.

5.3.1 I/0

After the initiation, the circuit determines the order of communication with the environment (i.e. the system is active with respect to its external ports). The environment must be willing to participate in each communication action in order not to block the circuit.

Let S be the system that represents the handshake circuit. A system can operate only when all its ports are properly connected. Dangling inputs arc not allowed as these may pick up or radiate stray signals. System S is Correct when it is closed and free (~l interference. When system S is not closed a system U is to be constructed for which the following predicate holds:

Correct.S par U

Consider a system S with only one port, say a. The behaviour of S is equivalent with the behaviour of some process P, which is described with a 4-phase handshake command. There are only six possibilities, the port can be active or passive and it can be used for input, output or control. The initiation port is a passive nonput port willing to communicate only once. When pmt a is decomposed into two wires, a .. and ar, it results in the following process descriptions of system s.

{ {a,.}, { aa} ,a •} { {aa},{ar},*a·} { { ~}, { ar}, *a·?x} { {~},{ar},*a.!E} { {ar},{aa},*a·} { {ar},{aa},*a·?x} { {ar},faa},*a.!E}

initiation port active nonput active input active output passive nonput passive input passive output

For each of the above process descriptions there must be a process Q for which:

Correct. { P, Q}

This can be established by choosing Q=-P as:

Correct.{P,-P}

for all P.

41

For example the reflection of process { { aa}, { ar}, *a· !E} is { { ar}, { aJ, *a·?x}. For a system S with more ports the recipe is repeated by connecting every individual port a to a process which is the reflection of S fa. When environment system U is constructed according to the this method the predicate Correct.S par U holds.

A handshake circuit environment built by the simulator consists of newly defined components of type Extern. The components read from or write to files when data is involved. Components connected to nonput ports do not include file 1/0. There are seven different components as there is one component required for the initiation port. For example, the component connected to active output port has the following definition:

EXTERNtsTRAIGHT. w. AC'TJ(a)

= *[ ar"?x

at·?x

These components are translated into TP procedures following the translation rules. A programmer of handshake circuits can also decide to include the environment within the circuit description, thus avoiding the need for files. This will lead to considerable faster simulations as file I/0 is slow. Implementation of the above definition leads to the following Transputer Pascal procedure:

;procedure EXTERN(f: File; d: Type; W: integer; act: Activity; a: Wire) ;begin

if d=(STRAIGHT) and (W>O) and (act=ACT) then begin while true do

begin

end end

P(a.x) Fwrite(f, a.v) V(a.x) Passive_2cycle(a)

else {d=ANTI V W=O V act=PAS} end

5.3.2 Handshake circuit as Transputer Pascal program

42

Now it is time to integrate all the constituent parts of the translation, into a TP program. This program will be automatically generated from a HCL file. This translation is performed by a compiler HCL2TP [Chapter 7]. The syntax definition of a handshake circuit simulation program can be found in Appendix A.

43

6 Timing & Power dissipation

In this chapter the functional simulator is extended with timing analysis and computation of power dissipation. Performance analysis allows the VLSI programmer to discriminate between several programs and program parts that all satisfy the functional specification but result in VLSI circuits with varying degrees of efficiency. This is important because the programmer will have to meet requirements for these propetties that depend on the application at hand. The transparency of the translation of VLSI programs into VLSI circuits, allows the programmer to make a raw estimate of speed and power dissipation of the resulting VLSI implementation of his program.

The first section explains how the timing information can be extracted from the circuit. In Section 6.1 the timing functions are specified, and the functional model of the component is extended with timing functions which generate the timing information. Section 6.2 discusses advantages and disadvantages of bit- based versus word based timing analysis, and shows the data dependency of both timing analyses. Section 6.3 explains why the timing specification of the component Arbiter cannot be realized within the current simulator structure. The last section discusses power dissipation of handshake circuits.

6.0 Timing analysis

Timing analysis of handshake circuits is based on the time at which data items are communicated between the components. When the starting time of every action of all handshake communications is registered, the behaviour of the entire circuit is captured. Due to the transparency of the translation, the structure of the VLSI program is closely related to the structure of the handshake circuit. The speed of the program can be deduced from the speed of its parts. It is possible to relate an elementary program part (statement) to the time during which it is executed within the circuit, by observing the handshake communications on its initiating channel only. Every complete 4-phase handshake on this channel corresponds with one execution of the statement. The timing behaviour of larger program parts is deduced from the timing behaviour of its elementary parts. The elementary parts of a VLSI program are assignments and communications. In fact, an assignment x:=E can also be viewed as a collection of communications. Subcircuits that implement the communications of the VLSI program have a fixed structure. There are two

44

possibilities: either the subcircuit is shared between occurrences within the program or it is not. If the subcircuit is not shared, it can be monitored by observing the handshake actions on the nonput port of the Tran.~f'errer component. The Transferrer is the basic component of a subcircuit that implements a communication. Figure 6.0 shows a circuit which implements the not shared statement: x:=y.

Figure 6.0

A shared subcircuit has more than one initiating channel and timing analysis requires a monitor for each of them. Figure 6.1 shows the solution in which the statement x:=y is shared. This circuit is an optimisation of the circuit in Figure 6.2.

Figure 6.1

Figure 6.2

The monitor components that are required for tlmmg registration of shared subcircuits are automatically added by the Tangram compiler. So the monitor of Figure 6.1 are added by the compiler. The other channels which have to be monitored, are not replaced by monitor components. In the latter case the channels

45

are always connected to the passive port of a Transferrer, hence the monitor function can easily be fulfilled inside this component. Note, that not all channels connected to passive ports of Transferrers need to be monitored (see Figure 6.1). Therefore, the Tangram compiler generates a file, called _.Ink, of the channels which must be monitored inside a Tran.~ferrer. This file contains also the relation between all the monitored channels and the textual position of the corresponding communications and assignments in the Tangram program. The monitors store their timing information in a file which can be viewed by a program called 'Dicyview'.

This leaves a few problems to be solved. The time at which the individual handshake actions occur must be computed and registered. Registration is done by attaching a time stamp to individual actions of the handshake protocol. The time stamp is stored in the global data structure of type Wire. The time stamp of a channel specifies the time on which the last output action to this channel occurred. So only two of the four handshake actions generate a new time stamp. The time stamp of an input equals the time stamp of the preceding output action.

The receiving component uses the time stamp to compute the time stamps of its outputs. The time at which an output occurs depends on the time stamps of related inputs, the value of those inputs and the time spent on the various actions between the inputs and the output. This differs from one component to another. Time stamps are computed on the basis of local information only. This is correct for a certain class of components, but the Arbiter component presents problems (Section 6.3).

Timing calculations have consequences for the definition of type Wire. Time stamps are passed on from one component to another. Therefore, the definition is extended with an additional field time denoted with t.

type Wire = record x, y : Semaphore

end

v: t :

6.1 Timing model

integer Time

{value} {time}

The time at which an output occurs depends on the time stamps of related inputs, the value of those inputs and the time spent on the various actions between the inputs and the output. The actions differ from one component to another, hence every component type has its own timing function. The type of a component is determined by its name, the width of its channels and its other parameters.

46

The set of inputs related to a particular output is deduced from the partial order of the port operations of a component. Consider the partial order of component Newpar:

a.rf<b.rf -b.a t

c.rf- c.a t

b.r J- b.aJ

a. a t-a.rJ

c.rJ- c.aJ

Figure 6.3

a.aJ

A port a consists of a request wire, denoted by a.r, and an acknowledge wire denoted by a.a. From the point of view of a component, one of the wires of a particular port is used for input whereas the other is used for output. All the actions in Figure 6.3 are output actions, however some of them are performed by the component Newpar and the other actions are performed by the environment. The set { a.ri,b.ai,c.ai,a.rl,b.al,c.al} contains the outputs of the environment and { b.ri,c.ri,a.ri,b.rl,c.rl,a.al} are the outputs of component Newpar. The outputs of the environment are the inputs of the component Newpar.

The time stamp of an output is computed from the time stamps of those inputs which immediately precede the output in the pm1ial order. The set of inputs from which a certain output o is computed is called Pred.o. For example:

Pre d.( a. a f) = {b. a i,c.a 1}

An input i is called productive with respect to output o, when it is the last input that is required for the computation of the time stamp of output o. The delay between productive input i and output o is computed by d.i.o. Function d is called a timing function. It depends on the architecture of the handshake component as well as the value of the processed data. When the value of the data is not taken into account function d is constant and can be computed once, just before the simulation starts. Note that this is correct when the timing function d.i.o is independent of the time stamps of the other inputs of Pred.o. In practice this may not be the case, but it is an good approximation.

A time stamp t of a certain output o is computed with the following formula:

o.t = (MAXi:iE Pred.o:i.t + d.i.o)

47

Timing functions are not placed directly in the code of the components as the performance characteristics of a component are more volatile than its functional behaviour. The timing functions of all components are placed in a table. A timing function d.(b.ai).(b.rJ) of component Newpar is called:

delay_bb_ud

In which u and d denote the direction of the arrows: "up" and "down".

6.1.0 Timing analysis of components, example Newpar

In this section the Transputer Pascal implementation of Newpar is discussed.

At the beginning of a simulation the timing functions are evaluated and stored in a local array d. For this purpose timing functions are numbered according to the alphabetical order. For example:

delay _aa_dd I delay_ab_uu 2 delay _ac_uu 3 delay _ba_dd 4 delay _ba_uu 5 delay_bb_ud 6 delay _ca_dd 7 delay _ca_uu 8 delay _cc_ud 9

Assignments to time stamps are placed immedialely in front of all V -operations which implement output actions of the component.

procedure NewPAR(var a, b, c : Wire) ;var u,v: Wire

d : array [1 .. 9] : integer

procedure C_elt begin

fork P(u); P(v); join a.t := MAX(u.t+d[5], v.t+d[S] V(a.y); P(a.x)

end

procedure Active_ 4cycle_on_b begin

b.t := a.t+d[2] Active_2cycle(b); u.t := b.t; V(u) b.t := b.t+d[6] Active_2cycle(b)

end

procedure Active_ 4cycle_on_c begin

c.t := a.t+d[3] Active_2cycle(c); v.t := c.t; V(v) c.t := c.t+d[9] Active_2cycle(c)

end

;begin while true do

end

begin P(a.x) fork Active_ 4cycle_on_b fork Active_ 4cycle_on_c fork C_elt join a.t = MAX(a.t+d[1], b.t+d[4], c.t+d[7]); V(a.y)

end

48

Note that, the code is slightly different from the one presented in Chapter 5. The reason for this is, that semaphores are not sufficient for the internal synchronisation as the time stamps of channel b and c are not accessible when they are needed to compute the time stamp of a.ai. Chosen is for type Wire, as the introduction of a new type serves no purpose.

6.1.1 Timing analysis of components, example MON

The component Mmz stores the timing information of each 4-phase handshake on channel a in the file nwn. The timing functions of component MON are of course equal to zero.

;procedure MON(var a,b : Wire) ;var tO,t1 : integer

man: File ;begin while true do

end

begin P(a.x) tO := a.t b.t := a.t Active_2cycle{b) a.t := b.t V(a.y) P(a.x) b.t := a.t Active_2cycle{b) a.t := b.t t1 := a.t Fwrite(mon,'a', tO,t1) V(a.y)

end

6.2 Word based timing versus bit based timing

49

Data items can be treated as words but also as a collection of bits. Bit based simulation is more accurate, especially for arithmetic calculations. Word based simulation is more or less a worst case analysis of bit based simulation. When data items are treated as words, all the bits of a word given the same time stamp. In reality the bits of a word are received at different points in time. This can have a large impact on the timing results of circuits that implement large arithmetic computations. The consequences of this phenomenon can best be shown by an example of a circuit that implements the addition of two numbers.

Example: addition

The purpose of this example is to show that bit based and word based timing analyses lead to different timing results. This example requires the specification of the functions ha and fa [Haa92] which stand for half-adder and full-adder respectively.

ha: {0,1 }2 >-+ {0,1 }2

ha.[a0,a1] = [s,c], such that a0 + a1 = s + 2*c

50

fa : {0,1 }3,.... {0,1 }2

fa.[aO>a 1,a2 ] = [s,c], such that a0 + a1 + a2 = s + 2*c

Because the binary representation is unique, the s and c values from the previous specification are also unique.

For rows XE {0,1 }3 and YE {0,1 r~ the addition is illustrated in Figure 6.4 'HA' stands for half-adder and 'FA' for full-adder. In this figure, the rows x and y are added, denoted by xE9y. The definition of E9 gives a computation scheme for adding binary numbers.

Definition of EB. For XE {0,1 }m, yE {0,1 }nand m::;;n

9:{0,1 }Ill X {0,1 }" ,_.. {0,1 }n+l

EB.[x,y] = [s0,s 1, ••• ,s,.. 1,c11• 1) = [zO, ... ,zn] where

[s0,c0] = ha.[x0,y0]

[si,cJ = fa.[xi,yi,ci_ 1] for l$i<m [si,ci] = ha.[yi,ci_1] for m$i<n

For m>n there is a symmetrical definition. D Assume that an adder (HA & FA) produces a carry as soon as possible (ASAP scheduling). For this the adder does not require a value on all its input channels. A half-adder can compute the value of the carry after the receipt of one input that equals 0. A full-adder can compute the value of the carry after the receipt of two identical inputs, the value of the carry equals the value of the majority of the inputs.

Tl zl z2 z3 z4

Figure 6.4

51

Let Sci such that }.t can be computed from the last inputs overS and S has no real subset of received inputs from which j. t can be computed. The time between the last required input and the resulting output is chosen to be 1. As every action has delay 1, the model is called unit delay model. This model is not very accurate but it serves the purpose for now.

j.t = (MAXi:ie S:i.t)+ 1

Consider the following computation, in which the tuples denote (value, time stamp).

Figure 6.5

Each row of adders in Figure 6.5 can be regarded as a BIN component. When the duration of the computation is computed with word based timing the computation takes at least 6 time units since the time of the output of the BIN component is defined as the maximum of the time stamps of all the bits. After the computation of the first BIN the maximum time stamp of any bit is 3, so the time stamp of all bits is given the value 3. Bit based timing analysis leads to a result of 4 units, see Figure 6.6.

end of example. Figure 6.6

52

So the accuracy of the timing results is better with bit based timing analysis. Bit based timing analysis requires a few changes of the current simulator. For some components the timing functions need to generate a time stamp for each individual bit. The amount of required time stamps equals the width of the channel. Hence, the simulator has to keep track of the width of a channel as well. The type Wire has to be extended, or a new Wire type has to be defined for each channel width. More changes are not required. For other components the time stamps of all bits are always the same, hence one time stamp is sufficient in these cases.

6.2.0 Timing function of an adder

This subsection gives some insight in some aspects of the construction of timing functions. Consider for example an addition of two positive numbers. The addition of two integers is different and will not be discussed here. The maximum delay of an output during the data-phase is reached when every adder needs the carry-in to compute its carry-out. Figure 6.7 shows the maximum delay of a BIN( +,5,3,4,a,b,c). The tuples denote (value, time stamp). The time stamps are computed according to the unit delay model.

Figure 6.7

BIN(+,U,V,W, ... ) contains l+IV-WI half-adders (HA) and (VminW)-1 full-adders (FA). The model is simplified by stating that the delay of an adder is independent of the values of its inputs. This implies also that it does not matter which of the two operands arrives first at the BIN component, as the function is symmetric in both operands. The timing formula can be rewritten into:

j.t :::: (MAXi:iE Pred.j:i.time) + d.}

d_HA and d_HA(c) define the maximum delays between the last input and any output and between the last input and the carry. For FA similar definitions exist.

l$V A l$W A V:t:.W delay_a_u = ( IV-W j)*d_HA(c)+d_HA+((VminW)-l)*d_FA(c)

l<V A l<W A V=W delay_a_u = d_HA(c)+(V-I)*d_FA(c)+ d_FA

l=V A l=W delay_a_u = d_HA

53

In the RTZ-phase the adder never needs the carry to return to zero, leading to constant RTZ-delays. RTZ-delays are independent of parameters U,V,W.

l<V A l<W: delay_a_d =

l=V A l$W: delay_a_d =

MAX(d_HA, d_FA)

d_HA

For the outputs b1,b2,c1 and c2 the formulas are more simple. It is obvious that these functions should be evaluated once at the beginning of the simulation. Therefore each component has an array to store its timing parameters. Not all the components have such a difficult timing function, some of the components, for example CON and COMBINE, do not introduce any delay as they are equivalent with wires. As timing functions depend largely on the architecture of the components, which is not known to the author of this report, they are currently assigned the constant value I.

6.3 Timing of the Arbiter

The Arbiter is a component to which the current time model cannot be applied.

command: restriction:

Figure 6.8

*(ai";c r·;al";d·) ll*(b l":dl";bl":dl·) conflict. { c • ,d·}

c

d

54

The Arbiter guarantees mutual exclusion of handshakes along the channels c and d. The Arbiter component treats the requests in order of arrival. When request a and b arrive at approximately the same time, one of them is chosen to be granted first. The other one is executed when the first one has finished.

Note, that the functional behaviour can easily be implemented. The functionality of the circuit does not change when the requests are granted in different order, as this is the main property of Delay Insensitive circuits!

Within the simulation it is very well possible that the order in which requests arrive does not correspond with the order of their time stamps. When the requests are also granted in this wrong order, the timing specification of the Arbiter is not respected. There are a few possibilities to solve this problem, but as they are all beyond the scope of this report, they will be mentioned briefly.

•One could introduce global time such that the actions of the simulator are executed in order of their time stamps.

•The Arbiter can wait for an input on its other channel and grant the request with the lower time stamp. However this second input may never come and the simulation would starve. This can be determined with a termination algorithm. When the simulation does not make progress any more the Arbiter with the input with lowest time stamp grants the request, after which the simulation proceeds.

•Another possibility is to grant a request immediately and terminate the simulation when a wrong choice is made. This strategy calls for backtracking which is very difficult if not impossible.

The Mixer could suffer from same deficiency, when its conflict set is not respected. When the inputs are unrelated the conflict restriction can be violated. Inputs that arrive in the wrong order will be processed in the wrong order. This problem is solved by adding an Arbiter to the circuit every time a Mixer is used in a situation in which mutual exclusion on the two passive channels is not guaranteed by the environment already.

In conclusion, in a distributed discrete event simulator, Arbiters have a large impact on the simulation speed.

55

6.4 Power dissipation

The goal of power dissipation analysis on handshake circuit level, is to give an estimation of the dynamic power dissipation before the layout of the circuit is computed. The results must be achieved at low cost, that is: fast.

A good approximation of the energy consumption of a static CMOS circuit i~ calculated by summing over all wires the quantity NCV2/2, where N is number of transitions on that wire, C the capacity of the wire including the transistor gates connected to it, and V the supply voltage. Accurate values of all capacities can be extracted from the layout [Ber92]. However, the layout is only partially available. The layout of the components is constant, but the length of the interconnecting wires and their capacities may vary. The power dissipation model defines, similar to the timing model, the length of all wires to be equal. This unit wire load model gives a good approximation of the dynamic power dissipation.

Energy consumption can now be calculated by summing the consumption of all the components and their interconnecting wires. The dissipation of each component depends on its architecture and the values of the inputs it processes. Similar to the timing model, data dependencies are not taken into account.

Energy consumption during a certain time span is of interest to the designer of VLSI circuits. The smaller the time span between two registrations the more accurate the results become. When every component registers its dissipation between an input and a related output, it gives the best results. However this requires an enormous amount of file operations which makes the simulation very slow. One could also compute the average power dissipation during the whole execution time. This requires only one monitor but the result is not very detailed, and high charges cannot be detected. It is also possible to register the dissipation of the complete 4-phase handshake command of a component.

A component requires a energy consumption function for each couple of related input and output. The identification of these functions is identical to the ones in the timing modeL All power dissipation functions are put in a table called Power.

Any contribution to the total dissipation quantity is to be counted exactly once. To guarantee that every contribution is accounted for the contribution is immediately registered in a file. This leads to a slow but correct analysis. There are many ways to optimise file i/o (for example batch wise), which will not be discussed here.

Consider for example the component Par:

PAR(a",b',c') =(a·,b·,c·).*[al":(b'llc')~ a!·]

The dissipation functions are derived from the partial order.

power_aa_dd 1

power_ab_uu 2 power_ac_uu 3 power_ba_du 4 power_bb_ud 5 power_ca_du 6 power_cc_ud 7

This leads to the following implementation of Par:

;procedure PAR(var a,b,c : Wire) ;var f: File

p : array [1 .. 7] of integer d : array [1 .. 7] of integer

;procedure Active_ 4cycle{n : integer; var w :wire) ;begin

Active_2cycle(w) Fwrite(f, (w.t, p[n], w.t+d[n])) w.t:=w.t+d[n] Active_2cycle(w}

end

;begin while true do begin

P(a.x) Fwrite(f, (a.t, p[2]+p[3], a.t+d[2])) {assume that d[2]=d[3]} b.b=a.t+d[2]; c.t:=a.t+d[3] fork Active_ 4cycle(4,b}; Active_ 4cycle(6,c) ;join Fwrite(f, (b.t, p[4], b.t+d[4]) Fwrite(f, {c.t, p[6], c.t+d[6])) a.t:=Max(b.t+d[4],c.t+d[6]) V(a.y) P(a.x) Fwrite(f, (a.t, p[1], a.t+d[1))) a.t:=a.t+d[1] V{a.y)

end end

56

Note that, the procedure Active_ 4cycle cannot be used any more without the introduction of extra parameters.

57

7 HCL2TP compiler

In this chapter the translation of HCL language to Transputer Pascal is automated using standard compilation techniques. HCL is the language in which the handshake circuits are specified. An HCL program is nothing more than a large list of the constituent components of a handshake circuit. The program contains no explicit declaration or initialisation of interconnecting wires. Transputer Pascal, the language used for the simulator requires explicit declaration and initialisation of the interconnecting wires. This is the main difference between the HCL program and the TP simulator program. Moreover the compiler extends the operation of Tran~ferrers connected to channels, specified in the file program_name.lnk, by giving them a monitor function for timing analysis. Monitors required for the registration of power dissipation are not defined in the HCL program.

7.0 Specification of H CL2TP

The translation of a HCL program into TP program is automated by a compiler called HCL2TP. Let T: HCL -> TP be the desired translation function. A compiler C from source language HCL to target language TP is specified by:

{input E H CL} C I output = T( input)}

However a compiler must be able to treat incorrect input, i.e input which is not an element of HCL. Taking this into account the specification can be rewritten.

{TRUE} C {(b ==input E HCL) A (b:::::} output= T(input))}

However, the HCL program itself is generated by a compiler. Therefore one may expect that the input is always an element of the source language. Unfortunately the simulator can handle but a subset of the correct programs that can be written in HCL. In particular, not all the components that appear have a simulation implementation (Arbiter). Besides this the width of all channels is restricted to 32 bits. Therefore the compiler HCL2TP is specified by:

{TRUE} HCL2TP {(b ==input E HCL') A (b:::::} output= T(input))}

58

The set of programs that can be generated with HCL' is a subset of the set programs that can be generated with HCL. An HCL' program is an HCL program with the following properties:

• The width of all channels is restricted to 32 bits • No Arbiter and Char components • The first 12 characters of an external channelname are sufficient for the

identification of the channel.

Remark: From now on HCL refers to HCL'.

The compilation process consists of four phases: lexical scanning, context-free analysis, context-dependent analysis and code generation. Although in practice these phases are often merged, they treated separately in the following sections.

7.1 Lexical scanner

The main task of a lexical scanner is to partition the input file, usually a file of characters, into tokens. The efficiency and complexity of the scanner is related to the complexity of the strings that have to be recognized. Therefore it is wise to restrict the complexity by demanding that the strings corresponding to a token form a regular language. Moreover, the first non-separator character of the input must give enough information to decide which is the first token scanned. In A.2 the set of tokens is described as a collection of regular expressions. It is easy to check that both requirements are fulfilled. The scanner can be constructed using the "acceptor recipe" [Hem90]. Note that the sets of characters and the regular expressions define the syntax of the non-terminals Name and Number in the HCL grammar [A.O].

7.2 Parser

A parser checks, whether the structure of the output of the lexical scanner corresponds with the grammar of the source language. A parser is usually divided into two parts: context-free analyses and context-dependent analyses. The grammar of HCL [A.O] is context-free. Because of the fact that the HCL program is constructed by a compiler, the parser considers the input to be correct with respect to the HCL syntax. The parser is merely used to generate the (code) attributes required for the translation. The HCL grammar is not ambiguous and LL(l). For such a grammar the construction of a recursive descent parser is straightforward [Hem90].

59

7.3 Code generation

The grammar of the target language is certainly not context free [A.l], as a TP program requires declaration and initiation of the channels. The translation of the language of grammar HCL to target language TP is described with a syntax directed translation scheme.

7 .3.0 Syntax Directed Translation Scheme

Let G=(N,'L,P,S) be a non-ambiguous context-free grammar and let Z be some set (target language). A syntax directed translation scheme for G and Z consists of:

-for each AE N a function TA : L(A) -f Z,

-for each production rule p == (A0 _. x0A1x 1 ••• A,,X11

) with function

S1,: Z' _. Z,

such that for all A E N:

if XE L(A) is generated as follows: A => xaA 1x 1 •• • A,,X11 => * XnY 1x 1 ••• y,,X11 = x where A_.x(,A,x, ... A,,.r,, E P, then TA(x) = Sl~\l(y), ... , ~1,/y,)).

I,*, A;EN a

The function TA is called the translation function corresponding to A.

7.3.1 Code Attributes

When the translation of the language of some grammar G to some target language Z can be described by a syntax directed translation scheme, then this translation can easily be generated by transforming G to an attribute grammar in which all nonterminals have a synthesized attribute with domain Z. Each production rule p = (A0 _. xaA1x1 ••• A,,x,) of G is replaced by

(A<+z0>0 _. x0A1<+z,>x1 ••• A11<+z11>x) Zo = Sp(z,, ... ,z,,).

The attributes associated to the nonterminals here are called code attributes. The code attributes are added to the HCL grammar already [A.O].

60

7.3.2 On the fly code generation

It is efficient to use the 'on the .fly' code generation method when the translation functions satisfy the on the .fly condition:

for each production rule p = (A 0 ___. xoA 1x 1 ••• A,,x,) of G there exist strings v0, ••• , v, of Z such that

This means that if xE L(A) is generated by:

A ::::::> xoA 1x 1 ••• A,,x, :::::>* x11y1x 1 ••• y,,x, = x then

Tix) = v0 TAly,)v 1 ••• v,_,TA,(.y,)v,).

However the translation does not fulfill the on the .fly condition. Consider the following production rule.

HCLtext(z) Name Circuit( +d,+z1)

z='program' Name Header Decs(-d) ';begin monitor' Inits(d) Circuit(+d,+z 1) 'end.'

The productions of TP nonterminals Decs and !nits both depend on inherited attribute d, which is constructed from HCL nonterminal Circuit. So when z1 is written to a file the code produced from Decs and !nits has to be placed in front of z1 afterwards. However, such file operations do not exist. The translation produced from nonterminal Circuit obeys the on the fly condition and is written to a temporary file, as it is more efficient to copy a file to another file than to pass the large fragments of code as a parameter.

7.3.3 Translation of file-i/o components.

Components which include file-i/o require a little explanation. The TP type filename is an array of exactly 12 characters. This implies that channelnames of external channels are truncated when they are longer and padded with spaces up to 12 characters when they are shorter than 12 characters. This can be found in production rule File(d) in the grammar of TP. This explains why Monitors and Tran.~f'errers write to a file called 'Mon ' instead of 'Mon'. As a consequence the first 12 characters of the name of an external channel must be

61

sufficient for the identification of the channel. This is not checked upon by the compiler. Finally the translation of componenttype Transferrer checks whether the Transferrer has a monitor function or not. If so, the name of the passive channel of the Transferrer can be found in the file LNK.

Componenttype(d,z)::= TRF' ['n 1 ']"('d 1' ,'d2' ,'d3')'

( d= { d 1} U { d2} U { d3} 1\

m0 = d 1 if d 1ELNK else m0 =' ' z='TRF(' 'Mon ",'n1 ','m0' ,'d1' ,' d2' ,' d3')'

62

8 Evaluation and conclusion

In this chapter the simulator is evaluated for its functional performance and the accuracy of its timing and power dissipation analysis. A couple of multipliers are tested with the simulator. Furthermore the suitability of Transputer Pascal for the implementation of this simulator is discussed. Finally some suggestions are given for future work.

8.0 Evaluation and results

It is the intention of this thesis to construct a simulator for handshake circuits on a Transputer network. The simulator should incorporate various performance measures like timing and dissipation.

8.0.0 Functional simulation

Presently it is possible to simulate the functional behaviour of handshake circuits which consist of components that are included in HCL grammar [A.O]. Such circuits may also contain Arbiters when they are used properly (e.i the functionality of the circuit is independent of decisions made by an Arbiter).

8.0.1 Timing analysis

The implemented timing analysis is data independent, hence the timing functions can be computed in advance. At this moment timing functions are either 0, or 1. A timing function is defined 0, if the component has the structure of a wire or if the component does not belong to the circuit description (monitors and I/0 components). The reason to use such a rough approximation is that more accurate values are not available to the author of this thesis. However the accuracy of the simulator can easily be increased with the substitution of more accurate values. The Arbiter has no timing model.

To be able to argue about the performance of the simulator, it was tested upon some multiplication algorithms [Haa92]. These algorithms were already tested with VHDL, another simulator. The results obtained with VHDL can be compared with

63

those obtained with SIM as they both use word based, data independent timing analysis, on the level of handshake circuits. VHDL however has more precise timing functions whereas SIM uses a unit delay model. The differences in speed of the various multipliers are detected by both simulators. Note, that the numbers in Table 8.0 have different unity.

Table 8.0

Multiplier VHDL SIM

A. I 887 1303

I A.8 526 919

A.7 379 599

A.IO 170 323

A.ll 107 243

8.0.2 Power dissipation analysis

The accuracy of the simulator with respect to power dissipation is not as well as possible on the level of the handshake circuit, as the power functions are not accurate. This solution gives every component the possibility to write his dissipation results to a file. A solution which is not very fast, but which can easily be implemented for a single Transputer simulator.

8.0.3 Single Transputer simulator

The simulator uses only one transputer. Hence the circuits that can be simulated cannot be too large, less then 4000 processes, as each process requires about lK of the memory and the host has 4M. There are some optimisations possible to decrease the amount of components in handshake circuits, one of them is mentioned in subsection 8.2.4. An advantage of single processor simulation is the simple communication protocol. Furthermore, every process can easily write to files directly from the host Transputer, where this is not possible from the other Transputers in the network. Unfortunately a distributed simulator is not realised. Hence, possible benefits, with respect to increase in speed, cannot be quantified.

64

8.0.4 Surrounding software

The simulator is defined to cooperate with the Tangram tools as this Tangram project has more software support. The set of Tangram tools contains amongst others a viewer with which the timing results can be viewed (Dicyview). The simulator is constructed in such a way that it fits within the current set of Tangram tools. The results of dissipation analysis cannot be viewed.

8.1 Concluding remarks

The Transputer network and Transputer Pascal are suitable for the simulation of handshake circuits for two important reasons.

• The main constructor of Delay Insensitive Circuits, the system operator par is mapped on TP construct fork, which implements parallelism on procedure level.

• Asynchronous communication is efficiently implemented by available asynchronous synchronisation primitives (i.e semaphores).

It is possible to define a simple syntax directed translation function from circuit component specification to simulator procedure. The structure of the simulating program stays very close to the original circuits description, which facilitates the writing of the compiler.

8.2 Suggestions for future work

8.2.0 Functional simulation

The functional model of handshake components is achieved by translating the component specification into TP procedures. This translation can also be automated when the components are properly specified. This way the simulator can be extended by someone who does not know anything about the Transputer Pascal implementation.

65

8.2.1 Timing analysis

The timing results of the simulator can be improved by simply assigning more accurate values to the timing functions. Furthermore word-based analysis can be replaced by bit-based analysis. The most accurate results are achieved when the data dependency of timing analysis is taken into account too.

8.2.2 Power dissipation analysis

Each component registers its dissipation on the moment it occurs. This leads to a large amount of relatively slow file operations. The amount of file operations can be decreased by saving the results batch wise.

8.2.3 Multi Transputer simulator

Presently the simulator uses only the host Transputer of the Transputer network. Hence the capacity of 50 Transputers is unused. Changing the single Transputer simulator in a multi Transputer version requires changes in the compiler HCL2TP.

HCL2TP must generate a program for every Transputer. These programs consist of disjunct subsets of the set of components which defines the circuit. Components which communicate, are either located on the same Transputer or on distinct ones. The latter case requires an inter processor communication mechanism. Transparent communication primitives have to be defined. HCL2TP must keep track of the location of all the components in order to be able to give each component the right address of its partners in communication.

Another interesting problem is the distribution of the components. Sequential computations can best be situated on one Transputer whereas parallel computations benefit from distribution over distinct processors. The distribution can be determined once, before the simulation, or it can also be changed during execution to ensure that the work is equally divided among the processors. The latter is called (dynamic) load balancing.

8.2.4 Surrounding software

The results of timing analysis can be viewed with a tool called Dicyview. An interpreter for the results of power dissipation analysis is not available yet, but is currently being constructed.

66

Circuit descriptions tend to contain a lot of components which do not contribute to the functional or timing behaviour in any way. Hence these components can be filtered from the circuit. The difference is considerable; as in for example in multiplier A 11 25% of the components are Connectors.

Acknowledgements

I would like to thank my mentors Rudolf Mak and Ad Peeters from Eindhoven University of Technology for their continuing support. Furthermore I want to thank Johan Lukkien for immediately answering all my questions on the Transputer network. I also want to express my gratitude to Annette for her support in any area except programming. Finally, I would like to thank prof.dr.M.Rem for being my supervising professor.

67

Bibliography

[Ber92]

[Bekp90]

[Bru91]

[Haa92]

[Hem90]

[Jan92]

[Kor92]

[Lin92]

[Luk92]

[Ver92]

[Ver88]

C.v.Berkel. Handshake Circuits: an intermediary between communicating processes and VLSI. PhD thesis, Philips Research Labatories - Eindhoven University of Technology, may 1992.

H.Bisseling, H.Eemers, M.Kamps, A.Peeters. Designing Delay Insensitive Circuits. Institute for Continuing Education, Eindhoven University of Technology, september 1990.

E. Brunvand. A cell set for selj~timed design using Actel FPGAs. Technical Report UUCS-91-013, Dept. of Comp. Science, Univ. of Utah, SaiL Lake City, august 1991.

J.M.P.W.Haans, VLSI Programming of Multipliers. Master's thesis, Eindhoven University of Technology, july 1992.

C.Hemerik, Course notes, Eindhoven University of Technology. 1990.

A.Jansen. A common suhexpressimz recognizer for VLSI programs. VOC internal report, Dept. of Math. and C.S.,Eindhoven University of Technology, february 1992.

M.v.d.Korst. VOICE, a Silicon Compiler for Asynchronous Circuits. Institute for Continuing Education, Eindhoven University of Technology, august 1992.

M.Lindwer. Handshake circuit simulator. VOC internal report, Dept. of Math. and C.S.,Eindhoven University of Technology. may 1992.

J.J.Lukkien. The Eindhoven Transputer System an overview. User manual. Eindhoven University of Technology, september 1992.

T. Verhoef[ Delay-Insensitive Circuits. Course notes, september 1992.

T.Verhoeff. Delay insensitive codes, an overview. Distributed Computing, 3( I): 1-8,1988.

eindhoven university of technology master parallel ... · pdf fileparallel simulation of...

Documents