tima lab. research reportstima.univ-grenoble-alpes.fr/publications/files/rr/pla_246.pdf · a...

ISSN 1292-862

TIMA Lab. Research Reports

CNRS INPG UJF

TIMA Laboratory, 46 avenue Félix Viallet, 38000 Grenoble France

A PROGRAMMABLE LOGIC ARCHITECTURE FOR PROTOTYPING CLOCKLESSCIRCUITS

Laurent Fesquet, Marc Renaudin

TIMA Laboratory46, avenue Felix Viallet38031 Grenoble, France�

Laurent.Fesquet, Marc.Renaudin � @imag.frwww home page: http://tima.imag.fr/cis

ABSTRACT

This paper presents a novel Programmable Logic Device(PLD) architecture for implementing and prototyping var-ious styles of clockless or asynchronous circuits. Manyclasses of asynchronous circuits exist, depending on the tim-ing assumptions that are made at the logical level and theadopted handshake communication protocols. The main ob-jective of this work is to break the dependency between thePLD architecture dedicated to asynchronous logic and thelogic style. Indeed, the PLDs dedicated to asynchronouslogic are always style-oriented. The innovative aspects ofthis architecture are described in details. Moreover, thestructure is well suited to be adapted with further asyn-chronous logic evolutions thanks to the architecture gener-icity. The programmable structure is flexible enough to beused with different logic styles and asynchronous designflows. As an example, a full-adder was implemented in twodifferent styles of logic to demonstrate the PLD architec-ture flexibility. This work is included in a larger projectcalled TAST, dedicated to the synthesis and the prototypingof multi-style logic asynchronous circuits.

1. INTRODUCTION

1.1. Context of the work

It is not necessary to present any more the advantages ofprogrammable logic, as its validation role in the logic de-sign flow or its flexibility. With the ever-increasing inte-gration level of synchronous design, the industry is now fac-ing problems (heat dissipation, clock tree distribution, noise)that have led asynchronous logic to gain in popularity thoseyears. Many asynchronous demonstrators have been imple-mented in the last years [1] [2] [3] [4] [5]. The methodolo-gies have also been developed to automatically synthesizeasynchronous circuits and dedicated synthesis tools have ap-peared, like CAST [6], TANGRAM [7], BALSA [8], andTAST [9]. Some dedicated FPGAs have been developed in

the last decade to test asynchronous designs too. Unfortu-nately, these FPGAs are closely associated to a style of de-sign. The use of synchronous FPGAs is possible but mostof the FPGA resources are unexploited [10].

1.2. State of art

Most of the research on asynchronous circuits has been fo-cused on designing an asynchronous FPGA from a syn-chronous one, like MONTAGE [11] and PGA-STC [12]) ortargeting application-specific FPGAs, locally synchronous,like GALSA [13] and STACC [14] or completely asyn-chronous like PAPA [15]. PGA-STC was developed toimplement two-phase bundled-data systems such as micropipelines, GALSA for massively parallel computing archi-tectures, STACC for reconfigurable computation and finallyPAPA was mainly created and optimized for pipelined pro-cesses. Clearly, all the researches that have been done to-date target a special application, protocol or asynchronouslogic style.

2. PRINCIPLES OF ASYNCHRONOUS LOGIC

While in synchronous circuits a clock globally controlsactivity, asynchronous circuits activity is locally controlledusing communicating channels able to detect the presenceof data at their inputs and outputs [16] [17] [18] [19].This is consistent with the so-called handshaking or re-quest/acknowledge protocol.

This communication protocol is the basis of the follow-ing sequencing rules of asynchronous circuits [18]:

� a module starts the computation if and only if all thedata required for the computation are available,

� as far as the result can be stored, the module releasesits input ports,

� it outputs the result in the output port if and only ifthis port is available.

Fig. 1. Request/Acknowledge communication betweenasynchronous operators to guarantee a timing independentsynchronization

However, asynchronous modules communicate witheach other using requests and acknowledges. One transi-tion on a request signal activates another module connectedto it. Therefore, signals must be valid all the time. Hazardis not allowed on signals. Asynchronous circuit synthesismust be thereby more strict, i.e. hazard-free. In fact, differ-ent timing assumptions are considered for different types ofasynchronous circuits. The most robust style is called DelayInsensitive (DI) because no timing assumption is made. Thismeans that the circuit works correctly whatever the delays inwires and gates are. Having such a circuit is really constrain-ing for the designer and costly in term of area. To reducethe complexity of these circuits, it is possible to introducean assumption on forks: some forks must be ”isochronic”(the delays in the branches of the fork are equal). This styleof circuits, named QDI (for Quasi-Delay Insensitive) cir-cuits, only works correctly in this ”isochronic fork” assump-tion. The ”isochronic fork” condition is very weak and manyasynchronous circuits have stronger timing assumptions, asmicropipeline circuits. Micropipeline circuits only differfrom the synchronous circuits by the controllers that replacethe clock. Many other asynchronous logic styles exist, butare not presented in this paper. To complete the huge possi-bilities in asynchronous designs (contrarily to synchronousstyle), the designer can change the handshake protocol or thedata encoding. That means that it is possible to implementQDI logic with different protocols or data encodings, likedual-rail (1 of 2 encoding) or multi-rail (1 of N encoding).These choices permit the implementation of a same designvarying the electrical properties of the circuit, like speed,power-consumption or electromagnetic emission.

3. ARCHITECTURE

The PLD architecture has been designed to be the best com-promise between the high flexibility needed to be style-independent and the optimal use of PLD resources.

3.1. Global view of the chip

The high flexibility is achieved by choosing an ”island style”top view of our chip: the Programmable Logic Blocks(PLBs), which implement the required logical functions, areplunged into a routing network. This network is a grid of in-terconnection busses, connection boxes, and switch boxes(Figure 2). The connection boxes ensure a proper map-ping of the signals used by the PLBs onto the busses, andthe switch boxes make the connection between horizontalbusses and vertical ones to route the signals through the chip.Finally, and ignoring side effects (I/Os on each side of thePLD), the chip can be seen as the repetition in 2D of the pat-tern made by a PLB, 4 connection boxes, and a switch boxas described in Figure 2.

Fig. 2. Top view of the FPGA architecture

3.2. The Programmable Logic Block (PLB)

The PLB implements the programmable logical functions;it consists of an Interconnection Matrix (IM), two Logic El-ements (LE), and a Programmable Delay Element (PDE)as shown in Figure 3. The Logic Elements are pro-grammable combinatorial logic components which host theprogrammed functions, and the delay element gives to thePLB the possibility to implement logic based on delay ele-ments (like micropipeline). In addition, the InterconnectionMatrix maps together PLB inputs, LEs inputs and outputs,and the PDE.The architecture of the PLB is designed to ensure a correctimplementation of memory elements typically needed by theasynchronous logic, such as Muller gates. In fact, thesememory elements are implemented by mapping loopedcombinatorial logic; that is why the feedbacks are as shortas possible, and entirely integrated into the PLB.

Fig. 3. Internal schematic view of a PLB

3.2.1. The Logic Element

A Logic Element consists of a ”multi-output LUT” (1), aLUT7-3, and a LUT2-1 connected together as shown inFigure 4. As presented in the section describing the asyn-chronous logic, this type of logic uses N signals to code aradix N digit (1 of N encoding). This specificity needs to betaken into account at the hardware level to obtain the bestfilling of the programmable logic elements. The adoptedsolution was to make externally available some internal sig-nals of a LUT; in particular, a LUT7-3 was chosen in theLE. Thus, it becomes easy to implement dual-rail encod-ing (which is very used), as 2 auxiliary outputs per LE areavailable. Moreover, implementing multi-rail signals with areduced number of Logical Elements is possible, thanks tothe large number of inputs per LE.Moreover, asynchronous logic requires to implement a pro-tocol between communicating modules which basicaly con-sists in computing an ackwnoledgement signal. This is sup-ported in the LE by adding a LUT2 directly connected tothe multi-output LUT. This enables to check the validity andthe invalidity of the two-wire channel coming out the LUT7-3(cf. example in Section 5.2).

3.2.2. The interconnection matrix

This routing component connects all the internal compo-nents of the logic block together (Figure 3). Having afully customable routing component into each logic blockcosts a lot of silicon area, but it is essential for the style-independence of our architecture as styles in asynchronouslogic differs precisely in the way in which the LEs arerouted.

1Note: LUTX-Y is the notation used to speak about a X inputs and Youtputs ”LUT”.

Fig. 4. Internal schematic view of a LE

3.2.3. The Programmable Delay Element

The PDE, located in the PLB (Figure 3) can be used to al-low the implementation of asynchronous circuits that needtiming assumptions. A PDE has one input and one output,and can be programmed to delay independently the risingand the falling edge of the signal.

3.3. The interconnection network

As in synchronous FPGAs, the interconnection network, andthe routing components take a huge part of the area on thechip. Therefore, the area optimization is the major concernin this part of the study.

3.3.1. The connection box

The connection box (Figure 5a) is the link between the Pro-grammable Logic Block and the global routing network.The number of outputs of the Connection Box (which is thenumber of PLB inputs) is a key element in the area usage ofthe chip. So, to ensure good area utilization, the PLB canaccess 1/3 of the bus width per connection boxes. As a PLBhas three input connection boxes, this choice allows the PLBto have a number of input equivalent to the bus width.

Fig. 5. a) The connection box b) The ”wilton” switch box(bus width not exact)

3.3.2. The switch box

A switch box is located at each crossroad between a hor-izontal and a vertical bus on the chip (Figure 5b). Thiscomponent has exactly the same functionality as in the syn-chronous programmable logic FPGAs. It routes the signalsbetween busses. The switch box is a ”wilton-style” whichprivileges the routability [20].

3.4. Structure genericity

It is really important to optimaly size the architecture to effi-ciently handle a very large number of asynchronous synthe-sis styles. But since the asynchronous synthesis rules are inconstant improvement, it will be necessary to change thosesizes in the future. Furthermore, it was necessary to have thepossibility to generate such a PLD on demand, especiallybuilt to optimize a particular library or synthesis method-ology. So that’s why all the components where describedwith generic characteristic dimensions, leaving the freedomto resize every part of the design.

3.5. Programming chain

Keeping in mind this need of a generic structure, we alsoneeded a way to efficiently program the PLD. As everypart of the architecture can change or be reorganized, wecreated a programming protocol that can easily be usedfor all the types of components regardless of the amountof SRAM they need. Therefore this protocol is based on adaisy chain (Figure 6), in which each LE, InterconnectionMatrix, Connection Box, and Switch Box uses an interfaceplugged to the signals described below:

Fig. 6. Synoptic view of the programming chain

The protocol is synchronous, each interface (whichrepresents a component) loads the bit located on serial inat each rising edge of clock provided that has receivedthe token from the previous component. The first tokenis sent by the external programming computer to initiatethe programming phase. Serial out give a feedback tothe computer, which can make a parity control of thetransmission. The protocol is illustrated in the followingexample:

The interface was implemented in synchronous logic to

Fig. 7. A typical component programming sequenceensure the best surface utilization and the compatibility withthe host computer used for programming the FPGA.

4. DESIGN FLOW

4.1. TAST Project

The incompatibility of tools created specifically for syn-chronous design, the need to make sure that the design ishazard free, the lack of high-level design methodologiesare some of the reasons why asynchronous circuits are notwidely used today. The TAST (TIMA Asynchronous Syn-thesis Tool) [9] Project was created to facilitate the de-sign of asynchronous circuits. It mainly consists of a com-piler/synthesizer with the capability of targeting several out-puts from a high-level hardware description language (CHP:communicating hardware processes):

� QDI-style gate level model

� Micropipeline style gate level model

Before creating the QDI or micropipeline netlist, it is pos-sible to choose the communication protocol. When thenetlist has been created, the design is mapped on standardcell libraries or specific libraries like TAL (TIMA Asyn-chronous Library) to create an ASIC or alternatively a ded-icated netlist to program the proposed PLD. The TAST de-sign flow quite often generates C-OR patterns (Muller Gatesfollowed by an OR gate). The half-buffer, which is a cir-cuit that can be produced by TAST, composed of two Mullergates and one OR gate is a good illustration (see Figure8). This pattern really nicely fits into one LE thanks to theLUT2-1 following the LUT7-3 (the Muller gates are mappedin the two LUT7-3 and the OR equation in the remainingLUT2-1).

This study was mainly done for Dual-Rail designs, butthe asynchronous mapping with other encoding, like 1-of-N,fits correctly into one or multiple PLBs.

4.2. Mapping Efficiency

One of the objectives was to design a PLD with the high-est filling ratio whatever the asynchronous logic style. Sothe best trade-off between the different styles has been re-searched. Benchmarks are synthezised with different styles,

Fig. 8. The half buffer implementation in a LEencodings and protocols to check this property. The fillingratio of the LEs is calculated as the number of used inputsover the number of available inputs. For the QDI style themean filling ratio is around 75%.

5. EXAMPLES

To demonstrate the capabilities of the architecture, a toy ap-plication, a full adder, has been implemented in two differentasynchronous logic styles: QDI and micropipeline. In or-der to simplify the demonstration, the encoding of the QDIadder only is limited to Dual-Rail and the data encoding ofthe micropipeline adder is bundled data (as in synchronouslogic). Moreover, for both styles, the protocol used is thesame and is a 4-phase protocol with return to zero.

5.1. Full adder in micropipeline

For the micropipeline version (Figure 9), 5 LEs are used.This means that 3 PLBs are used to create this function. Theoverall filling ratio is 51% (23 inputs over 45). The dashedlines around the gates symbolize the mapping in the LEs ofthe adder. The combinatorial logic, simply a 1-bit adder, fillsone LE at 33%. The Muller gate with reset, alone in one LE,has a filling ratio of 44 %, the latches a ratio of 55% and thelatch with the Muller gate a ratio of 66%. A programmabledelay element is used to implement the matched delay re-quired by this logic. The micropipeline style is correctlysupported by this architecture.

Fig. 9. A full adder in micropipeline

5.2. Full adder in QDI

In order to evaluate the flexibility of the architecture, the fulladder has been implemented in the QDI style too. Figure 10shows that the C2OR pattern appears 6 times (see section4.1). Each LE is filled at 89%. This means that the QDIadder fits into 4 PLBs and the overall filling ratio is 76%.The QDI style fits correctly in the PLD architecture.

Fig. 10. A full adder in QDI

6. CONCLUSION AND FUTURE WORKS

A novel PLD architecture has been presented that is ableto target multiple styles of asynchronous logic. The asyn-chronous logic fits nicely into this dedicated architecturewith a good filling ratio of PLB resources. Moreover thearchitecture, which is fully generic, is well suited to synthe-size new PLDs adapted to asynchronous logic evolutions.The next step will be to improve the filling ratio of asyn-chronous logic by adding an initialization signal to PLB.Some architectural improvements could be performed to ob-tain a better ratio between the area of communication re-sources (SW, CB) and the area of programmable resources(PLB). The efforts are today focused on a specific synthesismethodology to target those components. The routing ca-pabilities of the connection boxes and interconnection ma-trices could be reduced while still keeping a good routingefficiency. Finally, a silicon implementation of the PLD isunder design. This requires to finely tune the blocks (PLB,SB and CB) design to save a lot of silicon area. This cir-cuit will be a tool to evaluate asynchronous designs andto spread this technology to a larger community (than theasynchronous community!). Moreover, this PLD architec-ture could be used in SoCs as asynchronous reconfigurableblocks.

7. REFERENCES

[1] A.J. Martin and M. Nystrom et al. The Lutonium: Asub-nanojoule asynchronous 8051 microcontroller. In

ASYNC 2003, pages 14–23, Vancouver, Canada, May12–15 2003.

[2] Alain J. Martin and Andrew Lines et al. The Designof an Asynchronous MIPS R3000 Microprocessor. InARVLSI ’97, pages 164–181, Ann Arbor, MI, Sep. 15–16 1997.

[3] M. Renaudin, P. Vivet, and F. Robin. A Standard-CellQ.D.I. 16-Bit RISC Asynchronous Microprocessor. InASYNC 98, pages 22–31, San Diego, CA, Mar. 30–Apr. 2 1998.

[4] A. Abrial, J. Bouvier, M. Renaudin, P. Senn, andP. Vivet. A New Contactless Smart Card IC usingOn-Chip Antenna and Asynchronous Microcontroller.IEEE Journal of Solid-State Circuits, 36(7):1101–1107, 2001.

[5] J.D. Garside et al. AMULET3i - an AsynchronousSystem-on-Chip. In ASYNC 2000, pages 162–175,Eilat, Israel, Apr. 2–6 2000.

[6] CAST http://www.async.caltech.edu/.

[7] F. Schalij. Tangram manual. Technical Report LR008/93, Philips Research Laboratories, Eindhoven,The Netherlands, 1993.

[8] A. Bardsley and D.A. Edwards. The Balsa Asyn-chronous Circuit Synthesis System. In Forum on De-sign Languages (FDL2000), Tubingen, Germany, Sep.4–8 2000.

[9] M. Renaudin, J.-B. Rigaud, A.V. Dinh-Duc, A. Rez-zag, A. Sirianni, and J. Fragoso. TAST CAD Tools. InASYNC 02 Tutorial, Manchester, UK, Apr. 8–11 2002.

[10] Quoc Thai Ho, J.-B. Rigaud, L. Fesquet, M. Renaudin,and R. Rolland. Implementing asynchronous circuitson LUT based FPGAs. In 12th International Confer-ence on Field Programmable Logic and Applications(FPL), Montpellier (La Grande-Motte), France, Sep.2–4 2002.

[11] S. Hauck, S. Burns, G. Borriello, and C. Ebeling. AFPGA for Implementing Asynchronous Circuits. IEEEDesign and Test of Computers, 11(3):60–69, 1994.

[12] K. Maheswaran. Implementing Self-Timed Circuitsin Field Programmable Gate Arrays. Master’s thesis,U.C. Davis, 1995.

[13] B. Gao. A Globally Asynchronous Locally Syn-chronous Configurable Array Architecture for Algo-rithm Embeddings. PhD thesis, University of Edin-burgh, Dec. 1996.

[14] R. Payne. Self-Timed Field Programmable Gate ArrayArchitectures. PhD thesis, University of Edinburgh,1997.

[15] J. Teifel and R. Manohar. Programmable Asyn-chronous Pipeline Arrays. In 13th International Con-ference on Field Programmable Logic and Applica-tions, number 2778 in Lecture Notes in Computer Sci-ence, pages 345–354, Lisbon, Portugal, Sep. 2003.

[16] Al Davis and Steven M. Nowick. An Introduction toAsynchronous Circuits. Technical Report UUCS-97-013, University of Utah, Sep. 1997.

[17] Marc Renaudin. Etat de l’art sur la conception des cir-cuits asynchrones : perspectives pour l’integration dessystemes complexes, Jan. 2000. Internal Report.

[18] Marc Renaudin. Asynchronous circuits and systems:a promising design alternative. Microelectronic Engi-neering, 54(1–2):133–149, 2000.

[19] Ivan E. Sutherland. Micropipelines. Communicationsof the ACM, 32(6):720–738, Jun. 1989.

[20] M. Imran Masud. FPGA routing structures: a novelswitch block and depopulated interconnect matrix ar-chitectures. Master’s thesis, University of BritishColumbia, Dec. 1999.

tima lab. research reportstima.univ-grenoble-alpes.fr/publications/files/rr/pla_246.pdf · a...

Documents