a flexible high-performance network router with hybrid wave

A Flexible High-Performance Network Router with

Hybrid Wave-Pipelining

Jabulani Nyathi and Jos�e G. Delgado-Frias

Department of Electrical Engineering

State University of New York

Binghamton, NY 13902-6000

Abstract

In this paper a novel hybrid wave-pipelined bit-pattern associative router is pre-

sented. A router is an important component in any communication network systems.

The bit-pattern associative router (BPAR) allows for exibility and can accommodate

a large number of routing algorithms. Wave-pipelining is a high performance approach

which implements pipelining in logic without the use of intermediate registers. In this

study a hybrid wave-pipelined approach has been proposed and implemented. Hybrid

wave-pipelining allows for the reduction of the delay di�erence between the maximum

and minimum delays by narrowing the gap between each stage of the system. This

approach yields narrow \computing cones" that could allow faster clocks to be run.This is the �rst study in wave-pipelining that deals with a system that has a sub-

stantially di�erent set of pipeline stages. The bit-pattern associative router has three

stages: condition match, selection function, and port assignment. In each stage data

delay paths are tightly controlled in order to optimize the proper propagation of signals.

Internal control signals are generated to ensure that data propagates between stages

in a proper fashion. The simulation results show that using a hybrid wave-pipelining

signi�cantly reduces the clock period and the circuit delays become the limiting factor,

preventing further clock cycle time reduction.

1 Introduction

Communication channels between any two nodes regardless of their physical location withina communication network system can be established by using routers at each node. The pur-pose of the router would be to handle messages on the network. A router receives, forwards,and delivers messages. The router system transfers messages based on a routing algorithmwhich is a crucial element of any communication network [2, 3]. Given the recon�guring re-quirements for many computing systems, a number of routing algorithms as well as networktopologies must be supported. This in turn leads to a need for a very high performance exi-ble router to support these requirements. To maximize a machine's overall performance andto accommodate recon�guration in a distributed environment requires matching the applica-tion characteristics with a suitable routing algorithm and topology. It has been shown that

1

machine performance is fairly sensitive to the routing function and tra�c [4]. In addition,having the capability of changing the routing algorithm at run time could facilitate smartinterconnects and adaptability that allow changes on the machine's topology for di�erentapplications.

A router intended to be used for a number of routing algorithms and/or networktopologies has to be able to accommodate a number of routing requirements. It is of greatimportance that the routing algorithm execution time be extremely short. This time dictateshow fast a message can advance through the network, since a message cannot be transferreduntil an output port has been selected by the routing algorithm. Thus, the routing algorithmexecution time must be reduced to decrease message delays. Other requirements may include: exibility to accommodate modi�cations to a network, algorithm and/or topology switchingwith minimum delay, and programmability to support a large number of routing algorithmsand network topologies.

The major approaches to the realization of exible routers are dedicated processorsand look-up table [5]. A dedicated processor allows bit manipulation, required by mostalgorithms for ease of execution. This type of router is capable of handling a number ofinterconnection networks, but due to its sequential execution of routing algorithms it is aslow approach. Also the resulting router can be expensive in terms of hardware. The look-up table approach requires the storage in memory of a predetermined routing path for allpossible destination nodes. The look-up table approach reduces the time of determining theoutput port to a memory access delay, however, large tables are required to store the routinginformation. The bit-pattern approach uses an associative approach to determine in parallelthe potential output ports (according to the routing algorithm). The output port is selectedby means of a selection function. This scheme has been shown to be able to accommodatea large number of routing algorithms and to have fast execution [7, 6]. This approach hasbeen mapped onto a pipelined VLSI architecture [5].

In this paper we present a high performance hybrid wave-pipelined VLSI routerthat constitutes of several modules and uses dynamic circuitry within the modules. Wave-pipelining is a design method that enables pipelining in logic without the use of intermediateregisters. In order to realize practical systems using wave-pipelining, it is a requirement thataccurate system level and circuit level timing analysis be done. At system level, general-ized timing constraints for proper clocking and system optimization need to be considered.Since the performance of the system depends on the timing between multiple signals withinthe data path and also the timing between the clock signal and data at the input register,there is, therefore, room for performance optimization within the data path and the clockgeneration circuitry.

At the circuit level, performance is determined by the maximum circuit delay dif-ference in propagating signals within a given module of the system. Accurate analysis andstrict control of these delays are required in the study of worst case delay paths of circuits.Data dependent delays also present a problem that needs to be considered in the use of wave-pipelining to improve performance. In this study a di�erent approach to minimizing the clockperiod is undertaken. It is termed hybrid wave-pipelining and hinges on wave-pipelining.

2

2 Router Functional Organization

The bit-pattern associative router (BPAR) scheme supports the execution of routing algo-rithms that are used in most communication switches. Using routing algorithms to determinethe output port requires that the following issues be considered:

� A routing algorithm examines the status of addressing bits that are important indetermining the output port. Some addressing bits need to be ignored, depending onthe current and destination nodes.

� Output port alternatives need to be prioritized. This priority is used to favor a portthat would yield a short path. This feature ensures deterministic routing algorithms.

In this section, an overall description of the router system is presented. This descrip-tion is at a functional level, in order to illustrate the requirements and capabilities of thesystem.

2.1 Router System Description

The associative router uses a content addressable memory as its bit-pattern associative unit,and this enables the destination address alternatives to be considered in parallel. The des-tination address is presented as the input to the bit-pattern associative unit for comparisonwith the stored data. This operation constitutes the normal operation of this design. Thebit-pattern associative router shown in Figure 1 consists of several modules [5]. In this illus-tration arrows show the general ow of data from one stage to the next, with the destinationaddress being presented to the bit-pattern associative unit, which compares (for normal op-eration) the presented destination address to the current address (stored in the bit-patternassociative unit). The results of the comparison are passed to the selection function, whichthen passes the match output with the highest priority to the port assignment to select theword corresponding to the selected output port.

De�nitions of some the modules and labels that appear in the �gure are providedbelow.

1. Destination address. The destination address is passed to the routing function froman input port. It becomes the search argument for the pattern associative mechanism.

2. Bit-Pattern Associative Unit. The destination address is compared to the currentaddress by means of a bit-pattern matching mechanism. An associative computationwith all the bit-patterns is performed to determine potential routing paths.

3. Selection Function. If more than one condition is met only one assignment should beprocessed based on a priority scheme. For an oblivious router, the other matches areignored. For an adaptive router, the other matches may be considered as alternativesto the port assignment.

4. Port assignment. The output port assignment is made by selecting the output portassociated with the selection condition.

3

destination address priority output port

functionRouting port assignment

sele

ctio

n fu

nctio

n

Figure 1: The bit-pattern associative router scheme.

The patterns stored in the bit-pattern associative unit allow the router to make a decisionabout the destination port based on the routing algorithm. The address of a destinationnode D (dn�1; : : : d0) needs to be compared to the current node's address C (cn�1; : : : c0). Arouting algorithm compares the bits of the two addresses; some bits are ignored since theydo not a�ect the current routing decision. These \don't care" bits usually occur at di�erentpositions for each potential path being considered and need to be customized according tothe routing algorithm requirements. This in turn imposes a requirement for the bit-patternassociative unit to include \don't care" bits at di�erent positions in each entry. To providethe exibility required to support multiple interconnection networks and routing algorithmsthe bit-pattern associative router must be programmable.

Modules of the bit-pattern associative router include:

1. A search argument register (SAR), responsible for receiving the destination addressand passing it to the next stage at the appropriate time.

2. The Bit-Pattern Associative Unit, for storing the bit patterns for a given interconnec-tion network node and for performing a comparison between the destination addressand the current address (speci�ed by the bit patterns). All the patterns that matchthe input address are passed to the following stage; the selection function.

3. The selection function which operates as a priority encoder, determines the priority ofthe bit-pattern associative unit outputs. If more than one pattern has been matched,only one match gets selected. The other matching outputs are ignored. The selectedmatch pointer gets passed to the next stage. If no matching pattern is found theselection function sets a no match signal.

4

4. The port assignment stores the output port assignments. The address where the assign-ment is read is speci�ed by a pointer from the selection function and this assignmentis passed to the port assignment register.

5. The port assignment register is responsible for storing the output port assignment tobe used by the interconnection network.

6. The shift register serves as a pointer to a row of DCAMs (bit-pattern associative unitarray) and DRAMs (port assignment) to be programmed to or to be refreshed. It isnot used during normal operation.

7. The refresh circuitry is responsible for refreshing stored data, since dynamic circuitsare used and it is transparent to the other modes of operation.

The modules described above are shown in Figure 2, where the normal operation of theassociative router is depicted.

2.2 BPAR Normal Operation

Even though programming precedes the normal operation of the bit-pattern associativerouter, the organization of the normal operation is presented �rst, since this operation con-stitutes the main function of the router. The system uses a content addressable memory(CAM) as the basic element of the bit-pattern associative unit. The CAM has a propertythat enables some addressing bits to be ignored. Figure 2 shows the bit-pattern associativerouter organization for normal operation. The destination address is presented as inputto the search and argument register (SAR) and passed to the bit-pattern associative unit,which already has the current address stored. The presented destination address is com-pared simultaneously to all the stored bits within the associative unit. Several matchingconditions can result from this comparison. All the matching conditions are presented to theselection function which then prioritizes these inputs and selects the one determined to havethe highest priority. The output of the selection function selects a word to be read from theport assignment. This word is the output port speci�er and is passed to the port assignmentregister.

Refreshing the stored data is inter-leaved with the comparison of the destinationaddress to the current address. This makes the refresh operation transparent to the matching(comparison of destination address with current address) operation. The row select blockfor both the bit-pattern associative unit and the port assignment serve to provide a pointerto a word to be read or written during the refresh operation. Figure 2 depicts a situation inwhich the destination address matches two words (two current addresses) in the bit-patternassociative unit. The selection function, therefore, receives two inputs from the associativeunit and has to select one with the highest priority, which will then be passed as the selectedmatch to the port assignment. The example presented shows that the topmost match hasa higher priority than the one below it. The organization for the programming mode is thesame as that of the normal mode of operation with minor di�erences.

5

Search ArgumentRegister

Row

Select

matchselected

Row

Select

Associative UnitBit - Pattern

match

match

no matchRefresh (DRAM)Refresh (DCAM)

Port AssignmentRegister

To Switching Network (Output Port)From Input Port)

Destination Address(

Port Assignment(DCAM) (DRAM)

selection function (SF)

Figure 2: The bit-pattern associative router organization (normal operation.)

2.3 BPAR Programming Mode

During the programming phase the row select is used to select a word to be written inboth the matching unit and the port assignment while the match (comparing of destinationaddress to current address), the selection function, and the refresh logic are ignored. Themodules that are ignored during programming appear as dotted lines in the �gure. Thesearch and argument register is used as an input latch during this mode of operation. Alsothe output port register serves to latch the input to the port assignment. The row select shiftregister points to the row of DCAMs or DRAMs to be programmed to, and this selectionoccurs simultaneously for both memories. The procedure for writing into the memories isillustrated in Figure 3. The selected word to be written to during programming is shownas a shaded block in both memories (the bit-pattern associative unit and the output portassignment).

2.4 Dynamic CAM Cell

The DCAM cell implements a comparison between an input and the ternary digit conditionstored in the cell. The proposed cell has a small transistor count. The circuit of a singleDCAM cell, shown in Figure 4, consists of eight and a half transistors; transistor Te is sharedby two cells. The read, write, evaluation, and match line signals are shared by the cells in aword while the BIT STORE, NBIT STORE, BIT COMPARE and NBIT COMPARE linesare shared by the corresponding bit in all words of the matching unit. The design usesa precharged match line to allow fast and simple evaluation of the match condition. Thiscondition is represented by the DCAM cell state which is set by Sb1 and Sb0 node status.

6

Row

Select

matchselected

Row

Select

Search ArgumentRegister

Associative Unit

Bit - Pattern

match

no matchRefresh (DRAM)Refresh (DCAM)

Input Latch Port AssignmentRegister

Input Latch

Current Address)Program Data

(Output Port Assignment)Program Data

(

Port Assignment

(DCAM) (DRAM)

selectiuon function (SF)

Figure 3: Router organization (Programming mode).

This state is stored in the gate capacitance of the transistors Tc1 and Tc0. In Table 1the possible DCAM cell stored values are listed. Transistors Tref0 and Tref1 are used forrefreshing.

The DCAM cell performs a match between its stored ternary value and a bit input(provided by BIT COMPARE and NBIT COMPARE which represent a bit and its oppositevalue, respectively). The match line is precharged to \1"; thus, when a no match exists thisline is discharged by means of transistor Tm.

Table 1: Representation of stored data in a DCAM cell.Sb1 Sb0 state

0 0 X(don't care

0 1 1(one)

1 0 0(zero)

1 1 not allowed

Normal operation involves comparing the input data to the patterns stored in theDCAM and determining if a match has been found. During match operation, the input datais presented on the BIT COMPARE line and its inverse value on the NBIT COMPARE line.Before the actual matching of these two values is performed, the match line is prechargedto \1" which indicates a match condition. The matching of the input data and the storeddata is performed by means of an exclusive-or operation which is implemented by the twotransistors (Tc1 and Tc0) that hold the stored value.

7

m

Tc1

Tw1 Tw0

T

T

Tref1

Te

Tr

ref0

NBIT_COMPARE

Sb0Sb1

evaluate

match line

write

read

BIT_COMPARE

NBIT_STOREBIT_STORE

Tc0

Figure 4: CMOS circuit for a DCAM cell.

2.5 Selection Function and Port Assignment

The selection function should be designed to ensure the deterministic execution of the routingalgorithms. With a given input (i.e. destination address) and a set of patterns stored in thematching unit, the port assignment should always be the same. The priority allows only thehighest priority pattern that matches the current input to pass on to the port assignmentmemory. The encoded priority (EPi) output depends on the match at the current bit-patternand the priority for this row. If both match and priority are \1"; then the encoded priority istrue. A priority lookahead scheme has been proposed and implemented; it has been reportedin [12].

The port assignment memory or RAM holds information about the output port thathas to be assigned after a bit-pattern that matches the current input is found. This memoryis proposed to be implemented using a dynamic approach. The selected row address is passedfrom the priority encoder. The cells in this row are read and their data is latched in the portassignment register. The DRAM structure is able to perform an OR function per columnwhen multiple RAM rows are selected; this is when multiple matches are passed on. TheDRAM structure is explained in [5].

3 Wave-Pipelining

Wave-pipelining is an approach to achieve high-performance in pipelined digital systems byremoving intermediate latches or registers [10]. Idle time of individual logic gates within

8

combinational logic blocks can be minimized using wave-pipelining.Wave-pipelining reduces clock loads and associated area, power and latency while

retaining the external functionality and timing of a synchronous circuit [10]. Using wave-pipelining, new data can be applied to the input of combinational blocks or stages withina digital system before the previous outputs are available, thus e�ectively pipelining thecombinational logic and maximizing the utilization of logic without inserting registers [8, 9].

3.1 Modelling Wave-Pipelining

Conventional circuit pipelining uses intermediate latches in addition to the input and outputregisters. Intermediate latches (registers) ensure that when the leading edge of the systemclock comes data gets propagated from one stage to the next in a synchronous manner. Ina system set-up like this there is only one set of data between register stages. By usingwave-pipelining in digital systems, the clock frequency of digital circuits can be increased.This set up allows for coherent waves of data to be sent through a block of combinationallogic and also allows for new inputs to be applied before the previous data propagates to theoutput latch. There may be more than one data wave within the system.

The rate at which data propagates through the circuit depends not on the longest pathbut on the di�erence between the longest and the shortest path delays [10]. Wave-pipeliningrequires that delay paths within a combinational logic circuit be balanced so that the datawaves applied at the input simultaneously arrive at the next stage. An illustration of awave-pipelined system appears in Figure 5. The intermediate latches have been removedand replaced by \intrinsic" latches. Also the system clock is no longer distributed andmultiple coherent \waves" of data are available between storage elements.

3.2 Wave-Pipelining Requirements

Wave-pipelining requires studies across a variety of levels such as process, layout, circuit,logic, timing and architecture. Several research groups have explored some of these areas.Some of the challenges of designing wave-pipelined systems within any of the areas mentionedabove include:

� Preventing intermixing of unrelated data \waves." There must be no data overrun ineach circuit block, and it must be ensured that there is no over committing of the datapath. This is achieved by determining an appropriate range of clock frequencies atwhich to apply data at the input.

� Designing dedicated control circuitry. Control logic circuits must be designed to operatesynchronously with the circuitry of the pipeline stages. The number of these controllogic circuits must be minimal in order to be area e�cient.

� Balancing delay paths. Delay paths must be equalized to ensure that data waves appliedto the input latch propagate through all the stages synchronously. This is achievedby inserting delay elements in the shortest paths within logic blocks to equalize theirdelays to those of the longest paths. This approach allows for the elimination of theintermediate latches or registers.

9

intermediatelatches removed

wave 2 wave 1 wave 0

��

�� O

UTPUTS

output latch

clocking signaldesigned from worst

case operation

��

��

��

��

��

��

input latch

INPUTS

clock

Logic Circut Array

Logic

Circuit Array

LCA )1((LCA0

Logic Circuit Array

) (LCA2

input latch

INPUTS

clock

)

OUTPUTS

output latch

clocking signaldesigned from worst

case operation

Logic

Circuit

Array

LCAn-1 )(

wave n-1

Figure 5: Block diagram of a wave-pipelined digital system.

The requirements stated above are not inclusive but they represent some of the most impor-tant design issues in wave-pipelining.

In this section some of the parameters of importance in wave-pipelining are discussed.These include delays within combinational logic, clock skew, and sampling time at outputregisters. Following an example from Wayne et al [10], a single combinational logic block isconsidered in order to come up with the terminology that is used to determine the timingrequirements for wave-pipelining. The timing constraints derived from using this single blockshould hold for any design with more than a single stage of combinational logic.

3.3 Wave-Pipelining Formulation

To present the clock parameters and delays, a simple pipeline stage is shown in Figure 6.Some important labels are:

� Dmin; the minimum propagation delay through the combinational logic.

� Dmax; the Maximum (worst case) delay in the combinational logic.

� Tclk; the clock period.

10

��

��

��

��

��

��

clk

∆clk

edge

∆clk

skewed edge

clock edge

Dmin

Dmax

skewed

∆

T

clock skew

combinational logic block output registerinput register

clock

Figure 6: Longest and shortest path delays of a combinational logic block.

� Ts and Th; the register setup and hold times.

� �; the constructive clock skew.

� �clk; the register's worst case uncontrolled clock skew.

� DR; the register's propagation delay.

� TL; the time at which data at the output register can be sampled.

� Tsx; the temporal separation between the waves at an intermediate node x.

For this discussion it will be assumed that the combinational logic block has two inputs, thesetup and hold times of the input and output registers are equal and the propagation delaythrough the register is the same for both the input and output registers. From the diagramof Figure 6 the propagation delay of the signals through the combinational logic can berelated to the input and output clocks. Figure 7 shows the maximum and minimum delaysin relation to logic depth. The shaded region in the �gure represents a period during whichdata is being processed within the combinational logic block. Figure 7 can be extended torepresent several data sets being processed as time progresses.

11

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

L

Ts + ∆clk Th

T

T

+ ∆clk

sx

T

clk

maxD

minDlo

gic

dept

h

time

x i - 1 i i + 1

output clock

input clock

Figure 7: Temporal/spatial diagram of a wave-pipelined system.

For wave-pipelining to operate properly, the system clocking must be such that theoutput data is clocked after the latest data has arrived at the output and the earliest datafrom the next clock cycle arrives at the output [10]. The time at which data can be sampledat the output register TL is given by:

TL = N Tclk +� (1)

where N is the number of clock cycles required to propagate the input through the combi-national logic block before it could be latched at the output register. Latching the latestdata at the output register requires that the latest possible signal arrives early enough to beclocked by this register during the N th clock cycle. Thus, the lower bound of TL is:

TL > DR +Dmax + Ts +�clk (2)

To latch the same data requires that the arrival of the next wave not interfere with thelatching of the current wave output. The clock skew �clk must also be accounted for,resulting in the upper bound of TL being:

TL < Tclk +DR +Dmin � (�clk + Th) (3)

The di�erence of these two equations results in a value of Tclk given by:

Tclk > (Dmax �Dmin) + Ts + Th + 2�clk (4)

12

From Equation 4 it is apparent that the minimum clock period is limited by thedi�erence in path delays (Dmax � Dmin) and the clocking overhead Ts + Th + 2�clk. Thisoverhead occurs as a result of the inclusion of the input and output registers and the clockskew. Internal nodes within the system need to be considered in this analysis. It is importantto ensure that there is no data overlap within the system. Therefore, the next earliest possibleset of data should not arrive at a node until the latest possible wave has propagated through.If x is the output of an internal node within a logic network at a point on the logic depthaxis of Figure 7, its longest and shortest delays from its inputs would be represented bythe variables dmax and dmin, respectively. The equation that describes this internal node'sconstraints is:

Tclk � dmax(x) � dmin(x) + Tsx +�clk (5)

where Tsx is the minimum time that node x must be stable in order to correctly propagatea signal through the gate.

3.4 Minimizing the Clock Period

In order to minimize the clock period, both register constraints and internal node constraintsmust be taken into account. According to [11] the feasibility region of valid clock period Tclkis not continuous. It is composed of a �nite set of disjoint regions. Equations 2 and 3 arerevisited in order to derive a two sided constraint on the clock period. It was established thatthe register latching time is given by TL = NTclk + � and if this is applied to Equations 2and 3 to obtain the intermediate values between the lower and upper bound of the registerlatching time, the following equation results:

DR +Dmax + Ts +�clk < NTclk +� < Tclk +DR +Dmin +Dmax � (�clk + Th) (6)

To avoid writing long expressions two variables Tmax and Tmin are introduced.

� Tmax; maximum delay through the logic including clocking overhead and clock skews.

� Tmin; minimum delay through the logic including clocking overhead and clock skews.

Tmax = DR +Dmax + Ts +�clk �� (7)

andTmin = DR +Dmin � (�clk � Th)�� (8)

Having established these new variables Equation 6 can be simpli�ed to read:

Tmax

N< Tclk <

Tmin + Tclk

N(9)

The inequality to the right can be simpli�ed as follows:NTclk < Tmin + TclkNTclk � Tclk < Tmin

(N � 1)Tclk < Tmin

13

Equation 9 becomes:Tmax

N< Tclk <

Tmin

N � 1(10)

The above analysis shows that for N = 1, the clock period is not continuous and is onlybounded below by Tmax.

4 Hybrid Wave-Pipelining

Having presented the timing constraints of wave-pipelining, attention is now turned towardsdescribing these timing constraints as they apply to a hybrid wave-pipelined approach. Equa-tions to describe the timing constraints for this approach are derived and compared to thoseof the previous section. In many computer/digital systems each stage has a signi�cantlydi�erent function and circuitry, therefore, wide variations in delays (Dmin and Dmax) maynot be tolerated. Figure 8 shows an example of a wave-pipelined system. The inputs tothe system are passed in a synchronous manner by means of an external clock. Assumingthat three variables need to be computed in this stage and passed on to the following stage,it follows that for a given set of inputs, these variables would have delays associated witheach one of them. These delays are denoted as dA, dB, and dC , for inputs A, B, and C

respectively. The di�erence in delay may cause stage 2 (Figure 8) to start in a di�erentcomputation path than what is expected. This in turn produces a false start. This falsestart creates unnecessary changes in stage 2 as well as additional delays, that need not haveoccurred. These delays dA, dB and dC would depend on the gates associated with each path,and also on the set of input values.

N

U

I

P

T

S

dB

dC

dA

dD

dE

input latch

stage 1stage 2

clock

false start

Figure 8: Example of delays associated with pipeline stages.

These problems are avoided using a hybrid wave-pipelining approach. In order toprovide some insight into the problem de�nition and solution, a brief summary is providedbelow. A common engineering practice is to consider the worst case delay (Dmax), to ensurethat the system runs properly. Dmax plays a very important role in the system's performanceand safe regions of operation. Dmin (the shortest delay path), on the other hand imposes

14

a restriction in the valid input window. Getting Dmin closer to Dmax could increase thiswindow; in other words it could decrease clock cycle time.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

min_hold

Dmax

dhold(0)

∆clk

D+ Th minD

Tclk

DR

dmin(0)

dmin(1)

Tsx

Ts + ∆clk

ii + 1i - 1

d (1)hold

time

output clock

input clock

logi

c de

pth stag

e 1

stag

e 0

stag

e 2

Figure 9: Temporal/Spatial diagram of a hybrid wave-pipelined system.

The equations derived for the hybrid wave-pipelining are denoted by the subscript hto di�erentiate them from those of the wave-pipelining time constraints. The de�nitions forthe variables presented in the previous subsection still hold. To derive the equations thatdescribe the timing constraints for the hybrid wave-pipeline, the temporal/spatial diagramrepresenting this scheme is presented �rst. The shaded regions of Figure 9 indicate that datais not stable; therefore, register outputs cannot be sampled. The computational cones inthis diagram have been arranged to represent each stage within the design. Some variablesneed to be de�ned. They are:

� dmin(n); the minimum delay encountered in propagating data within a single stage n.

� Dmin hold; the overall minimum delay of all the stages and it includes the intrinsicregisters' hold times.

For hybrid wave-pipelining, TL's lower bound is described as follows:

TLh > DR +Dmax + Ts +�clk (11)

The upper bound of TLh is

TLh < Tclk +DR +Dmin hold � (�clk + Th) (12)

where Dmin hold = dmin(0) + dhold(0) + dmin(1) + dhold(1) + dmin(2) + dhold(2)

15

This equation takes into consideration the intermediate stages of the design. Theminimum delays and the hold times of each stage are considered. From the above equationit can be determined that:

Dmin hold � Dmin (13)

This implies that this delay di�erence is less than Dmax�Dmin for the wave-pipeliningscheme. If further derivations are carried out, the clock period for the hybrid approach isdetermined to be:

Tclk(h) > (Dmax �Dmin hold) + Ts + Th + 2�clk (14)

Comparing Equations 4 and 14 and having Dmin hold � Dmin, a conclusion thatTclkh � Tclk can be drawn. This implies that the clock period for the hybrid wave-pipelinedapproach allows for the clock signal period to be reduced, hence an increase in performance.A complete analysis of the hybrid wave-pipelining scheme must include clock cycle minimiza-tion, taking into consideration the constraints of the internal nodes of the system and theregister constraints. Based on the analysis of Equation 6, the minimum delay of the hybridapproach can be re-written to include the stage hold times as follows:

Tmin(h) = DR +Dmin hold ��clk � Th �� (15)

The maximum delay through the logic including the overhead and clock skews Tmax, remainsunchanged for the hybrid scheme. From Equation 15, and by taking into consideration thefact that Dmin hold � Dmin, it is determined that: Tmin < Tmin(h). Also from Figure 9it can be noticed that the region in which data is not stable, i.e. the di�erence betweenDmax �Dmin hold, is short. It can then be safely stated that Dmax � Dmin hold. The signallatching time, Equation 6 now becomes:

DR +Dmin hold + Ts +�clk �� < NTclk < Tclk +DR +Dmin hold � (�clk + Th)�� (16)

This analysis is graphically presented in Figures 10 and 11. Figure 10 shows thatthere is room to make the clock cycle smaller, since the distance between labels windowh

and windoww can be reduced. Figure 11 shows how e�ectively this could a�ect the clockperiod, reducing it from Tclk to T

0

clk.

4.1 Timing Approach

Propagating coherent data waves from one pipeline stage to the next without the use ofa distributed system clock is achieved by balancing the delays within each pipeline stage.Delay balancing is necessary since the data within a circuit module takes di�erent pathsand experiences di�erent delays depending on the operation, resulting in the data arrivingat the input of the next stage at di�erent times. The data that has the shortest delayarrives at the input to the next stage earlier than data that takes the critical path (longestdelay), hence the use of intermediate latches between pipeline stages (modules). Wayne [10]et al, have suggested bu�er insertion as a method of delay balancing. Bu�ers are inserted

16

��

��

��

��

��

��

��

min_hold

Dmax

dhold(0)

Tsx

hD

w

dmin(0)

dmin

window

window

clk

(1)

min

T

D

DR

ii + 1

d (1)hold

time

input clock

logi

c de

pth stag

e 1

stag

e 0

stag

e 2

Figure 10: Temporal/spatial diagram before clock period reduction.

��

��

��

��

��

��

��

��

maxwindowh

Tsx

T’

D

DR

min_holdD

clk

logi

c de

pth stag

e 1

stag

e 0

minD

stag

e 2

i

time

i + 1

Figure 11: Temporal/spatial diagram after clock period reduction.

17

in the shortest signal path to force these delays to be equal to the worst case delay, whichin turn permits the signals to arrive at the input of the next stage at the same time withoutthe use of intermediate latches nor the system clock. A di�erent approach is considered inthis research. It does not involve bu�er insertion, but uses the delays in the design of theappropriate signals that will enable data to ripple through the stages coherently without theuse of distributed clock nor intermediate latches. The method used will not be limited in useto the bit-pattern associative router circuitry and should �nd application in many circuitsthat employ the wave pipelining technique to improve performance.

The di�erence between the maximum (worst case) delay and the minimum delay is ofparticular interest in designing the timing scheme of a wave-pipelined bit-pattern associativerouter. There will be no bu�er insertion to balance the delays, instead specialized circuitryis designed to provide very precise timing sequence. The advantage of this approach is thatit is not a�ected by scaling or use of a di�erent technology since all the modules are a�ectedby these changes the same. The only major limiting factor becomes the worst case delay ofthe circuit. The valid clock pulse width (PW) can be made short, and it cannot be less thanDmin + (Dmax �Dmin) i.e. PW � Dmin + (Dmax �Dmin). This expression includes the riseand fall times of the clock and is based on measurements taken from SPICE simulations.This gives the minimum pulse width that will enable application of wave-pipelining to thebit-pattern associative router without intermixing the data \waves" within any given pipelinestage. This provides the basis for the design of the clock pulses required to cause the data toripple through the pipeline stages. The system clock is responsible for passing data from thesearch argument register (SAR) to the bit-pattern associative unit. It has been establishedthat the bit-pattern associative router requires the following signals for its operation:

� Read and write signals for the memories (DCAM and DRAM). The write signalmust be raised to a logic 1 only during programming (storing of current address) andrefreshing of the stored data; this is done when the row select signal selects a speci�cword to write to within the bit-pattern associative unit array. Reading occurs whenthere is a need to refresh the contents of the words stored within the array. The readand write signals are generated whenever the stored data has been sensed to degrade.

� Bit lines. They pass the input to the bit-pattern associative unit array during pro-gramming as well as normal operation and they need to be precharged before readingfrom the array. There is thus a need to design specialized circuitry to generate theprecharge signal for the bit lines.

� The evaluate signal. The evaluate signal is responsible for determining if a matchingcondition exists in the bit-pattern associative unit. Special circuitry to provide theproper evaluate signal has to be designed based on dmax(x) and dmin(x), where dmax(x)is the maximum delay path for a signal propagating through stage x and dmin(x) theminimum delay in propagating a signal through stage x of the bit-pattern associativerouter.

� The precharge signal. The match line of the bit-pattern associative unit is designedto be normally on, and this is achieved by precharging it before evaluation occurs. Aprecharge signal, therefore, has to be designed and it must occur at appropriate times.

18

The signals dmin(x) and dmax(x) play an important role in the design of the prechargesignal for the match line.

� The signal that permits the data \wave" to be passed to the selection function. Thesignal that allows the match output to be passed to the selection function is generatedby using the evaluate signal and sizing transistors with the threshold voltage of thetransistor used to latch the match outputs.

� Precharge for the selection function. This signal is simply the inverse of the evaluatesignal.

� The priority status signal. Within the selection function the priority status signal is ofimportance. If a match has been found in any of the entries other than the last entry ofthe matching unit, the selection function has to detect it as a match with the highestpriority and must then propagate a priority status indicating to the entries below thata match has been found. In this design a priority status Pi = 0 indicates that a matchwith the highest priority has been found and hence kills all the other possible matchesbelow the current one. The delays experienced by this signal are of importance sinceit can be used to determine when new input can be applied to the SAR.

� The signal required to pass the selected match output to the port assignment. This signalis required to gate the selected match output which is the one determined to have thehighest priority by the selection function, to the next stage. The system clock is notused in the selection function, therefore, the signals already generated have to be usedto provide precise timing to pass the selected match output to the port assignmentmemory. This signal serves to select the appropriate output port assignment.

The signals described above form the basis for proper timing of the hybrid wave-pipelined bit-pattern associative router. The primary objective is to reduce clock cycletime by allowing the delays to propagate data from stage to stage with the use ofneither intermediate latches nor distributed clocks.

As clock rates continue increasing, clock distribution is becoming a major problem;wave-pipelining could alleviate this problem by signi�cantly reducing the clock load as wellas the number of inter-stage latches [1].

4.2 Hybrid Wave-Pipelining Design

In this section circuits to generate some of the signals discussed above are presented. Re-moval of intermediate latches introduces the possibility that two or more data waves canbe intermixed. A wave-pipelined design must then ensure that this data collusion does notoccur. Circuitry to ensure that data propagates within its own wave has been designed.

4.3 Control Signals for The Bit-Pattern Associative Unit

The requirement for precharging the match line is that this occurs before evaluation takesplace and after the output of the match line has been passed to the next stage. As a result

19

the precharge signal is designed using the signals used to evaluate and pass the match lineoutput to the next stage. The circuit in Figure 12 is designed to generate the match lineprecharge signal. Transistor Tc, when conducting, charges up node V and the status of nodeV is passed to the inverter input when the pass signal is at \0". At evaluation time it isrequired that the match line precharge signal be at \1" to prevent a con ict of operations onthe match line (attempting to precharge the match line while evaluating it). Also, care mustbe taken to prevent precharging the match line before its output is passed to the next stage.The match line precharge signal, therefore, has to be low only when both the evaluate andpass signals are at \0", that is, the match line must be precharged when both the evaluateand pass signals are at \0". This condition is enforced by use of the evaluate signal to turnon transistor Td, providing a path to ground during evaluation. This occurs when pass is atlogic level 0, ensuring that node V gets discharged to \0", resulting in match line prechargesignal being at logic 1. The precharge device is a p-type transistor; it is, therefore, notconducting when the match line precharge signal is at logic 1. Transistor Toff passes a \1"to the gate of Tc when the evaluate signal is at logic 1, preventing a \1" from being passedto node V by turning o� transistor Tc. Since the pass and evaluate signals are never at logic1 at the same time, pass at the gate of Ton, after allowing a time lapse of two inverter delays,enables Tc to conduct. This arrangement ensures that the precharge signals will operateprecisely as desired.

Ton

Toff

Tc

Tr

Td

match lineprecharge

pass

evaluate

V

Figure 12: Generating the match line precharge signal.

The evaluate signal must go to \1" after the data passed to the DCAM for comparisonhas stabilized on the bit lines (BIT COMPARE and NBIT COMPARE). There is circuitrythat generates this signal once all the lines have been set to their appropriate input values.This circuitry takes into account setting a \0" or \1" in the particular CAM array and isshown in Figure 13. Its operation is governed by the status of the match line. The inverseof the clock gets passed to an intermediate node when the match line is at \1" and thisintermediate node gets precharged to \1" whenever the match line is at \0". The inverse ofthe intermediate node is the evaluate signal. Using the match line to pass the clock signal

20

and set the evaluate signal to \0" when evaluation is completed provides the proper durationfor the evaluate signal to stay at \0" or \1". Passing the clock signal when the match line isat logic 1 basically determines that the data on the bit line has had enough time to stabilizeon the bit lines.

clock

match line

evaluate

Figure 13: Generating the evaluate signal.

Once evaluation has been completed the match line outputs are passed to the selectionfunction. Thus, the pass signal must immediately be active following completion of theevaluation process. The circuit used to generate the pass signal is shown in Figure 14.The pass signal is designed to mimic the path that a zero on the match line (non-matchingcondition) takes once evaluation completes. Passing a \0" to the selection function providesthe maximum delay that can be experienced in passing the matching unit's outputs tothe selection function and, therefore, constitutes the worst case propagation delay for thisoperation. The node pass is initialized to \1" by precharging node z when the evaluatesignal is \1". A \1" at the pass node ensures that transistor Tpz passes a \0" which getsinverted and becomes input to transistor Tpd. When transistor Tpd conducts discharging thepass node, this becomes an indication that the pass signal has stayed at \1" long enough topass the output of the match line to the selection function. The duration and voltage levelof both the evaluate and pass signals need to be just long and high enough to accomplishtheir purpose. The circuits in Figures 13 and 14 have been designed to sense the durationand voltage levels of these signals. All the signals presented to this point depend on thesystem clock.

Tp

Tpd

Tpz

Tprech

Tpu

pass

evaluate

p

z

Figure 14: Generating the pass signal.

21

Another signal of importance is within the selection function. This is the signal thatenables the port assignment select signals i.e. the output of the selection function; the pointerto the selected word of the port assignment. It is required that if a match with the highestpriority has been established, the ones below this match output must be disabled. Thepriority status performs this function along with an enable signal, which has been designedbased on the fact that it takes longer to pass \0" from the bit-pattern associative unit. Therequired delay between sets of four selection function cells is arrived at by inserting delayelements within the circuit. This approach ensures the prevention of false starts.

enable0

enablen

PLn-1

enable1

pass precharge

PL0

Figure 15: Enabling the selection function outputs.

4.4 Hybrid Wave-Pipelining Signals

The signals generated by the above circuits appear in Figure 16 along with the clock. Thedelays of the �rst two stages are clearly marked in the �gure to show a correlation with thehybrid wave-pipelined approach. On Figure 16 the minimum delays, dmin and hold timesdhold of each stage are shown. Also the duration of the evaluate and pass signals are depicted.Using 1�m technology 2.5 V is su�cient to determine a logic 1.

Once the output of the match lines have been received by the selection functiona decision needs be made when more than one matching condition has been received. Theselection function is designed to propagate a priority status Pi to the entry below it indicatingwhether it has registered a match. This priority must be propagated very fast to prevent falsestarts. Some of the signals of importance in this scheme are shown in Figure 17. The selectionfunction's critical operation occurs when the �rst and last entries of the selection functionsimultaneously receive inputs indicating that matching conditions have been found in thecorresponding bit-pattern associative unit entries. The �rst entry of the selection functionhas to propagate a \0" to the last entry of the selection function to prevent generation of

22

a pointer to the DRAM by the last entry. Signals to lessen the possibility of false startsare generated and these are labeled enable. They are designed to ensure that even if a falsestart occurs the pointer to the DRAM array is not established. In Figure 17 delays from thesecond stage of the BPAR and those of the third stage are shown. In Figure 17 signals usedto enable the DRAM pointers for the �rst and last entries from a setup in which the last entryalways �nds a match and the �rst entry �nds a match every other clock cycle are shown.The delays involved are shown. The plots in Figure 18 are those of the DRAM pointers tothe �rst and last entries of the DRAM. They serve to show that an output port assignmentcan be read from the DRAM array every one and a half clock cycles. Also ensuring thatthe matching condition occurs every other clock cycle for the �rst entry of the bit-patternassociative unit, while the last entry always �nds a match, makes for a good check for falsestarts.

0

1

2

3

4

5

0.00000 0.00001 0.00249 0.00500 0.00639 0.01000 0.01166

Time (µs)

Vol

tage

(V

) dholdCAM

d CAM d SF

passevaluate clock

dminCAM

SF = selection function

Figure 16: Plot of the bit-pattern associative unit control signals.

23

0

1

2

3

4

5

0.000 0.001 0.004 0.006 0.010 0.011 0.016 0.019 0.021

Time (µs)

Vo

ltag

e (V

)output0 DRAMselect15

enable3

DRAMselect0

pass

enable0

output15

dpass(SF)

d (SF)

d (RAM) dhold(RAM)

SF = selection function

Figure 17: Plots of the enable signals.

0

1

2

3

4

5

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

time (µs)

Vol

tage

(V

)

DRAMselect0 clock

DRAMselect15

DRAMoutput bit0

Figure 18: Plot of the DRAM pointers.

24

5 BPAR's Space/Time Hybrid Wave-Pipelining

Figure 19 shows the delays associated with each stage (as shown in Figures 16, 17, and 18).The computational cones of Figure 19 show the three stages of the bit-pattern associativerouter. Hybrid wave-pipelining has enabled for the delay di�erence per stage to be reduced.Unlike wave-pipelining where Dmin for the system provides the upper bound for the clockperiod, hybrid wave-pipelining shows that each stage has it's maximum delay that can bemanipulated independent of other stages to obtain optimal timing.

The bit-pattern associative unit completes worst case operation (computation) in 5.1nano seconds (ns). This time includes the input latch delays, clock skew and hold and setuptimes for this stage. To accommodate the latest data wave arriving at this stage additionaltime during which this data can still be passed to the next stage is allocated. This timeappears in the �gure as dpass with a value of 0.9 ns. For data arriving late at the bit-pattern associative unit at least 2 ns is the time during which this data can still be processedwithin this stage. The hold time for the selection function is very short, almost equal to theminimum delay of this stage. This short hold time is a direct result of the selection functiondesign, which ensures propagation of the priority status to entries below the current onevery fast by means of the priority lookahead. The port assignment has a minimum delayof 0.4 ns and requires 3 ns of hold time. It also accommodates the latest \wave" for thecurrent data. Reduction of the gap between Dmax and Dmin at each stage is presented heregraphically with the delays displayed to show how the clock period is made shorter in thisdesign. Comparing this �gure with Figure 20 it can be seen how the time during which datacan be sampled at output latches is reduced.

A combined temporal and spatial diagram for a wave-pipelined router is shown inFigure 20. The horizontal axis represents time while the vertical axis represents the circuitryof the system which includes: input latch, CAM (bit-pattern associative unit), selectionfunction, RAM (port assignment), and output latch. The shadowed regions represent thetimes when a computation is being performed and data may be unstable. The space betweentwo adjacent regions corresponds to stable data. In this �gure several waves of data are shown(from data i � 1 to i + 1). It should be pointed out that the worst time for the selectionfunction occurs when the top priority CAM entry matches and produces a read. This inturn creates a delay for the last entry which is ready before the priority lookahead signalreaches it. In the �gure, it is shown how the �rst entry a�ects the last entry. In a traditionalpipeline this creates a very long delay in the selection function stage; since the CAM outputis not known until the pipeline clock comes. The output data should be read when it isstable; thus the output clock should be synchronized to achieve this.

The hybrid wave-pipelined router described in this paper has been fabricated using a0.5�m technology and the chip tested. Some traces of the signals described in this sectionappear in the Appendix.

6 Concluding Remarks

An extremely fast exible router capable of accommodating a number of routing algorithmsand networks has been presented in this paper. The bit-pattern associative router is designed

25

CA

Md

dmin

dmin

hold

d

CA

M

min

SF

hold RA

Md

hold

devaluate

dpass

SF

dR

AM

time

logic depth

input latch and CAM selection function (SF) RAM and output latch

1.6 ns3.5ns

0.7 ns0.8 ns

0.4 ns3 ns

2 ns

0.9 ns

Figu

re19:

Hybrid

wave-p

ipelin

eddelay

stran

slatedto

computation

alcon

e.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

min_hold

Dm

axw

indowh

min

D

T’clk

D

Tsx

DR

dhold C

AM

dmin C

AM

dm

in SFd

hold SF

dmin R

AM

dhold R

AM

RAMCAM

i + 1

i - 1

time

ouput latch &

logic depth

selection function

input latch &

i

Figu

re20:

Router's

Tem

poral/sp

atialdiagram

.

usin

ganovel

schem

ethat

combines

wave-p

ipelin

ingandcon

vention

alpipelin

ingproducin

ghybrid

wave-p

ipelin

ing.

Thehybrid

wave-p

ipelin

ingapproach

narrow

sthegap

betw

eenD

min

andD

maxresu

ltinginared

uction

oftheclo

ckcycle

time.

Tofurth

erim

prove

theperform

ance

ofthebit-p

atternasso

ciativerou

terdynam

iccircu

itsalon

gwith

ahigh

perform

ance

selectionfunction

tored

uce

priority

statusprop

agationdelay

sare

used

.This

studyis

the�rst

ofit's

kind;it

exam

ines

thedelay

swith

ineach

module

ofthedesign

andminim

izesthem

26

individually.Issues such as signal generation, uneven delays, and timing have been addressed to

synchronize the pipeline. The bit-pattern associative router presented in this paper has beendesigned, implemented and tested. The test results appear in the appendex.

27

References

[1] C. Thomas Gray, Wentai Liu, Ralph K. Cavin, III,Wave Pipelining: Theory and CMOSImplementation, Kluwer Academic Publisher 1994.

[2] D. Clark and J. Pasquale, \Strategic Directions in Networks and Telecommunications,"ACM Computing Surveys, vol. 28, no. 4, pp. 679-690, Dec. 1996.

[3] T. Leighton, \Average Case Analysis of Greedy Routing Algorithms on Arrays," 2ndAnnual ACM Symposium on Parallel Algorithms and Architectures, pp. 2-10, Crete,Greece, 1990.

[4] R. Cypher, A. Ho, S. Konstantinidou, and P. Messina, \Architectural Requirementsof Parallel Scienti�c Applications with Explicit Communication," 20th InternationalSymposium on Computer Architecture, pp. 2-13, May 1993.

[5] J. G. Delgado-Frias, J. Nyathi and D. H. Summerville, \A Programmable DynamicInterconnection Network Router with Hidden Refresh" IEEE Trans. on Circuits andSystems, I vol. 45, no.11, pp. 1182-1190, Nov. 1998.

[6] D. H. Summerville, J. G. Delgado-Frias, and S. Vassiliadis, \A Flexible Bit-AssociativeRouter for Interconnection Networks," IEEE Trans. on Parallel and Distributed Sys-tems, vol. 7, no. 5, pp. 477-485, May 1996.

[7] D. H. Summerville, J. G. Delgado-Frias, and S. Vassiliadis, \A High Performance Pat-tern Associative Oblivious Router for Tree Topologies," IPPS '94: The 8th InternationalParallel Processing Symposium, pp. 541-545, Cancun, Mexico, April 1994.

[8] C. Thomas Gray, Wentai Liu, and Ralph K. Cavin, III, \Timing Constraints for Wave-Pipelined Systems," IEEE Trans. on Computer-Aided Design of Integrated Circuits andSystems, Vol. 13, No. 8, pp. 987-1004, August 1994.

[9] D. Ghosh, and S. K. Nandy. \Design and Realization of High-Performance Wave-Pipelined 8 X 8 b Multiplier in CMOS Technology," IEEE Transactions on Very LargeScale Integration (VLSI) Systems, Vol. 3, No. 1, pp. 36-48, March 1995.

[10] W. P. Burleson, M. Ciesielski, F. Klass and W. Liu, \Wave-Pipelining: A Tutorial andResearch Survey," IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol.6, No. 3, pp. 464-474, September 1998.

[11] W. K. C. Lam, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, \Valid Clock Frequen-cies and Their Computation in Wave-pipelined Circuits," IEEE Trans. on Computer-Aided Design, vol. 15, no. 7, pp. 791-807, July 1996.

[12] J. G. Delgado-Frias and J. Nyathi, \A High Performance Encoder with Priority Looka-head," IEEE Sixth Great Lakes Symp on VLSI, pp. 59-64, Lafayette, Louisiana, Feb.1998.

28

evaluate

match-alternately (first and last entries)

Figure 21: Matching the �rst and last entries alternately.

matchline pass

Figure 22: Periodic discharge of the matchline.

RAM_output (first entry) RAM-output (last entry)

Figure 23: Port Assignment output.

29

a flexible high-performance network router with hybrid wave

Documents