on-chip decentralized routers with balanced pipelines for...

Presentation at NOCS’15 on September 29, 2015

On-Chip Decentralized Routers with Balanced Pipelines for

Avoiding Interconnect Bottleneck

Ryota Yasudo1, Hiroki Matsutani1, Michihiro Koibuchi2, Hideharu Amano1, Tadao Nakamura1!

!1Keio University, Japan!

2National Institute of Informatics, Japan

Outline l  Introduction

l  Related work

l  Architecture and delay models

l  Results

l  Conclusions and future work

0.1

1

10

100

250 180 130 90 65 45 32

Del

ay(N

orm

aliz

ed)

Process technology [nm]

Global interconnect(w/o repeaters)

Global interconnect(w/ repeaters)

Interconnect Bottleneck is real

Source: M. Anders, “High-performance Energy-efficient NoC fabrics”, NOCS’14

0.1

1

10

100

250 180 130 90 65 45 32

Del

ay(N

orm

aliz

ed)

Process technology [nm]

Global interconnect(w/o repeaters)

Global interconnect(w/ repeaters)

Interconnect Bottleneck is real

Designers now face difficulties dealing with wire delay of long interconnects.

Source: M. Anders, “High-performance Energy-efficient NoC fabrics”, NOCS’14

R

RR

R

New Clash between logical & physical distances/delay

vs.

Hop count Link length / wire delay

Well-balanced implementation of topologies with long links

becomes increasingly difficult as technology evolves.

To reduce logical distances, high-radix topologies such as Flattened Butterfly are used.

To reduce physical distances, we should shorten links.

R R

R

R R

R R

R R

Source

Dest.

1

2 RHow long?

Logical Physical Router

R

R

R R

Our contributions q  We propose decentralization of on-chip routers

Ø  as universal methodology to solve problems described above.

q  We formulate simple delay models of conventional routers and decentralized routers.

q  Four case studies in 28-nm process demonstrate the impact of our proposal.

Ø  Variable Routing, # of VCs, and so forth.


l  Related work


l  Results


Preceding studies about Decentralized Router Architectures

q  Proposed for different purposes from ours Ø  Rotary Router [ISCA’07] Ø  ElastiNoC [NOCS’14]

q  Proposed for a similar purpose to ours Ø  Distributed Switch [ICPP’11]

Rotary Router[Abad et al., ISCA’07]

[21], the maximum packet size is up to 5 times longer. At first glance, having small packets is not a problem until we analyze its impact on conventional routers performance. Figure 1 shows the performance of an 8x8 Torus network with Bubble routers (either adaptive or deterministic as in [28]) under synthetic uniform traffic for different packet lengths. Both routers use FIFO input buffers, and were tested under constant buffer space. Reducing the packet length from 20 phits down to 2 reduce the potential performance of the adaptive router by almost 45%. In other words, if router links are made five times wider, only 55% of the bandwidth improvement will be effective. This additional contention is due to more frequent packet arbitration. Although the adaptive router employed uses a feasible but aggressive arbiter, similar to the one employed in the Alpha 21364 router [21], its behavior when packets are extremely short degrades performance. As we can see, deterministic routers present much lower sensitivity to packet length, but with up to 30% performance loss compared to adaptive routers with large packets. Consequently, if we want to empower network performance using adaptive routing we need arbitration mechanisms immune to packet length.

All in all, the challenge faced in the design of a router for a CMP interconnection network is a hard task. Reaching the necessary trade-offs slightly modifying conventional router architectures seems very difficult. For this reason, we try to address this problem from a radically different point of view. The present work copes with this situation by proposing a new architecture that fulfils the main requirements successfully with a sustainable cost. The architecture is based on a router, denoted as Rotary Router, which not only minimizes effects of small packets but also takes advantage of them, has no appreciable HOL blocking, and allows the use of topology agnostic adaptive routing. The rest of the paper is organized as follows: Section 2 introduces the Rotary Router architecture. Section 3 explains how network anomalies are avoided. Section 4 shows some performance results. Section 5 addresses the implementation cost of the router. Section 6 discusses related research and, finally, Section 7 states the main conclusions of the paper.

2. THE ROTARY ROUTER In this section, we will provide a detailed router architecture and describe its operation. We will focus on the main differences of the Rotary Router compared to more classic architectures and on the advantages it presents when working with CMPs. Aspects

such as flow control mechanisms and routing algorithm are also described.

2.1 General Router Structure Trying to avoid the appearance of negative effects present in input buffered structures, the introduction of radical changes in the router design seems essential. On the one hand, in order to minimize contention effects on performance, the Rotary Router should not make use of centralized arbitration mechanisms nor centralized crossbar. For this reason, arbitration should be done independently at each router output port and independent of the number of input ports. On the other hand, non-FIFO buffers involve a high cost [32], so in order to deal with the HOL blocking problem while maintaining buffer FIFO policy, we need some mechanism that allows the packets at the head of the queue to leave the buffer, even when they have not obtained their profitable output port. This would enable the advance of the packets waiting behind the one blocked in the head of the buffer. Finally, it would be preferable that the number of router ports or the routing algorithm do not increase router complexity. In order to address all the aforementioned requirements, the way of connecting the components inside the router has to be completely new, while some common elements present in conventional architectures should disappear.

Figure 2 shows a sketch of the router for a 2-degree network with one host attached. The structure of the Rotary Router is based on two independent rings, which force packets to circulate either clockwise or anti-clockwise, traveling from port to port of the router. Each ring is built with a group of Dual-port FIFO Buffers (DFB). The operation of the Rotary Router is simple, when a packet arrives at a router input port it is sent to one of the rings which forms the router. The packet starts moving towards its output port using the DFBs of the ring. Once the packet reaches a profitable output port, there are two possible cases; if the suitable output port is available, the packet will leave the ring and advance to the next router. Otherwise, i.e. another packet is in transit through the same output port or the remote node has not enough

0102030405060708090

100

2 6 10 14 18Packet Length (phits)

Nor

mal

ized

to A

dapt

ive

Rou

ting

Max

imum

Thr

ough

put (

%)

Adaptive RouterDeterministic Router

Figure 1. Packet length impact in adaptive and deterministic input buffered routers.

FIFO Buffer

Multiplexer

FIFO Buffer

Demultiplexer

Injector

Consumer N

S

E

W

INPUT STAGE

OUTPUT STAGE

Dual-Port Fifo Buffer

BUFFERINGSEGMENT

STAGE

Figure 2. Rotary Router sketch.

117

Source: Abad et al., “Rotary Router: An Efficient Architecture for CMP Interconnection Networks”, in Proc. of ISCA’07, p.117, Figure 2.

It eliminates HoL blocking to improve performance.

It’s the first to show signs of decentralized router for NoC

ElastiNoC[Seitanidis et al., NOCS’14]

MU

LRCIn#0

2VC Elastistore

Out#0

In#3Out#3

MU

MU

MU

MU

MU

LRC

Fig. 3. The modular construction of an example ElastiNoC 4×4 VC-basedrouter using the proposed MU primitive that supports 2 VCs.

to switch and buffer locally the flits of two inputs that belongto different VCs. Buffering is done via ElastiStore units [17],which follow an elastic protocol and are able to simultaneouslystore the data of many VCs using the minimum amount ofbuffering. Each ElastiStore module comprises one single-flitregister per VC, plus one other single-flit register that can bedynamically allocated to the first stalled VC.

By using MUs and splitting the data arriving at each inputport, one can design an arbitrary VC-based router. An exampleis shown in Figure 3, which depicts an ElastiNoC router with4 inputs and 4 outputs. Upon arrival at the input of the router,each packet has already computed its destined output portvia Look-ahead Routing Computation (LRC). Subsequently –depending on buffer availability, output VC availability, andthe allocation steps involved in each MU – the flits of thepacket are forwarded to the MU of the appropriate output.Integration of MU and ElastiStore primitives is straightfor-ward, since they all operate under the same ready(i)/valid(i)handshake protocol. All router paths from input to output seea pipeline of MUs of log2 N stages. Moving to the next routerinvolves one extra cycle on the link that is just a one-to-oneconnection between two ElastiStores. The flow control on thelinks does not allow packets to change VC and its operationneeds only an arbiter and a multiplexer for selecting a flit tosend to the next router.

The fact that all input-to-output paths experience log2 Nstages of MUs is extremely important. This attribute alignsElastiNoC with the optimal pipelining conclusions extractedin Section II for both low- and high-radix routers. For low-radix routers (with 5-8 input ports), optimal pipelining callsfor 2-3 stages, while the 4-5 pipeline stages required forhigh-radix routers (with more than 12 input ports) are alsoin agreement with the logarithmic number of stages of theproposed architecture. Thus, ElastiNoC allows for sufficientlyfine-grained modularity, which can yield optimally pipelineddesigns over a wide spectrum of router radices.

Due to the distributed nature of ElastiNoC, the split con-nections can be customized to reflect the turns allowed by therouting algorithm. For example, in a 5-port router for a 2Dmesh employing XY dimensioned-ordered routing, splittingfrom the Y+ input to the X+ output is not necessary since thisturn is prohibited. Several other deterministic and partially-adaptive routing algorithms can be defined via turn prohibits asshown in [18]. When such customization is utilized, significantarea savings are expected, due to the removal of both buffering

perInput VC

availready

inputVC state

outVC(0)valid(0)

dequeueVC#0

VC#1sh

ared

VA1/SA1

V VA1V:1 arb

availready

V

SA1V:1 arb

SA22:1 arb

selectedoutput VC

availready

VC#0

VC#1sh

ared

outVC(1)valid(1)

VC#0

VC#1sh

ared

Output

outputVC stateInput #0

Input #1

granted(0)

granted(1)

1

0

Fig. 4. The fundamental ElastiNoC primitive, the Merge Unit (MU). Thediagram depicts the per-input and per-output multiplexers together with thecombined allocation logic (SA1, SA2) that runs in parallel to VA1.

and logic resources. On the contrary, when such optimizationsare performed in traditional VC-based routers, only parts ofthe crossbar and switch allocation logic are reduced, while theVC allocation logic and buffering, which are responsible fotthe majority of the router’s area, are not affected.

This modular router construction enables packet flow to bepipelined in a fine-grained manner, implementing all necessarysteps of buffering, VC and port allocation, and multiplexing ina distributed way inside each MU, or across MUs. Also, theplacement of MUs does not need to follow the floor-plan ofthe chosen NoC topology. Instead, MUs can be freely placedin space, provided that they are appropriately connected.

B. The Merge Unit (MU)Each MU is responsible for switching one output between

2 inputs assuming the existence of per-input and per-outputVCs, as shown in Figure 4. Since switching is achieved byconnecting several MUs in series (as illustrated in Figure 3),the buffers presented at the input of Figure 4 are actually theoutput buffers of the previous MUs.

1) Allocation and Switching Logic: Packets arriving at thetwo inputs of each MU must compete for a single output.Since the output can carry flits that belong to different VCs,each packet has to first allocate a VC at the output of theMU (known as an “output VC”), before leaving the inputVC. Allowing packets to change VC in-flight, within eachMU, is possible when the routing algorithm does not imposeany VC restrictions (e.g., XY routing does not even requirethe presence of VCs). However, if the routing algorithmand/or the upper-layer protocol (e.g., cache coherence) placespecific restrictions on the use of VCs, arbitrary VC changesare prohibited, because they may lead to deadlocks. Anyrestrictions are enforced by the allocator of the MU.

Our goal is to make the MU as fast as possible without sacri-ficing throughput. Therefore, we follow a combined allocationapproach [19], customized and optimized to the characteristicsof our design by allowing packets to change VC in flight atthe granularity of a single MU. Each input VC holds two state

2014 Eighth IEEE/ACM International Symposium on Networks-on-Chip (NoCS)

138

Source: Seitanidis et al., “ElastiNoC: A Self-Testable Distributed VC-based Network-on-Chip Architecture”, in Proc. of NOCS’14, p.138, Fig. 3.

Distributed self-testing mechanism, which achieves high fault coverage

The study applies decentralized architecture to making fault-tolerant networks.

Distributed Switch[Roca et al., ICPP’11]

In order to know such delays, the switch has been im-plemented using the 45nm technology open source Nangate[22] with Synopsys DC. We have used M1-M3 metallizationlayers to perform the Place&Route with Cadence Encounter.We observe, after synthesizing the modular switch that thecombinational delay of the data path and the flow controlpath are almost identical. The minimum combinational delayreachable by the synthesis tool for those paths is 0.53ns. Then,the critical path (T) of a switch is:

T = 0.53ns + link delay (1)

where the link delay will be set in order to fullfill the networkrequirements. It is noteworthy to mention, that the modularswitch presented in this paper has a critical path that is up to15% shorter than the critical path of a canonical switch [6].

IV. DISTRIBUTED SWITCH

The modular switch design presented in the previous sectionhas an interesting property. Each output port has its owncircuitry that is independent from the circuitry of the rest ofoutput ports (see Figure 1). This property allows the modularswitch presented in Section III to be distributed over thelinks while keeping the connectivity of the switch with itsneighbours, that is, it allows to spread the circuitry of eachoutput port controller of the switch along the link that connectsthat output port circuitry with the adjacent switch. Figure 4shows the spreading of the output port circuitry of the switchalong the link. Note that each of the stages of the output portcontroller is located at half the length of the link connecting tothe adjacent switch. For the sake of clarity, the RC module hasbeen omitted. However, in order to reduce wire redundancy,the RC module is computed in parallel with the second stage.

There are several benefits of distributing the switch over thelink. First, power consumption is reduced without increasingthe delay of the communication between switches. That is,distributing the switch over the link forces the link to bepipelined (distributed link). However, the pipeline of thedistributed link is not only introduced to minimize powerconsumption but to perform the switching tasks. Pipelining thelink minimizes the delay constraints of the link. Then, powerconsumption of the link is reduced. Thus, interconnects canbe designed reducing the power consumption meanwhile thepipeline of a message is not increased.

The second benefit of distributing the switch over the linkis that any stage of the pipelined switch is connected to afragment of the link (sublink). Thus, the effects of processvariation over the pipelined switch can be easily minimized, asshown in [23]. In [23] a simple technique to reduce the processvariation effects of the switch was presented. Basically, anyperformance variation in the switch is compensated by makingthe link faster, which is a simple and well-known technique.However, this technique could not be used inside a pipelinedswitch [24]. In a distributed switch, any stage of the switch isconnected to a sublink and, therefore, any variation process ofany pipeline stage of the switch can be compensated by thesublink that is connected to. The impact of process variationon the distributed switch is left to future work.

Distributing the switch over the link may have a negativeconsequence, as the length of the wires between AC modules

(a) Modular output port. (b) Distributed output port.

Fig. 4. Output port controller for a modular switch and a distributed switch.

increases. Remember that in the modular switch, any inputport is connected to all the output port controllers inside theswitch. In the distributed switch, this interconnection becomeslonger because AC stages are separated from each other.This effect can be seen in Figure 5. The figure shows theconnectivity between two modular switches and its equivalentdistributed link. While a typical link has a length of L plus thelength of the wire inside the switch (negligible), the distributedswitch presents six smaller links of length L/2, accountingfor a total of 3L routed wires per distributed link. Then, adistributed link, presents three times more routed wires thanthe equivalent centralized link. Despite the increment in routedwire, the length reduction of each sublink in a distributed linkwill reduce the delay constraint of these sublinks and then,reducing the total power consumption of the distributed link.

(a) Link between adjacent modularswitches.

!"#!"# !"#!"#

$%&'()(%*(+,'-.*(+,,,,,,/$'(0

(b) Equivalent distributed link.

Fig. 5. Link scheme for both scenarios.

The negative effect of these longer wires is minimizedby using higher metallization layers, achieving low powerconsumption (see Section V). This is the case of the distributedswitch. In contrast, in the modular switch distance betweenAC modules are shorter as they are inside the switch areaand, hence, they are routed by using lower metallizationlayers which have worse properties, and hence, higher powerconsumption.

Figure 6(a) shows the floorplan of a 3x3 2D-mesh withstandard (modular) switches. Note that, each modular switchis suited in a single area, equidistant to other switches.Figure 6(b) shows the floorplan of a 3x3 2D-mesh whendistributed switches are implemented. Interconnection wireshave been omitted for clarity. Note that, the AC modules of thedifferent distributed switches are spread on the die. Each ACmodule is connected AC modules at distance L/2 as explained

24

Source: Roca et al., “A Distributed Switch Architecture for On-Chip Networks”, in Proc. of ICPP’11, p.24, Fig. 4(b).

The idea of reducing negative impact of links is presented.

Its starting point is from modular switch, a specific architecture.

Summary of related work ü  Related work makes us appreciate that decentralized architecture is beneficial for different purposes.

ü  We follow the concept of reducing the impact of wire delay

ü  Our starting point is from standard routers and case studies generalize decentralized architectures.

ü  We propose alternative approaches: ² Decentralized buffer design ² Optimization of the arrangement of modules based on our delay

models.


l  Related work


l  Results


Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC

Baseline router: architecture

Data

Back pressure

fromprev. router

from arbiter

Buffer delay(e.g. 2 cycles)

1-bit

1 flit

Columnnumber

1 2 3 4 5

Figure 3: Decentralized buffer composed of D latches.

ferred in a direction opposite to data at the same speed as data. Itpasses D latches that are added to each column and stops data trans-fer as shown in Figure 3. When an arbiter wants to stop data (i.e.,packets conflict), arbiter asserts back pressure. Then back pressurecomes through in order, from the header flit.

The design is implemented in Verilog-HDL, and then synthesized,placed and routed by Synopsys Design Compiler and Synopsys ICCompiler. Data are transferred with two clock cycles when backpressure is low and are stored in D latches after the arbiter assertsback pressure. After back pressure is negated, data transfer restarts.This method spontaneously transfers data correctly without perfor-mance overhead.

Based on the above, the proposed decentralized router architectureis illustrated in Figure 4. It is basically identical to the naive design,but buffers are disjoined as shown in this figure. The columns of thebuffer are placed at regular intervals. Flits from the previous routerare transferred to the data path, and control signals are picked upand proceeded to the control path. Then data transfer and controlprocessing are handled independently.

In contrast to the naive architecture, a flow control does not changein the case of the proposed architecture, because the buffers areseparated from all the submodules. The identical flow control thatthe baseline router uses, namely virtual cut-through, is performed.Consequently, performance of the proposed and the baseline archi-tectures is the same on a cycle level. This means the performance isdetermined by the critical path delay rather than execution cycles.

The delay model of the proposed router is formulated as follows:

TRC=Grc +Wsegmenta , (9)TVSA=Garb +Wsegmentb , (10)TST=Gcb +Wsegmentc . (11)

TVSA is improved compared with the naive architecture. This is be-cause the grant delay is eliminated, or rather, back pressure replacesgrant. Grant needs the wire delay Wsegmentb , whereas back pres-sure does not, because the arbiter is adjacent to the head column ofthe buffer. Furthermore, Wsegmentb becomes also unnecessary inTST, because data arrive at the submodule-C before ST stages.

Besides the delay of each pipeline stage, the delay of data Tdata

must be considered in the case of the proposed router. It is formu-lated as follows:

Tdata≈Gbuffer +Wlink

C. (12)

Table 1: Routers used in four case studies.

# Routing algorithm # of VCs Pipeline structure1 DOR 2 [RC][VSA][ST][LT]2 DOR 8 [RC][VSA][ST][LT]3 West-first 2 [RC][RS][VSA][ST][LT]4 Duato’s protocol 2 [RC][RS][VSA][ST][LT]

Gbuffer is the critical path delay of the buffer, and C is the numberof cycles that correspond to a duration of buffering. The criticalpath delay of the proposed router becomes the maximum T includ-ing Tdata.

3.4 Case studyWe now get back to the case study since our study targets generalrather than specific router architecture. We select four commonrouter architectures, and present how to decentralize them. Afterthat, we elucidate the effect of decentralization in each case anddiscuss the difference between these cases and the preferred routerarchitecture. Table 1 outlines the routers used in the four case stud-ies.

3.4.1 The simple routerThe simple router completely corresponds to the baseline routerwhose routing algorithm is Dimension Order Routing (DOR). Weassume a very simple router that has only two VCs per input chan-nel that are allocated with fixed priority. Specifically, the next VCis computed by RC logic and packets never change VCs to pass.We also use this router for area evaluation.

3.4.2 The router with many VCsThe second case adds many VCs to the simple router. Since manyVCs improve efficiency of resource allocation by allowing morepackets/flits to participate in arbitration [15], this case is common.Router architecture and routing algorithm are essentially the sameas before, but eight VCs are allocated with round-robin priority inthis case. Thus, the critical path of VSA becomes long. Many VCsare implemented only in this case.

3.4.3 The partially adaptive routerSo far we have assumed deterministic routing, but from here weadopt adaptive routing. In this case, router architecture changes,for adaptive routing requires additional hardware, such as selectionfunction. It is inserted between an input channel and an arbiter asshown in Figure 5. It receives multiple requests from input chan-nels and selects one of them. The selected request is sent to anarbiter. In this case, an additional pipeline stage called route selec-tion (RS) is implemented. Its delay is formulated as follows:

TRS=Grs +Wsegmentd . (13)

Wsegmentd is the segmented wire delay between submudule-B andsubmodule-D (vid. Figure 5b).

We employ minimal west-first routing as partially adaptive routing.It is based on the turn model [4] to make it deadlock free. Thisalgorithm first routes a packet west, and then adaptively in otherdirections. We assume a round-robin mechanism as its selectionalgorithm: that is, the selection function selects an output chan-nel in rotation, rather than considering network congestion. Con-sequently, the router architecture becomes simple for an adaptiverouter.

FF-based FIFO buffer Virtual Cut-Through

Fifo write Route compute

Baseline router: delay model

t RC stage VSA stage ST stage LT stage

t

Current node

Next node

and decentralized routers, and formulates delay models. In Section4, we show the simulation results, and then balanced pipelines opti-mize decentralized routers. Finally, Section 5 concludes the paper.

2. RELATED WORKSeveral decentralized router architectures have been proposed fordifferent purposes. To the best of our knowledge, Rotary Router(RR) [1] is the first to show signs of decentralized architecture foron-chip networks. RR provides no crossbar switch or arbiter. In-stead, it has distributed modules on two independent rings, whichforce packets to circulate either clockwise or anti-clockwise, trav-eling from port to port. It eliminates head-of-line blocking to im-prove the performance. Its architecture and concept are very distantfrom ours, but this study reveals the potential of decentralized ar-chitecture.

Afterward, moving to the nano-scale era, distributed switch archi-tecture [11] presents the idea of reducing the negative impact oflinks by decentralized architecture, which is similar to our concept.It improves the trade-off between the power consumption and theoperating frequency, that is, it increases the maximum operatingfrequency with the reduced peak power consumption. The char-acteristic of this study is that its starting point is from a modularswitch, which is a specific architecture. On the other hand, ourstarting point is from canonical router architecture because we aimat generalization of decentralized router architectures.

The latest implementation of decentralized architecture is ElastiNoC[12]. This study introduces decentralization of routers based onVirtual Channel (VC), which allow for traffic separation and iso-lation to enable deadlock avoidance and improve network perfor-mance. Moreover, it provides a scalable distributed self-testingmechanism. This mechanism enables testing sessions to be con-ducted in a modular manner over multiple phases and achieveshigh fault coverage. Applying decentralized architecture to mak-ing fault-tolerant networks is an original idea here.

These preceding studies make us appreciate that decentralized ar-chitecture is beneficial, although there are differences in architec-ture and concept. We generalize decentralized architectures usingfour case studies of common routers as well as follow the prece-dents, especially the concept of reducing the impact of wire de-lay. Furthermore, we propose an alternative approach: decentral-ized buffer design and optimization of the arrangement of modulesbased on balanced pipelines.

3. ARCHITECTURE AND DELAY MODELS3.1 Baseline routerFirst, we describe a baseline conventional router. Figure 1 shows anoverview of our baseline virtual-channel router. It consists of n in-put channels, an arbiter, a crossbar switch, and m output channels.If the topology is 2D-mesh, both n and m are five for connectingto neighboring routers and a local core. An input channel providesmultiple VCs, each of which has an input buffer and Route Com-putation (RC) logic that computes the next route by using routingalgorithm such as dimension order routing or west-first routing. Forinput buffers, FF-based FIFO buffers or RAM are used. Since wetarget high performance routers, we adopt FF-based FIFO buffersand virtual cut-through flow control. An arbiter allocates a pair ofthe output channel and the VC for each incoming packet on thebasis of the state of the next router that is sent through output chan-nels. A crossbar switch consists of m n-to-1 multiplexers, each of

Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC

Figure 1: Overview of baseline router architecture.

which is controlled by a select signal from the arbiter. In an outputchannel, there is a register that can store one flit.

The routing processing typically consists of four steps: Route Com-putation (RC), Virtual channel Allocation (VA), Switch Allocation(SA), and Switch Traversal (ST). The baseline router is a conven-tional three-cycle router. That is, it has three pipeline stages: RC,VSA, and ST. In the VSA stage, VA and SA are speculatively per-formed in parallel. If the VA fails, SA will be ignored even thoughit succeeds. In addition, Link Traversal (LT) is also required totransfer a flit to the next router, and consequently four cycles arerequired in total. Since the latency of a NoC is directly related tothe pipeline depth in a router, LT stages limit the performance. Tomake matters worse, the latency of LT stages increases as technol-ogy progress because it is determined by wire delay. For this rea-son, LT stages are obvious drawbacks to the conventional router.

The delay model of the baseline router is formulated as follows:

TRC=max(Gfifowr , Grc), (1)TVSA=Garb, (2)TST=Gfiford +Gcb, (3)TLT=Wlink, (4)

where T , G, and W are the delay of each pipeline stage, the gatedelay of each processing, and the wire delay of the link betweenrouters, respectively. The critical path delay of the router Trouter isdetermined by the maximum T as follows:

Trouter=max(TRC, TVSA, TST, TLT). (5)

Naturally, the best performance is achieved by balancing pipelinestages. We will show the measured value and balance pipelinestages in Section 4.

3.2 Naive decentralized routerWe decentralize the baseline router to reduce the negative impactof LT stages. We divide the router function into small ones anddesign submodules for each function. Since router pipelining pro-cesses each step in a clock cycle at a dedicated stage, each functioncorresponds to each pipeline stage. The basic idea is to segmentthe function of a router into several modules in this manner and in-tersperse them with a link. Then a link between routers is dividedinto shorter links, and LT stages are segmented. Segmented wiredelay is distributed to the remaining pipeline stages: RC, VSA, andST. Figure 2 shows the naively decentralized router. The decentral-ized router consists of three submodules, the functions of which areexplained below.

Gate delay(G) Wire delay(W)

Arbitration Fifo read Crossbar







Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC
















Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC
















Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC
















Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC











Baseline router: delay model


t

Current node

Next node







Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC


















Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC
















Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC
















Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC
















Crossbar Switch

RC logic Arbiter

Input channel

Output channelVC










Unbalanced pipelines & increasing LT delay are serious issues.

Solution: distributing LT stages



t

Current node

Next node



Submoudule Crossbar

How?: decentralizing routers! abstract delay model is as below



t

Current node

Next node


Arbitration Fifo read

t

Submoudule Crossbar

How?: decentralizing routers! abstract delay model is as below


t RC stage VSA stage ST stage

t

Current node

Next node


Arbitration Fifo read

t

LT can disappear!!

Naïve design R RLink

Crossbar Switch

RC logicArbiter

Submodule A

Output channelVC

State machineState machine

Submodule B Submodule C

(Submodule Cof prev. router)

request / grant

VC

Crossbar Switch

RC logicArbiter

Submodule A

Output channelVC




request / grant

VC

Problems with naïve design

1. Buffers are required in each submodule. Buffers consume a large amount of power and delay!

2. Round-trip time in VSA stage is an obvious bottleneck.

t

t

Submodule B

Submodule C

Fifo write Arbitration

Request Grant

Data

Backpressure

fromprev. router

from arbiter


1-bit

1 flit

New buffer design using D-latches

Data

Back pressure

fromprev. router

from arbiter


1-bit

1 flit

Columnnumber

1 2 3 4 5












C. (12)











2 cycles

Column

= D-latch

Crossbar Switch

RC logicArbiter

Submodule A

Output channel


Submodule BSubmodule C


VC(1 flit)

Data path

Control pathrequest

Proposed Router: architecture

R RLink

backpressure

Crossbar Switch

RC logicArbiter

Submodule A

Output channel


Submodule BSubmodule C


VC(1 flit)

Data path

Control pathrequest

Proposed Router: architecture

R RLink

Grant signals are replaced by backpressure without wire delays

Buffers are separated from a control path.

backpressure

Route Compute

Proposed Router: delay model

t

t

t

t RC stage VSA stage ST stage Submodule

A

Submodule B

Submodule C

Next submodule

A

Arbitration

Crossbar

Gate delay(G) Wire delay(W) Buffer delay

Data

Back pressure

fromprev. router

from arbiter


1-bit

1 flit

Columnnumber

1 2 3 4 5












C. (12)











Data

Back pressure

fromprev. router

from arbiter


1-bit

1 flit

Columnnumber

1 2 3 4 5












C. (12)











Data

Back pressure

fromprev. router

from arbiter


1-bit

1 flit

Columnnumber

1 2 3 4 5












C. (12)












l  Related work


l  Results


Simulation methodology

RTL Source

Synopsys DC Synthesis

Synopsys ICC Place & Route

Std. Cell Library (STMicroelectronics

FD-SOI 28-nm) Delay

models

Performance Estimates

Timing reports

VDD: 1.0V Body bias: 0V

Case studies ① The simple router

Ø  DOR(XY routing) Ø  2 VCs / channel & fixed allocation

② The router with many VCs Ø  DOR(XY routing) Ø  8 VCs / channel & round-robin allocation

③ The partially adaptive router Ø  Minimal west-first routing Ø  2 VCs / channel & fixed allocation

④ The fully adaptive router Ø  Duato’s protocol based on west-first routing Ø  2 VCs / channel & fixed allocation

Architectural modification for adaptive routers Crossbar Switch

RC logicArbiter

Submodule A

Output channel




Segment ca b

VC(1 flit)

Data path

Control path back pressure

requestSegment

Segment

Figure 4: Overview of proposed router architecture.

Inputchannel

Selectionfunction Arbiter

multiplerequest

selectedrequest grant

(a) Outline drawing of selection functions.

SubmoduleA

SubmoduleD

SubmoduleB

multiplerequest

selectedrequest Submodule

C

request

Selection function isinserted here

(b) Selection functions are implemented into submodule-D.

Figure 5: Insertion of selection functions when adaptive routing isused.

3.4.4 The fully adaptive routerNext, we adopt Duato’s protocol [3] as fully adaptive routing. Ithas an escape path for deadlock avoidance, in which we use west-first routing. Since it is fully adaptive routing, it outperforms west-first routing. When fully adaptive routing is used, however, outputchannels must send the state of the next router to selection functionsto avoid deadlock. Therefore, the delay of RS stages changes asfollows:

TRS=Grs + 2Wsegmentd +Wsegmentb . (14)

It requires extra segmented wire delay. The effect is evaluated inSection 4.

4. RESULTS AND DISCUSSION4.1 Simulation methodologyWe design the proposed router models in STMicroelectronics 28-nm FD-SOI process technology to evaluate wire delay, gate delay,and area. They are synthesized by Synopsys Design Compiler, andplaced and routed by Synopsys IC Compiler. The packet and flitsizes are 5 flits and 16-bit, respectively.

For the performance evaluation, the baseline and proposed routersare compared on the basis of delay models, gate delay, and wire de-lay. Overall, there are no functional differences between the base-line and proposed routers as already described in Section 3. There-fore, the performance is basically determined by the critical pathdelay rather than the number of cycles if pipeline depths are equal.

Table 2: Simulation results of gate delay. Each symbol correspondsto that of delay models.

Symbol Variations Delay [ns]Grc DOR 0.27

West-first 0.27Duato’s protocol 0.27

Gfifowr – 0.34Grs West-first 0.38

Duato’s protocol 0.70Garb fixed VA 0.92

round-robin VA with 8 VCs 1.45Gfiford – 0.18Gcb – 0.44

Gbuffer – 0.21

In addition, the reduction of LT stages can be considered by cycleaccurate simulation.

We implement separately each router’s function to analyze gate de-lay on the basis of prescribed delay models. After place-and-route,a static timing analysis is carried out. We measure actual wire de-lay reported by IC Compiler. We implement a special design tomeasure wire delay, in which two small macros of an inverter areplaced far apart at intervals of optional distance. By sending datafrom one to the other, wire delay is measured. Since inverters aresmall, delay in a macro is negligible.

4.2 Critical path delayTable 2 shows the results of the gate delay based on delay models.Grc is constant regardless of routing algorithm. This is because RClogic only compares the current node with the destination node.Instead, the difference between routing algorithms is found in RSstages. Garb is the longest, and the use of many VCs prolongs itfurther. Gfifowd , Gfiford , Gcb, and Gbuffer are constant in all cases.On the basis of these data, the critical path delay is evaluated.

Figure 6 shows the wire delay evaluated by IC Compiler. IC Com-piler inserts few repeaters automatically with timing constrains loose.Manhattan distance is defined as the distance of a link consideringrouters as points. Since a tile measures 1mm wide by 1mm longin our design, Manhattan distance corresponds to real length mea-sured in millimeters. From the figure, repeated wire delay mostlyincreases linearly. Consequently, we can evaluate segmented wiredelay by linear approximation and optimally arrange each submod-ule. Table 3 shows wire delay of each example topology. Eachmaximum Manhattan distance can be generalized as described, andthe case for a 4x4 mesh is shown.

Crossbar Switch

RC logicArbiter

Submodule A

Output channel




Segment ca b

VC(1 flit)

Data path


requestSegment

Segment


Inputchannel


multiplerequest



SubmoduleA

SubmoduleD

SubmoduleB

multiplerequest


C

request















Gbuffer – 0.21





Data

Back pressure

fromprev. router

from arbiter


1-bit

1 flit

Columnnumber

1 2 3 4 5












C. (12)











Route Select (RS) stages are added.

Delays of RS stages

Data

Back pressure

fromprev. router

from arbiter


1-bit

1 flit

Columnnumber

1 2 3 4 5












C. (12)











Crossbar Switch

RC logicArbiter

Submodule A

Output channel




Segment ca b

VC(1 flit)

Data path


requestSegment

Segment


Inputchannel


multiplerequest



SubmoduleA

SubmoduleD

SubmoduleB

multiplerequest


C

request















Gbuffer – 0.21





segmentb segmentd Partially adaptive routing:

Fully adaptive routing:

Delays of RS stages

Data

Back pressure

fromprev. router

from arbiter


1-bit

1 flit

Columnnumber

1 2 3 4 5












C. (12)











Crossbar Switch

RC logicArbiter

Submodule A

Output channel




Segment ca b

VC(1 flit)

Data path


requestSegment

Segment


Inputchannel


multiplerequest



SubmoduleA

SubmoduleD

SubmoduleB

multiplerequest


C

request















Gbuffer – 0.21





Crossbar Switch

RC logicArbiter

Submodule A

Output channel




Segment ca b

VC(1 flit)

Data path


requestSegment

Segment


Inputchannel


multiplerequest



SubmoduleA

SubmoduleD

SubmoduleB

multiplerequest


C

request















Gbuffer – 0.21





segmentb segmentd Partially adaptive routing:

Fully adaptive routing:

State of the next router

request

Gate delay results

Crossbar Switch

RC logicArbiter

Submodule A

Output channel




Segment ca b

VC(1 flit)

Data path


requestSegment

Segment


Inputchannel


multiplerequest



SubmoduleA

SubmoduleD

SubmoduleB

multiplerequest


C

request















Gbuffer – 0.21





Wire delay results

0

0.5

1

1.5

2

1 2 3

Rep

eate

d w

ire d

elay

[ns]

Manhattan distance

Figure 6: Repeated wire delay vs. Manhattan distance.

Table 3: Wire delay of each example topology.

Example Max. Manhattan distance Wire delaytopology Size: 2n × 2n Size: 4 × 4 Size: 4 × 42D-Mesh 1 1 0.628 ns

Folded Torus 2 2 1.140 nsFlat. Butterfly 2n − 1 3 1.667 ns

We now finally obtain all numerical data to discuss balanced pipelinesand evaluate performance. From here, we do this with respect toeach case.

4.2.1 Case 1: The simple routerThe delay of the baseline router is as follows:

TRC=max(Gfifowr , Grc) = 0.34 ns, (15)TVSA=Garb = 0.92 ns, (16)TST=Gfiford +Gcb = 0.62 ns, (17)

TLT=

⎧⎪⎨

⎪⎩

0.63 ns (M = 1)

1.14 ns (M = 2)

1.67 ns (M = 3).

(18)

M is the maximum Manhattan distance of the topology. Gate de-lays are not balanced, and VSA is an apparent bottleneck. Lookingat wire delay, it becomes the critical path when M becomes two.

Decentralization changes the delay as follows:

TRC=0.27 ns +Wsegmenta , (19)TVSA=0.92 ns +Wsegmentb , (20)TST=0.44 ns +Wsegmentc , (21)

Tdata≈0.21 ns +Wlink

2

=

⎧⎪⎨

⎪⎩

0.52 ns (M = 1)

0.78 ns (M = 2)

1.04 ns (M = 3).

(22)

Wsegmenta , Wsegmentb , and Wsegmentc appear in TRC, TVSA, andTST, respectively. Thus no segmented delay appears at more thanone pipeline stage. Here we can optimize segmented wire delay tobalance pipelines on the basis of liner approximation. Specifically,the following procedure balances pipelines. Note that this methodaffects only the wire delay rather than the gate delay in contrast togate sizing and pipeline refactoring.

• STEP1: Initially take 0 for every segmented wire delay.

• STEP2: Extend Wsegmenta until TRC becomes equal to TST.If Wsegmenta reaches Wlink in the interval, Wsegmenta =Wlink and the remainder become 0. Then the procedure iscompleted. Otherwise go to the next step.

• STEP3: Extend Wsegmentc in the same way. If Wsegmentc +Wsegmenta reaches Wlink in the interval, the procedure endsat that point. Otherwise go to the next step.

• STEP4: Allocate the remaining wire delay, i.e, Wlink −(Wsegmenta+Wsegmentc) to each segmented wire delay equally.As a result, all segmented delays become equal.

If the topology is 2D-mesh (M = 1), the procedure ends at Step1. Consequently, the critical path remains VSA and stays constant,but LT stages vanish even in this case. Meanwhile, if the topol-ogy is Folded Torus (M = 2), the optimized delay becomes thefollowing.

TRC≈0.27 ns + 0.653 ns = 0.923 ns, (23)TVSA≈0.92 ns + 0.003 ns = 0.923 ns, (24)TST≈0.44 ns + 0.483 ns = 0.923 ns. (25)

In this case, the critical path is improved by 10% compared withthe baseline router. Each submodule is arranged at intervals of thesame rate as segmented wire delay.

In the case of Flattened butterfly (M = 3), the optimized delaybecomes as follows.

TRC≈0.27 ns + 0.83 ns = 1.10 ns, (26)TVSA≈0.92 ns + 0.18 ns = 1.10 ns, (27)TST≈0.44 ns + 0.66 ns = 1.10 ns. (28)

The critical path is improved by 34% compared with the baselinerouter, and we refer to this rate as the improvement rate. We can seefrom the above that the decentralized router effectively improvesthe performance as the maximum Manhattan distance increases.After reaching Step 4, the critical path delay increases only at arate of the third part of that of the baseline router. In that contextour proposal expands the availability of low latency topologies.

4.2.2 Case 2: The router with many VCsThe use of many VCs drastically lengthens TVSA as follows:

TVSA=Garb = 1.45 ns. (29)

The delay of the decentralized router is summarized as follows:

TRC≈

⎧⎪⎨

⎪⎩

0.27 ns + 0.63 ns = 0.90 ns (M = 1)

0.27 ns + 1.14 ns = 1.41 ns (M = 2)

0.27 ns + 1.18 ns = 1.45 ns (M = 3),

(30)

TVSA≈{1.45 ns + 0.00 ns = 1.45 ns (M ≤ 3), (31)

TST≈{0.44 ns + 0.00 ns = 0.44 ns (M ≤ 2)

0.44 ns + 0.49 ns = 0.93 ns (M = 3).(32)

In this case TVSA is too large, and hence the critical path of boththe baseline and proposed routers are the same when M is less thanfour. When M is three, the improvement rate becomes 13%. Thisshows that routers with many VCs are insulated from the influenceof link delay.

We assume 1 Manhattan distance = 1mm

R R

0

0.5

1

1.5

2

1 2 3

Rep

eate

d w

ire d

elay

[ns]

Manhattan distance

Manhattan distance vs. improvement rates

45% Deep pipeline

Best-case scenarios

Unbalanced pipeline delays

No signals sent backward (they are required in the case of fully adaptive routing) 0

10

20

30

40

50

1 2 3

Impr

ovem

ent r

ate

[%]

Manhattan distance

Partially adaptiveFully adaptive

SimpleMany VCs


l  Related work


l  Results


Conclusions q  Decentralized routers eliminate LT stages and improve the critical path by up to 45% in 28-nm process technology.

q  As technology advances, decentralized routers become more and more beneficial.

q  Our study suggests that decentralized high-speed routers with deep pipelines and low-latency topologies are efficient solutions in the nano-scale era.

Future work q  We plan to propose new topologies exploiting decentralized routers.

q  More detailed evaluation including the energy consumption of links (another interconnect bottleneck) will be performed.

on-chip decentralized routers with balanced pipelines for...

Documents