a noc closed-loop performance monitor and adapter

11
A NOC closed-loop performance monitor and adapter q Debora Matos a , Caroline Concatto a,, Anelise Kologeski a , Luigi Carro a , Marcio Kreutz b , Fernanda Kastensmidt a , Altamiro Susin a a Informatics Institute, Federal University of Rio Grande do Sul, Porto Alegre, Brazil b Federal University of Rio Grande do Norte, Natal, Brazil article info Article history: Available online 20 May 2011 Keywords: Network-on-chip Buffer Depth Router Latency Throughput Adaptability Power consumption Fault tolerance abstract In a NoC, the amount of buffers allocated to each communication channel has a significant impact on per- formance and power consumption. Moreover, since there will be changes in the application communica- tion pattern, or even because a new application is loaded in a SoC, a design based on the worst case scenario will probably either oversize buffers, with obvious power implications, or the performance will be compromised, since not enough buffers will be available. A runtime mechanism is required to auto- matically adapt the buffer size as a function of the communication pattern. This paper proposes a control mechanism to resize the buffer of an adaptive router. The runtime mechanism is able to monitor the traf- fic behavior and to control, for each channel, the required buffer size of the adaptive router. Besides, as the complexity of designs increase and technologies scale down, devices are subject to new types of mal- functions and failures. Network-on-chip routers are responsible to ensure the proper communication of on-chip cores, and the buffers present in the router channels are crucial to ensure the communication performance. This way, a technique to isolate faulty buffers is also presented. Experimental results using the proposed architecture have shown that, in the absence of faults, the latency has been decreased by 80%, and throughput has been increased by 45%, in the worst case. In the presence of faults, the proposed architecture was able to sustain the same performance of the equivalent homogeneous router, but with up to 25% power savings. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction The communication among cores of an MPSoC having reusable interconnections is currently being provided by networks-on-chip (NoC) [1]. A NoC is a general purpose on-chip interconnection net- work that offers some strategies to mitigate the ever increasing communication complexity of modern SoCs. Current NoCs are usu- ally static, in the sense that their performance and power are de- fined and balanced at design time [1–3]. Modern MPSoCs present a high increase in complexity, since they must efficiently handle some situations not foreseen at design time. In such cases, the communication infrastructure needs to adapt to the requisitions of the MPSoC. A solution to handle situa- tions not foreseen at design time is a design strategy for the worst case scenario. However, this solution will certainly present exces- sive power dissipation for the mean case [4]. Another solution is to have a NoC that adapts itself at run-time, like proposed in [6–8]. One of the first ideas to adapt router resources have been pre- sented in [7,8], where virtual channels are dynamically created and allocated to store the flits coming from the output channels. In [6] a NoC can change its buffer size at run-time, according to the system needs, showing a better power-performance product. In any case, none of the previous works ([6–8]) presents an auto- matic mechanism to allocate the buffer slots for the channels. Actually, the absence of an entity responsible to give the right amount of buffer slots for each channel is a problem in [7,8]. Pack- ets larger than the size of the buffer may block the creation of other virtual channel, since buffer slots are dynamically allocated until other header appears, or until there are no more available free buf- fer slots. Our first contribution is to cover exactly this gap. We propose a set of sensors and an automatic control mechanism to dynamically 0141-9331/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2011.05.001 q The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agree- ment nr. 249059. The information presented is provided as is and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. The opinions expressed in the document are of the authors only and I no way reflect the European Commission’s opinions. Moreover the research also received support by CNPq and CAPES. Corresponding author at: UFRGS - Av. Bento Gonçalves, 9500 - Campus do Vale - Bloco IV, Agronomia - Porto Alegre- RS -Brasil. Tel.: +55 51 3308-6165; fax: +55 51 3308-7308. E-mail addresses: [email protected] (D. Matos), [email protected] (C. Concatto), [email protected] (A. Kologeski), [email protected] (L. Carro), [email protected] (M. Kreutz), [email protected] (F. Kastensmidt), susin@inf. ufrgs.br (A. Susin). Microprocessors and Microsystems 37 (2013) 661–671 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

Upload: altamiro

Post on 31-Dec-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A NOC closed-loop performance monitor and adapter

Microprocessors and Microsystems 37 (2013) 661–671

Contents lists available at ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier .com/locate /micpro

A NOC closed-loop performance monitor and adapter q

Debora Matos a, Caroline Concatto a,⇑, Anelise Kologeski a, Luigi Carro a, Marcio Kreutz b,Fernanda Kastensmidt a, Altamiro Susin a

a Informatics Institute, Federal University of Rio Grande do Sul, Porto Alegre, Brazilb Federal University of Rio Grande do Norte, Natal, Brazil

a r t i c l e i n f o a b s t r a c t

Article history:Available online 20 May 2011

Keywords:Network-on-chipBufferDepthRouterLatencyThroughputAdaptabilityPower consumptionFault tolerance

0141-9331/$ - see front matter � 2011 Elsevier B.V. Adoi:10.1016/j.micpro.2011.05.001

q The research leading to these results has receivedCommunity’s Seventh Framework Programme (FP7/20ment nr. 249059. The information presented is providwarranty is given that the information is fit for anythereof uses the information at its sole risk and liabilithe document are of the authors only and I noCommission’s opinions. Moreover the research also reCAPES.⇑ Corresponding author at: UFRGS - Av. Bento Gonça

Bloco IV, Agronomia - Porto Alegre- RS -Brasil. Tel.: +53308-7308.

E-mail addresses: [email protected] (D. M(C. Concatto), [email protected] (A. Kologeski),[email protected] (M. Kreutz), [email protected] (A. Susin).

In a NoC, the amount of buffers allocated to each communication channel has a significant impact on per-formance and power consumption. Moreover, since there will be changes in the application communica-tion pattern, or even because a new application is loaded in a SoC, a design based on the worst casescenario will probably either oversize buffers, with obvious power implications, or the performance willbe compromised, since not enough buffers will be available. A runtime mechanism is required to auto-matically adapt the buffer size as a function of the communication pattern. This paper proposes a controlmechanism to resize the buffer of an adaptive router. The runtime mechanism is able to monitor the traf-fic behavior and to control, for each channel, the required buffer size of the adaptive router. Besides, asthe complexity of designs increase and technologies scale down, devices are subject to new types of mal-functions and failures. Network-on-chip routers are responsible to ensure the proper communication ofon-chip cores, and the buffers present in the router channels are crucial to ensure the communicationperformance. This way, a technique to isolate faulty buffers is also presented. Experimental results usingthe proposed architecture have shown that, in the absence of faults, the latency has been decreased by80%, and throughput has been increased by 45%, in the worst case. In the presence of faults, the proposedarchitecture was able to sustain the same performance of the equivalent homogeneous router, but withup to 25% power savings.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

The communication among cores of an MPSoC having reusableinterconnections is currently being provided by networks-on-chip(NoC) [1]. A NoC is a general purpose on-chip interconnection net-work that offers some strategies to mitigate the ever increasingcommunication complexity of modern SoCs. Current NoCs are usu-ally static, in the sense that their performance and power are de-fined and balanced at design time [1–3].

ll rights reserved.

funding from the European07-2013) under grant agree-ed as is and no guarantee orparticular purpose. The userty. The opinions expressed in

way reflect the Europeanceived support by CNPq and

lves, 9500 - Campus do Vale -5 51 3308-6165; fax: +55 51

atos), [email protected]@inf.ufrgs.br (L. Carro),

r (F. Kastensmidt), susin@inf.

Modern MPSoCs present a high increase in complexity, sincethey must efficiently handle some situations not foreseen at designtime. In such cases, the communication infrastructure needs toadapt to the requisitions of the MPSoC. A solution to handle situa-tions not foreseen at design time is a design strategy for the worstcase scenario. However, this solution will certainly present exces-sive power dissipation for the mean case [4]. Another solution is tohave a NoC that adapts itself at run-time, like proposed in [6–8].

One of the first ideas to adapt router resources have been pre-sented in [7,8], where virtual channels are dynamically createdand allocated to store the flits coming from the output channels.In [6] a NoC can change its buffer size at run-time, according tothe system needs, showing a better power-performance product.In any case, none of the previous works ([6–8]) presents an auto-matic mechanism to allocate the buffer slots for the channels.Actually, the absence of an entity responsible to give the rightamount of buffer slots for each channel is a problem in [7,8]. Pack-ets larger than the size of the buffer may block the creation of othervirtual channel, since buffer slots are dynamically allocated untilother header appears, or until there are no more available free buf-fer slots.

Our first contribution is to cover exactly this gap. We propose aset of sensors and an automatic control mechanism to dynamically

Page 2: A NOC closed-loop performance monitor and adapter

662 D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671

compute the buffer depth for each input channel of the router,according to the network traffic. Besides, this work uses a controlsystem that verifies when the network can update its buffer depth.

As technology scales and transistors features size shrinks, cir-cuits become prone to faults, so there is a probability of a devicehaving faults in its manufacturing process [9,10]. The routers, inthe NoC, are responsible to ensure the proper communication ofon-chip cores, and the buffers present in the router channels arecrucial to ensure the communication performance. Hence, the abil-ity of the network to function in the presence of faults (reliabilityand fault-tolerance) becomes an important issue. It is thus veryimportant to provide fault tolerance in the buffers, and hence oursecond contribution is to propose a low cost technique to isolatethe faulty buffer slot.

The proposed fault tolerant architecture takes advantage of thefact that one can change the size of each buffer in accordance to theapplication needs. When a fault occurs in a buffer slot, instead ofdisabling the entire input channel, one isolates the faulty slot. Tosustain performance the input channel can borrow buffer slotsfrom the neighbors according the monitoring architecture.

Our proposed strategy can be applied to any NoC bufferingarchitecture to provide adaptability in the buffer resizing. In theabsence of faults this work demonstrates that it is possible to in-crease the average throughput around 50%, and to reduce the aver-age latency around 80%, when compared to the homogeneousrouter using real applications as examples. In the presence offaults, the average latency has been increased less than 10%, show-ing that one can obtain fault tolerance with minimum performanceand power impact, thanks to the adaptivity mechanism.

The rest of the paper is organized as follows. After presentingthe related works in Section 2, in Section 3 we present the architec-ture of the adaptive router, its homogeneous counterpart and thefault tolerant solution. Section 4 shows the architecture of the con-trol system that is a main proposal of this paper. The performanceand synthesis results are shown in Section 5, and we conclude ourpaper in Section 6.

2. Related works

Tamir and Frazier [5] already in the nineties focused on internalbuffers organization for multiprocessor and multicomputer sys-tems. The DAMQ (Dynamically Allocated Multi-Queue) buffer isevaluated with SAMQ (Statically Allocated Multi-Queue) and SAFC(Statically Allocated Fully Connected) buffer organization. At thattime, the higher performance of the DAMQ compared with theother static buffer organizations had being based on a more effi-cient use of storage resources. SAMQ and SAFC switches staticallyallocate buffer slots to each port. The DAMQ buffer can allocatebuffer slots dynamically in each channel, wherefore DAMQ per-forms better than any other static switch with the same amountof storage at any traffic rate. Around 10 years later the buffer orga-nization studies on switches restarted, this time in intrachip fabriclike NoCs. This section presents some related works that have asbased-line the dynamic or static buffer organization in routersfor NoC.

NoC designs typically target a specific application or a limitedclass of applications. Thus, the NoC architecture is customized atdesign time for each specific application to achieve best energy,performance, and the trade-off costs [3]. However, as mentionedpreviously, there are reasons to have a dynamically reconfigurablenetwork to sustain general-purpose computations. In the literaturesome works present solutions that go in this direction, many worksdemonstrate that in the same application there is a variable com-munication rate among different cores during the execution of dif-ferent applications, and even in a single program there are several

communication phases, hence the need to use adaptability at run-time.

In [3] one can find a buffer allocation strategy at the system le-vel for each specific application, that is, given a traffic characteris-tic of a target application and the total budget of the availablebuffering space, an algorithm optimizes the allocation of bufferingresources across different channels. The algorithm making buffersdistribution is based on the architecture parameters (routing algo-rithm, delay parameters and others) and the application parame-ters (probability of the packet being delivered to the destinationand the packet rate injection). These characteristics are modeledin C++, and the algorithm gives a certain number of buffer slotsfor each channel. However, buffer sizing is static and is developedat design-time for each target application. However, if the commu-nication behavior changes, probably the system will not deliver therequired performance or the resources will be under-utilized.

In [7,18] one can find the proposal for adaptive buffer allocationwith virtual channels. ViChaR [7] dynamically allocates virtualchannels and buffer slots according to network traffic conditions.Each input channel manages its virtual channels according to thenumber of header flits that arrive in the input channel. For eachnew packet that reaches the input channel a new virtual channelis allocated. The buffer slots are allocated as the flits flow, that is,for each new flit a new slot of the buffer is allocated. For packetsizes smaller than the size of the buffer there are no problems,however, for packet sizes bigger than the buffer size, the proposedstrategy on [7] can lose the ability to have more than one virtualchannel. Bigger packets may block the channel by consuming theentire set of buffers, without leaving space for other packets. Thestrategy in [18] combines ViChaR with the loan process of bufferslots from other channels in the router. The router architecture en-ables both dynamic virtual channel allocation and sharing bufferslots among input channels. This way, when a flit arrives in thechannel, it is stored in the buffer of this channel, but when thereis no space in this channel, the flit is stored in the buffer of anotherchannel. In [18] there is no monitor mechanism to compute thebuffer size for each channel, buffer slots are given according to flitarrival and buffer space in the router, as consequence Neishaburiand Zilic [18] can only reduce the latency by 7.1% with their routerarchitecture. In the present proposal we have an automatic mech-anism to compute the optimal buffer size, and this process guaran-tees a significant performance gain.

An adaptive architecture with runtime observability has beenproposed in [8] to avoid faults in NoC, providing adaptability atsystem-level and at architecture-level. At system-level the archi-tecture can re-map the system tasks, and at architecture-level itcan re-route the packets and re-allocate the virtual channel buffers(VCB). The changes at architecture-level are based on the occur-rence of faults, and these events occur when the packets do notreach the destination or when the VCB is full. The presence of faultstriggers the need of NoC adaptation at architecture-level, and thenecessary steps to reconfigure the NoC for the new infrastructureare invoked. The adaptive process only occurs in the presence ofa fault, and hence no performance or power advantage can be ob-tained during the normal operation of the system. It would beinteresting to have an architecture that observes and adapts notonly in the presence of faults, but rather whenever the communi-cation pattern changes within the network, either for performanceor fault tolerance reasons.

The strategy in [11] proposes a NoC architecture with bidirec-tional channels that can be self-reconfigurable at run-time. BiNoCallows each communication channel to be dynamically self-config-ured to transmit flits in either direction in order to increase theperformance. A finite state machine implements an inter-routertransmission channel control block scheme for each channel, andit makes sure that only one direction of the channel is valid on each

Page 3: A NOC closed-loop performance monitor and adapter

D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671 663

bidirectional channel at any time. If the requested channel is avail-able, this means that the corresponding buffers at the neighboringrouter have enough storage space. The allocation of the channel isbased only in the request. Increasing the channel width is insuffi-cient to increase the performance, since larger links require a largerbuffer. In such cases, as the BiNoC has input buffering and allowsone to use the output links as input links, it needs buffering in bothlinks, and then the power consumption problem is severelyaggravated.

The related works discussed in this section do not take into ac-count, at the same time, the traffic behavior and the policy to dis-tribute the resources. Whenever there is a policy, it is fixed duringthe design time analysis step. In [7,8,11,18] the resources are givenwithout a policy or analysis of the traffic behavior. In [7,11] therule is based on which channel first asks for resources. In [8,18]the policy to distribute the resources is based on the fault occur-rence. In this case, these related works do not consider the trafficbehavior, as explained previously. Our proposal covers this gap,presenting a control to adapt the buffer slots based in a policy todistribute resources at run-time according to the need of eachchannel in the router.

3. Proposed router architecture

3.1. Homogeneous router architecture

The proposed router architecture has been embedded in the So-CIN NoC [12]. SoCIN has a regular 2D-torus topology and paramet-ric router architecture. The router architecture uses a routingswitch with up to five bi-directional ports (Local, North, South,West and East), each port with two unidirectional channel, andeach router is connected to four neighboring routers (North, South,West and East). This router is a VHDL soft-core. It uses the worm-hole switching approach and a deterministic source based routingalgorithm. The wormhole strategy breaks a packet into multipleflow control units called flits, and they are sized as an integer ofthe channel width. The first flit is a header with destination ad-dress followed by a set of payload flits and a tail flit. To indicatethis information (header, payload and tail flits) two bits of each flitare used.

There is a Round-Robin arbiter at each output channel. Therouting algorithm used is XY-routing, capable of supporting dead-lock-free data transmission. The flow control is based on a

din

dout

din_Edin_W

control East / West

din_S

d_W_Sd_E_S

(a) (b)

(d)Fig. 1. Input buffer (a) Homogeneous; (b) adaptive and (c) fault toler

handshake protocol, which is connected to the buffer protocol, asshown in Fig. 1d. In Fig. 1d val is connected to rok of the buffer.The rok variable is responsible to check when the buffer has datato be read. In the destination buffer, val is connected to wr, whichis responsible to control the write process in the buffer. The wokvariable, together with the ack one in the destination buffer, isresponsible to inform to the source when its buffer is full. Variableswok and ack control the read process in the source by controllingrd. When the source sends data to the destination, it activatesthe related valid signal (val). When the receiver is ready to con-sume the validated data, it activates the corresponding acknowl-edge (ack) variable. The buffering presents only at the inputchannel is a FIFO. Each flit is stored in a buffer slot and all channelshave the same buffer depth, which is defined at design time.

3.2. Adaptive router architecture (AR)

In a previous work we have developed an adaptive router wherethe number of buffer slots to each channel can be distributedamong the channels [6]. In this context, we verify that larger buffersizes in a channel guarantees a larger throughput and a smaller la-tency in the network, since fewer flits will be stagnant on thenetwork.

The adaptive router architecture provides the ability of eachchannel to lend part or the whole of its buffers in accordance tothe requirements of the neighboring buffers. To reduce connectioncosts, each channel may only use the available buffer slots of itsright and left neighbor channels. In such cases, each channel mayhave up to three times more buffer slots than its original buffer sizeplanned at design time.

Fig. 1a and b shows both the homogeneous and the proposedadaptive input buffer respectively. Comparing the two architec-tures, the adaptive router (Fig. 1b) uses more multiplexers to allowthe reconfiguration. Fig. 1b presents the South Channel as example.In this architecture it is possible to dynamically configure differentbuffer depths for channels. In accordance with Fig. 1b, the multi-plexers with the numbers t to v control the read of buffer slots,and they increase according to the buffer’s depth. The multiplexerr controls the input, that is, the data that will be stored in the buf-fers. The multiplexer s controls the output of data in the channel,selecting which data will be sent to the crossbar. These multiplex-ers (r and s) have fixed size. Multiplexers u and v are respon-sible to read the buffer data of the neighbors requested.

dout

d_S_W

d_S_Edin_Edin_W

control East / West

din_S

dout

d_S_W

d_S_E

d_W_Sd_E_S

faulty_buffer

(c)

ant adaptive and (d) flow control based on Handshake protocol.

Page 4: A NOC closed-loop performance monitor and adapter

664 D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671

On each channel some signals must be sent for the neighboringchannels in order to control its stored flits. The information abouthow many units of the buffer are used for each channel is set by theBuffer Depth Controller (this information is received by an input pinof the router). Each channel can receive three data inputs. Let usconsider the South Channel as an example, having the following in-puts: the own input (din_S), the right neighbor input (din_E) andthe left neighbor input (din_W).

The stored flits control uses counters to know how many bufferslots are available in the buffer queue. The control has the informa-tion of how many flits are stored on its own buffer and in theneighboring channels buffers. These controllers are registers. Thechannels need six counters to manage the read and writing processof left and right neighbors (one to read and one to write for eachneighbor). The counters have the same number of bits of the bufferdepth.

When the South Channel has a flit stored in the East Channel andthis flit must be sent to the output, it is passed from the East Chan-nel to the South Channel (d_E_S), and so it is directly sent to the out-put of the South Channel (dout_S) by a multiplexer. The SouthChannel has the following outputs: the own output (dout_S) andtwo more outputs (d_S_E and d_S_W) to send the flits stored inits channel but belonging to neighbor channels.

Each channel can have flits stored on its own buffer or in theneighbor channel buffers. Fig. 2 shows an example of the bufferreconfiguration to sustain QoS. First, a buffer depth is defined forall channels at design time. In this work we defined the buffer sizeequal to 4, and all input channels receive the same buffer depth, asillustrated in Fig. 2a. After, the traffic in each channel is verified anda control string defines the buffer depth needed in each channel, asshowed in Fig. 2b. With the adaptive router, the distribution of thebuffer words among the neighbor channels is realized as showed inFig. 2c.

In this router architecture each channel knows how many of itsown buffers are being used in the channel, and how many arebeing borrowed from the neighbors. Each channel controls its stor-age flits, and these flits are stored on the own channel buffer or inthe neighboring channel buffers. In this design we are not consid-ering the possibility of the Local Channel using neighbor buffers,only the South, North, West and East Channel of a router can makeuse of their adjacent neighbors.

3.3. Fault tolerant adptive router architecture (FTAR)

It is straightforward to reuse the capacity of the adaptive bufferarchitecture, with small modifications, to make it tolerate perma-nent faults that affect the buffers, and this is possible with minimalarea penalty. The main modification has been performed in thecontrol of the buffers, to allow each channel to bypass the faultybuffer word, as shown in Fig. 1c. The control receives which bufferword is faulty by faulty_buffer and isolates the fault.

Fig. 2. (a) Router designed with buffer depth 4. (b) An example of required confi

We assume that the NoC undergoes an off-line test where faultybuffer words are detected and identified. Then, each router has onesignal per input channel to indicate which buffer word presents adefect. The indication of the faulty buffer word is programmedby an external controller. For instance, Fig 3a shows a buffer sizeequal to 4, and all slots are fault-free. When a fault is detected ina set of buffer words (Fig. 3b) in the NoC, at least three solutionscan be taken: (i) to avoid using the entire router (ii) to avoid usingthe channel; (iii) to isolate the faulty buffer unit and to continue touse the channel. In this paper we take the last solution, because thebuffer is not totally useless, only the faulty buffer unit must be dis-carded. Using the same adaptive hardware for buffer lending, thisfaulty buffer unit is isolated, and the next buffer word is used(Fig. 3c).

Nevertheless, when all fault-free buffer units of the channel arebeing used by the application (due to a certain traffic pattern, forexample), the faulty buffer word needs to be replaced. In this case,buffers of a neighbor channel are used to substitute the defectivebuffer word. As cores connected to the NoC present different com-munication requirements, not all channels need to use all theirbuffers, in their full size, all the time. Of course, for some buffersthere will be a tradeoff between performance and fault tolerance.For example, if all words of the buffer are faulty in a single channel,then at least one buffer word must be borrowed from a neighbor. Achannel can borrow more than one buffer unit from its right andleft neighbor channel.

The pseudocode of Algorithm 1 presents the control of the buf-fer to isolate the faulty buffer slot. Variable nxt_buffer_word has theaddress of the next buffer slot, buffer_size the buffer size defined atdesign time (in this case it is 4), faulty_buffer has the status of eachbuffer slot and buffer_word has the address of the buffer slot that isbeing used. The control checks if the next buffer word is faulty free.If the next buffer word is faulty-free buffer_word receives the valueof nxt_buffer_word, otherwise one increments nxt_buffer_word untilit reaches a next faulty-free buffer slot.

Algorithm 1. Pseudocode to control the wr/rd on buffer

gu

1

ration of the router

for i = 0; i < buffer_size; i++

2 nxt_buffer_word = buffer_word+1 3 if faulty_buffer(nxt_buffer_word) = 0) 4 buffer_word = nxt_buffer_word 5 else 6 while (faulty_buffer(nxt_buffer_word) = 1) 7 nxt_buffer_word = buffer_word + 1 8 end while 9 buffer_word = nxt_buffer_word 10 end if 11 end for

. (c) Reconfiguration of the buffers to attend the need.

Page 5: A NOC closed-loop performance monitor and adapter

(a) (b) (c)Fig. 3. Buffer slots (a) fault free; (b) faulty; (c) isolating a fault.

Fig. 5. Router architecture with flow control detailed.

D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671 665

4. Buffer Depth Controller

In this paper we developed a Buffer Depth Controller (BDC) block.This controller has been implemented to each input channel of therouter, and it is shown in details with the blocks that compose theinput channel in Fig. 4. The Input Buffer is the block that containsthe buffers used in the input channel, and it is responsible forthe storage of the flits. This buffer is controlled by Input ChannelController, which has the function to control the input flow, thehandshake of the buffer and the routing of the flits that arrive inthe input channel. The Buffer Depth Controller is the novelty of thiswork, and this block is better detailed in Fig. 5. The BDC is used toresize and distribute the buffer depth for the adaptive routerarchitecture.

The BDC block of Fig. 4 encloses four others blocks: Monitor,Integrator, Buffer Slots Allocation (BSA) and Resizing Decision, as de-picted in Fig. 5. The Monitor block observes the traffic of the chan-nel and the Integrator calculates the new buffer depth for eachchannel, according to the traffic behavior and the application. Buf-fer Slots Allocation implements a protocol to distribute the bufferslots for each channel according to the buffer depth given at theIntegrator block. Resizing Decision block verifies when each channelallows change its buffer depth. Each block will be better detailed inthe next sub-sections.

4.1. Calculating the buffer depth according to the traffic behavior

The BDC block is constituted by a Monitor that is basically acounter. Each one of the channel monitors verifies how many pack-ets pass through its channel. The Central Controller block is a timerthat has the total number of packets that go through the router,and when the sum of all packets from a router reaches the limit va-lue, the timer is activated. The timer defines the time in which therouter must reallocate the buffer depth for each channel in the rou-ter. Each Monitor sends the number of packets received in its chan-nel to the Central Controller. The numbers of packets of all channelsare summed in the Central Controller until the limit value isreached. The quantity of packets that the Central Controller mustwait is defined at design time. When this value is reached, the

Fig. 4. Router architecture with input channel detailed.

Central Controller stops the monitors for the reallocation of buffersin the router. For this reallocation we have used the simple first or-der control in the following equation:

buffer need½nþ 1� ¼ ðaÞ � buffer need½n� þ ð1� aÞ� traffic rate ð1Þ

The Integrator block is responsible to compute Eq. (1) and it con-siders the a value as an input. An operating system could changethis value depending on the desired behavior. Eq. (1) considersthe past traffic in conformity with a, i.e., higher a values indicatethat the past has a greater weight, on the other hand, lower a val-ues favor the instantaneous traffic occurrence, being a a value be-tween 0 and 1. Multiplications in the Integrator block are doneusing shift and add (for instance, one number X times 0.125 isequal to shift the number X 8 times to the right). Using shift andadd helps to sustain low power consumption and area overhead.

In Eq. (1), buffer_need[n] refers to the number of buffer slotsused until the moment, and traffic_rate indicates the traffic ratein the channel, measured by the quantity of packets that passthrough the channel. When the sum of all traffic_rate of each chan-nel reaches the stop value established (called stop_value), the traf-fic_rate needs to be normalized to define the ideal buffer depth. Forexample, if we consider that the stop_value will count until 128packets, then if the buffer depth defined in the design time is 4,for the four channels of the router, the sum of all buffers depth willbe sixteen (#channels x buffer_depth = 16). Then, the traffic_rate isnormalized to the maximum buffer depth value and the distribu-tion of buffer slots is done proportionally for each channel.

As the allocation of buffers is done considering a borrow/lend-ing process among the adjacent neighbor channels, the maximumnumber of buffers that each channel can ever get will be threetimes the original buffer depth defined at design time. The limitof three times is due to router design, which only enables to storeflits in the neighbor channels.

4.2. Buffer Slots Allocation (BSA) Policy

Eq. (1) is used to define the buffer depth according to the trafficbehavior and the application, but this computed depth is not al-ways possible, e.g., when two adjacent channels need to borrowbuffer slots, they have to do it according to some priority, BSA willallocate the buffer slots for each channel following a policy that canbe seen in the pseudocode of Algorithm 2.

Page 6: A NOC closed-loop performance monitor and adapter

666 D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671

The BSA block verifies whether the buffer depth required isgreater than the buffer depth defined in design time. In the affir-mative case, the algorithm tries to borrow buffer slots from theright neighbor. A buffer slot is only lent when it is not needed byits own channel. When the channel cannot borrow buffer slotsfrom the right neighbor, it tries to borrow buffer slots from the leftneighbor. In Algorithm 2 each channel knows the amount of bor-rowed (lent) buffers from the right or left channel neighbor, thanksto state registers in the buffer architecture.

It is possible that a channel does not receive the entire bufferdepth defined by Integrator. When this happens, the channel willreceive the original buffer depth, defined at design time, plus theavailable buffer slots in the neighbors. Let us imagine the followingsituation: a West Channel of a router has been designed with fourbuffer slots, but for a specific application it needs seven bufferslots, according to what have been calculate by Integrator. How-ever, the left neighbor channel can only lend one buffer slot, andthe right neighbor channel needs all its buffer depth. In this case,the final buffer depth defined by the BSA block will be 5 (four bufferslots defined at design time +1 buffer slot borrowed from the leftneighbor).

Algorithm 2. Pseudocode for resizing the buffer depth

1 CASE fsm IS2 WHEN borrow_right_channel3 IF buffer_need > buffer_depth THEN4 IF right_buffer_need >= buffer_depth THEN5 fsm = borrow_left_channel;6 buffer_self = buffer_depth;7 ELSE8 IF

(buffer_depth � right_buffer_need) >= (buffer_need � buffer_depth) THEN9 buffer_self = buffer_need;10 fsm = finished;11 ELSE12 buffer_self = buffer_depth + (buffer_depth � right_buffer_need);13 fsm = borrow_left_channel;14 END IF15 END IF16 ELSE17 buffer_self = buffer_need;18 fsm = finished;19 END IF20 WHEN borrow_left_channel21 IF buffer_depth > left_buffer_need THEN22 IF (buffer_depth � buffer _left_need) >= (buffer_need � buffer_self)

THEN23 buffer_self = buffer_self + (buffer_need - buffer_self);24 ELSE25 buffer_self = buffer_self + (buffer_depth � buffer _left_need);26 END IF27 END IF28 fsm = finished;29 WHEN finished30 fsm = borrow_right_channel;31 final_buffer = buffer_self;32 END CASE

4.3. Resizing Decision without loss of performance

The main problem in adaptive systems is to know the exact mo-ment to change the features of the system while ensuring the cor-rect application functioning. Besides, for the adaptability of thesystem to be valid the reconfiguration time should be kept closeto a minimum value, in order to have no effect on the systemperformance.

As the buffer slots can be lent among the channels, it is neces-sary to ensure that when a new buffer size is being defined, no flitsare lost in the network, or arbitrated to a wrong channel. The basicidea is that for each lending process, the buffers return to their

original size, and only then the distribution proceeds. To guaranteethat there is no performance loss, the decision of when to reconfig-ure is based in the verification of the buffer usage. If the buffer isnot full it means that there is some free buffer slot in the channel,and that at least one unit of buffer can be lent. This is possible dueto the large underutilization of the router, since the buffers are notused all the time.

For instance, as in the example showed in Fig. 2, at the sametime that the west and east channels decrease their buffers size,the south channel borrows buffer slots from these neighbors’ chan-nels. When a channel is lending buffer slots it may stop receivingflits, because the handshake control will only accept incoming flitsif there are at least two buffer slots free, one to be lend and one forthe incoming flit. This situation causes an insignificant local im-pact, because it happens only in the channel with the lower trafficand during the buffer resizing process. So, if there is a free bufferslot, the resizing block takes one clock cycle to lend each free bufferslot. Because of that if a channel needs to borrow Z buffers slots, itcan take, in the best case (all needed buffer slots are free in theneeded moment) Z clock cycles, but in the worst case (there isno buffer slot free to be borrowed in the right moment), it dependswhen the neighbors channels will have available buffer slots to beborrowed.

Instead of verifying how much free buffer slots each channelhas available, we decided to only increment or decrement one slotunit at a time, because this reduces the hardware resources andfacilitates the implementation. The decrements and incrementsof slots are done until the new buffer depth reaches the buffer sizecalculated by BSA, and this value is maintained until a new reallo-cation of buffer is requested.

To guarantee that the channel will not be blocked forever, oneleaves a buffer slot free for each channel, and hence a channelcould start to receive packets in any time. The buffer size compu-tation is always related to the number of flits that pass throughthe channel and the buffer size, if the buffer size of a channel iszero no packet will pass in this channel and then the channel willnever receive buffer slots. As design decision one decides to leavealways one buffer slot for its own channel.

5. Results

5.1. Performance evaluation

Two different experiments have been made to analyze the per-formance of the proposal. First the results are shown for videodecoders in the absence of faults, and after in the presence offaults.

A cycle-accurate traffic simulator in Java evaluates the averagelatency and throughput of the network with a fixed-length packetwith 80 flits and an 8 bits link size for the experiments. We simu-lated three examples of real applications to analyze the adaptiverouter with the BDC block. The applications used are the MPEG4,VOPD [13] and MWD [14], all with 12 cores, but with differentcommunication patterns.

We used a mapping tool to distribute the cores at design time inorder to obtain the best throughput in the network for MPEG4application. This would be the best core distribution if the commu-nication pattern of the application could be fixed at design time,but in a real system the application can change, as well as the traf-fic behavior in each link.

The adaptive router uses an a equal to 0.125, and the bufferdepth defined at design time is equal to 4. Experiments prove thatwith a equal to 0.125 the channel could have, in the worst case, themaximum amount of buffer slots in only three interactions whenthe buffer slots requested are free. For alpha equal to 0.875 the

Page 7: A NOC closed-loop performance monitor and adapter

D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671 667

architecture takes in the worst case eight interactions. For theseexperiments, the buffer depth is monitored and changed at every128 packets, and each channel calculates a new buffer depth basedin the traffic of the channel. To obtain the ideal buffer depth in lesstime, one can simply decrease the stop_value variable, which isresponsible to define the time to redistribute the buffer slots inthe router. For the experiments realized in this work stop_valueis 128.

5.1.1. Results without fault toleranceLet us imagine an application like MPEG4, and the NoC defined

in design time, with channels with buffer sizes specific to the initialapplication. In this case, one could wonder what would be the sys-tem behavior whether the application changes from MPEG4 toMWD, or to VOPD. Considering each application using CPUs andmemories mapped in the network in conformity with MPEG4application, we verified the results of this experiment in Fig 6,which shows the average latency results for four situations consid-ering changes in the system application from MPEG4 to MWD andlater to VOPD:

(i) heterogeneous router using the buffer size defined in designtime for the MPEG4,

(ii) an homogeneous router where all channels present a fixedbuffer size equal to 4,

(a)Fig. 7. Throughput results for: (a) homogeneous router and our adaptive router to MWDadaptive router to MWD application.

Fig. 6. Latency results for three applications using a homogeneous, an heteroge-nous and an adaptive router.

(iii) an homogeneous router where all channels present a fixedbuffer size equal to 12,

(iv) and finally our proposal, where the buffer depth can dynam-ically change at runtime.

As one can see in Fig. 6, when the MWD application uses theheterogeneous router with buffer depth defined to MPEG4, the het-erogeneous router increases the average latency. This occurs be-cause when one maps MWD in a NoC with buffer size defined forMPEG4, many channels present inappropriate buffer sizes, i.e.,channels with high bandwidth have been allocated for smaller buf-fer size, and other unused or underused channels for larger buffersizes. The buffer sizes to MWD are not the best ones to reduce thelatency (once they have been designed to support MPEG4). Thesame does not happen with VOPD, because the hotspots of VOPDcoincide with those of MPEG4.

As one can see in Fig. 6, the BDC and adaptive router architec-ture show a reduction of approximately 91% in the average latencyin the MWD application, and 83% in VOPD, when compared withthe heterogeneous router with buffer depth defined for anotherapplication (this case MPEG4). MWD shows lower latency thanVOPD because it has less communication among the cores, hencefewer resources are needed to decrease the latency. Besides, whenwe compared the results obtained with the homogeneous routerwith three times more buffer depth than our adaptive router, ourproposal shows an insignificant difference in the average latency.Fig. 6 shows that with the adaptive router and the monitor mech-anism, the average latency can be drastically reduced. The experi-ment also shows that the buffer depth greatly influences theaverage latency of the network. This experiment illustrates thatit is possible to use a single NoC with the adaptive buffer architec-ture to obtain low latency results to any traffic pattern.

Following the same configuration previously adopted, we veri-fied the throughput. Fig. 7 shows the throughput results for VOPDand Fig. 8 shows for MWD application. The X axis presents the in-put channels of all 12 routers of the NoC, and the Y axis representsthe throughput in Mbits/s. Figs. 7 and 8 show the throughput forthree situations:

(i) an homogeneous router where all channels present a fixedbuffer size equal to 4,

(ii) an heterogeneous router using the buffer size defined indesign time for the MPEG4,

(iii) and our proposal where the buffer depth can dynamicallychange at runtime.

Figs. 7a and 8a compare the throughput of the homogenousrouter with our proposal. In some channels where no throughputis reported this means that they are not used. One can see inFigs. 7a and 8a that the proposed architecture always increases

(b)application; (b) heterogeneous router with buffer depth defined to MPEG4 and our

Page 8: A NOC closed-loop performance monitor and adapter

(a) (b)Fig. 8. Throughput results for: (a) homogeneous router and our adaptive router to VOPD application; (b) heterogeneous router with buffer depth defined to MPEG4 and ouradaptive router to VOPD application.

668 D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671

the throughput and this is due to the fact that not all buffers arebeing used at the same time, and then having an adaptive architec-ture the resources can be distributed according to the applicationneeds. In these cases the throughput increases on average 40%using the adaptive router and BDC, in comparison to the homoge-neous router. This experiment shows that for the same buffer size,our proposal uses better the storage resources and presents a high-er data flow.

In Figs. 7b and 8b we compare our proposal with a heteroge-neous router designed to the MPEG4 application. With this exper-iment we want to show that one cannot apply a static architectureto any application, even when a heterogeneous distribution of buf-fers is allowed at design time. In this situation, some channels havebigger buffer depths, but the throughput is not always increased. Inthe worst case (only 5% of all situations) the heterogeneous and theproposed architecture have the same throughput. In all others, theproposed architecture increases the throughput. In this case, withbigger buffer depths, the simple heterogeneous architecture isnot saving power.

Fig. 9 compares the latency between the adaptive router withalpha equal to 0.125 and a homogeneous router. Fig. 9a presentsthe latency versus packet size, and Fig. 9b presents the latency ver-sus buffer size. The proposal can decrease around 50% the latencyfor packet sizes bigger than 40 flits and buffer size equal to 4, asdepicted in Fig 9a. The adaptive architecture with the BDC de-creases the congestion by allocating more buffer slots for the chan-nels that need it more. For different buffer sizes (Fig. 9b), ourproposal only increases the performance for buffer sizes lower than

(a)Fig. 9. Latency results for MPEG4 (a) homogeneous and adaptive router for differe

10 positions. This happens because, for buffer size larger than 10positions, the traffic is not heavy enough to use all buffers all thetime, hence, creating a congestion in the NoC. The BDC is not dis-tributing buffer slots because all the channels in the router haveenough storage places for this traffic behavior.

5.1.2. Results with fault toleranceTo simulate the fault tolerant proposal we have implemented

the buffer depth controller in the Java cycle accurate simulator,and added the ability to bypass the faulty buffers. In the presenceof the later, the algorithm bypasses the faulty buffer and, in accor-dance to the traffic behavior, more buffers can be obtained fromneighbor channels.

In order to verify the use of the fault tolerant adaptive routerand the controller we performed some simulations by randomlyassigning faults only in the buffers. First, we verified the possibilityof the occurrence of defects in a router connected to a processorand a cache. To define the occurrence of faults the defect densityper cm2 as shown in [9] has been used. We considered a 90 nmprocess technology that presents 0.28 defects/cm2 for memory cir-cuits, and 0.14 defects/cm2 for a microprocessor. We also consid-ered that one has, for each NoC node, a microprocessor with asmall cache size to instructions and data (16 K/16 K), like theARM11 MP Core processor, with total area equal to 2.54 mm2

[15]. As the router area is very small compared with the micropro-cessor and memories, its value does not influence in the calculus ofthe number of defects. With the defect density above mentioned,the average value for defects is 0.21 defects/cm2. Let us imagine

(b)nt packet size; (b) homogeneous and adaptive router for different buffer size.

Page 9: A NOC closed-loop performance monitor and adapter

D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671 669

the worst case (and as of today, probably unlikely) scenario wherethe faults occurring in the node that includes processor, memoryand a NoC router are concentrated only in the buffers of the NoC.Then the NoC faults number is given by below equation:

Faults ¼ DD� NA� TN ð2Þ

Fig. 11. Average Throughput results for homogeneous router and adaptive withfault-tolerant technique for MPEG4.

where DD refers to defect density, NA is the area of a single nodecontain a router, microprocessor and memory and TN is total num-ber of NoC nodes. Considering a 4 � 3 NoC used in our experiments,the occurrence of faults will be approximately of 0.065 defects forthe entire NoC. In the near future, either CMOS technology willaggressively scale or it will be replaced by nanoscale technologies.In both cases, in the same die size, the SoCs will show a great in-crease of complexity and more permanent faults will occur due toprocess parameters [16]. In this context, we analyze the occurrenceof a single fault in a NoC (reflecting current technologies), after this,we verify the behavior of NoC if more faults happens (in a futuretechnological scenario). For this later case we considered 2 faultsin the whole NOC, and for an extreme case, we considered theoccurrence of 6 faults in the NoC, hence assuming a higher defectration of the current technology, aligned to what is foreseen in [17].

To prove the efficiency of the fault tolerant strategy we repeatedthe same experiments done with fault-free proposal, however thistime faults have been inserted in the buffers (application changingfrom MPEG4 to MWD, or to VOPD in the presence of faults). Con-sidering each application using CPUs and memories mapped inthe network in conformity with the MPEG4 application, we verifiedthe latency of these experiments in Fig 10.

First, we verified the behavior of the NoC considering a singlefault, after we introduced 2 and 6 faulty buffers and we analyzedthe latency and throughput of the NoC for those cases. All the pos-sibilities of faults above defined have been simulated with randominjection, totaling 50 simulations in the occurrence of each faultscenario (1, 2 and 6 faults) for each application. We defined thebuffer depth equal to 4, and the packet size contains 80 flits. Weverified the latency results for fault tolerant adaptive router (FTAR)with Buffer Depth Controller (BDC) with the three fault possibilities(1, 2 and 6 simultaneous faults). Fig. 10 presents the average la-tency results obtained for MPEG4, VOPD and MWD applications.

Fig. 10. Latency results for three application with adaptive and fault-tolerantrouter.

In accordance with Fig. 10, we verified that to VOPD and MPEG4applications, FTRR with faulty buffers shows a minimal incrementin the latency when compared with the reconfigurable router. ForMWD and MPEG4 with 6 faults in the buffers the latency has beenincreased by less than 10%, for MWD no latency increment isshown. Thanks to the available reconfiguration, even in the pres-ence of random fabrication faults in the buffers the adaptive routerwith the BDC can handle the required performance.

Between MPEG4 and VOPD the routing resources are more of-ten requested in MPEG4 application, then the faulty buffers havegreat influence in the latency. However, in MWD the requiredbandwidth is lower, so not all buffers are requested at the sametime, allowing the faulty buffers to be replaced by other faulty-freebuffers from the neighbors, this way sustaining performance.

In Fig. 11 one compares the throughput, in the presence offaults, of homogeneous and adaptive routers with the proposedfault tolerance strategy for the same buffer depth. We verify thatthe behavior is very dependent of the channel where the fault isdetected. For instance, when the channel has a low traffic, thethroughput has a small decrease. However, we verified that if afault occurs in a channel with a high traffic, the loss in performancecan reach up to 20% for benchmarks used in this paper. In thesecases, with the adaptability mechanism, this loss can be solvedwith the loan of buffers among the adjacent channels presentedby BDC block.

5.2. Area, power and frequency results

The proposed routers have been described in VHDL, and weused the ModelSim tool to simulate the code. We analyzed theaverage power consumption results to a CMOS 0.90um processtechnology using the Synopsys Power Compiler tool. Table 1

Table 1Power consumption, frequency and area results to the proposal and homogeneousrouter architecture.

Buffer Depth

Area (um²)

Maximum Frequency

(MHz)

Power Consumption

@ Max. Freq. (mW)

Power Consumption

@ 200MHz (mW)

9 28,471 700 11.94 3.39 Homogeneous Router (HR) 15,871 757 6.10 1.61

Adaptive Router (AR)

38,862 675 6.79

2.00

Buffer Depth Controller

(BDC) 13,004 3.3 GHz

9.25 0.55

AR + BDC

4

51,900 675 9.24 2.55

Page 10: A NOC closed-loop performance monitor and adapter

670 D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671

presents the results obtained with this analysis. The channel widthcontains n + 2 bits, for n data bits and 2 bits for control.

The power results have been obtained using the circuit operat-ing at the maximum frequency and at 200 MHz. We fixed the buf-fer depth of the adaptive router to 4, and sized the buffer of thehomogeneous router in order to have the same latency of the adap-tive ones.

5.2.1. Without fault toleranceTo reach the same average latency obtained with the adaptive

router, the homogeneous router needed much larger buffers. Forthe application examples, the homogeneous router needs to haveapproximately a buffer depth equal to 9, in comparison to a bufferdepth equal to 4 in the adaptive router. As one saves flip-flops byusing the buffer depth controller and adaptive router, it is also truethat one needs extra multiplexers to support the borrowing/lend-ing process. The flip-flops have almost constant (and high) powerconsumption due to the clock, while the same thing does not hap-pen with the multiplexer. The proposed architecture uses 55%smaller buffer depths to reach the same performance results thata homogenous router with buffer depth equal to 9. Results showa reduction in the power consumption by 25% on average. As onecan see in Table 1, the homogenous router with buffer depth equalto 9 consumes 3.39 mW of power at 200 Mhz, on other hand, ourproposal consumes 2.55 mW of power at the same frequency.The area consumption is larger due to the extra hardware usedto define the buffer depth allocation at runtime, nevertheless, mostof the overhead are multiplexers and interconnections.

For buffer depth equal to 4, BDC and the adaptive router have apenalty frequency when compare to homogeneous router, and thethroughput is increased around 40% on average. The area of oneBDC block for only one channel is 3251 lm2, this is 10% of the totalarea of the adaptive router with buffer size equal to 4. The areaoverhead of the buffer depth controller for four channels is 40%considering the whole architecture, however one can increase thethroughput by 40% on average, and latency is reduced by 75%.Comparing the frequency between the adaptive router and adap-tive router plus buffer depth controller, the frequency remainsthe same, since the BDC runs in parallel with the adaptive router.

5.2.2. With fault toleranceTable 2 shows synthesis results for routers with the fault toler-

ant strategy. The area overhead of the fault tolerance is only 2%when compared to routers without FT for the same buffer depth,and the power consumption is almost the same, since only onemultiplexer have been included to isolate the faulty buffers.

For the same buffer depth, FTAR + BDC shows overheads whencompared to FTHR, however the adaptive strategy decreases the

Table 2Power consumption, frequency and area results to the proposal and homogeneousrouter architecture.

Bufferdepth

Area(lm2)

Maximumfrequency(MHz)

Powerconsumption@ Max. Freq.(mW)

Powerconsumption@ 200 MHz(mW)

FaulttolerantHR(FTHR)

4

16,195 724 6.51 1.79

FaulttolerantAR(FTAR)

40,000 617 6.24 2.01

FTAR + BDC 53,004 617 8.69 2.56

latency by 70% on average. For the same performance the gainsof the FTAR + BDC in power is almost the same of the AR + BCD.

6. Conclusion

In this paper we have shown an automatic controller to dynam-ically change the buffer depth according to the traffic measured ineach channel. The proposed router can dynamically change thebuffer depth to each channel, in accordance to the necessity ofthe application, even when the application radically changes. Afault tolerant technique has been also combined to the controller,since the same mechanism that ensures extra performance can de-liver fault-tolerance against manufacturing defects.

With this work we verified the advantages of using a NoC withthe proposed router instead of using a homogeneous one. More-over, even if a custom sizing had been made at design time foran application, whenever the application changes we sustain per-formance thanks to the buffer adaptability capability. The pro-posed architecture presents gains in throughput and averagelatency, even in the presence of faults, when one considers thesame buffer depth defined at design time. Besides, the FT techniqueis very lightweight. It presents less than 3% of area overhead, lessthan 5% power overhead and it has almost the same performance(even in the presence of faults) when compared to a router withoutthis mechanism.

With the adaptive router it is possible to have the same NoCbeing used to different traffic behaviors, using the same links thatmight change their communicating patterns at run time withoutpresenting any performance loss. Besides, the system uses a mech-anism to configure new buffer depths without pauses in the net-work. The proposed routers and buffer depth controller, whilereaching the same performance than the original architecture, ob-tained a reduction of approximately 25% of power consumption.

The detection of faults in the buffer is not the focus of this work,however as future work we intend to add a BIST structure in theNoC platform to test the buffers and evaluate the cost of this plat-form. NoC plus BIST can dynamically test the buffers and updatethe system with the faulty buffer words. The test structure couldapply the test vectors while a channel would not be in use, andthen update the registers that indicate which buffer slot is faultyafter the test phase.

References

[1] M. Dall’Osso, G. Biccari, L. Giovannini, D. Bertozzi, L. Benini, P. Tavel, Xpipes: Alatency insensitive parameterized network-on-chip architecture formultiprocessor SoCs, in: Proceedings 21st International Conference onComputer Design. ICCD’03, 2003, pp. 536–539.

[2] X. Yang, Huimin Du, J. Han, Research on node coding and routing algorithm fornetwork on chip, in: ISECS International Colloquium on Computing,Communication, Control, and Management, CCCM ’08, 2008, pp. 198–203.

[3] H. Jingcao, U.Y. Ogras, R. Marculescu, System-level buffer allocation forapplication-specific networks-on-chip router design, in: IEEE Transaction onComputer-Aided Design of Integrated Circuits and Systems, 2006, p. 2919–2933.

[4] C. Xuning, L. Peh, Leakage power modeling and optimization ininterconnection networks, in: Proceedings of the International Symposiumon Low Power Electronics and Design (ISLPED), 2003, pp. 90–95.

[5] Y. Tamir, G.L. Frazier, Dynamically-allocated multi-queue buffers for VLSIcommunication switches, IEEE Transaction on Computers 41 (6) (1992) 725–737.

[6] D. Matos, C. Concatto, L. Carro, M. Kreutz, A. Susin, F. Kastensmidt, NoC poweroptimization using a reconfigurable router, in: IEEE Computer Society AnnualSymposium on VLSI, 2009, pp. 235–240.

[7] C. Nicopoulos, Kim D. Park, N. Vijaykrishnan, S. Yousif, C. Das, ViChaR: Adynamic virtual channel regulator for network-on-chip routers, in:Proceedings of 39th Annual International Symposium Microarchitecture(MICRO), 2006.

[8] M.A. Al Faruque, T. Ebi, J. Henkel, ROAdNoC: Runtime observability for anadaptive network on chip architecture, in: IEEE/ACM International Conferenceon Computer-Aided Design, ICCAD, 2008, pp. 543–548.

Page 11: A NOC closed-loop performance monitor and adapter

D. Matos et al. / Microprocessors and Microsystems 37 (2013) 661–671 671

[9] M.J. Flynn, P. Hung, Microprocessor desing issues: thoughts on the road ahead,IEEE Micro. 25 (3) (2005) 16–31.

[10] M. Ali, M. Welzl, S. Hessler, An efficient fault tolerant mechanism to deal withpermanent and transient failures in a network on chip, Int. J. High Perform.Syst. Archit. 1 (2) (2007) 113–123.

[11] Ying-Cherng Lan, Shih-Hsin Lo, Yueh-Chi Lin, Yu-Hen Hu, Soa-Jie Chen, BiNoC:Bidirectional NoC architecture with dynamic self-reconfigurable channel, in:3rd ACM/IEEE International Symposium on Networks-on-Chip, 2009.

[12] C. Zeferino, A. Susin, SoCIN: A parametric and scalable network-on-Chip,Symposium on Integrated Circuits and Systems Design (SBCCI), 2203, pp. 169–174.

[13] D. Bertozzi, et al., NoC synthesis flow for customized domain specificmultiprocessor systems-on-chip, in: IEEE Transaction on Parallel andDistributed System, 2005, pp. 113–129.

[14] K. Srinivasan, K.S. Chatha, A low complexity heuristic for design of customnetwork-on-chip architectures, in: Proceedings of Design, Automation andTest in Europe Conference 1 (2006) 1–6.

[15] http://www.arm.com/products/CPUs/ARM11MPCoreMultiprocessor.html.[16] T. Dumitras, S. Kerner, R. Marculescu, Towards on-chip fault-tolerant

communication, Proc. Asia and South Pacific Design Automation Conference(ASP-DAC), 2003, pp. 225–232.

[17] A.D. Hon, H. Naeimi, Seven strategies for tolerating highly defectivefabrication, in: Proceedings of the 21st annual symposium on Integratedcircuits and system design, 2008, pp. 34–39.

[18] M.H. Neishaburi, Z. Zilic, Reliability aware NoC router architecture using inputchannel buffer sharing, in: Proceedings of the 19th ACM Great Lakessymposium on VLSI, 2009, pp. 511–516.

Débora Matos was born in Pelotas-RS, Brazil. Shereceived the B.S. degree in digital system engineeringfrom Universidade Estadual do Rio Grande do Sul(UERGS), Guaíba, Brazil, in 2007 and the M.Sc. degree incomputer science from Universidade Federal do RioGrande do Sul (UFRGS), Porto Alegre, Brazil, in 2010,where she is currently pursuing the Ph.D. degree incomputer science. Her main research interests includenetworks-on-chip (NoCs), reconfigurable systems, sys-tem on chip, and multiprocessor system on chip(MPSoC).

Caroline Concatto received the B.S. degree in digitalsystem engineering from Universidade Estadual do RioGrande do Sul (UERGS), Guaíba, Brazil, and the M.Sc.degree from Universidade Federal do Rio Grande do Sul(UFRGS), Porto Alegre, Brazil, where she is currentlypursuing the Ph.D. degree in computer science. Herresearch interests include adaptive systems, networks-on-chip (NoCs), and fault tolerance techniques.

Anelise Kologeski received the B.S. degree in digitalsystem engineering from Universidade Estadual do RioGrande do Sul (UERGS), Guaíba, Brazil. She is currentlypursuing the Mater degree in Microeletronics. Herresearch interests include adaptive systems, networks-on-chip (NoCs), and fault tolerance techniques.

Luigi Carro was born in Porto Alegre, Brazil, in 1962. Hereceived the electrical engineering, M.Sc., and Ph.D.degree in computer science from Universidade Federaldo Rio Grande do Sul (UFRGS), Porto Alegre, Brazil, in1985, 1989, and 1996, respectively. From 1989 to 1991,he worked with the R&D Group, ST-Microelectronics,Agrate, Italy. He is currently a Professor with theApplied Informatics Department, Informatics Institute,UFRGS, where he is in charge of computer architectureand organization disciplines at the undergraduate lev-els. He is also a member of the Graduation Program incomputer science at UFRGS, where he is coresponsible

for courses on embedded systems, digital signal processing, and VLSI design. Hisprimary research interests include embedded systems design, validation, automa-tion and test, fault tolerance for future technologies, and rapid system prototyping.

He has advised over 20 graduate students (Master’s and Ph.D. levels). He haspublished over 150 technical papers on those topics and is the author of the bookDigital Systems Design and 2001, in portuguese) and coauthor of Fault- ToleranceTechniques for SRAM-Based FPGAs (Springer, 2006) and Dynamic ReconfigurableArchitectures and Transparent Optimization Techniques (Springer, 2010). His mostupdated resume is located in http://lattes.cnpq.br/8544491643812450. For thelatest news, please check www.inf.ufrgs.br/�carro. Dr. Carro was a recipient of aprize FAPERGS—Researcher of the Year in Computer Science in 2007.

Márcio Kreutz received the B.S. degree in computerscience and the M.Sc. and Ph.D. degrees in computerscience and microelectronics from the Federal Univer-sity of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil, in1994, 1997, and 2005, respectively. His thesis wasdeveloped on the topic of networks- on-chip architec-tural optimization. He is currently an Adjunct Professorwith Federal University of Rio Grande do Norte (UFRN),Natal, Brazil. His research interests include embeddedarchitectures modeling and specification, embeddedsoftware mapping, and communication/processingarchitectures optimization.

Fernanda Kastensmidt received the B.S. degree inelectrical engineering and the M.Sc. and Ph.D. degrees incomputer science and microelectronics from the FederalUniversity of Rio Grande do Sul (UFRGS), Porto Alegre,Brazil, in 1997, 1999, and 2003, respectively. She is aProfessor with the Department of Computer Science,UFRGS. Her professional research experiences includeinternships with the Grenoble National PolytechnicInstitute (INPG), France, in 1999, with Xilinx Corpora-tion, San Jose, CA, in 2001, and with the Laboratory ofMaterials and Systems Integration (IMS), BordeauxUniversity, France, in 2008. Her research interests

include VLSI testing and design, fault effects, fault tolerant techniques, and pro-grammable architectures. She is the author of the book Fault-Tolerance Techniquesfor SRAM-based FPGAs (Springer, 2006).

Altamiro Susin was born in Vacaria-RS, Brazil. Hereceived the Electrical Engineering and the M.Sc.degrees from Universidade Federal do Rio Grande do Sul(UFRGS), Porto Alegre, Brazil, in 1972 and 1977,respectively, and the Dr.Eng. degree from InstitutNational Polytechnique de Grenoble-France, Grenoble,France, in 1981. Since 1968, he has worked with digitalcomputers when he was with a group that started thecomputer centers of two local Universities. He is FullProfessor with the Electrical Engineering Department,UFRGS, where he is in charge of analog and digitalelectronics disciplines at the graduate and undergrad-

uate levels. He is also a member of the Graduation Programs in Computer Science,Electrical Engineering, and Microelectronics of UFRGS. His main research interestsinclude integrated circuit architecture, system-on-chip design, and signal process-

ing with over 200 technical papers published in those domains. He is/wasresponsible for several R&D projects either funded with public and/or industryresources, presently coordinating a research network for Digital TV.