[ieee 2009 17th ifip international conference on very large scale integration (vlsi-soc) -...

6
Highly Efficient Reconfigurable Routers in Networks-on-Chip Débora Matos 1 , Caroline Concatto 1 , Luigi Carro 1 , Fernanda Kastensmidt 1 , Márcio Kreutz 2 , Altamiro Susin 1 1 UFRGS – Federal University of Rio Grande do Sul – PPGC, PGMICRO, Porto Alegre, Brazil {debora.matos, cconcatto, carro, fglima, susin}@inf.ufrgs.br 2 UFRN – Federal University of Rio Grande do Norte, Natal, Brazil [email protected] AbstractNoC designs are based on a compromise of latency, power dissipation or energy, usually defined at design time. However, setting all parameters at design time can cause either excessive power dissipation (originated by router underutilization), or a higher latency. The situation worsens whenever the application changes its communication pattern, i.e., a portable phone downloads a new service. The buffer’s depth is an important resource to assure performance, and has a great impact on power. In this paper we propose the use of a reconfigurable router, where the buffers are dynamically allocated to increase router efficiency in a NoC, even under rather different communication loads. The reconfigurable router allows up to 52% power savings, while maintain the same performance of the homogeneous original router with roughly the same area. I. INTRODUCTION System-on-Chips (SoCs) are emerging as one of the technologies providing a way to support the growing design complexity of embedded systems, since they provide processor architectures adapted to selected problem classes, allied to programming flexibility. The increased interconnection complexity and the scalability deficiency of buses require another interconnection model. The communication among cores of a SoC having reusable interconnections is being provided by Networks-on- Chip (NoC) [1]. NoCs have been proposed to integrate several IP cores providing high communication bandwidth and parallelism. Azimi et al. [2] affirm that it is necessary to find a way to keep the off-die bandwidth manageable in system architectures with trade-offs among cost, power and performance. Moreover, in a hardware context, the system must offer expressive flexibility with a scalable high- bandwidth, low-latency and power-efficiency. Interconnection fabric allows cores to access memory, communicate with each other and with the rest of the system. Manferdelli et al. [3] assure that it is possible to find many applications that use heterogeneous processors with several memories controllers to provide a large memory interface. Decisions such as throughput, frequency, and bandwidth are currently being made at design time, trying to guarantee the performance of the system. However, whenever the product needs an updated or has to change its functionality, most likely a huge change in the communication pattern will be required, and hence decisions performed at design time would mean either a loss in performance, or an excessive power dissipation. Besides, considering the NoC components, as crossbars, arbiters, buffers and links, in the experiments realized by [4] the buffers were the largest leakage power consumers, dissipating approximately 64% of whole power budget. In this way, the buffers were considered as candidates for leakage power optimization, since even at high loads, there were still 85% of idle buffers [4]. Regarding dynamic power, the buffers consumption is also high, and it increases rapidly as packet flow throughput increases [5]. In this work we show the efficiency loss in the amount of buffers used within a homogeneous router, and the gains that can be achieved using a reconfigurable router. In this particular contribution, we focus on providing a reconfigurable router that can optimize power while sustaining high performance, even when the application changes the communication pattern. A NoC built with the proposed router allows the use of buffers with smaller depths, and with the same performance than the one obtained with a NoC using a fixed router with a large buffer depth. With this architecture, the reconfigurable router saves power, and improves energy usage in the NoC. Moreover, it compares favorably with other dynamic topologies like virtual channels. This paper is organized as follows. In section II we show some related works that identify the need for reconfiguration. Section III presents an analysis of the problem and identifies low efficiency in homogenous routers. The reconfigurable router is proposed in section IV, where we describe the differences between the original and new architecture. In section V we present some results of latency, buffer utilization, frequency, area and power consumption, and finally the conclusions are showed in section VI. II. RELATED WORKS In an MPSoC (Multi-Processor SoC) it is usual to find different interconnection needs amongst processors, memories, peripherals and others elements. Due to this fact, it has been perceived the need for distinct bandwidth in each node of a NoC. In the literature, some works present solutions that point in this direction, but generally with a static

Upload: altamiro

Post on 18-Dec-2016

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2009 17th IFIP International Conference on Very Large Scale Integration (VLSI-SoC) - Florianopolis, Brazil (2009.10.12-2009.10.14)] 2009 17th IFIP International Conference on

Highly Efficient Reconfigurable Routers in

Networks-on-Chip

Débora Matos1, Caroline Concatto1, Luigi Carro1, Fernanda Kastensmidt1, Márcio Kreutz2, Altamiro Susin1

1UFRGS – Federal University of Rio Grande do Sul – PPGC, PGMICRO, Porto Alegre, Brazil {debora.matos, cconcatto, carro, fglima, susin}@inf.ufrgs.br

2UFRN – Federal University of Rio Grande do Norte, Natal, Brazil [email protected]

Abstract— NoC designs are based on a compromise of

latency, power dissipation or energy, usually defined at design time. However, setting all parameters at design time can cause either excessive power dissipation (originated by router underutilization), or a higher latency. The situation worsens whenever the application changes its communication pattern, i.e., a portable phone downloads a new service. The buffer’s depth is an important resource to assure performance, and has a great impact on power. In this paper we propose the use of a reconfigurable router, where the buffers are dynamically allocated to increase router efficiency in a NoC, even under rather different communication loads. The reconfigurable router allows up to 52% power savings, while maintain the same performance of the homogeneous original router with roughly the same area.

I. INTRODUCTION

System-on-Chips (SoCs) are emerging as one of the technologies providing a way to support the growing design complexity of embedded systems, since they provide processor architectures adapted to selected problem classes, allied to programming flexibility.

The increased interconnection complexity and the scalability deficiency of buses require another interconnection model. The communication among cores of a SoC having reusable interconnections is being provided by Networks-on-Chip (NoC) [1]. NoCs have been proposed to integrate several IP cores providing high communication bandwidth and parallelism.

Azimi et al. [2] affirm that it is necessary to find a way to keep the off-die bandwidth manageable in system architectures with trade-offs among cost, power and performance. Moreover, in a hardware context, the system must offer expressive flexibility with a scalable high-bandwidth, low-latency and power-efficiency. Interconnection fabric allows cores to access memory, communicate with each other and with the rest of the system. Manferdelli et al. [3] assure that it is possible to find many applications that use heterogeneous processors with several memories controllers to provide a large memory interface.

Decisions such as throughput, frequency, and bandwidth are currently being made at design time, trying to guarantee the performance of the system. However, whenever the product needs an updated or has to change its functionality,

most likely a huge change in the communication pattern will be required, and hence decisions performed at design time would mean either a loss in performance, or an excessive power dissipation.

Besides, considering the NoC components, as crossbars, arbiters, buffers and links, in the experiments realized by [4] the buffers were the largest leakage power consumers, dissipating approximately 64% of whole power budget. In this way, the buffers were considered as candidates for leakage power optimization, since even at high loads, there were still 85% of idle buffers [4]. Regarding dynamic power, the buffers consumption is also high, and it increases rapidly as packet flow throughput increases [5].

In this work we show the efficiency loss in the amount of buffers used within a homogeneous router, and the gains that can be achieved using a reconfigurable router. In this particular contribution, we focus on providing a reconfigurable router that can optimize power while sustaining high performance, even when the application changes the communication pattern. A NoC built with the proposed router allows the use of buffers with smaller depths, and with the same performance than the one obtained with a NoC using a fixed router with a large buffer depth. With this architecture, the reconfigurable router saves power, and improves energy usage in the NoC. Moreover, it compares favorably with other dynamic topologies like virtual channels.

This paper is organized as follows. In section II we show some related works that identify the need for reconfiguration. Section III presents an analysis of the problem and identifies low efficiency in homogenous routers. The reconfigurable router is proposed in section IV, where we describe the differences between the original and new architecture. In section V we present some results of latency, buffer utilization, frequency, area and power consumption, and finally the conclusions are showed in section VI.

II. RELATED WORKS

In an MPSoC (Multi-Processor SoC) it is usual to find different interconnection needs amongst processors, memories, peripherals and others elements. Due to this fact, it has been perceived the need for distinct bandwidth in each node of a NoC. In the literature, some works present solutions that point in this direction, but generally with a static

Page 2: [IEEE 2009 17th IFIP International Conference on Very Large Scale Integration (VLSI-SoC) - Florianopolis, Brazil (2009.10.12-2009.10.14)] 2009 17th IFIP International Conference on

approach, used in the design phase only. In this section we did a study where several of these ideas demonstrate that in the same application there is the need to use routers with different features and communication needs.

Cardoso et al. [6] observed the need to have heterogeneous links in a network-on-chip, where for each application a link with appropriate size is used. A heterogeneous router was presented based on heterogeneous links with two wrappers for each router. These wrappers control the traffic between the heterogeneous channels and each channel works with the message compatible with its width. Following this path, Kreutz et al. [7] analyzed the architecture of three routers, each one having different characteristics.

Ahmad et al. [8] proposed a network router designed with a bus like interface. An in-built wrapper is used, and thus any component compatible with a bus can be integrated into the NoC architecture. The interface of this router becomes a simple bus. The objective of this proposal was to reduce the design time and to ease integration, since it is not necessary to known to the NoC architecture. When the network has a channel that requires high bandwidth, the NoC changes the switching, obtaining a dedicated path between the IPs.

Ahonen and Nurmi [9] proposed a hierarchical NoC, able to cope with inefficiencies obtained with a regular NoC. They used two types of on-chip network: the Global Network (NoC), and the Local Network (Bus-based). The Local Network is used to connect slaves to a master that together are called local clusters. The NoC is used to connect all local clusters, all of them having the same capabilities.

Eun Lee and Bagherzadeh [10] proposed the use of different clocks while sending flits in the NoC. Body flits operate faster than head flits. In accordance with them, as (First-In First-Out) FIFOs work faster than the route decision, it is possible to use different clocks to body flits and head flits. While the head flit is analyzed to define the path of route, body flits can continue advancing along the reserved path already established, improving the performance of the router.

The problem of the proposals reported above is that all of them must be used at design time, hence causing problems to what regards scalability, whenever the NoC is used for a new application in the same platform. Hence, product updates or even customization of a MPSOC for a different market, using the same components, but with a different communication pattern, is not possible without a costly redesign.

Another line of work uses virtual channels, and was showed in [11]. They proposed a unified buffer structure called ViChaR, which dynamically allocates virtual channels and buffers according to network traffic conditions. In such case, instead of individual and statically partitioned buffers, they utilized a unified buffer unit. However, due to the complexity of this architecture, results like latency and power consumption are larger when compared with our proposal. The proposed router obtained on average 15% more gains in power consumption when compared with the ViChaR architecture. Besides, for the same performance of a generic router, the proposed router obtained up to 64% of buffer size reduction, against 50% while using a ViChaR architecture (these results

were obtained from the gains that both architectures reach when compared with a generic router). This paper proposes a simple and effective solution to obtain a better efficiency of the channel input buffers and with this, it is possible obtain gains in performance and power consumption.

III. A QUANTITATIVE ANALYSIS OF THE PROBLEM

We simulated four examples of real applications to analyze the router’s behavior. The applications used were the MPEG4, VOPD [12], MWD (Multi-Window Display) [13] and Xbox [14], all with 12 cores, but with different communication patterns, as represented in the bandwidth of each link depicted in fig. 1.

A cycle-accurate traffic simulator in Java was utilized to evaluate the network hotspots and the average latency using the reconfigurable and original routers. The distribution of the cores, in the NoC, was specified in accordance with the communication needs of the cores, reflecting a design time choice, being based on the original application.

(a) (b)

(c) (d)

Figure 1. MPEG4 (a), VOPD (b), Xbox (c) and MWD (d) task graphs.

Fig. 2 shows the mean efficiency of MPEG4, VOPD, MWD and Xbox when mapped to a 4x3 NoC with homogeneous buffer sizes. In this work, efficiency represents how many buffer units are being appropriately used, in accordance with the necessity of the application. The efficiency results in fig. 2 were obtained in accordance with equation 1.

(1) This equation indicates the number of the buffers

effectively used per number of the available buffers in each channel of the NoC. Looking at fig. 2 one can observe that homogeneous routers use excessive buffers in some channels, since not all channel present the same communication rate. In

routers

routerbufferstotal

routerusedbuffersroutersi

i i

i

#

__#

__##

1

=

==η

Page 3: [IEEE 2009 17th IFIP International Conference on Very Large Scale Integration (VLSI-SoC) - Florianopolis, Brazil (2009.10.12-2009.10.14)] 2009 17th IFIP International Conference on

such case, the extra buffers of the channel will consume unnecessary area and power.

Usually, at design time the buffers are sized to guarantee that all channels in a router will have low latency, or to guarantee less power consumption and area. This means that each FIFO might have a different amount of flip-flops to provide performance and/or optimal power consumption and/or area for that specific link. However, by defining an optimal point at design time, if the application is changed, probably the latency and the power consumption will increase, since in some links there might not be enough buffers to ensure QoS (low latency, high throughput, a required bit rate, delay, jitter).

One can see in fig. 2 that, in a buffer sized to the best performance case, around 54% of the buffers are utilized, and all the others are not. However, they are nevertheless consuming power, but are not contributing to reduce the latency or the number of hotspots in the network.

Figure 2. Efficiency of a homogeneous router.

In the next section we will present the proposed reconfigurable router, where the NoC efficiency can be increased as a function of the possibility to reconfigure the buffer’s size, and according to the requirements of each channel of the router at run-time, without the need to oversize buffers to guarantee performance.

IV. PROPOSED ROUTER ARCHITECTURE

A. Original Router Architecture

The router architecture proposed was embedded in the SoCIN NoC [15]. SoCIN has a regular 2D-mesh topology and parametric router architecture. The router architecture used is RaSoC that is a routing switch with up to five bi-directional ports (Local, North, South, West and East) each port with two unidirectional channel and each router is connected to four neighboring router (North, South, West and East). This router is a VHDL soft-core, parameterized in three dimensions: communication channels width, input buffers depth and routing information width. The architecture uses the wormhole switching approach and a deterministic source based routing algorithm. The routing algorithm used is XY-routing, capable of supporting deadlock-free data transmission, and the flow control is based on the handshake protocol. The wormhole strategy breaks a packet into multiple flow control units called flits, and they are sized as an integer of the channel width. The first flit is a header with destination address followed by a set of payload flits and a tail flit. To indicate this information (header, payload and tail flits) two bits of each flit are used.

There is a Round-Robin arbiter at each output channel. The buffering is present only at the input channel, and this is a FIFO. Each flit is stored in a FIFO buffer unit. The input channel is instantiated to all channels of the NoC, in this manner, all channels have the same buffer depth defined at design time.

B. Reconfigurable Router Architecture

If a NoC router has a larger FIFO, the throughput will be larger and the latency in the network smaller, since it will have fewer flits stagnant on the network [16]. Nevertheless, there is a limit in the increase of the FIFO depth. Since each communication will have its peculiarities, sizing the FIFO for the worst case communication scenario will compromise not only the routing area, but power as well. However, if the router has a small FIFO depth, the latency will be larger, and QoS can be compromised. The proposed solution is to have a heterogeneous router, in which each channel can have a different buffer size. In this situation, if a channel has a communication rate smaller than its neighbor, it may lend some of its buffer slots that are not being used. In a different communication pattern, the roles may be inverted or changed at run time, without a redesign step.

The proposed architecture is able to sustain performance due to the fact that not all buffers are used all the time. In our architecture it is possible to dynamically reconfigure different buffer depths for each channel. A channel can lend part or the whole of its buffer words in accordance to the requirements of the neighboring buffers. To reduce connections costs, each channel may only use the available FIFOs of its right and left neighbor channels. This way, each channel may have up to 3 times more buffers words than its original buffer size planned at design time.

Fig. 3 shows the original and proposed input FIFO. Comparing the two architectures, the new proposal uses more multiplexers to allow the reconfiguration. Fig. 3(b) presents the South Channel as example. In this architecture it is possible to dynamically configure different buffer depths for the channels. In accordance with this figure, each channel has five multiplexers, and two of these multiplexers are responsible to control the input and output of data.

din

dout

din_Edin_W

control East / West

din_S

dout

d_S_W

d_S_E

d_W_Sd_E_S

(a) (b)

Figure 3. Input FIFO (a) Original; (b) proposed router.

These multiplexers present a fixed size, being independent of the buffer size. Others three multiplexers are necessary to control the read and write process of the FIFO. The size of the multiplexers that control the buffer slots increases according to the depth of the buffer. These multiplexers are controlled by

Page 4: [IEEE 2009 17th IFIP International Conference on Very Large Scale Integration (VLSI-SoC) - Florianopolis, Brazil (2009.10.12-2009.10.14)] 2009 17th IFIP International Conference on

the FSM of the FIFO. In order to reduce routing and extra multiplexers, we adopted the strategy of changing the control part of each channel.

Some rules were defined in order to enable the use of buffers from one channel by other channels. When a channel fills all its FIFO it can borrow more buffers words from its neighbors. First the channel asks for buffers words to the right neighbor, and if it still needs more buffers, it tries to borrow from the left neighbor FIFO. In this manner, some signals of each channel must be sent for the neighboring channels in order to control its stored flits.

The information about how many units of the buffer are used for each channel is set by an external control (this information is received by an input pin of the router), and this control can be dynamically altered outside the router.

Each channel can receive three data inputs. Let us consider the South Channel as an example, having the following inputs: the own input (din_S), the right neighbor input (din_E) and the left neighbor input (din_W). For illustration purposes, let us assume we are using a router with FIFO depth equal to 4, and there is a router that needs to be configured as follows: South Channel with buffer depth equal to 9, East Channel with buffer depth equal to 2, West Channel with buffer depth equal to 1 and North Channel with FIFO depth equal to 4. In such case, the South Channel needs to borrow buffer slots from its neighbors. As the East Channel occupies 2 of 4 slots, this channel can lend 2 slots to its neighbor, but even then, the South Channel still needs more 3 buffer slots. As the West Channel occupies only 1 slot, the 3 missing slots can be lent to the South Channel. When the South Channel has a flit stored in the East Channel, and this flit must be sent to the output, it is passed from the East Channel to the South Channel (d_E_S), and so the flit is directly sent to the output of the South Channel (dout_S) by a multiplexer. The South Channel has the following outputs: the own output (dout_S) and two more outputs (d_S_E and d_S_W) to send the flits stored in its channel but belonging to neighbor channels.

The choice to resend the flits stored in a neighbor channels to its own channel before sending it to the output was preferred in order to avoid changes in others mechanisms of the architecture, as for example, in the routing algorithm. With this definition, the complexity of implementation to obtain the correct function of the router was reduced in this aspect.

In the proposed router architecture each channel knows how many slots of its own buffers are being used in the channel, and how many are being borrowed from neighbors.

Each channel controls its flits storage, being these slots stored on its own buffer or in the neighbor channel buffers. In this design we are not considering the possibility of the Local Channel using neighboring buffers, only the South, North, West and East Channel of a router can make use of their adjacent neighbors.

V. RESULTS

A. Performance Evaluation

In order to define the buffer size to each channel in accordance with its need, we use a simple decision mechanism

described by the Algorithm 1. This algorithm distributes buffers among the channels of each router. It analyzes each channel individually, and performs the buffer distribution according to the number of hotspots. First the algorithm verifies the hotspots of each router. We consider hotspots those channels that receive a large number of flits, and in such case, require many buffers because of contingency reasons. This way, whenever hotspots are detected, the algorithm verifies the possibility to borrow the neighbor’s channel buffers.

When there is only one hotspot in the router, the buffer loaning process occurs between the right and left neighbors. Otherwise, if there are two hotspots in the router, the neighbor buffers are divided between the hotspots. For these experiments, fixed-length packets of 80 flits were assumed, and the link size was defined to be 16 bits. From these decisions, with the information of bandwidth presented in fig. 1, and with a defined frequency, we determined the interval in number of cycles that each packet is sent to the link. The bandwidth to each link considers the need of all cores that utilize it. When a neighbor channel is used and it is not a hotspot, the channel will leave only one buffer to the neighbor channel and will use the remaining buffers. When there are tree hotspots in a router, no buffer can be lent. The constant buffer_increase considered in the pseudocode refers to the channel buffer depth.

Algorithm 1 Pseudocode to use of the buffers of the router channels

1 FOR router=0 to number_of_routers DO 2 FOR channel=0 to 3 DO 3 IF channel is hotspots THEN 4 IF number_hotspots_router = 1 THEN 5 IF buffer_right_neighbor is not used THEN 6 buffer_channel = buffer_channel + buffer_increase; 7 buffer_right_neigbord = buffer_right_neigbord – buffer_increase; 8 ELSE 9 buffer_channel = buffer_channel + (buffer_increase-1); 10 buffer_right_neigbord = buffer_right_neigbord – (buffer_increase-1); 11 END IF 12 IF buffer_left_neighbor is not used THEN 13 buffer_channel = buffer_channel + buffer_increase; 14 buffer_left_neigbord = buffer_left_neigbord – buffer_increase; 15 ELSE 16 buffer_channel = buffer_channel + (buffer_increase-1); 17 buffer_left_neigbord = buffer_left_neigbord – (buffer_increase-1); 18 END IF 19 20 ELSE IF number_hotspots_router = 2 THEN 21 IF buffer_right_neighbor is not used THEN 22 buffer_channel = buffer_channel + buffer_increase; 23 buffer_right_neigbord = buffer_right_neigbord – buffer_increase; 24 ELSE IF right_neighbor_channel is not hotspots THEN 25 buffer_channel = buffer_channel + (buffer_increase-1); 26 buffer_right_neigbord = buffer_right_neigbord – (buffer_increase-1); 27 ELSE IF buffer_left_neighbor is not used THEN 28 buffer_channel = buffer_channel + buffer_increase; 29 buffer_left_neigbord = buffer_left_neigbord – buffer_increase; 30 ELSE 31 buffer_channel = buffer_channel + (buffer_increase-1); 32 buffer_left_neigbord = buffer_left_neigbord – (buffer_increase-1); 33 END IF 34 END IF 35 END IF 36 END FOR

Using this algorithm for buffers loaning, we obtained the results of fig. 4, which presents the behavior of the NoC for the VOPD application using a 4x3 NoC with buffer depth equal to 4. The Y axis represents the number of flits that need to wait the availability of the buffer to be sent to the next router, and the X axis presents the input channels of the NoC. Fig. 4(a) illustrates the results obtained with homogeneous routers and fig. 4(b) presents the hotspots obtained with the reconfigurable router. One can observe that with the

Page 5: [IEEE 2009 17th IFIP International Conference on Very Large Scale Integration (VLSI-SoC) - Florianopolis, Brazil (2009.10.12-2009.10.14)] 2009 17th IFIP International Conference on

reconfigurable router the number of hotspots was drastically reduced.

We analyzed four applications (MPEG4, MWD, VOPD and XBOX) that present different traffic conditions. We fixed the buffer size of the reconfigurable router to 4, and sized the buffer of the original router in order to have the same latency of the reconfigurable ones. In this case, to reach the same average latency obtained with the reconfigurable router, the homogeneous router needed much larger buffers, as it can be seen in fig. 5.

In fig. 5 each application has two columns, the first column is the reconfigurable router with FIFO depth equal to 4, and the second is the buffer size of a router fixed at design time required to reach the same latency performance. Observing the results presented in fig. 5, we verify that the buffer depth greatly influences the average latency. As these applications present different bandwidth in the links and different number of connections among the cores, we can confirm that to have the same average latency obtained with the reconfigurable router, larger and for many cases useless buffer depths were required by the homogeneous router. In fig. 5 we observed that the MWD application presented the smaller latency results to the reconfigurable router and, with this, to obtain the same latency value, the homogeneous router required a buffer with 11 positions. This can be explained due to the traffic behavior of the MWD application, since there are few connections among the cores. Then, in this case, as there are many buffers which are not used, they could be lent to the channels in use. This experiment proves that it is possible to use a single NoC with the reconfigurable router to obtain low latency results to any application.

With the reconfigurable router we also verified that there was a better distribution of flits in the network. Fig. 6 shows the number of flits that need to wait the availability of the buffer to be sent to the next router. We called this situation buffer overhead, and the high values of buffer overhead represent the hotspots, i.e., channels that need a larger buffer depth.

In the X axis of fig. 6 one can see how many channels have approximately the same number of flits that could not be stored in the first time. Hence, fig. 6 shows the number of channels that generate the network congestion, for the original router in fig 6(a), and for the reconfigurable one in fig. 6(b).

Figure 4. Number of flits overhead per router (a) in the original architecture

(b) in the new router for VOPD

Figure 5. Four applications for the same latency with buffer depth of the original router sized in order to obtain the same latency of a reconfigurable

router with buffer depth equal to 4.

(a) (b)

Figure 6. Occurrences of flits that need to wait the availability of the buffer to be sent to the next router to MPEG4 application.

One can observe that for the original buffer there are 10 channels (first two columns in fig. 6(a)) that potentially decrease the performance of the NOC. However, for the proposed router one has only 2 channels that decrease the performance, and with a much smaller buffer overhead value, since the buffer lending/borrowing process better distributes the storage of flits in the NoC. This experiment shows that for the same buffer size, the reconfigurable router has more utilization of the FIFO of the channels, and hence less power is wasted.

B. Area, Power and Frequency Results

The proposed router was described in VHDL, and we used the ModelSim tool to simulate the code. We analyzed the average power consumption results to a CMOS 0.18um process technology using the Synopsys Power Compiler tool. Table I presents the results obtained with this analysis. The channel width contains n+2 bits, for n data bits and 2 bits for control.

The power results were obtained using the maximum frequency of each architecture. For the same latency, using a reconfigurable router with buffer depth 4, different buffer sizes were needed while using the original router to the four applications mentioned. The original router needs more buffers to have the same average latency of the reconfigurable router, like it was demonstrated in fig. 5.

In a router, the largest power dissipation comes from the flip-flops of the buffers. In the reconfigurable router we have less flip-flops (for the same performance), but must have more multiplexers to make the borrowing/lending process work. The flip-flops have an almost constant (and high) power consumption due to the clock, something which does not happen with the multiplexer.

(b)

(a)

Page 6: [IEEE 2009 17th IFIP International Conference on Very Large Scale Integration (VLSI-SoC) - Florianopolis, Brazil (2009.10.12-2009.10.14)] 2009 17th IFIP International Conference on

The reason for the power increase not being a linear function of the buffer depth is the utilization of multiplexers, as present in fig. 3. These multiplexers define which flit in the FIFO must be sent to channel output. Hence, the gate increase of the FIFO depth 7 to 9 is lower than from the FIFO depth 9 to 11.

With the applications simulated in this work, we confirmed that the original homogeneous NoC presents a large underutilization of the router, since not all of its channels are used. In such cases, the extra buffers on channels not used in the original router would be unnecessarily consuming power.

The reconfigurable router does not present penalties in the maximum frequency, except when compared with the original router buffer size 7. We can see that for the same performance the reconfigurable router presents a great reduction of power dissipation for 16-bits and 32-bits data (more 2-bits to control). The larger the link size, the larger the power savings allowed by the reconfigurable router, since in this case the impact of the extra circuits required to allow reconfiguration are diminished. With respect to area results, when one compares the original router that needs a buffer of size up to 7, the reconfigurable router also present gains with 32-bits links.

Considering the four applications utilized in this work, the reconfigurable router reduces 41% of power consumption on average without the use of low power techniques, and to the same performance results, it uses 64% smaller buffer depth.

With the new router it is possible to have one single NoC connecting different applications that might change their communicating patterns at run time, while if homogeneous buffers had to be used, design modifications at design time had to be made to achieve the optimum case. The technique here proposed avoids costly redesigns and new manufacturing.

VI. CONCLUSION

In this paper the advantage of the use of a NoC with reconfigurable routers instead of homogeneous ones has been presented. Using reconfiguration, one can dynamically change the buffer depth to each channel, in accordance to the necessity of the application, increasing the power efficiency of the system for the same performance level. We verified that to reach the same performance obtained with the reconfigurable router, the original architecture needs much more buffers.

The new router, while reaching the same performance than the original architecture, obtained a reduction of approximately 27% of power consumption in the worst case, and of 53% for the best case analyzed.

Moreover, with the new architecture it is possible to reconfigure the router in accordance with the application,

obtaining similar performances even when the application radically changes.

As a future work we are planning to use hardware monitors to adjust buffer depths without external intervention.

REFERENCES [1] Benini, L., De Micheli, G., “Network on Chips: A new SoC

Paradigm”, IEEE Computer, 2002, pp. 70-78. [2] Azimi, M., et al., “Integration Challenges and Tradeoff for Tera-

scale Architectures”, Intel Technology Journal, vol. 11, no. 3, 2007.

[3] Manferdelli, L. Govindaraju, N.K. Crall, C., “Challenges and Opportunities in Many-Core Computing”. Proceeding of the IEEE, 96, no. 5, 2008, pp. 808-815.

[4] Xuning, C. and Peh, L., “Leakage power modeling and optimization in interconnection networks”, in Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003, pp. 90-95.

[5] Benini, T., and De Micheli, G., “Analysis of power consumption on switch fabrics in network routers”, in Proceedings of the 39th Design Automation Conference - (DAC), 2002, pp. 524-529.

[6] Cardoso, R., Kreutz, M., Carro, L., Susin, A, “Design Space Exploration on Heterogeneous Network-on-chip”. International Symposium on Circuits and Systems, vol. 1, 2005, pp. 428-431.

[7] Kreutz, M., Marcon, C., Carro, L., Wagner, F., Susin A., “Design Space Exploration Comparing Homogenous and Heterogeneous Network-on-Chip Architectures”. Symposium on Integrated Circuits and Systems Design, SBCCI’05, 2005, pp. 190-195.

[8] Ahmad, B., Ahmadinia, A., Arslan, T., “Dynamically Reconfigurable NOC with Bus Based Interface for Ease of Integration and Reduced Designed Time”. NASA/ESA Conference on Adaptive Hardware and Systems, AHS’08, 2008, pp. 309-314.

[9] Ahonen, T., Nurmi J., “Hierarchically Heterogeneous Network-on-Chip”, The International Conference on Computer as a Tool, EUROCON, 2007, pp. 2580-2586.

[10] Eun Lee, S., Bagherzadeh, N., “Increasing the Throughput of an Adaptive Router in Network-on-Chip (NoC)”, International Conference on Hardware/ Software Codesign and System. Synthesis, 2006, pp. 82-87.

[11] Nicopoulos, C., Park, Kim, D., Vijaykrishnan, N., Yousif, S., Das C., “ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers”, Proc. 39th Ann. Int. Symp. Microarchitecture (MICRO), 2006, pp. 333 – 346.

[12] Bertozzi, D. et al., “NoC Synthesis Flow for Customized Domain Specific Multiprocessor Systems-on-Chip”, IEEE Transaction on Parallel and Distributed System, 2005, pp. 113-129.

[13] K. Srinivasan and K. S. Chatha, “A Low Complexity Heuristic for Design of Custom Network-on-Chip Architectures,” in Proceedings of Design, Automation and Test in Europe Conf.,vol. 1, 2006, pp. 1-6.

[14] Andrews J., Baker N., “Xbox 360 System Architecture”, IEEE Micro, vol. 26, no. 2, 2006, pp. 25-37.

[15] Zeferino, A., Susin, A., “SoCIN: A Parametric and Scalable Network-on-Chip” in 17th Symposium on Integrated Circuits and System (SBCCI), 2003, pp. 169-174.

[16] Wu C. and Chi H., “Design of a High-Performance Switch for Circuit-Switched On-Chip Networks”, Asian Solid-State Circuits Conference, 2005, pp. 481-484.

TABLE I. POWER CONSUMPTION, FREQUENCY AND AREA RESULTS TO THE RECONFIGURABLE AND ORIGINAL ROUTER ARCHITECTURE FOR THE SAME LATENCY VALUES

Homogeneous Router Reconfigurable Router Power

Reduction

Channel width Buffer Depth

Power Consumption

(mW)

Area (um²)

Maximum Frequency

(MHz)

Buffer Depth

Power Consumption

(mW)

Area (um²)

Maximum Frequency

(MHz)

@ max. frequency

18 bits 11 25.20 242,615 231

4 13.45 230,645 232 46.6%

9 21.88 208,335 229 38.5% 7 21.25 170,385 273 36.7%

34 bits 11 45.62 423,175 230

4 21.67 337,471 230 52.5%

9 38.37 356,295 228 43.5% 7 29.57 287,005 271 26.7%