[ieee 2010 18th ieee/ifip international conference on vlsi and system-on-chip (vlsi-soc) - madrid,...

6
Network Interface to Synchronize Multiple Packets on NoC-based Systems-on-Chip Debora Matos, Miklecio Costa, Luigi Carro, Altamiro Susin Instituto de Informatica - PPGC Federal University of Rio Grande do SuI - UFRGS Porto Alegre, Brazil {debora.matos, miklecio.costa, carro, susin}@inf.ufrgs.br Abstract- The cores of a System-on-Chip (SoC) connected by Networks-on-Chip (NoCs) need interfaces to properly send and receive packets. However, in this interfacing, different situations can occur when heterogeneous cores are applied. Applications may require, for example, an irregular trfic behavior or present a large bandwidth variation. These situations may lead to problems in data synchronization. In this paper we show a simple and efficient synchronization solution, which although known in the literature, has not yet been applied to NoC-based systems scenario. Using a network interface as the synchronization mechanism, the proposed circuit handles data dependencies instead of letting each core solve the synchronization problems at higher levels. As case study, an H.264 video decoder was used to show the need and advantage of our approach. The proposed design is FIFO-based and can be applied when multiple packets from different sources need to be synchronized in a single destination. Simulations were performed to verify the functionality and efficiency of the synchronization solution. These interfaces were implemented in L and synthesized using an 180 CMOS technology. I. INTRODUCTION Networks-on-Chip have gained notable attention as a solution to the System-on-Chip interconnection problem. SoCs integrate numerous heterogeneous cores, which responsible for different functions and operate at different bandwidth. However, there are many pticulities that need to be solved when a NoC is utilized to connect a system. In data flow systems, like multimedia applications, the NoC needs to ensure that the correct sequence of data is delivered to each core. these situations a circuit to synchronize the data is indispensable, since, working with tenths of cores with a cycle accurate synchronous approach, very easily can be required a huge design effort. Synchronization in data flow systems has been first raised in [1], and the solution adopted for this problem was the use of static buffering. This work describes an application as a data flow graph, in which each node consumes some quantity of tokens, performs some computation with those and produces some quantity of other tokens to the next node. The synchronization problem in NoCs has been discussed in globally asynchronous locally synchronous (GALS) systems [2-6]. The main problem related in GALS architectures is the possibility of synchronization failure between two different clock domains (metastability). This need of synchronization also occurs when the same clock frequency is used but in different phases, the called mesochronous clock domains. 978-1-4244-6471-5/10/$26.00 ©2010 IEEE 31 However, in this work we present the need of a solution for another synchronization problem, also present even in synchronous NoC architectures and that may be required in any NoC-based design. An example of this need can be verified when a core A sends data to different cores (B and C), and the output data obtained with the processing in the nodes B and C need to be added in a correct sequence by another core. Let us imagine that cores B and C have different processing times, or that, depending on the input data, the time to process each output data can vary in each core. Consequently, the adder core would receive operands at different rates, which necessarily demands data synchronization. However, even if one uses some alteative to synchronize the clocks, the problem is still not solved, since the processing time of each core can be different, or in the worst case, the traffic might have an iegular or an unknown behavior, something very common in multimedia applications. As the cores have a heterogeneous behavior, it is possible that they present different clock domains, or require different bandwidth, or yet that they have an irregular traffic behavior that depends on the input data being processed. Nevertheless, considering the simple use of a NoC, it is not possible to ensure a constant time for sending the packets, once the NoC links can be shared by packets belonging to different nodes. All these situations might happen at the same time and, from these occuences, an application can demand the use of synchronization solutions to associate the packets. The synchronization requirement shown in this work does not have been properly discussed in NoC literature. These problems have been treated at higher levels of abstraction. In MPSoCs (multi processor SoC), software mechanisms allow the synchronization of different processors by means of access to shared variables. such case, solutions completely hardware-based have been little discussed in systems interconnected by a NoC. However, as presented in [18], the task of packet assembly, indispensable to interface the core to NoC, obtained a very low latency when implemented in hardware instead of in soſtware. In the same manner, our strategy was developed in hardware to obtain a high communication performance. To connect cores using a network-on-chip, one needs an interface module between the cores and the NoC routers to pack and unpack the data. These modules known as network interface (NI) modules, and they are the best candidates to take responsibility for the synchronization problem.

Upload: altamiro

Post on 14-Mar-2017

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC) - Madrid, Spain (2010.09.27-2010.09.29)] 2010 18th IEEE/IFIP International Conference on VLSI

Network Interface to Synchronize Multiple Packets on NoC-based Systems-on-Chip

Debora Matos, Miklecio Costa, Luigi Carro, Altamiro Susin Instituto de Informatica - PPGC

Federal University of Rio Grande do SuI - UFRGS Porto Alegre, Brazil

{debora.matos, miklecio.costa, carro, susin}@inf.ufrgs.br

Abstract- The cores of a System-on-Chip (SoC) connected by

Networks-on-Chip (NoCs) need interfaces to properly send and

receive packets. However, in this interfacing, different situations can occur when heterogeneous cores are applied. Applications

may require, for example, an irregular traffic behavior or

present a large bandwidth variation. These situations may lead to problems in data synchronization. In this paper we show a simple

and efficient synchronization solution, which although known in

the literature, has not yet been applied to NoC-based systems scenario. Using a network interface as the synchronization

mechanism, the proposed circuit handles data dependencies instead of letting each core solve the synchronization problems at

higher levels. As case study, an H.264 video decoder was used to show the need and advantage of our approach. The proposed

design is FIFO-based and can be applied when multiple packets from different sources need to be synchronized in a single

destination. Simulations were performed to verify the functionality and efficiency of the synchronization solution. These interfaces were implemented in VlIDL and synthesized using an

180nm CMOS technology.

I. INTRODUCTION

Networks-on-Chip have gained notable attention as a solution to the System-on-Chip interconnection problem. SoCs integrate numerous heterogeneous cores, which are responsible for different functions and operate at different bandwidth. However, there are many particularities that need to be solved when a NoC is utilized to connect a system. In data flow systems, like multimedia applications, the NoC needs to ensure that the correct sequence of data is delivered to each core. In these situations a circuit to synchronize the data is indispensable, since, working with tenths of cores with a cycle accurate synchronous approach, very easily can be required a huge design effort. Synchronization in data flow systems has been first raised in [1], and the solution adopted for this problem was the use of static buffering. This work describes an application as a data flow graph, in which each node consumes some quantity of tokens, performs some computation with those and produces some quantity of other tokens to the next node.

The synchronization problem in NoCs has been discussed in globally asynchronous locally synchronous (GALS) systems [2-6]. The main problem related in GALS architectures is the possibility of synchronization failure between two different clock domains (metastability). This need of synchronization also occurs when the same clock frequency is used but in different phases, the called mesochronous clock domains.

978-1-4244-6471-5/10/$26.00 ©2010 IEEE 31

However, in this work we present the need of a solution for another synchronization problem, also present even in synchronous NoC architectures and that may be required in any NoC-based design. An example of this need can be verified when a core A sends data to different cores (B and C), and the output data obtained with the processing in the nodes B and C need to be added in a correct sequence by another core.

Let us imagine that cores B and C have different processing times, or that, depending on the input data, the time to process each output data can vary in each core. Consequently, the adder core would receive operands at different rates, which necessarily demands data synchronization. However, even if one uses some alternative to synchronize the clocks, the problem is still not solved, since the processing time of each core can be different, or in the worst case, the traffic might have an irregular or an unknown behavior, something very common in multimedia applications.

As the cores have a heterogeneous behavior, it is possible that they present different clock domains, or require different bandwidth, or yet that they have an irregular traffic behavior that depends on the input data being processed. Nevertheless, considering the simple use of a NoC, it is not possible to ensure a constant time for sending the packets, once the NoC links can be shared by packets belonging to different nodes. All these situations might happen at the same time and, from these occurrences, an application can demand the use of synchronization solutions to associate the packets.

The synchronization requirement shown in this work does not have been properly discussed in NoC literature. These problems have been treated at higher levels of abstraction. In MPSoCs (multi processor SoC), software mechanisms allow the synchronization of different processors by means of access to shared variables. In such case, solutions completely hardware-based have been little discussed in systems interconnected by a NoC. However, as presented in [18], the task of packet assembly, indispensable to interface the core to NoC, obtained a very low latency when implemented in hardware instead of in software. In the same manner, our strategy was developed in hardware to obtain a high communication performance.

To connect cores using a network-on-chip, one needs an interface module between the cores and the NoC routers to pack and unpack the data. These modules are known as network interface (NI) modules, and they are the best candidates to take responsibility for the synchronization problem.

Page 2: [IEEE 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC) - Madrid, Spain (2010.09.27-2010.09.29)] 2010 18th IEEE/IFIP International Conference on VLSI

Our main contribution in this paper is to provide a proposal for a network interface, solving the synchronization problem that exists in applications interconnected by NoC architectures. The main idea of our proposal is to use FIFOs to synchronize the packets that arrive from different sources in a single destination. The interface modules have a flexible network configuration and provide a reliable synchronization among packets coming from different cores. As case study we have described the implementation of the H.264 decoder and have showed the need of our NI for this application.

According to Fig. 1, the NI blocks are responsible for unpacking the packets from NoC and packing the data from core to network. In our solution we added in this architecture a synchronizer together with the unpack block with the function to associate packets received of different nodes before to send them to core.

NoC

Figure 1. Example of Noe-core communication using network interface.

The paper is organized as follows. In the next section we mention some related works. In section III we briefly introduce the H.264 decoder that was used in our case study. The network interfaces are described in section IV, with focus on the synchronizer module. Synthesis results are shown in section V, and finally in section VI we conclude the paper.

II. RELATED WORKS

In the literature one can find solutions to the synchronization problem in GALS systems. GALS architectures have been proposed to solve problems related to distribute a skew-free synchronous clock over the whole chip [8] and related to jitter correction. When this solution is adopted, some strategies need to be chosen to synchronize the data [2-6].

As MPSoCs present isolated heterogeneous synchronous Processing Elements (PEs), some synchronization solution among the PEs is needed. A simple synchronization solution to provide a skew tolerance when different clock domains are used in a system is to use a FIFO or a latch with two clock inputs, one for the transmitter and other for the receiver [9]. This method can avoid synchronization failure in skew variations. Another similar idea was proposed in [10]. They presented a bi-synchronous FIFO to interface systems that work in different clock frequencies and phases. To synchronize the write and read pointers of the FIFO, the authors used an encoding algorithm based in token ring with two consecutive tokens to avoid changes in the register. A pointer, a data buffer and a full and empty detector were proposed and detailed in this paper.

It is very common to use a FIFO in these cases because it provides a simple interface that decouples the transmitter of the receiver decisions and maximizes throughput [11]. From this

idea, in [11] the author proposed a FIFO modular design for a standard-cell library configurable for different NoC requirements, like FIFO capacity, clocked or clockless interface and synchronization latency. This FIFO is composed of a ring of stages and it presents the FIFO control logic separated from the synchronizers. In this way, it is possible to define the synchronizer latency according to the clock frequency and reliability requirements.

Based in the ideas presented in [10] and [11], a recent work proposed a mesochronous synchronizer and a dual-clock FIFO for synchronization respectively between systems with the same frequency, but different phases, and with independent clock and offset [12]. The mesochronous synchronizer is composed by a set of parallel latches controlled by a counter and driven by an incoming clock. The output data are sent to a flip-flop by a multiplexer also controlled by another counter driven by a receiver clock.

In [5] the same objective of synchronization in NoCs due to the possibility of having the cores operating in multiple clock domains was attacked and they also presented as solution the use of an asynchronous FIFO with dual clocks. In [6] the authors have also developed a wrapper with handshake, synchronizer, buffer and logic circuits for a GALS.

In [2] three FIFOs architectures were presented for interfacing asynchronous NoCs and synchronous subsystems or two adjacent multi-synchronous routers. In this proposal, according to the synchronizer selected, one has a trade-off between latency and robustness.

In [3] the authors proposed on-chip and off-chip interfaces for a GALS NoC architecture to resynchronize between the synchronous and asynchronous NoC domains. The interfaces are based in the Gray code principle for encoding the read and write pointers of the FIFO. According to the authors, this option was adopted because in a Gray code a single bit is modified between two successive values, avoiding erroneous transient value and metastability problems, which could occur with binary code. Later the same authors proposed another FIFO solution, arguing that the Gray code presents limitations as complexity of implementation, encoding of only powers of two, problems in pointer increment, and others [4]. As the new solution, they used a Johnson encoding for the FIFO read and write pointers.

Another proposal called NnJGAP used Gray code to reorder packets [7]. In this case, the Gray code is generated and added into the packet header as a tag at every new time stamp. Each packet has the time tag and the sequence tag. This proposal considers the possibility of each packet going through a different path in the network, and the NIUGAP is used to reorder the out-of-order packets that arrive in the network interface. However this solution does not attack the same problem of this paper, since a single stream can be rebuilt with NIUGAP, while here one is concerned with several streams arriving in order to be processed by a surrogate core.

The need of synchronization in NoCs shown in this paper to associate different streams from different cores has not yet been presented in the literature. The solution is needed in synchronous and asynchronous architectures, and nevertheless,

32 2010 18th lEEEI1FlP lnternational Conference on VLSl and System-an-Chip (VLS1-SoC 2010)

Page 3: [IEEE 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC) - Madrid, Spain (2010.09.27-2010.09.29)] 2010 18th IEEE/IFIP International Conference on VLSI

none of the proposals of the related works solve this problem. Besides, although some works report the use of NoCs for the H.264 decoder [13-15], their designs show no solution for the synchronization problem. In [13] the focus was to show a design methodology for an application-specific NoC (ASNoC). In this case, they proposed a methodology to automatically generate hierarchical topologies with distributed shared memories. The authors used a H.264 HDTV and Smart Camera to present their proposal. In [14] was proposed a methodology for modeling concurrent system architecture, and a Nokia H.264 decoder was used to illustrate this proposal. In [15] the proposal consists in to explore several mapping possibilities in a Star-Mesh NoC architecture to a H.264 decoder design. In none of these articles the authors comment whether they used some high level synchronization mechanism or used a cycle-by-cycle design (with severe impact in design time).

III. H.264 VIDEO DECODER

The synchronization solution was investigated for a NoC­based H.264 video decoder. We have developed in VHDL a minimal H.264 video decoder with the following blocks: Network Adaptation Layer (NAL), Parser, Entropy Decoding, Inverse Transforms (IT), Inverse Quantization (lQ), Intra Prediction and Descrambler. In this first design we do not aggregate to decoder the Inter Prediction module. We first modeled this application in six cores to be connected in a NoC according to the graph of Fig. 2. In this project, due to data dependence of each block, we decided to use a single router to the Parser and Entropy modules, and a single router to Inverse Transforms (IT) and Inverse Quantization (IQ) modules.

Figure 2. H.264 decoder mapped to a 2D-mesh NoC.

First, the NAL block receives the packets with the incoming bitstream and sends the video elements to the Parser block. The Parser decodes the slice header and slice data package from syntactic elements to obtain prediction and residual encoding data. The prediction data are sent to Intra-Prediction, and the residuals are decoded by Entropy and sent to IQ and after to IT [16]. Inverse Quantization module performs a scalar multiplication and Inverse Transforms constitute of Inverse Discrete Cosine Transform (lDCT) and Inverse Hadamard Transform. The Intra Prediction reconstructs the macroblocks (l6x16 samples) from the neighbor macroblocks of the same frame. The IQ and IT generate the decoded residuals, and those data are summed with the prediction data by the Adder block to obtain the final reconstructed video. The reconstructed video samples return to Intra-Prediction block to be used as reference for the next blocks [16].

The Intra module can use up to 9 prediction modes for 4x4 Luma, 4 mode possibilities for 16x16 Luma and more 4 modes for Chroma 8x8. The definition of each mode to be used is defined by the encoder for each macroblock. Then, according to the predicted block size, the correct neighbors are selected and the calculations are performed for the defined mode. Due to the possibility of different prediction modes and the use of blocks (4x4 samples) or macroblocks (l6x16 samples) for prediction, the execution time of the Intra module is not constant, and the output data are sent to other modules in a rate that is unknown at design time and variable at run time.

The decoded residuals need to be summed to the correct sample predicted by Intra module. As the IQ + IT and Intra modules work in different rates, present an irregular traffic behavior and are in different nodes of a NoC, the correct sequence of data that arrives in the Adder needs to be properly associated. Fig. 3 illustrates respectively the minimum, mean and maximum traffic rates for each link of the H.264 video decoder.

Figure 3. H.264 Decoder NoC graph.

The Intra block was designed to obtain four output samples while the IT and IQ obtain one output sample per time. For this reason, the Intra block operates in a clock frequency smaller then IT + IQ. A possibility will be to have the IT + IQ operating in a smaller clock frequency, and in this case one can have two IT + IQ blocks to produce the decoded residuals to be added to prediction data in the Adder block, as illustrated in Fig. 4. Other possibilities of partitioning could be defined, e.g., to group the Adder togheter to Intra block. However, even in this case one would need to use some solution to synchronize the data from Intra and IT + IQ. In the next section we will present the proposed network interface developed to solve this problem.

Figure 4. H.264 Decoder NoC graph with 2 IT +IQ blocks.

2010 18th IEEEIIFIP International Conference on VLSI and System-an-Chip (VLSI-SoC 2010) 33

Page 4: [IEEE 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC) - Madrid, Spain (2010.09.27-2010.09.29)] 2010 18th IEEE/IFIP International Conference on VLSI

IV. PROPOSED NETWORK INTERFACE

The network used in this project was the SoCIN NoC [17]. It has a regular 2D-mesh topology and configurable router architecture. SoCIN is a VHDL soft-core, it has wormhole switching approach and the routing algorithm used is X-Y. This NoC has a Round-Robin arbiter at each output channel, and the flow control is based on the handshake protocol.

All network interfaces in Fig. 2 present the same blocks, with the exception of the Adder that has still the synchronizer and buffer blocks. NI provides also typical functionalities, as packing, unpacking and adaption of communication protocol between core and router. In the next sub-sections we detail the blocks that compose the network interfaces.

A. Network Inteiface Blocks

As the NoC uses the wormhole strategy, the packing in the NI divides the data into multiple flow control units called flits. One needs an unpacking network interface circuit to integrate the flits that compose the complete data, and a packing network interface circuit to divide the data generated in each core to be sent in the network. Each one of these blocks was implemented with Finite State Machine (FSM). Many cores can share the same pack and unpack NIs, and for this, one simply uses a multiplexer to define the source and destination cores.

Fig. 5 illustrates the blocks that compose the proposed network interface, including the synchronizer module. The interface is able to transport data in parallel in two directions: from core to router and from router to core. Considering the first direction, core to router, the Packing block allows the core to send packets to different destinations. Each destination has different routing information. The Packing block is responsible for putting this information into the packet header. In the opposite direction, router to core, the interface allows the core to receive corresponding data from N different sources simultaneously.

I

I

Core

t-�--�------I--,

: I Synchronizer I : i 1- --p�:�:

g

- --nil unpa:�ng i

'- - - - - - - - -N�i;o:k I�te;f;;e- -1- - - - - - �, +

Router Figure 5. Proposed network interface.

I

1

As the NoC uses a handshake protocol, one needs to control the data sent and received, according to the network availability. Before the packet is injected into the router (core to router direction), the Packing block performs the communication protocol with the NoC.

The FSM implemented for the Packing block controls when the flits can be sent to the network. When the Packing block has a valid flit, the signal val is set and, while the NoC signal ack is not equal to 1, the flit to be sent needs to wait for the network availability. That way, the network interface avoids

sending flits to the NoC while the router buffers are full, which guarantees that the flits are sent only when there is some free buffer in the input channel of the router. Similarly, when the packets arrive in the local channel (router to core direction), the Unpacking block receives a val signal and, when the data can be transmitted to the NoC, the ack signal is sent and the packet is unpacked. The extra information, inserted during the packing process, is removed, lasting just the packet payload. Unpacking block receives the flits and group them to compose the original data, removing the information that compose the header, as control, routing definition and source information.

Data are packed according to the packet format of the applied NoC. Its packet format was slightly adapted, in order to be used with the proposed network interface, as shown in Fig. 6. The packet contains, besides of data, extra information, such as control codes, to identify the type of flit, and the routing definition. This information is presented in the flit header that describes the number of hops needed to achieve the destination according to X-Y routing algorithm. Therefore, during the packing step, it is necessary to use a static table to obtain the routing definition associated to the destination core.

Due to irregular traffic behavior of the cores, one needs still to use FIFOs in the output of the cores to store the data and guarantee that no data word is lost while the flits are being sent to the network. This can occur when the latency to send a packet to network is larger than the clock cycles to generate another output data. Even though the traffic does not have an irregular behavior, this need exists since the sending of packets depends on NoC availability, buffer depth, channel width and path taken by the flits.

n +2 bits EE � 101 I routing definition I header

1001 source identification I reserved

1001 data I payload

I IO I data trailer

Figure 6. Packet format of the NoC adapted to the proposed network interface.

B. Synchronizer

The synchronization of data received from different cores is the main contribution of this paper. The block in the network interface responsible for this functionality has the internal organization detailed in the block diagram of Fig. 7. That block was designed to be expanded in a scalable way, according to the quantity of source cores that send data to that destination core. For that, a static FIFO buffer was associated to each one of the N source cores that need to have synchronous packets in the node. Before writing in the buffer, it is necessary to identify which source node sent the packet. The signal source is identified in the Unpacking block and this code provides the information of the FIFO used to store this data word. After identify the FIFO where the data word should be stored, if the data word is totally reconstituted (data_ok = I) and the correspondent FIFO is not full, the wr signal is activated and the data word can be written in the corresponding buffer.

34 2010 18th IEEEIIFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)

Page 5: [IEEE 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC) - Madrid, Spain (2010.09.27-2010.09.29)] 2010 18th IEEE/IFIP International Conference on VLSI

Network Interface r--------------------------------------------r

r-���==��fF:'F�o�-�c:o�re:-�l �==�====� l Router

Core I

I I

I I L ____________________________________________ l

Figure 7. Detailed structure of the Network Interface.

The data_ok signal also is responsible to control the handshake protocol between NI and router. The Unpacking

block can only receive flits when the packet is not completely unpacked or when the FSM of the Unpacking block is in idle

mode, and in this case, the NI is ready to receive a new packet

(in all these cases data_ok is equal to 0). Each one of the N buffers has a signal empty that indicates

when the buffer has no data word. If there is at least one data word in each buffer and the core can receive a new data word and still if the buffer that stored the output data of the core is not full, the signal rd is enabled simultaneously for all the

reading signals of the AFOs. As the correct order of the packets are delivered for the NI due to the deterministic routing algorithm used, the data received of different nodes and stored in its respective buffers present the same sequence for a same position of the AFO. Besides, the appropriate buffer depth

should be verified according to the rates of the cores that will

have its packets associated, i.e., the buffer depth should be large enough ensuring that no AFO is full while another synchronization AFO is empty.

This solution can be used even when a large number of packets from many different cores need to be synchronized

simultaneously. For that reason, the synchronization problem for the decoder of Fig. 3 can be solved with the proposed

network interface as well as the partitioned decoder of Fig. 4.

In the first case, one need to associate packets from 2 cores and, for this, 2 FIFOs are used. In the second case, packets

from 3 cores need to be associated, using 3 AFOs. After the synchronization of the packet received of each source core, the data can be sent to the destination core to start the computation of these input data.

V. RESULTS The proposed network interface was described in VHDL,

and we used the ModelSim tool to simulate the code. We analyzed the results to a CMOS 180nm process technology using the Cadence RTL Compiler tool. The network interfaces

have been successfully integrated in the NoC. For the H.264

decoder, we verified that, using the channel width equal to 16

is sufficient to reach the rate required for the application. For this reason, the syntheses results were obtained for data width

of the NI equal to 16. Two experiments were performed to evaluate the costs of

the NI design. In the first one, it was verified the NI scalability

with the variation in the number of cores that need to be synchronized. The second one consisted in the analysis about

the impact of the buffer depth in the results. The H.264 decoder bitstream was considered to define those configurations. The Intra and IT + IQ blocks need to send samples to the Adder block. Each data word contains 32 bits (4 samples of 8 bits).

The Adder block is able to perform 4 operations in parallel. Then the packets were configured to transport 4 samples at a time. For this reason, the widths of the AFOs were defined as 32 bits. Based on the variation in the decoder traffic behavior, the AFOs were defined to store up to 8 words, deep enough to

tolerate discrepancies between traffic rates. To simulate a

variation in the number of synchronized cores, we varied the number of 8-word-FiFOs in the synchronizer block.

Table I presents the synthesis results of the network interfaces. For this analysis, we considered 3 possible cases of need to associate packets: we verify the synthesis results when

packets from 2 up to 4 cores need to be synchronized. The results present a linear growth in terms of area and power

consumption, directly related to the increase in the number of

buffers. The maximum operational frequency decreases a little with the increased of the buffer depth, it occurs because the

circuit critical path is in the buffer writing mechanism, which depends on the source identification signal source from the Unpacking block, as seen in Fig. 7.

To have a complete evaluation of the network interface, the AFO depth was varied. In this second experiment, the constant

parameter was the number of cores. As in the H.264 decoder, the number of cores to be synchronized was defined as 2. The

graphs with the results of area, maximum frequency and power dissipation at 200M Hz and at maximum frequency for the network interfaces are presented respectively in Fig. 8, 9, 10 and 11.

2010 18th IEEEIIFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010) 35

Page 6: [IEEE 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC) - Madrid, Spain (2010.09.27-2010.09.29)] 2010 18th IEEE/IFIP International Conference on VLSI

TABLE I. SYNTHESIS RESULTS FOR THE NETWORK INTERFACES

Area Max. Power Dissip. Power Dissip. (um2) Freq. @200MHz @max·freq.

(MHz) (mW) (mW) NI to synchronize 55,880 490 13.6 30.5 2 cores NI to synchronize

80,046 472 19.4 42.3 3 cores NI to synchronize

103,817 457 25.6 54.0 4 cores

These graphs show the results obtained for buffer depths equal to 4 up to 32 words. As one can see, area and power plots present a linear behavior. Again, one can see that the frequency results decrease with the buffer depth, but this reduction presents a logarithmic behavior. As in the first experiment, the circuit critical path is in the buffer writing process, which also depends on the buffer status signal full. In other words, the buffer writing depends on the buffer depth, which must be chosen according to the requirements of the application.

250000 200000

N" � 150000 '" � 100000

50000 ���� -32635 16 24 32 40

Bu fler Depth s Figure 8. Area of the NI for different buffer depths.

580 _ 560 N I 540 6. 520 [500

LL 480 � 460 � 440

420 400

I--j56 \ \

• 490 "-'-. 444

·�422-. --

16 24 Bufler Depths

32 40

Figure 9. Maximum frequency of the NI for different buffer depths.

_ 60 --------------------� 481 � so --.-�.-.-.--

t � ���==�� Q.

16 24 32 40 BufferDepths

Figure 10. Power dissipation of the NI at 200 MHz for different buffer depths.

_100 ;;: 90 §. 80

:=.--�. �r=== ._------ ._--­. _---- ----­'--517 ------

>Ii so ---7-=-----··-­� 40 30-�---------@J �� 1�7.� :==::==-=:::==:::=== CD 10 � 0 -i---�--�-�-�_-� Q. 16 24

Bufler Depths 32 40

Figure II. Power consumption of the NI at maximum frequency for different

buffer depths.

VI. CONCLUSION

In this paper we have shown a solution to the data synchronization problem in NoCs. We presented a NI with a synchronizer that is needed for asynchronous and also synchronous communication architectures. We designed scalable network interfaces to associate packets from different sources. For the designed NI, we verified the synthesis results for different cases, exploring the number of cores to associate the packets and the buffer depth. With this proposal, we solve the problem of packet synchronization present in NoC-based applications, treating the dependencies in hardware, as an efficient alternative to traditional software solutions. It must be noticed that, without this solution, the synchronization problem will have to be solved by a circuit embedded in each core. As seen in synthesis results, the numbers of cores and buffer depth have a linear impact in the results of the proposed solution.

REFERENCES

[I] E. Lee and D. Messerschmitt, "Synchronous Data Flow", Describing Signal Processing Algorithm for Parallel Computation - COMPCON, pp. 310-315,1987.

[2] A. Sheiganyrad and A. Greiner, "Hybrid-Timing FlFOs to use on Networks-on-Chip in GALS Architectures", International Conference on Embedded Systems and Applications - ESA, pp. 27-33,2007.

[3] E. Beigne and P. Vi vet, "Design of On-chip and Off-chip Interfaces for a GALS NoC Architecture", ASYNC, pp. 172-183, 2006.

[4] Y. Thonnart et aI., "Design and Implementation of a GALS Adapter for ANoC Based Architectures", ASYNC, pp. 13-22,2009.

[5] S. Kundu and S. Chattopadhyay, "Interfacing Cores and Routers in Network-on-Chip Using GALS", ISIC, pp. 154-157,2007.

[6] W. Ning, G. Fen and W. Fei, "Design of a GALS Wrapper for Network on Chip", World Congress on Computer Science and Information Engineering - CSIE, pp. 592-595, 2009.

[7] D. Kim et aI., "NIUGAP: Low Latency Network Interface Architecture with Gray Code for Networks-on-Chip", ISCAS, pp. 3901-3905,2006 .

[8] D. M. Chapiro, "Globally-Assynchronous Locally-Synchronous systems, Phd thesis, Stanford University, 1984.

[9] A. Chakraborty and M. Greenstreet, "A Minimal Source-Synchronous Interface", International ASIC/SOC Conference, pp. 443- 44, 2002.

[10] I. Panades and A. Greiner, "Bi-Synchronous FIFO for Synchronous Circuit Communication Well Suited for Network-on-Chip in GALS Architectures" - NOCS, pp. 83-94,2007.

[II] T. Ono and M. Greenstreet, "A Modular Synchronizing FIFO for NoCs", International Symposium on Networks-on-Chip - NOCS, pp. 224-233,2009.

[12] D. Ludovici, A. Strano, D. Bertozzi, "Architecture Design Principles for the Integration of Synchronization Interfaces into Network-on-Chip Switches" - International Workshop on Network-on-Chip Architectures - NoCARC, pp. 31-36, 2009.

[13] J. Xu et aI., "A Design Methodology for Application-Specific Networks­on-Chip", TECS, pp. 263-280,2006.

[14] A. Agarwal et aI., "System-Level Modeling of a Noc-Based H.264 Decoder", Annual IEEE Systems Conference, pp. 1-7,2008.

[IS] J. Chang et aI., "Star-Mesh NoC Based Multi-Channel H.264 Decoder Design", International SoC Design Conference, pp. 170-173,2008 .

[16] L, Agostini et aI., "FPGA Design of a H.264/ A VC Main Profile Decoder for HDTV", International Conference on Field Programmable Logic and Applications - FPL, 2006, pp. 501-506.

[17] C. Zeferino, and A. Susin, "SoCIN: "A Parametric and Scalable Network-on-Chip", 17th Symposium on Integrated Circuits and System - SBCCI, pp. 169-174, 2003.

[18] P. Bhojwani and R. Mahapatra, "Interfacing Cores with On-Chip Packet-Switched Networks". International Conference on VLSI Design, VLSI, pp. 382 - 387, 2003.

36 2010 18th lEEEllFlP international Conference on VLSl and System-an-Chip (VLS1-SoC 2010)