[ieee 2010 ieee international symposium on circuits and systems - iscas 2010 - paris, france...

4
Associating Packets of Heterogeneous Cores Using a Synchronizer Wrapper for NoCs Débora Matos, Luigi Carro, Altamiro Susin Instituto de Informática - PPGC Federal University of Rio Grande do Sul - UFRGS Porto Alegre, Brazil {debora.matos, carro, susin}@inf.ufrgs.br Abstract— MPSoCs systems are composed of heterogeneous cores, and for this reason, the cores can present different bandwidth, different clock domains or still they can require an irregular traffic behavior. When networks-on-chip (NoCs) are used to connect these cores, one very often needs some synchronization solution, and due to the mentioned problems, this might be required for synchronous or asynchronous NOCs. In this paper we show a network interface (NI) with a synchronizer wrapper solution. We verified its applicability for different channel widths and buffer depths of a NoC. These network interfaces were used to connect a H.264 decoder and the simulation results demonstrate that the wrapper provides a reliable synchronization solution, and does not compromise the latency of the network. These interfaces have been successfully implemented in a 0.18um CMOS technology. I. INTRODUCTION Networks-on-Chip (NoCs) have gained notable attention as a solution to the Multi-Processor System-on-Chip (MPSoC) interconnection problem. MPSoCs integrate numerous heterogeneous cores, which are responsible for different functions and operate at different bandwidth. However, there are still some problems that are not fully resolved when a NoC is utilized to connect a system. In data flow systems, like multimedia applications, the NoC needs to ensure that the correct sequence of data is delivered to each core. In these situations a circuit to synchronize the data is indispensable, since working with tenths of cores with a cycle accurate synchronous approach can very easily require a huge design effort. Synchronization in data flow systems has been first raised in [1], and the solution adopted for this problem was the use of static buffering. The synchronization problem in NoCs has been discussed in globally asynchronous locally synchronous (GALS) systems [2]-[6]. The main problem related in GALS architectures is the possibility of synchronization failure between two different clock domains (metastability). However, in this work we present the need of a solution for another synchronization problem, also present even in synchronous NoC architectures. Fig. 1 shows a simple example of the need for synchronization that happens in any NoC architecture. In this example, core A sends data to different cores (B and C), and the output data obtained with the processing in the nodes B and C need to be added by another block. Let us imagine that cores B and C have a different processing time, or that depending of the input data, the time to process each output data can vary in each core. However, even if one uses some alternative to synchronize the clocks, the problem shown in fig. 1 is still not resolved, since the processing time of each core can be different, or in the worst case, the traffic might have an irregular or an unknown behavior, something very common in multimedia applications. Thus, one has at least 3 situations, caused by the heterogeneity of cores, which demand the use of synchronization solutions when networks-on-chip are used in data flow systems. All these situations might happen at the same time, and they have not been properly explored in the literature. These circumstances are: When each core presents a different required bandwidth; When the cores present different clock domains; When the cores present an irregular traffic behavior, which can change depending on the data being processed. A B C + Figure 1. Example of the need for synchronization. To connect cores using a network-on-chip one needs an interface module among the cores and the NoC routers to pack and unpack the data. These modules are known as network interface (NI) modules, and they are the best candidates to take responsibility for the synchronization problem. Our main contribution in this article is to provide a new proposal for a network interface, solving the synchronization problem that exists in data flow applications connected by NoC architectures. We propose a synchronizer NI to associate packets that arrive from different cores in a NoC. The 978-1-4244-5309-2/10/$26.00 ©2010 IEEE 4177

Upload: altamiro

Post on 22-Feb-2017

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2010 IEEE International Symposium on Circuits and Systems - ISCAS 2010 - Paris, France (2010.05.30-2010.06.2)] Proceedings of 2010 IEEE International Symposium on Circuits and

Associating Packets of Heterogeneous Cores Using a Synchronizer Wrapper for NoCs

Débora Matos, Luigi Carro, Altamiro Susin Instituto de Informática - PPGC

Federal University of Rio Grande do Sul - UFRGS Porto Alegre, Brazil

{debora.matos, carro, susin}@inf.ufrgs.br

Abstract— MPSoCs systems are composed of heterogeneous cores, and for this reason, the cores can present different bandwidth, different clock domains or still they can require an irregular traffic behavior. When networks-on-chip (NoCs) are used to connect these cores, one very often needs some synchronization solution, and due to the mentioned problems, this might be required for synchronous or asynchronous NOCs. In this paper we show a network interface (NI) with a synchronizer wrapper solution. We verified its applicability for different channel widths and buffer depths of a NoC. These network interfaces were used to connect a H.264 decoder and the simulation results demonstrate that the wrapper provides a reliable synchronization solution, and does not compromise the latency of the network. These interfaces have been successfully implemented in a 0.18um CMOS technology.

I. INTRODUCTION Networks-on-Chip (NoCs) have gained notable attention

as a solution to the Multi-Processor System-on-Chip (MPSoC) interconnection problem. MPSoCs integrate numerous heterogeneous cores, which are responsible for different functions and operate at different bandwidth. However, there are still some problems that are not fully resolved when a NoC is utilized to connect a system. In data flow systems, like multimedia applications, the NoC needs to ensure that the correct sequence of data is delivered to each core. In these situations a circuit to synchronize the data is indispensable, since working with tenths of cores with a cycle accurate synchronous approach can very easily require a huge design effort. Synchronization in data flow systems has been first raised in [1], and the solution adopted for this problem was the use of static buffering.

The synchronization problem in NoCs has been discussed in globally asynchronous locally synchronous (GALS) systems [2]-[6]. The main problem related in GALS architectures is the possibility of synchronization failure between two different clock domains (metastability). However, in this work we present the need of a solution for another synchronization problem, also present even in synchronous NoC architectures. Fig. 1 shows a simple example of the need for synchronization that happens in any NoC architecture. In this example, core A sends data to different cores (B and C), and the output data obtained with

the processing in the nodes B and C need to be added by another block.

Let us imagine that cores B and C have a different processing time, or that depending of the input data, the time to process each output data can vary in each core. However, even if one uses some alternative to synchronize the clocks, the problem shown in fig. 1 is still not resolved, since the processing time of each core can be different, or in the worst case, the traffic might have an irregular or an unknown behavior, something very common in multimedia applications. Thus, one has at least 3 situations, caused by the heterogeneity of cores, which demand the use of synchronization solutions when networks-on-chip are used in data flow systems. All these situations might happen at the same time, and they have not been properly explored in the literature. These circumstances are:

• When each core presents a different required bandwidth;

• When the cores present different clock domains; • When the cores present an irregular traffic behavior,

which can change depending on the data being processed.

A

B C

+ Figure 1. Example of the need for synchronization.

To connect cores using a network-on-chip one needs an interface module among the cores and the NoC routers to pack and unpack the data. These modules are known as network interface (NI) modules, and they are the best candidates to take responsibility for the synchronization problem.

Our main contribution in this article is to provide a new proposal for a network interface, solving the synchronization problem that exists in data flow applications connected by NoC architectures. We propose a synchronizer NI to associate packets that arrive from different cores in a NoC. The

978-1-4244-5309-2/10/$26.00 ©2010 IEEE 4177

Page 2: [IEEE 2010 IEEE International Symposium on Circuits and Systems - ISCAS 2010 - Paris, France (2010.05.30-2010.06.2)] Proceedings of 2010 IEEE International Symposium on Circuits and

interface modules have a flexible network configuration and provide a reliable synchronization among packets coming from different cores. Besides, our proposal does not interfere in the latency of the packets in the network. As a case study we describe the implementation of the H.264 decoder and show the NI needed for this application.

The paper is organized as follow. In the next section we mention some related works. In section III we briefly introduce the H.264 decoder that was used in our case study. The network interfaces are described in section IV, with focus on the synchronizer module. Synthesis results are shown in section V, and finally in section VI we conclude this paper and indicate future works.

II. RELATED WORKS In the literature one can find solutions to the

synchronization problem in globally asynchronous, locally synchronous (GALS) systems. GALS architectures have been proposed to solve problems related to distribute a skew-free synchronous clock over the whole chip [8]. When this solution is adopted, some strategies need to be chosen to synchronize the data [2]-[6].

In [2] three FIFOs architectures were presented for interfacing asynchronous NoCs and synchronous subsystems or two adjacent multi-synchronous routers. In this proposal, according to the synchronizer selected, one has a trade-off between latency and robustness.

In [5] the question of the need for synchronization in NoCs due to the possibility of having the cores operating in multiple clock domains is raised. As solution, the authors proposed the use of an asynchronous FIFO with dual clocks. In [6] the authors have also developed a wrapper with handshake, synchronizer, buffer and logic circuits for a GALS.

In [3] the authors proposed on-chip and off-chip interfaces for a GALS NoC architecture to resynchronize between the synchronous and asynchronous NoC domains. The interfaces are based in the Gray code principle for encoding the read and write pointers of the FIFO. According to the authors, this option was adopted because in a Gray code a single bit is modified between two successive values, avoiding erroneous transient value and metastability problems that could occur with binary code. Later the same authors proposed another FIFO solution, arguing that the Gray code presents limitations as complexity of implementation, encoding of only powers of two, problems in pointer increment, and others [4]. As the new solution they used a Johnson encoding for the FIFO read and write pointers.

Another proposal called NIUGAP used Gray code to reorder packets [7]. In this case, the Gray code is generated and added into the packet header as a tag at every new time stamp. Each packet has the time tag and the sequence tag. This proposal considers the possibility of each packet going through a different path in the network, and the NIUGAP is used to reorder the out-of-order packets that arrive in the network interface. Similarly to our proposal, NIUGAP uses

sequence tag in the packets, however this solution does not attack the same problem of this paper, since a single stream can be rebuilt with NIUGAP, while here one is concerned with several streams arriving in order to be processed by a surrogate core.

The need of synchronization in NoCs shown in this paper to associate different streams from different cores has not yet been presented in the literature. The solution is needed in synchronous and asynchronous architectures, and nevertheless, none of the proposals of the related works solve this problem. Besides, although some works report the use of NOCs for the H.264 decoder, no solution for the synchronization problem has been shown in other designs [9]-[10]. In [10] the authors do not comment whether they used some high level synchronization mechanism or used a cycle-by-cycle design (with severe impact in design time). As reported in [10], although the flit header contains a Time Stamp, this information was used only to determine the latency to deliver the flit, and not for synchronization purposes.

III. H.264 VIDEO DECODER The synchronization solution was investigated for a NoC-

based H.264 video decoder. We had developed in VHDL a minimal H.264 video decoder with the following blocks: Network Adaptation Layer (NAL), Parser, Entropy Decoding, Inverse Transforms (IT), Inverse Quantization (IQ), Intra-Prediction and Descrambler. In this work, we model this application in six cores to be connected in a NoC according to the graph of fig. 2. In this project, due to data dependence of each block, we decided to use a single router to the Parser and Entropy modules, and a single router to Inverse Transforms (IT) and Inverse Quantization (IQ) modules.

First, the NAL block receives the packets with the incoming bitstream and sends the video elements to the Parser block. The Parser decodes the slice header and slice data package from syntactic elements to obtain prediction and residual encoding data. The prediction data is sent to Intra-Prediction, and the residuals are decoded by Entropy and sent to IQ and after to IT [11]. Inverse Quantization module performs a scalar multiplication and Inverse Transforms constitute of Inverse Discrete Cosine Transform (IDCT) and Hadamard transform. The Intra-Prediction reconstructs the macroblocks (16x16 samples) from the neighbor macroblocks of the same frame. The IQ and IT generate the decoded residuals, and this data is summed with the prediction data by the Adder block to obtain the final reconstructed video. This later block returns the reconstructed video to Intra-Prediction block to be used as reference for the next blocks [11].

The INTRA module can use up to 9 prediction modes for 4x4 Luma, 4 mode possibilities for 16x16 Luma and more 4 modes for 8x8 Chroma. The definition of each mode to be used is defined by the encoder for each macroblock. Then, according to the predicted block size, the correct neighbors are selected and the calculations are performed for the defined mode. Due to the possibility of different prediction

4178

Page 3: [IEEE 2010 IEEE International Symposium on Circuits and Systems - ISCAS 2010 - Paris, France (2010.05.30-2010.06.2)] Proceedings of 2010 IEEE International Symposium on Circuits and

modes and the use of blocks (4x4 samples) or macroblocks (16x16 samples) for prediction, the execution time of the Intra module is not constant, and the output data are sent to other modules in a rate that is unknown at design time, and variable at run time. A similar fact occurs with the IT + IQ modules, since when the encoder defines predictions for macroblocks instead of blocks, the decoder must also execute the inverse Hadamard transform.

As each macroblock can have a different mode definition, here one has the same synchronization problem previously commented. The decoded residuals need to be summed to the correct sample predicted by Intra-Prediction module. As the IQ + IT and Intra modules work in different rates, and they present an irregular traffic behavior, the correct sequence of data that arrives in the Adder needs to be properly associated. Fig. 2 illustrates respectively the minimum, mean and maximum rates for each link of the H.264 decoder. In the next section we present the proposed wrapper developed to solve this problem.

Parser +Entropy

Adder

NAL

Intra IT + IQ

Desc.

(50, 52, 533) Mb/s

(31, 620, 775) Mb/s(4, 8, 10) Mb/s

(18, 360, 450) Mb/s

(16, 320, 800) Mb/s

(16, 320, 800) Mb/s

Figure 2. H.264 Decoder NoC graph.

IV. PROPOSED NETWORK INTERFACES The network used in this project was the SoCIN NoC

[12]. It has a regular 2D-mesh topology and parametric router architecture. It is a VHDL soft-core and uses wormhole switching approach and the XY-routing algorithm. There is a Round-Robin arbiter at each output channel, and the flow control is based on the handshake protocol.

To illustrate the proposed wrapper, we use the same case study shown in fig.1 and also presented in fig. 2 but now with the cores mapped to a NoC, as depicted in fig. 3. This example reproduces the same problem present in the H.264 decoder. Fig. 3 presents the blocks that constitute the network interfaces. All network interfaces present the same blocks with the exception of the Adder that has still the synchronizer and memories blocks. In the next sub-sections we detail the blocks that compose the network interfaces.

A. Synchronizer Wrapper

To solve the synchronization problem we use tags to indicate the sequence of each packet. In this case, core A sends the same tag in the packets for the B and C cores. These packets are disassembled in the respective NIs of cores B and C. After the processing of each core, a new output data is generated and the flits are sent to the network by the NI. The NIs send in the header the tag sequence and the source address of the core.

B

A

NoC

C

+

XX

X X

NI

Pack

NI

Hand. Control

Synchronizer

Mem BUnpack

Pack

Hand. Control

NI

Unpack

Pack

Hand. Control

NI

Unpack

Pack

Hand. Control

Unpack

Mem C

Figure 3. Network Interfaces with sychronizer module.

When the packet is received in the NI of the Adder core, after the unpacking process, the wrapper executes the pseudocode of Algorithm 1. First, it verifies what core sent the packet, for example, if it has received a package from core B, the wrapper checks if the same sequence tag corresponding to core C has arrived. In the affirmative case, the sum can be executed; otherwise, the data that came from core B needs to be stored in its correspondent memory. To store the data that arrived and that needed to wait its sequence tag match, two small memories have been used, one for each core. The sequence tag is incremented to each new data up to a established limit value and we use the same code to address the memory. To indicate when a data stored in the memory is correct we have used a validity bit. When this bit is equal to 1, the data is valid; otherwise, the data has not yet arrived.

Algorithm 1 Synchronizer Pseudocode 1 IF source == A THEN 2 data_A = data_unpack; 3 addr_B = tag_A; 4 data_B = data_out_mem_B; 5 validity_bit_B = data_B[msb]; 6 IF validity_bit_B == 1 THEN 7 WE_mem_B = 1; 8 data_in_mem_B = 0; 9 data_ok = 1; 10 ELSE 11 WE_mem_A = 1; 12 data_in_mem_A = 1 & data_A; 13 data_ok = 0; 14 END IF; 15 ELSE 16 IF source == B THEN 17 data_B = data_unpack; 18 addr_A = tag_B; 19 data_A = data_out_mem_A; 20 validity_bit_A = data_A[msb]; 21 IF validity_bit_A == 1 THEN 22 WE_mem_A = 1; 23 data_in_mem_A = 0; 24 data_ok = 1; 25 ELSE 26 WE_mem_B = 1; 27 data_in_mem_B = 1 & data_B; 28 data_ok = 0; 29 END IF; 30 END IF; 31 END IF;

The source and tag information are known already in the header flit. Then, let us imagine that the source address refers

4179

Page 4: [IEEE 2010 IEEE International Symposium on Circuits and Systems - ISCAS 2010 - Paris, France (2010.05.30-2010.06.2)] Proceedings of 2010 IEEE International Symposium on Circuits and

to core B. In this case, while the payload flits arrive in the NI, the correspondent data of the core C for the same sequence tag can be checked. Then, the data relative to core C has been read from its corresponding memory in the NI, and its validity verified in parallel to the processing of other flits sent by core B. For this reason, this proposal does not compromise the network latency in this process. When the data read from memory buffer of core C is valid, the sum is executed and the validity bit is reset. If the data is not valid, it is stored in its corresponding memory and the validity bit is set.

B. Pack and Unpack Interfaces As the NoC uses the wormhole strategy, the packing in

the NI divides the packet into multiple flow control units called flits. In the example of fig. 3, one needs an unpacking network interface circuit to integrate the flits that compose the complete data, and a packing network interface circuit to divide the data generated in each core to be sent in the network. Each one of these blocks was implemented with Finite State Machine (FSM), and they can be used in a NoC with configurable buffer depth and channel width. Many cores can share the same pack and unpack NIs, and for this, one simply uses a multiplexer to define the source and destination cores, e.g., if a core receives packets from NoC of different cores, a single unpack block is required.

As the NoC uses a handshake protocol, one needs to control the data sent and received according to the network availability. The FSM implemented for the pack block controls when the flits can be sent to the network according to the ack (acknowledge) signal. When the pack block has a valid flit the val signal is set and when the ack does not equal to 1 the flit to be sent need to wait the network availability. This control is realized by Handshake control block. Unpack block receives the flits and group to compose the original data, removing the information that compose the header, as control, routing definition, sequence tag and source information sent by the Pack block.

V. RESULTS The proposed network-interface was described in VHDL,

and we used the ModelSim tool to simulate the code. We analyzed the results to a CMOS 0.18um process technology using the Synopsys Power Compiler tool. Table I presents the synthesis results for the network interfaces. NI standard refers to network interfaces used in the core A, B and C and the NI standard + Synchronizer + Memories refer to network interfaces used in the Adder core. The network interfaces have been successfully integrated in the SoCIN network.

For the H.264 decoder, we verified that, using the channel width equal to 16 is sufficient to reach high throughput. For these reasons, the syntheses results were obtained for data width of the NI equal to 16. The memory word size used was of 33 bits (1 validity bit more 32 data bits) and the memory depth was equal to 16 words. Although the proposed circuit incurs in an overhead for the present NI, it must be noticed

that, without this solution, the synchronization problem will have to be solved by a circuit embedded in each core with the same overhead. On the other hand, the proposed solution does not require any modifications in the cores.

TABLE I. SYNTHESIS RESULTS FOR THE NETWORK INTERFACES

VI. CONCLUSION In this paper we have shown a solution to the

synchronization problem in NoCs. We presented that a NI with the synchronizer is needed for asynchronous and also synchronous communication architectures. We designed the network interfaces with a wrapper to associate packets of different sources that need to be placed in sequence. Thus, we solved the problem of synchronization present in data flow applications. The network interfaces were designed with flexible network configuration and, besides, a single set of unpack and pack network interfaces can be shared for multiple cores. As future works we intend to obtain a proposal of NI to associate more than 2 sources of data.

REFERENCES [1] E. Lee and D. Messerschmitt, “Synchronous Data Flow”, Describing

Signal Processing Algorithm for Parallel Computation – COMPCON, pp. 310-315, 1987.

[2] A. Sheibanyrad, and A. Greiner, “Hybrid-Timing FIFOs to use on Networks-on-Chip in GALS Architectures”, International Conference on Embedded Systems and Applications - ESA, pp. 27-33, 2007.

[3] E. Beigné and P. Vivet, “Design of On-chip and Off-chip Interfaces for a GALS NoC Architecture”, ASYNC, pp. 172-183, 2006.

[4] Y. Thonnart et al., "Design and Implementation of a GALS Adapter for ANoC Based Architectures”, ASYNC, pp.13-22, 2009.

[5] S. Kundu and S. Chattopadhyay, “Interfacing Cores and Routers in Network-on-Chip Using GALS”, ISIC, pp. 154-157, 2007.

[6] W. Ning, G. Fen and W. Fei, “Design of a GALS Wrapper for Network on Chip”, World Congress on Computer Science and Information Engineering – CSIE, pp. 592-595, 2009.

[7] D. Kim et al., “NIUGAP: Low Latency Network Interface Architecture with Gray Code for Networks-on-Chip”, ISCAS, pp. 3901-3905, 2006.

[8] D. M. Chapiro, “Globally-Asynchronous Locally-Synchronous systems, Phd thesis, Stanford University, 1984.

[9] J. Xu et al., “A Design Methodology for Application-Specific Networks-on-Chip”, ACM Transactions on Embedded Computing Systems -TECS, pp. 263-280, 2006.

[10] A, Agarwal et al., “System-Level Modeling of a Noc-Based H.264 Decoder”, Annual IEEE Systems Conference, pp. 1-7, 2008.

[11] L, Agostini et al., “FPGA Design of a H.264/AVC Main Profile Decoder for HDTV”, International Conference on Field Programmable Logic and Applications - FPL, pp. 501-506, 2006.

[12] C. Zeferino, and A. Susin, “SoCIN: “A Parametric and Scalable Network-on-Chip” in 17th Symposium on Integrated Circuits and System - SBCCI, pp. 169-174, 2003.

Area

(gates) Area (um²)

Max. Freq. (MHz)

Power Consump. @ 100MHz

(mW)

Power Consump.

@ max. freq.(mW)

NI Standard 752 18,735 457 0.73 3.35 NI Standard + Synchronizer 2,122 50,019 127 1.95 2.47 NI Standard + Synchronizer + Memories

10,748 169,817 126 6.54 8.24

4180