[ieee 2014 ieee international symposium on circuits and systems (iscas) - melbourne vic, australia...

Adaptive Multiple Switching Strategy Toward an Ideal NoC

Debora Matos¹, Marcio Kreutz², Cezar Reinbrecht¹, Luigi Carro¹, Altamiro Susin¹ ¹UFRGS - Institute of Informatics, Porto Alegre, Brazil

{debora.matos, cezar.reinbrecht, carro, susin}@inf.ufrgs.br

Abstract— The exigency for heterogeneous many-core systems has brought an exponential growth in the complexity of their interconnections. In this manner, other Network-on-Chip (NoC) alternatives are being sought to attend the requirements in terms of power consumption and performance. Nevertheless, several of these proposals present very complex architectures, with virtual channels, tables and extra controls. In this paper we propose the junction of two advantageous strategies: hierarchical topology with adaptability. The use of these two techniques is novel in the literature and it allows ensuring high performance even when the application has their communication rates altered. The gains in power and in performance are possible due to the use of low cost components in a hierarchical structure.

I. INTRODUCTION With the technological scale, it was possible to integrate

tens of devices in a single chip [1-2]. However, with the increase of these elements, another concern about how these devices communicate and are interconnected has been raised, since these features are essential for the performance, energy and power application aspects. This need allowed the advent of the NoCs and countless studies were already done to analyze such interconnect devices. However, due to the current technological accelerating that brings the need for even more complex systems, consuming low energy and providing constant application updates without losing the performance features, other interconnect alternatives need to be investigated [3].

The performance and energy dissipation of a system are totally dependent on the network topology and how the cores are interconnected in the NoC. Selecting the network topology is one of the first steps in designing of complex system interconnection [4]. In this manner, higher performance and/or lower power dissipation are not possible from the general alternatives, like regular topologies, since they present poor performance and have a large power overhead [5]. Besides, a general purpose solution is not a good alternative in the embedded system context where communication patterns are irregular and strongly application-dependent, while the cores are completely heterogeneous also in terms of size [6].

The new philosophy of systems-on-chip brought an enormous degree of integration raises and also new challenges in designing interconnection infrastructure. This work approaches the requirements in the use of different strategies

for NoCs in order to reach the requisites of both performance and low power in the current and future many core systems. The proposed solution combines hierarchy with adaptability in order to ensure high performance even when the system has changed their communication rates. The use of these two strategies was only possible due to the use of adaptability in a low cost architecture what allowed largely reduce the power consumption allied with high performance. The advantages of this strategy are illustrated in Fig. 1.

Figure 1: The advantages in the junction of hierarchy with adaptability.

II. PROPOSED NOC ARCHITECTURE

A. The use of hierarchy is essential in interconnect devices Considering the scenario of future systems with several

cores, different bandwidths will be required for different regions of the system. For this reason, a hierarchical topology is fundamental for the future of many-core systems in order to improve performance without to use expensive interconnect devices.

A hierarchical NoC topology brings many advantages for complex designs since it can exploit the system communication locality, while maintaining the NoC advantages. The proposed topology is formed by crossbars in the local level and a mesh router architecture in the global level, according to illustrated in Fig. 2. It is clear that there is a limit in the size of a crossbar due to their limitation in the scalability but its use is justified in a clustered communication approach. A recent study considers the possibility to interconnect many cores by a crossbar switch [7].

In the proposed hierarchical NoC, each application needs to have an appropriate mapping for the proposed architecture and adequate crossbar granularity [8]. The crossbar switch architecture allows parallel communications if the data are sent for different output ports. However, if there is a conflict for the same port, a Round Robin arbiter is used in this

978-1-4799-3432-4/14/$31.00 ©2014 IEEE 1014

architecture to avoid starvation. Each crossbar switch is composed for a set of multiplexers and one arbiter for each output port. The architecture uses the wormhole approach and a deterministic source-based routing algorithm (XY). The XY routing algorithm is capable of supporting deadlock-free data transmission, and the flow control is based on the handshake protocol. The use of a simple routing algorithm is an advantage of this proposal, different of other topologies like [9-10] that require a specific routing strategy. The packet header has inter-cluster and intra-cluster information. For the mesh topology level are required the routing and the flit type information. Whenever the flit arrives in the destination router, the ID related to the destination core in that cluster is read in the respective packet ID field.

Figure 2: Hierarchical proposed topology.

B. Adaptive NoCs is another need In order to achieve the performance requirements of

different systems, some works propose the use of adaptability in the switching mechanisms, such as VIP [11], HCS [12], and EVC [3]. In all those solutions, some type of circuit switching (CS) is implemented together with the packet switching (PS) mechanism. In [12], the authors proposed the HCS (Hybrid Circuit Switching) that uses two separated mesh networks: one for data and another for setup. In VIP [11], similarly to the work presented in [12], the connections between two cores can bypass intermediate routers. In this case, in order to avoid starvation, it is defined a threshold in cycles for VIPs and for PS. The latency advantages obtained with VIPs are achieved bypassing the pipeline stages of the PS mode whenever in the CS mode. However, that strategy presents some limitations. Firstly, the definition of the VIPs in the dynamic manner requires a centralized control, which sometimes is not possible to implement, mainly in large systems. Besides, a setup network to construct the VIPs is required, as well as the HCS architecture. EVC (Express Virtual Channel) [3] is also another solution that allows bypassing intermediate routers along the path.

A different strategy to configure different topologies was presented in [10]. In that work, the NoC reconfiguration is achieved by inserting several inter-node switches (switch box) between the routers. Nevertheless, this NoC needs to apply a table-based routing scheme, increasing the complexity.

All presented proposals have similar goals and use some type of circuit-switched strategy to improve performance. However, these proposals use excessive area resources and, in some cases, it is unclear the real costs of the architecture. Our proposed strategy presents many advantages in comparison with the others: it dynamically adapts the

switching, it presents two circuit switching modes that allows extract the minimal latency without limiting the frequency, and it uses a minimal hardware to provide the adaptability.

Our adaptive proposal allows dynamically reconfiguring three switching possibilities according to the wire information extracted from floorplanning. These switching modes are: UCS (unbuffered circuit switching), BCS (buffered circuit switching) and PS (packet switching). The difference among the modes is related to the storage of the flits. In the PS, the flits are stored in the conventional input channel buffer (FIFO). If the selected mode is UCS, the flits are not stored in the FIFO and pass directly to the output channel. In the BCS, a single register (flip-flop) is required, since all flits will follow to the destination without interruption. When a message enters in the NoC, it is sent in the UCS mode. This UCS mode is speculative, i.e., the circuit switching is set and the header flit tries to close the path in this mode until the destination router. If there is part of the path in use for other message, the circuit switching is changed to packet mode, storing the flits in the input channel.

In the input channel, there are some multiplexers to select if the data will be sent by packet, unbuffered or buffered circuit, according to Fig. 3. In the circuit mode, the flit is sent directly for the MUX 2. MUX 3 defines if the circuit switching is buffered or not.

Figure 3: Input channel architecture.

The adaptive router architecture is illustrated in Fig. 4. Each router port is composed of Input Channel, Output Channel and Operation Mode Controller (OMC). The routers use a control to identify the wire length between two routers according to the design floorplan. In order to avoid long interconnections, the OMC mode presents a wire length estimation when in the circuit switching mode that checks if the estimated delay for the total wire length in the transmission of a packet is close to the clock period. When this value is almost the same, the flit needs to be stored, changing from UCS to BCS mechanism. This choice is totally automatic.

To define the switching mode, three issues are taken into account: if the input channel has flits to receive (in_val); if the selected arbiter is free (xbar_free); if the input port of the destination router can receive data (circuit_allowed). Therefore, if the conditions above are met, the circuit switching mode can be enabled.

1015

Figure 4: MiNoC router architecture.

C. Ideal Solution: Hierarchical Topology and Adaptability

The solutions presented in A and B sections present advantages and disadvantages in different aspects (Fig.1). In a system composed for a large quantity of heterogeneous cores, it is unavoidable it does not adopt some hierarchical solution. However, the primary problem of this decision is to design a hierarchical topology for an application specific and later, this application requires updates or changes in the communication rates, affecting the system performance.

The proposal presented in this paper covers exactly this case. These situations are very common, for example, let us imagine an application designed for a specific architecture, however, some cores can receive more functions do not foreseen in the design time (the system can be updated and processors can receive more tasks, for example) and then, the interconnection device will need to be fit with this new scenario. Thus, it is possible to achieve this interconnect solution integrating the related proposals: hierarchy and adaptability, such as depicted in Fig. 1. With this solution, it is possible to mitigate the impact in the performance results whenever the system presents any changes in its communication behavior. Taking onto account that some cores of an application can have their rates increased due to an update, if the increase in the bandwidth occurs in the cores inside of the same cluster, as the crossbar already gets the maximum communication rate, it will have a minimal or no degradation in the performance. However, if the messages increase in the global level, the adaptive strategy can reduce the loss of performance, as will be present in the results.

III. RESULTS

A. Synthesis Results Synthesis results for 65nm of process technology were

analyzed for two benchmarks: NCS [13] and TVOPD [14]. In order to obtain accurate link lengths, it was considered the core areas as black boxes in the synthesis. In this case, the correct wire information and architectural costs were considered in the logical synthesis (obtained with the RTL Compiler tool) from the parasitic extraction obtained in the physical synthesis (from the First Encounter tool). The operating frequency considered in the synthesis results was equal to 1GHz. The NoC was configured with 16-bit link wide and 4-flit deep buffers. Power consumption and area results are presented in Table I. To evaluate the power consumption it was considered the tool default toggle rate. As can be verified

in this table, the power and area reductions are very impressing, being the power larger than 70% when compared to a conventional mesh NoC with a packet switching strategy.

Table I: Power and area reductions for 65nm process technology.

NCS TVOPD

Power (mW) mesh NoC 58.08 173.44

Proposed NoC Solution 15.17 41.28

Power reduction (%) 73.88 76.2

Area (mm²) mesh NoC 0.56 1.62

Proposed NoC Solution 0.12 0.32

Area reduction (%) 78.41 80.04

B. Performance Results The performance results were obtained with a cycle-

accurate traffic simulator described in Java. For these experiments, the benchmarks TVOPD, NCS and TMPEG4 (triple MPEG4 [15] in a similar construction like TVOPD) were considered. Simulations were performed for the same NoC configuration, for a total of 5000 packets/ node and considering the operation frequency equal to 1GHz for all experiments. To verify the network behavior when the traffic increases, it was used different packet size in terms of flits (8, 16, 32, 64 and 128 flits per packet). Analyses of long wires interconnections to define the circuit switching mode were considered. In this case, in the average, when in the UCS mode, the BCS is set after to bypass two routers in the UCS.

For all these benchmarks are obsered a large reduction in the average latency when compared to a conventional mesh architecure or when no adaptability is used in the hierarchical proposal, as can be observed in Fig. 5. For the TVOPD benchmark, it can observed that the reduction was smaller than the other benchmarks. This is due to the behavior of this application that presents a well behaved traffic. In this application, the majority of the communication is given from a single core to another one, like streaming messages. Because of this, practically the mesh level does not present contention, and then, the adaptive strategy does not obtain a considerable reduction in the latency. However, this situation is different for the NCS and TMPEG4 benchmarks, where the reduction in the latency is expressive. The reduction in the average latency for the hierarchical topology with adaptability in relation to the same hierarchical NoC but using only packet switching in the mesh is up to 37% for the NCS and more than 45% for the TMPEG4. If this comparison is made with a conventional mesh topology, the reduction is around 70% for TMPEG4 and 76% for the NCS benchmark.

Another experiment was investigated in order to analyze the robustness of our combined strategy. It is interesting to verify if this solution responses well when the bandwidth of some cores is increased. For the TVOPD and NCS, as the mapping for these applications greatly reduced the inter-cluster communications, the absorption of the latency impact was minimal. However, for the TMPEG4, as this benchmark presents much inter-cluster communications, the increment in the latency was minimized. In this case, in order to observe if the adaptability really mitigates the latency, all inter-cluster

1016

communications had their bandwidths doubled (2x). The results for these traffic conditions are presented in Fig. 6. In these results it was compared the impact in latency when the hierarchical with and without adaptability is considered. If the percentage of increment in the latency is smaller for the combined solution than the hierarchical topology without adaptability, this proves that the purpose of the adaptive mechanism with a hierarchical topology was achieved

The gains of this approach are possible thanks to the communication locality allowed by the hierarchical NoC combined with a suitable mapping for the application and by the dynamic adaptability in the top level. As one can observe, only the use of the adaptive switching strategy is not enough to reach good results in both power and performance. It is clear the use of different techniques need to be considered to obtain advantages in different aspects.

(a)

(b)

(c)

Figure 5: Average latency results considering different NoC proposals for (a) NCS, (b) TMPEG4 and (c) TVOPD benchmarks.

Figure 6: Average latency results considering a double bandwidth in the

mesh level.

IV. CONCLUSION This work joined two interesting techniques that applied in

a system can obtain an efficient interconnection solution. The proposed hierarchical approach can cope with specific communication behaviors. However, thanks the adaptive strategy, it also presents flexible features to support different traffic patterns. The adaptability of this work is completely different of other strategies, since it considers floorplan information to define the circuit switching mode. The gains of our proposal are in the use of an adaptive mechanism allied to a low cost hierarchical NoC, composed for router and crossbars. The results present expressive reductions in power consumption and in average latency when when compared with other strategies.

REFERENCES [1] JERRAYA, A., TENHUNEN, H., WOLF, W.: Guest

editors’introduction: Multiprocessor system-on-chips. 36-40, 2005. [2] ITRS: System Drivers (2011).

http://www.itrs.net/Links/2011ITRS/2011Chapters/2011SysDrivers.pdf [3] Kumar, A. et al., "Toward Ideal On-Chip Communication Using

Express Virtual Channels," Micro, IEEE , vol.28, no.1, pp.80,90, 2008. [4] Dally, W and Towles B. Principles and Practices of Interconnection

Networks. Morgan Kaufmann Publishers Inc., San Francisco, 2003. [5] Lai, G. and Lin, X. Floorplan-Aware Application-Specific Network-on-

Chip Topology Synthesis Using Genetic Algorithm Technique. The Journal of Supercomputing, pp. 1–20, 2011.

[6] CHAN, J. et al. “NoCOUT: NoC Topology Generation with Mixed Packet-Switched and Point-to-Point Networks”. ASPDAC, 2008.

[7] Passas, G. et al.,"Crossbar NoCs Are Scalable Beyond 100 Nodes," TCAD, pp.573,585, 2012.

[8] Matos, D. et al., Floorplanning-Aware Design Space Exploration for Application-Specific Hierarchical Network-on-Chip In NoCArc, 20'11.

[9] Hollstein, T., et al., “HiNoC: A Hierarchical Generic Approach for on- Chip Communication, Testing and Debugging of SoCs”, in: IFIP, Springer, S.39-54, ISBN 978-0-387-33404-3; ISSN 1571-5736, 2006.

[10] MODARRESSI, M. et al. "Application-Aware Topology Reconfiguration for On-Chip Networks," TVLSI, Nov. 2011.

[11] MODARRESSI, M. et al., "Virtual Point-to-Point Connections for NoCs," TCAD, vol.29, no.6, pp.855,868, June 2010.

[12] JERGER, E. et al., "Circuit-Switched Coherence," NoCS ACM/IEEE International Symposium on , pp.193-202, 7-10 April 2008.

[13] TINO, A., KHAN G., Power and Performance Tabu Search Based Multicore Network-on-Chip Design”. In International Conference on Parallel Processing Workshop, pp. 74-81, 2010.

[14] MURALI, S. et al., "Synthesis of networks on chips for 3D systems on chips," ASP-DAC, vol., no., pp.242, 247, 19-22 Jan. 2009.

[15] BERTOZZI, D. et al. NoC Synthesis Flow for Customized Domain Specific Multiprocessor Systems-on-Chip, TPDS, p. 113-129, 2005.

1017

[ieee 2014 ieee international symposium on circuits and systems (iscas) - melbourne vic, australia...

Documents