network on chip (noc) platform for high …

192
NETWORK ON CHIP (NoC) PLATFORM FOR HIGH PERFORMANCE COMPUTING Thesis submitted to Goa University for the award of the degree of Doctor of Philosophy In Electronics By Udaysing Vithalrao Rane Supervisor Dr.R. S. Gad Department of Electronics, Goa University, Goa-403206 January 2017

Upload: others

Post on 22-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

NETWORK ON CHIP (NoC) PLATFORM FOR

HIGH PERFORMANCE COMPUTING

Thesis submitted to Goa University

for the award of the degree of

Doctor of Philosophy

In

Electronics

By

Udaysing Vithalrao Rane

Supervisor

Dr.R. S. Gad

Department of Electronics,

Goa University, Goa-403206

January 2017

I

CERTIFICATE

This is to certify that the thesis entitled “Network on Chip (NoC) Platform for High Performance

Computing ", submitted by Mr. Udaysing Vithalrao Rane, for the award of the degree of Doctor

of Philosophy in Electronics, is based on his original and independent work carried out by him

during the period of study, under my supervision. The thesis or any part thereof has not been

previously submitted for any other degree or diploma in any University or institute.

Place: Goa University (R. S. Gad)

Date: Research Guide

II

STATEMENT

I state that the present thesis “Network on Chip (NoC) Platform for High Performance

Computing " is my original contribution and the same has not been submitted on any occasion

for any other degree or diploma of this University or any other University / Institute. To the best

of my knowledge, the present study is the first comprehensive work of its kind in the area

mentioned. The literature related to the problem investigated has been cited. Due

acknowledgements have been made wherever facilities and suggestions have been availed of.

Place: Goa University (Udaysing V. Rane)

Date: Candidate

III

Dedicated with love to my parents

Shri.Vithalrao Daulatrao Rane and Smt.Vijaya Vithalrao Rane

My wife

Mrs. Aparna U. Rane

& My children

Master Arush U. Rane & Master Aryan U. Rane

IV

ACKNOWLEDGEMENTS

With deep regards and profound respect, I avail this opportunity to express my deep sense of

gratitude and indebtedness to my supervisor, Dr. Rajendra S. Gad, Associate Professor

Department of Electronics, Goa University for his endless support, encouragement, inspiration,

ideas, comments, critics, an endless discussions and articles during our interaction. He has also

helped me in broadening my perspectives in looking at many things in life. I express deep sense

of gratitude to Dr. Gourish M. Naik, Professor and Head Department of Electronics, Goa

University, for his valuable guidance, constant encouragement, suggestions, advice and ideas

I acknowledge the technical comments and the assistance received from Dr. Y.S. Valaulikar

Member FRC, Department of Mathematics. Prof. A. Salkar Dean, Faculty of Natural Science and

Prof. J.A.E. Desa, Former Dean, Faculty of Natural Science for their valuable suggestions and

timely advice during the faculty research meetings.

I am very grateful to Dr D. B. Arolkar Principal, D.M’s College Assagao Goa and Shri

Shrikrishna Pokle, Chairman, Dnyanprassarak Mandal’s, Mapusa Goa, for their encouragement

and support for pursuing research.

My sincere thanks to Dr. Jyoti Pawar, Associate Professor, Department of Computer Science and

Technology, Goa University, Dr. Jivan S. Parab, Assistant Professor, Department of Electronics,

Dr. Vinaya R.Gad Associate Professor and Head Department of Computer Science, G.V.M’s

College Ponda, Dr. Prakah Pareinkar, Head, Department of Konkani, Goa University, Dr.

Sulaxana Raikar, Assistant Professor, Department of Computer Science, G.V.M’s College Ponda

for their precious advises that greatly helped me in the developing this thesis

I acknowledge Dr. Vithal Tilvi (Arizona State University, USA) for his moral support, constant

encouragement, discussions and arguments over a telephonic conversations or whenever he was

V

down to India. I also thank Mr. Saish Amonkar, Technical head , Intel Bangalore for his time

over technical discussions we had.

Special thanks to all Research Scholars Mr. Narayan Vetrekar, Dr. Ingrid, Shaila Ghanti, Noel,

Aniket, Charan, Yogini and Marlan of our department for endless deliberations regarding

technical writing skills and Mr. William and Mr. Agnelo Lopez for constantly assisting me in

various administrative works.

I am indebted to funding agency ALTERA INC. USA, for setting up System on Chip (SoC)

laboratory at the Goa University.

I enjoyed all the moments which I experienced during my research work which transformed me

to present form and am sure that this exposure effectively changed my perception of life. I thank

all my colleagues, D.M’s College and research centre, Assagao Goa; who have greatly

contributed to good inspiring working and social environment.

I express deep sense of gratitude to my wife Aparna, my sister Sharavati and my brothers

Abasaheb and Sanjay, for their patience and understanding as well as taking much of the

responsibilities at home. I shall forever remain indebted to my children, Arush and Aryan. The

long hours and holidays that I owed to them, I had spent for thesis.

Udaysing V. Rane, January 2017.

VI

TABLE OF CONTENTS

I PREFACE XVI

1. INTRODUCTION

1.1 Introduction 1

1.2 Multi-Processor System-on-Chip (MPSoC) 3

1.3 Network-on-Chip (NoC)) 7

1.3.1 NoC architecture 7

1.3.2 Topologies 8

1.3.3 Topologies metrics 13

1.3.4 NoC switching techniques 16

1.3.5 Routing algorithms 18

1.4 Objectives of the thesis and outline 18

2. LITERATURE SURVEY ANDSTATE OF THE ART IN NoC

2.1 NoC architectures 22

2.2 NoC router architectures 25

2.3 NoC routing algorithms 29

2.4 NoC link interfaces 31

2.5 NoC flow control 34

3. NoC SWITCH/ROUTER ARCHITECTURE DESIGN AND IT’S IMPLEMENTATION

3.1 Proposed switch architecture design 36

3.2 Working of switch/router 37

3.2.1 ATM packet format: 38

3.2.2 The scheduler 40

3.2.3 Two dimensional ripple carry arbiters 41

3.2.4 Comparison between DPA and RPA 43

3.2.5 The crossbar organisation 44

3.3 Development design flow 44

3.4 Description of switch using hardware descriptive language 46

VII

3.4.1 Switch design synthesis 46

3.4.2 Switch design simulation 48

4. IMPLEMENTATION OF COMPUTING NODE

4.1 Overview of computing node 58

4.2 Task of converting the design into block diagram 58

4.3 Soft core processor architecture 59

4.4 SoC generation using Nios-II soft core 64

4.5 Integrating SoC with switch 66

4.6 Synthesizing and porting the computing node on target device Cyclone –II 69

4.7 Programming and configuring the FPGA device 74

4.8 Validating the design for the application program 75

5. IMPLEMENTATION OF NoC PLATFORM USING MESH TOPOLOGY

5.1 Overview of mesh topology 80

5.2 Design of NoC Mesh topology platform 81

5.3 The 3X2 NoC Mesh Topology verification with RTOS 84

6. PERFORMANCE EVALUATION AND ANALYSIS OF MESH TOPOLOGY

6.1 Performance evaluation and analysis of mesh topology 92

6.2 Simulation environment for m x n mesh topology 93

6.3 Scenario-1 Throughput and delay performance with varying packet size 95

6.4 Scenario-2 Throughput and delay performance with varying queue size for low

load and high load packets 103

6.5 Scenario-3 Throughput and delay performance with varying link bandwidth for

low load and high load packet 110

6.6 Scenario-4 Throughput and delay performance with varying link propagation

delay with low load and high load packets 115

VIII

7. CONCLUSION AND FUTURE TRENDS

7.1 Conclusion 121

7.2 Future Trends 132

ANNEXURE

Appendix-I 137

Appendix-II 142

Appendix-III 146

REFERENCES 156

IX

LIST OF TABLES Table 3.1: Areas and delays for each arbiter design.

Table 3.2: Resources usage on Cyclone II EP2C35F672C6 FPGA device for proposed switch

design.

Table 3.3: Static look up table used for routing .

Table 3.4: Input ATM packets used for simulation and expected output packets.

Table 3.5: Comparison of the proposed router design with other router design.

Table 4.1: Embedded Soft Core Processor for FPGA.

Table 4.2: Pin assignment for EP2C35F672C6 of Cyclone II family for DE2 board.

Table 4.3: Input ATM packet send at port one of the switch.

Table 4.4: Part of look up table.

Table 4.5: Output ATM packet expected at port number 3.

Table 6.1: Simulation parameters. Table 6.2: Various parameter values for scenario 1.

Table 6.3: Throughput and delay performance with varying packet size for 4x4 mesh topology.

Table 6.4: Throughput and delay performance with varying packet size for 4x4 mesh topology.

Table 6.5: Throughput and delay performance with varying packet size for 8x8 mesh topology.

Table 6.6: Throughput and delay performance with varying packet size for 8x8 mesh topology.

Table 6.7: Various parameter values for scenario 2.

Table 6.8: Throughput and delay performance by varying Queue size with low load packets

(512bytes) on 4x4 mesh topology.

Table 6.9: Throughput and delay performance by varying Queue size with high load packets

(64kbytes) on 4x4 mesh topology.

Table 6.10: Throughput and delay performance by varying Queue size with low load packets

(512bytes) on 8x8 mesh topology.

Table 6.11: Throughput and delay performance by varying Queue size for high load packets

(64kbytes) on 8x8 mesh topology.

Table 6.12: Various parameter values for scenario 3.

Table 6.13: Throughput and delay performance by varying link Bandwidth with packet size of

0.512Kbytes on 4x4 mesh topology.

Table 6.14: Throughput and delay performance by varying link Bandwidth with packet size of

64Kbytes on 4x4 mesh topology.

Table 6.15: Throughput and delay performance by varying link Bandwidth with packet size of

0.512Kbytes on 8x8 mesh topology.

Table 6.16: Throughput and delay performance by varying link Bandwidth with packet size of

64Kbytes on 8x8 mesh topology.

Table 6.17:Various parameter values for scenario 4.

X

Table 6.18: Throughput and delay performance by varying link propagation delay with packet

size of 512 bytes on 4x4 mesh topology. Table 6.19: Throughput and delay performance by varying link propagation delay of link with

packet size 64Kbytes on 4x4 mesh topology.

Table 6.20: Throughput and delay performance by varying link propagation delay with packet

size 512bytes on 8x8 mesh topology.

Table 6.21: Throughput and delay performance by varying link propagation delay with packet

size 64Kbytes on 8x8 mesh topology.

XI

LIST OF FIGURES

Figure 1.1: Growth in processor performance

Figure 1.2: Evolution of multi-core system

Figure 1.3: General Architecture of NoC

Figure 1.4: 2D Mesh topology

Figure 1.5: 2D Torus Network

Figure 1.6: Fat Tree Network

Figure 1.7: Butterfly Fat Tree topology

Figure 1.8: SPIN topology

Figure 1.9: Polygon (Hexagon) Topology

Figure 1.10: Spidergon (Octagon) Topology

Figure 1.11: Star Topology

Figure 3.1: Depicts the proposed switch architecture

Figure 3.2: ATM Packet format structure

Figure 3.3: Input module of the proposed switch

Figure 3.4: Look Up Table structure for the proposed switch

Figure 3.5: 4 X 4 two dimensional ripple carry arbiter

Figure 3.6: Embedded development flow

Figure 3.7: Quartus II design flow

Figure 3.8: Compilation report flow summary

Figure 3.9: RTL view of the Switch

Figure 3.10: Input waveform vector file showing input ATM packets

Figure 3.11: Output showing ATM packets received at respective output ports of the switch as

described in LUT

Figure 4.1: Block symbol of the proposed switch design

Figure 4.2: Block diagram of the Nios II processor core with relevant buses

Figure 4.3: Customize soft core processor for switch design

Figure 4.4: Schematic view of the Computing node consist of SoC and a switch

Figure 4.5: The DE2 board with relevant component (source: ALTERA INC.)

Figure 4.6: Block diagram of the DE2 board with various I/O and communication Interfaces

Figure 4.7: Program flow for verification of computing node on FPGA

Figure 4.8: Output ATM packet at port number 3 of the switch on Nios-II IDE console

Figure 5.1: 4x4 Mesh topology with switch (router) and computing element (core)

Figure 5.2: Close view of 4X4 Mesh topology interconnection platform using Quartus-II

Figure 5.3: Full view of 4X4 Mesh topology interconnection platform using Quartus-II with SoC

and PLL

Figure 5.4: Compilation report for 4 x 4 switch matrix demanding more utilization.

Figure 5.5: Compilation report for 3 x 2 mesh topology utilizing 97% logic elements.

Figure 5.6: 3x2 mesh topology interconnection platform

Figure 5.7: Real Time system priority scheduler for transmitting and receiving the ATM packets

XII

Figure 6.1: Throughput in Kbytes v/s Packet size in Kbytes

Figure 6.2: Delay per packet in ms v/s Packet size in Kbytes

Figure 6.3: Throughput in Kbytes v/s Queue size

Figure 6.4: Delay per packet in ms v/s Queue size

Figure 6.5: Throughput in Kbytes v/s Bandwidth in Mbps

Figure 6.6: Delay per packet in ms v/s Bandwidth in Mbps

Figure 6.7: Throughput in Kbytes v/s Link delay in ms

Figure 6.8: Delay per packet in ms v/s Link delay in ms

XIII

KEYWORDS ASIC:Application Specific Integrated Circuit

ALU: Arithmetic Logic Unit

API:Application Program Interface

AS: Active Serial

ATM: Asynchronous Transfer Mode

BFT: Butterfly Fat Tree

BiNoC: Bidirectional channel Network on Chip

CBR: Constant Bit Rate

CDC: Channel Direction Control

CDLSI:Capacitively Driven Low-Swing Interconnects

CLICHÉ: Chip-Level Integration of Communicating Heterogeneous Elements

CMP: Chip Multi- Processor

CoNoC: Contention-free optical NoC

CPLD:Complex Programmable Logic Device

CPU: Central Processing unit

DE: Development and Education

DOR: Dimension Ordered Routing

DPA: Diagonal Propagation Arbiter

DPXY: Distance Prediction XY

DSB: Distributed Shared Buffer

DSM: Deep Submicron

DSP: Digital Signal Processors

EDA:Electronic Design Automation

EPC: Express Physical Channels

EVC: Express Virtual Channels

FIFO: First In First Out

FPGA:Field Programmable Gate Array

FT: Fat Tree

FTP:File Transfer Protocol

FWFT: First Word Fall Through

XIV

GCA:Global Congestion Awareness

GUI: Graphical User Interface

GWOR: Generic Wavelength routed Optical Router

HAL: Hardware Abstraction Layer

HDL: Hardware Description Language

HNoC: Hierarchical Network-on-Chip

IBR: Input Buffer Router

IC: Integrated Circuit

IDE: Integrated Development Environment

IP: Intellectual Property

IPC: Inter Process Communication

IRQ:Interrupt Request

ISA: Instruction Set Architecture

JATAG: Joint Test Action Group

LAFT: Look Ahead Fault Tolerant

LEDR: Level Encoded Dual Rail

LUT: Look Up Table

MMU: Memory Management Unit

MPSoC: Multi-Processor System-on-Chip

MPU: Memory Protection Unit

MR: Micro-ring Resonator

NFR: Neighbor Flow Regulation

NI:Nanophotonic Interconnect

NI: Network Interface

NoC: Network-on-Chip

OCMP: On-Chip Multilayer Photonic

ONoC: Optical NoCs

PCC: Packet-Connected Circuit

PDN: Perfect Difference Network

PIO:Programmed Input/Output

PLL:Phase Locked Loop

XV

PNoC: Programmable NoC

RAM: Random Access Memory

RCWRON: Recursive Wavelength Routed Optical Network

RISC: Reduced Instruction Set Computer

RPA: Rectilinear Propagation Arbiter

RPNoC: Ring based Packet switched NoC

RTL:Register Transfer Level

SA: Simulated Annealing

SAF: Store and Forward

SDM: Space Division Multiplexing

SDRAM: Synchronous Dynamic Random Access Memory

SGM:State Graph Manipulators

SoC: System-on-Chip

SOPC: System-On-a-Programmable-Chip

SPIN: Scalable, Programmable, Integrated Network

SRAM:Static Random Access Memory

SSN: Spiking Neural Networks

STorus: Screwy Torus

TCP: Transmission Control Protocol

TSV: Through-Silicon-Vias

UART:Universal Asynchronous Receiver/Transmitter

USB:Universal Serial Bus

VCI: Virtual Circuit Identifier

VCT: Virtual Cut-Through

VHDL: Very high-speed integrated circuit Hardware Description Language

WDM: Wavelength Division Multiplexing

WH: Worm-Hole

WRON: Wavelength Routed Optical Network

XDMesh: Extended Diagonal Mesh

XVI

PREFACE The thesis describes the design and implementation of Network on Chip platform. The NoC

architectures not only provide power- efficient solutions for interconnecting network resources

but also enhance the network performance while maintaining implementation cost. The thesis

discusses the various building blocks require to design the NoC platform, the different

architectures and routing schemes with their pros and cons. The m x n mesh topology based

platform with complete processes of hardware implementation and the reason for choosing mesh

topology is described in detail.

The thesis initially describes the proposed designed of the switch which is the basic

building block of NoC architecture. The model is described using HDL and synthesized using

ALTERA QUARTUS II Software for Cyclone-II soft core processor. The switch component is

then tested for its routing functionality using static look up table configured inside the design.

The computing element required to connect to the switch is generated using Altera’s Nios-II soft

core processor. The thesis describe in detail the process of implementing and testing the

computing node for its functionality, computing node is the switch embedded with the

computing element generated using soft core NIOS-II processor.

The m x n mesh topology platform has been designed, implemented and synthesized on

Cyclone II 2C35 FPGA device using Altera’s Development and Education (DE2) board and

verified using real time operating system µCOS for data communication.

The thesis also gives the detail implementation of 4 x4 and 8 x 8 mesh topology NoC

simulative models and analyses the performance of these models with respect to throughput and

packet delay by varying size of the packet, link bandwidth, queue size and link delay parameters

using FTP and CBR traffic pattern. The results obtained give the fair understanding in deciding

efficient usage of such platform for different applications.

Udaysing Vithalrao Rane

XVII

LIST OF RESEARCH RELATED COMMUNICATED PAPERS

1. Udaysing.V. Rane, V.R. Gad, R.S. Gad, G.M. Naik, ‘Reliable and scalable architecture

for Internet of things for sensors using soft-core processor’ International conference on

Internet of Things, Smart Spaces, and Next Generation Networks and Systems. Proc. 15.

Int. Conf., NEW2AN-2015, and 8. Conf. ruSMART-2015 St. Petersburg, RUSSIA and

published as Lecture Notes in Computer Science. 9247(); 2015; 367-382.)

2. Vinaya R. Gad, Udaysing V. Rane,Rajendra S. Gad, Gourish M. Naik, ‘Performance

Evaluation of Low Density Parity Check (LDPC) Codes Over Gigabit Ethernet

Protocol’ Transactions on Networks and Communications (TNC) Volume 4, Issue 5,

October 2016

3. Rajendra S. Gad, Udaysing V. Rane, John Fernandes, Mahesh Salgaonkar, ‘Virtual

Extended Memory Symmetric Multiprocessor (SMP) Organization Design Using LC-3

Processor: Case Study of Dual Processor’ International Journal of Engineering Research

and Development (IJERD)Volume 10, Issue 3 (March 2014), PP.29-39

4. Udaysing V. Rane, Rajendra S. Gad, ‘Low power routing: Migration from Electronics to

Optical switching’ 8th

Annual National Symposium on VLSI and Embedded systems VSI

Goa India 8th

March 2014 at PCCE college of Engineering Verna Goa India.

5. Udaysing V. Rane, Rajendra S. Gad, ‘Design of NoC Platform with Mesh Topology

using Soft Core NIOS-II Processor’ 6th

Annual National Symposium on VLSI and

Embedded Systems VSI Goa India 22nd

March 2012 at Goa University Goa India

6. Charan Arur Panem, A. A. Gaonkar ,Udaysing. V. Rane, A. B. Pandit, R. S. Gad,

‘Sensors Data Fusion Architecture Over MIMO: Case Study of Quad copter’

International Conference on Electrical, Electronics, and Optimization Techniques

(ICEEOT) –March 2016- Chennai India

7. Udaysing V. Rane, Rajendra S. Gad, ‘Implementation of Switch design on FPGA for

Asynchronous Transfer Mode Network’ 5th

National Conference on Electronic

Technologies (NCET-14) on 25th

-26th

April 2014 at GEC Goa India

XVIII

8. Udaysing V. Rane, Rajendra S. Gad, G.M. Naik, ‘NoC Mesh Topology Platform

Synthesis on ALTERA FPGA: Verification using µC/OS Real Time Operating

System’, under review with Real-Time Systems the International Journal of Time-Critical

Computing Systems-Springer.

9. Udaysingh V. Rane, Charan Arur Panem, Vinaya R. Gad, Rajendra S. Gad, Gourish M.

Naik, ‘Performance Study of 2-D Mesh Typologies on Network on Chip and

Verification Using Real-Time Operating System on 90nm Technologies’, under review

with Microprocessor and Microsystems Embedded Hardware Design International

Journal-Elsevier.

1

Chapter 1

INTRODUCTION

2

1.1 Introduction

The complexity of semiconductor Integrated Circuit(IC) is measured using the transistor

count. The transistor count is the number of transistors on an IC chip. Very Large Scale

Integration (VLSI) technology began in the 1970s has scaled the size of transistor drastically

(10000nm to 14nm) allowing billions of transistors fitting on IC.

The Intel in 1971 introduced their INTEL 4004 PROCESSOR chip containing only 2300

transistors fitted on 12mm2die with 10000nm process and in 2016 they introduced Broadwell-

EP Xeon E5-2600 V4 with up to 7.2 billion total transistors packed inside a 456mm2 die with

14nm process [1]. The 14nm process allows Intel to significantly cut down the die size while

increasing both transistor count, core count and the Inter Process Communication (IPC)

performance [2]. Oracle Corporation in October 2015 introduces 32 core SPARC M7 with 10

billion transistors [3]. At the 2016 GPU Technology Conference in San Jose [4] announced

the new NVIDIA Tesla P100 with 15.3 billion transistors on 16nm technology and 610mm2

die area.

It can be observed that the transistor count of the integrated circuits doubles approximately

every two years as predicted by Moore's law [5] With the number of transistors increasing

which is the stepping stone to develop complex semiconductor and communication

technologies, allows more and more components to be integrated within the same area of a

chip. As a consequence complex systems that once required many microchips for being built

now can be installed on a single microchip containing all the logic of the system and the

interconnection channels connecting them. At the same time new real time applications, such

as audio and video transmissions, video conferencing, video on demand, distant education, e-

3

commerce, etc., demands fast computation, higher rates in information transmissions and in

minimization of delay.

The survey conducted by SPEC on their benchmark series, SPECint, which was

designed to measure the single-threaded integer performance of a computing machine shown

in figure 1.1, wherein the results are grouped by CPU brand which consists of 5052 test

results from 715 different CPU models, the result shows that the performance of uniprocessor

started decreasing after 2004[6]. Thus the combined pressures from ever-increasing power

consumption and the diminishing returns in performance of uniprocessor architectures have

led to the advent of multicore chips figure 1.2 [7]. For example, the Intel SCC [8] contains 48

cores, the Tilera Tile64 has 64 cores [9], and the experimental Intel Polaris chip incorporates

80 cores [10].Consequently, the embedded systems have led to the multi-processor System-

on-Chip (MPSoC) design and the high performance computer architectures have evolved into

Chip Multi- processor (CMP) platforms,

1.2 Multi-Processor System-on-Chip (MPSoC)

The Multi-Processor System-on-Chip (MPSoC) is a System-on-Chip (SoC) which

incorporates multiple heterogeneous processing cores and memory hierarchy and I/O

components in a single case on a single die, usually targeted for embedded applications.

These architectures meet the performance needs of multimedia applications,

telecommunication architectures, network protection and security and other application

domains while limiting the power consumption through the use of specialized processing

elements and architecture[11]. All these components are linked to each other by an on-chip

interconnect.

4

Figure 1.1: Growth in processor performance (source:- Original data collected and plotted by

Figure 1.2: Evolution of multi-core systemM. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond and C. Batten Dotted line extrapolations by Moore)

5

This on-chip interconnects usually consist of conventional bus and crossbar structures.

The bus interconnect is just wires, interconnecting all Intellectual Property (IP) Cores,

combined with an arbiter that manages the access to the bus. System-level latency and

bandwidth constraints led to a natural evolution towards multi-tiered bus architecture,

typically consisting of a high-performance low latency processor bus, a high-bandwidth

memory bus and a peripheral bus. Examples of bus-based architectures, are ARM™ AMBA

[12], OpenCore’s WISHBONE SoC interconnection [13], and IBM CoreConnect™ [14-15].

However, there are several problems associated with the standard bus architectures [16]. First,

a global bus implies a large capacity load for the bus drivers. This in turn implies large delays

and huge power consumption. Second, the performance of shared bus architecture is

inherently not scalable, as there can be at most one transaction over the shared bus at any

point of time. Moreover, the bus performance has to be degraded if a slow device is accessing

the bus. Some sophisticated modern bus architectures address these problems through the

concept of bus hierarchy and separation. However, such a temporary solution falls short when

hundreds or over thousands of processors will have to be integrated on a single chip [17].

Third, in Deep Submicron (DSM) era, design of long and wide buses becomes a real

challenge. While physical information is extremely important for successful bus design, the

environment in which the bus is embedded is very hard to predict and characterize early in the

design stages due to cross talks. With the increasing IC performance requirements, designers

started to implement cross-bar structures. These structures improve latency predictability and

significantly increase aggregate bandwidth, at the cost of larger number of wires.

The advent of the SoC, incorporating tens to hundreds of IP cores created a significant

integration challenge. The above described busses and cross-bars are ‘coupled’ solutions. The

6

interfaces of all IP cores connected to a single bus or cross-bar must all be exactly the same,

both logically (signals) and in physical design parameters (clock-frequency, timing margins).

This turned out to be significant obstacle to rapid integration and re-use of existing IP into

increasingly complex SoCs. Systems-on-chips can be found in many product categories

ranging from consumer devices to industrial systems [18]

• Television production equipment uses systems-on-chips to encode/decode video.

Encoding high-definition video in real time requires extremely high computation rates.

• Digital televisions and set-top boxes use sophisticated multiprocessors to perform real-

time video and audio decoding and user interface functions.

• Telecommunications and networking use specialized systems-on-chips, such as

network processors, to handle the huge error detection and correction data rates

presented by modern transmission equipment.

• Cell phones use several programmable processors to handle the signal processing and

communication protocol tasks required by telephony. These architectures must be

designed to operate at the very low-power levels provided by batteries.

• Video games use several complex parallel processing machines to render gaming

action in real time.

The scalability and success of switch-based networks and packet-based communication in

parallel computing and Internet has inspired the researchers to propose the Network-on-Chip

(NoC) architecture as a viable solution to the complex on-chip communication problems [19].

1.3 Network-on-Chip (NoC)

7

For complex on-chip computing platform, an efficient way to manage the communication

among various on chip resources becomes critically important. The communicating resources

use bus and crossbar architectures to communicate with each other. The problems with bus

based communicating strategy is that they occupy large wire area leading to large area,

scalability problem as the number of resources grows since the communication is serialized

and also arbitration for the shared medium can impose a significant latency and the fan-out

[20]. Crossbars architecture eliminates serialization but the area and power costs increases

quadratically with the number of network endpoints [21]. To address this SoC design

challenges, Network-on-Chip has been proposed as an alternative approach to design the

communication subsystem between on chip resources. NoC proposes networks as scalable,

reusable and global communication architecture.

1.3.1 NoC Architecture

The figure 1.3 shows the general architecture of NoC. The major components in NoC are

Routers (RS), Network Interface (NI), IP cores (Processors, Memory, Sub-Systems etc.) and

links [22]. A router is the communication backbone of a NoC system which undertakes the

crucial task of steering and coordinating the data flow. It should be designed for maximum

efficiency and throughput.

Links are set of wires used to physically connect the routers and enables the communication

in the network. NI makes the logical connection between IP core and network. The IP cores

can be Digital Signal Processors (DSPs), embedded memory blocks, or Central Processing

Units (CPUs), or Video Processors etc.

8

Figure 1.3: General Architecture of NoC

1.3.2 Topologies

Topology refers to the physical layout and connections between nodes and links in the

network. The performance of the network is directly depends on the topology. It determines

the number of routers a message must traverse as well as interconnect lengths between

routers, thus influencing the network latency significantly. The network energy consumption

is also depends on the topology of the network, as the routers and links incurs energy while

traversing messages. Further it dictates the total number of alternate paths between routers,

affecting the distribution of traffic smoothly and hence supports bandwidth requirements. The

commonly used on-chip topologies are explained below in detailed

1.3.2.1 2D Mesh Topology

The mesh-based interconnect architecture called CLICHÉ (Chip-Level Integration of

Communicating Heterogeneous Elements) is proposed by Kumar et al [23].

9

The architecture consists of an m x n mesh of switches interconnecting computational

resources (IPs) placed along with the switches as shown in figure 1.4. Each switch is

connected to four adjacent switches except the one which are placed at edges and one IP

block using communication channels. This topology is widely used in many NoC based

multicore prototypes e.g. MITs 16-tile RAW chip [24] and Intel 80-tile TFLOPS chip [25].

The major advantages of the mesh are its simplicity, very regular layout and scalability.

Figure 1.4: 2D Mesh topology

1.3.2.2 2D Torus Topology

Figure 1.5: 2D Torus Network

IP IP IP

IP IP IP

IP IP IP

R R R

R R R

R R R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

IP

R

10

The 2D Torus architecture figure 1.5 was proposed by Dally and Towles [26]. It is same as a

regular mesh the only difference is that the switches at the edges are connected to the opposite

edge through wrap-around channels as shown in figure 1.5.

1.3.2.3 Star Topology

Figure 1.6: Star Topology In star topology network computational resources are connected in star fashion to a central

router as shown in figure 1.6. The capacity requirements of the central router are quite large,

because all the traffic between the spikes goes through the central router. That causes a

remarkable possibility of congestion in the middle of the star.

1.3.2.4Fat Tree Topology

This topology uses a tree structure where nodes are routers and leaves are computational

resources as shown in figure 1.7. The routers above a leaf are called as leaf’s ancestors and

correspondingly the leaves below the ancestor are its children. In a fat tree topology each

node has replicated ancestors which mean that there are many alternative routes between

nodes.

11

Figure 1.7: Fat Tree Network 1.3.2.5 Butterfly Fat Tree Topology

Researchers C.Grecu, P.P. Pande, A. Ivanov and R.Saleh [27] proposed Butterfly Fat Tree

architecture as shown in figure 1.8. In this topology the computational resources are placed at

the leaves of the tree and switches are placed at the vertices. The architecture may be

unidirectional or bidirectional. Each node in the topology is label with a pair of coordinates (l,

p), where l is a level and p is a position within that level. In general, at level 0 (lowest level)

there are N computational resources with addresses ranging from 0 to N-1. Each switch in the

Butterfly Fat Tree architecture has two parent ports and four child ports.

Figure 1.8: Butterfly Fat Tree topology

R

IP IP

R

IP IP

R

IP IP

R

IP IP

R

R R

IPIPIP IPIPIP IPIPIP IPIPIP

R

R R R R

R

12

1.3.2.6 SPIN Topology

Guerrier and Greiner [28] proposed a generic interconnect template called SPIN (Scalable,

Programmable, Integrated Network) using a fat tree based architecture to connect

computational resources for on-chip packed switched interconnections. In this architecture ,

every node has four children and the parent is replicated four times at any level of the tree.

The leaves of the tree contains computational resources and the vertices contains switches as

shown in figure 1.9.

Figure 1.9: SPIN topology 1.3.2.7Polygon Topology The Polygon topology is a circular network in which packets travel in loop from router to

other. The network becomes more diverse when chords are added to the circle. Two special

cases of polygon network, polygon (Hexagon) network shown in figure 1.10 proposed by ST

Microelectronics[29] and Spidergon/Octagon network shown in figure 1.11 [30][31][32][33].

Hexagon network topology consists of six routers, each router has six bidirectional links

connecting to five other routers and one link is connected to the computational resource.

Spidergon/Octagon consists of eight routers and 12 bidirectional links. Each router is

13

Figure 1.10: Polygon (Hexagon) Topology connected in circular fashion as well as there is a connection between opposite routers.

Communication between any pair of nodes takes at most N/4 hops where N is the number of

nodes as compared to single hop in Hexagon topology

Figure 1.11: Spidergon (Octagon) Topology 1.3.3 Topologies metrics Node degree: The number of links at each node is term as the degree of a topology. For

example, a ring topology has a degree of 2 since there are two links at each node, while a

mesh has a degree of 4 as each node has 4 links connecting it to 4 neighboring nodes. Higher

R

IPIP

R

R R

R R

IPIP

IP IP

R

R

R

R

R

R

R

R

14

the degree, more number of ports are required at routers which increases the implementation

complexity.

Hop count: Hop count is the number of hops a message takes from source to destination, or

the number of links it traverses. The number of hop count depends on the diameter of the

network and the routing algorithm used. Increase in the number of hops will increase the

latency affecting the performance of the network.

Topologies typically trading off between node degree and hop count, i.e. a topology may

have low node degree but high average hop count ( e.g. ring topology), while another may

have high node degree but low average hop count (e.g. mesh topology), hence comparison

between topologies become trickier. Implementation details have to be factored in before an

astute choice can be made.

Maximum channel load: is the maximum number of bits per second injected by every node

into the network before it saturate. This metric is useful for estimating the maximum

bandwidth the network can support. The higher the maximum channel load on a topology,

greater the congestion in the network caused by the topology and routing protocol causing

decrease in throughput. The specific traffic pattern affects maximum channel load

substantially and representative traffic pattern should be used in estimating maximum channel

load and throughput.

Significant research exists into all of the above topologies discussed. It is observed

through the survey that majority of on-chip network proposals gravitate towards either mesh

or ring topologies. For example, the IBM Cell processor [34], which is the first product with

an on-chip network, uses ring topology due to simple design, its ordering properties and low

power consumption. Total four rings were employed to boost the bandwidth and to alleviate

15

the latency. Also, proposed Intel Larrabee [35] adopted the two ring topology. The MIT Raw

chip, the first chip with an on-chip network has adopted mesh topology and its follow-on

commercialization into the Tilera TILE64 chip[36]. Each parallel network in the chip is a

mesh topology and different types of traffic are routed over the distinct networks. It is also

seen through the survey that researchers tried comparative evaluation of different NoC

topologies. The performance evaluation for throughput and delay was carried out by using

different NoC topologies under different injection rate and switching techniques. Gen Fen,

Wu Ning, and Wang Qi[37] performed simulation study using 2D Mesh, Fat-Tree(FT) and

Butterfly Fat- Tree (BFT). The switching techniques used were wormhole (WH) and Virtual-

cut-through (VCT). The performance of the above topologies was analyzed for throughput

and delay under random traffic pattern in which every node generates a packet and sends it to

a randomly chosen destination in the network. Their result concluded that the latency and

throughput increases as the injection rate increases for all the topologies under WH switching

technique. As the injection load approaches the accepted traffic limit, there will be more

messages contention and latency will be increase. FT has the lowest latency and highest

throughput under middle injection rate. The performance evaluation in the similar line was

carried out by Abdul Quaiyum Ansari, Mohammad Rashid Ansari and Mohammad Ayoub

Khan [38]. They compared Mesh, Torus, Cmesh and Fat tree topology for throughput and

delay using uniform traffic pattern in which each source sends packets to each destination

with equal probability. It was concluded that as the injection rate increases the throughput and

delay increases and at one point it sharply increases and saturate further. The torus topology

can withstand high injection rate without saturation and performs better as compared to other

16

topology. Further it was concluded that all the topologies have different injection rate range

for a same size network.

1.3.4 NoC Switching Techniques

Another important parameter of NoC architecture is switching techniques. The flow of the

data in the network through routers is determined by the switching technique [39]. The

different types of switching techniques used are described below

1.3.4.1 Circuit Switching

In Circuit Switching, the physical path between source node and destination node is reserved

before data transmission. Since there is a dedicated path, there is a very low probability of

packet loss. However, sometimes the bandwidth is in efficiently used because the path

reserved for one connection cannot be used by other until it is released [40] [41]. This

technique does not scale well with NoC size.

1.3.4.2 Packet Switching

In packet switching technique, the messages are divided into packets at the source node and

these packets are transmitted from source to destination node independently through different

routers. Each packet follows different route depending on the current status of the network.

Packet switching is the most commonly used technique in NoC because of its potential for

providing simultaneous data communication between many source-destination pairs. Packet

switching is classified as

I. Wormhole Switching

In wormhole switching the packets are divided into smaller fixed length flow control

units (flits) [42]. In this switching technique the header flit contains the information required

for making path decision from source to destination. Only header flit takes some delay to

17

decide the path and the remaining flits belonging to the same packet simply follows the path

taken by the header flit significantly reducing average message latency. The packet can

therefore spread into consecutive routers like a worm and hence the name wormhole

switching. Since the packets are divided into smaller size flits the buffer size required at the

router is reduced to size of flit instead of size of packet. The main disadvantage of wormhole

switching is that when header flit is blocked over transmission complete packet gets blocked

[43]. This switching is more susceptible to deadlock due to dependencies between links.

II. Store and Forward Switching

In store and forward switching the complete packet is stored at the node before

forwarding it to the next node in the path. Each node along the destination path should have

the buffer size sufficient to store the whole packet. The store and forward switching

introduces much overhead and limits the performance, given the area and power constraint of

on-chip networks [19]

III. Virtual Cut-through Switching

In Virtual Cut-through switching the data is forwarded in the forms of packets, the

buffer size required at router is equal to the size of the packet. The packets can cut through to

the input of the next router before the entire packet is received at the current router only if

there is enough storage at the next downstream router to hold the entire packet. The router can

start forwarding the header following data bytes as soon as it receives thus reducing the

latency drastically over store and forward switching technique. The Virtual Cut-through

switching behaves like store and forward switching technique at high network load.

18

1.3.5 Routing Algorithms

Routing algorithm decides which path a packet will follow through network to reach the

destination. The main goal of the routing algorithm is to evenly distribute traffic among the

paths supplied by the network topology to minimize contention and avoid hotspots, thus

improving throughput and network latency. Routing algorithms are generally divided into

three classes: deterministic, oblivious and adaptive.

In deterministic algorithm all packets from source to destination node always follow

the same path. Most commonly used deterministic algorithm for network on chip is

Dimension Ordered Routing (DOR). Whereas in oblivious algorithm the packets traverse

different paths from source node to destination node; the path selected is without regard to

network congestion. Further in adaptive algorithm the path taken by the packets from source

node to destination node depends on situation of the network traffic.

Routing algorithms are also classified as minimal and non-minimal. Minimal

algorithms select path from source to destination with minimum number of hops. Non-

minimal routing algorithm allow paths to be selected from source to destination which may

increase number of hops, this type of algorithms are mostly used when there is congestion.

1.4Objectives of the Thesis and outline

Network on chip is an active upcoming research area. Many aspects of NoC are still under

investigation and needs further exploration and understanding. In this thesis our main

concentration is to design and implement an efficient switch for NoC platform. The switch is

the main component that determines the latency, throughput, reliability and efficiency of the

entire NoC design. We will also implement the computing element using Hardware

Descriptive Language (HDL). Computing element is the soft core processor; use for

19

generating, process and receiving packets. This computing element is embedded with the

switch element. Further 2Dm x n mesh topology for the above described switch will be

implemented and tested using packet switching technique for data communication for ATM

packets. Finally we will be studying the performance analysis of the scalable topology.

I. To synthesized switch using HDL and test for its functionality.

II. The switch with embedded computing element using the HDL will be implemented

and tested for its functionality.

III. 2 dimensional ‘m x n’ mesh topology for above switch node will be synthesized on

FPGA hardware and verification using packet switching technique for communication

with soft core processor using proper real-time operating system.

IV. Static IP table will be created for the fixed size of m x n matrix topology to

established links over interconnects for the routing algorithms and will be verified

using Real Time Operating System.

V. Performance analysis of the scalable topology will be carried out for throughput,

delay, bandwidth, packet size and queue size.

The thesis is organized as follows:

Chapter 2 discusses the NoC literature survey and state of the art in NoC. Here various

surveys with reference to NoC architecture, router design, routing algorithms, link interfaces

and flow control is described in detail.

Chapter 3discusses the propose NoC router architecture, its implementation using hardware

descriptive language and functional simulation. We described in detail input module,

scheduler and crossbar logic components

20

Chapter 4discusses the implementation of computing node and testing the design for its

functionality by targeting it on FPGA device. The chapter elaborates on design of computing

node using Nios-II soft core processor and integrating the same with switch

Chapter 5 discusses the implementation of NoC platform using 2-D mesh topology and

testing the platform for inter-communication using real time operating system. Here we

describe in detail the verification methodology for data communicationover various tasks

using real time operating system µCOS

Chapter 6 discusses the performance evaluation and analysis of 4 x 4 and 8 x 8 mesh

topologies for throughput and delay by varying packet size, bandwidth, queue size and link

delay.

Chapter 7gives concluding remarks along with discussion and future trends.

21

Chapter 2

LITERATURE SURVEY

AND

STATE OF THE ART IN NoC

22

The NoC design can be broken down into its various building blocks such as its architecture,

topologies, router architecture design, routing algorithm, link interface and flow control. Each

of these blocks can be a separate topic for research. We present the state of the art research

survey and discussions based on these components.

2.1 NoC Architecture

The researchers are studying various NoC architectures to understand out the best NoC for the

given application. The analysis is based on various network parameters. P.P.Pande, Cristian

Grecu, Michael Jones, Andre Ivanov and Resve Saleh[44] performed comparative evaluation

study on SPIN, CLICHÉ, Torus, Folded torus, Octagon and BFT NoC architectures. The

performance metrics used was throughput, Transport Latency, energy and area requirement.

To carry out the comparison they developed the simulator which employed flit-level event

drive worm whole routing and the traffic injected was Self-similar and Poisson distributions.

They also used the similar types of switching and routing circuits to ensure the consistency of

the comparisons. The results indicate that throughput for BFT, CLICHÉ and Folded Torus is

lower as compared to SPIN and Octagon under uniform traffic pattern due to more links

between source and destination. With respect to energy, SPIN and Octagon has greater energy

dissipation at saturation than other architectures although this architecture on an average

provides high throughput and low latency. The die area overhead for SPIN and Octagon is

also higher as compared to other architectures. Their conclusion was that some architecture

can sustain very high data rates at the expense of high energy dissipation and considerable

silicon area overhead, while other can provide a lower data rate and lower energy dissipation

levels. Abdul Quaiyum Ansari, Mohammad Rashid Ansari and Mohammad Ayoub Khan[45]

also carried out study in the similar line. They performed the performance study using mesh,

23

torus c-mesh and fat-tree network topology with analysis parameters such as latency,

throughput, injection rate and hop counts using Booksim 2.0 simulator. They concluded that

the latency of the network increases as injection rate increases, throughput increases linearly

with injection rate. Various topologies with same size network have different injection rate

range. The analysis concludes that the c-mesh topology has very less injection rate range and

torus topology has better performance in comparison with other topology. Bharat

Phanibhushana and Sandip Kundu[46] presented a synthetic method for hierarchical design of

NoC for a given task graph system deadline which optimizes the router area. They proposed

2-phase design flow namely topology generation and statistical analysis. The test cases of

proportioned Monte-Carlo were used to meet the deadline as a metric for goodness. The

proposed solution was compared with Simulated Annealing (SA) based network generation

and static design approach. Their results claim 10% performance benefit over SA, 16% over

standard mesh and 30% over static design and total router area benefit of 59% over SA, 48%

over mesh and 55% over static design. Mahendra Gaikwad and Rajendra Patrikar[47]

proposed energy aware model for NoC platform using Perfect Difference Network (PDN)

based on mathematical notion of Perfect Difference Sets (PDS). They considered chordal ring

structure of the PDN for n=13, based on PDS {0, 1, 3, 9}, nodes in the PDN are either

connected directly or through intermediate node of length 2. The messages can be routed in

this network through one or two hops. The results were compared with 4X4 mesh inter-tile

based CLICHÉ NoC architecture, where it is observed that significant energy saving is

achieved for transferring random stream of data from one node to another in their proposed

model. Chia-Hsin OwenChen, Niket Agarwal,Tushar Krishna, Kyung-Hoae Koo, Li-Shiuan

Peh, Krisna C. Saraswat [48] performed a comparative study on physical express topology (

24

Express Physical Channels (EPC) based on express cubes) and virtual express topology

(Express Virtual Channels (EVC)) for meshes with large node count on-chip networks under

bisection bandwidth , router area and power constraints. The impacts of the Capacitively

Driven Low-Swing Interconnects (CDLSI) links on above designs were also evaluated. They

observed virtual express topologies give higher throughputs and they are more robust across

traffic patterns. Physical express topologies gives better throughput/watt and can leverage

CDLSI to reduce the latency and increase the throughput but both designs have similar low-

load latencies. Ying-Cherng Lan, Hsiao-An Lo, Yu Hen Hu and Sao-Jie Chen[49] proposed a

Bidirectional channel Network on Chip(BiNoC) architecture. The novel on chip router

architecture was also developed which supports the communication channels to be

dynamically self reconfigurable to transmit flits in either direction. The direction of the flow

is controlled by Channel Direction Control (CDC) algorithm. The proposed architecture was

evaluated for the performance using synthetic and real world traffic pattern, the result shows

significant reduce in packet delivery latency at all levels of packet injection rates, the

bandwidth utilization and traffic consumption rate also exhibited higher efficiency as

compared to conventional NoCs. It was also observed that better latency results were

achieved under different traffic pattern with less buffer size usage. Snaider Carrillo, Jim

Harkin, Liam j. McDaid, Fearghal Morgan,Sandeep Pande, Seamus Cawley and Brian

McGinley[50] presented a novel Hierarchical Network-on-Chip (HNoC) architecture for

Spiking Neural Networks (SSN) hardware. This architecture incorporates a spike traffic

compression technique to exploit SNN traffic patterns and locality between neurons, which

improves throughput and reduces traffic overhead. The result shows a high throughput of 3.33

x 109 spikes per second and synthesis result of the proposed HNoC demonstrates an efficient

25

low cost area utilization of 0.587 mm2 and low power consumption of 13.16mW for single

cluster facility of 400 neurons. Md. Hasan Furhad and John-Myyon Kim [51] proposed an

Extended Diagonal Mesh (XDMesh) topology for NoC architectures which includes diagonal

links between remote nodes. The experimental result of XDMesh outperforms the existing

state-of-art NoC topologies in terms of silicon area and energy consumption. Wu Chang, Li

Yubai and Chai Song [52] proposed a new torus topology structure and corresponding router

algorithm for NoC applications by redefining the router denotations and changing the original

router locations in the traditional torus topology. The proposed structure and corresponding

algorithm is implemented using system C. The performance evaluation for average delay and

normalized throughput of the proposed structure against XY route mesh for four different

traffic patterns was studied which claims that the proposed torus structure is more suitable for

NoC applications.

2.2 NoC Router Architectures

Router is an important component in NoC architecture, it is also called as the communication

backbone of the NoC architecture, and an efficient router design directly affects the system

performance. Many researchers have contributed their research work in this area. Jawwad

Latif,Hassan Nazeer Chadhry, Sadia Azam, Naveed Khan baloch[53] proposed a folded dual

Xbar architecture which is a combination of dual Xbar and folding technique. The dual Xbar

router architecture with buffer and buffer less feature will reduce buffer r/w energy with dual

crossbars and switch folding technique will increase resource utilization by reducing wire

density and decreasing the logic multiplesxures in crossbar. Simulations were performed

using OMNET++ platform to compare the performance of proposed architecture with the

conventional architecture. The results depicts that in proposed 2-folded dual Xbar architecture

26

there is slight increase in throughput and reduce in buffer r/w energy by average of 46% at

high load as compared to conventional architecture similarly for 3-Folded Dual Xbar

architecture there is 16.6% increase in throughput with 43 to 45% reduced buffer r/w energy

but little increase in crossbar. Umamaheswari S., Meganathan D and Raja Paul Perinbam

J[54] proposed a heterogeneous adaptable router which will reduce latency in irregular mesh

NoC architectures. Large input buffers in the router will incur large hardware overhead

followed by excessive power consumption although improves the efficiency of NoC

communication. Router with small buffers result in high latency although reduces power

consumption. In the proposed NoC router input buffers can be allocated dynamically reducing

the latency. The results shows that there is 20% decrease in latency and 9% increase in

throughput in 4 X 4 irregular mesh topology NoC with a buffer depth of 4 slots. The proposed

router architecture was also exposed to the synthetic traffics like hotspot, tornado, uniform

and bit complement in 8X8 irregular mesh topology NoC architecture and it offered 30.42%

reduction in latency and 18.33% increase in saturation throughput. Further the router was also

used for E3S benchmark applications where the latency was reduced by 22.63% as compared

to static router with 53% less power consumption and 55% less reduction in buffer

requirement. Anbu chozhan. P,D. Muralidharan and R. Muthaiah[55] proposed five port NoC

router design with First Word Fall Through (FWFT) based asynchronous First In First Out

(FIFO) buffering technique which improves timing and power consumption. The proposed

router design uses mask based round robin arbiter, the mask is generated using round robin

pointer. Two priority arbiters are used of which one handles the entire request and other

handles the masked request. Based on the results of two priority arbiter the round robin

pointer is updated. The priority scheduling is enforced by mask based round robin arbitration

27

maintaining fairness to all participants. Here the router design is simulated using ISIM

simulator of Xilinx ISE 13.2 to verify its functionality. Qing Sun, Lijun Zhang and Anding

Du[56] has proposed and designed a low latency NoC router with wormhole virtual channel

switching, the wormhole switching will save the cache space effectively while virtual channel

switching will set several logical virtual channels for a physical link. The proposed router uses

two switching modes in addition to two clock-cycles to reduce the latency. The proposed

router was developed using ISE tool of Xilinx and simulations were performed using

Modelsim. Rohit Sunkam Ramanujam, Vassos Soterious, Bill Lin and Li-Shiuan [57]

proposed a Distributed Shared Buffer (DSB) based router for NoC, DSB router architecture is

extensively used in high performance Internet packet routers. The proposed router uses two

crossbar stages and buffering of the packets are done between this two crossbar stages rather

than buffering packets at the output ports. The proposed DSB router as compared to Input

Buffer Router (IBR) architecture achieves up to 19% higher saturation on synthetic traffic

pattern and up to 94% of the ideal saturation throughput, further for SPLASH-2 benchmark

with high contention the packet latency was reduced on an average by 60% and under

synthetic workload evaluation the theoretical ideal saturation throughput was within 10%.

Weiwei Fu, Jingcheng Shao, Bin Xie , Tianzhou Chen and Li Liu [58] proposed a router

micro architecture design with Neighbor Flow Regulation (NFR), the design builds an

additional regulation network using low cost wires which collects information of flows in

neighbor router and regulates the arbitration schemes on them to prevent congestion and

starvation. The simulation result shows 6.7% increase in network throughput and also

achieved 28% improvement under hot spot traffic in switch matching efficiency. Yaniv Ben-

Itzhak, Israel Cidon, Avinoam Kolodny, Michael Shabun and Nir [59] proposed

28

heterogeneous NoC router architecture based on shared-buffer architecture which supports

different links bandwidths and different number of virtual channels per unidirectional port

with ingress and egress bandwidth decoupling advantages. They also presented the new

approach to reduce the number of shared buffers required for a conflict free router which

reduces area and power consumption dramatically. The saturation throughput was improved

by 6% to 47% for standard traffic pattern as compared to optimal input buffer homogeneous

router. A significant run time improvement for NoC based CMP running PARSEC

benchmark was achieved. The comparative performance study of the proposed router with

predefined traffic pattern against optimal input-buffer homogeneous and heterogeneous

routers for NoC based CMP at various sizes shows that the proposed router offers better

scalability, area and power reduction of 15% to 60%. Ahmed Bin Dammas, Adel Soudani and

Abdullah Al-Dhelaan [60] proposed enhanced router architecture for NoC that ensures a

specific management according to a service classification with enhanced routing process.

They proposed a new mechanism for flow control to avoid congestion in NoC with an

appropriate approach to manage flits buffering which allows dropping the flits with low

priority when the router is in a congestion state. The performance evaluation of this router

was carried out along with its hardware characteristics indicating its usage for low power NoC

applications. Bidhas Ghoshal, Kanchan Manna, Santanu Chattopadhyay and Indrani

Sengupta[61] proposed an online transparent test technique to detect latent hard faults

developed in FIFO buffers of NoC routers during field operation. They also proposed a

transparent SOA-MATS++ test generation algorithm which performs the periodic on line

testing of buffers within the NoC routers which prevents accumulation of faults.The

simulation results shows that test circuit does not affect the overall throughput of the system.

29

Woo young Jang and David Z. Pan [62] presented a NoC router with an explicit Synchronous

Dynamic Random Access Memory (SDRAM) aware flow control. This SDRAM aware flow

controller based on priority based arbitration schedules memory request which prevents data

contentions, bank conflicts and short turn around bank interleaving improving memory

utilization and latency. The experimental result shows that the proposed design significantly

improves memory utilization and latency compared to the conventional NoC design without

SDRAM aware router.

2.3 NoC Router Algorithms

The performance of the NoC system is also depends on underlying routing algorithms.

Routing algorithm decides the path for a packet and is generally classified into deterministic

and adaptive routing algorithms. In deterministic routing the path is determined by source and

destination address and in adaptive routing the path is decided on dynamic network

conditions. Tremendous research in designing enhanced routing algorithms and comparative

sanalysis is going on to decide the best routing algorithm for a given NoC application.

Gaoming Du, Jing He, Yukun Song, Duoli Zhang and Huajie Wu [63] carried out the

comparative study using XY routing algorithm, turn routing algorithm and retrograde

algorithm on the NoC based on Packet-Connected Circuit (PCC) scheme with different packet

lengths and injection rates. The experimental result shows that retrograde and turn algorithm

gives better performance as compared to XY routing algorithm, the average throughput and

average latency improved by 32.99% and 12.16% respectively for high injection rate and long

packet length. Ebrahim Behrouzian-Nezhad and Ahmad Khademzadeh[64] presented a

routing algorithm for NoC called BIOS which is based on Best Input and Output Selection.

The proposed routing algorithms simulation result shows better performance than other

30

deterministic and adaptive routing algorithms for different traffic patterns. Andrew DeOrio,

David Fick, Valeria Bertacco, Dennis Sylvester, David Blaauw, Jin Hu and Gregory Chen

[65] proposed Vicis, a fault tolerant NoC architecture and associate routing algorithm which

allows communication to continue even with large number of transistor failures. The proposed

architecture first detects errors within the router and reconfigures the architecture to leverage

errors and if the link connectivity or router disables due to faults, rerouting is performed using

distributed routing algorithm for meshes and tori. This maintains high reliability incurring

overhead by 51%.

Wei Ni and Zhenwei Liu [66] introduced a new routing algorithm called Distance Prediction

XY (DPXY) with self-similar traffic based on buffering information and center distance. The

comparative study of the proposed routing algorithm with XY, WestFirst and OddEven

routing algorithm was performed and the result shows that the proposed algorithm can

effectively avoid centric congestion, decreases latency and improves network performance.

Jan Moritz Joseph, Christopher Blochwitz and Thilo [67] proposed a routing technique with

an adaptive allocation of default path in routers. It retains the standard network load since the

method is non-speculative and also packets routed through non prioritized links are not

penalized by the defaults paths. Routers configure themselves locally and hence virtual point

to point paths emerge temporarily and accelerate interleaved data streams. The simulations

were performed using PARSEC benchmarks; the result shows 4.8% to 12.2 % reduction in

average packet latency. K.Tatas and C. Chrysostomou [68] proposed adaptive routing scheme

for buffered and buffer less NoC based on fuzzy logic. To select the output port for an

incoming flit, the dynamic traffic and power consumption on neighboring router links is taken

into account and input link cost is dynamically calculated using fuzzy logic control. The

31

proposed scheme was implemented in Application Specific Integrated Circuit (ASIC) and

Field Programmable Gate Array(FPGA) technologies which demonstrated that the hardware

area overhead is minimal and also imposes no additional latency. Salih Bayar and Arda

Yurdakul [69] proposed custom 2-D NoC architecture with reconfigurable switches. These

reconfigurable switches can be configured as per the requirements of application during both

designs as well as runtime phase. They also proposed a mapping algorithm which will set path

through these switches; the experimental results show that the proposed mapping algorithm

reduces routing cost by 79.96% for real life embedded applications.

2.4NoC link interfaces

The recent advances in silicon photonics technology including Micro-ring Resonators (MRs)

to build photonic switches and a waveguide to transfer messages over optical link attracted

the researchers to use this optical technology in NoC architecture. Optics is attractive due to

its high bandwidth, low power consumption and low error rate. Researchers are trying to use

the optical interconnect or proposing the optical router designs with the corresponding routing

algorithms which can be more prominently used for power efficient and high performance

NoC applications. Few research articles are found in this area. Cisse Ahmadou Dit Adi, Ping

Qiu, Hidetsugu Irie, Takefumi Miyoshi and Tsutomu [70] proposed OREX, a hybrid NoC

architecture consisting of optical ring and an electrical crossbar central router. The electrical

crossbar switch is used to set up path between source and destination node and the messages

are transmitted using optical ring. The performance of the OREX using static wavelength

allocation under probabilistic traffic pattern has been evaluated. The result shows that the

latency and power consumption of the proposed hybrid NoC architecture is much less as

compared to pure electrical NoC architecture. Xianfang Tan, Mei Yang, Lei Zhang, Yingtao

32

Jiang, Jianyi Yang [71] proposed a generic passive scalable and non- blocking router

architecture , namely Generic Wavelength routed Optical Router (GWOR), which is scalable

from 4x4 to any size. The different input wavelengths and corresponding micro-ring

resonators (MRRs) are used to realize the routing in GWOR. The comparative analysis

between the proposed design and existing router design was carried out which shows 4x4

GWOR uses the least number of MMRs and has least power loss. Parisa Khadem Hamedani,

Natalie Enright Jerger and Shaahin Hessabi [72] presented QuT, an all-optical NoC using

passive microrings along with deterministic wavelength routing algorithm for optical switch

optimization to reduce the number of wavelength and microrings. The proposed QuT is

compared with 128-node Optical Spidegon, λ-router and Corona under different synthetic

traffic patterns, the result shows that power is reduced by 23%, 86.3% and 52.7% respectively

for above optical topologies with competitive latency. Xiaolu Wang, Huaxi Gu, Yintang

AYang, Kun Wang and Qinfen Hao [73] proposed a Ring based Packet switched NoC

architecture called RPNoC, which uses few optical devices as compared to other Optical

NoCs. The proposed architecture has small diameter compared with many other packet-

switched Optical NoCs which contributes to low energy consumption, high saturation

bandwidth and low latency. The use of both Wavelength Division Multiplexing (WDM) and

Space Division Multiplexing (SDM) technology makes the architecture highly scalable. The

proposed design also assures deadlock freedom and low resource consumption. The

evaluation under different realistic and synthetic traffic for 64-node RPNoC was carried out.

The simulation results shows that the proposed design achieves low latency and high

throughput as compared to other packet switched Optical NoC (ONoCs), it also achieves the

lowest energy consumption. Xiuhua Li, Huaxi Gu, Ke Chen, Liang Song and Qingfen Hao

33

[74] proposed a Screwy Torus Topology (STorus) along with its new optical router design.

An optical network was divided into two subnets connected using gateway, each subnet

consisting of STorus architecture. The diameter of the proposed architecture is half of torus

and quarter of mesh compared with the same scale network. The hierarchical waveguide

layout gives fewer crossing leading to optical power loss with an average loss of 0.4dB.The

simulation results when compared with traditional mesh topology under hotspot traffic, the

throughput of the proposed architecture is improved by 59.1% and 52.3% under uniform

traffic. Somayyeh Koohi and Shaahin Hessabi [75] proposed Contention-free optical NoC

(CoNoC) architecture for on-chip routing of packets. The CoNoC architecture is built upon

all-optical switches that passively routes data based on their wavelength eliminating the need

of resource reservation at the intermediate nodes. This improves the scalability and

performance of the network. The proposed architecture is compared with electrical NoC and

alternative ONoC architectures under different synthetic traffic pattern. The power

consumption has been reduced per packet by 19%, 28%, 29% and 91% and energy reduction

per packet was achieved by 28%, 40%, 20% and 99% against Columbia, Phastlane, ƛ-router

and electric torus respectively. Lei Zhang, Mei Yang, Yingtao Jiang, Emma [76] presented

Wavelength Routed Optical Network (WRON) an interconnection architecture suitable to

build on-chip optical micro-networks. They also generalized the routing scheme for WRON

using any two routing parameters out of the source node address, the destination node address

and the routing wavelength parameters. Here they proposed a Recursive Wavelength Routed

Optical Network (RCWRON) architecture using WRON as the primitive platform. The

RCWRON serves as the basis for the redundant wavelength routed optical network

(RDWRON) architecture.

34

2.5 NoC flow control

Flow control determines how network resources like channel bandwidth; buffer

capacity and control state are allocated to packet traversing over network. The buffered flow

control has several variants like credit based flow, handshaking signal, acknowledge flow

control, stall/go flow control and T-error flow control. A good flow control protocol lowers

the delay experienced by messages at low loads by not imposing high overhead in resource

allocation, and drives up network throughput by enabling effective sharing of buffers and link

across messages. Zhiliang Qian et al[77] proposed a new NoC latency model for the arrival

traffic burstiness, the general service time distribution, the finite buffer depth and arbitrary

packet length combination. A link dependency analysis technique is proposed to determine

the order of applying queuing analysis. The accuracy of the model is demonstrated using both

the synthetic traffic and real applications. A 70x speedup over simulation is achieved with

less than 13% error in the proposed analytical model, which benefits the NoC synthesis

process. They further reviewed several techniques for NoC performance evaluation. These

techniques range from building the traffic models so as to develop the analytical and

simulation-based latency models. Their work summarized the typical workloads that are

employed in NoC-based system evaluation then, reviewed the approaches in traffic analysis

for capturing both the short- and long-range dependence behaviors. In addition, a multi-fractal

based approach to model the non-stationary traffic is introduced. Also, advances in NoC

architectures are researched in higher dimensional spaces i.e. 3D domain. Such 3D NoC

architectures drastically improves the throughput, latency at the cost of power, area and higher

end of 3D die for fabrication of Integrated Circuits. These 3D architectures are discussed in

later part of the thesis.

35

Chapter 3

NoC SWITCH/ROUTER ARCHITECTURE

DESIGN AND

IT’S IMPLEMENTATION

36

3.1 Proposed Switch Architecture Design

Switch/Router is the main building block of a NoC or for that matter any network architecture

which undertakes important task of coordinating the flow of data in the network. It is also

important to keep the design of switch as simple as possible since its complexity will lead to

increase in implementation cost. The proposed architecture consists of mainly three parts: The

Input Module, Crossbar and Scheduler.

Figure 3.1: Proposed switch architecture

The number of input port and output port of any NoC switch depends on network topology.

The above proposed switch design is for mesh NoC topology. Thus our design consists of 5

input ports and 5 output ports. Out of five ports, four ports are used to connect four

neighboring routers in north, south, east and west directions and one port is used to connect to

local processing element.

37

3.1.1 Input Module:

For each of the input ports there is an input module which is responsible for handling, storing

and processing the arriving data packets. For each input module there is data line, frame pulse

input, clock input, reset input and global reset input.

3.1.2 The Crossbar:

The crossbar handles the task of physically connecting an input port to its destined output

port, based on the grants signal issued by the scheduler.

3.1.3 The Scheduler:

The scheduler is responsible receiving request from all the input modules and based on

scheduling algorithm determines the best realizable configuration for the crossbar

3.2 Working of Switch

The input to the Switch/Router is a data packet arriving from adjacent Switch/Router on 8-bit

wide input data line of the respective input module. In our design we used Asynchronous

Transfer Mode (ATM) packet of fixed 53 bytes for implementation purpose. Another input to

the switch is a clock input which is global to all switch components and frame pulse input

which is one bit wide signal used to signal the start of data packet. One data byte can be input

to the switch in every clock cycle. On every falling edge of the clock input the frame pulse

signal is checked, this pulse should be at least one clock cycle wide. Once the frame pulse is

detected the input module considers the first data byte of the packet on second rising edge of

the clock. The reset input of the switch is use to reset all the counters of design and initialize

them to their initial values. The global reset signal is use to reset all input buffers. The wave

diagram is explained in the functional simulation of the router in figure 3.9 and figure 3.10.

38

3.2.1 ATM packet format:

The figure 3.2 depicts the ATM packet format structure. ATM packet is 53byte fixed cell

having 48 byte payload and 5 bytes header information. Second, third and fourth byte of the

ATM packet contains Virtual Circuit Identifier (VCI) which is used by the input module to

determine the output port.

(a) (b)

Figure 3.2: (a) ATM Packet format structure (b) ATM header

Every data byte arriving at input module figure 3.3 is first stored in the Random Access

Memory (RAM) component of the input module. As the input packet is read by the input

module, it also extracts simultaneously second, third and fourth byte of the packet separately

into VCI register, this information is then send to Look Up Table (LTU) module maintained

inside the input module.

39

Figure 3.3: Input module of the proposed switch

The LUT is a Read Only Memory (ROM) based component that can be initialized with an

arbitrary set of data, to form the routing table of the switch. The ROM has eight rows and

each row is 36 bits wide. These bits consist of: a 16-bit input VCI, a 16-bit output VCI, and a

4-bit output port number as shown in figure 3.4. This LUT information is used by the input

module to update the old VCI bytes of the header with the new VCI bytes along with the

output destination port number for that packet. The input module then sends a request for that

specific output port to the scheduler and waits for a grant signal from scheduler. Once a grant

is issued, the data bytes are removed from the RAM in the same order that they arrived and

40

dispatched over the port for which grant signal is demanded. After the entire packet is sent,

the same process is repeated for the other packets.

Figure 3.4: Look Up Table structure for the proposed switch

3.2.2 The Scheduler

The scheduler is responsible for receiving requests from the input module for various input

ports and grants some of these requests by determining the best configuration for the crossbar.

It keeps the information of which ports are free and which ports are communicating with each

other. This helps it to control the arbitration and resolve contention problem. This decision is

41

mainly depends on the scheduling algorithm used. The algorithm has to be fast, efficient, fair

and most important, easy to implement in the hardware.

There are many ways in which crossbar scheduler can be implemented. A group of

researchers at UC San Diego has designed and implemented Rectilinear Propagation Arbiter

(RPA) and Diagonal Propagation Arbiter (DPA) which uses round robin priority scheme. The

scheduling decision in RPA and DPA is implemented in combinational logic using two

dimensional ripple carry arbiter.

3.2.3 Two Dimensional Ripple Carry Arbiters

Two Dimensional Ripple Carry Arbiters is shown in figure 3.5. For simplicity we explain

4X4 two dimensional ripple arbiters but the design can be scaled for m x n two dimensional

ripple carry arbiter. The input port and output ports are realized using arbiter cells, the row

numbers in the cell corresponding to input ports and column numbers are output ports. The

pair of numbers mention on each cell specifies the request that is handled by that specific cell.

For example the pair 2, 3 on a cell will handle packet that is destined to go from input port 2

to output port3. Each arbiter cell has signal R (request) which is an input signal and G (grant)

signal which is an output signal. R[ i , j ] is active when there is a packet at input port i which

is destined for output port j and G[ i , j] is active when request from input port i to output port

j has been granted.

The scheduling decision is based on the following algorithm

I. Start from top left most cell (1,1)

II. Once the cell is reached move to its right and bottom cells.

42

III. Check if R signal is active, than activate G signal if and only if there have not been

any request granted in the cells at the top and to the right of the current cell.

IV. If a request granted, inform the cells on the right and at the bottom by activating

appropriate signals.

Figure 3.5: 4 X 4 two dimensional ripple carry arbiter

This ensures that at any given point of time only one request is granted in the same row or

column (the one higher or on the left is granted).

The two dimensional ripple carry architecture describe in figure 3.5 is not fair because it gives

priority to the cell that are higher and to the left i.e. cell number (1, 1). To make this

43

architecture fair the researcher [78] modified this design to cyclic two dimensional ripple

carry arbiter which too has some fundamental problems. This design uses combinational

feedback loop which are not supported by the logic synthesis tools and are difficult to design

and test. To overcome these problems new architectures were introduced Rectilinear

Propagation Arbiter [RPA] and Diagonal Propagation Arbiter [DPA][78].

DPA is a modified version of two dimensional RPA. Here certain individual cells are selected

and are placed in the diagonal rows. These are the cells to which if requests are granted will

not prevent granting others. The arbitration process consider the first diagonal row and if there

are request in the first diagonal row all are granted, then in the next time slot the arbitration

process moves to next diagonal row, grants all request in that row and so on.

3.2.4 Comparison between DPA and RPA

Table 3.1 describes the results from timing simulations of DPA and RPA architectures for 16

port and 32 port crossbar switches [78]. The results show that for DPA architecture the

number of cells (and therefore the cost) is almost half of that of RPA architecture. Also, the

propagation delay of the DPA design is almost half of the delay in RPA.

Table 3.1: Areas and delays for each arbiter design

RPA(16) DPA(16) RPA(32) DPA(32)

NO. OF CELLS 5766 2480 23814 10.080

AREA IN (MM2)

0.497 0.184 2.052 0.746

DELAY IN (NS) 10.23 6.15 21.10 13.61

44

The table 3.1 comparative results are quit convincing to select the DPA architecture for our

design.

3.2.5 The crossbar organization

The crossbar module in the design is responsible for physically connecting an input port to its

destined output port, based on the grants issued by the scheduler. The inputs to the crossbar

are taken from input modules of our switch. The outputs of crossbar are connected to the

output ports of the switch. The crossbar makes the appropriate connection between each input

and its corresponding output.

3.3 Development design flow For development and synthesis purpose we have used Altera’s Quartus II 8.0 software. The

Altera Quartus II design software provides a complete, multiplatform design environment that

easily adapts to our design needs. It is a comprehensive environment for system-on-a-

programmable-chip (SOPC) design. The Quartus II software includes solutions for all phases

of FPGA and Complex Programmable Logic Device (CPLD) design. A Nios II soft-core

processor is constructed in SOPC Builder; and the design is synthesized with the soft-core in a

Cyclone II FPGA with the help of Quartus 8.0 IDE (Integrated Development Environment).

The SoC is developed with the help Nios IDE which integrates MicroC/OS-II RTOS and

Nios-II soft core processor. The embedded development flow is shown in figure 3.6.

Embedded development flow goes as described herewith.

• System concepts and System Requirement Analysis

• Defining and Generating the System in SOPC Builder using Nios II with standard

required components along with custom instructions and peripherals

45

Figure 3.6: Embedded development flow

• Integrating the SOPC Builder System into the Quartus II Project along with custom

hardware modules

• Assigning pin locations, timing requirement and other design constraints

• Compile hardware and download to target board

• Develop software project with the Nios-II IDE .Also design application code Altera

provides component drivers and a hardware abstraction layer (HAL) which allows you

to write Nios II programs quickly and independently of the low-level hardware details.

46

In addition to your application code, you can design and reuse custom libraries in your

Nios II IDE for projects.

• Download the software to Nios II system on target board.

• Finally run and debug software on target board

3.4 Description of Switch using Hardware Descriptive Language

The switch design described above needs to be realized as a digital system module by using

hardware descriptive language. There are several hardware descriptive languages available of

which most common available once are VHDL, Verilog and System C. We have chosen

VHDL(Very high-speed integrated circuit Hardware Description Language) as a

programming language to model a digital system by dataflow, behavioral and structural style

of modeling[79]. VHDL is IEEE standards, supported by all CAD tools, technology

independent, flexible and supports easy modeling of various abstraction levels

3.4.1 Switch Design Synthesis

The top level entity Switch is described using VHDL code which describes input and

output interface of the switch and also architecture structure of its three components namely

input module, scheduler and crossbar. The detail code for component; input module,

scheduler and crossbar is written as three different vhdl files along with the code for look up

table. The designed switch is synthesized using the ALTERA QUARTUS II (version 8.0)

software. The Altera QuartusII design software provides a complete, multiplatform design

environment that easily adapts to your specific design needs. It is a comprehensive

environment for system-on-a-programmable-chip (SOPC) design. The Quartus II software

includes solutions for all phases of FPGA and CPLD design (Figure3.7)[80].

47

Figure 3.7: Quartus II design flow for SoC design

The Quartus II software includes a modular Compiler which has the following modules.

Analysis & Synthesis, Partition Merge, Fitter, Assembler, Time Quest Timing Analyzer,

Design Assistant, Electronic Design Automation (EDA) Netlist Writer, HardCopy® and

Netlist Writer. We used Quartus II software Graphical User Interface (GUI) to create Switch

project described above with top level file as switch. vhdl and also added remaining files to

project. The project is then compiled whose successful compilation summary is shown in

figure 3.8. It is observed through compilation summary that total logic elements used to

generate one switch is 4241 out of 33216 which is only 13% usage. The other component

usage on Cyclone II EP2C35F672C6 FPGA device is shown table 3.2

48

Table 3.2: Resources usage on Cyclone II EP2C35F672C6 FPGA device for proposed switch design

Cyclone II EP2C35F672C6 FPGA

device resources Total usage Percentage

Total logic elements 4241/33216 13%

Total combinational functions 2987/33216 9%

Dedicated logic registers 2605/33216 8%

Total pins 130/475 27%

Total memory bits 28224/483840 6%

Embedded multiplier 9- bit elements 0/70 0%

Total PLLs 0/4 0%

After successful compilation the graphical representations of the design was generated as

shown in figure 3.9 The design is viewed using Quartus II RTL Viewer which provides a

gate-level schematic view of the design and a hierarchy list, which lists the instances,

primitives, pins, and nets for the entire design netlist. One can navigate through the entire

design flow to examine the connection of different components and their input and output

interfaces. This gives only the idea whether the blocks are correct and having proper

connectivity, but does not guarantee the working of design.

3.4.2 Switch Design Simulation

After successful synthesis of design, the designed is checked using simulations. There are two

ways in which designed circuit can be simulated. The simplest way is to assume that logic

elements and interconnection wires are perfect, thus causing no delay in propagation of

49

signals through the circuit. This is called functional simulation. A more complex alternative is

to take all propagation delays into account, which leads to timing simulation.

3.4.2.1 Functional Simulation

The functional simulation is performed to verify the functional correctness of a circuit as it is

being designed. This takes much less time, because the simulation can be performed simply

by using the logic expressions that define the circuit.

Before running the functional simulation the look up table information was initialized

with the data as shown in table 3.3 using lut1.mif, lut2.mif, lut3.mif and lut4.mif files. Four

different .mif files are required, as we have four input ports, the data entered in these files

after compilations will seat into respective LUT modules of input module. The look up table

has the following information; like input VCI, output VCI and output port number.

Figure 3.8: Compilation report flow summary

50

Figure 3.9: RTL view of the Switch

51

Input VCI is the second, third and fourth byte header information of ATM packet which is

used for routing purpose. Now as per the data stored in look up table any ATM packet having

input VCI number 56C9 will be replaced with 1234 number by the input module and will be

give port number 3.Similarly any ATM packet having input VCI number 5BED will be

replaced with ABCD number by the input module and will be give port number 2 and so on.

Table 3.3: Static look up table used for routing

INPUT VCI OUTPUT VCI OUTPUT PORT NUMBER

56C9 1234 3

5BED ABCD 2

B9E0 8193 1

0204 DEBA 4

BCCD 1213 4

3747 56C9 3

2C3C 8104 2

BBCD 7685 1

As said earlier we are using ATM packet for our simulation. ATM packet is 53 bytes cell with

first five bytes header information and remaining bytes i.e. six to fifty three byte is the data.

We in our design generated following input ATM packets and expected output ATM packets

as shown in table 3.4.

52

The simulation takes waveform vector file as the input file. Using waveform editor input.vwf

file was created as shown in figure 3.10, wherein the input waveform vector file specifies

clock, global_reset, reset, frame pulse(fp) signals and data_in packets, global_reset and reset

was set to active low, four different fp signals fp1,fp2,fp3 and fp4 and four data_in packets

data_in1,data_in2,data_in3 and data_in4 for four input ports were supplied respectively.

Table 3.4: Input ATM packets used for simulation and expected output packets

Packet

No.

Input ATM packets Expected output ATM packets

1st

byte

2nd

byte

3rd

byte

4th

byte

5th

byte

6-53

bytes

1st

byte

2nd

byte

3rd

byte

4th

byte

5th

byte

6-53

bytes

1 72 75 6C 95 76 11 72 71 23 45 76 11

2 A9 A5 BE DC AD 22 A9 AA BC DC AD 22

3 DF EB 9E 02 E3 33 DF E8 19 32 E3 33

4 15 10 20 48 1A 44 15 1D EB A8 1A 44

5 AA BB CC DD EE 11 AA B1 21 3D EE 11

6 10 20 30 40 50 AA 10 2C CD D0 50 AA

7 C1 C2 C3 C4 C5 BB C1 C8 10 44 C5 BB

8 DA DB DC DD DE 22 DA D7 68 5D DE 22

As mentioned in the working of switch when fp signal is detected, at second rising edge of the

clock pulse the first byte of data byte is supplied. We supplied first four input ATM packets of

53

table 3.4 to switch input module as follows. First packet to input port 1, second to input port

2, third to input port 3 and fourth packet to input port 4.

Figure 3.10: Input waveform vector file showing input ATM packets

Before running the functional simulation the required netlist was created and then the

functional simulation was run. The output report file generated after functional simulation is

shown in figure 3.11. As per the routing information stored in LUT of input module, packet

number 1 injected at input port_1 of the input module of the switch having input VCI number

56C9 is updated with the new VCI number 1234 and is suppose to go to output port number

3. Further packet number 2 injected at input port_2 of the input module of the switch having

input VCI number 5BED is updated with the new VCI number ABCD and is suppose to go to

output port number 2.Also packet number 3 injected at input port_3 of the input module of the

switch having input VCI number B9E0 is updated with the new VCI number 8193 and is

suppose to go to output port number 1 and lastly packet number 4 injected at input port_4 of

the input module of the switch having input VCI number 0204 is updated with the new VCI

number DEBA and is suppose to go to output port number 4. The expected output results

54

exactly matches with the result generated after the functional simulation, this confirms the

working of the proposed switch design.

Bouraoui Chemli and Abdelkrim [81] proposed the architecture of a NoC router to be used in

mesh topology and implemented the deadlock-free Negative-First turn model routing

algorithm. The router is designed in VHDL and synthesized with ISE 13.1 tool and

implemented on a Xilinx FPGA Virtex5 board. The proposed router architecture consists of

Figure 3.11: Output showing ATM packets received at respective output ports of the switch as

described in LUT

the input module, semi-crossbar, switch allocator and the credit-switcher. The data flow and

the communication between adjacent blocks is managed by the input port. Switch allocator

issues grant signal indicating the response to the request of the output ports. The arbiter

module of the switch allocator uses a round-robin and a priority scheduler schemes to assign

55

the highest priority packet to the adequate output port. The credit switcher indicates the status

of the input ports of neighboring routers by issuing credit signal. The link controller receives

flits from adjacent router and, if requires stores them in the buffer. The router can serve up to

five packets simultaneously; this parallel operation reduces the latency of the proposed router

and the average latency of the NoC. Here they compared their design with two other router

designs. The results are shown in the table 3.5 which claim that the router is 1.15 and 1.5

times faster than other routers because of distributed routing function but in terms of area their

design is bigger than the other router design.

Table 3.5: comparison of the proposed router design with other router design

Topology 2D Mesh 2D Torus 2D Mesh

Routing algorithm XY XY Negative-first

Frequency (MHz) 52 40 60

Area (Slice) 824 611 3174

Power estimation

(mW)

58.8 NA 43

Estimated Peak

Performance per port

0.812 0.312 19.20

FPGA device Virtex-2

X2VC6000

Virtex-2

X2VC6000

Virtex-2

X2VC6000

Another implementation of router was proposed by Hung K. Nguyen and Xuan-Tu [82]. Here

they proposed a hybrid switching router based on the combination of virtual cut-through and

wormhole switching schemes. The router is dynamically reconfigurable at run time to

exchange between switching schemes, therefore achieving higher average performance than

56

wormhole switching, while reducing the implementation cost in comparison with the virtual

cut-through switching. The RTL model of the proposed router was synthesized on Xilinx

Virtex-7 xc7vx485 FPGA chip by using the Xilinx Vivado Design Suite. The resources

utilization for the proposed router is about 0.36%, 1.02%, and 0.25% of the xc7vx485 chip in

terms of Flip-Flop, LUTs and Memory LUT respectively. When compared with generic

wormhole router with same configuration, the resource utilization overhead of the proposed

router is about 0.82%, 2.51%, and 9.83% in terms of Flip-Flop, LUTs and Memory LUTs

respectively. The experimental results show that the router guarantees reliability and reduce

latency approximately 30.2% and increase average throughput about 38.9% compared with

the generic router. The power consumption and area overhead is also acceptable when

compared with the generic router. FPGA based design of reconfigurable router for NoC

applications is another router implementation proposed by Amit Bhanwala, Mayank Kumar

and Yogendera [83]. The design consists of four channels and crossbar switch, each channel

has FIFO buffers to store data and multiplexers for controlling input and output of the data.

The proposed router is described using Verilog HDL and simulated using Modelsim. The

RTL view is obtained using Xilinx ISE 13.4. The entire design is synthesized on Xilinx

SPARTAN-6 FPGAs device. The power gating technique is used to reduce power dissipation

of the proposed reconfigurable router. The results show that the proposed design consumes

less power compared to the previously designed reconfigurable routers.

57

Chapter 4

IMPLEMENTATION OF COMPUTING NODE

58

4.1 Overview of Computing Node

A computing node is an active electronic device that is attached to a network, and is capable

of creating, receiving and transmitting information over a communications channel. The

computing node in our design is a soft core processor integrated with the switch block. The

soft core processor is capable of generating and forwarding data packets to attached switch,

receiving data packets from the switch and processing data to some extent. The

implementation of computing node is accomplished by converting the RTL design of the

switch into block symbol file and then interfacing the block to the soft core processor SoC

generated. We first discuss the process of converting the switch design into block diagram.

4.2 Task of converting the Design into Block Diagram After testing the switch design for its functionality to create block symbol, we opened the

switch.vhdl file compile the code and simply create symbol file by Selecting create symbol

file from menu which created switch.bsf figure 4.1 shows the block symbol of our switch

design. At input interface it has four one bit frame pulse signal (fp1 to fp4) use to indicate the

arrival of data packet at respective port number, four 8-bit data buses (data_in1 to data_in4)

for receiving data packets at respective port numbers and clock signal, reset and global reset

signals. The output interfaces are fp_out_ports( 1 to 4) which are the frame pulse input signals

for the adjacent switch, data_out_port(1 to 4) which are 8-bit data buses on which the data

packets are outputted,data_valid(1 to 4) to check the validity of data,

incoming_port_to_output(1 to 4) which indicates from which incoming port the data packet

has arrived along with request and grant signals.

59

Figure 4.1: Block symbol of the proposed switch design 4.3 Soft Core Processor architecture A soft processor is an Intellectual Property (IP) core that is implemented using the logic

primitives of the FPGA. Key benefits of using a soft processor include configurability to trade

between price and performance, faster time to market, easy integration with the FPGA fabric,

and avoiding obsolescence. Commonly used embedded soft core processors for FPGA [84]

are shown in table 4.1

LEON3 is a 32-bit processor based on the SPARC V8 architecture and is designed and

maintained by Aeroflex Gaisler AB in Sweden. The model is highly configurable and

particularly suitable for system-on-a-chip (SOC) designs. The structure of LEON3 processor

follows Harvard architecture and uses the AMBA Advanced High performance Bus (AHB)

for all on–chip communications.

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

clock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst

VCCf p1 INPUT

VCCf p2 INPUT

VCCf p3 INPUT

VCCf p4 INPUT

VCCdata_in1[7..0] INPUT

VCCdata_in2[7..0] INPUT

VCCdata_in3[7..0] INPUT

VCCdata_in4[7..0] INPUT

VCCglobal_reset INPUT

VCCreset INPUT

VCCclock INPUT

f p_out_port1OUTPUT

f p_out_port2OUTPUT

f p_out_port3OUTPUT

f p_out_port4OUTPUT

data_out_port1[7..0]OUTPUT

data_out_port2[7..0]OUTPUT

data_out_port3[7..0]OUTPUT

data_out_port4[7..0]OUTPUT

data_v alid1OUTPUT

data_v alid2OUTPUT

data_v alid3OUTPUT

data_v alid4OUTPUT

incoming_port_to_output1[2..0]OUTPUT

incoming_port_to_output2[2..0]OUTPUT

incoming_port_to_output3[2..0]OUTPUT

incoming_port_to_output4[2..0]OUTPUT

P[7..1]OUTPUT

request[16..1]OUTPUT

grant[16..1]OUTPUT

60

Table 4.1: Embedded Soft Core Processor for FPGA

Soft Core

Processor

Architecture LogicalElements SupportedFPGA

LEON3 SPARC-v8 3500 Xlinx FPGA

Microblaze Microblaze 1324 Xlinx FPGA

Picoblaze Picoblaze 192 Xlinx FPGA

Cortex-M1 ARMv6 2600 Xlinx & Altera FPGA

Nios II Nios II

Nios II(f)- 1800

Altera FPGA Nios II(s)-1700

Nios II(e)- 390

Register windowing makes it possible to select a new set of local registers upon a procedure,

thereby reducing time required for storing register contents in memory every time. By default

a basic implementation has 8 global registers, 8sets of register windows, and each window

consisting of24 registers. Thus at any time, 32 registers will be available for a program, out of

the existing 200 registers. The unique debug interface of LEON3 allows non-intrusive

hardware debugging and provides access to all registers and memory. A main advantage of

LEON3 processor is that it uses a structured organization of packets, folders and VHDL

records. This processor is very reliable and hence is used in a large number of military and

space applications.

MicroBlazeis a proprietary 32-bits RISC reconfigurable soft-processor from Xilinx, and can

be customized with different peripheral and memory configurations. Harvard memory

architecture is used for development of MicroBlaze. This processor has a three-stage pipeline

with variable length instruction latencies, typically ranging from one to three cycles. Xilinx

61

Platform Studio is available for creating a MicroBlaze based system. It uses 2 Local Memory

Busses (LMB) to connect instruction and data memories. User can define the sizes of

instruction and data memory and the number of peripheral based on application requirements.

An On-Chip-Peripheral Bus (OPB) is used to boost systems performance and it is designed to

support low-performance/speed peripherals such as UART, GPIO, USB, external bus

controllers. This soft-core processor is used only in Xilinx FPGAs only and has lots of

configuration options. It also uses the Advanced Extensible Interface (AXI) standard bus.

Being proprietary soft-core, the source code is not available.

PicoBlaze is the designation of a series of three free soft processor cores from Xilinx for use

in their FPGA and CPLD products. They are based on an 8-bit RISC architecture and can

reach speeds up to 100 MIPS on the Virtex 4FPGA's family. The processors have an 8-bit

address and data port for access to a wide range of peripherals. The license of the cores allows

their free use, albeit only on Xilinx devices, and they come with development tools. Third

party tools are available from Mediatronix and others.

Cortex-M1 is an implementation of the proprietary 32-bit ARMv6 architecture designed for

FPGA. Using Cortex-M1 requires license from ARM Limited. There are Actel flash-based

FPGA chips which are shipped with Cortex-M1 license. Cortex-M1 can be used with Xilinx

and Altera FPGAs as well.ARM (Advanced RISC Machine) is a popular RISC architecture

particularly suitable for applications that demands low power consumption.Many operating

systems have been ported to ARM, including Linux and Windows CE.

We in our design selected Nios II processor because it supports all Altera® SoC and FPGA

families.

62

4.3.1 Altera Nios –II Soft Core Processor Nios II Soft Processor [85] is a 32-bit fixed point high- performance processor created to be

used only in FPGA. This processor has separated buses for data and program memories,

which is often called Harvard architecture, and also has Reduced Instruction Set Computer

(RISC) architecture. A Nios II processor system is equivalent to a microcontroller or

“computer on a chip” that includes a processor and a combination of peripherals and memory

on a single chip. Like a microcontroller family, all Nios II processor systems use a consistent

instruction set and programming model. The Nios II soft processor has a number of features

that can be configured by the user to meet the demands of a desired system. The processor can

be implemented in three different configurations:

• Nios II/f is a "fast" version designed for high performance. It has the widest scope of

configuration options that can be used to optimize the processor for performance.

• Nios II/s is a "standard" version that requires less resource in an FPGA device as a

trade-off for reduced performance.

• Nios II/e is an "economy" version which requires the least amount of FPGA resources,

but also has the most limited set of user-configurable features.

Its arithmetic and logic operations are performed on operands in the general purpose registers.

The data is moved between the memory and these registers by means of Load and Store

instructions. The word length of the Nios II processor is 32 bits. All registers are 32 bits long.

Byte addresses in a 32-bit word are assigned in little-endian style, in which the lower byte

addresses are used for the less significant bytes (the rightmost bytes) of the word. A Nios II

processor may operate in the following modes:

63

• Supervisor mode – allows the processor to execute all instructions and perform all

available functions. When the processor is reset, it enters this mode.

• User mode – the intent of this mode is to prevent execution of some instructions that

should be used for systems purposes only. This mode is available only when the

processor is configured to use the Memory Management Unit (MMU) or the Memory

Protection Unit (MPU).

Application programs can be run in either the User or Supervisor modes.

4.3.2 Characteristics of Nios II Processor

Nios II soft processor is very flexible and has some configurations that can be implemented at

the design Stage. These configurations allow the user to optimize the processor for various

applications [86]. The most relevant are the clock frequency, the debug level, the performance

level and the user defined instructions.

Features of Nios II Soft Processor:

• Full 32-bit instruction set, data path and address space, 32 general-purpose registers,

optional shadow registers sets

• Single-instruction 32 x 32 multiply and divide unit producing a 32-bit result

• Dedicated instructions for computing 64-bit and 128-bit product of multiplication

• Floating-point instructions for single-precision floating point operations

• Single instruction barrel shifter

• Hardware-assisted debug module enabling processor start, stop, step and trace under

control of the Nios-II software development tools

• Optional memory management unit (MMU) to support operating systems requiring

MMUs

64

• Instruction Set Architecture (ISA) compatible across all Nios-II processor systems

• Performance up to 250 DMIPS

One of the most important features of Nios-II processor is the possibility to add user-defined

Functional Units called Custom Logic. Custom instructions allow the possibilities to tailor the

Nios-II processor core to meet the needs of a particular application. In this way it is possible

to accelerate time- critical software algorithms by implementing some steps into specialized

hardware blocks. These blocks must be created using either VHDL or Verilog language.

Physically, the Custom Logic block is placed inside the Nios-II processor in parallel to the

ALU. Figure 4.2: shows block diagram of the Nios II processor core

4.4 SoC generation using Nios-II Soft Core We customize the Nios-II Soft Core Processor as per our requirement; this soft core can be

generated by using Altera’s SOPC Builder. SOPC is a powerful system development tool [87]

which enables user to define and generate a complete System-On-a-Programmable-Chip

(SOPC) in much less time than using traditional, manual integration methods. It is a part of

the Quartus II software. SOPC provides the system which is combination of Hardware &

Software [88]. The SOPC maintains a programmable and reconfigurable hardware, which is

the most significant form of SOC. It automates the task of integrating hardware components

by just specifying the system components in a GUI and SOPC builder generates the

interconnect logic automatically. SOPC Builder generates HDL files that define all

components of the system, and a top-level HDL file that connects all the components

Figure 4.2: Block diagram of the Nios II processor core together. SOPC builder generates either Verilog HDL or VHDL equally [

connects multiple modules together to create a top

system. It generates system interconnect fabric that contains logic to manage the connectivity

of all modules in the system.

shown in the figure 4.3. The design

top-level HDL file called SOPC b

The following modules were selected as required by our design

• CPU (Nios-II/s processor)

• Static Random Access Memory

• JTAG-UART (interface for

• PIO ( 8-bit for input data packets

65

.2: Block diagram of the Nios II processor core with relevant buses

uilder generates either Verilog HDL or VHDL equally [89]. SOPC

connects multiple modules together to create a top-level HDL file called the SOPC b

system. It generates system interconnect fabric that contains logic to manage the connectivity

of all modules in the system. The customized soft core processor for the switch design

. The design shows the multiple modules by SOPC builder to create a

level HDL file called SOPC builder system.

The following modules were selected as required by our design

II/s processor)

Static Random Access Memory(SRAM)

UART (interface for communicating with the host computer)

bit for input data packets -4 in numbers)

]. SOPC builder

el HDL file called the SOPC builder

system. It generates system interconnect fabric that contains logic to manage the connectivity

soft core processor for the switch design is

uilder to create a

66

• PIO ( 8-bit for output data packets -4 in numbers)

• PIO( 1- bit for frame pulse signal, 1-bit for reading clock and 1-bit for reading output

frame pulse- total 3 in numbers)

SOPC builder modules use Avalon interfaces, such as memory-mapped, streaming, and IRQ,

for the physical connection of components. The Avalon interface [90] family defines

interfaces appropriate for streaming high-speed data, reading and writing registers and

memory, and controlling off-chip devices. The Avalon interface is a synchronous interface

defined by a set of signal types with roles for supporting data transfers. There are two types of

Avalon interface ports; master ports and slave ports. Avalon master ports initiate transfers.

Avalon slave ports respond to transfer requests according to the requirements. After adding all

the components we have to assign priority of the components which is IRQ number. As

shown in figure 4.3, here CPU is set to IRQ 0 which gets higher priority. SOPC builder

automatically assign base address to all components. The device family was set to Cyclone-II

and external source clock of 50MHz was selected. With this configuration the soft core

processor is generated.

4.5 Integrating SoC with Switch The final computing node is implemented by integrating the configured Nios-II processor

generated by SOPC Builder with the switch. This is done by opening a new project in Quartus

–II software and instantiating the Nios-II soft core processor module in the Quartus-II project

along with the switch block component as shown in the figure 4.4.

The following connections were performed

• The four output PIO buses were connected to four input data pins of the switch block.

• The output data pins of the switch block are connected to the four input PIO buses.

67

• 1-bit PIO out pin is connected to frame pulse signal of the switch.

• The soft core processor is assigned external clock of 50MHz.

• The clock of the switch as described herewith

Figure 4.3: Customize soft core processor for switch design

The system thus generated has two components namely SoC and the switch. Both this

components requires two different clock frequencies. The SoC at the time of configuration

needs to be assigned a clock frequency before generation. The only clock which can be

assigned to this soft core is the external clock of the DE2 development board on which this

design will be performed. Thus the clock assigned to soft core processor is 50MHz clock of

the DE2 board. The designed switch requires a clock in the range of 350Hz – 400Hz

frequency which is of the factor of 15-20 for proper synchronization with SoC.

68

Figure 4.4: Schematic view of the Computing node consist of SoC and a switch. The only possibility remain is to generate a clock frequency in the range of 350Hz- 400Hz

from the clock frequency of the SoC which is 50MHz. This is done by using PLL and few

stages of JK flip flops to divide the required clock frequency. Further it was difficult to

synchronize both the components, one running at high frequency and other at low frequency.

Stages of JK FF

SoC Switch

69

We resolved this issue by polling back the low frequency clock signal via one of the input

PIO of the SoC and synchronized the data transfer from SoC to switch component by writing

appropriate C code which executes on the soft core processor.

4.6 Synthesizing and porting the Computing node on target device Cyclone-

II The computing node generated needs to be tested by porting it on the target FPGA device. We

have used Altera’s Development and Education board DE2. The DE2 board has many

features that allow the user to implement a wide range of designed circuits, from simple

circuits to various multimedia projects. Figure 4.5 shows the layout and components on DE2

board.

The following hardware component is provided on the DE2 board: Altera Cyclone® II 2C35

FPGA device, Altera Serial Configuration device - EPCS16, USB Blaster (on board) for

programming and user Application Program Interface(API) control; both JTAG and Active

Serial(AS) programming modes are supported, 512-Kbyte SRAM, 8-Mbyte SDRAM, 4-

Mbyte Flash memory (1 Mbyte on some boards) and SD Card socket, 50-MHz oscillator and

27-MHz oscillator for clock sources, 10/100 Ethernet Controller with a connector, USB

Host/Slave Controller with USB type A and type B connectors. In addition to these hardware

features, the DE2 board has software support for standard I/O interfaces and a control panel

facility for accessing various components.

The block diagram of the DE2 board is shown in figure 4.6. To provide maximum flexibility

for the user, all connections are made through the Cyclone II FPGA device. Thus, the user can

configure the FPGA to implement any system design.

70

Figure 4.5: The DE2 board

The detailed information about the blocks in Figure 4.6 is as follows

Cyclone II 2C35 FPGA

• 33,216 Les, 105 M4K RAM blocks, 483,840 total RAM bits, 35 embedded multipliers, 4

PLLs, 475 user I/O pins, Fine Line BGA 672-pin package

Serial Configuration device and USB Blaster circuit

• Altera’s EPCS16 Serial Configuration device

• On-board USB Blaster for programming and user API control

• JTAG and AS programming modes are supported

SRAM

• 512-Kbyte Static RAM memory chip organized as 256K x 16 bits accessible as memory for

the Nios II processor and by the DE2 Control Panel

71

SDRAM

• 8-Mbyte Single Data Rate Synchronous Dynamic RAM memory chip organized as 1M x 16

bits x 4 banks accessible as memory for the Nios II processor and by the DE2 Control Panel

Flash memory

• 4-Mbyte NOR Flash memory (1 Mbyte on some boards), 8-bit data bus accessible as

memory for the Nios II processor and by the DE2 Control Panel

SD card socket

Figure 4.6: Block diagram of the DE2 board with various I/O and communication Interfaces

72

• Provides SPI mode for SD Card access accessible as memory for the Nios II processor with

the DE2 SD Card Driver

Clock inputs

• 50-MHz oscillator

• 27-MHz oscillator

• SMA external clock input

Before compiling the computing node design the appropriate FPGA pin assignment is done as

shown in the table 4.2. After compilation the compiler assembler module creates a binary file

with extension xxx.sof (SRAM object file), which contains the data needed to configure the

FPGA device. The device selected is EP2C35F672C6 of Cyclone II family, which is the

FPGA device used on the DE2 board.

Table 4.2: Pin assignment for EP2C35F672C6 of Cyclone II family for DE2 board

Signal Name FPGA Pin No

SRAM_ADDR[0] PIN_AE4

SRAM_ADDR[1] PIN_AF4

SRAM_ADDR[2] PIN_AC5

SRAM_ADDR[3] PIN_AC6

SRAM_ADDR[4] PIN_AD4

SRAM_ADDR[5] PIN_AD5

SRAM_ADDR[6] PIN_AE5

SRAM_ADDR[7] PIN_AF5

SRAM_ADDR[8] PIN_AD6

SRAM_ADDR[9] PIN_AD7

SRAM_ADDR[10] PIN_V10

73

SRAM_ADDR[11] PIN_V9

SRAM_ADDR[12] PIN_AC7

SRAM_ADDR[13] PIN_W8

SRAM_ADDR[14] PIN_W10

SRAM_ADDR[15] PIN_Y10

SRAM_ADDR[16] PIN_AB8

SRAM_ADDR[17] PIN_AC8

SRAM_DQ[0] PIN_AD8

SRAM_DQ[1] PIN_AE6

SRAM_DQ[2] PIN_AF6

SRAM_DQ[3] PIN_AA9

SRAM_DQ[4] PIN_AA10

SRAM_DQ[5] PIN_AB10

SRAM_DQ[6] PIN_AA11

SRAM_DQ[7] PIN_Y11

SRAM_DQ[8] PIN_AE7

SRAM_DQ[9] PIN_AF7

SRAM_DQ[10] PIN_AE8

SRAM_DQ[11] PIN_AF8

SRAM_DQ[12] PIN_W11

SRAM_DQ[13] PIN_W12

SRAM_DQ[14] PIN_AC9

SRAM_DQ[15] PIN_AC10

SRAM_WE_N PIN_AE10

SRAM_OE_N PIN_AD10

74

SRAM_UB_N PIN_AF9

SRAM_LB_N PIN_AE9

SRAM_CE_N PIN_AC11

CLOCK_50 PIN_N2

SW[1] PIN_N26

4.7 Programming and configuring the FPGA device

The FPGA device must be programmed and configured to implement the designed system.

The required configurationfilexxx.sof is generated by the Quartus II Compiler’s assembler

module. Altera’s DE2 board allows the configuration to be done in two different ways, known

as Joint Test Action Group (JTAG) and Active Serial (AS) modes. The configuration data is

transferred from the host computer (which runs the Quartus II software) to the board by

means of a cable that connects a USB port on the host computer to the leftmost USB

connector on the board. This connection requires the USB-Blaster driver. In the JTAG mode,

the configuration data is loaded directly into the FPGA device and it will retain its

configurations long as the power remains turned on. The configuration information is lost

when the power is turned off.

The second possibility is to use the (AS) mode. In this case, a configuration device that

includes some flash memory is used to store the configuration data. Quartus II software places

the configuration data into the configuration device on the DE2 board. Then, this data is

loaded into the FPGA upon power-up or reconfiguration. Thus, the FPGA need not be

configured by the Quartus II software if the power is turned off and on. The choice between

the two modes is made by the RUN/PROG switch on the DE2 board. The RUN position

selects the JTAG mode, while the PROG position selects the AS mode. We configured our

75

design into JTAG programming mode and downloaded the design on Cyclone II FPGA

device by performing following steps.

1. Connected the DE2 board to the host computer by means of a USB cable plugged into the

USB-Blaster port. Powered on the DE2 board and ensured that the RUN/PROG switch is in

the RUN position.

2. Under Tools > Programmer option select JTAG in the Mode box and select the USB-

Blaster.

3. The configuration file with extension xxx.sof appears in the window.

4. Click the box under Program/Configure to select this action.

5. Press Start to configure the FPGA.

With this process we have ported the complete computing node system along with the

configuration file on to the Cyclone II FPGA device of DE2 board.

4.8 Validating the design for the application program

Having configured the required hardware in the FPGA device, it is now necessary to confirm

the design by creating and executing an application program that performs the desired

operation. The software development environment of NIOS II processor called NIOS II

Integrated Development Environment (IDE). IDE facilitate to write the application program,

compile and executed the same. Application code can be written in either C/C++ language

[91]. We have used C programming language, the complete code is shown in ANNEXURE

(Appendix-I). The application first activate fp signal high for two clock cycles and generates

four 53 bytes ATM data packets. After the fp signal is activated the program starts sending the

ATM packets to switch input ports, first byte at second rising edge of the clock. The

application reads the clock signal generated after dividing the original 50MHz clock thus,

76

synchronizing it with switch input clock. The program continues to sends one byte of data at

every clock cycle to the input port of the switch until fp_out signal is detected, which

indicates the arrival of data packet at the output port. The program control is then switched

over to read the output ports of the switch and direct the data packet to Nios-II IDE console

.The above logic is represented using flow chart as shown in figure 4.7. The program is then

compiled and downloaded into Nios II system implemented on a DE2 board. The output result

is shown in the figure 4.8. As seen from the program ANNEXURE (Appendix-I). The four

different ATM packets were generated and transferred to four input port of the switch. For

example we provided the following ATM packet table 4.3 to input port_1 of the switch, as per

the look up table information chapter 3 table 3.2 one of the entry is as shown in table 4.4 and

the expected output ATM Packet received at output port_3 of the switch is shown in table 4.5.

We obtained the same result as shown in figure 4.8 which confirms that the computing node

implemented on the DE2 board on target Cyclone II FPGA works as per our objectives.

Table 4.3: input ATM packet send at port one of the switch.

1st byte 2

nd byte 3

rd byte 4

th byte 5

th byte 6-48 bytes

72 73 74 75 76 01-48

Table 4.4: part of look up table

Input VCI Output VCI Output port number

3747 56C9 3

77

Table 4.5: output ATM packet expected at port number 3

1st byte 2

nd byte 3

rd byte 4

th byte 5

th byte 6-48 bytes

72 75 6C 95 76 01-48

Figure 4.7: Program flow for verification of computing node on FPGA

78

Figure 4.8: Output ATM packet at port number 3 of the switch on Nios-II IDE console

79

Chapter 5

IMPLEMENTATION OF

NoC PLATFORM USING MESH TOPOLOGY

5.1 Overview of Mesh Topology The basic building block of any mesh topology is the switch/router. The switches in the mesh

topology are connected to north, south, east and west directions of adjacent switches in the

topology. Every switch has embedded computing

switch shown as router in the figure

Figure 5.1: 4x4 Mesh topology

80

verview of Mesh Topology

The basic building block of any mesh topology is the switch/router. The switches in the mesh

north, south, east and west directions of adjacent switches in the

switch has embedded computing element shown as core interfaced with the

shown as router in the figure 5.1.

with switch (router) and computing element (core)

The basic building block of any mesh topology is the switch/router. The switches in the mesh

north, south, east and west directions of adjacent switches in the

element shown as core interfaced with the

with switch (router) and computing element (core)

81

5.2 Design of NoC Mesh Topology Platform

The 4 X 4 NoC mesh topology requires 16 computing node. The computing node designed

and tested in chapter 4 occupies 32% of the logical elements of Cyclone II FPGA of DE2

board. Thus it is quite clear that 4 X 4 mesh topology will not fit on the Cyclone II FPGA

device. Three such computing nodes will occupy 96% logical elements on this device.

Figure 5.2: Close view of 4X4 Mesh topology interconnection platform using Quartus-II

The option is to generate 16 instances of switches without processing element. We

used only one processing element connected to first switch and connected as shown in the

figure 5.2 and figure 5.3 to build Mesh topology platform. Though the design requires one

processing element for every switch, we embedded only one processing element to first

82

switch in our design because multi-processor support is not so strong in Quartus on Cyclone-

II device.

Figure 5.3: Full view of 4X4 Mesh topology interconnection platform using Quartus-II with

SoC and PLL.

VCCclk INPUT

GNDreset_n INPUT

GNDglobal_reset INPUT

SRAM_ADDR_f rom_the_sram_0[17..0]OUTPUT

SRAM_CE_N_f rom_the_sram_0OUTPUT

SRAM_LB_N_f rom_the_sram_0OUTPUT

SRAM_OE_N_f rom_the_sram_0OUTPUT

SRAM_UB_N_f rom_the_sram_0OUTPUT

SRAM_WE_N_f rom_the_sram_0OUTPUT

VCCSRAM_DQ_to_and_f rom_the_sram_0[15..0]BIDIR

TFFclock

q

FF

inst1

TFF

clockq

FF

inst2

TFF

clockq

FF

inst4

TFF

clockq

FF

inst5

Cy clone II

inclk0 f requency : 50.000 MHz

Operation Mode: Normal

Clk Ratio Ph (dg) DC (%)

c0 1/5 0.00 50.00

c1 1/2 0.00 50.00

inclk0 c0

c1

pll

inst6

TFF

clockq

FF

inst12

TFF

clockq

FF

inst13

TFF

clockq

FF

inst14

TFF

clockq

FF

inst15

clk

reset_n

in_port_to_the_pio_2[7..0]

in_port_to_the_pio_3[7..0]

in_port_to_the_pio_4[7..0]

in_port_to_the_pio_5[7..0]

in_port_to_the_pio_6

in_port_to_the_pio_7

out_port_from_the_pio

out_port_f rom_the_pio_1[7..0]

out_port_from_the_pio_10[7..0]

out_port_f rom_the_pio_8[7..0]

out_port_f rom_the_pio_9[7..0]

SRAM_ADDR_f rom_the_sram_0[17..0]

SRAM_CE_N_from_the_sram_0

SRAM_LB_N_f rom_the_sram_0

SRAM_OE_N_from_the_sram_0

SRAM_UB_N_f rom_the_sram_0

SRAM_WE_N_from_the_sram_0

SRAM_DQ_to_and_from_the_sram_0[15..0]

noc_platform_10_1

inst7

TFF

clockq

FF

inst

TFF

clockq

FF

inst36

TFF

clockq

FF

inst37

TFF

clockq

FF

inst38

TFF

clockq

FF

inst9

TFF

clockq

FF

inst10

TFF

clockq

FF

inst11

TFFclock

q

FF

inst16

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

c lock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst8

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

c lock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst3

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

c lock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

voq_switch

inst17

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

clock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_valid1

data_valid2

data_valid3

data_valid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst18

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

clock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst19

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

clock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_valid1

data_valid2

data_valid3

data_valid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst20

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

clock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst22

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

c lock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

voq_switch

inst23

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

clock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst24

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

clock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_valid1

data_valid2

data_valid3

data_valid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst25

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

c lock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

voq_switch

inst26

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

c lock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

voq_switch

inst27

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

c lock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

voq_switch

inst28

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

c lock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

voq_switch

inst29

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

c lock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

voq_switch

inst30

f p1

f p2

f p3

f p4

data_in1[7..0]

data_in2[7..0]

data_in3[7..0]

data_in4[7..0]

global_reset

reset

clock

f p_out_port1

f p_out_port2

f p_out_port3

f p_out_port4

data_out_port1[7..0]

data_out_port2[7..0]

data_out_port3[7..0]

data_out_port4[7..0]

data_v alid1

data_v alid2

data_v alid3

data_v alid4

incoming_port_to_output1[2..0]

incoming_port_to_output2[2..0]

incoming_port_to_output3[2..0]

incoming_port_to_output4[2..0]

P[7..1]

request[16..1]

grant[16..1]

v oq_switch

inst31

83

Figure 5.4: Compilation report for 4 x 4 switch matrix demanding more utilization.

84

Also multi-core support for the µCOS Real Time Operating System (RTOS) is not supported

on the Altera Monitor program software development tool for NIOS-II. Hence, we have

customized the design for a single processing element and established design

interconnection for two dimensional mesh topology. The processing element connected to

first switch can be use to generate, send and receive ATM data packets.

After compiling the 4 X 4 NoC mesh topology the compiled summary report as shown in

figure 5.4 shows that the design was unable to fit on the Cyclone II FPGA .Over various trials

of decreasing number of switches we could fit only 6 switches and one processing element

connected to first switch on the FPGA .The Memory utilization for 6 switches and one

processing element on cyclone II FPGA is 97% out of 33216 logical elements as referred in

figure 5.5, on Cyclone II device 32208 Logical Elements were utilized.

5.3 The 3X2 NoC Mesh Topology verification with RTOS

We scaled down 4 x4(16 node) mesh topology platform to 3 x 2 (6 nodes) mesh topology

platform. The reason is resource utilization of logical elements on selected device

EP2C35F672C6 of Cyclone II family, which is the FPGA device used on the DE2 board. In

order to design 3x2 mesh topology platforms we used our switch design shown in Figure 4.1.

We require six such switches to build the platform. All switches were configured with the

appropriate look up tables require to decide the deterministic routing path as per shortest path

algorithm from any node to any node in the network.

We first performed the functional simulation of all switches by creating input waveform

vector file to test the functionality of the switches. Before connecting the switches into 3x2

mesh topology, we connect processing element to these switches.

85

Figure 5.5: Compilation report for 3 x 2 mesh topology utilizing 97% logic elements.

86

The customized Soft Core Processor is interfaced with the switch as computing node. It is

also observed that cyclone-II device in Quartus-II does not fully support multi-core processors

and multi-instances of µCOS Real Time Operating System (RTOS) on the Altera Monitor

program software development tool for NIOS-II. Hence, we connected 6 switches into 3x2

mesh topology with one processing element connected to first switch as shown in figure 5.6.

The 3x2 mesh topology is then synthesizes on the target FPGA device. The compilation result

shows that the total utilization of logical element is 97%. We then verified the design for two

different cases as described below.

Figure 5.6: 3x2 mesh topology interconnection platform

Case-I: Verification of 3x2 mesh topology at first switch

This test is written to verify the design which is ported on the target FPGA device of DE2

board. The application program is written using C programming language to generate 53

bytes long four different ATM packets. The ATM packets generated contains first five bytes

VCI information as shown in table 3.3 the switch is configured as per the look up table 3.2

The program is then compiled and downloaded into SoC system ported on a DE2 board.The

ATM packet generated is send to one of the input port of the switch connected. As the packet

Switch

0

Switch

3

Switch

4

Switch

5

Switch

1

Switch

2

NIOS –II

soft-core

processor

87

is read by the switch, it checks the VCI information of the packet and update the old VCI with

the new VCI information as per the information maintained in the look up table of the switch

and decides the output port for the packet. The packet is then forwarded to the decided output

port once the scheduler grants the signal. To check whether the packet reached at the

respective output port correctly we require capturing this packet using soft core processor.

Thus we connect the output port of the switch to the input pin PIO of the SoC. The processor

is involved in two tasks, one for sending the ATM packet and other for receiving the ATM

packet. This two task needs to be perform in real time which is implemented using µCOS

Real Time Operating System. The two tasks, a low priority task and a high priority tasks are

defined in the program. The scheduling diagram is shown in figure 5.7. Low priority task is

involved in sending the ATM packet to the input port of the switch, wherein one byte at every

clock cycle is transmitted. The fp_out signal at the output of the switch indicates the arrival of

the packet. The high priority task is continuously in the loop checking for the fp_out signal; as

soon as it detects this signal the processor stops the low priority task and jumps to high

priority task for receiving the ATM packet. The high priority task of receiving the packet is

looped for 53 clock cycle which is equal to the size of ATM packet. The processor then

switch back to low priority task again and continue to send another packet, while polling

through high priority task.

Case II: Verification of 3x2 mesh topology at last switch

In order to verify of 3x2 mesh topology, we connected the switches as shown in the figure

5.6. We fixed the first switch as source node and sixth switch as destination node. The static

routing decision is performed to decide the shortest path from source node to destination node

by proper LUT in the respective switches

deterministic path set. The switches are configured with routing table information

of look up table. The program to generate ATM packet is written using C lan

successful compilation downloaded into

packet is injected to the input port of the first switch.

first switch it checks the input VCI information and assi

new VCI information as per the look up table maintained in the switch. The packet is then

forwarded to output port once scheduler grants the permission.

follow the paths through the different

Figure 5.7: Real Time system priority scheduler for transmitting and receiving the ATM packets Each switch will update the VCI information of the packet and will forward it to the next

switch as per the look up table maintained in the switch thereby reaching to the destination

node. Here again we defined two task, low priority and high priority task as discussed in case

88

by proper LUT in the respective switches. The packet will then simply follow the

. The switches are configured with routing table information

The program to generate ATM packet is written using C lan

successful compilation downloaded into SoC system implemented on DE2 board. The ATM

packet is injected to the input port of the first switch. As the packet enters input module of the

first switch it checks the input VCI information and assign the output port by updating with

new VCI information as per the look up table maintained in the switch. The packet is then

forwarded to output port once scheduler grants the permission. Similarly the packet will

follow the paths through the different switches as decided by the routing algorithm.

l Time system priority scheduler for transmitting and receiving the ATM

Each switch will update the VCI information of the packet and will forward it to the next

look up table maintained in the switch thereby reaching to the destination

node. Here again we defined two task, low priority and high priority task as discussed in case

The packet will then simply follow the

. The switches are configured with routing table information in the form

The program to generate ATM packet is written using C language and after

system implemented on DE2 board. The ATM

As the packet enters input module of the

gn the output port by updating with

new VCI information as per the look up table maintained in the switch. The packet is then

Similarly the packet will

switches as decided by the routing algorithm.

l Time system priority scheduler for transmitting and receiving the ATM

Each switch will update the VCI information of the packet and will forward it to the next

look up table maintained in the switch thereby reaching to the destination

node. Here again we defined two task, low priority and high priority task as discussed in case-

89

I. The lower priority task is involved in generating ATM packet and forwarding it to the first

switch. The high priority task is polling for the fp_out signal which indicates the arrival of

packet at output port. The output port of the destination node i.e. the sixth switch is connected

to the input PIO of the Nios-II soft core processor. Once fp_out signal is detected the

processor stops sending the input packet and starts reading ATM packet at the input PIO of

the Nios-II soft core processor.

NoC verification techniques can be classified into simulation, formal verification, and

mathematical proof. The simulation is the most famous and widely used technique. Many

NoC simulators like NS2, Noxim and XMulator have been designed to evaluate performance

on throughput and latency. Various interconnection topologies, router switch designs, routing

algorithms, arbitration algorithms and traffic patterns have been modeled and simulated.

During simulation exhaustive checking of all system behaviors is near to impossible and time-

consuming for a complex NoC-based Soc. However, some functional properties, like mutual

exclusion, deadlock-freedom, and starvation-freedom, require exhaustive exploration; thus

simulation cannot be used to complete such verifications. To resolve these issues, one can use

a conventional mathematical proof method. One such method used was eductive proof in

order to verify if the ordering of the packet transactions is correct by Jones et al. [92]. Further

to prove the deadlock-freedom of routing algorithms different deduction methods are used

[93], [94]. A channel dependency graph method is proposed by Dally and Seitz [93] to prove

that the routing algorithm in NoC is deadlock free, if and only if, there is no cycle in the

channel dependency graph. Also by applying the proposed theorem Lin et al. [94] proved that

the XY-based wormhole routing on a mesh-based NoC design is deadlock free. Though

mathematical proof is powerful, it has two defects. First, its automation is difficult. Little

90

automation means that more non-trivial human effort is required and it is not very practical

for applying this proof method on complex real cases. Second, if we disprove some functional

properties, a mathematical proof may not be able to point out how the given properties are

violated. These issues can be addressed by some popular formal verification techniques, such

as model checking [95] and theorem proving. Different kinds of communication properties in

an NoC were verified by theorem proving [96], [97], like deadlock freedom in a routing

algorithm. Model checking had also been used to prove deadlock freedom and the correctness

of a protocol design [98].One such model checking based formal verification procedure is

developed to verify and validate the routing microarchitecture in a Network-on-chip (NoC)

communication infrastructure by Yean-Ru Chen et al [99], where they investigated four

properties of an NoC router namely mutual exclusion, starvation freedom, deadlock freedom

and conditions for traffic congestions. They proposed some guidelines for constructing formal

models of an NoC router, and analyzed minimal formal models essential for verifying these

four properties. To perform the verification task they applied a popular formal verification

model checking tool State Graph Manipulators (SGM). Results obtained through formal

verification of these four properties provide useful insights to refine the protocol design.

Another such study was performed by Anam Zaman and Osman Hasan [100], where they

performed the formal verification of circuit-switched NoC using the SPIN model checker. The

proposed methodology provides generic modelling guidelines and identifies some properties,

including deadlock freedom, starvation freedom, mutual exclusion and liveness that are quite

useful in the context of circuit-switched NoC. They used the proposed methodology to verify

the programmable NoC (PNoC) architecture, which is one of the most widely used circuit-

switched NoC.

91

Chapter 6

PERFORMANCE EVALUATION AND

ANALYSIS OF MESH TOPOLOGY

92

6.1 Performance Evaluation and Analysis of Mesh Topology Two-dimensional 4x4 and 8x8 mesh topologies were modeled and simulated using network

simulator NS2 [1]. NS2 is an open source, object-oriented and discrete event driven network

simulator written in C++ and OTcl. It is a very common and widely used tool to simulate

small and large area networks. Due to similarities between NoCs and networks, NS2 has been

a choice of many NoC researches to simulate and observe the behavior of a NoC at higher

abstraction level of design. Mesh based NoC topologies are receiving attention because of

their modularity and the ability to expand by adding new nodes and links without any

modification of the existing node structure. One advantage of mesh is that it can be

partitioned into smaller meshes, which is a desirable feature for parallel applications. In this

topology each switch is connected to four neighboring switches and one resource. The

number of switches required is equal to the number of resources. Switch, resource and link

are three basic elements in the topology.

An inter-communication path between the switches is composed of links. Each node is

connected with point-to-point bidirectional links. The bandwidth and latency of the link is

configurable. When any link is down between two nodes it implies that the packet cannot be

travel between these nodes in any direction.

An inter-communication path is composed of set of links identified by the routing

strategy. This simulator models4 x 4 (16-node) and 8 x 8 (64-node) 2-D mesh in which

routing decision will be taken at source node using source routing methodology. A shortest

communication path has been selected for each traffic pair using deterministic XY routing

before a simulation.

93

To evaluate the performance of the mesh interconnection networks, following

parameters were considered latency, throughput and bandwidth. Latency is a time requires to

transfer n bytes of packet from source to destination it consist of routing delay, contention

delay, channel occupancy and overhead. Throughput is the total number of received packets

by the destination per unit time. Bandwidth over communication is the amount of data that

can be moved using a communication link in a unit time period. The above parameters can be

specified and varied by writing a Tcl code. Tcl is used for specifying the Mesh

interconnection network simulation model and running the simulation. The implementation of

mesh interconnection networks uses the source routing to send packets from source node to

destination node. The complete code using tcl is written to build 4 X 4 and 8 X 8 mesh

topologies –ANNEXURE (Appendix-II and III)

6.2 Simulation Environment for m x n mesh topology

For the evaluation, a detailed event-driven simulator has been developed. This simulator

models a 4x 4 (16-node) and 8 x 8 (64-node) 2-D mesh in which routing decision will be

taken at source node using source routing methodology. Each node is connected with point-

to-point bidirectional serial links. Transmission Control Protocol (TCP) is used to transfer

data using static routing table maintained at each router to establish the path from source

router to destination router using shortest path algorithm. The simulation uses DropTail

Queue which drops the packet at the end of the queue during congestion in other words queue

serves first in first out bases. The transport agent TCP is attached to source node N(0) and two

different traffic pattern applications namely File Transfer Protocol(FTP) and constant Bit Rate

(CBR) were tested. FTP generates bulk data in random fashion and CBR generates packets at

constant bit rate. We studied the performance of 4 x 4 and 8 x 8 mesh topology with respect to

94

link bandwidth, link delay, queue size, packet size parameters using FTP and CBR traffic.

Different scenarios were created by varying above parameters and performance in terms of

throughput and delay over the topology are studied. All these modeled parameters are

described as a script file in Tcl. The parameters chosen for simulation are shown in table 6.1.

Table 6.1: Simulation parameters

NoC Model Parameter Parameter Constraint applied in NS2

Topology Mesh

Connections Resource-Router, Router-Router

Transmission Protocols Transmission Control Protocol(TCP)

Routing Scheme Static

Routing Protocol Shortest Path

Queue Mechanism Drop Tail(FIFO)

Link Queue Varying

Link Delay 10ms

Traffic Generation FTP/CBR

Traffic rate Varying

Number of Nodes 4 x 4 (16-node), 8 x 8 (64-node)

Simulation time 20 sec

95

Node N (0) and node N (15) were fixed as source and destination node for 4 x 4 mesh

topology and node N(0) and node N(63) were fixed as source and destination node for 8 x 8

mesh topology for simulation.

6.3 Scenario-1 Throughput and delay performance with varying packet size

In Scenario 1 we implemented 4x4 and 8x8 mesh topology simulation model and the

comparative performance study of these models were carried out in terms throughput and

delay.

Table 6.2: Various parameter values for scenario 1

SCENARIO-1

4x4 4x4 8x8 8x8

CBR 5Mbps 10Mbps 5Mbps 10Mbps

Link delay 10ms 10ms 10ms 10ms

Link BW 10Mbps 10Mbps 10Mbps 10Mbps

Queue size 1 1 1 1

Packet size 0.1 to16384 kbytes 0.1 to16384 kbytes 0.1to16384 bytes 0.1 to 16384 kbytes

Table 6.3: Throughput and delay performance with varying packet size for 4x4 mesh topology

Packet size

in Kbytes

Throughput in

Kbytes

FTP

Delay per packet

in ms

FTP

Throughput in

Kbytes

CBR

Delay per packet

in ms

CBR

0.1 22 0.060673645 22 0.060676365

0.2 38 0.061154818 38 0.061159502

96

0.4 70 0.062117192 70 0.062125849

0.8 129 0.064042039 131 0.064058853

1.6 248 0.067892214 248 0.067926051

3.2 455 0.075594281 461 0.075665262

6.4 596 0.14557829 603 0.14513425

12.8 598 0.3481888 601 0.34803309

25.6 603 0.74818304 601 0.75165174

51.2 601 1.3935586 592 1.4013148

64 608 1.696563 591 1.6889675

128 619 2.8467846 573 2.8816504

256 601 3.9446304 569 3.9241282

512 574 5.3739556 540 5.5336179

1024 508 7.708185 484 7.9549006

2048 367 17.44648 295 18.805834

4096 333 14.259669 148 13.167424

8192 340 28.459136 151 26.274624

16384 0.040 0.060192 0.040 0.060224

97

Table 6.4: Throughput and delay performance with varying packet size for 4x4 mesh topology

Packet size in

Kbytes

Throughput in

Kbytes

FTP

Delay per

packet in ms

FTP

Throughput in

Kbytes

CBR

Delay per

packet in ms

CBR

0.1 22 0.060673645 22 0.060676365

0.2 38 0.061154818 38 0.061159502

0.4 70 0.062117192 70 0.062125849

0.8 129 0.064042039 131 0.064058853

1.6 248 0.067892214 248 0.067926051

3.2 455 0.075594281 461 0.075665262

6.4 596 0.14557829 610 0.14362082

12.8 598 0.3481888 620 0.33858433

25.6 603 0.74881053 615 0.73858948

51.2 597 1.4003808 604 1.4039356

64 605 1.711896 595 1.7167413

128 619 2.8817046 582 2.8930515

256 594 3.9421148 585 4.0341319

98

512 565 5.4201113 540 5.4407722

1024 508 7.708185 484 7.9549006

2048 367 17.44648 295 18.805834

4096 333 14.252736 267 18.619701

8192 340 28.459136 151 26.274624

10000 343 34.726869 0.040 0.060224

16384 0.040 0.060192 0.040 0.060224

Table 6.5: Throughput and delay performance with varying packet size for 8x8 mesh topology

Packet size in

Kbytes

Throughput in

Kbytes

FTP

Delay per

packet in ms

FTP

Throughput in

Kbytes

CBR

Delay per

packet in ms

CBR

0.1 9 0.14157171 10 0.14157815

0.2 16 0.14269432 16 0.14270533

0.4 30 0.14493961 30 0.14495999

0.8 56 0.14943039 56 0.14946999

1.6 104 0.15841294 105 0.15849282

3.2 193 0.17638191 195 0.17654995

6.4 344 0.2123365 348 0.21270924

99

12.8 569 0.2843086 573 0.28521036

25.6 589 0.67421239 591 0.6721586

51.2 554 1.3370749 559 1.341439

64 555 1.6462719 552 1.6551183

128 549 2.7536893 512 2.7953078

256 492 3.9690716 476 3.9890783

512 395 6.136064 381 6.1676202

1024 317 11.773117 207 11.861354

2048 233 22.054084 161 22.805051

4096 160 31.816192 75 26.35488

8192 0.040 0.140448 0.040 0.14048

16384 0.040 0.140448 0.040 0.140448

Table 6.6: Throughput and delay performance with varying packet size for 8x8 mesh topology

Packet size in

Kbytes

Throughput in

Kbytes

FTP

Delay per

packet in ms

FTP

Throughput in

Kbytes

CBR

Delay per

packet in ms

CBR

0.1 9 0.14157171 10 0.14157815

0.2 16 0.14269432 16 0.14270533

100

0.4 30 0.14493961 30 0.14495999

0.8 56 0.14943039 56 0.14946999

1.6 104 0.15841294 105 0.15849282

3.2 193 0.17638191 195 0.17654995

6.4 344 0.2123365 348 0.21270924

12.8 569 0.2843086 573 0.28521036

25.6 589 0.67421239 596 0.67000229

51.2 554 1.3370749 559 1.341439

64 555 1.6462719 552 1.6551183

128 549 2.7536893 512 2.7953078

256 492 3.9690716 476 3.9901596

512 395 6.136064 381 6.1765245

1024 317 11.650232 241 12.010315

2048 233 22.054084 161 22.805051

4096 160 31.816192 75 26.35488

8192 0.040 0.140448 0.040 0.14048

16384 0.040 0.140448 0.040 0.140448

101

Here, we kept the link bandwidth constant at 10Mbps; propagation delay of the link t constant

at 10ms, queue size was fixed to hold maximum 1 packet at a time and varied size of the

packet from 0.1kbytes to 16384kbytes. Two types of traffic pattern applications were

executed on these models, the FTP which generates bulk data randomly and the CBR with

5Mbps and 10Mbps rate generation. The packets generated at source node N (0) and

transmitted to node N (15) or node N (63) destination node for 4x4 and 8x8 mesh topology

respectively using shortest path algorithm. The simulations were run for 20 seconds. The

results are shown in table (6.3, 6.4, .6.5, .6.6). It is observed (figure 6.1) that for FTP and

CBR (5Mbps and 10Mbps) application, as the packet size increases, throughput increases

linearly initially and saturates for packet size in range of 6.4 Kbytes to 512 Kbytes having

maximum value of 619 Kbytes for FTP and 620 Kbytes for CBR which is corresponding to

4.951Mbps and 4.960Mbps respectively and later degrades. Though we set link bandwidth to

10Mbps, the bandwidth is shared between two links (bidirectional), thus each link shares

5Mbps. The result depicts close to 100% bandwidth exploitation. The throughput for 4x4 is

higher as compared to 8x 8 mesh topology because the length of path require to travel from

source node N(0) to destination node N(63) in 8x8 mesh topology is longer than length of

path require to travel from source node N(0) to destination node N(15) in 4x4 mesh

topology.

The performance of the two mesh topology models was also studied for delay

characteristics. Here we kept link bandwidth, link delay and queue size constant as shown in

table 6.2 Scenario 1. The packets with varying sizes from 0.1 Kbytes to 16384 Kbytes were

generated at source node N (0) of the 4 x 4 mesh topology and 8 x 8 mesh topology model

using two different traffic pattern FTP and CBR (at 5 Mbps, 10 Mbps rate) and transmitted to

102

destination node N (15) and N (63) of 4 x 4 and 8 x 8 mesh topology respectively using

shortest path algorithm. The protocol used is TCP and the simulation was run for 20 seconds.

The result obtained is shown in table (6.3, 6.4, 6.5, and 6.6) and the plot is shown in figure

6.2. it is observed from the figure 6.2 that for both FTP and CBR traffic as the packet size

increases delay also increases gradually and is highest for packet size 1024 Kbytes to 8192

Kbytes for 4 x 4 mesh topology and for packet size 1024 Kbytes to 4096 Kbytes for 8 x 8

mesh topology and further drops down drastically. This is the saturation point where

throughput also reached at its peak, thus exploiting the parameter values set. Further increase

in packet size leads to gradually decrease in delay. The delay for 8 x 8 mesh topology is

higher as compared to 4 x 4 mesh topology model which is due to increase in number of hops

and the pathlength between source node and destination node.

Figure 6.1: Throughput in Kbytes v/s Packet size in Kbytes

103

Figure 6.2: Delay per packet in ms v/s Packet size in Kbytes

6.4 Scenario-2 Throughput and delay performance with varying Queue size for low load

and high load packets

Table 6.7: Various parameter values for scenario 2

SCENARIO-2

4x4 4x4 8x8 8x8

CBR 10Mbps 10Mbps 10Mbps 10Mbps

Link delay 10ms 10ms 10ms 10ms

Link BW 10Mbps 10Mbps 10Mbps 10Mbps

104

Queue size 5-200 5-200 5-200 5-200

Packet size 512kbytes(low load) 64kbytes(high load) 512 bytes(low load) 64kbytes(high load)

Table 6.8: Throughput and delay performance by varying Queue size with low load packets (512bytes) on 4x4 mesh topology

Queue

size

Throughput in

Kbytes

FTP

Delay per packet

in ms

FTP

Throughput in

Kbytes

CBR

Delay per packet

in ms

CBR

5 86 0.062656301 85 0.062660796

10 87 0.062656136 86 0.062660648

20 87 0.062656136 87 0.06266704

40 87 0.062656136 87 0.06266704

80 87 0.062656136 87 0.06266704

100 87 0.062656136 87 0.06266704

200 87 0.062656136 87 0.06266704

Table 6.9: Throughput and delay performance by varying Queue size with high load packets (64kbytes) on 4x4 mesh topology

Queue

size

Throughput in

Kbytes

FTP

Delay per packet

in ms

FTP

Throughput in

Kbytes

CBR

Delay per packet

in ms

CBR

5 511 0.46477903 515 0.46720931

10 583 0.63688067 520 0.6113077

20 580 0.97672589 581 0.98149923

105

40 605 1.711896 595 1.7167413

80 605 1.711896 595 1.7167413

100 605 1.711896 595 1.7167413

200 605 1.711896 595 1.7167413

Table 6.10: Throughput and delay performance by varying Queue size with low load packets (512bytes) on 8x8 mesh topology

Queue

size

Throughput in

Kbytes

FTP

Delay per packet

in ms

FTP

Throughput in

Kbytes

CBR

Delay per packet

in ms

CBR

5 35 0.14619776 35 0.14620876

10 37 0.14619701 36 0.14620821

20 37 0.14619701 37 0.1462227

40 37 0.14619701 37 0.1462227

80 37 0.14619701 37 0.1462227

100 37 0.14619701 37 0.1462227

200 37 0.14619701 37 0.1462227

106

Table 6.11: Throughput and delay performance by varying Queue size for high load packets (64kbytes) on 8x8 mesh topology

Queue size Throughput

in Kbytes

FTP

Delay per packet in

ms

FTP

Throughput

in Kbytes

CBR

Delay per

packet in ms

CBR

5 433 0.91622105 432 0.92468632

10 470 1.0218441 477 1.0341071

20 505 1.2934178 502 1.3265322

40 555 1.6462719 552 1.6551183

80 555 1.6462719 552 1.6551183

100 555 1.6462719 552 1.6551183

200 555 1.6462719 552 1.6551183

In this Scenario (table 6.7) we studied the performance of two network models namely 4 x 4

and 8 x 8 mesh topology in terms throughput in Kbytes and delay per packet in milliseconds

by changing the queue size of the node. Queue size is the number of packets the node can

hold during transfer of data. Queue size plays very important role in the performance of the

network. If the size is small the node may discard the packet due to unavailability of buffer

space. This causes retransmission of the same packet from source to destination which leads

to delay. Declaring the size of the queue to large may also lead to area overhead and delay due

to service time problem. More the size, the packets get queued up in the buffer for longer time

before being processed by the node. It may so happen that by the time the packet reaches the

head of the queue for processing, the packet may be discarded due to time to leave field of the

107

packet hits zero count. This again leads to retransmission of the packet thereby increasing

packet delay. Thus the size of the queue plays a crucial role in the performance of the

network. We kept bandwidth of the link to constant 10Mbps, link delay to 10 ms and by

varying the queue size from 5 to 200 packets, studied the performance for two different

network loads, low load where the packet size is 512 bytes and high load where the packet

size is 64 Kbytes. The packets were generated at node N (0) using FTP and CBR (at 10Mbps

rate) and were transmitted to destination node N(15) of 4x4 mesh topology and N (63) of 8x8

mesh topology by using shortest path algorithm. The results are shown in table (6.8, 6.9, 6.10,

and 6.11). It is observed from figure 6.3 that the throughput in both FTP and CBR application

is 4 to 5 times higher for high load as compared to that for low load in both the mesh topology

models. This is because the amount of data generated in case of high load (64 Kbytes packet)

is large and hence large amount of data is transmitted from source node and received at

destination node increasing the throughput compared to low load (512 bytes packet).Further it

is also observed that CBR (at 10Mbps rate) for high load gives higher throughput as compared

to FTP application. Also throughput of the 4 x4 mesh topology at high load as well as low

load is higher compared to the 8x8 mesh topology. The reason is short path length from

source to destination in case of 4x4 mesh topology. We also noticed that for the 4x4 mesh

topology throughput saturates when queue size reaches 20 and throughput for 8x8 mesh

topology saturates when the queue size reaches 40. This value indicates that for the 4x4 mesh

topology the queue size should be less than 20 and for the 8x8 mesh topology queue size

should be less than 40. This value corresponds to factor of 5 for matrix dimension of the mesh

topology.

108

We also carried out the performance study of 4x4 and 8x 8 mesh topology for average

transmission delay the packet takes to travel from source node to destination node in

milliseconds. We kept bandwidth of the link constant to 10Mbps, link delay to 10ms and

varying queue size from 5 to 200 packets per node. The simulation was carried for low load

i.e. packet size of 512 bytes and high load with the packet size of 64Kbytes. The packets were

generated at source node N(0) using FTP and CBR (at 10Mbps rate) traffic generation and

transmitted to destination node N(15) and N(63) of 4x4 and 8x8 mesh topology modes

respectively using shortest path algorithm. The simulation is set for 20 seconds. The results

obtained are shown in table (6.8, 6.9, 6.10, and 6.11). It is observed from the figure (6.4) that

transmission delay in case of high load packet is higher as compared to low load in both FTP

and CBR (at 10Mbps rate) because at high load the amount of data bytes transmitted are more

as compared to low load. In 4 x 4 mesh topology at high load when queue size is 5 packets per

node, the transmission delay per packet is 0.464ms and 0.4672ms for FTP and CBR

respectively and increased linearly up to 1.711ms and 1.716ms for FTP and CBR respectively

at queue size of 40 and further remained constant. Whereas in 8 x 8 mesh topology at high

load the delay per packet is 0.916ms and 0.92ms, when queue size is 5 packets per node

which linearly increased to 1.64ms and 1.65ms and further remained constant at this values

for queue size from 40 to 200 for FTP and CBR traffic respectively. Within the high load the

performance of 4x4 mesh topology is better as compared to 8x8 mesh topology. The reason is

that the path length between source and destination is shorter in case of 4x4 and the number of

hops is also less in case of 4x4 as compared to 8x8 mesh topology.

109

Figure 6.3: Throughput in Kbytes v/s Queue size.

Figure 6.4: Delay per packet in ms v/s Queue size

110

6.5 Scenario-3 Throughput and delay performance with varying link Bandwidth for low

load and high load packet

Table 6.12: Various parameter values for scenario 3

SCENARIO-3

4x4 4x4 8x8 8x8

CBR 10Mbps 10Mbps 10Mbps 10Mbps

Link delay 10ms 10ms 10ms 10ms

Link BW 10Mbps-200Mbps 10Mbps-200Mbps 10Mbps-200Mbps 10Mbps-200Mbps

Queue size 1 1 1 1

Packet size 512 bytes(low load) 64kbytes(high load) 512 bytes(low load) 64kbytes(high load)

Table 6.13: Throughput and delay performance by varying link Bandwidth with packet size of 0.512Kbytes on 4x4 mesh topology

BW in

Mbps

Throughput in

Kbytes

FTP

Delay per packet in

ms

FTP

Throughput

in Kbytes

CBR

Delay per packet

in ms

CBR

10 87 0.062656136 87 0.06266704

20 87 0.061328028 88 0.061333417

40 87 0.060664004 89 0.060666683

80 89 0.060332001 89 0.060333337

111

100 89 0.0602656 89 0.060266666

200 89 0.060132799 89 0.060133332

Table 6.14: Throughput and delay performance by varying link Bandwidth with packet size of 64Kbytes on 4x4 mesh topology

BW in

Mbps

Throughput in

Kbytes

FTP

Delay per packet in

ms

FTP

Throughput

in Kbytes

CBR

Delay per packet

in ms

CBR

10 605 1.711896 595 1.7167413

20 1243 0.94410154 1228 0.95028378

40 2463 0.28046868 1228 0.27797226

80 4939 0.099205179 1228 0.10170933

100 6224 0.093661094 1228 0.094316757

200 9210 0.075634592 1228 0.076104124

Table 6.15: Throughput and delay performance by varying link Bandwidth with packet size of 0.512Kbytes on 8x8 mesh topology

BW in

Mbps

Throughput in

Kbytes

FTP

Delay per packet in

ms

FTP

Throughput

in Kbytes

CBR

Delay per packet

in ms

CBR

10 37 0.14619701 37 0.1462227

20 37 0.1430984 37 0.14311116

40 38 0.14154918 38 0.14155549

112

80 37 0.14077458 38 0.14077774

100 37 0.14061966 38 0.14062218

200 38 0.14030983 38 0.14031109

Table 6.16: Throughput and delay performance by varying link Bandwidth with packet size of 64Kbytes on 8x8 mesh topology

BW in

Mbps

Throughput in

Kbytes

FTP

Delay per packet in

ms

FTP

Throughput

in Kbytes

CBR

Delay per packet

in ms

CBR

10 555 1.6462719 552 1.6551183

20 1207 0.87346923 1208 0.87257966

40 2389 0.32198518 1228 0.32419287

80 3327 0.23144778 1228 0.23236135

100 3507 0.21259688 1228 0.21403118

200 3915 0.17607773 1228 0.17653149

In Scenario 3 table (6.12) the performance of the 4x4 and 8x8 mesh topology models was

studied for throughput and transmission delay by varying link bandwidth. The simulations

were carried out for 20 seconds by keeping propagation delay of the link constant to

10milliseconds and varying link bandwidth from 10Mbps to 200Mbps. Two different traffic

patterns were used, namely FTP for bulk generation of data and CBR which generates data at

113

10Mbps bit rate. We also used two different packet size namely low load where the size of the

packet is kept constant at 512 bytes and high load where the size of the packet was kept at 64

Kbytes. The traffic generated at the source node N (0) was transmitted to destination node N

(15) or N (63) of the respective mesh topology model using shortest path algorithm. The

results obtained are shown in table (6.13, 6.15, 6.15, and 6.16). We observed from figure 6. 5

that, throughput performances for low load i.e. 0.512 Kbytes packet is very low below 100

Kbytes compared to high load with packet size 64 Kbytes. For high load the throughput is 20

to 25 times more than low load. This is because for low load packet the amount of data

generated is very low as compared to high load and the amount of data transmitted and

received is also low for low load leading to low throughput. The bandwidth is not efficiently

utilized in case of low load. The result also depicts that at high load FTP traffic performance

is better as compared to CBR traffic, as we kept the CBR rate constant to 10Mbps which is

low to exploit the given bandwidth and hence it remains saturated even after increasing the

bandwidth. While in FTP traffic the throughput linearly increases as the bandwidth increases,

exploiting the bandwidth almost up to 99% and it is found to be best for 4x4 mesh topology as

compared to 8x8 mesh topology due to path length between source node to destination node.

We also carried out the performance study of 4x4 and 8x8 mesh topology models for

transmission delay by varying link bandwidth from 10Mbps to 200Mbps. We calculated the

amount of time the packet takes to travel from source node to destination node for two

different loads namely low load (0.512 Kbytes) and high load (64Kbytes), keeping link

bandwidth to 10milliseconds and by using FTP and CBR traffic. Node N (0), node N(15) and

node N(0) , node N(63) were fixed as source node and destination node for 4x4 and 8x8 mesh

topologies respectively. The results are shown in table (6.13, 6.15, 6.15, and 6.16) it is

114

observed from the figure (6.6) that for low load i.e. 0.512 Kbytes packet size, packets takes

less transmission time as compared to high load with 64 Kbytes packet size.

Figure 6.5: Throughput in Kbytes v/s Bandwidth in Mbps

The better performance delay is due to size of packet which is 128 times smaller in case of

low load as compared to high load. In case of high load the transmission delay is initially high

for link bandwidth from 10Mbps to 40Mbps and the performance is almost same as low load

for link bandwidth 40Mbps onwards. The performance of 4x4 and 8x8 mesh topology model

for packet transmission delay is best observed for link bandwidth more than 40Mbps for both

high as well low load in CBR and FTP traffic pattern.

115

Figure 6.6: Delay per packet in ms v/s Bandwidth in Mbps

6.6 Scenario-4 Throughput and delay performance with varying propagation delay of link with

low load and high load packets

Table 6.17: Various parameter values for scenario 4

SCENARIO-4

4x4 4x4 8x8 8x8

CBR 10Mbps 10Mbps 10Mbps 10Mbps

Link delay 10ms- 200ms 10ms- 200ms 10ms- 200ms 10ms- 200ms

Link BW 10Mbps 10Mbps 10Mbps 10Mbps

116

Queue size 1 1 1 1

Packet size 512 bytes(low load)

64kbytes(high load) 512 bytes(low load) 64kbytes(high load)

Table 6.18: Throughput and delay performance by varying link propagation delay with packet size of 512 bytes on 4x4 mesh topology

Link delay

in ms

Throughput

in Kbytes

FTP

Delay per packet

in ms

FTP

Throughput in

Kbytes

CBR

Delay per packet

in ms

CBR

10 87 0.062656136 87 0.06266704

20 44 0.1226626 44 0.1226843

40 21 0.24267566 22 0.24271969

80 11 0.48270278 11 0.48279377

100 8 0.60271764 8 0.60283319

200 4 1.2027966 4 1.2030367

Table 6.19: Throughput and delay performance by varying link propagation delay of link with

packet size 64Kbytes on 4x4 mesh topology

Link delay

in ms

Throughput

in Kbytes

FTP

Delay per packet in

ms

FTP

Throughput

in Kbytes

CBR

Delay per packet

in ms

CBR

10 605 1.711896 595 1.7167413

20 600 1.8568241 596 1.8768459

40 595 1.7423041 585 1.7633757

80 575 1.5251242 563 1.547276

117

100 564 1.4170311 555 1.440462

200 393 1.5264711 382 1.5615011

Table 6.20: Throughput and delay performance by varying link propagation delay with packet size 512bytes on 8x8 mesh topology

Link delay

in ms

Throughput

in Kbytes

FTP

Delay per packet in

ms

FTP

Throughput

in Kbytes

CBR

Delay per packet

in ms

CBR

10 38 0.14030983 38 0.14031109

20 18 0.28621169 18 0.28626355

40 9 0.56624181 9 0.56634856

80 4 1.1263099 4 1.1265313

100 3 1.4063443 3 1.4066442

200 0.461 2.8061652 0.461 2.806381

Table 6.21: Throughput and delay performance by varying link propagation delay with packet size 64Kbytes on 8x8 mesh topology

Link delay

in ms

Throughput

in Kbytes

FTP

Delay per packet in

ms

FTP

Throughput

in Kbytes

CBR

Delay per packet

in ms

CBR

10 555 1.6462719 552 1.6551183

20 575 1.7246999 566 1.7469654

40 543 1.4816446 538 1.5045655

118

80 359 1.8569539 351 1.8950532

100 291 2.1415174 286 2.1904111

200 43 3.5095102 42 3.5295024

In Scenario 4 table (6.17) the performance study of 4x4 and 8x8 mesh topology for

throughput was carried out by varying propagation delay of the link from 10 milliseconds to

200 milliseconds while keeping bandwidth fixed at 10Mbps and queue size to one. The study

was performed for FTP and CBR (at 10Mbps rate) traffic for low load (0.512 Kbytes) and

high load (64kbytes) packets. As in the previous scenarios, we fixed N (0) and N (15) as

source node and destination node for 4x4 topology and N (0) and N (63) as source and

destination node respectively for 8x8 topology. The traffic generated was transmitted from

source node to destination node using shortest path algorithm. We measured the throughput at

destination node after running the simulation for 20 seconds. The results obtained are shown

in table (6.18, 6.19, 6.20, and 6.21). We observe from figure 6.7 that throughput performance

for low load (0.512 Kbytes) is very low a factor of 4 to 5 times lower as compared to high

load in FTP and CBR traffic. The throughput is below 100 Kbytes and decreases further as the

propagation delay of the link increases. At higher load the throughput performance is better in

case of CBR traffic as compared to FTP traffic. The throughput decreases as the propagation

delay of the link increases. The performance of 4x4 is better compared to 8x8 mesh topology

due to shorter path length. The decrease in performance at the higher propagation delay is

natural and obvious.

119

Figure 6.7: Throughput in Kbytes v/s Link delay in ms.

Figure 6.8: Delay per packet in ms v/s Link delay in ms

120

We also studied the performance of 4x4 and 8x8 mesh topology for transmission delay

parameter by varying propagation delay of the link from 10 milliseconds to 200 milliseconds

while keeping bandwidth fixed at 10Mbps and queue size to one. We used FTP and CBR (at

10Mbps rate) traffic for low load (0.512 Kbytes) and high load (64kbytes) packets. By

keeping N (0) and N (15) as source node and destination node for 4x4 topology and N (0) and

N (63) as source and destination node respectively for 8x8 topology the traffic was generated

at source node and transmitted to destination node using shortest path algorithm. We further

computed the transmission delay per packet in milliseconds at destination node after running

the simulation for 20 seconds. The results obtained are shown in table (6.18, 6.19, 6.20, and

6.21). We observe from figure 6.8 that the delay for low load (0.512 Kbytes) is comparatively

better as compared to high load (64 Kbytes) but as the propagation delay of the link increases

the transmission delay is also increases thereby decreasing the performance. Once again the

reason is quite explanatory as the propagation delay of the link increases, the transmission

delay increases.

121

Chapter 7

CONCLUSION AND

FUTURE TRENDS

122

7.1 Conclusion

Network on Chip is a relatively new research field; there are many pressing research

challenges that need to be addressed. The major ones include the NoC architecture design and

implementation, the switch/router design and implementation and routing algorithm. We in

our study first explored the various NoC architecture designed, various NoC topologies,

router design and algorithms proposed by various research fraternity through literature survey.

Later we entrusted on designing and implementing the two dimensional mxn mesh topology

NoC platform and perform functional simulation and formal verification of the same. The

reason for selecting mesh topology is its scalability and simple in hardware implementation.

We first proposed four first in first out (FIFO) inputs ports and four output ports along with

one local port switch design. Four ports are required to connect to four adjacent switches

namely north, south, east and west direction to build mesh topology platform. The switch

design comprises of three components namely input module, scheduler and crossbar. We

restricted ourselves to fixed size ATM packets although the design can be scaled to any size

packet. The main reason is that the FPGA on which the design is to be ported has limited

logical elements and also less memory space to implement LUT. We tested the design for its

functionality by realizing it as digital system module using HDL. All three components of the

switch were realized using HDL and synthesized using ALTERA QUARTUS II (version 8.0)

software. The gate level schematics of the designed generated is viewed using Quartus II RTL

viewer to examine the design connectivity. The proposed switch designed is verified for its

functionality using functional simulation. We performed the simulation by injecting various

ATM packets through the input ports and verified functionality at respective output port of the

switch as decided by the look up table information configured within the switch.Further we

123

implemented computing node, which is a soft core processor embedded with the switch

design. The task of the processor is to generate packets, forward packets to the connected

switch and receive packets from the switch. We used Altera’s Nios- II soft core processor and

customized as per our requirement to generate SOPC using Altera’s SOPC builder system

tool. Soft core generated is integrated with our switch by connecting appropriate buses and

control signals thereby implementing computing node. For implementation purpose we used

Altera’s Development and Education board DE2 which has Cyclone II 2C35 FPGA device.

The computing node system along with the configuration file is synthesized and ported on the

Cyclone II 2C35 FPGA device of the DE2 board. The design is tested by writing C program

application which generates ATM packets, forward the packet to the input port of the attached

switch and receive at the respective port using Integrated Development Environment. Further

using the above computing node we build two dimensional 3x2 mesh topology NoC platform

synthesized on Cyclone II 2C35 FPGA deviceEP2C35F672C6 and formally verified using

µCOS real time operating system

We also developed simulation and verification platform to measure the NoC

performance in terms of network delay and throughput. The platform was designed and

implemented using 4x4 and 8x8 mesh topology. The comparative performance study was

performed by varying different NoC parameters. The throughput and delay of the NoC

platform was measured by varying packet size, link bandwidth, link delay and queue size

using FTP and CBR traffic pattern. We conclude that as the packet size increases throughput

increases, it achieves almost 100% bandwidth of the link. The transmission delay also

increases as the packet size increases it is better in case of CBR traffic thus mesh topology

124

with less number of nodes can be used for applications were data packets are generated using

constant bit rates.

With reference to queue size, the throughput and delay for low load remains constant for both

4x4 and8x8 mesh topology but for high load the throughput and delay is better at lower queue

size. Queue size should never exceed 5 times the matrix value of mesh topology in the design

implemented.

Further with respect to link bandwidth we conclude that FTP traffic outperforms CBR traffic

for high load. The throughput with FTP traffic linearly increases as the bandwidth increases,

exploiting the bandwidth almost up to 99% and it is found to be best for 4x4 mesh topology

compared with 8x8 mesh topology. The transmission delay when bandwidth increases does

not affects much for low load but drastic change is seen for high load. Higher the bandwidth

better is the delay.

We also conclude that the throughput of the network decreases as the propagation delay of the

link increases .At higher load the throughput performance is better in case of CBR traffic as

compared to FTP traffic.. The performance of 4x4 is better compared to 8x8 mesh .The delay

for low load is comparatively better as compared to high load but as the propagation delay of

the link increases the transmission delay is also increases thereby decreasing the performance

NoCs are key enabling technology for the provision of many additional services

ranging from different quality of service (QoS) levels to fault- tolerance. The major

challenges faced by the designers now apart from global communications are high power

dissipation. It is grown to such importance that they now directly constrain attainable

performance. As a designer one

125

need to always take into consideration this parameter while designing. While designing router

the type of link interconnect, number of ports, queue size plays important role in energy

consideration. Additional energy requires to route a packet is then shown to be dominated by

the data path. A performance analysis also shows that dynamic resource allocation leads to

the lowest network latencies, while static allocation may be used to meet QoS goals.

Combining the power and performance figures then allows an energy-latency product to be

calculated to judge the efficiency of each of the networks. Study in similar line is performed

by various researchers. Arnab Banerjee, Pascal T. Wolkotte, Robert D. Mullins,Simon W.

Moore, and Gerard J. M. Smit [101], Performed power analysis on a circuit-switched router, a

wormhole router, a quality-of-service(QoS) supporting virtual channel router and a

speculative virtual channel router for NoC architectures and evaluate the energy-performance.

Router designs were synthesized, placed and routed on a CMOS 90-nm, high-performance

process technology with a core voltage of 1.2V and nominal threshold voltage. The result

shows that the routers power significantly overhead the link power. Also found that all router

designs dissipate significant power produced by leakage and clock tree power which is a

overhead for a particular architecture and highlights the need to use efficient architectures

combined with standby power reduction techniques, to obtain power efficient designs. The

dynamic energy cost per packet was calculated to measure additional power dissipation while

routing a packet, wider data path compared to control path dominate the energy need. They

implemented dynamic allocation and virtual channels in the SpecVC router design and

observed that the packet latency under random packet injection is reduced while QoS

supporting design of the GuarVC router offered low latency for GT traffic. The measured

energy and latency results are combined into an energy-latency product (ELP) metric that

126

represents the efficiency of router architecture for a specific traffic scenario. Another way to

reduce power dissipation is to use asynchronous circuits which eliminates global clock signal

on a chip. In the asynchronous NoC, multiple communications are simultaneously done over

asynchronous routers and links between processing cores. Naoya Onizawa et al[102],

proposed high-throughput compact delay-insensitive asynchronous NoC router. They have

proposed a high-throughput compact delay insensitive asynchronous NoC router based on

LEDR encoding with a packet-structure constraint. Since a routing computation is performed

by using only single phase information, the hardware complexity of two phase encoding is

alleviated with maintaining timing robustness. Thus, the proposed NoC is to benefit at a

maximum from the two-phase encoding that communication steps and the number of signals

being used become half in comparison with the four-phase encoding. As a result, the proposed

router achieves a throughput of 526 Mflit/sec that is almost double the throughput of a

conventional four-phase router, with 34 % energy saving and small area overhead by post-

layout simulation under a 0.13µmCMOS technology. Moreover, the proposed asynchronous

NoCs achieve 113 and 69 Gbps in the 4x4 2-D mesh and 16-core Spidergon NoCs,

respectively. This represents almost 2.4x and 2.3x throughput with respect to the four-phase

asynchronous NoCs. The asynchronous NoC based on the proposed router is fabricated and

demonstrated under various supply voltages of 0.6 to 1.8 V.

The issue of energy cannot be solved completely by designing efficient systems there

are certain in build factors that may affect the system performance, one such factor is the

material used for interconnect. The conventional metallic interconnect are becoming the

bottleneck of NoC performance due to the limited bandwidth, long delay, large area, and high

power consumption. The progress in photonic technologies diverted the focus of researchers

127

towards exploring optical interconnects to build on-board inter-chip interconnect, switching

fabrics in core routers, and etc. Huaxi Gu, Kwai Hung Mo, Jiang Xu, Wei Zhang[103]

proposed a low-power, low-loss, and low-cost 5x5 optical router design called Cygnus, for

optical NoCs. Cygnus is non-blocking and based on silicon micro-resonators. They carried

out the comparative performance study of Cygnus with other optical routers and the results

shows that Cygnus consumes 50% less power, has 51% less loss, requires 20% less micro-

resonators than the traditional crossbar and consumes only 3.8% power of the

highperformance45nm electronic router.

In our switch design we implemented FIFO input buffer, the priority of the packet is

first in first out order. The arbitration strategy we used works well for most of the applications

but with potential different requirement with different applications the platform may not scale

well. Also different packets can have very different effects on system performance due to

different level of memory-level parallelism (MLP) of applications. Certain packets may be

performance-critical because they cause the processor to stall, whereas others may be delayed

for a number of cycles with no effect on application-level performance as their latencies are

hidden by other outstanding packets’ latencies. So deciding arbitration strategy at the time of

designing a switch is difficult unless one knows the type of application will be run on the

design platform. Thus, characterizing, understanding, and optimizing the interference

behavior of applications in the NoC are an important problem for enhancing system

performance and fairness. One can captures the packet’s importance to its application’s

performance thereby differentiating packets based on their slack. Reetuparna Das et al [104]

study explored the above area by proposing Aérgia: Exploiting Packet Latency Slack in On-

Chip Networks. They introduce the concept of packet slack and characterize it in the context

128

of on-chip networks. In order to exploit packet slack, they propose and evaluate a novel

architecture, called the Aérgia, which identifies and prioritizes low-slack (critical) packets.

The key components of the proposed architecture are techniques for online estimation of slack

in packet latency and slack-aware arbitration policies. Averaged over 56 randomly-generated

multi programmed workload mixes on a 64-core 8x8-mesh CMP, Aérgia improves overall

system throughput by 10.3% and network fairness by 30.8% on average. Results show that

Aérgia outperforms three state-of-the-art NoC i.e. scheduling, prioritization and fairness

techniques. They conclude that the proposed network architecture is effective at improving

overall system throughput as well as reducing application-level network unfairness

Adequate research work is found in latency and energy optimization for Quality of

services (QoS). Researchers have worked on latency of operation and hence various levels of

latency matrix are offered. One can find various architectures for optimizing the guaranteed

bandwidth and Best effort for QoS using various arbitration algorithms like round robin,

FCFS, Priority, and Priority based round robin wherein later are used for guaranteed traffic..

Sunghyun Park et al [105] proposed 16-node 4x4 mesh NoC fabricated in 45nm SOI. They

have defined and analyze the theoretical limits of a mesh NoC in latency, throughput and

energy. An off-chip signaling techniques have been shown to deliver substantial energy

savings when applied to NoC interconnects, circuit or system-level solutions to their increased

vulnerability to process variations need to be developed to ensure viability in future. Due to

the characteristics of the on-chip networks and the systems based on those structures, service

guarantees and predictability are usually given through hardware implementations, as opposed

to statistical guarantees of macro-networks. In real-time or other critical applications, one

really needs guarantees more than predictability. In order to give hard guarantees, one must

129

use specific routing schemes (virtual channels, adaptive routing, fault-tolerant router and link

structures, etc.) to ensure the required levels of performance and/or reliability. Naturally, for

Guaranteed Services GS NoCs, implementation costs and complexity grows with the

complexity of the system and the QoS requirements. Best-effort BE NoCs, on the other hand,

tends to present a better utilization of the available resources.

In order to evaluate the performance of NoC many researchers have simulated the NoC

models using different topologies and routing algorithms and carried out performance

evaluation in terms of throughput and delay by varying different network parameters like

packet size, link delay, queue size and bandwidth using different traffic pattern. Flich, P.

lopez, M. P. Malumbers and J. Duato[106] discussed improving the performance of regular

networks with source routing. Here mesh interconnection networks uses the source routing to

send packets from source node to destination node. In source routing the information about

the whole path from the source to the destination is pre-computed and provided in packet

header 2D Mesh of size 4x4 they found that the performance of FTP mechanism is better than

the CBR mechanism for parallel transmission. The FTP mechanism has not lost any packet

while constant bit rate mechanism lost few packets in parallel transmission. Therefore FTP

mechanism is most secure and reliable when parallel transmission is considered with

handshaking concept in the Mesh interconnection networks. The performance of the NoC also

affected by packet injection rate was observed by C. Marcon,et al[107] in their proposed tiny–

optimised 3D mesh NoC for area and latency minimization design. In their study they

compared Tiny with a straightforward 3D mesh NoC such as Lasio, and conclude that: (i)

Tiny always reduces the overall area, even with a small area increase of border routers and (ii)

Tiny reduces the average packets latency when there are few concurrent communications

130

and/or a low packet injection rate. In contrast, when the packet injection rate increases and

there are huge concurrent communications, Lasio provides a lower average packet latency due

to the larger quantity of the buffered paths. We in our simulation carried out the study for low

load and high load traffic and observed that high traffic affects the delay performance of the

model. This traffic can also be balanced using various routing techniques. Mukund

Ramakrishna et al[108],discuss GCA i.e. Global Congestion Awareness for load balance in

NoC. They present a novel routing adaptive routing technique for on-chip networks, which

aims to make routing decisions based on global knowledge of network state, Global

Congestion Awareness (GCA). GCA is a light-weight, low-complexity adaptive routing

algorithm, which makes per-hop routing decisions based upon awareness of the congestion of

links throughout the network. It differs from other globally aware routing schemes in that it

utilizes the existent packets within the network to convey (“piggyback”) congestion

information instead of requiring a sideband network dedicated to congestion status

propagation. This makes it a more scalable solution than other existing techniques. Our

experiments show that GCA consistently performs better than Regional Congestion

Awareness (RCA-1D) on a variety of workloads. On SPLASH-2 traffic, GCA is 51%better in

latency over RCA in the best case and 15% better on average. It also betters a competing

globally aware routing algorithm, DAR, by 8% on average on realistic workloads. The size of

the network also plays very vital role with respect to performance. Smaller size network will

have short length link thereby reducing average link delay. Jie Chen et al [109] presented

performance evaluation of three NoC architectures. They studied the performance of torus,

Metacube and hypercube under bit-complement and Poisson assumption. Although the

hypercube performs the best, the hypercube is expensive in terms of link complexity and node

131

degree. When the sizes of networks are small (32-64 nodes) and channel loads are not heavy,

the torus is a viable choice. It is cheap and simple to implement. For the 32-node and 64-node

cases, the Metacube network performs the worst among the three because it has the smallest

node degree and the largest diameter. For 128 nodes and 512 nodes networks, the Metacube

starts to outperform torus and exhibits similar performance to the hypercube. The Metacube

of 128-node and 512-node have lower link complexity and fewer long wires, which makes it a

viable choice under a moderate load. Further Zvika Guz et al[110] explored network delay

and link capacities in application-specific wormhole NoCs in their study. Here they presents a

novel analytical delay model for virtual channeled wormhole networks with non-uniform

links and applies the analysis in devising an efficient capacity allocation algorithm which

assigns link capacities such that packet delay requirements for each flow are satisfied.

Network-on-chip- (NoC) based application-specific systems on chip, where information

traffic is heterogeneous and delay requirements may largely vary, require individual capacity

assignment for each link in the NoC. This is in contrast to the standard approach of on- and

off-chip interconnection networks which employ uniform-capacity links. Therefore, the

allocation of link capacities is an essential step in the automated design process of NoC-based

systems. The algorithm should minimize the communication resource costs under Quality-of-

Service timing constraints. Further it is also possible to customize the link wire length at the

time of placement. Junbok You et al [111], proposed bandwidth optimization in asynchronous

NoCs by customizing link wire length. They have estimated the benefit of bandwidth

optimization in the design of an asynchronous network-on-chip (NoC). Here they developed a

tool to optimize NoC bandwidth and energy through topology and router placement. This

placement minimizes the wire lengths and router hops for high bandwidth network links. A

132

second optimization is performed on long links that require further bandwidth improvement

by adding pipeline latches. This design process creates a multi-frequency bandwidth

optimized asynchronous NoC. This design is compared to a clocked NoC applying the same

optimizations, but without applying multi-frequency design. The results show that exploiting

the natural multi-frequency nature of asynchronous designs results in substantial

improvements. The topology and placement optimizations create an asynchronous design that

has an average link bandwidth of 1.54 Gflits/s. Compared to the 1.54 GHz clocked design, the

asynchronous design has 46% less average packet latency and 19% less energy consumption

at 3×offered load. The asynchronous design performs similarly to the clocked network with a

link bandwidth of 2.12 Gflits/s, but demands 29% less energy at 3×offered load. Adding

pipeline latches to the clocked design does not have substantial beneficial effect. However, for

the asynchronous design this optimization significantly improves performance when the

network is highly congested. At 7× offered load, the pipeline latches resulting in a 35%

reduction in average packet latency for only 6.1% more energy consumption.

7.2 Future Trends

The future trend is towards third dimension NoC, as the future applications are getting more

and more complex the 2-D NoC may not scale well. As the network grows the diameter of the

network also grows which affects the performance of the network thus the current researchers

are trying to optimize the NoC based architectures. Brett Stanley Feero and Partha Pratim

Pande[112] performed a simulation study to evaluate the performance of five different

topologies in 3D space. The topologies considered for evaluation were 3D mesh, Stacked

mesh, Ciliated 3D mesh, butterfly fat tree (BFT) and SPIN. They demonstrated that mesh and

133

tree-based NoCs achieves better performance when initiated in a 3D IC environment

compared to 2D implementation. The 3D mesh based NoCs achieves significant gain in terms

of latency, throughput and energy dissipation and 3D tree based NoCs shows significant gain

only in energy dissipation with little area overhead in both the cases. Awet Yemane

Weldezion, Matt Grange, Dinesh Pamunuwa, Zhonghai Lu, Axel Jantsch, Roshan

Weerasekera and Hannu Tenhunen[113] studied the performance and scalability of different

3D NoC communication topologies namely 2-D mesh,3-D mesh (with switch connectivity

between layers) and 3-D bus (with bus connectivity between layers)using Through-Silicon-

Vias (TSV) for inter-die connectivity. Using different traffic pattern they performed Cycle-

accurate RTL-level simulation for two communication schemes based on a 7-port switch and

a centrally arbitrated vertical bus. The result shows that among the three designs 3-D 7-port

switch is the best performer in terms of throughput and normalized latency. Amir Mohammad

Rahmani, Kameswar Rao Vaddina, Khalid Latif, Pasi Liljeberg, Juha Plosila and Hannu

Tenhunen[114] proposed an efficient inter-layer communication scheme for 3D NoC-Bus

Hybrid mesh architecture to improve system reliability, enhance system performance and

reduce power consumption. They also presented congestion aware and bus failure tolerant

routing algorithm called AdaptiveZ for vertical communication. Further they implemented a

generic monitoring platform called ARB-NET on top of the 3D stacked NoC mesh

architecture which can be used for traffic monitoring and fault tolerance purposes. Seyyed

Hossein Seyyedaghaei Rezaei, Abbas Mazloumi, Mehdi Modarressi and Pejman Lotfi-

Kamran [115] presented router micro architecture for 3D NoC called FRESH with Fine-

grained Resource Sharing capability. The proposed NoC takes the advantage of ultra fast

vertical links to forward a blocked flit to vertical adjacent router if the corresponding link and

134

crossbar ports are idle effectively eliminating the packet block time. The cycle accurate

simulation results show up to 21% lower packet latency compared to other state-of-the-art 3D

NoCs. Randy W. Morris Jr, Avinash Karanth Kodi, Ahmed Louri and Ralph D. Whaley

Jr[116] proposed a high bandwidth and energy efficient interconnect architecture design

called On-Chip Multilayer Photonic (OCMP) which uses Nanophotonic Interconnect(NIs) and

3D stacking technologies with reconfiguration. They also proposed a reconfiguration

algorithm that maximizes the available bandwidth by dynamically reallocating the channel

bandwidth and through runtime monitoring of network resources. The simulation results for

64 core reconfigured network shows that the execution time can be reduced up to 25% for

PARSEC, Splash-2 and SPEC CPU2006 benchmarks and the simulations were also carried

out for 256 core proposed architecture and on-chip electrical and optical networks, the result

indicates that there is more than 25% throughput improvement and 23% energy saving on

synthetic traffic for the proposed architecture. Yaoyao Ye, Jiang Xu, Baihan Huang, Xiaowen

Wu, Wei Zhang, Xuan Wang, Mahdi Nikdast, Zhehui Wang, Weichen Liu and Zhe Wang

[117] proposed a 3-D mesh based Optical NoC for MPSoC along with 4 x 4 and 5 x 5 routers

to be used in the network corners and edges and 6 x 6 and 7 x 7 optical routers for dimension

order routing. Routers are strictly non-blocking and built based on silicon MRs. In addition an

optimized floorplan for 3-D meshed based ONoC was proposed which follows the regular 3-

D mesh topology but implements all optical routers in a single optical layer. The simulations

were performed to compare 3-D 8 x 8 x 2 meshed based ONoC with 2-D 16 x 8 meshed based

optical NoC, the result shows that the proposed architecture achieves 17% performance

improvement and 32% less energy consumption. Similarly the proposed architecture was also

compared with 2-D 16 x 8 mesh-based electronic NoC were it achieved 9% performance

135

improvement and 47% reduction in energy consumption. Akram Ben Ahmed and Abderazek

Ben Abdallah [118] proposed Look Ahead Fault Tolerant (LAFT) routing algorithm for 3D

NoC. LAFT enhances the system performance and reduces communication latency. It also

ensures fault tolerance while maintaining reasonable hardware complexity. The proposed

algorithm was implemented on a real 3D- NoC architecture (3D-OASIS-NoC) and prototyped

on FPGA. The evaluation results shows 38% average latency reduction and up to 46%

enhance throughput when compared to conventional XYZ routing algorithm. Sara Akbari, Ali

Shafiee, Mahmoud Fathy and Reza Berangi [119] proposed AFRA, a deadlock free routing

algorithm for 3D mesh based NoC which tolerates faults on vertical links. The proposed

routing algorithm uses ZXY routing whenever possible and switch to XZXY routing when

fault in baseline path is detected achieving high performance and simplicity. AFRA is highly

robust; it supports connections to all possible pairs of communicating nodes in high fault

rates. When compared with planar adaptive routing the performance of the AFRA showed

70% and 54.1% improvement in saturation injection rate for uniform traffic under single fault

and 207% and 44% for bit complement traffic under five faults.

Wireless 3D NoC Architecture

Wireless 3D NoC architecture for the building-block SiPs, in which the number and types

of chips stacked in a package can be changed after the chips have been fabricated, by virtue

of the inductive coupling. These chips can be formed into a single unidirectional ring network

so as to fully exploit the flexibility of the wireless connections that enables us to add, remove,

and swap chips in a package without updating any routing information. Yasuhiro Take et

al[120], described 3D NoC with inductive-coupling links for building-block of Systems in

136

Package (SiPs). The unidirectional ring can be extended to a bidirectional ring network by

using the inductive-coupling transceivers that can dynamically change their communication

direction in a single cycle. The proposed vertical bubble flow network outperforms the

conventional VC-based approach by 7.9-12.5% with a 33.5% smaller router area for building-

block SiPs connecting up to eight chips. The next step in mobility is the Internet of Things

(IoT) [121] where everyday devices, objects and physical assets are equipped with computing

power, sensors and wireless connections that enable them to communicate with each other

and with the Internet. These connected objects become more and more intelligent with tiny

integrated processors such as the Intel Quark processor in the size of an SD card including

Bluetooth and WiFi communication capabilities. In the end, these SoCs with their on-chip

interconnects will be connected forming large wireless off-chip networks with an ever

increasing amount of data that will be transmitted.

137

ANNEXURE Appendix-I #include "system.h" #include "altera_avalon_pio_regs.h" #include"stdio.h" int j=-2; int main(void) { int loop=0; //for(loop=0;loop<10;loop++)//run simulation for 10 times // { // printf("............................................SIMULATION %d\n",loop); int fp=0; int k=0,p=0; // alt_u8 d=0x00 ; alt_u8 clk_edge ; alt_u8 fp_out=0x00 ; alt_u8 direct_data=0x00 ; unsignedchar data1[53]= {0x72 ,0x73 ,0x74 ,0x75 ,0x76 ,0x01 ,0x02,0x03 ,0x04 ,0x05 ,0x06 ,0x07 ,0x08 ,0x09,0x10,0x11,0x12,0x13,0x14,0x15,0x16,0x17,0x18,0x19,0x20,0x21,0x22,0x23,0x24,0x25,0x26,0x27 ,0x28,0x29 ,0x30 ,0x31 ,0x32 ,0x33 ,0x34 ,0x35,0x36,0x37,0x38,0x39,0x40,0x41,0x42,0x43,0x44,0x45,0x46,0x47,0x48 }; unsigned char data2[53]= {0xA9 ,0xAA ,0xAB ,0xAC ,0xAD ,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa }; unsigned char data3[53]= {0xDF ,0xE0 ,0xE1 ,0xE2 ,0xE3 ,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb }; unsigned char data4[53]= {0x15 ,0x16 ,0x17 ,0x18 ,0x1A ,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc }; IOWR(PIO_BASE, 0, 0x01); //fp1 low printf("\nfp_in is made high to generate fp_out\n");

138

while(fp_out==0)//checks wether fp_out is high { //checks for falling edge of switch clock do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_high %2x", clk_edge); }while(clk_edge==0x01); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_low %2x", clk_edge); }while(clk_edge==0x00); if(fp==1) {IOWR(PIO_BASE, 0, 0x00); //fp1 low printf(" fp is made low\n"); } fp++; fp_out = IORD_ALTERA_AVALON_PIO_DATA(PIO_7_BASE); printf("\n fp_out%2x \n\n",fp_out); } printf(" first fp_out is detected \n"); while(fp_out==1) { fp_out = IORD_ALTERA_AVALON_PIO_DATA(PIO_7_BASE); printf("\n fp_out%2x \n\n",fp_out); } printf(" first fp_out is low \n"); //after first fpout is low wait for some more clock pulses and then make fpin high for(p=0;p<10;p++)

139

{ do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_high %2x", clk_edge); }while(clk_edge==0x01); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_low %2x", clk_edge); }while(clk_edge==0x00); } do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" clk_edge_ for fp to make high =%2x\n" , clk_edge); if (clk_edge==1) { IOWR(PIO_BASE, 0, 0x1); // Enable the fp1 high printf("fp_in high\n"); } }while(clk_edge==0); while(fp_out==0 && j<=47)//checks wether fp_out is high { //checks for falling edge of switch clock do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_high %2x", clk_edge); }while(clk_edge==0x01); printf("\n falling edge detected............."); //sending data j++; if(j>=0) { IOWR_ALTERA_AVALON_PIO_DATA(PIO_1_BASE,data1[j]);

140

IOWR_ALTERA_AVALON_PIO_DATA(PIO_8_BASE,data2[j]); IOWR_ALTERA_AVALON_PIO_DATA(PIO_8_BASE,data3[j]); IOWR_ALTERA_AVALON_PIO_DATA(PIO_10_BASE,data4[j]); } printf(" \n the value of j=%d", j); direct_data = IORD_ALTERA_AVALON_PIO_DATA(PIO_5_BASE); printf("\n data1_send_at_clk_low=%02x\n",data1[j]); printf("\n data2_send_at_clk_low=%02x\n",data2[j]); printf("\n data3_send_at_clk_low=%02x\n",data3[j]); printf("\n data4_send_at_clk_low=%02x\n",data4[j]); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_low %2x", clk_edge); }while(clk_edge==0x00); printf("\n Rising edge detected............."); d = IORD_ALTERA_AVALON_PIO_DATA(PIO_5_BASE); printf("\ndata_at_clk_high=%02x\n",d); fp_out = IORD_ALTERA_AVALON_PIO_DATA(PIO_7_BASE); printf("\n fp_out %2x \n\n",fp_out); } printf(" ......fp_out detected............\n"); recieve_data(); // } return 0; } recieve_data() { int i=1; alt_u8 result2=0x00 ; alt_u8 result3=0x00 ; alt_u8 result4=0x00 ; alt_u8 result5=0x00 ; alt_u8 clk_edge ; printf(" in recieve fn \n");

141

do { //result2 = IORD_ALTERA_AVALON_PIO_DATA(PIO_2_BASE); //printf("data_1=%02x\n",result2); //result3 = IORD_ALTERA_AVALON_PIO_DATA(PIO_3_BASE); //printf("data_2=%02x\n",result3); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); // printf(" \n clk_high %2x", clk_edge); }while(clk_edge==0x01); result4 = IORD_ALTERA_AVALON_PIO_DATA(PIO_4_BASE); printf("data_out_port_3 = %02x\n",result4); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); // printf(" \n clk_low %2x", clk_edge); }while(clk_edge==0x00); //result5 = IORD_ALTERA_AVALON_PIO_DATA(PIO_5_BASE); //printf("data_4=%02x\n",result5); i++; } while(i<1000000); return 0; }

142

Appendix-II Tcl script for 4 x 4 mesh topology #Create a simulator object set ns [new Simulator] #Define different colors for data flows (for NAM) $ns color 1 Blue $ns color 2 Red #$ns color 1 green #Open the NAM trace file set nf [open out.nam w] $ns namtrace-all $nf #Open the Trace file set tf [open out.tr w] $ns trace-all $tf #Define a 'finish' procedure proc finish {} { global ns nf tf $ns flush-trace #Close the NAM trace file close $nf close $tf #Execute NAM on the trace file exec nam out.nam & #Call xgraph to display the results exec xgraph out1.xg -geometry 800x800 & exit 0 } #Create sixteen nodes set n1 [$ns node] set n2 [$ns node] set n3 [$ns node] set n4 [$ns node]

143

set n5 [$ns node] set n6 [$ns node] set n7 [$ns node] set n8 [$ns node] set n9 [$ns node] set n10 [$ns node] set n11 [$ns node] set n12 [$ns node] set n13 [$ns node] set n14 [$ns node] set n15 [$ns node] set n16 [$ns node] #Create links between the nodes #connecting 1--2--3-------10 $ns duplex-link $n1 $n2 10mb 10ms DropTail $ns duplex-link $n2 $n3 10mb 10ms DropTail $ns duplex-link $n3 $n4 10mb 10ms DropTail $ns duplex-link $n1 $n5 10mb 10ms DropTail $ns duplex-link $n2 $n6 10mb 10ms DropTail $ns duplex-link $n3 $n7 10mb 10ms DropTail $ns duplex-link $n4 $n8 10mb 10ms DropTail $ns duplex-link $n5 $n6 10mb 10ms DropTail $ns duplex-link $n6 $n7 10mb 10ms DropTail $ns duplex-link $n7 $n8 10mb 10ms DropTail $ns duplex-link $n5 $n9 10mb 10ms DropTail $ns duplex-link $n6 $n10 10mb 10ms DropTail $ns duplex-link $n7 $n11 10mb 10ms DropTail $ns duplex-link $n8 $n12 10mb 10ms DropTail $ns duplex-link $n9 $n10 10mb 10ms DropTail $ns duplex-link $n10 $n11 10mb 10ms DropTail $ns duplex-link $n11 $n12 10mb 10ms DropTail $ns duplex-link $n9 $n13 10mb 10ms DropTail $ns duplex-link $n10 $n14 10mb 10ms DropTail $ns duplex-link $n11 $n15 10mb 10ms DropTail $ns duplex-link $n12 $n16 10mb 10ms DropTail $ns duplex-link $n13 $n14 10mb 10ms DropTail $ns duplex-link $n14 $n15 10mb 10ms DropTail $ns duplex-link $n15 $n16 10mb 10ms DropTail #--------------------------------------

144

#------------------------------------- #$ns queue-limit $n1 $n2 200 #$ns queue-limit $n2 $n3 200 #$ns queue-limit $n3 $n4 200 #$ns queue-limit $n1 $n5 200 #$ns queue-limit $n2 $n6 200 #$ns queue-limit $n3 $n7 200 #$ns queue-limit $n4 $n8 200 #$ns queue-limit $n5 $n6 200 #$ns queue-limit $n6 $n7 200 #$ns queue-limit $n7 $n8 200 #$ns queue-limit $n5 $n9 200 #$ns queue-limit $n6 $n10 200 #$ns queue-limit $n7 $n11 200 #$ns queue-limit $n8 $n12 200 #$ns queue-limit $n9 $n10 200 #$ns queue-limit $n10 $n11 200 #$ns queue-limit $n11 $n12 200 #$ns queue-limit $n9 $n13 200 #$ns queue-limit $n10 $n14 200 #$ns queue-limit $n11 $n15 200 #$ns queue-limit $n12 $n16 200 #$ns queue-limit $n13 $n14 200 #$ns queue-limit $n14 $n15 200 #$ns queue-limit $n15 $n16 200 #------------------------------------- #Setup first TCP connection Agent/TCP set packetSize_ 64000 set tcp1 [new Agent/TCP] $tcp1 set class_ 2 $ns attach-agent $n1 $tcp1 set sink1 [new Agent/TCPSink] $ns attach-agent $n16 $sink1 $ns connect $tcp1 $sink1 $tcp1 set fid_ 1 #Setup Second TCP connection set tcp2 [new Agent/TCP] $tcp2 set class_ 3 $ns attach-agent $n1 $tcp2 set sink2 [new Agent/TCPSink] $ns attach-agent $n16 $sink2

145

$ns connect $tcp2 $sink2 $tcp2 set fid_ 2 #Setup a FTP over TCP connection set ftp1 [new Application/FTP] $ftp1 attach-agent $tcp1 $ftp1 set type_ FTP1 #SETUP UDP CONNECTION #set udp [new Agent/UDP] #$ns attach-agent $n1 $udp #set null [new Agent/Null] #$ns attach-agent $n100 $null #$ns connect $udp $null #$udp set fid_ 3 #Setup a CBR over TCP connection set cbr [new Application/Traffic/CBR] $cbr attach-agent $tcp2 $cbr set type_ CBR $cbr set packet_size_ 64000 $cbr set rate_ 10Mb $cbr set random_ off #$cbr set interval_ 0.1 #Schedule events for the CBR and FTP agents $ns at 0.1 "$ftp1 start" $ns at 0.1 "$cbr start" $ns at 60 "$cbr stop" $ns at 60 "$ftp1 stop" #Detach tcp and sink agents (not really necessary) #$ns at 4.5 "$ns detach-agent $n1 $tcp ; $ns detach-agent $n100 $sink" #Call the finish procedure after 5 seconds of simulation time $ns at 62.0 "finish" #Print CBR packet size and interval puts "CBR packet size = [$cbr set packet_size_]" puts "CBR interval = [$cbr set interval_]" puts "CBR Rate = [$cbr set rate_]" #Run the simulation $ns run

146

Appendix-III Tcl script for 8 x 8 mesh topology #Create a simulator object set ns [new Simulator] #Define different colors for data flows (for NAM) $ns color 1 Blue $ns color 2 Red #$ns color 1 green #Open the NAM trace file set nf [open out.nam w] $ns namtrace-all $nf #Open the Trace file set tf [open out.tr w] $ns trace-all $tf #Define a 'finish' procedure proc finish {} { global ns nf tf $ns flush-trace #Close the NAM trace file close $nf close $tf #Execute NAM on the trace file exec nam out.nam & #Call xgraph to display the results exec xgraph out1.xg -geometry 800x800 & exit 0 } #Create 100 nodes set n1 [$ns node] set n2 [$ns node] set n3 [$ns node] set n4 [$ns node] set n5 [$ns node] set n6 [$ns node]

147

set n7 [$ns node] set n8 [$ns node] set n9 [$ns node] set n10 [$ns node] set n11 [$ns node] set n12 [$ns node] set n13 [$ns node] set n14 [$ns node] set n15 [$ns node] set n16 [$ns node] set n17 [$ns node] set n18 [$ns node] set n19 [$ns node] set n20 [$ns node] set n21 [$ns node] set n22 [$ns node] set n23 [$ns node] set n24 [$ns node] set n25 [$ns node] set n26 [$ns node] set n27 [$ns node] set n28 [$ns node] set n29 [$ns node] set n30 [$ns node] set n31 [$ns node] set n32 [$ns node] set n33 [$ns node] set n34 [$ns node] set n35 [$ns node] set n36 [$ns node] set n37 [$ns node] set n38 [$ns node] set n39 [$ns node] set n40 [$ns node] set n41 [$ns node] set n42 [$ns node] set n43 [$ns node] set n44 [$ns node] set n45 [$ns node] set n46 [$ns node] set n47 [$ns node]

148

set n48 [$ns node] set n49 [$ns node] set n50 [$ns node] set n51 [$ns node] set n52 [$ns node] set n53 [$ns node] set n54 [$ns node] set n55 [$ns node] set n56 [$ns node] set n57 [$ns node] set n58 [$ns node] set n59 [$ns node] set n60 [$ns node] set n61 [$ns node] set n62 [$ns node] set n63 [$ns node] set n64 [$ns node] #Create links between the nodes #--------- horizontal connection--------- $ns duplex-link $n1 $n2 10Mb 10ms DropTail $ns duplex-link $n2 $n3 10Mb 10ms DropTail $ns duplex-link $n3 $n4 10Mb 10ms DropTail $ns duplex-link $n4 $n5 10Mb 10ms DropTail $ns duplex-link $n5 $n6 10Mb 10ms DropTail $ns duplex-link $n6 $n7 10Mb 10ms DropTail $ns duplex-link $n7 $n8 10Mb 10ms DropTail $ns duplex-link $n9 $n10 10Mb 10ms DropTail $ns duplex-link $n10 $n11 10Mb 10ms DropTail $ns duplex-link $n11 $n12 10Mb 10ms DropTail $ns duplex-link $n12 $n13 10Mb 10ms DropTail $ns duplex-link $n13 $n14 10Mb 10ms DropTail $ns duplex-link $n14 $n15 10Mb 10ms DropTail $ns duplex-link $n15 $n16 10Mb 10ms DropTail $ns duplex-link $n17 $n18 10Mb 10ms DropTail $ns duplex-link $n18 $n19 10Mb 10ms DropTail $ns duplex-link $n19 $n20 10Mb 10ms DropTail $ns duplex-link $n20 $n21 10Mb 10ms DropTail $ns duplex-link $n21 $n22 10Mb 10ms DropTail $ns duplex-link $n22 $n23 10Mb 10ms DropTail

149

$ns duplex-link $n23 $n24 10Mb 10ms DropTail $ns duplex-link $n25 $n26 10Mb 10ms DropTail $ns duplex-link $n26 $n27 10Mb 10ms DropTail $ns duplex-link $n27 $n28 10Mb 10ms DropTail $ns duplex-link $n28 $n29 10Mb 10ms DropTail $ns duplex-link $n29 $n30 10Mb 10ms DropTail $ns duplex-link $n30 $n31 10Mb 10ms DropTail $ns duplex-link $n31 $n32 10Mb 10ms DropTail $ns duplex-link $n33 $n34 10Mb 10ms DropTail $ns duplex-link $n34 $n35 10Mb 10ms DropTail $ns duplex-link $n35 $n36 10Mb 10ms DropTail $ns duplex-link $n36 $n37 10Mb 10ms DropTail $ns duplex-link $n37 $n38 10Mb 10ms DropTail $ns duplex-link $n38 $n39 10Mb 10ms DropTail $ns duplex-link $n39 $n40 10Mb 10ms DropTail $ns duplex-link $n41 $n42 10Mb 10ms DropTail $ns duplex-link $n42 $n43 10Mb 10ms DropTail $ns duplex-link $n43 $n44 10Mb 10ms DropTail $ns duplex-link $n44 $n45 10Mb 10ms DropTail $ns duplex-link $n45 $n46 10Mb 10ms DropTail $ns duplex-link $n46 $n47 10Mb 10ms DropTail $ns duplex-link $n47 $n48 10Mb 10ms DropTail $ns duplex-link $n49 $n50 10Mb 10ms DropTail $ns duplex-link $n50 $n51 10Mb 10ms DropTail $ns duplex-link $n51 $n52 10Mb 10ms DropTail $ns duplex-link $n52 $n53 10Mb 10ms DropTail $ns duplex-link $n53 $n54 10Mb 10ms DropTail $ns duplex-link $n54 $n55 10Mb 10ms DropTail $ns duplex-link $n55 $n56 10Mb 10ms DropTail $ns duplex-link $n57 $n58 10Mb 10ms DropTail $ns duplex-link $n58 $n59 10Mb 10ms DropTail $ns duplex-link $n59 $n60 10Mb 10ms DropTail $ns duplex-link $n60 $n61 10Mb 10ms DropTail $ns duplex-link $n61 $n62 10Mb 10ms DropTail $ns duplex-link $n62 $n63 10Mb 10ms DropTail $ns duplex-link $n63 $n64 10Mb 10ms DropTail #---------------vertcal conectio------------------------------- $ns duplex-link $n1 $n9 10Mb 10ms DropTail $ns duplex-link $n2 $n10 10Mb 10ms DropTail $ns duplex-link $n3 $n11 10Mb 10ms DropTail $ns duplex-link $n4 $n12 10Mb 10ms DropTail $ns duplex-link $n5 $n13 10Mb 10ms DropTail $ns duplex-link $n6 $n14 10Mb 10ms DropTail

150

$ns duplex-link $n7 $n15 10Mb 10ms DropTail $ns duplex-link $n8 $n16 10Mb 10ms DropTail $ns duplex-link $n9 $n17 10Mb 10ms DropTail $ns duplex-link $n10 $n18 10Mb 10ms DropTail $ns duplex-link $n11 $n19 10Mb 10ms DropTail $ns duplex-link $n12 $n20 10Mb 10ms DropTail $ns duplex-link $n13 $n21 10Mb 10ms DropTail $ns duplex-link $n14 $n22 10Mb 10ms DropTail $ns duplex-link $n15 $n23 10Mb 10ms DropTail $ns duplex-link $n16 $n24 10Mb 10ms DropTail $ns duplex-link $n17 $n25 10Mb 10ms DropTail $ns duplex-link $n18 $n26 10Mb 10ms DropTail $ns duplex-link $n19 $n27 10Mb 10ms DropTail $ns duplex-link $n20 $n28 10Mb 10ms DropTail $ns duplex-link $n21 $n29 10Mb 10ms DropTail $ns duplex-link $n22 $n30 10Mb 10ms DropTail $ns duplex-link $n23 $n31 10Mb 10ms DropTail $ns duplex-link $n24 $n32 10Mb 10ms DropTail $ns duplex-link $n25 $n33 10Mb 10ms DropTail $ns duplex-link $n26 $n34 10Mb 10ms DropTail $ns duplex-link $n27 $n35 10Mb 10ms DropTail $ns duplex-link $n28 $n36 10Mb 10ms DropTail $ns duplex-link $n29 $n37 10Mb 10ms DropTail $ns duplex-link $n30 $n38 10Mb 10ms DropTail $ns duplex-link $n31 $n39 10Mb 10ms DropTail $ns duplex-link $n32 $n40 10Mb 10ms DropTail $ns duplex-link $n33 $n41 10Mb 10ms DropTail $ns duplex-link $n34 $n42 10Mb 10ms DropTail $ns duplex-link $n35 $n43 10Mb 10ms DropTail $ns duplex-link $n36 $n44 10Mb 10ms DropTail $ns duplex-link $n37 $n45 10Mb 10ms DropTail $ns duplex-link $n38 $n46 10Mb 10ms DropTail $ns duplex-link $n39 $n47 10Mb 10ms DropTail $ns duplex-link $n40 $n48 10Mb 10ms DropTail $ns duplex-link $n41 $n49 10Mb 10ms DropTail $ns duplex-link $n42 $n50 10Mb 10ms DropTail $ns duplex-link $n43 $n51 10Mb 10ms DropTail $ns duplex-link $n44 $n52 10Mb 10ms DropTail $ns duplex-link $n45 $n53 10Mb 10ms DropTail $ns duplex-link $n46 $n54 10Mb 10ms DropTail $ns duplex-link $n47 $n55 10Mb 10ms DropTail $ns duplex-link $n48 $n56 10Mb 10ms DropTail $ns duplex-link $n49 $n57 10Mb 10ms DropTail $ns duplex-link $n50 $n58 10Mb 10ms DropTail $ns duplex-link $n51 $n59 10Mb 10ms DropTail $ns duplex-link $n52 $n60 10Mb 10ms DropTail

151

$ns duplex-link $n53 $n61 10Mb 10ms DropTail $ns duplex-link $n54 $n62 10Mb 10ms DropTail $ns duplex-link $n55 $n63 10Mb 10ms DropTail $ns duplex-link $n56 $n64 10Mb 10ms DropTail #------------------------------------------- #Set Queue Size of link (n1-n64) to 10 ######################################## $ns queue-limit $n1 $n2 100 $ns queue-limit $n2 $n3 100 $ns queue-limit $n3 $n4 100 $ns queue-limit $n4 $n5 100 $ns queue-limit $n5 $n6 100 $ns queue-limit $n6 $n7 100 $ns queue-limit $n7 $n8 100 $ns queue-limit $n9 $n10 100 $ns queue-limit $n10 $n11 100 $ns queue-limit $n11 $n12 100 $ns queue-limit $n12 $n13 100 $ns queue-limit $n13 $n14 100 $ns queue-limit $n14 $n15 100 $ns queue-limit $n15 $n16 100 $ns queue-limit $n17 $n18 100 $ns queue-limit $n18 $n19 100 $ns queue-limit $n19 $n20 100 $ns queue-limit $n20 $n21 100 $ns queue-limit $n21 $n22 100 $ns queue-limit $n22 $n23 100 $ns queue-limit $n23 $n24 100 $ns queue-limit $n25 $n26 100 $ns queue-limit $n26 $n27 100 $ns queue-limit $n27 $n28 100 $ns queue-limit $n28 $n29 100 $ns queue-limit $n29 $n30 100 $ns queue-limit $n30 $n31 100 $ns queue-limit $n31 $n32 100 $ns queue-limit $n33 $n34 100 $ns queue-limit $n34 $n35 100

152

$ns queue-limit $n35 $n36 100 $ns queue-limit $n36 $n37 100 $ns queue-limit $n37 $n38 100 $ns queue-limit $n38 $n39 100 $ns queue-limit $n39 $n40 100 $ns queue-limit $n41 $n42 100 $ns queue-limit $n42 $n43 100 $ns queue-limit $n43 $n44 100 $ns queue-limit $n44 $n45 100 $ns queue-limit $n45 $n46 100 $ns queue-limit $n46 $n47 100 $ns queue-limit $n47 $n48 100 $ns queue-limit $n49 $n50 100 $ns queue-limit $n50 $n51 100 $ns queue-limit $n51 $n52 100 $ns queue-limit $n52 $n53 100 $ns queue-limit $n53 $n54 100 $ns queue-limit $n54 $n55 100 $ns queue-limit $n55 $n56 100 $ns queue-limit $n57 $n58 100 $ns queue-limit $n58 $n59 100 $ns queue-limit $n59 $n60 100 $ns queue-limit $n60 $n61 100 $ns queue-limit $n61 $n62 100 $ns queue-limit $n62 $n63 100 $ns queue-limit $n63 $n64 100 #---------------vertcal conectio------------------------------- $ns queue-limit $n1 $n9 100 $ns queue-limit $n2 $n10 100 $ns queue-limit $n3 $n11 100 $ns queue-limit $n4 $n12 100 $ns queue-limit $n5 $n13 100 $ns queue-limit $n6 $n14 100 $ns queue-limit $n7 $n15 100 $ns queue-limit $n8 $n16 100 $ns queue-limit $n9 $n17 100 $ns queue-limit $n10 $n18 100 $ns queue-limit $n11 $n19 100 $ns queue-limit $n12 $n20 100 $ns queue-limit $n13 $n21 100 $ns queue-limit $n14 $n22 100 $ns queue-limit $n15 $n23 100 $ns queue-limit $n16 $n24 100

153

$ns queue-limit $n17 $n25 100 $ns queue-limit $n18 $n26 100 $ns queue-limit $n19 $n27 100 $ns queue-limit $n20 $n28 100 $ns queue-limit $n21 $n29 100 $ns queue-limit $n22 $n30 100 $ns queue-limit $n23 $n31 100 $ns queue-limit $n24 $n32 100 $ns queue-limit $n25 $n33 100 $ns queue-limit $n26 $n34 100 $ns queue-limit $n27 $n35 100 $ns queue-limit $n28 $n36 100 $ns queue-limit $n29 $n37 100 $ns queue-limit $n30 $n38 100 $ns queue-limit $n31 $n39 100 $ns queue-limit $n32 $n40 100 $ns queue-limit $n33 $n41 100 $ns queue-limit $n34 $n42 100 $ns queue-limit $n35 $n43 100 $ns queue-limit $n36 $n44 100 $ns queue-limit $n37 $n45 100 $ns queue-limit $n38 $n46 100 $ns queue-limit $n39 $n47 100 $ns queue-limit $n40 $n48 100 $ns queue-limit $n41 $n49 100 $ns queue-limit $n42 $n50 100 $ns queue-limit $n43 $n51 100 $ns queue-limit $n44 $n52 100 $ns queue-limit $n45 $n53 100 $ns queue-limit $n46 $n54 100 $ns queue-limit $n47 $n55 100 $ns queue-limit $n48 $n56 100 $ns queue-limit $n49 $n57 100 $ns queue-limit $n50 $n58 100 $ns queue-limit $n51 $n59 100 $ns queue-limit $n52 $n60 100 $ns queue-limit $n53 $n61 100 $ns queue-limit $n54 $n62 100 $ns queue-limit $n55 $n63 100 $ns queue-limit $n56 $n64 100 # connecting 11-----20

154

################################### #Setup first TCP connection Agent/TCP set packetSize_ 512 set tcp1 [new Agent/TCP] $tcp1 set class_ 2 $ns attach-agent $n1 $tcp1 set sink1 [new Agent/TCPSink] $ns attach-agent $n64 $sink1 $ns connect $tcp1 $sink1 $tcp1 set fid_ 1 #Setup Second TCP connection set tcp2 [new Agent/TCP] $tcp2 set class_ 3 $ns attach-agent $n1 $tcp2 set sink2 [new Agent/TCPSink] $ns attach-agent $n64 $sink2 $ns connect $tcp2 $sink2 $tcp2 set fid_ 2 #Setup a FTP over TCP connection set ftp1 [new Application/FTP] $ftp1 attach-agent $tcp1 $ftp1 set type_ FTP1 #SETUP UDP CONNECTION #set udp [new Agent/UDP] #$ns attach-agent $n1 $udp #set null [new Agent/Null] #$ns attach-agent $n50 $null #$ns connect $udp $null #$udp set fid_ 3 #Setup a CBR over TCP connection set cbr [new Application/Traffic/CBR] $cbr attach-agent $tcp2 $cbr set type_ CBR $cbr set packet_size_ 512

155

$cbr set rate_ 10Mb $cbr set random_ off #$cbr set interval_ 0.1 #Schedule events for the CBR and FTP agents $ns at 0.1 "$ftp1 start" $ns at 0.1 "$cbr start" $ns at 60 "$cbr stop" $ns at 60 "$ftp1 stop" #Detach tcp and sink agents (not really necessary) #$ns at 4.5 "$ns detach-agent $n1 $tcp ; $ns detach-agent $n100 $sink" #Call the finish procedure after 5 seconds of simulation time $ns at 62.0 "finish" #Print CBR packet size and interval puts "CBR packet size = [$cbr set packet_size_]" puts "CBR interval = [$cbr set interval_]" puts "CBR Rate = [$cbr set rate_]" #Run the simulation $ns run

156

REFERENCES

[1] http://wccftech.com/intel-broadwell-ep-xeon-e5-v4/#ixzz49YMiOki1.

[2] http://www.intel.com/content/www/us/en/silicon-innovations/intel-

14nmtechnology.html.

[3] http://siliconangle.com/blog/2015/10/26/oracle-debuts-first-systems-with-10-billion

transistor-sparc-m7-chip/.

[4] https://devblogs.nvidia.com/parallelforall/inside-pascal/.

[5] S.W. Keckler et al. (eds.), Multicore Processors and Systems, Integrated Circuits and

Systems, DOI 10.1007/978-1-4419-0263-42, C Springer Science+Business Media,

LLC 2009.

[6] http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance.

[7] “Designofenergy-efficient on-chip networks,”http://www.rle.mit.edu/ isg/

documents/2010_NOC_tutorial_vladimir.pdf, 2010.

[8] T. G. Mattson, R. F. Van der Wijngaart, M. Riepen, T. Lehnig, P. Brett, W. Haas, P.

Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, and S. Dighe, “The 48-core SCC

processor: The programmer’s view,” in Proc. SC, Nov. 2010, pp. 1–11.

[9] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L.

Bao, J. Brown, M. Mattina, C.-C. Miao,C. Ramey, D. Wentzlaff, W. Anderson, E.

Berger, N. Fairbanks, D. Khan,F. Montenegro, J. Stickney, and J. Zook, “TILE64

157

processor: A 64-core SoC with mesh interconnect,” in Proc. ISSCC, Feb. 2008, pp.

88–598.

[10] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A.

Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts,Y. Hoskote, N. Borkar, and S.

Borkar, “An 80-tile sub-100-w teraflops processor in 65-nm CMOS,” IEEE J. Solid-

State Circuits, vol. 43, no. 1, pp. 29–41, Jan. 2008.

[11] Wayne Wolf, Ahmed Amine Jerraya, and Grant “Multiprocessor System-on-Chip

(MPSoC) TechnologyMartin” IEEE Transactiins on Computer –Aided Design og

Integrated Circuits and Systems Vol. 27,no. 10, October 2008.

[12] AMBA™ Specification Rev. 2.0, http://www.arm.com, 1999.

[13] Specification for the: WISHBONE System-on-Chip (SoC) Interconnection

Architecture for Portable IPCores, OpenCore, 2002.

[14] TheCore ConnectBusArchitecture. http://www-03.ibm.com /chips /products

/coreconnect /,1999.

[15] A Comparison of Network-on-Chip and Busses, http://www.arteris.com

/noc_whitepaper.pdf, 2005.

[16] http://chipdesignmag.com/sld/files/2009/11/NoC-Evolution_v11.pdf.

[17] Naveen Choudhary” Network on Chip: A New SoC Communication

Infrastructure Paradigm” International Journal of Soft Computing and Enginee ring

(IJSCE)ISSN: 2231-2307, Volume-1, Issue-6, January 2012.

158

[18] Ahmed Jerraya and Wayne Wolf “Multiprocessor Systems-on‐Chips” Elsevier Inc,

2004.

[19] W. J. Dally, B. Towles, "Route packets, not wires: on -chip interconnection networks",

in Proceedings DAC, pp. 684-689, June2001.

[20] Maurizio Palesi and Masoud Daneshtalab (Eds.), Routing Algorithms inNetworks-on-

Chip, Springer, 2014.

[21] Daniel Sanchez, George Michelogiannakis and Christos Kozyrakis “An Analysis of

On-Chip Interconnection Networks for Large-Scale Chip Multiprocessors” ACM

Transactions on Architecture and Code Optimization, Vol. 7, No. 1, Article 4, April

2010.

[22] David Atienza, Federico Angiolini, Srinivasan Murali, Antonio Pullini, Luca Benini,

and Giovanni De Micheli, “Network-on-Chip design and synthesis outlook,”

Integration, the VLSI journalvol. 41, pp. 340–359, 2008.

[23] A.Kumar, A.Jantesh, M.Millberg, J.Obeerg, J.P.Soininen, M.Forsell, K.Tiensyrja and

A.Hemani “A Network on chip architecture and Design methodology” In Proc. Symp.

VLSI, page 117, Washington, DC,USA,2002. IEEE Computer Society

[24] M.B.Taylor,j.Kim,J. Miller,D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P.

Johnson, Jae-Wook Lee, W. Lee, A. Saraf, M. Seneski, N. Shnidman, V.

Strumpen, M.Frank, S. Amarasinghe and A. Agarwal “ The raw microprocessor: a

computational fabric for software circuits and general purpose programs”, Micro,

IEEE, vol.22, no.2, pp 25-35,2002

159

[25] S.R.Vangal, J.Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. singh,

T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar and S. Borkar “ An

80-tile sub-100-w teraflops processor in 65-nm cmos”, IEEE Journal of Solid State

Circuits, vol.43, no.1,pp29-41, 2008

[26] W.J. Dally and B. Towles “Route packets, not wires: On-chip interconnection

networks. In Procesdings of the Design Automation Conference, pp 684-689, June

2001

[27] C.Grecu, P.P. Pande, A. Ivanov and R.Saleh “ Structured interconnect architecture: a

solution for the non-scalability of bus based SoCs” in 14th ACM Great Lakes

Symposium on VLSI(GLSVLSI’04) pages 192-195, April 2004

[28] A. Greiner L. Mortiez A. Adriahantenaina, H. Charley and C.A. Zeferino “SPIN: A

Scalable, packet switched, on-chip micro-network” In Design, Automation and Test in

Europe (DATE’05), page 20070, Washington, DC, USA, 2003.IEEE Computer

Society.

[29] Karim and Faraydon O. Octagon interconnection network for linking processing nodes

on an SOC device and method of operating same. US Patent 7218616

[30] M. Coppola, M.D. Grammatikakis, R. Locatelli, G. Maruccia and L. Pieralisi “Design

of cost- Efficient Interconnect Processing Units: Spidegon STNoC.CRC Press, Inc.

Boca Raton, FL, USA, 2008.

[31] M. Coppola, R. Locatelli, G. Maruccia, L. Pieralisi and A. Scandurra “Spidergon: A

NoC modeling paradigm. In Proc. 2004 International Symposium on System-on-

Chip,page 15, November 2004.

160

[32] L. Bononi and N. Concer “Simulation and analysis of network on chip architectures:

ring, spidegon and 2d mesh”, In Design, Automation and Test in Europe

(DATE’06), pages 154-159, 2006.

[33] L. Bononi, N. Concer, M. Grammatikakis, M. Coppola and R. Locatelli “ NoC

topologies exploration based on mapping and simulation models” In DSD’ 07:

Proceeding of the 10th Euromicro Conference on Digital System Design Architectures,

Methods and Tools, pages 543-546, 2007.

[34] Michael Kistler, Michael Perrone, and Fabrizio Petrini. “Cell multiprocessor

communication network :Built for Speed.” IEEE MICRO, 26(3): 10-23, May 2006.

[35] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep

Dubey, StephenJunkins, Adam Lake, Jeremy Sugerman, Robeert Cavin, Roger

Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan “Larrabee: A many-core x86

architecture for visual computing” ACM Transactions on Graphics, 27, August

2008.

[36] David Wentzlaff, Patric Griffin, Henry Hoffman, Liewei Bao, Bruce Edwards, Carl

Ramey, Matthew Mattina, Chyi- Chang Miao, John Brown III, and Anant Agarwal.

“On-Chip interconnection architecture of the Tile Processor” IEEE Micro, 27(5): 15-

31, 2007.

[37] Ge Fen, Wu Ning, Wang Qi “Simulation and Performance Evaluation for Network on

Chip Design using OPNET” IEEE region 10 conference TENCON 2007 Oct-Nov

2007.

161

[38] Abdul Ansari, Mohammad Ansari and Mohammad Khan “ Performance Evaluation of

Various Parameters of Network- on- Chip (NoC) for Different Topologies” IEEE

conference 2015.

[39] Partha pratim Pande, Cristian Grecu, Andre Ivanov, Res Saleh, “ Design of a Switch

fornetwork on Chip Applications” Proceedings of the 2003 International Symposium

on Circuits and Systems ISCAS’03 vol. 5 pg 217-220 2003.

[40] Jose Duato, Sudhakar Yalamanchili and Ni Lionel “Interconnection Network: An

Engineering Approach”, Morgan Kaufmann Publishers Inc., San Fransisco, CA,

USA, 2002

[41] N.E. Jergeer and L.S. Peh “On- chip Networks, Morgan, New York, 2009.

[42] Shashi Kumar, Axel Jantsch, Juha-Pekka Soininen, Martti Forsell, Mikael Millberg,

Johny Obergl, Kari Tiensyrja and Ahmed Hemani, “ A Network on Chip Architecture

and Design Methodology”, IEEE Computer Society Annul Symposium on VLSI

(ISVLSI), pp 105-112, 2002

[43] Partha Pratim Pande, Cristian Grecu, Andrew Ivanov, Res Saleh, “Design of Switch

for Network-on-Chip Applications” Proceedings of the 2003 International

Symposium on Circuits and Systems, ISCAS 2003

[44] P.P.Pande, Cristian Grecu, Michael Jones, Andre Ivanov and Resve Saleh “

Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect

Architectures” IEEE Transactions on Computers, vol. 54 no 8 August 2006

162

[45] Abdul Quaiyum Ansari, Mohammad Rashid Ansari and Mohammad Ayoub Khan”

Performance Evaluation of various Parameters of Network-on-chip (NoC) for different

topologies-978-1-4673-6540, June 2015 IEEE.

[46] Bharat Phanibhushana and Sandip Kundu “Network-on-Chip design for Heterogenous

Multiprocessor System-on-Chip” IEEE Computer Society Annual Symposium on

VLSI 2014.

[47] Mahendra Gaikwad and Rajendra Patrikar “Perfect Difference Network for Network-

on-Chip Architecture” International Journal of Computer Science and Network

Security, vol.9 No.12 December 2009

[48] Chia-Hsin OwenChen, Niket Agarwal,Tushar Krishna, Kyung-Hoae Koo, Li-Shiuan

Peh, Krisna C. Saraswat “ Physical vs. Virtual Express Topologies with Low-Swing

Links for Future Many-core NoCs” fourth ACM/IEEE International Symposium on

Network-on-Chip” 2010

[49] Ying-Cherng Lan, Hsiao-An Lo, Yu Hen Hu and Sao-Jie Chen “ A Bidirectional NoC

(BiNoC) Architecture with Dynamic Self-Reconfigurable Channel” IEEE

Transactions on computer –aided design of integrated circuits and systems, vol. 30

No.3, march 2011

[50] Snaider Carrillo, Jim Harkin, Liam j. McDaid, Fearghal Morgan,Sandeep Pande,

Seamus Cawley and Brian McGinley “ Scalable Hierarchical Network-on-Chip

Architecture for Spiking Neural Network Hardware Implementations” IEEE

Transactions on Parallel and Distributed systems, vol. 24, No. 12,December 2013

163

[51] Md. Hasan Furhad and John-Myyon Kim “An Extended Diagonal Mesh Topology

for Network-on-Chip Architectures” International Journal of Multimedia and

Ubiquitous Engineering, vol. 10,No.10, 2015

[52] Wu Chang, Li Yubai and Chai Song “ Design and Simulation of a Torus topology for

network on chip chip” Journal of Systems Engineering and Electronics, vol. 19,No. 4,

2008

[53] Jawwad Latif,Hassan Nazeer Chadhry, Sadia Azam, Naveed Khan baloch “Design

Trade off and Performance Analysis of Router Architectures in Network-on-

Chip” 2nd International workshop on Design and Performance of Networks on Chip

(DPNoC 2015) Procedia Computer Science 56 (2015) 421-426

[54] Umamaheswari S., Meganathan D and Raja Paul Perinbam J “Runtime buffer

management to improve the performance of irregulat Network-on-Chip architecture”

© Indian Academy of Sciences (Sadhana) vol. 40, part 4, june 2015 pp.1117-1137

[55] Anbu chozhan. P,D. Muralidharan and R. Muthaiah “ Performance enhanced router

design for network on chip” International journal of Engineering and

Technology(IJET), vol. 5 No.2 April-May 2013

[56] Qing Sun, Lijun Zhang and Anding Du “Design and Implementation of the Wormhole

Virtual Channel NoC Router” 4th International Conference on Computer Science and

Network Technology (ICCSNT 2015)

164

[57] Rohit Sunkam Ramanujam, Vassos Soterious, Bill Lin and Li-Shiuan “ Design of a

High Throughput Distributed Shared-Buffer NoC Router” Fourth ACM/IEEE

International Symposium on Network-on-Chip 2010

[58] Weiwei Fu, Jingcheng Shao, Bin Xie , Tianzhou Chen and Li Liu “ Design of a High-

Throughput NoC Router with Neighbor Flow Regulation” 14th International

Conference on High Performance and Communications. IEEE 2012

[59] YanivBen-Itzhak,Israel Cidon, Avinoam Kolodny, Michael Shabun and Nir Shmuel

“Heterogeneous NoC Router Architecture” IEEE Transactions on Parallel and

Distributed Systems Vol. 26,No. 9. Sept. 1 2015

[60] Ahmed Bin Dammas, Adel Soudani and Abdullah Al-Dhelaan “ Design of NoC router

with flow control mechanism for congestion avoidance” IEEE conference 22-24

June 2013.

[61] Bidhas Ghoshal, Kanchan Manna, Santanu Chattopadhyay and Indrani Sengupta “ In-

Field Test for Permanent Faults Buffers of NoC Routers” IEEE Transactions on

Very Large Scale Integration (VLSI) systems vol.24 No.1 2016.

[62] Wooyoung Jang and David Z. Pan “SDRAM-Aware Router for Network-on-Chip”

IEEE Transactions on Computer –Aided Design of Integrated Circuits and Systems,

vol.29 No. 10 October 2010

[63] Gaoming Du, Jing He, Yukun Song, Duoli Zhang and Huajie Wu “ comparison of

NoC Routing Algorithms Based on Packet-circuit Switching” Third International

Conference on Information Science and Technology, March 23-25 2013.

165

[64] Ebrahim Behrouzian-Nezhad and Ahmad Khademzadeh “BIOS: A New Efficient

Routing Algorithm for Network on Chip” Contemporary Engineering Sciences, vol.2,

No.1 2009

[65] Andrew DeOrio, David Fick, Valeria Bertacco, Dennis Sylvester, David Blaauw, Jin

Hu and Gregory Chen “ A Reliable Routing Architecture and Agorithm for

NoCs” IEEE Transactions on Computer-Aided Design of Integrated Circuits and

Systems. Vol. 31, No. 5 May 2012.

[66] Wei Ni and Zhenwei Liu “A routing algorithm for Network on Chip with self-similar

traffic” 11th International Conference on ASIC (ASICON) 2015.

[67] Jan Moritz Joseph, Christopher Blochwitz and Thilo Pionteck “Adaptive allocation of

default router paths in Network on Chip for latency reduction” International

Conference on High Performance Computing & Simulation (HPCS) 2016.

[68] K.Tatas and C. Chrysostomou “Adaptive Network on Chip Routing with Fuzzy Logic

Control” Euromicro Conference on Digital System Design (DSD) 2016.

[69] Salih Bayar and Arda Yurdakul “ An Efficient Mapping Algorithm on 2-D Mesh

Network-on-Chip with Reconfigurable Switches” 11th International Conference on

Design & Technology of Integrated Systems in Nanoscale Era (DTIS) 2016.

[70] Cisse Ahmadou Dit Adi, Ping Qiu, Hidetsugu Irie, Takefumi Miyoshi and Tsutomu

Yoshinaga “ OREX: An Optical Ring with Electrical Crossbar Hybrid Photonic

Network –on- Chip” International Workshop on Innovative Architecture for Future

Generation High-Performance Processors and Systems 2010

166

[71] Xianfang Tan, Mei Yang, Lei Zhang, Yingtao Jiang, Jianyi Yang “ A Generic Optical

Router DesignPhotonic Network-on-Chips” Journal of Lightwave Technology

vol.30,No 3, February 1, 2012.

[72] Parisa Khadem Hamedani, Natalie Enright Jerger and Shaahin Hessabi “QuT: A Low-

Power Optical Network-on-Chip” Eighth IEEE/ACM International Symposium on

Network-on-Chip (NoCS) 2014.

[73] Xiaolu Wang, Huaxi Gu, Yintang AYang, Kun Wang and Qinfen Hao “A Highly

Scalable Optical Network-on-Chip with Small Network Diameter and Deadlock

Freeedom” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,

vol.24 No. 12 2016.

[74] Xiuhua Li, Huaxi Gu, Ke Chen, Liang Song and Qingfen Hao “STorus: A New

Tolpology for Optical Network-on-Chip” Optical Switching and Networking, vol.22

November 2016.

[75] Somayyeh Koohi and Shaahin Hessabi “Scalable architecture for a contention-free

optical network-on-chip” Journal of Parallel and Distributed Computing vol. 72, No

11 November 2012.

[76] Lei Zhang, Mei Yang, Yingtao Jiang, Emma Regentova “ Architectures and routing

schemes for optical network-on-chips” Computers and Electrical Engineering 2009

[77] Zhiliang Qian, Da-Cheng Juan, Paul Bogdan, Chi-Ying Tsui, Diana Marculescu and

Radu Marculescu, “A Comprehensive and Accurate Latency Model for Network-on-

Chip Performance Analysis”, IEEE/ACM ASP-DAC’14, 22 Jan., 2014, Singapore.

167

[78] James Hurt, Andrew May, Xiaohan Zhu, and Bill Lin, “Design and implementation of

high-speed symmetric crossbar schedulers,” Proc. ICC’99, Vancouver, Canada, June

1999, S37-6.

[79] Introduction to Structured VLSI Design by Joachim Rodriguz

[80] https://www.altera.com/content/dam/altera; www/global/en_US/pdfs/

literature/manual/intro_to_quartus2.pdf

[81] Bouraoui Chemli and Abdelkrim Zitouni “Design of a NetworkonChip router based on

turn model”16th international conference on Sciences and Techniques of Automatic

control & computer engineering- STA'2015,Monastir,Tunisia,December 1-

23,2015.

[82] Hung K. Nguyen and Xuan-Tu “Design and Implementation of a Hybrid Switching

Router for the Reconfigurable Network-on-Chip” International Conference on

Advanced Technologies for Communications (ATC) 2016.

[83] Amit Bhanwala, Mayank Kumar and Yogendera Kumar “FPGA based Design of Low

Power Reconfigurable Router for Network on Chip (NoC)”International Conference

on Computing, Communication and Automation (ICCCA2015) IEEE -2015.

[84] Avalon Interface Specifications: http://www.altera.com/mnl_avalon_spec.pdf

[85] Zuo Zhen, Tang Guilin , Dong Zhi , Huang Zhiping ,”Design and realization of the

hardware platform based on the Nios soft-core processor” 8th International

Conference on Electronic Measurement and Instruments, ICEMI '07, Proc.

IEEE,pg.865-869,July2007

168

[86] Alcalde, Ortmann, M.S, Mussa, S.A. ,“NIOS II Processor Implemented in FPGA: An

Application on Control of a PFC Converter” Conference on Power Electronics

Specialists , PESC 2008.Proc. IEEE , pg.4446-4451,June 2008.

[87] Wang Ziting ,Guo Haili ,Sun Yan ,“Design of VGA Image Controller Based on SOPC

Technology” International Conference on New Trends in Information and Service

Science, NISS '09, Proc. IEEE,pg.825-827 , July 2009.

[88] Lin Fei-yu , Jiao Xi-xiang , Guo Yu-Hui , Zhang Jian-chuan , Qiao Wei-ming , Jing

Lan , Wang Yan-Yu , Ma Yun-hai ,“System On Programmable Chip Development

System” Second International Workshop on Education Technology and Computer

Science (ETCS), proc. IEEE , Volume 1,pg.471-474, March2010.

[89] First time designer guide http://www.altera.com/literature/hb/nios2/edh_ed51001.pdf

[90] QuartusIIVersion 7.2 Handbook Volume 4: SOPC Builder:

http://www.cs.columbia.edu/~sedwards/classes/2008/4840/qts_qii5v4.pdf

[91] NIOS II integrated development environment. [Online]. Available:

http://www.altera.com/

[92] M. Jones and G. Gopalakrishnan”Verifying Transaction Ordering Properties in

Unbounded Bus Networks through Combined Deductive/ Algorithmic Methods” In

Proceedings of the Third International Conference on Formal Methods in

Computer-Aided Design, pages 505- 519. Springer-Verlag, November 2000.

[93] W. 1. Dally and C. L. Seitz “Deadlock-Free Message Routing in Multiprocessor

Interconnection Networks” IEEE Transactions on Computers,C-36(5):547-553, May

1987.

169

[94] X. Lin, P. K. McKinley, and L. M. Ni “ Deadlock-free multicast wormhole routing in

2-D mesh multicomputers” IEEE Transactions on Parallel and Distributed Systems,

5(8):793-804, August 1994.

[95] E. M. Clarke, O. Grumberg, and D. A. Peled.Model Checking.MIT Press, 1999.

[96] D. Borrione, A. Helmy, L. Pierre, and 1. Schmaltz “A Formal Approach to the

Verification of Networks on Chip” EURASIP Journal on Embedded Systems,

2009(548324):1-14, February 2009.

[97] 1. Schmaltz and D. Borrione “Towards a Formal Theory of On Chip Communications

in the ACL2 Logic” In Proceedings of the Sixth International Workshop on the

ACL2 Theorem Prover and ItsApplication, pages 47-56. ACM, August 2006.

[98] Salaun, W. Serwe, Y. Thonnart, and P. Vivet “Formal Verification of CHP

Specifications with CADP Illustration on an Asynchronous Network-on-Chip” In

Proceedings of International Symposium on Asynchronous Circuits and Systems,

pages 73-82. IEEE Computer Society Press, March 2007.

[99] Yean-Ru Chen ,Wan-Ting SU, Pao-Ann Hsiung, Ying-Chemg Lan , Yu-Hen Hu and

Sao-Jie Chen “Formal Modeling and Verification for Network-on-Chip” IEEE

InternationalConference on Green Circuits and Systems 2010

[100] Anam Zaman and Osman Hasan “Formalverification of circuit-switchedNetwork on

chip (NoC) architecturesusingSPIN “ IEEE International Symposium on System-

on-Chip (SoC) 2014

170

[101] Arnab Banerjee, Pascal T. Wolkotte, Robert D. Mullins,Simon W. Moore, and Gerard

J. M. Smit “An Energy and Performance Exploration of Network-on-Chip

Architectures” IEEE Transactions on VLSI systems, vol. 17. No.3, 319-March 2009

[102] Naoya Onizawa et al,“High-Throughput Compact Delay-Insensitive Asynchronous

NoC Router”, IEEE Transactions On Computers, Digital Object Identifier

10.1109/TC.2013.81,0018-9340/13/$31.00 © 2013 IEEE.

[103] Huaxi Gu, Kwai Hung Mo, Jiang Xu, Wei Zhang “A Low-power Low-cost Optical

Router for Optical Networks-on-Chip in Multiprocessor Systems-on-Chip” IEEE

Computer Society Annual Symposium on VLSI 2009

[104] Reetuparna Das, Onur Mutlu, Thomas Moscibroda, Chita R. Das, “Aérgia: Exploiting

Packet Latency Slack in On-Chip Networks”, ISCA’10, June 19–23, 2010, Saint-

Malo, France. Copyright

[105] Sunghyun Park, Tushar Krishna, Chia-Hsin Owen Chen, Bhavya Daya, Anantha P.

Chandrakasan, Li-Shiuan Peh, “Approaching the Theoretical Limits of a Mesh NoC

with a 16-Node Chip Prototype in 45nm SOI”, DAC 2012, June 3-7, 2012, San

Francisco, California, USA. Copyright 2012 ACM ACM 978-1-4503-1199-

1/12/02010 ACM 978-1-4503-0053-7/10/06.

[106] J. Flich, P. lopez, M. P. Malumbers, J. Duato, “Improving the Performance of Regular

Networks with Source Routing”, Proceeding of the IEEE International Conference on

Parallel Processing. 21-24 Aug. 2000. Page(s) 353-361.

171

[107] C. Marcon et al., “Tiny–optimised 3D mesh NoC for area and latency minimization”,

Electronics Letters, 30th January 2014 Vol. 50 No. 3 pp. 165–166.

[108] Mukund Ramakrishna et al, “GCA:Global Congestion Awareness for Load Balance in

Networks-on-Chip”,IEEE Transactions On Parallel And Distributed Systems (TPDS),

DOI10.1109/TPDS.2015.2477840.

[109] Jie Chen et al, “Performance Evaluation of Three Network-on-Chip(NoC)

Architectures (Invited)”, IEEE International Conference on Communications in

China: Communications QoS and Reliability (CQR), 978-1-4673-2815-9/12/$31.00

©2012 IEEE.

[110] Zvika Guz et al, “Network Delays and Link Capacities in Application-Specific

Wormhole NoCs”, Hindawi Publishing Corporation, VLSI Design ,Volume 2007,

Article ID 90941,15 pages, Doi:10.1155/2007/90941.

[111] Junbok You et al, “Bandwidth Optimization in Asynchronous NoCs by Customizing

Link Wire Length”,ICCD 2010.

[112] Brett Stanley Feero and Partha Pratim Pande “Network-on-Chip in a Three-

dimensional Environment: A Performance Evaluation” IEEE Transactions on

Computer, vol.58, No. 1 January 2009

[113] Awet Yemane Weldezion, Matt Grange, Dinesh Pamunuwa, Zhonghai Lu, Axel

Jantsch, Roshan Weerasekera and Hannu Tenhunen “Scalability of network-on-chip

communication architecture for 3D meshes” 3rd ACM/IEEE International Symposium

on Network-on-Chip 2009

172

[114] Amir Mohammad Rahmani, Kameswar Rao Vaddina, Khalid Latif, Pasi Liljeberg,

Juha Plosila and Hannu Tenhunen “High-Performance and Fault-Tolerant 3D NoC-

Bus Hybrid.Architecture using ARB-NET based Adaptive Monitoring Platform” IEEE

Transactions onComputers Vol.63, No.3, 2014

[115] Seyyed Hossein Seyyedaghaei Rezaei, Abbas Mazloumi, Mehdi Modarressi and

Pejman Lotfi-Kamran “Dynamic Resource Sharing for High-Performance 3-D

Network-on-Chip” IEEE Computer Architecture Letters vol.15, No.1, 2016

[116] Randy W. Morris Jr, Avinash Karanth Kodi, Ahmed Louri and Ralph D. Whaley Jr

“Three-Dimensional Stacked Nanophotonic Network-on-Chip Architecture with

Minimal Reconfiguration” IEEE Transactions on Computers, vol. 63, Co.1, January

2014.

[117] Yaoyao Ye, Jiang Xu, Baihan Huang, Xiaowen Wu, Wei Zhang, Xuan Wang, Mahdi

Nikdast, Zhehui Wang, Weichen Liu and Zhe Wang “3-D Mesh-Based Optical

Network-on-Chip for Multiprocessor System-on-Chip” IEEE Transactions on

Computer-Aided design og Integrated Circuits and Systems, vol. 32, No.4 April 2014.

[118] Akram Ben Ahmed and Abderazek Ben Abdallah “Architecture and design of high-

throughput, low-latency, and fault-tolerant routing algorithm for 3D-network-on-chip

(3D- NoC)” The Journal of Supercomputing , vol. 66, No. 3 December 2013.

[119] Sara Akbari, Ali Shafiee, Mahmoud Fathy and Reza Berangi “AFRA: A low cost high

performance reliable routing for 3D mesh NoCs” IEEE conference on Design,

Automation & Test in Europe Conference & Exhibition 2012.

173

[120] Yasuhiro Take et al, “3D NoC with Inductive-Coupling Links for Building-Block

SiPs”, IEEE Transactions On Computers, Vol. 63, No. 3, March 2014.

[121] KrzanichB., CES 2014, Keynote. http://www.intel.com/ content/www/us/en/ events/

intel-ces-keynote.html.