network on chip (noc) platform for high …
TRANSCRIPT
NETWORK ON CHIP (NoC) PLATFORM FOR
HIGH PERFORMANCE COMPUTING
Thesis submitted to Goa University
for the award of the degree of
Doctor of Philosophy
In
Electronics
By
Udaysing Vithalrao Rane
Supervisor
Dr.R. S. Gad
Department of Electronics,
Goa University, Goa-403206
January 2017
I
CERTIFICATE
This is to certify that the thesis entitled “Network on Chip (NoC) Platform for High Performance
Computing ", submitted by Mr. Udaysing Vithalrao Rane, for the award of the degree of Doctor
of Philosophy in Electronics, is based on his original and independent work carried out by him
during the period of study, under my supervision. The thesis or any part thereof has not been
previously submitted for any other degree or diploma in any University or institute.
Place: Goa University (R. S. Gad)
Date: Research Guide
II
STATEMENT
I state that the present thesis “Network on Chip (NoC) Platform for High Performance
Computing " is my original contribution and the same has not been submitted on any occasion
for any other degree or diploma of this University or any other University / Institute. To the best
of my knowledge, the present study is the first comprehensive work of its kind in the area
mentioned. The literature related to the problem investigated has been cited. Due
acknowledgements have been made wherever facilities and suggestions have been availed of.
Place: Goa University (Udaysing V. Rane)
Date: Candidate
III
Dedicated with love to my parents
Shri.Vithalrao Daulatrao Rane and Smt.Vijaya Vithalrao Rane
My wife
Mrs. Aparna U. Rane
& My children
Master Arush U. Rane & Master Aryan U. Rane
IV
ACKNOWLEDGEMENTS
With deep regards and profound respect, I avail this opportunity to express my deep sense of
gratitude and indebtedness to my supervisor, Dr. Rajendra S. Gad, Associate Professor
Department of Electronics, Goa University for his endless support, encouragement, inspiration,
ideas, comments, critics, an endless discussions and articles during our interaction. He has also
helped me in broadening my perspectives in looking at many things in life. I express deep sense
of gratitude to Dr. Gourish M. Naik, Professor and Head Department of Electronics, Goa
University, for his valuable guidance, constant encouragement, suggestions, advice and ideas
I acknowledge the technical comments and the assistance received from Dr. Y.S. Valaulikar
Member FRC, Department of Mathematics. Prof. A. Salkar Dean, Faculty of Natural Science and
Prof. J.A.E. Desa, Former Dean, Faculty of Natural Science for their valuable suggestions and
timely advice during the faculty research meetings.
I am very grateful to Dr D. B. Arolkar Principal, D.M’s College Assagao Goa and Shri
Shrikrishna Pokle, Chairman, Dnyanprassarak Mandal’s, Mapusa Goa, for their encouragement
and support for pursuing research.
My sincere thanks to Dr. Jyoti Pawar, Associate Professor, Department of Computer Science and
Technology, Goa University, Dr. Jivan S. Parab, Assistant Professor, Department of Electronics,
Dr. Vinaya R.Gad Associate Professor and Head Department of Computer Science, G.V.M’s
College Ponda, Dr. Prakah Pareinkar, Head, Department of Konkani, Goa University, Dr.
Sulaxana Raikar, Assistant Professor, Department of Computer Science, G.V.M’s College Ponda
for their precious advises that greatly helped me in the developing this thesis
I acknowledge Dr. Vithal Tilvi (Arizona State University, USA) for his moral support, constant
encouragement, discussions and arguments over a telephonic conversations or whenever he was
V
down to India. I also thank Mr. Saish Amonkar, Technical head , Intel Bangalore for his time
over technical discussions we had.
Special thanks to all Research Scholars Mr. Narayan Vetrekar, Dr. Ingrid, Shaila Ghanti, Noel,
Aniket, Charan, Yogini and Marlan of our department for endless deliberations regarding
technical writing skills and Mr. William and Mr. Agnelo Lopez for constantly assisting me in
various administrative works.
I am indebted to funding agency ALTERA INC. USA, for setting up System on Chip (SoC)
laboratory at the Goa University.
I enjoyed all the moments which I experienced during my research work which transformed me
to present form and am sure that this exposure effectively changed my perception of life. I thank
all my colleagues, D.M’s College and research centre, Assagao Goa; who have greatly
contributed to good inspiring working and social environment.
I express deep sense of gratitude to my wife Aparna, my sister Sharavati and my brothers
Abasaheb and Sanjay, for their patience and understanding as well as taking much of the
responsibilities at home. I shall forever remain indebted to my children, Arush and Aryan. The
long hours and holidays that I owed to them, I had spent for thesis.
Udaysing V. Rane, January 2017.
VI
TABLE OF CONTENTS
I PREFACE XVI
1. INTRODUCTION
1.1 Introduction 1
1.2 Multi-Processor System-on-Chip (MPSoC) 3
1.3 Network-on-Chip (NoC)) 7
1.3.1 NoC architecture 7
1.3.2 Topologies 8
1.3.3 Topologies metrics 13
1.3.4 NoC switching techniques 16
1.3.5 Routing algorithms 18
1.4 Objectives of the thesis and outline 18
2. LITERATURE SURVEY ANDSTATE OF THE ART IN NoC
2.1 NoC architectures 22
2.2 NoC router architectures 25
2.3 NoC routing algorithms 29
2.4 NoC link interfaces 31
2.5 NoC flow control 34
3. NoC SWITCH/ROUTER ARCHITECTURE DESIGN AND IT’S IMPLEMENTATION
3.1 Proposed switch architecture design 36
3.2 Working of switch/router 37
3.2.1 ATM packet format: 38
3.2.2 The scheduler 40
3.2.3 Two dimensional ripple carry arbiters 41
3.2.4 Comparison between DPA and RPA 43
3.2.5 The crossbar organisation 44
3.3 Development design flow 44
3.4 Description of switch using hardware descriptive language 46
VII
3.4.1 Switch design synthesis 46
3.4.2 Switch design simulation 48
4. IMPLEMENTATION OF COMPUTING NODE
4.1 Overview of computing node 58
4.2 Task of converting the design into block diagram 58
4.3 Soft core processor architecture 59
4.4 SoC generation using Nios-II soft core 64
4.5 Integrating SoC with switch 66
4.6 Synthesizing and porting the computing node on target device Cyclone –II 69
4.7 Programming and configuring the FPGA device 74
4.8 Validating the design for the application program 75
5. IMPLEMENTATION OF NoC PLATFORM USING MESH TOPOLOGY
5.1 Overview of mesh topology 80
5.2 Design of NoC Mesh topology platform 81
5.3 The 3X2 NoC Mesh Topology verification with RTOS 84
6. PERFORMANCE EVALUATION AND ANALYSIS OF MESH TOPOLOGY
6.1 Performance evaluation and analysis of mesh topology 92
6.2 Simulation environment for m x n mesh topology 93
6.3 Scenario-1 Throughput and delay performance with varying packet size 95
6.4 Scenario-2 Throughput and delay performance with varying queue size for low
load and high load packets 103
6.5 Scenario-3 Throughput and delay performance with varying link bandwidth for
low load and high load packet 110
6.6 Scenario-4 Throughput and delay performance with varying link propagation
delay with low load and high load packets 115
VIII
7. CONCLUSION AND FUTURE TRENDS
7.1 Conclusion 121
7.2 Future Trends 132
ANNEXURE
Appendix-I 137
Appendix-II 142
Appendix-III 146
REFERENCES 156
IX
LIST OF TABLES Table 3.1: Areas and delays for each arbiter design.
Table 3.2: Resources usage on Cyclone II EP2C35F672C6 FPGA device for proposed switch
design.
Table 3.3: Static look up table used for routing .
Table 3.4: Input ATM packets used for simulation and expected output packets.
Table 3.5: Comparison of the proposed router design with other router design.
Table 4.1: Embedded Soft Core Processor for FPGA.
Table 4.2: Pin assignment for EP2C35F672C6 of Cyclone II family for DE2 board.
Table 4.3: Input ATM packet send at port one of the switch.
Table 4.4: Part of look up table.
Table 4.5: Output ATM packet expected at port number 3.
Table 6.1: Simulation parameters. Table 6.2: Various parameter values for scenario 1.
Table 6.3: Throughput and delay performance with varying packet size for 4x4 mesh topology.
Table 6.4: Throughput and delay performance with varying packet size for 4x4 mesh topology.
Table 6.5: Throughput and delay performance with varying packet size for 8x8 mesh topology.
Table 6.6: Throughput and delay performance with varying packet size for 8x8 mesh topology.
Table 6.7: Various parameter values for scenario 2.
Table 6.8: Throughput and delay performance by varying Queue size with low load packets
(512bytes) on 4x4 mesh topology.
Table 6.9: Throughput and delay performance by varying Queue size with high load packets
(64kbytes) on 4x4 mesh topology.
Table 6.10: Throughput and delay performance by varying Queue size with low load packets
(512bytes) on 8x8 mesh topology.
Table 6.11: Throughput and delay performance by varying Queue size for high load packets
(64kbytes) on 8x8 mesh topology.
Table 6.12: Various parameter values for scenario 3.
Table 6.13: Throughput and delay performance by varying link Bandwidth with packet size of
0.512Kbytes on 4x4 mesh topology.
Table 6.14: Throughput and delay performance by varying link Bandwidth with packet size of
64Kbytes on 4x4 mesh topology.
Table 6.15: Throughput and delay performance by varying link Bandwidth with packet size of
0.512Kbytes on 8x8 mesh topology.
Table 6.16: Throughput and delay performance by varying link Bandwidth with packet size of
64Kbytes on 8x8 mesh topology.
Table 6.17:Various parameter values for scenario 4.
X
Table 6.18: Throughput and delay performance by varying link propagation delay with packet
size of 512 bytes on 4x4 mesh topology. Table 6.19: Throughput and delay performance by varying link propagation delay of link with
packet size 64Kbytes on 4x4 mesh topology.
Table 6.20: Throughput and delay performance by varying link propagation delay with packet
size 512bytes on 8x8 mesh topology.
Table 6.21: Throughput and delay performance by varying link propagation delay with packet
size 64Kbytes on 8x8 mesh topology.
XI
LIST OF FIGURES
Figure 1.1: Growth in processor performance
Figure 1.2: Evolution of multi-core system
Figure 1.3: General Architecture of NoC
Figure 1.4: 2D Mesh topology
Figure 1.5: 2D Torus Network
Figure 1.6: Fat Tree Network
Figure 1.7: Butterfly Fat Tree topology
Figure 1.8: SPIN topology
Figure 1.9: Polygon (Hexagon) Topology
Figure 1.10: Spidergon (Octagon) Topology
Figure 1.11: Star Topology
Figure 3.1: Depicts the proposed switch architecture
Figure 3.2: ATM Packet format structure
Figure 3.3: Input module of the proposed switch
Figure 3.4: Look Up Table structure for the proposed switch
Figure 3.5: 4 X 4 two dimensional ripple carry arbiter
Figure 3.6: Embedded development flow
Figure 3.7: Quartus II design flow
Figure 3.8: Compilation report flow summary
Figure 3.9: RTL view of the Switch
Figure 3.10: Input waveform vector file showing input ATM packets
Figure 3.11: Output showing ATM packets received at respective output ports of the switch as
described in LUT
Figure 4.1: Block symbol of the proposed switch design
Figure 4.2: Block diagram of the Nios II processor core with relevant buses
Figure 4.3: Customize soft core processor for switch design
Figure 4.4: Schematic view of the Computing node consist of SoC and a switch
Figure 4.5: The DE2 board with relevant component (source: ALTERA INC.)
Figure 4.6: Block diagram of the DE2 board with various I/O and communication Interfaces
Figure 4.7: Program flow for verification of computing node on FPGA
Figure 4.8: Output ATM packet at port number 3 of the switch on Nios-II IDE console
Figure 5.1: 4x4 Mesh topology with switch (router) and computing element (core)
Figure 5.2: Close view of 4X4 Mesh topology interconnection platform using Quartus-II
Figure 5.3: Full view of 4X4 Mesh topology interconnection platform using Quartus-II with SoC
and PLL
Figure 5.4: Compilation report for 4 x 4 switch matrix demanding more utilization.
Figure 5.5: Compilation report for 3 x 2 mesh topology utilizing 97% logic elements.
Figure 5.6: 3x2 mesh topology interconnection platform
Figure 5.7: Real Time system priority scheduler for transmitting and receiving the ATM packets
XII
Figure 6.1: Throughput in Kbytes v/s Packet size in Kbytes
Figure 6.2: Delay per packet in ms v/s Packet size in Kbytes
Figure 6.3: Throughput in Kbytes v/s Queue size
Figure 6.4: Delay per packet in ms v/s Queue size
Figure 6.5: Throughput in Kbytes v/s Bandwidth in Mbps
Figure 6.6: Delay per packet in ms v/s Bandwidth in Mbps
Figure 6.7: Throughput in Kbytes v/s Link delay in ms
Figure 6.8: Delay per packet in ms v/s Link delay in ms
XIII
KEYWORDS ASIC:Application Specific Integrated Circuit
ALU: Arithmetic Logic Unit
API:Application Program Interface
AS: Active Serial
ATM: Asynchronous Transfer Mode
BFT: Butterfly Fat Tree
BiNoC: Bidirectional channel Network on Chip
CBR: Constant Bit Rate
CDC: Channel Direction Control
CDLSI:Capacitively Driven Low-Swing Interconnects
CLICHÉ: Chip-Level Integration of Communicating Heterogeneous Elements
CMP: Chip Multi- Processor
CoNoC: Contention-free optical NoC
CPLD:Complex Programmable Logic Device
CPU: Central Processing unit
DE: Development and Education
DOR: Dimension Ordered Routing
DPA: Diagonal Propagation Arbiter
DPXY: Distance Prediction XY
DSB: Distributed Shared Buffer
DSM: Deep Submicron
DSP: Digital Signal Processors
EDA:Electronic Design Automation
EPC: Express Physical Channels
EVC: Express Virtual Channels
FIFO: First In First Out
FPGA:Field Programmable Gate Array
FT: Fat Tree
FTP:File Transfer Protocol
FWFT: First Word Fall Through
XIV
GCA:Global Congestion Awareness
GUI: Graphical User Interface
GWOR: Generic Wavelength routed Optical Router
HAL: Hardware Abstraction Layer
HDL: Hardware Description Language
HNoC: Hierarchical Network-on-Chip
IBR: Input Buffer Router
IC: Integrated Circuit
IDE: Integrated Development Environment
IP: Intellectual Property
IPC: Inter Process Communication
IRQ:Interrupt Request
ISA: Instruction Set Architecture
JATAG: Joint Test Action Group
LAFT: Look Ahead Fault Tolerant
LEDR: Level Encoded Dual Rail
LUT: Look Up Table
MMU: Memory Management Unit
MPSoC: Multi-Processor System-on-Chip
MPU: Memory Protection Unit
MR: Micro-ring Resonator
NFR: Neighbor Flow Regulation
NI:Nanophotonic Interconnect
NI: Network Interface
NoC: Network-on-Chip
OCMP: On-Chip Multilayer Photonic
ONoC: Optical NoCs
PCC: Packet-Connected Circuit
PDN: Perfect Difference Network
PIO:Programmed Input/Output
PLL:Phase Locked Loop
XV
PNoC: Programmable NoC
RAM: Random Access Memory
RCWRON: Recursive Wavelength Routed Optical Network
RISC: Reduced Instruction Set Computer
RPA: Rectilinear Propagation Arbiter
RPNoC: Ring based Packet switched NoC
RTL:Register Transfer Level
SA: Simulated Annealing
SAF: Store and Forward
SDM: Space Division Multiplexing
SDRAM: Synchronous Dynamic Random Access Memory
SGM:State Graph Manipulators
SoC: System-on-Chip
SOPC: System-On-a-Programmable-Chip
SPIN: Scalable, Programmable, Integrated Network
SRAM:Static Random Access Memory
SSN: Spiking Neural Networks
STorus: Screwy Torus
TCP: Transmission Control Protocol
TSV: Through-Silicon-Vias
UART:Universal Asynchronous Receiver/Transmitter
USB:Universal Serial Bus
VCI: Virtual Circuit Identifier
VCT: Virtual Cut-Through
VHDL: Very high-speed integrated circuit Hardware Description Language
WDM: Wavelength Division Multiplexing
WH: Worm-Hole
WRON: Wavelength Routed Optical Network
XDMesh: Extended Diagonal Mesh
XVI
PREFACE The thesis describes the design and implementation of Network on Chip platform. The NoC
architectures not only provide power- efficient solutions for interconnecting network resources
but also enhance the network performance while maintaining implementation cost. The thesis
discusses the various building blocks require to design the NoC platform, the different
architectures and routing schemes with their pros and cons. The m x n mesh topology based
platform with complete processes of hardware implementation and the reason for choosing mesh
topology is described in detail.
The thesis initially describes the proposed designed of the switch which is the basic
building block of NoC architecture. The model is described using HDL and synthesized using
ALTERA QUARTUS II Software for Cyclone-II soft core processor. The switch component is
then tested for its routing functionality using static look up table configured inside the design.
The computing element required to connect to the switch is generated using Altera’s Nios-II soft
core processor. The thesis describe in detail the process of implementing and testing the
computing node for its functionality, computing node is the switch embedded with the
computing element generated using soft core NIOS-II processor.
The m x n mesh topology platform has been designed, implemented and synthesized on
Cyclone II 2C35 FPGA device using Altera’s Development and Education (DE2) board and
verified using real time operating system µCOS for data communication.
The thesis also gives the detail implementation of 4 x4 and 8 x 8 mesh topology NoC
simulative models and analyses the performance of these models with respect to throughput and
packet delay by varying size of the packet, link bandwidth, queue size and link delay parameters
using FTP and CBR traffic pattern. The results obtained give the fair understanding in deciding
efficient usage of such platform for different applications.
Udaysing Vithalrao Rane
XVII
LIST OF RESEARCH RELATED COMMUNICATED PAPERS
1. Udaysing.V. Rane, V.R. Gad, R.S. Gad, G.M. Naik, ‘Reliable and scalable architecture
for Internet of things for sensors using soft-core processor’ International conference on
Internet of Things, Smart Spaces, and Next Generation Networks and Systems. Proc. 15.
Int. Conf., NEW2AN-2015, and 8. Conf. ruSMART-2015 St. Petersburg, RUSSIA and
published as Lecture Notes in Computer Science. 9247(); 2015; 367-382.)
2. Vinaya R. Gad, Udaysing V. Rane,Rajendra S. Gad, Gourish M. Naik, ‘Performance
Evaluation of Low Density Parity Check (LDPC) Codes Over Gigabit Ethernet
Protocol’ Transactions on Networks and Communications (TNC) Volume 4, Issue 5,
October 2016
3. Rajendra S. Gad, Udaysing V. Rane, John Fernandes, Mahesh Salgaonkar, ‘Virtual
Extended Memory Symmetric Multiprocessor (SMP) Organization Design Using LC-3
Processor: Case Study of Dual Processor’ International Journal of Engineering Research
and Development (IJERD)Volume 10, Issue 3 (March 2014), PP.29-39
4. Udaysing V. Rane, Rajendra S. Gad, ‘Low power routing: Migration from Electronics to
Optical switching’ 8th
Annual National Symposium on VLSI and Embedded systems VSI
Goa India 8th
March 2014 at PCCE college of Engineering Verna Goa India.
5. Udaysing V. Rane, Rajendra S. Gad, ‘Design of NoC Platform with Mesh Topology
using Soft Core NIOS-II Processor’ 6th
Annual National Symposium on VLSI and
Embedded Systems VSI Goa India 22nd
March 2012 at Goa University Goa India
6. Charan Arur Panem, A. A. Gaonkar ,Udaysing. V. Rane, A. B. Pandit, R. S. Gad,
‘Sensors Data Fusion Architecture Over MIMO: Case Study of Quad copter’
International Conference on Electrical, Electronics, and Optimization Techniques
(ICEEOT) –March 2016- Chennai India
7. Udaysing V. Rane, Rajendra S. Gad, ‘Implementation of Switch design on FPGA for
Asynchronous Transfer Mode Network’ 5th
National Conference on Electronic
Technologies (NCET-14) on 25th
-26th
April 2014 at GEC Goa India
XVIII
8. Udaysing V. Rane, Rajendra S. Gad, G.M. Naik, ‘NoC Mesh Topology Platform
Synthesis on ALTERA FPGA: Verification using µC/OS Real Time Operating
System’, under review with Real-Time Systems the International Journal of Time-Critical
Computing Systems-Springer.
9. Udaysingh V. Rane, Charan Arur Panem, Vinaya R. Gad, Rajendra S. Gad, Gourish M.
Naik, ‘Performance Study of 2-D Mesh Typologies on Network on Chip and
Verification Using Real-Time Operating System on 90nm Technologies’, under review
with Microprocessor and Microsystems Embedded Hardware Design International
Journal-Elsevier.
2
1.1 Introduction
The complexity of semiconductor Integrated Circuit(IC) is measured using the transistor
count. The transistor count is the number of transistors on an IC chip. Very Large Scale
Integration (VLSI) technology began in the 1970s has scaled the size of transistor drastically
(10000nm to 14nm) allowing billions of transistors fitting on IC.
The Intel in 1971 introduced their INTEL 4004 PROCESSOR chip containing only 2300
transistors fitted on 12mm2die with 10000nm process and in 2016 they introduced Broadwell-
EP Xeon E5-2600 V4 with up to 7.2 billion total transistors packed inside a 456mm2 die with
14nm process [1]. The 14nm process allows Intel to significantly cut down the die size while
increasing both transistor count, core count and the Inter Process Communication (IPC)
performance [2]. Oracle Corporation in October 2015 introduces 32 core SPARC M7 with 10
billion transistors [3]. At the 2016 GPU Technology Conference in San Jose [4] announced
the new NVIDIA Tesla P100 with 15.3 billion transistors on 16nm technology and 610mm2
die area.
It can be observed that the transistor count of the integrated circuits doubles approximately
every two years as predicted by Moore's law [5] With the number of transistors increasing
which is the stepping stone to develop complex semiconductor and communication
technologies, allows more and more components to be integrated within the same area of a
chip. As a consequence complex systems that once required many microchips for being built
now can be installed on a single microchip containing all the logic of the system and the
interconnection channels connecting them. At the same time new real time applications, such
as audio and video transmissions, video conferencing, video on demand, distant education, e-
3
commerce, etc., demands fast computation, higher rates in information transmissions and in
minimization of delay.
The survey conducted by SPEC on their benchmark series, SPECint, which was
designed to measure the single-threaded integer performance of a computing machine shown
in figure 1.1, wherein the results are grouped by CPU brand which consists of 5052 test
results from 715 different CPU models, the result shows that the performance of uniprocessor
started decreasing after 2004[6]. Thus the combined pressures from ever-increasing power
consumption and the diminishing returns in performance of uniprocessor architectures have
led to the advent of multicore chips figure 1.2 [7]. For example, the Intel SCC [8] contains 48
cores, the Tilera Tile64 has 64 cores [9], and the experimental Intel Polaris chip incorporates
80 cores [10].Consequently, the embedded systems have led to the multi-processor System-
on-Chip (MPSoC) design and the high performance computer architectures have evolved into
Chip Multi- processor (CMP) platforms,
1.2 Multi-Processor System-on-Chip (MPSoC)
The Multi-Processor System-on-Chip (MPSoC) is a System-on-Chip (SoC) which
incorporates multiple heterogeneous processing cores and memory hierarchy and I/O
components in a single case on a single die, usually targeted for embedded applications.
These architectures meet the performance needs of multimedia applications,
telecommunication architectures, network protection and security and other application
domains while limiting the power consumption through the use of specialized processing
elements and architecture[11]. All these components are linked to each other by an on-chip
interconnect.
4
Figure 1.1: Growth in processor performance (source:- Original data collected and plotted by
Figure 1.2: Evolution of multi-core systemM. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond and C. Batten Dotted line extrapolations by Moore)
5
This on-chip interconnects usually consist of conventional bus and crossbar structures.
The bus interconnect is just wires, interconnecting all Intellectual Property (IP) Cores,
combined with an arbiter that manages the access to the bus. System-level latency and
bandwidth constraints led to a natural evolution towards multi-tiered bus architecture,
typically consisting of a high-performance low latency processor bus, a high-bandwidth
memory bus and a peripheral bus. Examples of bus-based architectures, are ARM™ AMBA
[12], OpenCore’s WISHBONE SoC interconnection [13], and IBM CoreConnect™ [14-15].
However, there are several problems associated with the standard bus architectures [16]. First,
a global bus implies a large capacity load for the bus drivers. This in turn implies large delays
and huge power consumption. Second, the performance of shared bus architecture is
inherently not scalable, as there can be at most one transaction over the shared bus at any
point of time. Moreover, the bus performance has to be degraded if a slow device is accessing
the bus. Some sophisticated modern bus architectures address these problems through the
concept of bus hierarchy and separation. However, such a temporary solution falls short when
hundreds or over thousands of processors will have to be integrated on a single chip [17].
Third, in Deep Submicron (DSM) era, design of long and wide buses becomes a real
challenge. While physical information is extremely important for successful bus design, the
environment in which the bus is embedded is very hard to predict and characterize early in the
design stages due to cross talks. With the increasing IC performance requirements, designers
started to implement cross-bar structures. These structures improve latency predictability and
significantly increase aggregate bandwidth, at the cost of larger number of wires.
The advent of the SoC, incorporating tens to hundreds of IP cores created a significant
integration challenge. The above described busses and cross-bars are ‘coupled’ solutions. The
6
interfaces of all IP cores connected to a single bus or cross-bar must all be exactly the same,
both logically (signals) and in physical design parameters (clock-frequency, timing margins).
This turned out to be significant obstacle to rapid integration and re-use of existing IP into
increasingly complex SoCs. Systems-on-chips can be found in many product categories
ranging from consumer devices to industrial systems [18]
• Television production equipment uses systems-on-chips to encode/decode video.
Encoding high-definition video in real time requires extremely high computation rates.
• Digital televisions and set-top boxes use sophisticated multiprocessors to perform real-
time video and audio decoding and user interface functions.
• Telecommunications and networking use specialized systems-on-chips, such as
network processors, to handle the huge error detection and correction data rates
presented by modern transmission equipment.
• Cell phones use several programmable processors to handle the signal processing and
communication protocol tasks required by telephony. These architectures must be
designed to operate at the very low-power levels provided by batteries.
• Video games use several complex parallel processing machines to render gaming
action in real time.
The scalability and success of switch-based networks and packet-based communication in
parallel computing and Internet has inspired the researchers to propose the Network-on-Chip
(NoC) architecture as a viable solution to the complex on-chip communication problems [19].
1.3 Network-on-Chip (NoC)
7
For complex on-chip computing platform, an efficient way to manage the communication
among various on chip resources becomes critically important. The communicating resources
use bus and crossbar architectures to communicate with each other. The problems with bus
based communicating strategy is that they occupy large wire area leading to large area,
scalability problem as the number of resources grows since the communication is serialized
and also arbitration for the shared medium can impose a significant latency and the fan-out
[20]. Crossbars architecture eliminates serialization but the area and power costs increases
quadratically with the number of network endpoints [21]. To address this SoC design
challenges, Network-on-Chip has been proposed as an alternative approach to design the
communication subsystem between on chip resources. NoC proposes networks as scalable,
reusable and global communication architecture.
1.3.1 NoC Architecture
The figure 1.3 shows the general architecture of NoC. The major components in NoC are
Routers (RS), Network Interface (NI), IP cores (Processors, Memory, Sub-Systems etc.) and
links [22]. A router is the communication backbone of a NoC system which undertakes the
crucial task of steering and coordinating the data flow. It should be designed for maximum
efficiency and throughput.
Links are set of wires used to physically connect the routers and enables the communication
in the network. NI makes the logical connection between IP core and network. The IP cores
can be Digital Signal Processors (DSPs), embedded memory blocks, or Central Processing
Units (CPUs), or Video Processors etc.
8
Figure 1.3: General Architecture of NoC
1.3.2 Topologies
Topology refers to the physical layout and connections between nodes and links in the
network. The performance of the network is directly depends on the topology. It determines
the number of routers a message must traverse as well as interconnect lengths between
routers, thus influencing the network latency significantly. The network energy consumption
is also depends on the topology of the network, as the routers and links incurs energy while
traversing messages. Further it dictates the total number of alternate paths between routers,
affecting the distribution of traffic smoothly and hence supports bandwidth requirements. The
commonly used on-chip topologies are explained below in detailed
1.3.2.1 2D Mesh Topology
The mesh-based interconnect architecture called CLICHÉ (Chip-Level Integration of
Communicating Heterogeneous Elements) is proposed by Kumar et al [23].
9
The architecture consists of an m x n mesh of switches interconnecting computational
resources (IPs) placed along with the switches as shown in figure 1.4. Each switch is
connected to four adjacent switches except the one which are placed at edges and one IP
block using communication channels. This topology is widely used in many NoC based
multicore prototypes e.g. MITs 16-tile RAW chip [24] and Intel 80-tile TFLOPS chip [25].
The major advantages of the mesh are its simplicity, very regular layout and scalability.
Figure 1.4: 2D Mesh topology
1.3.2.2 2D Torus Topology
Figure 1.5: 2D Torus Network
IP IP IP
IP IP IP
IP IP IP
R R R
R R R
R R R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
IP
R
10
The 2D Torus architecture figure 1.5 was proposed by Dally and Towles [26]. It is same as a
regular mesh the only difference is that the switches at the edges are connected to the opposite
edge through wrap-around channels as shown in figure 1.5.
1.3.2.3 Star Topology
Figure 1.6: Star Topology In star topology network computational resources are connected in star fashion to a central
router as shown in figure 1.6. The capacity requirements of the central router are quite large,
because all the traffic between the spikes goes through the central router. That causes a
remarkable possibility of congestion in the middle of the star.
1.3.2.4Fat Tree Topology
This topology uses a tree structure where nodes are routers and leaves are computational
resources as shown in figure 1.7. The routers above a leaf are called as leaf’s ancestors and
correspondingly the leaves below the ancestor are its children. In a fat tree topology each
node has replicated ancestors which mean that there are many alternative routes between
nodes.
11
Figure 1.7: Fat Tree Network 1.3.2.5 Butterfly Fat Tree Topology
Researchers C.Grecu, P.P. Pande, A. Ivanov and R.Saleh [27] proposed Butterfly Fat Tree
architecture as shown in figure 1.8. In this topology the computational resources are placed at
the leaves of the tree and switches are placed at the vertices. The architecture may be
unidirectional or bidirectional. Each node in the topology is label with a pair of coordinates (l,
p), where l is a level and p is a position within that level. In general, at level 0 (lowest level)
there are N computational resources with addresses ranging from 0 to N-1. Each switch in the
Butterfly Fat Tree architecture has two parent ports and four child ports.
Figure 1.8: Butterfly Fat Tree topology
R
IP IP
R
IP IP
R
IP IP
R
IP IP
R
R R
IPIPIP IPIPIP IPIPIP IPIPIP
R
R R R R
R
12
1.3.2.6 SPIN Topology
Guerrier and Greiner [28] proposed a generic interconnect template called SPIN (Scalable,
Programmable, Integrated Network) using a fat tree based architecture to connect
computational resources for on-chip packed switched interconnections. In this architecture ,
every node has four children and the parent is replicated four times at any level of the tree.
The leaves of the tree contains computational resources and the vertices contains switches as
shown in figure 1.9.
Figure 1.9: SPIN topology 1.3.2.7Polygon Topology The Polygon topology is a circular network in which packets travel in loop from router to
other. The network becomes more diverse when chords are added to the circle. Two special
cases of polygon network, polygon (Hexagon) network shown in figure 1.10 proposed by ST
Microelectronics[29] and Spidergon/Octagon network shown in figure 1.11 [30][31][32][33].
Hexagon network topology consists of six routers, each router has six bidirectional links
connecting to five other routers and one link is connected to the computational resource.
Spidergon/Octagon consists of eight routers and 12 bidirectional links. Each router is
13
Figure 1.10: Polygon (Hexagon) Topology connected in circular fashion as well as there is a connection between opposite routers.
Communication between any pair of nodes takes at most N/4 hops where N is the number of
nodes as compared to single hop in Hexagon topology
Figure 1.11: Spidergon (Octagon) Topology 1.3.3 Topologies metrics Node degree: The number of links at each node is term as the degree of a topology. For
example, a ring topology has a degree of 2 since there are two links at each node, while a
mesh has a degree of 4 as each node has 4 links connecting it to 4 neighboring nodes. Higher
R
IPIP
R
R R
R R
IPIP
IP IP
R
R
R
R
R
R
R
R
14
the degree, more number of ports are required at routers which increases the implementation
complexity.
Hop count: Hop count is the number of hops a message takes from source to destination, or
the number of links it traverses. The number of hop count depends on the diameter of the
network and the routing algorithm used. Increase in the number of hops will increase the
latency affecting the performance of the network.
Topologies typically trading off between node degree and hop count, i.e. a topology may
have low node degree but high average hop count ( e.g. ring topology), while another may
have high node degree but low average hop count (e.g. mesh topology), hence comparison
between topologies become trickier. Implementation details have to be factored in before an
astute choice can be made.
Maximum channel load: is the maximum number of bits per second injected by every node
into the network before it saturate. This metric is useful for estimating the maximum
bandwidth the network can support. The higher the maximum channel load on a topology,
greater the congestion in the network caused by the topology and routing protocol causing
decrease in throughput. The specific traffic pattern affects maximum channel load
substantially and representative traffic pattern should be used in estimating maximum channel
load and throughput.
Significant research exists into all of the above topologies discussed. It is observed
through the survey that majority of on-chip network proposals gravitate towards either mesh
or ring topologies. For example, the IBM Cell processor [34], which is the first product with
an on-chip network, uses ring topology due to simple design, its ordering properties and low
power consumption. Total four rings were employed to boost the bandwidth and to alleviate
15
the latency. Also, proposed Intel Larrabee [35] adopted the two ring topology. The MIT Raw
chip, the first chip with an on-chip network has adopted mesh topology and its follow-on
commercialization into the Tilera TILE64 chip[36]. Each parallel network in the chip is a
mesh topology and different types of traffic are routed over the distinct networks. It is also
seen through the survey that researchers tried comparative evaluation of different NoC
topologies. The performance evaluation for throughput and delay was carried out by using
different NoC topologies under different injection rate and switching techniques. Gen Fen,
Wu Ning, and Wang Qi[37] performed simulation study using 2D Mesh, Fat-Tree(FT) and
Butterfly Fat- Tree (BFT). The switching techniques used were wormhole (WH) and Virtual-
cut-through (VCT). The performance of the above topologies was analyzed for throughput
and delay under random traffic pattern in which every node generates a packet and sends it to
a randomly chosen destination in the network. Their result concluded that the latency and
throughput increases as the injection rate increases for all the topologies under WH switching
technique. As the injection load approaches the accepted traffic limit, there will be more
messages contention and latency will be increase. FT has the lowest latency and highest
throughput under middle injection rate. The performance evaluation in the similar line was
carried out by Abdul Quaiyum Ansari, Mohammad Rashid Ansari and Mohammad Ayoub
Khan [38]. They compared Mesh, Torus, Cmesh and Fat tree topology for throughput and
delay using uniform traffic pattern in which each source sends packets to each destination
with equal probability. It was concluded that as the injection rate increases the throughput and
delay increases and at one point it sharply increases and saturate further. The torus topology
can withstand high injection rate without saturation and performs better as compared to other
16
topology. Further it was concluded that all the topologies have different injection rate range
for a same size network.
1.3.4 NoC Switching Techniques
Another important parameter of NoC architecture is switching techniques. The flow of the
data in the network through routers is determined by the switching technique [39]. The
different types of switching techniques used are described below
1.3.4.1 Circuit Switching
In Circuit Switching, the physical path between source node and destination node is reserved
before data transmission. Since there is a dedicated path, there is a very low probability of
packet loss. However, sometimes the bandwidth is in efficiently used because the path
reserved for one connection cannot be used by other until it is released [40] [41]. This
technique does not scale well with NoC size.
1.3.4.2 Packet Switching
In packet switching technique, the messages are divided into packets at the source node and
these packets are transmitted from source to destination node independently through different
routers. Each packet follows different route depending on the current status of the network.
Packet switching is the most commonly used technique in NoC because of its potential for
providing simultaneous data communication between many source-destination pairs. Packet
switching is classified as
I. Wormhole Switching
In wormhole switching the packets are divided into smaller fixed length flow control
units (flits) [42]. In this switching technique the header flit contains the information required
for making path decision from source to destination. Only header flit takes some delay to
17
decide the path and the remaining flits belonging to the same packet simply follows the path
taken by the header flit significantly reducing average message latency. The packet can
therefore spread into consecutive routers like a worm and hence the name wormhole
switching. Since the packets are divided into smaller size flits the buffer size required at the
router is reduced to size of flit instead of size of packet. The main disadvantage of wormhole
switching is that when header flit is blocked over transmission complete packet gets blocked
[43]. This switching is more susceptible to deadlock due to dependencies between links.
II. Store and Forward Switching
In store and forward switching the complete packet is stored at the node before
forwarding it to the next node in the path. Each node along the destination path should have
the buffer size sufficient to store the whole packet. The store and forward switching
introduces much overhead and limits the performance, given the area and power constraint of
on-chip networks [19]
III. Virtual Cut-through Switching
In Virtual Cut-through switching the data is forwarded in the forms of packets, the
buffer size required at router is equal to the size of the packet. The packets can cut through to
the input of the next router before the entire packet is received at the current router only if
there is enough storage at the next downstream router to hold the entire packet. The router can
start forwarding the header following data bytes as soon as it receives thus reducing the
latency drastically over store and forward switching technique. The Virtual Cut-through
switching behaves like store and forward switching technique at high network load.
18
1.3.5 Routing Algorithms
Routing algorithm decides which path a packet will follow through network to reach the
destination. The main goal of the routing algorithm is to evenly distribute traffic among the
paths supplied by the network topology to minimize contention and avoid hotspots, thus
improving throughput and network latency. Routing algorithms are generally divided into
three classes: deterministic, oblivious and adaptive.
In deterministic algorithm all packets from source to destination node always follow
the same path. Most commonly used deterministic algorithm for network on chip is
Dimension Ordered Routing (DOR). Whereas in oblivious algorithm the packets traverse
different paths from source node to destination node; the path selected is without regard to
network congestion. Further in adaptive algorithm the path taken by the packets from source
node to destination node depends on situation of the network traffic.
Routing algorithms are also classified as minimal and non-minimal. Minimal
algorithms select path from source to destination with minimum number of hops. Non-
minimal routing algorithm allow paths to be selected from source to destination which may
increase number of hops, this type of algorithms are mostly used when there is congestion.
1.4Objectives of the Thesis and outline
Network on chip is an active upcoming research area. Many aspects of NoC are still under
investigation and needs further exploration and understanding. In this thesis our main
concentration is to design and implement an efficient switch for NoC platform. The switch is
the main component that determines the latency, throughput, reliability and efficiency of the
entire NoC design. We will also implement the computing element using Hardware
Descriptive Language (HDL). Computing element is the soft core processor; use for
19
generating, process and receiving packets. This computing element is embedded with the
switch element. Further 2Dm x n mesh topology for the above described switch will be
implemented and tested using packet switching technique for data communication for ATM
packets. Finally we will be studying the performance analysis of the scalable topology.
I. To synthesized switch using HDL and test for its functionality.
II. The switch with embedded computing element using the HDL will be implemented
and tested for its functionality.
III. 2 dimensional ‘m x n’ mesh topology for above switch node will be synthesized on
FPGA hardware and verification using packet switching technique for communication
with soft core processor using proper real-time operating system.
IV. Static IP table will be created for the fixed size of m x n matrix topology to
established links over interconnects for the routing algorithms and will be verified
using Real Time Operating System.
V. Performance analysis of the scalable topology will be carried out for throughput,
delay, bandwidth, packet size and queue size.
The thesis is organized as follows:
Chapter 2 discusses the NoC literature survey and state of the art in NoC. Here various
surveys with reference to NoC architecture, router design, routing algorithms, link interfaces
and flow control is described in detail.
Chapter 3discusses the propose NoC router architecture, its implementation using hardware
descriptive language and functional simulation. We described in detail input module,
scheduler and crossbar logic components
20
Chapter 4discusses the implementation of computing node and testing the design for its
functionality by targeting it on FPGA device. The chapter elaborates on design of computing
node using Nios-II soft core processor and integrating the same with switch
Chapter 5 discusses the implementation of NoC platform using 2-D mesh topology and
testing the platform for inter-communication using real time operating system. Here we
describe in detail the verification methodology for data communicationover various tasks
using real time operating system µCOS
Chapter 6 discusses the performance evaluation and analysis of 4 x 4 and 8 x 8 mesh
topologies for throughput and delay by varying packet size, bandwidth, queue size and link
delay.
Chapter 7gives concluding remarks along with discussion and future trends.
22
The NoC design can be broken down into its various building blocks such as its architecture,
topologies, router architecture design, routing algorithm, link interface and flow control. Each
of these blocks can be a separate topic for research. We present the state of the art research
survey and discussions based on these components.
2.1 NoC Architecture
The researchers are studying various NoC architectures to understand out the best NoC for the
given application. The analysis is based on various network parameters. P.P.Pande, Cristian
Grecu, Michael Jones, Andre Ivanov and Resve Saleh[44] performed comparative evaluation
study on SPIN, CLICHÉ, Torus, Folded torus, Octagon and BFT NoC architectures. The
performance metrics used was throughput, Transport Latency, energy and area requirement.
To carry out the comparison they developed the simulator which employed flit-level event
drive worm whole routing and the traffic injected was Self-similar and Poisson distributions.
They also used the similar types of switching and routing circuits to ensure the consistency of
the comparisons. The results indicate that throughput for BFT, CLICHÉ and Folded Torus is
lower as compared to SPIN and Octagon under uniform traffic pattern due to more links
between source and destination. With respect to energy, SPIN and Octagon has greater energy
dissipation at saturation than other architectures although this architecture on an average
provides high throughput and low latency. The die area overhead for SPIN and Octagon is
also higher as compared to other architectures. Their conclusion was that some architecture
can sustain very high data rates at the expense of high energy dissipation and considerable
silicon area overhead, while other can provide a lower data rate and lower energy dissipation
levels. Abdul Quaiyum Ansari, Mohammad Rashid Ansari and Mohammad Ayoub Khan[45]
also carried out study in the similar line. They performed the performance study using mesh,
23
torus c-mesh and fat-tree network topology with analysis parameters such as latency,
throughput, injection rate and hop counts using Booksim 2.0 simulator. They concluded that
the latency of the network increases as injection rate increases, throughput increases linearly
with injection rate. Various topologies with same size network have different injection rate
range. The analysis concludes that the c-mesh topology has very less injection rate range and
torus topology has better performance in comparison with other topology. Bharat
Phanibhushana and Sandip Kundu[46] presented a synthetic method for hierarchical design of
NoC for a given task graph system deadline which optimizes the router area. They proposed
2-phase design flow namely topology generation and statistical analysis. The test cases of
proportioned Monte-Carlo were used to meet the deadline as a metric for goodness. The
proposed solution was compared with Simulated Annealing (SA) based network generation
and static design approach. Their results claim 10% performance benefit over SA, 16% over
standard mesh and 30% over static design and total router area benefit of 59% over SA, 48%
over mesh and 55% over static design. Mahendra Gaikwad and Rajendra Patrikar[47]
proposed energy aware model for NoC platform using Perfect Difference Network (PDN)
based on mathematical notion of Perfect Difference Sets (PDS). They considered chordal ring
structure of the PDN for n=13, based on PDS {0, 1, 3, 9}, nodes in the PDN are either
connected directly or through intermediate node of length 2. The messages can be routed in
this network through one or two hops. The results were compared with 4X4 mesh inter-tile
based CLICHÉ NoC architecture, where it is observed that significant energy saving is
achieved for transferring random stream of data from one node to another in their proposed
model. Chia-Hsin OwenChen, Niket Agarwal,Tushar Krishna, Kyung-Hoae Koo, Li-Shiuan
Peh, Krisna C. Saraswat [48] performed a comparative study on physical express topology (
24
Express Physical Channels (EPC) based on express cubes) and virtual express topology
(Express Virtual Channels (EVC)) for meshes with large node count on-chip networks under
bisection bandwidth , router area and power constraints. The impacts of the Capacitively
Driven Low-Swing Interconnects (CDLSI) links on above designs were also evaluated. They
observed virtual express topologies give higher throughputs and they are more robust across
traffic patterns. Physical express topologies gives better throughput/watt and can leverage
CDLSI to reduce the latency and increase the throughput but both designs have similar low-
load latencies. Ying-Cherng Lan, Hsiao-An Lo, Yu Hen Hu and Sao-Jie Chen[49] proposed a
Bidirectional channel Network on Chip(BiNoC) architecture. The novel on chip router
architecture was also developed which supports the communication channels to be
dynamically self reconfigurable to transmit flits in either direction. The direction of the flow
is controlled by Channel Direction Control (CDC) algorithm. The proposed architecture was
evaluated for the performance using synthetic and real world traffic pattern, the result shows
significant reduce in packet delivery latency at all levels of packet injection rates, the
bandwidth utilization and traffic consumption rate also exhibited higher efficiency as
compared to conventional NoCs. It was also observed that better latency results were
achieved under different traffic pattern with less buffer size usage. Snaider Carrillo, Jim
Harkin, Liam j. McDaid, Fearghal Morgan,Sandeep Pande, Seamus Cawley and Brian
McGinley[50] presented a novel Hierarchical Network-on-Chip (HNoC) architecture for
Spiking Neural Networks (SSN) hardware. This architecture incorporates a spike traffic
compression technique to exploit SNN traffic patterns and locality between neurons, which
improves throughput and reduces traffic overhead. The result shows a high throughput of 3.33
x 109 spikes per second and synthesis result of the proposed HNoC demonstrates an efficient
25
low cost area utilization of 0.587 mm2 and low power consumption of 13.16mW for single
cluster facility of 400 neurons. Md. Hasan Furhad and John-Myyon Kim [51] proposed an
Extended Diagonal Mesh (XDMesh) topology for NoC architectures which includes diagonal
links between remote nodes. The experimental result of XDMesh outperforms the existing
state-of-art NoC topologies in terms of silicon area and energy consumption. Wu Chang, Li
Yubai and Chai Song [52] proposed a new torus topology structure and corresponding router
algorithm for NoC applications by redefining the router denotations and changing the original
router locations in the traditional torus topology. The proposed structure and corresponding
algorithm is implemented using system C. The performance evaluation for average delay and
normalized throughput of the proposed structure against XY route mesh for four different
traffic patterns was studied which claims that the proposed torus structure is more suitable for
NoC applications.
2.2 NoC Router Architectures
Router is an important component in NoC architecture, it is also called as the communication
backbone of the NoC architecture, and an efficient router design directly affects the system
performance. Many researchers have contributed their research work in this area. Jawwad
Latif,Hassan Nazeer Chadhry, Sadia Azam, Naveed Khan baloch[53] proposed a folded dual
Xbar architecture which is a combination of dual Xbar and folding technique. The dual Xbar
router architecture with buffer and buffer less feature will reduce buffer r/w energy with dual
crossbars and switch folding technique will increase resource utilization by reducing wire
density and decreasing the logic multiplesxures in crossbar. Simulations were performed
using OMNET++ platform to compare the performance of proposed architecture with the
conventional architecture. The results depicts that in proposed 2-folded dual Xbar architecture
26
there is slight increase in throughput and reduce in buffer r/w energy by average of 46% at
high load as compared to conventional architecture similarly for 3-Folded Dual Xbar
architecture there is 16.6% increase in throughput with 43 to 45% reduced buffer r/w energy
but little increase in crossbar. Umamaheswari S., Meganathan D and Raja Paul Perinbam
J[54] proposed a heterogeneous adaptable router which will reduce latency in irregular mesh
NoC architectures. Large input buffers in the router will incur large hardware overhead
followed by excessive power consumption although improves the efficiency of NoC
communication. Router with small buffers result in high latency although reduces power
consumption. In the proposed NoC router input buffers can be allocated dynamically reducing
the latency. The results shows that there is 20% decrease in latency and 9% increase in
throughput in 4 X 4 irregular mesh topology NoC with a buffer depth of 4 slots. The proposed
router architecture was also exposed to the synthetic traffics like hotspot, tornado, uniform
and bit complement in 8X8 irregular mesh topology NoC architecture and it offered 30.42%
reduction in latency and 18.33% increase in saturation throughput. Further the router was also
used for E3S benchmark applications where the latency was reduced by 22.63% as compared
to static router with 53% less power consumption and 55% less reduction in buffer
requirement. Anbu chozhan. P,D. Muralidharan and R. Muthaiah[55] proposed five port NoC
router design with First Word Fall Through (FWFT) based asynchronous First In First Out
(FIFO) buffering technique which improves timing and power consumption. The proposed
router design uses mask based round robin arbiter, the mask is generated using round robin
pointer. Two priority arbiters are used of which one handles the entire request and other
handles the masked request. Based on the results of two priority arbiter the round robin
pointer is updated. The priority scheduling is enforced by mask based round robin arbitration
27
maintaining fairness to all participants. Here the router design is simulated using ISIM
simulator of Xilinx ISE 13.2 to verify its functionality. Qing Sun, Lijun Zhang and Anding
Du[56] has proposed and designed a low latency NoC router with wormhole virtual channel
switching, the wormhole switching will save the cache space effectively while virtual channel
switching will set several logical virtual channels for a physical link. The proposed router uses
two switching modes in addition to two clock-cycles to reduce the latency. The proposed
router was developed using ISE tool of Xilinx and simulations were performed using
Modelsim. Rohit Sunkam Ramanujam, Vassos Soterious, Bill Lin and Li-Shiuan [57]
proposed a Distributed Shared Buffer (DSB) based router for NoC, DSB router architecture is
extensively used in high performance Internet packet routers. The proposed router uses two
crossbar stages and buffering of the packets are done between this two crossbar stages rather
than buffering packets at the output ports. The proposed DSB router as compared to Input
Buffer Router (IBR) architecture achieves up to 19% higher saturation on synthetic traffic
pattern and up to 94% of the ideal saturation throughput, further for SPLASH-2 benchmark
with high contention the packet latency was reduced on an average by 60% and under
synthetic workload evaluation the theoretical ideal saturation throughput was within 10%.
Weiwei Fu, Jingcheng Shao, Bin Xie , Tianzhou Chen and Li Liu [58] proposed a router
micro architecture design with Neighbor Flow Regulation (NFR), the design builds an
additional regulation network using low cost wires which collects information of flows in
neighbor router and regulates the arbitration schemes on them to prevent congestion and
starvation. The simulation result shows 6.7% increase in network throughput and also
achieved 28% improvement under hot spot traffic in switch matching efficiency. Yaniv Ben-
Itzhak, Israel Cidon, Avinoam Kolodny, Michael Shabun and Nir [59] proposed
28
heterogeneous NoC router architecture based on shared-buffer architecture which supports
different links bandwidths and different number of virtual channels per unidirectional port
with ingress and egress bandwidth decoupling advantages. They also presented the new
approach to reduce the number of shared buffers required for a conflict free router which
reduces area and power consumption dramatically. The saturation throughput was improved
by 6% to 47% for standard traffic pattern as compared to optimal input buffer homogeneous
router. A significant run time improvement for NoC based CMP running PARSEC
benchmark was achieved. The comparative performance study of the proposed router with
predefined traffic pattern against optimal input-buffer homogeneous and heterogeneous
routers for NoC based CMP at various sizes shows that the proposed router offers better
scalability, area and power reduction of 15% to 60%. Ahmed Bin Dammas, Adel Soudani and
Abdullah Al-Dhelaan [60] proposed enhanced router architecture for NoC that ensures a
specific management according to a service classification with enhanced routing process.
They proposed a new mechanism for flow control to avoid congestion in NoC with an
appropriate approach to manage flits buffering which allows dropping the flits with low
priority when the router is in a congestion state. The performance evaluation of this router
was carried out along with its hardware characteristics indicating its usage for low power NoC
applications. Bidhas Ghoshal, Kanchan Manna, Santanu Chattopadhyay and Indrani
Sengupta[61] proposed an online transparent test technique to detect latent hard faults
developed in FIFO buffers of NoC routers during field operation. They also proposed a
transparent SOA-MATS++ test generation algorithm which performs the periodic on line
testing of buffers within the NoC routers which prevents accumulation of faults.The
simulation results shows that test circuit does not affect the overall throughput of the system.
29
Woo young Jang and David Z. Pan [62] presented a NoC router with an explicit Synchronous
Dynamic Random Access Memory (SDRAM) aware flow control. This SDRAM aware flow
controller based on priority based arbitration schedules memory request which prevents data
contentions, bank conflicts and short turn around bank interleaving improving memory
utilization and latency. The experimental result shows that the proposed design significantly
improves memory utilization and latency compared to the conventional NoC design without
SDRAM aware router.
2.3 NoC Router Algorithms
The performance of the NoC system is also depends on underlying routing algorithms.
Routing algorithm decides the path for a packet and is generally classified into deterministic
and adaptive routing algorithms. In deterministic routing the path is determined by source and
destination address and in adaptive routing the path is decided on dynamic network
conditions. Tremendous research in designing enhanced routing algorithms and comparative
sanalysis is going on to decide the best routing algorithm for a given NoC application.
Gaoming Du, Jing He, Yukun Song, Duoli Zhang and Huajie Wu [63] carried out the
comparative study using XY routing algorithm, turn routing algorithm and retrograde
algorithm on the NoC based on Packet-Connected Circuit (PCC) scheme with different packet
lengths and injection rates. The experimental result shows that retrograde and turn algorithm
gives better performance as compared to XY routing algorithm, the average throughput and
average latency improved by 32.99% and 12.16% respectively for high injection rate and long
packet length. Ebrahim Behrouzian-Nezhad and Ahmad Khademzadeh[64] presented a
routing algorithm for NoC called BIOS which is based on Best Input and Output Selection.
The proposed routing algorithms simulation result shows better performance than other
30
deterministic and adaptive routing algorithms for different traffic patterns. Andrew DeOrio,
David Fick, Valeria Bertacco, Dennis Sylvester, David Blaauw, Jin Hu and Gregory Chen
[65] proposed Vicis, a fault tolerant NoC architecture and associate routing algorithm which
allows communication to continue even with large number of transistor failures. The proposed
architecture first detects errors within the router and reconfigures the architecture to leverage
errors and if the link connectivity or router disables due to faults, rerouting is performed using
distributed routing algorithm for meshes and tori. This maintains high reliability incurring
overhead by 51%.
Wei Ni and Zhenwei Liu [66] introduced a new routing algorithm called Distance Prediction
XY (DPXY) with self-similar traffic based on buffering information and center distance. The
comparative study of the proposed routing algorithm with XY, WestFirst and OddEven
routing algorithm was performed and the result shows that the proposed algorithm can
effectively avoid centric congestion, decreases latency and improves network performance.
Jan Moritz Joseph, Christopher Blochwitz and Thilo [67] proposed a routing technique with
an adaptive allocation of default path in routers. It retains the standard network load since the
method is non-speculative and also packets routed through non prioritized links are not
penalized by the defaults paths. Routers configure themselves locally and hence virtual point
to point paths emerge temporarily and accelerate interleaved data streams. The simulations
were performed using PARSEC benchmarks; the result shows 4.8% to 12.2 % reduction in
average packet latency. K.Tatas and C. Chrysostomou [68] proposed adaptive routing scheme
for buffered and buffer less NoC based on fuzzy logic. To select the output port for an
incoming flit, the dynamic traffic and power consumption on neighboring router links is taken
into account and input link cost is dynamically calculated using fuzzy logic control. The
31
proposed scheme was implemented in Application Specific Integrated Circuit (ASIC) and
Field Programmable Gate Array(FPGA) technologies which demonstrated that the hardware
area overhead is minimal and also imposes no additional latency. Salih Bayar and Arda
Yurdakul [69] proposed custom 2-D NoC architecture with reconfigurable switches. These
reconfigurable switches can be configured as per the requirements of application during both
designs as well as runtime phase. They also proposed a mapping algorithm which will set path
through these switches; the experimental results show that the proposed mapping algorithm
reduces routing cost by 79.96% for real life embedded applications.
2.4NoC link interfaces
The recent advances in silicon photonics technology including Micro-ring Resonators (MRs)
to build photonic switches and a waveguide to transfer messages over optical link attracted
the researchers to use this optical technology in NoC architecture. Optics is attractive due to
its high bandwidth, low power consumption and low error rate. Researchers are trying to use
the optical interconnect or proposing the optical router designs with the corresponding routing
algorithms which can be more prominently used for power efficient and high performance
NoC applications. Few research articles are found in this area. Cisse Ahmadou Dit Adi, Ping
Qiu, Hidetsugu Irie, Takefumi Miyoshi and Tsutomu [70] proposed OREX, a hybrid NoC
architecture consisting of optical ring and an electrical crossbar central router. The electrical
crossbar switch is used to set up path between source and destination node and the messages
are transmitted using optical ring. The performance of the OREX using static wavelength
allocation under probabilistic traffic pattern has been evaluated. The result shows that the
latency and power consumption of the proposed hybrid NoC architecture is much less as
compared to pure electrical NoC architecture. Xianfang Tan, Mei Yang, Lei Zhang, Yingtao
32
Jiang, Jianyi Yang [71] proposed a generic passive scalable and non- blocking router
architecture , namely Generic Wavelength routed Optical Router (GWOR), which is scalable
from 4x4 to any size. The different input wavelengths and corresponding micro-ring
resonators (MRRs) are used to realize the routing in GWOR. The comparative analysis
between the proposed design and existing router design was carried out which shows 4x4
GWOR uses the least number of MMRs and has least power loss. Parisa Khadem Hamedani,
Natalie Enright Jerger and Shaahin Hessabi [72] presented QuT, an all-optical NoC using
passive microrings along with deterministic wavelength routing algorithm for optical switch
optimization to reduce the number of wavelength and microrings. The proposed QuT is
compared with 128-node Optical Spidegon, λ-router and Corona under different synthetic
traffic patterns, the result shows that power is reduced by 23%, 86.3% and 52.7% respectively
for above optical topologies with competitive latency. Xiaolu Wang, Huaxi Gu, Yintang
AYang, Kun Wang and Qinfen Hao [73] proposed a Ring based Packet switched NoC
architecture called RPNoC, which uses few optical devices as compared to other Optical
NoCs. The proposed architecture has small diameter compared with many other packet-
switched Optical NoCs which contributes to low energy consumption, high saturation
bandwidth and low latency. The use of both Wavelength Division Multiplexing (WDM) and
Space Division Multiplexing (SDM) technology makes the architecture highly scalable. The
proposed design also assures deadlock freedom and low resource consumption. The
evaluation under different realistic and synthetic traffic for 64-node RPNoC was carried out.
The simulation results shows that the proposed design achieves low latency and high
throughput as compared to other packet switched Optical NoC (ONoCs), it also achieves the
lowest energy consumption. Xiuhua Li, Huaxi Gu, Ke Chen, Liang Song and Qingfen Hao
33
[74] proposed a Screwy Torus Topology (STorus) along with its new optical router design.
An optical network was divided into two subnets connected using gateway, each subnet
consisting of STorus architecture. The diameter of the proposed architecture is half of torus
and quarter of mesh compared with the same scale network. The hierarchical waveguide
layout gives fewer crossing leading to optical power loss with an average loss of 0.4dB.The
simulation results when compared with traditional mesh topology under hotspot traffic, the
throughput of the proposed architecture is improved by 59.1% and 52.3% under uniform
traffic. Somayyeh Koohi and Shaahin Hessabi [75] proposed Contention-free optical NoC
(CoNoC) architecture for on-chip routing of packets. The CoNoC architecture is built upon
all-optical switches that passively routes data based on their wavelength eliminating the need
of resource reservation at the intermediate nodes. This improves the scalability and
performance of the network. The proposed architecture is compared with electrical NoC and
alternative ONoC architectures under different synthetic traffic pattern. The power
consumption has been reduced per packet by 19%, 28%, 29% and 91% and energy reduction
per packet was achieved by 28%, 40%, 20% and 99% against Columbia, Phastlane, ƛ-router
and electric torus respectively. Lei Zhang, Mei Yang, Yingtao Jiang, Emma [76] presented
Wavelength Routed Optical Network (WRON) an interconnection architecture suitable to
build on-chip optical micro-networks. They also generalized the routing scheme for WRON
using any two routing parameters out of the source node address, the destination node address
and the routing wavelength parameters. Here they proposed a Recursive Wavelength Routed
Optical Network (RCWRON) architecture using WRON as the primitive platform. The
RCWRON serves as the basis for the redundant wavelength routed optical network
(RDWRON) architecture.
34
2.5 NoC flow control
Flow control determines how network resources like channel bandwidth; buffer
capacity and control state are allocated to packet traversing over network. The buffered flow
control has several variants like credit based flow, handshaking signal, acknowledge flow
control, stall/go flow control and T-error flow control. A good flow control protocol lowers
the delay experienced by messages at low loads by not imposing high overhead in resource
allocation, and drives up network throughput by enabling effective sharing of buffers and link
across messages. Zhiliang Qian et al[77] proposed a new NoC latency model for the arrival
traffic burstiness, the general service time distribution, the finite buffer depth and arbitrary
packet length combination. A link dependency analysis technique is proposed to determine
the order of applying queuing analysis. The accuracy of the model is demonstrated using both
the synthetic traffic and real applications. A 70x speedup over simulation is achieved with
less than 13% error in the proposed analytical model, which benefits the NoC synthesis
process. They further reviewed several techniques for NoC performance evaluation. These
techniques range from building the traffic models so as to develop the analytical and
simulation-based latency models. Their work summarized the typical workloads that are
employed in NoC-based system evaluation then, reviewed the approaches in traffic analysis
for capturing both the short- and long-range dependence behaviors. In addition, a multi-fractal
based approach to model the non-stationary traffic is introduced. Also, advances in NoC
architectures are researched in higher dimensional spaces i.e. 3D domain. Such 3D NoC
architectures drastically improves the throughput, latency at the cost of power, area and higher
end of 3D die for fabrication of Integrated Circuits. These 3D architectures are discussed in
later part of the thesis.
36
3.1 Proposed Switch Architecture Design
Switch/Router is the main building block of a NoC or for that matter any network architecture
which undertakes important task of coordinating the flow of data in the network. It is also
important to keep the design of switch as simple as possible since its complexity will lead to
increase in implementation cost. The proposed architecture consists of mainly three parts: The
Input Module, Crossbar and Scheduler.
Figure 3.1: Proposed switch architecture
The number of input port and output port of any NoC switch depends on network topology.
The above proposed switch design is for mesh NoC topology. Thus our design consists of 5
input ports and 5 output ports. Out of five ports, four ports are used to connect four
neighboring routers in north, south, east and west directions and one port is used to connect to
local processing element.
37
3.1.1 Input Module:
For each of the input ports there is an input module which is responsible for handling, storing
and processing the arriving data packets. For each input module there is data line, frame pulse
input, clock input, reset input and global reset input.
3.1.2 The Crossbar:
The crossbar handles the task of physically connecting an input port to its destined output
port, based on the grants signal issued by the scheduler.
3.1.3 The Scheduler:
The scheduler is responsible receiving request from all the input modules and based on
scheduling algorithm determines the best realizable configuration for the crossbar
3.2 Working of Switch
The input to the Switch/Router is a data packet arriving from adjacent Switch/Router on 8-bit
wide input data line of the respective input module. In our design we used Asynchronous
Transfer Mode (ATM) packet of fixed 53 bytes for implementation purpose. Another input to
the switch is a clock input which is global to all switch components and frame pulse input
which is one bit wide signal used to signal the start of data packet. One data byte can be input
to the switch in every clock cycle. On every falling edge of the clock input the frame pulse
signal is checked, this pulse should be at least one clock cycle wide. Once the frame pulse is
detected the input module considers the first data byte of the packet on second rising edge of
the clock. The reset input of the switch is use to reset all the counters of design and initialize
them to their initial values. The global reset signal is use to reset all input buffers. The wave
diagram is explained in the functional simulation of the router in figure 3.9 and figure 3.10.
38
3.2.1 ATM packet format:
The figure 3.2 depicts the ATM packet format structure. ATM packet is 53byte fixed cell
having 48 byte payload and 5 bytes header information. Second, third and fourth byte of the
ATM packet contains Virtual Circuit Identifier (VCI) which is used by the input module to
determine the output port.
(a) (b)
Figure 3.2: (a) ATM Packet format structure (b) ATM header
Every data byte arriving at input module figure 3.3 is first stored in the Random Access
Memory (RAM) component of the input module. As the input packet is read by the input
module, it also extracts simultaneously second, third and fourth byte of the packet separately
into VCI register, this information is then send to Look Up Table (LTU) module maintained
inside the input module.
39
Figure 3.3: Input module of the proposed switch
The LUT is a Read Only Memory (ROM) based component that can be initialized with an
arbitrary set of data, to form the routing table of the switch. The ROM has eight rows and
each row is 36 bits wide. These bits consist of: a 16-bit input VCI, a 16-bit output VCI, and a
4-bit output port number as shown in figure 3.4. This LUT information is used by the input
module to update the old VCI bytes of the header with the new VCI bytes along with the
output destination port number for that packet. The input module then sends a request for that
specific output port to the scheduler and waits for a grant signal from scheduler. Once a grant
is issued, the data bytes are removed from the RAM in the same order that they arrived and
40
dispatched over the port for which grant signal is demanded. After the entire packet is sent,
the same process is repeated for the other packets.
Figure 3.4: Look Up Table structure for the proposed switch
3.2.2 The Scheduler
The scheduler is responsible for receiving requests from the input module for various input
ports and grants some of these requests by determining the best configuration for the crossbar.
It keeps the information of which ports are free and which ports are communicating with each
other. This helps it to control the arbitration and resolve contention problem. This decision is
41
mainly depends on the scheduling algorithm used. The algorithm has to be fast, efficient, fair
and most important, easy to implement in the hardware.
There are many ways in which crossbar scheduler can be implemented. A group of
researchers at UC San Diego has designed and implemented Rectilinear Propagation Arbiter
(RPA) and Diagonal Propagation Arbiter (DPA) which uses round robin priority scheme. The
scheduling decision in RPA and DPA is implemented in combinational logic using two
dimensional ripple carry arbiter.
3.2.3 Two Dimensional Ripple Carry Arbiters
Two Dimensional Ripple Carry Arbiters is shown in figure 3.5. For simplicity we explain
4X4 two dimensional ripple arbiters but the design can be scaled for m x n two dimensional
ripple carry arbiter. The input port and output ports are realized using arbiter cells, the row
numbers in the cell corresponding to input ports and column numbers are output ports. The
pair of numbers mention on each cell specifies the request that is handled by that specific cell.
For example the pair 2, 3 on a cell will handle packet that is destined to go from input port 2
to output port3. Each arbiter cell has signal R (request) which is an input signal and G (grant)
signal which is an output signal. R[ i , j ] is active when there is a packet at input port i which
is destined for output port j and G[ i , j] is active when request from input port i to output port
j has been granted.
The scheduling decision is based on the following algorithm
I. Start from top left most cell (1,1)
II. Once the cell is reached move to its right and bottom cells.
42
III. Check if R signal is active, than activate G signal if and only if there have not been
any request granted in the cells at the top and to the right of the current cell.
IV. If a request granted, inform the cells on the right and at the bottom by activating
appropriate signals.
Figure 3.5: 4 X 4 two dimensional ripple carry arbiter
This ensures that at any given point of time only one request is granted in the same row or
column (the one higher or on the left is granted).
The two dimensional ripple carry architecture describe in figure 3.5 is not fair because it gives
priority to the cell that are higher and to the left i.e. cell number (1, 1). To make this
43
architecture fair the researcher [78] modified this design to cyclic two dimensional ripple
carry arbiter which too has some fundamental problems. This design uses combinational
feedback loop which are not supported by the logic synthesis tools and are difficult to design
and test. To overcome these problems new architectures were introduced Rectilinear
Propagation Arbiter [RPA] and Diagonal Propagation Arbiter [DPA][78].
DPA is a modified version of two dimensional RPA. Here certain individual cells are selected
and are placed in the diagonal rows. These are the cells to which if requests are granted will
not prevent granting others. The arbitration process consider the first diagonal row and if there
are request in the first diagonal row all are granted, then in the next time slot the arbitration
process moves to next diagonal row, grants all request in that row and so on.
3.2.4 Comparison between DPA and RPA
Table 3.1 describes the results from timing simulations of DPA and RPA architectures for 16
port and 32 port crossbar switches [78]. The results show that for DPA architecture the
number of cells (and therefore the cost) is almost half of that of RPA architecture. Also, the
propagation delay of the DPA design is almost half of the delay in RPA.
Table 3.1: Areas and delays for each arbiter design
RPA(16) DPA(16) RPA(32) DPA(32)
NO. OF CELLS 5766 2480 23814 10.080
AREA IN (MM2)
0.497 0.184 2.052 0.746
DELAY IN (NS) 10.23 6.15 21.10 13.61
44
The table 3.1 comparative results are quit convincing to select the DPA architecture for our
design.
3.2.5 The crossbar organization
The crossbar module in the design is responsible for physically connecting an input port to its
destined output port, based on the grants issued by the scheduler. The inputs to the crossbar
are taken from input modules of our switch. The outputs of crossbar are connected to the
output ports of the switch. The crossbar makes the appropriate connection between each input
and its corresponding output.
3.3 Development design flow For development and synthesis purpose we have used Altera’s Quartus II 8.0 software. The
Altera Quartus II design software provides a complete, multiplatform design environment that
easily adapts to our design needs. It is a comprehensive environment for system-on-a-
programmable-chip (SOPC) design. The Quartus II software includes solutions for all phases
of FPGA and Complex Programmable Logic Device (CPLD) design. A Nios II soft-core
processor is constructed in SOPC Builder; and the design is synthesized with the soft-core in a
Cyclone II FPGA with the help of Quartus 8.0 IDE (Integrated Development Environment).
The SoC is developed with the help Nios IDE which integrates MicroC/OS-II RTOS and
Nios-II soft core processor. The embedded development flow is shown in figure 3.6.
Embedded development flow goes as described herewith.
• System concepts and System Requirement Analysis
• Defining and Generating the System in SOPC Builder using Nios II with standard
required components along with custom instructions and peripherals
45
Figure 3.6: Embedded development flow
• Integrating the SOPC Builder System into the Quartus II Project along with custom
hardware modules
• Assigning pin locations, timing requirement and other design constraints
• Compile hardware and download to target board
• Develop software project with the Nios-II IDE .Also design application code Altera
provides component drivers and a hardware abstraction layer (HAL) which allows you
to write Nios II programs quickly and independently of the low-level hardware details.
46
In addition to your application code, you can design and reuse custom libraries in your
Nios II IDE for projects.
• Download the software to Nios II system on target board.
• Finally run and debug software on target board
3.4 Description of Switch using Hardware Descriptive Language
The switch design described above needs to be realized as a digital system module by using
hardware descriptive language. There are several hardware descriptive languages available of
which most common available once are VHDL, Verilog and System C. We have chosen
VHDL(Very high-speed integrated circuit Hardware Description Language) as a
programming language to model a digital system by dataflow, behavioral and structural style
of modeling[79]. VHDL is IEEE standards, supported by all CAD tools, technology
independent, flexible and supports easy modeling of various abstraction levels
3.4.1 Switch Design Synthesis
The top level entity Switch is described using VHDL code which describes input and
output interface of the switch and also architecture structure of its three components namely
input module, scheduler and crossbar. The detail code for component; input module,
scheduler and crossbar is written as three different vhdl files along with the code for look up
table. The designed switch is synthesized using the ALTERA QUARTUS II (version 8.0)
software. The Altera QuartusII design software provides a complete, multiplatform design
environment that easily adapts to your specific design needs. It is a comprehensive
environment for system-on-a-programmable-chip (SOPC) design. The Quartus II software
includes solutions for all phases of FPGA and CPLD design (Figure3.7)[80].
47
Figure 3.7: Quartus II design flow for SoC design
The Quartus II software includes a modular Compiler which has the following modules.
Analysis & Synthesis, Partition Merge, Fitter, Assembler, Time Quest Timing Analyzer,
Design Assistant, Electronic Design Automation (EDA) Netlist Writer, HardCopy® and
Netlist Writer. We used Quartus II software Graphical User Interface (GUI) to create Switch
project described above with top level file as switch. vhdl and also added remaining files to
project. The project is then compiled whose successful compilation summary is shown in
figure 3.8. It is observed through compilation summary that total logic elements used to
generate one switch is 4241 out of 33216 which is only 13% usage. The other component
usage on Cyclone II EP2C35F672C6 FPGA device is shown table 3.2
48
Table 3.2: Resources usage on Cyclone II EP2C35F672C6 FPGA device for proposed switch design
Cyclone II EP2C35F672C6 FPGA
device resources Total usage Percentage
Total logic elements 4241/33216 13%
Total combinational functions 2987/33216 9%
Dedicated logic registers 2605/33216 8%
Total pins 130/475 27%
Total memory bits 28224/483840 6%
Embedded multiplier 9- bit elements 0/70 0%
Total PLLs 0/4 0%
After successful compilation the graphical representations of the design was generated as
shown in figure 3.9 The design is viewed using Quartus II RTL Viewer which provides a
gate-level schematic view of the design and a hierarchy list, which lists the instances,
primitives, pins, and nets for the entire design netlist. One can navigate through the entire
design flow to examine the connection of different components and their input and output
interfaces. This gives only the idea whether the blocks are correct and having proper
connectivity, but does not guarantee the working of design.
3.4.2 Switch Design Simulation
After successful synthesis of design, the designed is checked using simulations. There are two
ways in which designed circuit can be simulated. The simplest way is to assume that logic
elements and interconnection wires are perfect, thus causing no delay in propagation of
49
signals through the circuit. This is called functional simulation. A more complex alternative is
to take all propagation delays into account, which leads to timing simulation.
3.4.2.1 Functional Simulation
The functional simulation is performed to verify the functional correctness of a circuit as it is
being designed. This takes much less time, because the simulation can be performed simply
by using the logic expressions that define the circuit.
Before running the functional simulation the look up table information was initialized
with the data as shown in table 3.3 using lut1.mif, lut2.mif, lut3.mif and lut4.mif files. Four
different .mif files are required, as we have four input ports, the data entered in these files
after compilations will seat into respective LUT modules of input module. The look up table
has the following information; like input VCI, output VCI and output port number.
Figure 3.8: Compilation report flow summary
51
Input VCI is the second, third and fourth byte header information of ATM packet which is
used for routing purpose. Now as per the data stored in look up table any ATM packet having
input VCI number 56C9 will be replaced with 1234 number by the input module and will be
give port number 3.Similarly any ATM packet having input VCI number 5BED will be
replaced with ABCD number by the input module and will be give port number 2 and so on.
Table 3.3: Static look up table used for routing
INPUT VCI OUTPUT VCI OUTPUT PORT NUMBER
56C9 1234 3
5BED ABCD 2
B9E0 8193 1
0204 DEBA 4
BCCD 1213 4
3747 56C9 3
2C3C 8104 2
BBCD 7685 1
As said earlier we are using ATM packet for our simulation. ATM packet is 53 bytes cell with
first five bytes header information and remaining bytes i.e. six to fifty three byte is the data.
We in our design generated following input ATM packets and expected output ATM packets
as shown in table 3.4.
52
The simulation takes waveform vector file as the input file. Using waveform editor input.vwf
file was created as shown in figure 3.10, wherein the input waveform vector file specifies
clock, global_reset, reset, frame pulse(fp) signals and data_in packets, global_reset and reset
was set to active low, four different fp signals fp1,fp2,fp3 and fp4 and four data_in packets
data_in1,data_in2,data_in3 and data_in4 for four input ports were supplied respectively.
Table 3.4: Input ATM packets used for simulation and expected output packets
Packet
No.
Input ATM packets Expected output ATM packets
1st
byte
2nd
byte
3rd
byte
4th
byte
5th
byte
6-53
bytes
1st
byte
2nd
byte
3rd
byte
4th
byte
5th
byte
6-53
bytes
1 72 75 6C 95 76 11 72 71 23 45 76 11
2 A9 A5 BE DC AD 22 A9 AA BC DC AD 22
3 DF EB 9E 02 E3 33 DF E8 19 32 E3 33
4 15 10 20 48 1A 44 15 1D EB A8 1A 44
5 AA BB CC DD EE 11 AA B1 21 3D EE 11
6 10 20 30 40 50 AA 10 2C CD D0 50 AA
7 C1 C2 C3 C4 C5 BB C1 C8 10 44 C5 BB
8 DA DB DC DD DE 22 DA D7 68 5D DE 22
As mentioned in the working of switch when fp signal is detected, at second rising edge of the
clock pulse the first byte of data byte is supplied. We supplied first four input ATM packets of
53
table 3.4 to switch input module as follows. First packet to input port 1, second to input port
2, third to input port 3 and fourth packet to input port 4.
Figure 3.10: Input waveform vector file showing input ATM packets
Before running the functional simulation the required netlist was created and then the
functional simulation was run. The output report file generated after functional simulation is
shown in figure 3.11. As per the routing information stored in LUT of input module, packet
number 1 injected at input port_1 of the input module of the switch having input VCI number
56C9 is updated with the new VCI number 1234 and is suppose to go to output port number
3. Further packet number 2 injected at input port_2 of the input module of the switch having
input VCI number 5BED is updated with the new VCI number ABCD and is suppose to go to
output port number 2.Also packet number 3 injected at input port_3 of the input module of the
switch having input VCI number B9E0 is updated with the new VCI number 8193 and is
suppose to go to output port number 1 and lastly packet number 4 injected at input port_4 of
the input module of the switch having input VCI number 0204 is updated with the new VCI
number DEBA and is suppose to go to output port number 4. The expected output results
54
exactly matches with the result generated after the functional simulation, this confirms the
working of the proposed switch design.
Bouraoui Chemli and Abdelkrim [81] proposed the architecture of a NoC router to be used in
mesh topology and implemented the deadlock-free Negative-First turn model routing
algorithm. The router is designed in VHDL and synthesized with ISE 13.1 tool and
implemented on a Xilinx FPGA Virtex5 board. The proposed router architecture consists of
Figure 3.11: Output showing ATM packets received at respective output ports of the switch as
described in LUT
the input module, semi-crossbar, switch allocator and the credit-switcher. The data flow and
the communication between adjacent blocks is managed by the input port. Switch allocator
issues grant signal indicating the response to the request of the output ports. The arbiter
module of the switch allocator uses a round-robin and a priority scheduler schemes to assign
55
the highest priority packet to the adequate output port. The credit switcher indicates the status
of the input ports of neighboring routers by issuing credit signal. The link controller receives
flits from adjacent router and, if requires stores them in the buffer. The router can serve up to
five packets simultaneously; this parallel operation reduces the latency of the proposed router
and the average latency of the NoC. Here they compared their design with two other router
designs. The results are shown in the table 3.5 which claim that the router is 1.15 and 1.5
times faster than other routers because of distributed routing function but in terms of area their
design is bigger than the other router design.
Table 3.5: comparison of the proposed router design with other router design
Topology 2D Mesh 2D Torus 2D Mesh
Routing algorithm XY XY Negative-first
Frequency (MHz) 52 40 60
Area (Slice) 824 611 3174
Power estimation
(mW)
58.8 NA 43
Estimated Peak
Performance per port
0.812 0.312 19.20
FPGA device Virtex-2
X2VC6000
Virtex-2
X2VC6000
Virtex-2
X2VC6000
Another implementation of router was proposed by Hung K. Nguyen and Xuan-Tu [82]. Here
they proposed a hybrid switching router based on the combination of virtual cut-through and
wormhole switching schemes. The router is dynamically reconfigurable at run time to
exchange between switching schemes, therefore achieving higher average performance than
56
wormhole switching, while reducing the implementation cost in comparison with the virtual
cut-through switching. The RTL model of the proposed router was synthesized on Xilinx
Virtex-7 xc7vx485 FPGA chip by using the Xilinx Vivado Design Suite. The resources
utilization for the proposed router is about 0.36%, 1.02%, and 0.25% of the xc7vx485 chip in
terms of Flip-Flop, LUTs and Memory LUT respectively. When compared with generic
wormhole router with same configuration, the resource utilization overhead of the proposed
router is about 0.82%, 2.51%, and 9.83% in terms of Flip-Flop, LUTs and Memory LUTs
respectively. The experimental results show that the router guarantees reliability and reduce
latency approximately 30.2% and increase average throughput about 38.9% compared with
the generic router. The power consumption and area overhead is also acceptable when
compared with the generic router. FPGA based design of reconfigurable router for NoC
applications is another router implementation proposed by Amit Bhanwala, Mayank Kumar
and Yogendera [83]. The design consists of four channels and crossbar switch, each channel
has FIFO buffers to store data and multiplexers for controlling input and output of the data.
The proposed router is described using Verilog HDL and simulated using Modelsim. The
RTL view is obtained using Xilinx ISE 13.4. The entire design is synthesized on Xilinx
SPARTAN-6 FPGAs device. The power gating technique is used to reduce power dissipation
of the proposed reconfigurable router. The results show that the proposed design consumes
less power compared to the previously designed reconfigurable routers.
58
4.1 Overview of Computing Node
A computing node is an active electronic device that is attached to a network, and is capable
of creating, receiving and transmitting information over a communications channel. The
computing node in our design is a soft core processor integrated with the switch block. The
soft core processor is capable of generating and forwarding data packets to attached switch,
receiving data packets from the switch and processing data to some extent. The
implementation of computing node is accomplished by converting the RTL design of the
switch into block symbol file and then interfacing the block to the soft core processor SoC
generated. We first discuss the process of converting the switch design into block diagram.
4.2 Task of converting the Design into Block Diagram After testing the switch design for its functionality to create block symbol, we opened the
switch.vhdl file compile the code and simply create symbol file by Selecting create symbol
file from menu which created switch.bsf figure 4.1 shows the block symbol of our switch
design. At input interface it has four one bit frame pulse signal (fp1 to fp4) use to indicate the
arrival of data packet at respective port number, four 8-bit data buses (data_in1 to data_in4)
for receiving data packets at respective port numbers and clock signal, reset and global reset
signals. The output interfaces are fp_out_ports( 1 to 4) which are the frame pulse input signals
for the adjacent switch, data_out_port(1 to 4) which are 8-bit data buses on which the data
packets are outputted,data_valid(1 to 4) to check the validity of data,
incoming_port_to_output(1 to 4) which indicates from which incoming port the data packet
has arrived along with request and grant signals.
59
Figure 4.1: Block symbol of the proposed switch design 4.3 Soft Core Processor architecture A soft processor is an Intellectual Property (IP) core that is implemented using the logic
primitives of the FPGA. Key benefits of using a soft processor include configurability to trade
between price and performance, faster time to market, easy integration with the FPGA fabric,
and avoiding obsolescence. Commonly used embedded soft core processors for FPGA [84]
are shown in table 4.1
LEON3 is a 32-bit processor based on the SPARC V8 architecture and is designed and
maintained by Aeroflex Gaisler AB in Sweden. The model is highly configurable and
particularly suitable for system-on-a-chip (SOC) designs. The structure of LEON3 processor
follows Harvard architecture and uses the AMBA Advanced High performance Bus (AHB)
for all on–chip communications.
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
clock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst
VCCf p1 INPUT
VCCf p2 INPUT
VCCf p3 INPUT
VCCf p4 INPUT
VCCdata_in1[7..0] INPUT
VCCdata_in2[7..0] INPUT
VCCdata_in3[7..0] INPUT
VCCdata_in4[7..0] INPUT
VCCglobal_reset INPUT
VCCreset INPUT
VCCclock INPUT
f p_out_port1OUTPUT
f p_out_port2OUTPUT
f p_out_port3OUTPUT
f p_out_port4OUTPUT
data_out_port1[7..0]OUTPUT
data_out_port2[7..0]OUTPUT
data_out_port3[7..0]OUTPUT
data_out_port4[7..0]OUTPUT
data_v alid1OUTPUT
data_v alid2OUTPUT
data_v alid3OUTPUT
data_v alid4OUTPUT
incoming_port_to_output1[2..0]OUTPUT
incoming_port_to_output2[2..0]OUTPUT
incoming_port_to_output3[2..0]OUTPUT
incoming_port_to_output4[2..0]OUTPUT
P[7..1]OUTPUT
request[16..1]OUTPUT
grant[16..1]OUTPUT
60
Table 4.1: Embedded Soft Core Processor for FPGA
Soft Core
Processor
Architecture LogicalElements SupportedFPGA
LEON3 SPARC-v8 3500 Xlinx FPGA
Microblaze Microblaze 1324 Xlinx FPGA
Picoblaze Picoblaze 192 Xlinx FPGA
Cortex-M1 ARMv6 2600 Xlinx & Altera FPGA
Nios II Nios II
Nios II(f)- 1800
Altera FPGA Nios II(s)-1700
Nios II(e)- 390
Register windowing makes it possible to select a new set of local registers upon a procedure,
thereby reducing time required for storing register contents in memory every time. By default
a basic implementation has 8 global registers, 8sets of register windows, and each window
consisting of24 registers. Thus at any time, 32 registers will be available for a program, out of
the existing 200 registers. The unique debug interface of LEON3 allows non-intrusive
hardware debugging and provides access to all registers and memory. A main advantage of
LEON3 processor is that it uses a structured organization of packets, folders and VHDL
records. This processor is very reliable and hence is used in a large number of military and
space applications.
MicroBlazeis a proprietary 32-bits RISC reconfigurable soft-processor from Xilinx, and can
be customized with different peripheral and memory configurations. Harvard memory
architecture is used for development of MicroBlaze. This processor has a three-stage pipeline
with variable length instruction latencies, typically ranging from one to three cycles. Xilinx
61
Platform Studio is available for creating a MicroBlaze based system. It uses 2 Local Memory
Busses (LMB) to connect instruction and data memories. User can define the sizes of
instruction and data memory and the number of peripheral based on application requirements.
An On-Chip-Peripheral Bus (OPB) is used to boost systems performance and it is designed to
support low-performance/speed peripherals such as UART, GPIO, USB, external bus
controllers. This soft-core processor is used only in Xilinx FPGAs only and has lots of
configuration options. It also uses the Advanced Extensible Interface (AXI) standard bus.
Being proprietary soft-core, the source code is not available.
PicoBlaze is the designation of a series of three free soft processor cores from Xilinx for use
in their FPGA and CPLD products. They are based on an 8-bit RISC architecture and can
reach speeds up to 100 MIPS on the Virtex 4FPGA's family. The processors have an 8-bit
address and data port for access to a wide range of peripherals. The license of the cores allows
their free use, albeit only on Xilinx devices, and they come with development tools. Third
party tools are available from Mediatronix and others.
Cortex-M1 is an implementation of the proprietary 32-bit ARMv6 architecture designed for
FPGA. Using Cortex-M1 requires license from ARM Limited. There are Actel flash-based
FPGA chips which are shipped with Cortex-M1 license. Cortex-M1 can be used with Xilinx
and Altera FPGAs as well.ARM (Advanced RISC Machine) is a popular RISC architecture
particularly suitable for applications that demands low power consumption.Many operating
systems have been ported to ARM, including Linux and Windows CE.
We in our design selected Nios II processor because it supports all Altera® SoC and FPGA
families.
62
4.3.1 Altera Nios –II Soft Core Processor Nios II Soft Processor [85] is a 32-bit fixed point high- performance processor created to be
used only in FPGA. This processor has separated buses for data and program memories,
which is often called Harvard architecture, and also has Reduced Instruction Set Computer
(RISC) architecture. A Nios II processor system is equivalent to a microcontroller or
“computer on a chip” that includes a processor and a combination of peripherals and memory
on a single chip. Like a microcontroller family, all Nios II processor systems use a consistent
instruction set and programming model. The Nios II soft processor has a number of features
that can be configured by the user to meet the demands of a desired system. The processor can
be implemented in three different configurations:
• Nios II/f is a "fast" version designed for high performance. It has the widest scope of
configuration options that can be used to optimize the processor for performance.
• Nios II/s is a "standard" version that requires less resource in an FPGA device as a
trade-off for reduced performance.
• Nios II/e is an "economy" version which requires the least amount of FPGA resources,
but also has the most limited set of user-configurable features.
Its arithmetic and logic operations are performed on operands in the general purpose registers.
The data is moved between the memory and these registers by means of Load and Store
instructions. The word length of the Nios II processor is 32 bits. All registers are 32 bits long.
Byte addresses in a 32-bit word are assigned in little-endian style, in which the lower byte
addresses are used for the less significant bytes (the rightmost bytes) of the word. A Nios II
processor may operate in the following modes:
63
• Supervisor mode – allows the processor to execute all instructions and perform all
available functions. When the processor is reset, it enters this mode.
• User mode – the intent of this mode is to prevent execution of some instructions that
should be used for systems purposes only. This mode is available only when the
processor is configured to use the Memory Management Unit (MMU) or the Memory
Protection Unit (MPU).
Application programs can be run in either the User or Supervisor modes.
4.3.2 Characteristics of Nios II Processor
Nios II soft processor is very flexible and has some configurations that can be implemented at
the design Stage. These configurations allow the user to optimize the processor for various
applications [86]. The most relevant are the clock frequency, the debug level, the performance
level and the user defined instructions.
Features of Nios II Soft Processor:
• Full 32-bit instruction set, data path and address space, 32 general-purpose registers,
optional shadow registers sets
• Single-instruction 32 x 32 multiply and divide unit producing a 32-bit result
• Dedicated instructions for computing 64-bit and 128-bit product of multiplication
• Floating-point instructions for single-precision floating point operations
• Single instruction barrel shifter
• Hardware-assisted debug module enabling processor start, stop, step and trace under
control of the Nios-II software development tools
• Optional memory management unit (MMU) to support operating systems requiring
MMUs
64
• Instruction Set Architecture (ISA) compatible across all Nios-II processor systems
• Performance up to 250 DMIPS
One of the most important features of Nios-II processor is the possibility to add user-defined
Functional Units called Custom Logic. Custom instructions allow the possibilities to tailor the
Nios-II processor core to meet the needs of a particular application. In this way it is possible
to accelerate time- critical software algorithms by implementing some steps into specialized
hardware blocks. These blocks must be created using either VHDL or Verilog language.
Physically, the Custom Logic block is placed inside the Nios-II processor in parallel to the
ALU. Figure 4.2: shows block diagram of the Nios II processor core
4.4 SoC generation using Nios-II Soft Core We customize the Nios-II Soft Core Processor as per our requirement; this soft core can be
generated by using Altera’s SOPC Builder. SOPC is a powerful system development tool [87]
which enables user to define and generate a complete System-On-a-Programmable-Chip
(SOPC) in much less time than using traditional, manual integration methods. It is a part of
the Quartus II software. SOPC provides the system which is combination of Hardware &
Software [88]. The SOPC maintains a programmable and reconfigurable hardware, which is
the most significant form of SOC. It automates the task of integrating hardware components
by just specifying the system components in a GUI and SOPC builder generates the
interconnect logic automatically. SOPC Builder generates HDL files that define all
components of the system, and a top-level HDL file that connects all the components
Figure 4.2: Block diagram of the Nios II processor core together. SOPC builder generates either Verilog HDL or VHDL equally [
connects multiple modules together to create a top
system. It generates system interconnect fabric that contains logic to manage the connectivity
of all modules in the system.
shown in the figure 4.3. The design
top-level HDL file called SOPC b
The following modules were selected as required by our design
• CPU (Nios-II/s processor)
• Static Random Access Memory
• JTAG-UART (interface for
• PIO ( 8-bit for input data packets
65
.2: Block diagram of the Nios II processor core with relevant buses
uilder generates either Verilog HDL or VHDL equally [89]. SOPC
connects multiple modules together to create a top-level HDL file called the SOPC b
system. It generates system interconnect fabric that contains logic to manage the connectivity
of all modules in the system. The customized soft core processor for the switch design
. The design shows the multiple modules by SOPC builder to create a
level HDL file called SOPC builder system.
The following modules were selected as required by our design
II/s processor)
Static Random Access Memory(SRAM)
UART (interface for communicating with the host computer)
bit for input data packets -4 in numbers)
]. SOPC builder
el HDL file called the SOPC builder
system. It generates system interconnect fabric that contains logic to manage the connectivity
soft core processor for the switch design is
uilder to create a
66
• PIO ( 8-bit for output data packets -4 in numbers)
• PIO( 1- bit for frame pulse signal, 1-bit for reading clock and 1-bit for reading output
frame pulse- total 3 in numbers)
SOPC builder modules use Avalon interfaces, such as memory-mapped, streaming, and IRQ,
for the physical connection of components. The Avalon interface [90] family defines
interfaces appropriate for streaming high-speed data, reading and writing registers and
memory, and controlling off-chip devices. The Avalon interface is a synchronous interface
defined by a set of signal types with roles for supporting data transfers. There are two types of
Avalon interface ports; master ports and slave ports. Avalon master ports initiate transfers.
Avalon slave ports respond to transfer requests according to the requirements. After adding all
the components we have to assign priority of the components which is IRQ number. As
shown in figure 4.3, here CPU is set to IRQ 0 which gets higher priority. SOPC builder
automatically assign base address to all components. The device family was set to Cyclone-II
and external source clock of 50MHz was selected. With this configuration the soft core
processor is generated.
4.5 Integrating SoC with Switch The final computing node is implemented by integrating the configured Nios-II processor
generated by SOPC Builder with the switch. This is done by opening a new project in Quartus
–II software and instantiating the Nios-II soft core processor module in the Quartus-II project
along with the switch block component as shown in the figure 4.4.
The following connections were performed
• The four output PIO buses were connected to four input data pins of the switch block.
• The output data pins of the switch block are connected to the four input PIO buses.
67
• 1-bit PIO out pin is connected to frame pulse signal of the switch.
• The soft core processor is assigned external clock of 50MHz.
• The clock of the switch as described herewith
Figure 4.3: Customize soft core processor for switch design
The system thus generated has two components namely SoC and the switch. Both this
components requires two different clock frequencies. The SoC at the time of configuration
needs to be assigned a clock frequency before generation. The only clock which can be
assigned to this soft core is the external clock of the DE2 development board on which this
design will be performed. Thus the clock assigned to soft core processor is 50MHz clock of
the DE2 board. The designed switch requires a clock in the range of 350Hz – 400Hz
frequency which is of the factor of 15-20 for proper synchronization with SoC.
68
Figure 4.4: Schematic view of the Computing node consist of SoC and a switch. The only possibility remain is to generate a clock frequency in the range of 350Hz- 400Hz
from the clock frequency of the SoC which is 50MHz. This is done by using PLL and few
stages of JK flip flops to divide the required clock frequency. Further it was difficult to
synchronize both the components, one running at high frequency and other at low frequency.
Stages of JK FF
SoC Switch
69
We resolved this issue by polling back the low frequency clock signal via one of the input
PIO of the SoC and synchronized the data transfer from SoC to switch component by writing
appropriate C code which executes on the soft core processor.
4.6 Synthesizing and porting the Computing node on target device Cyclone-
II The computing node generated needs to be tested by porting it on the target FPGA device. We
have used Altera’s Development and Education board DE2. The DE2 board has many
features that allow the user to implement a wide range of designed circuits, from simple
circuits to various multimedia projects. Figure 4.5 shows the layout and components on DE2
board.
The following hardware component is provided on the DE2 board: Altera Cyclone® II 2C35
FPGA device, Altera Serial Configuration device - EPCS16, USB Blaster (on board) for
programming and user Application Program Interface(API) control; both JTAG and Active
Serial(AS) programming modes are supported, 512-Kbyte SRAM, 8-Mbyte SDRAM, 4-
Mbyte Flash memory (1 Mbyte on some boards) and SD Card socket, 50-MHz oscillator and
27-MHz oscillator for clock sources, 10/100 Ethernet Controller with a connector, USB
Host/Slave Controller with USB type A and type B connectors. In addition to these hardware
features, the DE2 board has software support for standard I/O interfaces and a control panel
facility for accessing various components.
The block diagram of the DE2 board is shown in figure 4.6. To provide maximum flexibility
for the user, all connections are made through the Cyclone II FPGA device. Thus, the user can
configure the FPGA to implement any system design.
70
Figure 4.5: The DE2 board
The detailed information about the blocks in Figure 4.6 is as follows
Cyclone II 2C35 FPGA
• 33,216 Les, 105 M4K RAM blocks, 483,840 total RAM bits, 35 embedded multipliers, 4
PLLs, 475 user I/O pins, Fine Line BGA 672-pin package
Serial Configuration device and USB Blaster circuit
• Altera’s EPCS16 Serial Configuration device
• On-board USB Blaster for programming and user API control
• JTAG and AS programming modes are supported
SRAM
• 512-Kbyte Static RAM memory chip organized as 256K x 16 bits accessible as memory for
the Nios II processor and by the DE2 Control Panel
71
SDRAM
• 8-Mbyte Single Data Rate Synchronous Dynamic RAM memory chip organized as 1M x 16
bits x 4 banks accessible as memory for the Nios II processor and by the DE2 Control Panel
Flash memory
• 4-Mbyte NOR Flash memory (1 Mbyte on some boards), 8-bit data bus accessible as
memory for the Nios II processor and by the DE2 Control Panel
SD card socket
Figure 4.6: Block diagram of the DE2 board with various I/O and communication Interfaces
72
• Provides SPI mode for SD Card access accessible as memory for the Nios II processor with
the DE2 SD Card Driver
Clock inputs
• 50-MHz oscillator
• 27-MHz oscillator
• SMA external clock input
Before compiling the computing node design the appropriate FPGA pin assignment is done as
shown in the table 4.2. After compilation the compiler assembler module creates a binary file
with extension xxx.sof (SRAM object file), which contains the data needed to configure the
FPGA device. The device selected is EP2C35F672C6 of Cyclone II family, which is the
FPGA device used on the DE2 board.
Table 4.2: Pin assignment for EP2C35F672C6 of Cyclone II family for DE2 board
Signal Name FPGA Pin No
SRAM_ADDR[0] PIN_AE4
SRAM_ADDR[1] PIN_AF4
SRAM_ADDR[2] PIN_AC5
SRAM_ADDR[3] PIN_AC6
SRAM_ADDR[4] PIN_AD4
SRAM_ADDR[5] PIN_AD5
SRAM_ADDR[6] PIN_AE5
SRAM_ADDR[7] PIN_AF5
SRAM_ADDR[8] PIN_AD6
SRAM_ADDR[9] PIN_AD7
SRAM_ADDR[10] PIN_V10
73
SRAM_ADDR[11] PIN_V9
SRAM_ADDR[12] PIN_AC7
SRAM_ADDR[13] PIN_W8
SRAM_ADDR[14] PIN_W10
SRAM_ADDR[15] PIN_Y10
SRAM_ADDR[16] PIN_AB8
SRAM_ADDR[17] PIN_AC8
SRAM_DQ[0] PIN_AD8
SRAM_DQ[1] PIN_AE6
SRAM_DQ[2] PIN_AF6
SRAM_DQ[3] PIN_AA9
SRAM_DQ[4] PIN_AA10
SRAM_DQ[5] PIN_AB10
SRAM_DQ[6] PIN_AA11
SRAM_DQ[7] PIN_Y11
SRAM_DQ[8] PIN_AE7
SRAM_DQ[9] PIN_AF7
SRAM_DQ[10] PIN_AE8
SRAM_DQ[11] PIN_AF8
SRAM_DQ[12] PIN_W11
SRAM_DQ[13] PIN_W12
SRAM_DQ[14] PIN_AC9
SRAM_DQ[15] PIN_AC10
SRAM_WE_N PIN_AE10
SRAM_OE_N PIN_AD10
74
SRAM_UB_N PIN_AF9
SRAM_LB_N PIN_AE9
SRAM_CE_N PIN_AC11
CLOCK_50 PIN_N2
SW[1] PIN_N26
4.7 Programming and configuring the FPGA device
The FPGA device must be programmed and configured to implement the designed system.
The required configurationfilexxx.sof is generated by the Quartus II Compiler’s assembler
module. Altera’s DE2 board allows the configuration to be done in two different ways, known
as Joint Test Action Group (JTAG) and Active Serial (AS) modes. The configuration data is
transferred from the host computer (which runs the Quartus II software) to the board by
means of a cable that connects a USB port on the host computer to the leftmost USB
connector on the board. This connection requires the USB-Blaster driver. In the JTAG mode,
the configuration data is loaded directly into the FPGA device and it will retain its
configurations long as the power remains turned on. The configuration information is lost
when the power is turned off.
The second possibility is to use the (AS) mode. In this case, a configuration device that
includes some flash memory is used to store the configuration data. Quartus II software places
the configuration data into the configuration device on the DE2 board. Then, this data is
loaded into the FPGA upon power-up or reconfiguration. Thus, the FPGA need not be
configured by the Quartus II software if the power is turned off and on. The choice between
the two modes is made by the RUN/PROG switch on the DE2 board. The RUN position
selects the JTAG mode, while the PROG position selects the AS mode. We configured our
75
design into JTAG programming mode and downloaded the design on Cyclone II FPGA
device by performing following steps.
1. Connected the DE2 board to the host computer by means of a USB cable plugged into the
USB-Blaster port. Powered on the DE2 board and ensured that the RUN/PROG switch is in
the RUN position.
2. Under Tools > Programmer option select JTAG in the Mode box and select the USB-
Blaster.
3. The configuration file with extension xxx.sof appears in the window.
4. Click the box under Program/Configure to select this action.
5. Press Start to configure the FPGA.
With this process we have ported the complete computing node system along with the
configuration file on to the Cyclone II FPGA device of DE2 board.
4.8 Validating the design for the application program
Having configured the required hardware in the FPGA device, it is now necessary to confirm
the design by creating and executing an application program that performs the desired
operation. The software development environment of NIOS II processor called NIOS II
Integrated Development Environment (IDE). IDE facilitate to write the application program,
compile and executed the same. Application code can be written in either C/C++ language
[91]. We have used C programming language, the complete code is shown in ANNEXURE
(Appendix-I). The application first activate fp signal high for two clock cycles and generates
four 53 bytes ATM data packets. After the fp signal is activated the program starts sending the
ATM packets to switch input ports, first byte at second rising edge of the clock. The
application reads the clock signal generated after dividing the original 50MHz clock thus,
76
synchronizing it with switch input clock. The program continues to sends one byte of data at
every clock cycle to the input port of the switch until fp_out signal is detected, which
indicates the arrival of data packet at the output port. The program control is then switched
over to read the output ports of the switch and direct the data packet to Nios-II IDE console
.The above logic is represented using flow chart as shown in figure 4.7. The program is then
compiled and downloaded into Nios II system implemented on a DE2 board. The output result
is shown in the figure 4.8. As seen from the program ANNEXURE (Appendix-I). The four
different ATM packets were generated and transferred to four input port of the switch. For
example we provided the following ATM packet table 4.3 to input port_1 of the switch, as per
the look up table information chapter 3 table 3.2 one of the entry is as shown in table 4.4 and
the expected output ATM Packet received at output port_3 of the switch is shown in table 4.5.
We obtained the same result as shown in figure 4.8 which confirms that the computing node
implemented on the DE2 board on target Cyclone II FPGA works as per our objectives.
Table 4.3: input ATM packet send at port one of the switch.
1st byte 2
nd byte 3
rd byte 4
th byte 5
th byte 6-48 bytes
72 73 74 75 76 01-48
Table 4.4: part of look up table
Input VCI Output VCI Output port number
3747 56C9 3
77
Table 4.5: output ATM packet expected at port number 3
1st byte 2
nd byte 3
rd byte 4
th byte 5
th byte 6-48 bytes
72 75 6C 95 76 01-48
Figure 4.7: Program flow for verification of computing node on FPGA
5.1 Overview of Mesh Topology The basic building block of any mesh topology is the switch/router. The switches in the mesh
topology are connected to north, south, east and west directions of adjacent switches in the
topology. Every switch has embedded computing
switch shown as router in the figure
Figure 5.1: 4x4 Mesh topology
80
verview of Mesh Topology
The basic building block of any mesh topology is the switch/router. The switches in the mesh
north, south, east and west directions of adjacent switches in the
switch has embedded computing element shown as core interfaced with the
shown as router in the figure 5.1.
with switch (router) and computing element (core)
The basic building block of any mesh topology is the switch/router. The switches in the mesh
north, south, east and west directions of adjacent switches in the
element shown as core interfaced with the
with switch (router) and computing element (core)
81
5.2 Design of NoC Mesh Topology Platform
The 4 X 4 NoC mesh topology requires 16 computing node. The computing node designed
and tested in chapter 4 occupies 32% of the logical elements of Cyclone II FPGA of DE2
board. Thus it is quite clear that 4 X 4 mesh topology will not fit on the Cyclone II FPGA
device. Three such computing nodes will occupy 96% logical elements on this device.
Figure 5.2: Close view of 4X4 Mesh topology interconnection platform using Quartus-II
The option is to generate 16 instances of switches without processing element. We
used only one processing element connected to first switch and connected as shown in the
figure 5.2 and figure 5.3 to build Mesh topology platform. Though the design requires one
processing element for every switch, we embedded only one processing element to first
82
switch in our design because multi-processor support is not so strong in Quartus on Cyclone-
II device.
Figure 5.3: Full view of 4X4 Mesh topology interconnection platform using Quartus-II with
SoC and PLL.
VCCclk INPUT
GNDreset_n INPUT
GNDglobal_reset INPUT
SRAM_ADDR_f rom_the_sram_0[17..0]OUTPUT
SRAM_CE_N_f rom_the_sram_0OUTPUT
SRAM_LB_N_f rom_the_sram_0OUTPUT
SRAM_OE_N_f rom_the_sram_0OUTPUT
SRAM_UB_N_f rom_the_sram_0OUTPUT
SRAM_WE_N_f rom_the_sram_0OUTPUT
VCCSRAM_DQ_to_and_f rom_the_sram_0[15..0]BIDIR
TFFclock
q
FF
inst1
TFF
clockq
FF
inst2
TFF
clockq
FF
inst4
TFF
clockq
FF
inst5
Cy clone II
inclk0 f requency : 50.000 MHz
Operation Mode: Normal
Clk Ratio Ph (dg) DC (%)
c0 1/5 0.00 50.00
c1 1/2 0.00 50.00
inclk0 c0
c1
pll
inst6
TFF
clockq
FF
inst12
TFF
clockq
FF
inst13
TFF
clockq
FF
inst14
TFF
clockq
FF
inst15
clk
reset_n
in_port_to_the_pio_2[7..0]
in_port_to_the_pio_3[7..0]
in_port_to_the_pio_4[7..0]
in_port_to_the_pio_5[7..0]
in_port_to_the_pio_6
in_port_to_the_pio_7
out_port_from_the_pio
out_port_f rom_the_pio_1[7..0]
out_port_from_the_pio_10[7..0]
out_port_f rom_the_pio_8[7..0]
out_port_f rom_the_pio_9[7..0]
SRAM_ADDR_f rom_the_sram_0[17..0]
SRAM_CE_N_from_the_sram_0
SRAM_LB_N_f rom_the_sram_0
SRAM_OE_N_from_the_sram_0
SRAM_UB_N_f rom_the_sram_0
SRAM_WE_N_from_the_sram_0
SRAM_DQ_to_and_from_the_sram_0[15..0]
noc_platform_10_1
inst7
TFF
clockq
FF
inst
TFF
clockq
FF
inst36
TFF
clockq
FF
inst37
TFF
clockq
FF
inst38
TFF
clockq
FF
inst9
TFF
clockq
FF
inst10
TFF
clockq
FF
inst11
TFFclock
q
FF
inst16
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
c lock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst8
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
c lock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst3
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
c lock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
voq_switch
inst17
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
clock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_valid1
data_valid2
data_valid3
data_valid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst18
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
clock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst19
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
clock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_valid1
data_valid2
data_valid3
data_valid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst20
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
clock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst22
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
c lock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
voq_switch
inst23
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
clock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst24
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
clock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_valid1
data_valid2
data_valid3
data_valid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst25
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
c lock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
voq_switch
inst26
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
c lock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
voq_switch
inst27
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
c lock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
voq_switch
inst28
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
c lock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
voq_switch
inst29
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
c lock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
voq_switch
inst30
f p1
f p2
f p3
f p4
data_in1[7..0]
data_in2[7..0]
data_in3[7..0]
data_in4[7..0]
global_reset
reset
clock
f p_out_port1
f p_out_port2
f p_out_port3
f p_out_port4
data_out_port1[7..0]
data_out_port2[7..0]
data_out_port3[7..0]
data_out_port4[7..0]
data_v alid1
data_v alid2
data_v alid3
data_v alid4
incoming_port_to_output1[2..0]
incoming_port_to_output2[2..0]
incoming_port_to_output3[2..0]
incoming_port_to_output4[2..0]
P[7..1]
request[16..1]
grant[16..1]
v oq_switch
inst31
84
Also multi-core support for the µCOS Real Time Operating System (RTOS) is not supported
on the Altera Monitor program software development tool for NIOS-II. Hence, we have
customized the design for a single processing element and established design
interconnection for two dimensional mesh topology. The processing element connected to
first switch can be use to generate, send and receive ATM data packets.
After compiling the 4 X 4 NoC mesh topology the compiled summary report as shown in
figure 5.4 shows that the design was unable to fit on the Cyclone II FPGA .Over various trials
of decreasing number of switches we could fit only 6 switches and one processing element
connected to first switch on the FPGA .The Memory utilization for 6 switches and one
processing element on cyclone II FPGA is 97% out of 33216 logical elements as referred in
figure 5.5, on Cyclone II device 32208 Logical Elements were utilized.
5.3 The 3X2 NoC Mesh Topology verification with RTOS
We scaled down 4 x4(16 node) mesh topology platform to 3 x 2 (6 nodes) mesh topology
platform. The reason is resource utilization of logical elements on selected device
EP2C35F672C6 of Cyclone II family, which is the FPGA device used on the DE2 board. In
order to design 3x2 mesh topology platforms we used our switch design shown in Figure 4.1.
We require six such switches to build the platform. All switches were configured with the
appropriate look up tables require to decide the deterministic routing path as per shortest path
algorithm from any node to any node in the network.
We first performed the functional simulation of all switches by creating input waveform
vector file to test the functionality of the switches. Before connecting the switches into 3x2
mesh topology, we connect processing element to these switches.
86
The customized Soft Core Processor is interfaced with the switch as computing node. It is
also observed that cyclone-II device in Quartus-II does not fully support multi-core processors
and multi-instances of µCOS Real Time Operating System (RTOS) on the Altera Monitor
program software development tool for NIOS-II. Hence, we connected 6 switches into 3x2
mesh topology with one processing element connected to first switch as shown in figure 5.6.
The 3x2 mesh topology is then synthesizes on the target FPGA device. The compilation result
shows that the total utilization of logical element is 97%. We then verified the design for two
different cases as described below.
Figure 5.6: 3x2 mesh topology interconnection platform
Case-I: Verification of 3x2 mesh topology at first switch
This test is written to verify the design which is ported on the target FPGA device of DE2
board. The application program is written using C programming language to generate 53
bytes long four different ATM packets. The ATM packets generated contains first five bytes
VCI information as shown in table 3.3 the switch is configured as per the look up table 3.2
The program is then compiled and downloaded into SoC system ported on a DE2 board.The
ATM packet generated is send to one of the input port of the switch connected. As the packet
Switch
0
Switch
3
Switch
4
Switch
5
Switch
1
Switch
2
NIOS –II
soft-core
processor
87
is read by the switch, it checks the VCI information of the packet and update the old VCI with
the new VCI information as per the information maintained in the look up table of the switch
and decides the output port for the packet. The packet is then forwarded to the decided output
port once the scheduler grants the signal. To check whether the packet reached at the
respective output port correctly we require capturing this packet using soft core processor.
Thus we connect the output port of the switch to the input pin PIO of the SoC. The processor
is involved in two tasks, one for sending the ATM packet and other for receiving the ATM
packet. This two task needs to be perform in real time which is implemented using µCOS
Real Time Operating System. The two tasks, a low priority task and a high priority tasks are
defined in the program. The scheduling diagram is shown in figure 5.7. Low priority task is
involved in sending the ATM packet to the input port of the switch, wherein one byte at every
clock cycle is transmitted. The fp_out signal at the output of the switch indicates the arrival of
the packet. The high priority task is continuously in the loop checking for the fp_out signal; as
soon as it detects this signal the processor stops the low priority task and jumps to high
priority task for receiving the ATM packet. The high priority task of receiving the packet is
looped for 53 clock cycle which is equal to the size of ATM packet. The processor then
switch back to low priority task again and continue to send another packet, while polling
through high priority task.
Case II: Verification of 3x2 mesh topology at last switch
In order to verify of 3x2 mesh topology, we connected the switches as shown in the figure
5.6. We fixed the first switch as source node and sixth switch as destination node. The static
routing decision is performed to decide the shortest path from source node to destination node
by proper LUT in the respective switches
deterministic path set. The switches are configured with routing table information
of look up table. The program to generate ATM packet is written using C lan
successful compilation downloaded into
packet is injected to the input port of the first switch.
first switch it checks the input VCI information and assi
new VCI information as per the look up table maintained in the switch. The packet is then
forwarded to output port once scheduler grants the permission.
follow the paths through the different
Figure 5.7: Real Time system priority scheduler for transmitting and receiving the ATM packets Each switch will update the VCI information of the packet and will forward it to the next
switch as per the look up table maintained in the switch thereby reaching to the destination
node. Here again we defined two task, low priority and high priority task as discussed in case
88
by proper LUT in the respective switches. The packet will then simply follow the
. The switches are configured with routing table information
The program to generate ATM packet is written using C lan
successful compilation downloaded into SoC system implemented on DE2 board. The ATM
packet is injected to the input port of the first switch. As the packet enters input module of the
first switch it checks the input VCI information and assign the output port by updating with
new VCI information as per the look up table maintained in the switch. The packet is then
forwarded to output port once scheduler grants the permission. Similarly the packet will
follow the paths through the different switches as decided by the routing algorithm.
l Time system priority scheduler for transmitting and receiving the ATM
Each switch will update the VCI information of the packet and will forward it to the next
look up table maintained in the switch thereby reaching to the destination
node. Here again we defined two task, low priority and high priority task as discussed in case
The packet will then simply follow the
. The switches are configured with routing table information in the form
The program to generate ATM packet is written using C language and after
system implemented on DE2 board. The ATM
As the packet enters input module of the
gn the output port by updating with
new VCI information as per the look up table maintained in the switch. The packet is then
Similarly the packet will
switches as decided by the routing algorithm.
l Time system priority scheduler for transmitting and receiving the ATM
Each switch will update the VCI information of the packet and will forward it to the next
look up table maintained in the switch thereby reaching to the destination
node. Here again we defined two task, low priority and high priority task as discussed in case-
89
I. The lower priority task is involved in generating ATM packet and forwarding it to the first
switch. The high priority task is polling for the fp_out signal which indicates the arrival of
packet at output port. The output port of the destination node i.e. the sixth switch is connected
to the input PIO of the Nios-II soft core processor. Once fp_out signal is detected the
processor stops sending the input packet and starts reading ATM packet at the input PIO of
the Nios-II soft core processor.
NoC verification techniques can be classified into simulation, formal verification, and
mathematical proof. The simulation is the most famous and widely used technique. Many
NoC simulators like NS2, Noxim and XMulator have been designed to evaluate performance
on throughput and latency. Various interconnection topologies, router switch designs, routing
algorithms, arbitration algorithms and traffic patterns have been modeled and simulated.
During simulation exhaustive checking of all system behaviors is near to impossible and time-
consuming for a complex NoC-based Soc. However, some functional properties, like mutual
exclusion, deadlock-freedom, and starvation-freedom, require exhaustive exploration; thus
simulation cannot be used to complete such verifications. To resolve these issues, one can use
a conventional mathematical proof method. One such method used was eductive proof in
order to verify if the ordering of the packet transactions is correct by Jones et al. [92]. Further
to prove the deadlock-freedom of routing algorithms different deduction methods are used
[93], [94]. A channel dependency graph method is proposed by Dally and Seitz [93] to prove
that the routing algorithm in NoC is deadlock free, if and only if, there is no cycle in the
channel dependency graph. Also by applying the proposed theorem Lin et al. [94] proved that
the XY-based wormhole routing on a mesh-based NoC design is deadlock free. Though
mathematical proof is powerful, it has two defects. First, its automation is difficult. Little
90
automation means that more non-trivial human effort is required and it is not very practical
for applying this proof method on complex real cases. Second, if we disprove some functional
properties, a mathematical proof may not be able to point out how the given properties are
violated. These issues can be addressed by some popular formal verification techniques, such
as model checking [95] and theorem proving. Different kinds of communication properties in
an NoC were verified by theorem proving [96], [97], like deadlock freedom in a routing
algorithm. Model checking had also been used to prove deadlock freedom and the correctness
of a protocol design [98].One such model checking based formal verification procedure is
developed to verify and validate the routing microarchitecture in a Network-on-chip (NoC)
communication infrastructure by Yean-Ru Chen et al [99], where they investigated four
properties of an NoC router namely mutual exclusion, starvation freedom, deadlock freedom
and conditions for traffic congestions. They proposed some guidelines for constructing formal
models of an NoC router, and analyzed minimal formal models essential for verifying these
four properties. To perform the verification task they applied a popular formal verification
model checking tool State Graph Manipulators (SGM). Results obtained through formal
verification of these four properties provide useful insights to refine the protocol design.
Another such study was performed by Anam Zaman and Osman Hasan [100], where they
performed the formal verification of circuit-switched NoC using the SPIN model checker. The
proposed methodology provides generic modelling guidelines and identifies some properties,
including deadlock freedom, starvation freedom, mutual exclusion and liveness that are quite
useful in the context of circuit-switched NoC. They used the proposed methodology to verify
the programmable NoC (PNoC) architecture, which is one of the most widely used circuit-
switched NoC.
92
6.1 Performance Evaluation and Analysis of Mesh Topology Two-dimensional 4x4 and 8x8 mesh topologies were modeled and simulated using network
simulator NS2 [1]. NS2 is an open source, object-oriented and discrete event driven network
simulator written in C++ and OTcl. It is a very common and widely used tool to simulate
small and large area networks. Due to similarities between NoCs and networks, NS2 has been
a choice of many NoC researches to simulate and observe the behavior of a NoC at higher
abstraction level of design. Mesh based NoC topologies are receiving attention because of
their modularity and the ability to expand by adding new nodes and links without any
modification of the existing node structure. One advantage of mesh is that it can be
partitioned into smaller meshes, which is a desirable feature for parallel applications. In this
topology each switch is connected to four neighboring switches and one resource. The
number of switches required is equal to the number of resources. Switch, resource and link
are three basic elements in the topology.
An inter-communication path between the switches is composed of links. Each node is
connected with point-to-point bidirectional links. The bandwidth and latency of the link is
configurable. When any link is down between two nodes it implies that the packet cannot be
travel between these nodes in any direction.
An inter-communication path is composed of set of links identified by the routing
strategy. This simulator models4 x 4 (16-node) and 8 x 8 (64-node) 2-D mesh in which
routing decision will be taken at source node using source routing methodology. A shortest
communication path has been selected for each traffic pair using deterministic XY routing
before a simulation.
93
To evaluate the performance of the mesh interconnection networks, following
parameters were considered latency, throughput and bandwidth. Latency is a time requires to
transfer n bytes of packet from source to destination it consist of routing delay, contention
delay, channel occupancy and overhead. Throughput is the total number of received packets
by the destination per unit time. Bandwidth over communication is the amount of data that
can be moved using a communication link in a unit time period. The above parameters can be
specified and varied by writing a Tcl code. Tcl is used for specifying the Mesh
interconnection network simulation model and running the simulation. The implementation of
mesh interconnection networks uses the source routing to send packets from source node to
destination node. The complete code using tcl is written to build 4 X 4 and 8 X 8 mesh
topologies –ANNEXURE (Appendix-II and III)
6.2 Simulation Environment for m x n mesh topology
For the evaluation, a detailed event-driven simulator has been developed. This simulator
models a 4x 4 (16-node) and 8 x 8 (64-node) 2-D mesh in which routing decision will be
taken at source node using source routing methodology. Each node is connected with point-
to-point bidirectional serial links. Transmission Control Protocol (TCP) is used to transfer
data using static routing table maintained at each router to establish the path from source
router to destination router using shortest path algorithm. The simulation uses DropTail
Queue which drops the packet at the end of the queue during congestion in other words queue
serves first in first out bases. The transport agent TCP is attached to source node N(0) and two
different traffic pattern applications namely File Transfer Protocol(FTP) and constant Bit Rate
(CBR) were tested. FTP generates bulk data in random fashion and CBR generates packets at
constant bit rate. We studied the performance of 4 x 4 and 8 x 8 mesh topology with respect to
94
link bandwidth, link delay, queue size, packet size parameters using FTP and CBR traffic.
Different scenarios were created by varying above parameters and performance in terms of
throughput and delay over the topology are studied. All these modeled parameters are
described as a script file in Tcl. The parameters chosen for simulation are shown in table 6.1.
Table 6.1: Simulation parameters
NoC Model Parameter Parameter Constraint applied in NS2
Topology Mesh
Connections Resource-Router, Router-Router
Transmission Protocols Transmission Control Protocol(TCP)
Routing Scheme Static
Routing Protocol Shortest Path
Queue Mechanism Drop Tail(FIFO)
Link Queue Varying
Link Delay 10ms
Traffic Generation FTP/CBR
Traffic rate Varying
Number of Nodes 4 x 4 (16-node), 8 x 8 (64-node)
Simulation time 20 sec
95
Node N (0) and node N (15) were fixed as source and destination node for 4 x 4 mesh
topology and node N(0) and node N(63) were fixed as source and destination node for 8 x 8
mesh topology for simulation.
6.3 Scenario-1 Throughput and delay performance with varying packet size
In Scenario 1 we implemented 4x4 and 8x8 mesh topology simulation model and the
comparative performance study of these models were carried out in terms throughput and
delay.
Table 6.2: Various parameter values for scenario 1
SCENARIO-1
4x4 4x4 8x8 8x8
CBR 5Mbps 10Mbps 5Mbps 10Mbps
Link delay 10ms 10ms 10ms 10ms
Link BW 10Mbps 10Mbps 10Mbps 10Mbps
Queue size 1 1 1 1
Packet size 0.1 to16384 kbytes 0.1 to16384 kbytes 0.1to16384 bytes 0.1 to 16384 kbytes
Table 6.3: Throughput and delay performance with varying packet size for 4x4 mesh topology
Packet size
in Kbytes
Throughput in
Kbytes
FTP
Delay per packet
in ms
FTP
Throughput in
Kbytes
CBR
Delay per packet
in ms
CBR
0.1 22 0.060673645 22 0.060676365
0.2 38 0.061154818 38 0.061159502
96
0.4 70 0.062117192 70 0.062125849
0.8 129 0.064042039 131 0.064058853
1.6 248 0.067892214 248 0.067926051
3.2 455 0.075594281 461 0.075665262
6.4 596 0.14557829 603 0.14513425
12.8 598 0.3481888 601 0.34803309
25.6 603 0.74818304 601 0.75165174
51.2 601 1.3935586 592 1.4013148
64 608 1.696563 591 1.6889675
128 619 2.8467846 573 2.8816504
256 601 3.9446304 569 3.9241282
512 574 5.3739556 540 5.5336179
1024 508 7.708185 484 7.9549006
2048 367 17.44648 295 18.805834
4096 333 14.259669 148 13.167424
8192 340 28.459136 151 26.274624
16384 0.040 0.060192 0.040 0.060224
97
Table 6.4: Throughput and delay performance with varying packet size for 4x4 mesh topology
Packet size in
Kbytes
Throughput in
Kbytes
FTP
Delay per
packet in ms
FTP
Throughput in
Kbytes
CBR
Delay per
packet in ms
CBR
0.1 22 0.060673645 22 0.060676365
0.2 38 0.061154818 38 0.061159502
0.4 70 0.062117192 70 0.062125849
0.8 129 0.064042039 131 0.064058853
1.6 248 0.067892214 248 0.067926051
3.2 455 0.075594281 461 0.075665262
6.4 596 0.14557829 610 0.14362082
12.8 598 0.3481888 620 0.33858433
25.6 603 0.74881053 615 0.73858948
51.2 597 1.4003808 604 1.4039356
64 605 1.711896 595 1.7167413
128 619 2.8817046 582 2.8930515
256 594 3.9421148 585 4.0341319
98
512 565 5.4201113 540 5.4407722
1024 508 7.708185 484 7.9549006
2048 367 17.44648 295 18.805834
4096 333 14.252736 267 18.619701
8192 340 28.459136 151 26.274624
10000 343 34.726869 0.040 0.060224
16384 0.040 0.060192 0.040 0.060224
Table 6.5: Throughput and delay performance with varying packet size for 8x8 mesh topology
Packet size in
Kbytes
Throughput in
Kbytes
FTP
Delay per
packet in ms
FTP
Throughput in
Kbytes
CBR
Delay per
packet in ms
CBR
0.1 9 0.14157171 10 0.14157815
0.2 16 0.14269432 16 0.14270533
0.4 30 0.14493961 30 0.14495999
0.8 56 0.14943039 56 0.14946999
1.6 104 0.15841294 105 0.15849282
3.2 193 0.17638191 195 0.17654995
6.4 344 0.2123365 348 0.21270924
99
12.8 569 0.2843086 573 0.28521036
25.6 589 0.67421239 591 0.6721586
51.2 554 1.3370749 559 1.341439
64 555 1.6462719 552 1.6551183
128 549 2.7536893 512 2.7953078
256 492 3.9690716 476 3.9890783
512 395 6.136064 381 6.1676202
1024 317 11.773117 207 11.861354
2048 233 22.054084 161 22.805051
4096 160 31.816192 75 26.35488
8192 0.040 0.140448 0.040 0.14048
16384 0.040 0.140448 0.040 0.140448
Table 6.6: Throughput and delay performance with varying packet size for 8x8 mesh topology
Packet size in
Kbytes
Throughput in
Kbytes
FTP
Delay per
packet in ms
FTP
Throughput in
Kbytes
CBR
Delay per
packet in ms
CBR
0.1 9 0.14157171 10 0.14157815
0.2 16 0.14269432 16 0.14270533
100
0.4 30 0.14493961 30 0.14495999
0.8 56 0.14943039 56 0.14946999
1.6 104 0.15841294 105 0.15849282
3.2 193 0.17638191 195 0.17654995
6.4 344 0.2123365 348 0.21270924
12.8 569 0.2843086 573 0.28521036
25.6 589 0.67421239 596 0.67000229
51.2 554 1.3370749 559 1.341439
64 555 1.6462719 552 1.6551183
128 549 2.7536893 512 2.7953078
256 492 3.9690716 476 3.9901596
512 395 6.136064 381 6.1765245
1024 317 11.650232 241 12.010315
2048 233 22.054084 161 22.805051
4096 160 31.816192 75 26.35488
8192 0.040 0.140448 0.040 0.14048
16384 0.040 0.140448 0.040 0.140448
101
Here, we kept the link bandwidth constant at 10Mbps; propagation delay of the link t constant
at 10ms, queue size was fixed to hold maximum 1 packet at a time and varied size of the
packet from 0.1kbytes to 16384kbytes. Two types of traffic pattern applications were
executed on these models, the FTP which generates bulk data randomly and the CBR with
5Mbps and 10Mbps rate generation. The packets generated at source node N (0) and
transmitted to node N (15) or node N (63) destination node for 4x4 and 8x8 mesh topology
respectively using shortest path algorithm. The simulations were run for 20 seconds. The
results are shown in table (6.3, 6.4, .6.5, .6.6). It is observed (figure 6.1) that for FTP and
CBR (5Mbps and 10Mbps) application, as the packet size increases, throughput increases
linearly initially and saturates for packet size in range of 6.4 Kbytes to 512 Kbytes having
maximum value of 619 Kbytes for FTP and 620 Kbytes for CBR which is corresponding to
4.951Mbps and 4.960Mbps respectively and later degrades. Though we set link bandwidth to
10Mbps, the bandwidth is shared between two links (bidirectional), thus each link shares
5Mbps. The result depicts close to 100% bandwidth exploitation. The throughput for 4x4 is
higher as compared to 8x 8 mesh topology because the length of path require to travel from
source node N(0) to destination node N(63) in 8x8 mesh topology is longer than length of
path require to travel from source node N(0) to destination node N(15) in 4x4 mesh
topology.
The performance of the two mesh topology models was also studied for delay
characteristics. Here we kept link bandwidth, link delay and queue size constant as shown in
table 6.2 Scenario 1. The packets with varying sizes from 0.1 Kbytes to 16384 Kbytes were
generated at source node N (0) of the 4 x 4 mesh topology and 8 x 8 mesh topology model
using two different traffic pattern FTP and CBR (at 5 Mbps, 10 Mbps rate) and transmitted to
102
destination node N (15) and N (63) of 4 x 4 and 8 x 8 mesh topology respectively using
shortest path algorithm. The protocol used is TCP and the simulation was run for 20 seconds.
The result obtained is shown in table (6.3, 6.4, 6.5, and 6.6) and the plot is shown in figure
6.2. it is observed from the figure 6.2 that for both FTP and CBR traffic as the packet size
increases delay also increases gradually and is highest for packet size 1024 Kbytes to 8192
Kbytes for 4 x 4 mesh topology and for packet size 1024 Kbytes to 4096 Kbytes for 8 x 8
mesh topology and further drops down drastically. This is the saturation point where
throughput also reached at its peak, thus exploiting the parameter values set. Further increase
in packet size leads to gradually decrease in delay. The delay for 8 x 8 mesh topology is
higher as compared to 4 x 4 mesh topology model which is due to increase in number of hops
and the pathlength between source node and destination node.
Figure 6.1: Throughput in Kbytes v/s Packet size in Kbytes
103
Figure 6.2: Delay per packet in ms v/s Packet size in Kbytes
6.4 Scenario-2 Throughput and delay performance with varying Queue size for low load
and high load packets
Table 6.7: Various parameter values for scenario 2
SCENARIO-2
4x4 4x4 8x8 8x8
CBR 10Mbps 10Mbps 10Mbps 10Mbps
Link delay 10ms 10ms 10ms 10ms
Link BW 10Mbps 10Mbps 10Mbps 10Mbps
104
Queue size 5-200 5-200 5-200 5-200
Packet size 512kbytes(low load) 64kbytes(high load) 512 bytes(low load) 64kbytes(high load)
Table 6.8: Throughput and delay performance by varying Queue size with low load packets (512bytes) on 4x4 mesh topology
Queue
size
Throughput in
Kbytes
FTP
Delay per packet
in ms
FTP
Throughput in
Kbytes
CBR
Delay per packet
in ms
CBR
5 86 0.062656301 85 0.062660796
10 87 0.062656136 86 0.062660648
20 87 0.062656136 87 0.06266704
40 87 0.062656136 87 0.06266704
80 87 0.062656136 87 0.06266704
100 87 0.062656136 87 0.06266704
200 87 0.062656136 87 0.06266704
Table 6.9: Throughput and delay performance by varying Queue size with high load packets (64kbytes) on 4x4 mesh topology
Queue
size
Throughput in
Kbytes
FTP
Delay per packet
in ms
FTP
Throughput in
Kbytes
CBR
Delay per packet
in ms
CBR
5 511 0.46477903 515 0.46720931
10 583 0.63688067 520 0.6113077
20 580 0.97672589 581 0.98149923
105
40 605 1.711896 595 1.7167413
80 605 1.711896 595 1.7167413
100 605 1.711896 595 1.7167413
200 605 1.711896 595 1.7167413
Table 6.10: Throughput and delay performance by varying Queue size with low load packets (512bytes) on 8x8 mesh topology
Queue
size
Throughput in
Kbytes
FTP
Delay per packet
in ms
FTP
Throughput in
Kbytes
CBR
Delay per packet
in ms
CBR
5 35 0.14619776 35 0.14620876
10 37 0.14619701 36 0.14620821
20 37 0.14619701 37 0.1462227
40 37 0.14619701 37 0.1462227
80 37 0.14619701 37 0.1462227
100 37 0.14619701 37 0.1462227
200 37 0.14619701 37 0.1462227
106
Table 6.11: Throughput and delay performance by varying Queue size for high load packets (64kbytes) on 8x8 mesh topology
Queue size Throughput
in Kbytes
FTP
Delay per packet in
ms
FTP
Throughput
in Kbytes
CBR
Delay per
packet in ms
CBR
5 433 0.91622105 432 0.92468632
10 470 1.0218441 477 1.0341071
20 505 1.2934178 502 1.3265322
40 555 1.6462719 552 1.6551183
80 555 1.6462719 552 1.6551183
100 555 1.6462719 552 1.6551183
200 555 1.6462719 552 1.6551183
In this Scenario (table 6.7) we studied the performance of two network models namely 4 x 4
and 8 x 8 mesh topology in terms throughput in Kbytes and delay per packet in milliseconds
by changing the queue size of the node. Queue size is the number of packets the node can
hold during transfer of data. Queue size plays very important role in the performance of the
network. If the size is small the node may discard the packet due to unavailability of buffer
space. This causes retransmission of the same packet from source to destination which leads
to delay. Declaring the size of the queue to large may also lead to area overhead and delay due
to service time problem. More the size, the packets get queued up in the buffer for longer time
before being processed by the node. It may so happen that by the time the packet reaches the
head of the queue for processing, the packet may be discarded due to time to leave field of the
107
packet hits zero count. This again leads to retransmission of the packet thereby increasing
packet delay. Thus the size of the queue plays a crucial role in the performance of the
network. We kept bandwidth of the link to constant 10Mbps, link delay to 10 ms and by
varying the queue size from 5 to 200 packets, studied the performance for two different
network loads, low load where the packet size is 512 bytes and high load where the packet
size is 64 Kbytes. The packets were generated at node N (0) using FTP and CBR (at 10Mbps
rate) and were transmitted to destination node N(15) of 4x4 mesh topology and N (63) of 8x8
mesh topology by using shortest path algorithm. The results are shown in table (6.8, 6.9, 6.10,
and 6.11). It is observed from figure 6.3 that the throughput in both FTP and CBR application
is 4 to 5 times higher for high load as compared to that for low load in both the mesh topology
models. This is because the amount of data generated in case of high load (64 Kbytes packet)
is large and hence large amount of data is transmitted from source node and received at
destination node increasing the throughput compared to low load (512 bytes packet).Further it
is also observed that CBR (at 10Mbps rate) for high load gives higher throughput as compared
to FTP application. Also throughput of the 4 x4 mesh topology at high load as well as low
load is higher compared to the 8x8 mesh topology. The reason is short path length from
source to destination in case of 4x4 mesh topology. We also noticed that for the 4x4 mesh
topology throughput saturates when queue size reaches 20 and throughput for 8x8 mesh
topology saturates when the queue size reaches 40. This value indicates that for the 4x4 mesh
topology the queue size should be less than 20 and for the 8x8 mesh topology queue size
should be less than 40. This value corresponds to factor of 5 for matrix dimension of the mesh
topology.
108
We also carried out the performance study of 4x4 and 8x 8 mesh topology for average
transmission delay the packet takes to travel from source node to destination node in
milliseconds. We kept bandwidth of the link constant to 10Mbps, link delay to 10ms and
varying queue size from 5 to 200 packets per node. The simulation was carried for low load
i.e. packet size of 512 bytes and high load with the packet size of 64Kbytes. The packets were
generated at source node N(0) using FTP and CBR (at 10Mbps rate) traffic generation and
transmitted to destination node N(15) and N(63) of 4x4 and 8x8 mesh topology modes
respectively using shortest path algorithm. The simulation is set for 20 seconds. The results
obtained are shown in table (6.8, 6.9, 6.10, and 6.11). It is observed from the figure (6.4) that
transmission delay in case of high load packet is higher as compared to low load in both FTP
and CBR (at 10Mbps rate) because at high load the amount of data bytes transmitted are more
as compared to low load. In 4 x 4 mesh topology at high load when queue size is 5 packets per
node, the transmission delay per packet is 0.464ms and 0.4672ms for FTP and CBR
respectively and increased linearly up to 1.711ms and 1.716ms for FTP and CBR respectively
at queue size of 40 and further remained constant. Whereas in 8 x 8 mesh topology at high
load the delay per packet is 0.916ms and 0.92ms, when queue size is 5 packets per node
which linearly increased to 1.64ms and 1.65ms and further remained constant at this values
for queue size from 40 to 200 for FTP and CBR traffic respectively. Within the high load the
performance of 4x4 mesh topology is better as compared to 8x8 mesh topology. The reason is
that the path length between source and destination is shorter in case of 4x4 and the number of
hops is also less in case of 4x4 as compared to 8x8 mesh topology.
109
Figure 6.3: Throughput in Kbytes v/s Queue size.
Figure 6.4: Delay per packet in ms v/s Queue size
110
6.5 Scenario-3 Throughput and delay performance with varying link Bandwidth for low
load and high load packet
Table 6.12: Various parameter values for scenario 3
SCENARIO-3
4x4 4x4 8x8 8x8
CBR 10Mbps 10Mbps 10Mbps 10Mbps
Link delay 10ms 10ms 10ms 10ms
Link BW 10Mbps-200Mbps 10Mbps-200Mbps 10Mbps-200Mbps 10Mbps-200Mbps
Queue size 1 1 1 1
Packet size 512 bytes(low load) 64kbytes(high load) 512 bytes(low load) 64kbytes(high load)
Table 6.13: Throughput and delay performance by varying link Bandwidth with packet size of 0.512Kbytes on 4x4 mesh topology
BW in
Mbps
Throughput in
Kbytes
FTP
Delay per packet in
ms
FTP
Throughput
in Kbytes
CBR
Delay per packet
in ms
CBR
10 87 0.062656136 87 0.06266704
20 87 0.061328028 88 0.061333417
40 87 0.060664004 89 0.060666683
80 89 0.060332001 89 0.060333337
111
100 89 0.0602656 89 0.060266666
200 89 0.060132799 89 0.060133332
Table 6.14: Throughput and delay performance by varying link Bandwidth with packet size of 64Kbytes on 4x4 mesh topology
BW in
Mbps
Throughput in
Kbytes
FTP
Delay per packet in
ms
FTP
Throughput
in Kbytes
CBR
Delay per packet
in ms
CBR
10 605 1.711896 595 1.7167413
20 1243 0.94410154 1228 0.95028378
40 2463 0.28046868 1228 0.27797226
80 4939 0.099205179 1228 0.10170933
100 6224 0.093661094 1228 0.094316757
200 9210 0.075634592 1228 0.076104124
Table 6.15: Throughput and delay performance by varying link Bandwidth with packet size of 0.512Kbytes on 8x8 mesh topology
BW in
Mbps
Throughput in
Kbytes
FTP
Delay per packet in
ms
FTP
Throughput
in Kbytes
CBR
Delay per packet
in ms
CBR
10 37 0.14619701 37 0.1462227
20 37 0.1430984 37 0.14311116
40 38 0.14154918 38 0.14155549
112
80 37 0.14077458 38 0.14077774
100 37 0.14061966 38 0.14062218
200 38 0.14030983 38 0.14031109
Table 6.16: Throughput and delay performance by varying link Bandwidth with packet size of 64Kbytes on 8x8 mesh topology
BW in
Mbps
Throughput in
Kbytes
FTP
Delay per packet in
ms
FTP
Throughput
in Kbytes
CBR
Delay per packet
in ms
CBR
10 555 1.6462719 552 1.6551183
20 1207 0.87346923 1208 0.87257966
40 2389 0.32198518 1228 0.32419287
80 3327 0.23144778 1228 0.23236135
100 3507 0.21259688 1228 0.21403118
200 3915 0.17607773 1228 0.17653149
In Scenario 3 table (6.12) the performance of the 4x4 and 8x8 mesh topology models was
studied for throughput and transmission delay by varying link bandwidth. The simulations
were carried out for 20 seconds by keeping propagation delay of the link constant to
10milliseconds and varying link bandwidth from 10Mbps to 200Mbps. Two different traffic
patterns were used, namely FTP for bulk generation of data and CBR which generates data at
113
10Mbps bit rate. We also used two different packet size namely low load where the size of the
packet is kept constant at 512 bytes and high load where the size of the packet was kept at 64
Kbytes. The traffic generated at the source node N (0) was transmitted to destination node N
(15) or N (63) of the respective mesh topology model using shortest path algorithm. The
results obtained are shown in table (6.13, 6.15, 6.15, and 6.16). We observed from figure 6. 5
that, throughput performances for low load i.e. 0.512 Kbytes packet is very low below 100
Kbytes compared to high load with packet size 64 Kbytes. For high load the throughput is 20
to 25 times more than low load. This is because for low load packet the amount of data
generated is very low as compared to high load and the amount of data transmitted and
received is also low for low load leading to low throughput. The bandwidth is not efficiently
utilized in case of low load. The result also depicts that at high load FTP traffic performance
is better as compared to CBR traffic, as we kept the CBR rate constant to 10Mbps which is
low to exploit the given bandwidth and hence it remains saturated even after increasing the
bandwidth. While in FTP traffic the throughput linearly increases as the bandwidth increases,
exploiting the bandwidth almost up to 99% and it is found to be best for 4x4 mesh topology as
compared to 8x8 mesh topology due to path length between source node to destination node.
We also carried out the performance study of 4x4 and 8x8 mesh topology models for
transmission delay by varying link bandwidth from 10Mbps to 200Mbps. We calculated the
amount of time the packet takes to travel from source node to destination node for two
different loads namely low load (0.512 Kbytes) and high load (64Kbytes), keeping link
bandwidth to 10milliseconds and by using FTP and CBR traffic. Node N (0), node N(15) and
node N(0) , node N(63) were fixed as source node and destination node for 4x4 and 8x8 mesh
topologies respectively. The results are shown in table (6.13, 6.15, 6.15, and 6.16) it is
114
observed from the figure (6.6) that for low load i.e. 0.512 Kbytes packet size, packets takes
less transmission time as compared to high load with 64 Kbytes packet size.
Figure 6.5: Throughput in Kbytes v/s Bandwidth in Mbps
The better performance delay is due to size of packet which is 128 times smaller in case of
low load as compared to high load. In case of high load the transmission delay is initially high
for link bandwidth from 10Mbps to 40Mbps and the performance is almost same as low load
for link bandwidth 40Mbps onwards. The performance of 4x4 and 8x8 mesh topology model
for packet transmission delay is best observed for link bandwidth more than 40Mbps for both
high as well low load in CBR and FTP traffic pattern.
115
Figure 6.6: Delay per packet in ms v/s Bandwidth in Mbps
6.6 Scenario-4 Throughput and delay performance with varying propagation delay of link with
low load and high load packets
Table 6.17: Various parameter values for scenario 4
SCENARIO-4
4x4 4x4 8x8 8x8
CBR 10Mbps 10Mbps 10Mbps 10Mbps
Link delay 10ms- 200ms 10ms- 200ms 10ms- 200ms 10ms- 200ms
Link BW 10Mbps 10Mbps 10Mbps 10Mbps
116
Queue size 1 1 1 1
Packet size 512 bytes(low load)
64kbytes(high load) 512 bytes(low load) 64kbytes(high load)
Table 6.18: Throughput and delay performance by varying link propagation delay with packet size of 512 bytes on 4x4 mesh topology
Link delay
in ms
Throughput
in Kbytes
FTP
Delay per packet
in ms
FTP
Throughput in
Kbytes
CBR
Delay per packet
in ms
CBR
10 87 0.062656136 87 0.06266704
20 44 0.1226626 44 0.1226843
40 21 0.24267566 22 0.24271969
80 11 0.48270278 11 0.48279377
100 8 0.60271764 8 0.60283319
200 4 1.2027966 4 1.2030367
Table 6.19: Throughput and delay performance by varying link propagation delay of link with
packet size 64Kbytes on 4x4 mesh topology
Link delay
in ms
Throughput
in Kbytes
FTP
Delay per packet in
ms
FTP
Throughput
in Kbytes
CBR
Delay per packet
in ms
CBR
10 605 1.711896 595 1.7167413
20 600 1.8568241 596 1.8768459
40 595 1.7423041 585 1.7633757
80 575 1.5251242 563 1.547276
117
100 564 1.4170311 555 1.440462
200 393 1.5264711 382 1.5615011
Table 6.20: Throughput and delay performance by varying link propagation delay with packet size 512bytes on 8x8 mesh topology
Link delay
in ms
Throughput
in Kbytes
FTP
Delay per packet in
ms
FTP
Throughput
in Kbytes
CBR
Delay per packet
in ms
CBR
10 38 0.14030983 38 0.14031109
20 18 0.28621169 18 0.28626355
40 9 0.56624181 9 0.56634856
80 4 1.1263099 4 1.1265313
100 3 1.4063443 3 1.4066442
200 0.461 2.8061652 0.461 2.806381
Table 6.21: Throughput and delay performance by varying link propagation delay with packet size 64Kbytes on 8x8 mesh topology
Link delay
in ms
Throughput
in Kbytes
FTP
Delay per packet in
ms
FTP
Throughput
in Kbytes
CBR
Delay per packet
in ms
CBR
10 555 1.6462719 552 1.6551183
20 575 1.7246999 566 1.7469654
40 543 1.4816446 538 1.5045655
118
80 359 1.8569539 351 1.8950532
100 291 2.1415174 286 2.1904111
200 43 3.5095102 42 3.5295024
In Scenario 4 table (6.17) the performance study of 4x4 and 8x8 mesh topology for
throughput was carried out by varying propagation delay of the link from 10 milliseconds to
200 milliseconds while keeping bandwidth fixed at 10Mbps and queue size to one. The study
was performed for FTP and CBR (at 10Mbps rate) traffic for low load (0.512 Kbytes) and
high load (64kbytes) packets. As in the previous scenarios, we fixed N (0) and N (15) as
source node and destination node for 4x4 topology and N (0) and N (63) as source and
destination node respectively for 8x8 topology. The traffic generated was transmitted from
source node to destination node using shortest path algorithm. We measured the throughput at
destination node after running the simulation for 20 seconds. The results obtained are shown
in table (6.18, 6.19, 6.20, and 6.21). We observe from figure 6.7 that throughput performance
for low load (0.512 Kbytes) is very low a factor of 4 to 5 times lower as compared to high
load in FTP and CBR traffic. The throughput is below 100 Kbytes and decreases further as the
propagation delay of the link increases. At higher load the throughput performance is better in
case of CBR traffic as compared to FTP traffic. The throughput decreases as the propagation
delay of the link increases. The performance of 4x4 is better compared to 8x8 mesh topology
due to shorter path length. The decrease in performance at the higher propagation delay is
natural and obvious.
119
Figure 6.7: Throughput in Kbytes v/s Link delay in ms.
Figure 6.8: Delay per packet in ms v/s Link delay in ms
120
We also studied the performance of 4x4 and 8x8 mesh topology for transmission delay
parameter by varying propagation delay of the link from 10 milliseconds to 200 milliseconds
while keeping bandwidth fixed at 10Mbps and queue size to one. We used FTP and CBR (at
10Mbps rate) traffic for low load (0.512 Kbytes) and high load (64kbytes) packets. By
keeping N (0) and N (15) as source node and destination node for 4x4 topology and N (0) and
N (63) as source and destination node respectively for 8x8 topology the traffic was generated
at source node and transmitted to destination node using shortest path algorithm. We further
computed the transmission delay per packet in milliseconds at destination node after running
the simulation for 20 seconds. The results obtained are shown in table (6.18, 6.19, 6.20, and
6.21). We observe from figure 6.8 that the delay for low load (0.512 Kbytes) is comparatively
better as compared to high load (64 Kbytes) but as the propagation delay of the link increases
the transmission delay is also increases thereby decreasing the performance. Once again the
reason is quite explanatory as the propagation delay of the link increases, the transmission
delay increases.
122
7.1 Conclusion
Network on Chip is a relatively new research field; there are many pressing research
challenges that need to be addressed. The major ones include the NoC architecture design and
implementation, the switch/router design and implementation and routing algorithm. We in
our study first explored the various NoC architecture designed, various NoC topologies,
router design and algorithms proposed by various research fraternity through literature survey.
Later we entrusted on designing and implementing the two dimensional mxn mesh topology
NoC platform and perform functional simulation and formal verification of the same. The
reason for selecting mesh topology is its scalability and simple in hardware implementation.
We first proposed four first in first out (FIFO) inputs ports and four output ports along with
one local port switch design. Four ports are required to connect to four adjacent switches
namely north, south, east and west direction to build mesh topology platform. The switch
design comprises of three components namely input module, scheduler and crossbar. We
restricted ourselves to fixed size ATM packets although the design can be scaled to any size
packet. The main reason is that the FPGA on which the design is to be ported has limited
logical elements and also less memory space to implement LUT. We tested the design for its
functionality by realizing it as digital system module using HDL. All three components of the
switch were realized using HDL and synthesized using ALTERA QUARTUS II (version 8.0)
software. The gate level schematics of the designed generated is viewed using Quartus II RTL
viewer to examine the design connectivity. The proposed switch designed is verified for its
functionality using functional simulation. We performed the simulation by injecting various
ATM packets through the input ports and verified functionality at respective output port of the
switch as decided by the look up table information configured within the switch.Further we
123
implemented computing node, which is a soft core processor embedded with the switch
design. The task of the processor is to generate packets, forward packets to the connected
switch and receive packets from the switch. We used Altera’s Nios- II soft core processor and
customized as per our requirement to generate SOPC using Altera’s SOPC builder system
tool. Soft core generated is integrated with our switch by connecting appropriate buses and
control signals thereby implementing computing node. For implementation purpose we used
Altera’s Development and Education board DE2 which has Cyclone II 2C35 FPGA device.
The computing node system along with the configuration file is synthesized and ported on the
Cyclone II 2C35 FPGA device of the DE2 board. The design is tested by writing C program
application which generates ATM packets, forward the packet to the input port of the attached
switch and receive at the respective port using Integrated Development Environment. Further
using the above computing node we build two dimensional 3x2 mesh topology NoC platform
synthesized on Cyclone II 2C35 FPGA deviceEP2C35F672C6 and formally verified using
µCOS real time operating system
We also developed simulation and verification platform to measure the NoC
performance in terms of network delay and throughput. The platform was designed and
implemented using 4x4 and 8x8 mesh topology. The comparative performance study was
performed by varying different NoC parameters. The throughput and delay of the NoC
platform was measured by varying packet size, link bandwidth, link delay and queue size
using FTP and CBR traffic pattern. We conclude that as the packet size increases throughput
increases, it achieves almost 100% bandwidth of the link. The transmission delay also
increases as the packet size increases it is better in case of CBR traffic thus mesh topology
124
with less number of nodes can be used for applications were data packets are generated using
constant bit rates.
With reference to queue size, the throughput and delay for low load remains constant for both
4x4 and8x8 mesh topology but for high load the throughput and delay is better at lower queue
size. Queue size should never exceed 5 times the matrix value of mesh topology in the design
implemented.
Further with respect to link bandwidth we conclude that FTP traffic outperforms CBR traffic
for high load. The throughput with FTP traffic linearly increases as the bandwidth increases,
exploiting the bandwidth almost up to 99% and it is found to be best for 4x4 mesh topology
compared with 8x8 mesh topology. The transmission delay when bandwidth increases does
not affects much for low load but drastic change is seen for high load. Higher the bandwidth
better is the delay.
We also conclude that the throughput of the network decreases as the propagation delay of the
link increases .At higher load the throughput performance is better in case of CBR traffic as
compared to FTP traffic.. The performance of 4x4 is better compared to 8x8 mesh .The delay
for low load is comparatively better as compared to high load but as the propagation delay of
the link increases the transmission delay is also increases thereby decreasing the performance
NoCs are key enabling technology for the provision of many additional services
ranging from different quality of service (QoS) levels to fault- tolerance. The major
challenges faced by the designers now apart from global communications are high power
dissipation. It is grown to such importance that they now directly constrain attainable
performance. As a designer one
125
need to always take into consideration this parameter while designing. While designing router
the type of link interconnect, number of ports, queue size plays important role in energy
consideration. Additional energy requires to route a packet is then shown to be dominated by
the data path. A performance analysis also shows that dynamic resource allocation leads to
the lowest network latencies, while static allocation may be used to meet QoS goals.
Combining the power and performance figures then allows an energy-latency product to be
calculated to judge the efficiency of each of the networks. Study in similar line is performed
by various researchers. Arnab Banerjee, Pascal T. Wolkotte, Robert D. Mullins,Simon W.
Moore, and Gerard J. M. Smit [101], Performed power analysis on a circuit-switched router, a
wormhole router, a quality-of-service(QoS) supporting virtual channel router and a
speculative virtual channel router for NoC architectures and evaluate the energy-performance.
Router designs were synthesized, placed and routed on a CMOS 90-nm, high-performance
process technology with a core voltage of 1.2V and nominal threshold voltage. The result
shows that the routers power significantly overhead the link power. Also found that all router
designs dissipate significant power produced by leakage and clock tree power which is a
overhead for a particular architecture and highlights the need to use efficient architectures
combined with standby power reduction techniques, to obtain power efficient designs. The
dynamic energy cost per packet was calculated to measure additional power dissipation while
routing a packet, wider data path compared to control path dominate the energy need. They
implemented dynamic allocation and virtual channels in the SpecVC router design and
observed that the packet latency under random packet injection is reduced while QoS
supporting design of the GuarVC router offered low latency for GT traffic. The measured
energy and latency results are combined into an energy-latency product (ELP) metric that
126
represents the efficiency of router architecture for a specific traffic scenario. Another way to
reduce power dissipation is to use asynchronous circuits which eliminates global clock signal
on a chip. In the asynchronous NoC, multiple communications are simultaneously done over
asynchronous routers and links between processing cores. Naoya Onizawa et al[102],
proposed high-throughput compact delay-insensitive asynchronous NoC router. They have
proposed a high-throughput compact delay insensitive asynchronous NoC router based on
LEDR encoding with a packet-structure constraint. Since a routing computation is performed
by using only single phase information, the hardware complexity of two phase encoding is
alleviated with maintaining timing robustness. Thus, the proposed NoC is to benefit at a
maximum from the two-phase encoding that communication steps and the number of signals
being used become half in comparison with the four-phase encoding. As a result, the proposed
router achieves a throughput of 526 Mflit/sec that is almost double the throughput of a
conventional four-phase router, with 34 % energy saving and small area overhead by post-
layout simulation under a 0.13µmCMOS technology. Moreover, the proposed asynchronous
NoCs achieve 113 and 69 Gbps in the 4x4 2-D mesh and 16-core Spidergon NoCs,
respectively. This represents almost 2.4x and 2.3x throughput with respect to the four-phase
asynchronous NoCs. The asynchronous NoC based on the proposed router is fabricated and
demonstrated under various supply voltages of 0.6 to 1.8 V.
The issue of energy cannot be solved completely by designing efficient systems there
are certain in build factors that may affect the system performance, one such factor is the
material used for interconnect. The conventional metallic interconnect are becoming the
bottleneck of NoC performance due to the limited bandwidth, long delay, large area, and high
power consumption. The progress in photonic technologies diverted the focus of researchers
127
towards exploring optical interconnects to build on-board inter-chip interconnect, switching
fabrics in core routers, and etc. Huaxi Gu, Kwai Hung Mo, Jiang Xu, Wei Zhang[103]
proposed a low-power, low-loss, and low-cost 5x5 optical router design called Cygnus, for
optical NoCs. Cygnus is non-blocking and based on silicon micro-resonators. They carried
out the comparative performance study of Cygnus with other optical routers and the results
shows that Cygnus consumes 50% less power, has 51% less loss, requires 20% less micro-
resonators than the traditional crossbar and consumes only 3.8% power of the
highperformance45nm electronic router.
In our switch design we implemented FIFO input buffer, the priority of the packet is
first in first out order. The arbitration strategy we used works well for most of the applications
but with potential different requirement with different applications the platform may not scale
well. Also different packets can have very different effects on system performance due to
different level of memory-level parallelism (MLP) of applications. Certain packets may be
performance-critical because they cause the processor to stall, whereas others may be delayed
for a number of cycles with no effect on application-level performance as their latencies are
hidden by other outstanding packets’ latencies. So deciding arbitration strategy at the time of
designing a switch is difficult unless one knows the type of application will be run on the
design platform. Thus, characterizing, understanding, and optimizing the interference
behavior of applications in the NoC are an important problem for enhancing system
performance and fairness. One can captures the packet’s importance to its application’s
performance thereby differentiating packets based on their slack. Reetuparna Das et al [104]
study explored the above area by proposing Aérgia: Exploiting Packet Latency Slack in On-
Chip Networks. They introduce the concept of packet slack and characterize it in the context
128
of on-chip networks. In order to exploit packet slack, they propose and evaluate a novel
architecture, called the Aérgia, which identifies and prioritizes low-slack (critical) packets.
The key components of the proposed architecture are techniques for online estimation of slack
in packet latency and slack-aware arbitration policies. Averaged over 56 randomly-generated
multi programmed workload mixes on a 64-core 8x8-mesh CMP, Aérgia improves overall
system throughput by 10.3% and network fairness by 30.8% on average. Results show that
Aérgia outperforms three state-of-the-art NoC i.e. scheduling, prioritization and fairness
techniques. They conclude that the proposed network architecture is effective at improving
overall system throughput as well as reducing application-level network unfairness
Adequate research work is found in latency and energy optimization for Quality of
services (QoS). Researchers have worked on latency of operation and hence various levels of
latency matrix are offered. One can find various architectures for optimizing the guaranteed
bandwidth and Best effort for QoS using various arbitration algorithms like round robin,
FCFS, Priority, and Priority based round robin wherein later are used for guaranteed traffic..
Sunghyun Park et al [105] proposed 16-node 4x4 mesh NoC fabricated in 45nm SOI. They
have defined and analyze the theoretical limits of a mesh NoC in latency, throughput and
energy. An off-chip signaling techniques have been shown to deliver substantial energy
savings when applied to NoC interconnects, circuit or system-level solutions to their increased
vulnerability to process variations need to be developed to ensure viability in future. Due to
the characteristics of the on-chip networks and the systems based on those structures, service
guarantees and predictability are usually given through hardware implementations, as opposed
to statistical guarantees of macro-networks. In real-time or other critical applications, one
really needs guarantees more than predictability. In order to give hard guarantees, one must
129
use specific routing schemes (virtual channels, adaptive routing, fault-tolerant router and link
structures, etc.) to ensure the required levels of performance and/or reliability. Naturally, for
Guaranteed Services GS NoCs, implementation costs and complexity grows with the
complexity of the system and the QoS requirements. Best-effort BE NoCs, on the other hand,
tends to present a better utilization of the available resources.
In order to evaluate the performance of NoC many researchers have simulated the NoC
models using different topologies and routing algorithms and carried out performance
evaluation in terms of throughput and delay by varying different network parameters like
packet size, link delay, queue size and bandwidth using different traffic pattern. Flich, P.
lopez, M. P. Malumbers and J. Duato[106] discussed improving the performance of regular
networks with source routing. Here mesh interconnection networks uses the source routing to
send packets from source node to destination node. In source routing the information about
the whole path from the source to the destination is pre-computed and provided in packet
header 2D Mesh of size 4x4 they found that the performance of FTP mechanism is better than
the CBR mechanism for parallel transmission. The FTP mechanism has not lost any packet
while constant bit rate mechanism lost few packets in parallel transmission. Therefore FTP
mechanism is most secure and reliable when parallel transmission is considered with
handshaking concept in the Mesh interconnection networks. The performance of the NoC also
affected by packet injection rate was observed by C. Marcon,et al[107] in their proposed tiny–
optimised 3D mesh NoC for area and latency minimization design. In their study they
compared Tiny with a straightforward 3D mesh NoC such as Lasio, and conclude that: (i)
Tiny always reduces the overall area, even with a small area increase of border routers and (ii)
Tiny reduces the average packets latency when there are few concurrent communications
130
and/or a low packet injection rate. In contrast, when the packet injection rate increases and
there are huge concurrent communications, Lasio provides a lower average packet latency due
to the larger quantity of the buffered paths. We in our simulation carried out the study for low
load and high load traffic and observed that high traffic affects the delay performance of the
model. This traffic can also be balanced using various routing techniques. Mukund
Ramakrishna et al[108],discuss GCA i.e. Global Congestion Awareness for load balance in
NoC. They present a novel routing adaptive routing technique for on-chip networks, which
aims to make routing decisions based on global knowledge of network state, Global
Congestion Awareness (GCA). GCA is a light-weight, low-complexity adaptive routing
algorithm, which makes per-hop routing decisions based upon awareness of the congestion of
links throughout the network. It differs from other globally aware routing schemes in that it
utilizes the existent packets within the network to convey (“piggyback”) congestion
information instead of requiring a sideband network dedicated to congestion status
propagation. This makes it a more scalable solution than other existing techniques. Our
experiments show that GCA consistently performs better than Regional Congestion
Awareness (RCA-1D) on a variety of workloads. On SPLASH-2 traffic, GCA is 51%better in
latency over RCA in the best case and 15% better on average. It also betters a competing
globally aware routing algorithm, DAR, by 8% on average on realistic workloads. The size of
the network also plays very vital role with respect to performance. Smaller size network will
have short length link thereby reducing average link delay. Jie Chen et al [109] presented
performance evaluation of three NoC architectures. They studied the performance of torus,
Metacube and hypercube under bit-complement and Poisson assumption. Although the
hypercube performs the best, the hypercube is expensive in terms of link complexity and node
131
degree. When the sizes of networks are small (32-64 nodes) and channel loads are not heavy,
the torus is a viable choice. It is cheap and simple to implement. For the 32-node and 64-node
cases, the Metacube network performs the worst among the three because it has the smallest
node degree and the largest diameter. For 128 nodes and 512 nodes networks, the Metacube
starts to outperform torus and exhibits similar performance to the hypercube. The Metacube
of 128-node and 512-node have lower link complexity and fewer long wires, which makes it a
viable choice under a moderate load. Further Zvika Guz et al[110] explored network delay
and link capacities in application-specific wormhole NoCs in their study. Here they presents a
novel analytical delay model for virtual channeled wormhole networks with non-uniform
links and applies the analysis in devising an efficient capacity allocation algorithm which
assigns link capacities such that packet delay requirements for each flow are satisfied.
Network-on-chip- (NoC) based application-specific systems on chip, where information
traffic is heterogeneous and delay requirements may largely vary, require individual capacity
assignment for each link in the NoC. This is in contrast to the standard approach of on- and
off-chip interconnection networks which employ uniform-capacity links. Therefore, the
allocation of link capacities is an essential step in the automated design process of NoC-based
systems. The algorithm should minimize the communication resource costs under Quality-of-
Service timing constraints. Further it is also possible to customize the link wire length at the
time of placement. Junbok You et al [111], proposed bandwidth optimization in asynchronous
NoCs by customizing link wire length. They have estimated the benefit of bandwidth
optimization in the design of an asynchronous network-on-chip (NoC). Here they developed a
tool to optimize NoC bandwidth and energy through topology and router placement. This
placement minimizes the wire lengths and router hops for high bandwidth network links. A
132
second optimization is performed on long links that require further bandwidth improvement
by adding pipeline latches. This design process creates a multi-frequency bandwidth
optimized asynchronous NoC. This design is compared to a clocked NoC applying the same
optimizations, but without applying multi-frequency design. The results show that exploiting
the natural multi-frequency nature of asynchronous designs results in substantial
improvements. The topology and placement optimizations create an asynchronous design that
has an average link bandwidth of 1.54 Gflits/s. Compared to the 1.54 GHz clocked design, the
asynchronous design has 46% less average packet latency and 19% less energy consumption
at 3×offered load. The asynchronous design performs similarly to the clocked network with a
link bandwidth of 2.12 Gflits/s, but demands 29% less energy at 3×offered load. Adding
pipeline latches to the clocked design does not have substantial beneficial effect. However, for
the asynchronous design this optimization significantly improves performance when the
network is highly congested. At 7× offered load, the pipeline latches resulting in a 35%
reduction in average packet latency for only 6.1% more energy consumption.
7.2 Future Trends
The future trend is towards third dimension NoC, as the future applications are getting more
and more complex the 2-D NoC may not scale well. As the network grows the diameter of the
network also grows which affects the performance of the network thus the current researchers
are trying to optimize the NoC based architectures. Brett Stanley Feero and Partha Pratim
Pande[112] performed a simulation study to evaluate the performance of five different
topologies in 3D space. The topologies considered for evaluation were 3D mesh, Stacked
mesh, Ciliated 3D mesh, butterfly fat tree (BFT) and SPIN. They demonstrated that mesh and
133
tree-based NoCs achieves better performance when initiated in a 3D IC environment
compared to 2D implementation. The 3D mesh based NoCs achieves significant gain in terms
of latency, throughput and energy dissipation and 3D tree based NoCs shows significant gain
only in energy dissipation with little area overhead in both the cases. Awet Yemane
Weldezion, Matt Grange, Dinesh Pamunuwa, Zhonghai Lu, Axel Jantsch, Roshan
Weerasekera and Hannu Tenhunen[113] studied the performance and scalability of different
3D NoC communication topologies namely 2-D mesh,3-D mesh (with switch connectivity
between layers) and 3-D bus (with bus connectivity between layers)using Through-Silicon-
Vias (TSV) for inter-die connectivity. Using different traffic pattern they performed Cycle-
accurate RTL-level simulation for two communication schemes based on a 7-port switch and
a centrally arbitrated vertical bus. The result shows that among the three designs 3-D 7-port
switch is the best performer in terms of throughput and normalized latency. Amir Mohammad
Rahmani, Kameswar Rao Vaddina, Khalid Latif, Pasi Liljeberg, Juha Plosila and Hannu
Tenhunen[114] proposed an efficient inter-layer communication scheme for 3D NoC-Bus
Hybrid mesh architecture to improve system reliability, enhance system performance and
reduce power consumption. They also presented congestion aware and bus failure tolerant
routing algorithm called AdaptiveZ for vertical communication. Further they implemented a
generic monitoring platform called ARB-NET on top of the 3D stacked NoC mesh
architecture which can be used for traffic monitoring and fault tolerance purposes. Seyyed
Hossein Seyyedaghaei Rezaei, Abbas Mazloumi, Mehdi Modarressi and Pejman Lotfi-
Kamran [115] presented router micro architecture for 3D NoC called FRESH with Fine-
grained Resource Sharing capability. The proposed NoC takes the advantage of ultra fast
vertical links to forward a blocked flit to vertical adjacent router if the corresponding link and
134
crossbar ports are idle effectively eliminating the packet block time. The cycle accurate
simulation results show up to 21% lower packet latency compared to other state-of-the-art 3D
NoCs. Randy W. Morris Jr, Avinash Karanth Kodi, Ahmed Louri and Ralph D. Whaley
Jr[116] proposed a high bandwidth and energy efficient interconnect architecture design
called On-Chip Multilayer Photonic (OCMP) which uses Nanophotonic Interconnect(NIs) and
3D stacking technologies with reconfiguration. They also proposed a reconfiguration
algorithm that maximizes the available bandwidth by dynamically reallocating the channel
bandwidth and through runtime monitoring of network resources. The simulation results for
64 core reconfigured network shows that the execution time can be reduced up to 25% for
PARSEC, Splash-2 and SPEC CPU2006 benchmarks and the simulations were also carried
out for 256 core proposed architecture and on-chip electrical and optical networks, the result
indicates that there is more than 25% throughput improvement and 23% energy saving on
synthetic traffic for the proposed architecture. Yaoyao Ye, Jiang Xu, Baihan Huang, Xiaowen
Wu, Wei Zhang, Xuan Wang, Mahdi Nikdast, Zhehui Wang, Weichen Liu and Zhe Wang
[117] proposed a 3-D mesh based Optical NoC for MPSoC along with 4 x 4 and 5 x 5 routers
to be used in the network corners and edges and 6 x 6 and 7 x 7 optical routers for dimension
order routing. Routers are strictly non-blocking and built based on silicon MRs. In addition an
optimized floorplan for 3-D meshed based ONoC was proposed which follows the regular 3-
D mesh topology but implements all optical routers in a single optical layer. The simulations
were performed to compare 3-D 8 x 8 x 2 meshed based ONoC with 2-D 16 x 8 meshed based
optical NoC, the result shows that the proposed architecture achieves 17% performance
improvement and 32% less energy consumption. Similarly the proposed architecture was also
compared with 2-D 16 x 8 mesh-based electronic NoC were it achieved 9% performance
135
improvement and 47% reduction in energy consumption. Akram Ben Ahmed and Abderazek
Ben Abdallah [118] proposed Look Ahead Fault Tolerant (LAFT) routing algorithm for 3D
NoC. LAFT enhances the system performance and reduces communication latency. It also
ensures fault tolerance while maintaining reasonable hardware complexity. The proposed
algorithm was implemented on a real 3D- NoC architecture (3D-OASIS-NoC) and prototyped
on FPGA. The evaluation results shows 38% average latency reduction and up to 46%
enhance throughput when compared to conventional XYZ routing algorithm. Sara Akbari, Ali
Shafiee, Mahmoud Fathy and Reza Berangi [119] proposed AFRA, a deadlock free routing
algorithm for 3D mesh based NoC which tolerates faults on vertical links. The proposed
routing algorithm uses ZXY routing whenever possible and switch to XZXY routing when
fault in baseline path is detected achieving high performance and simplicity. AFRA is highly
robust; it supports connections to all possible pairs of communicating nodes in high fault
rates. When compared with planar adaptive routing the performance of the AFRA showed
70% and 54.1% improvement in saturation injection rate for uniform traffic under single fault
and 207% and 44% for bit complement traffic under five faults.
Wireless 3D NoC Architecture
Wireless 3D NoC architecture for the building-block SiPs, in which the number and types
of chips stacked in a package can be changed after the chips have been fabricated, by virtue
of the inductive coupling. These chips can be formed into a single unidirectional ring network
so as to fully exploit the flexibility of the wireless connections that enables us to add, remove,
and swap chips in a package without updating any routing information. Yasuhiro Take et
al[120], described 3D NoC with inductive-coupling links for building-block of Systems in
136
Package (SiPs). The unidirectional ring can be extended to a bidirectional ring network by
using the inductive-coupling transceivers that can dynamically change their communication
direction in a single cycle. The proposed vertical bubble flow network outperforms the
conventional VC-based approach by 7.9-12.5% with a 33.5% smaller router area for building-
block SiPs connecting up to eight chips. The next step in mobility is the Internet of Things
(IoT) [121] where everyday devices, objects and physical assets are equipped with computing
power, sensors and wireless connections that enable them to communicate with each other
and with the Internet. These connected objects become more and more intelligent with tiny
integrated processors such as the Intel Quark processor in the size of an SD card including
Bluetooth and WiFi communication capabilities. In the end, these SoCs with their on-chip
interconnects will be connected forming large wireless off-chip networks with an ever
increasing amount of data that will be transmitted.
137
ANNEXURE Appendix-I #include "system.h" #include "altera_avalon_pio_regs.h" #include"stdio.h" int j=-2; int main(void) { int loop=0; //for(loop=0;loop<10;loop++)//run simulation for 10 times // { // printf("............................................SIMULATION %d\n",loop); int fp=0; int k=0,p=0; // alt_u8 d=0x00 ; alt_u8 clk_edge ; alt_u8 fp_out=0x00 ; alt_u8 direct_data=0x00 ; unsignedchar data1[53]= {0x72 ,0x73 ,0x74 ,0x75 ,0x76 ,0x01 ,0x02,0x03 ,0x04 ,0x05 ,0x06 ,0x07 ,0x08 ,0x09,0x10,0x11,0x12,0x13,0x14,0x15,0x16,0x17,0x18,0x19,0x20,0x21,0x22,0x23,0x24,0x25,0x26,0x27 ,0x28,0x29 ,0x30 ,0x31 ,0x32 ,0x33 ,0x34 ,0x35,0x36,0x37,0x38,0x39,0x40,0x41,0x42,0x43,0x44,0x45,0x46,0x47,0x48 }; unsigned char data2[53]= {0xA9 ,0xAA ,0xAB ,0xAC ,0xAD ,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa,0xaa,0xaa,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa ,0xaa }; unsigned char data3[53]= {0xDF ,0xE0 ,0xE1 ,0xE2 ,0xE3 ,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb,0xbb ,0xbb,0xbb ,0xbb ,0xbb ,0xbb ,0xbb ,0xbb }; unsigned char data4[53]= {0x15 ,0x16 ,0x17 ,0x18 ,0x1A ,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc,0xcc ,0xcc,0xcc ,0xcc ,0xcc ,0xcc ,0xcc ,0xcc }; IOWR(PIO_BASE, 0, 0x01); //fp1 low printf("\nfp_in is made high to generate fp_out\n");
138
while(fp_out==0)//checks wether fp_out is high { //checks for falling edge of switch clock do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_high %2x", clk_edge); }while(clk_edge==0x01); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_low %2x", clk_edge); }while(clk_edge==0x00); if(fp==1) {IOWR(PIO_BASE, 0, 0x00); //fp1 low printf(" fp is made low\n"); } fp++; fp_out = IORD_ALTERA_AVALON_PIO_DATA(PIO_7_BASE); printf("\n fp_out%2x \n\n",fp_out); } printf(" first fp_out is detected \n"); while(fp_out==1) { fp_out = IORD_ALTERA_AVALON_PIO_DATA(PIO_7_BASE); printf("\n fp_out%2x \n\n",fp_out); } printf(" first fp_out is low \n"); //after first fpout is low wait for some more clock pulses and then make fpin high for(p=0;p<10;p++)
139
{ do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_high %2x", clk_edge); }while(clk_edge==0x01); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_low %2x", clk_edge); }while(clk_edge==0x00); } do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" clk_edge_ for fp to make high =%2x\n" , clk_edge); if (clk_edge==1) { IOWR(PIO_BASE, 0, 0x1); // Enable the fp1 high printf("fp_in high\n"); } }while(clk_edge==0); while(fp_out==0 && j<=47)//checks wether fp_out is high { //checks for falling edge of switch clock do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_high %2x", clk_edge); }while(clk_edge==0x01); printf("\n falling edge detected............."); //sending data j++; if(j>=0) { IOWR_ALTERA_AVALON_PIO_DATA(PIO_1_BASE,data1[j]);
140
IOWR_ALTERA_AVALON_PIO_DATA(PIO_8_BASE,data2[j]); IOWR_ALTERA_AVALON_PIO_DATA(PIO_8_BASE,data3[j]); IOWR_ALTERA_AVALON_PIO_DATA(PIO_10_BASE,data4[j]); } printf(" \n the value of j=%d", j); direct_data = IORD_ALTERA_AVALON_PIO_DATA(PIO_5_BASE); printf("\n data1_send_at_clk_low=%02x\n",data1[j]); printf("\n data2_send_at_clk_low=%02x\n",data2[j]); printf("\n data3_send_at_clk_low=%02x\n",data3[j]); printf("\n data4_send_at_clk_low=%02x\n",data4[j]); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); printf(" \n clk_low %2x", clk_edge); }while(clk_edge==0x00); printf("\n Rising edge detected............."); d = IORD_ALTERA_AVALON_PIO_DATA(PIO_5_BASE); printf("\ndata_at_clk_high=%02x\n",d); fp_out = IORD_ALTERA_AVALON_PIO_DATA(PIO_7_BASE); printf("\n fp_out %2x \n\n",fp_out); } printf(" ......fp_out detected............\n"); recieve_data(); // } return 0; } recieve_data() { int i=1; alt_u8 result2=0x00 ; alt_u8 result3=0x00 ; alt_u8 result4=0x00 ; alt_u8 result5=0x00 ; alt_u8 clk_edge ; printf(" in recieve fn \n");
141
do { //result2 = IORD_ALTERA_AVALON_PIO_DATA(PIO_2_BASE); //printf("data_1=%02x\n",result2); //result3 = IORD_ALTERA_AVALON_PIO_DATA(PIO_3_BASE); //printf("data_2=%02x\n",result3); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); // printf(" \n clk_high %2x", clk_edge); }while(clk_edge==0x01); result4 = IORD_ALTERA_AVALON_PIO_DATA(PIO_4_BASE); printf("data_out_port_3 = %02x\n",result4); do{ clk_edge = IORD_ALTERA_AVALON_PIO_DATA(PIO_6_BASE); // printf(" \n clk_low %2x", clk_edge); }while(clk_edge==0x00); //result5 = IORD_ALTERA_AVALON_PIO_DATA(PIO_5_BASE); //printf("data_4=%02x\n",result5); i++; } while(i<1000000); return 0; }
142
Appendix-II Tcl script for 4 x 4 mesh topology #Create a simulator object set ns [new Simulator] #Define different colors for data flows (for NAM) $ns color 1 Blue $ns color 2 Red #$ns color 1 green #Open the NAM trace file set nf [open out.nam w] $ns namtrace-all $nf #Open the Trace file set tf [open out.tr w] $ns trace-all $tf #Define a 'finish' procedure proc finish {} { global ns nf tf $ns flush-trace #Close the NAM trace file close $nf close $tf #Execute NAM on the trace file exec nam out.nam & #Call xgraph to display the results exec xgraph out1.xg -geometry 800x800 & exit 0 } #Create sixteen nodes set n1 [$ns node] set n2 [$ns node] set n3 [$ns node] set n4 [$ns node]
143
set n5 [$ns node] set n6 [$ns node] set n7 [$ns node] set n8 [$ns node] set n9 [$ns node] set n10 [$ns node] set n11 [$ns node] set n12 [$ns node] set n13 [$ns node] set n14 [$ns node] set n15 [$ns node] set n16 [$ns node] #Create links between the nodes #connecting 1--2--3-------10 $ns duplex-link $n1 $n2 10mb 10ms DropTail $ns duplex-link $n2 $n3 10mb 10ms DropTail $ns duplex-link $n3 $n4 10mb 10ms DropTail $ns duplex-link $n1 $n5 10mb 10ms DropTail $ns duplex-link $n2 $n6 10mb 10ms DropTail $ns duplex-link $n3 $n7 10mb 10ms DropTail $ns duplex-link $n4 $n8 10mb 10ms DropTail $ns duplex-link $n5 $n6 10mb 10ms DropTail $ns duplex-link $n6 $n7 10mb 10ms DropTail $ns duplex-link $n7 $n8 10mb 10ms DropTail $ns duplex-link $n5 $n9 10mb 10ms DropTail $ns duplex-link $n6 $n10 10mb 10ms DropTail $ns duplex-link $n7 $n11 10mb 10ms DropTail $ns duplex-link $n8 $n12 10mb 10ms DropTail $ns duplex-link $n9 $n10 10mb 10ms DropTail $ns duplex-link $n10 $n11 10mb 10ms DropTail $ns duplex-link $n11 $n12 10mb 10ms DropTail $ns duplex-link $n9 $n13 10mb 10ms DropTail $ns duplex-link $n10 $n14 10mb 10ms DropTail $ns duplex-link $n11 $n15 10mb 10ms DropTail $ns duplex-link $n12 $n16 10mb 10ms DropTail $ns duplex-link $n13 $n14 10mb 10ms DropTail $ns duplex-link $n14 $n15 10mb 10ms DropTail $ns duplex-link $n15 $n16 10mb 10ms DropTail #--------------------------------------
144
#------------------------------------- #$ns queue-limit $n1 $n2 200 #$ns queue-limit $n2 $n3 200 #$ns queue-limit $n3 $n4 200 #$ns queue-limit $n1 $n5 200 #$ns queue-limit $n2 $n6 200 #$ns queue-limit $n3 $n7 200 #$ns queue-limit $n4 $n8 200 #$ns queue-limit $n5 $n6 200 #$ns queue-limit $n6 $n7 200 #$ns queue-limit $n7 $n8 200 #$ns queue-limit $n5 $n9 200 #$ns queue-limit $n6 $n10 200 #$ns queue-limit $n7 $n11 200 #$ns queue-limit $n8 $n12 200 #$ns queue-limit $n9 $n10 200 #$ns queue-limit $n10 $n11 200 #$ns queue-limit $n11 $n12 200 #$ns queue-limit $n9 $n13 200 #$ns queue-limit $n10 $n14 200 #$ns queue-limit $n11 $n15 200 #$ns queue-limit $n12 $n16 200 #$ns queue-limit $n13 $n14 200 #$ns queue-limit $n14 $n15 200 #$ns queue-limit $n15 $n16 200 #------------------------------------- #Setup first TCP connection Agent/TCP set packetSize_ 64000 set tcp1 [new Agent/TCP] $tcp1 set class_ 2 $ns attach-agent $n1 $tcp1 set sink1 [new Agent/TCPSink] $ns attach-agent $n16 $sink1 $ns connect $tcp1 $sink1 $tcp1 set fid_ 1 #Setup Second TCP connection set tcp2 [new Agent/TCP] $tcp2 set class_ 3 $ns attach-agent $n1 $tcp2 set sink2 [new Agent/TCPSink] $ns attach-agent $n16 $sink2
145
$ns connect $tcp2 $sink2 $tcp2 set fid_ 2 #Setup a FTP over TCP connection set ftp1 [new Application/FTP] $ftp1 attach-agent $tcp1 $ftp1 set type_ FTP1 #SETUP UDP CONNECTION #set udp [new Agent/UDP] #$ns attach-agent $n1 $udp #set null [new Agent/Null] #$ns attach-agent $n100 $null #$ns connect $udp $null #$udp set fid_ 3 #Setup a CBR over TCP connection set cbr [new Application/Traffic/CBR] $cbr attach-agent $tcp2 $cbr set type_ CBR $cbr set packet_size_ 64000 $cbr set rate_ 10Mb $cbr set random_ off #$cbr set interval_ 0.1 #Schedule events for the CBR and FTP agents $ns at 0.1 "$ftp1 start" $ns at 0.1 "$cbr start" $ns at 60 "$cbr stop" $ns at 60 "$ftp1 stop" #Detach tcp and sink agents (not really necessary) #$ns at 4.5 "$ns detach-agent $n1 $tcp ; $ns detach-agent $n100 $sink" #Call the finish procedure after 5 seconds of simulation time $ns at 62.0 "finish" #Print CBR packet size and interval puts "CBR packet size = [$cbr set packet_size_]" puts "CBR interval = [$cbr set interval_]" puts "CBR Rate = [$cbr set rate_]" #Run the simulation $ns run
146
Appendix-III Tcl script for 8 x 8 mesh topology #Create a simulator object set ns [new Simulator] #Define different colors for data flows (for NAM) $ns color 1 Blue $ns color 2 Red #$ns color 1 green #Open the NAM trace file set nf [open out.nam w] $ns namtrace-all $nf #Open the Trace file set tf [open out.tr w] $ns trace-all $tf #Define a 'finish' procedure proc finish {} { global ns nf tf $ns flush-trace #Close the NAM trace file close $nf close $tf #Execute NAM on the trace file exec nam out.nam & #Call xgraph to display the results exec xgraph out1.xg -geometry 800x800 & exit 0 } #Create 100 nodes set n1 [$ns node] set n2 [$ns node] set n3 [$ns node] set n4 [$ns node] set n5 [$ns node] set n6 [$ns node]
147
set n7 [$ns node] set n8 [$ns node] set n9 [$ns node] set n10 [$ns node] set n11 [$ns node] set n12 [$ns node] set n13 [$ns node] set n14 [$ns node] set n15 [$ns node] set n16 [$ns node] set n17 [$ns node] set n18 [$ns node] set n19 [$ns node] set n20 [$ns node] set n21 [$ns node] set n22 [$ns node] set n23 [$ns node] set n24 [$ns node] set n25 [$ns node] set n26 [$ns node] set n27 [$ns node] set n28 [$ns node] set n29 [$ns node] set n30 [$ns node] set n31 [$ns node] set n32 [$ns node] set n33 [$ns node] set n34 [$ns node] set n35 [$ns node] set n36 [$ns node] set n37 [$ns node] set n38 [$ns node] set n39 [$ns node] set n40 [$ns node] set n41 [$ns node] set n42 [$ns node] set n43 [$ns node] set n44 [$ns node] set n45 [$ns node] set n46 [$ns node] set n47 [$ns node]
148
set n48 [$ns node] set n49 [$ns node] set n50 [$ns node] set n51 [$ns node] set n52 [$ns node] set n53 [$ns node] set n54 [$ns node] set n55 [$ns node] set n56 [$ns node] set n57 [$ns node] set n58 [$ns node] set n59 [$ns node] set n60 [$ns node] set n61 [$ns node] set n62 [$ns node] set n63 [$ns node] set n64 [$ns node] #Create links between the nodes #--------- horizontal connection--------- $ns duplex-link $n1 $n2 10Mb 10ms DropTail $ns duplex-link $n2 $n3 10Mb 10ms DropTail $ns duplex-link $n3 $n4 10Mb 10ms DropTail $ns duplex-link $n4 $n5 10Mb 10ms DropTail $ns duplex-link $n5 $n6 10Mb 10ms DropTail $ns duplex-link $n6 $n7 10Mb 10ms DropTail $ns duplex-link $n7 $n8 10Mb 10ms DropTail $ns duplex-link $n9 $n10 10Mb 10ms DropTail $ns duplex-link $n10 $n11 10Mb 10ms DropTail $ns duplex-link $n11 $n12 10Mb 10ms DropTail $ns duplex-link $n12 $n13 10Mb 10ms DropTail $ns duplex-link $n13 $n14 10Mb 10ms DropTail $ns duplex-link $n14 $n15 10Mb 10ms DropTail $ns duplex-link $n15 $n16 10Mb 10ms DropTail $ns duplex-link $n17 $n18 10Mb 10ms DropTail $ns duplex-link $n18 $n19 10Mb 10ms DropTail $ns duplex-link $n19 $n20 10Mb 10ms DropTail $ns duplex-link $n20 $n21 10Mb 10ms DropTail $ns duplex-link $n21 $n22 10Mb 10ms DropTail $ns duplex-link $n22 $n23 10Mb 10ms DropTail
149
$ns duplex-link $n23 $n24 10Mb 10ms DropTail $ns duplex-link $n25 $n26 10Mb 10ms DropTail $ns duplex-link $n26 $n27 10Mb 10ms DropTail $ns duplex-link $n27 $n28 10Mb 10ms DropTail $ns duplex-link $n28 $n29 10Mb 10ms DropTail $ns duplex-link $n29 $n30 10Mb 10ms DropTail $ns duplex-link $n30 $n31 10Mb 10ms DropTail $ns duplex-link $n31 $n32 10Mb 10ms DropTail $ns duplex-link $n33 $n34 10Mb 10ms DropTail $ns duplex-link $n34 $n35 10Mb 10ms DropTail $ns duplex-link $n35 $n36 10Mb 10ms DropTail $ns duplex-link $n36 $n37 10Mb 10ms DropTail $ns duplex-link $n37 $n38 10Mb 10ms DropTail $ns duplex-link $n38 $n39 10Mb 10ms DropTail $ns duplex-link $n39 $n40 10Mb 10ms DropTail $ns duplex-link $n41 $n42 10Mb 10ms DropTail $ns duplex-link $n42 $n43 10Mb 10ms DropTail $ns duplex-link $n43 $n44 10Mb 10ms DropTail $ns duplex-link $n44 $n45 10Mb 10ms DropTail $ns duplex-link $n45 $n46 10Mb 10ms DropTail $ns duplex-link $n46 $n47 10Mb 10ms DropTail $ns duplex-link $n47 $n48 10Mb 10ms DropTail $ns duplex-link $n49 $n50 10Mb 10ms DropTail $ns duplex-link $n50 $n51 10Mb 10ms DropTail $ns duplex-link $n51 $n52 10Mb 10ms DropTail $ns duplex-link $n52 $n53 10Mb 10ms DropTail $ns duplex-link $n53 $n54 10Mb 10ms DropTail $ns duplex-link $n54 $n55 10Mb 10ms DropTail $ns duplex-link $n55 $n56 10Mb 10ms DropTail $ns duplex-link $n57 $n58 10Mb 10ms DropTail $ns duplex-link $n58 $n59 10Mb 10ms DropTail $ns duplex-link $n59 $n60 10Mb 10ms DropTail $ns duplex-link $n60 $n61 10Mb 10ms DropTail $ns duplex-link $n61 $n62 10Mb 10ms DropTail $ns duplex-link $n62 $n63 10Mb 10ms DropTail $ns duplex-link $n63 $n64 10Mb 10ms DropTail #---------------vertcal conectio------------------------------- $ns duplex-link $n1 $n9 10Mb 10ms DropTail $ns duplex-link $n2 $n10 10Mb 10ms DropTail $ns duplex-link $n3 $n11 10Mb 10ms DropTail $ns duplex-link $n4 $n12 10Mb 10ms DropTail $ns duplex-link $n5 $n13 10Mb 10ms DropTail $ns duplex-link $n6 $n14 10Mb 10ms DropTail
150
$ns duplex-link $n7 $n15 10Mb 10ms DropTail $ns duplex-link $n8 $n16 10Mb 10ms DropTail $ns duplex-link $n9 $n17 10Mb 10ms DropTail $ns duplex-link $n10 $n18 10Mb 10ms DropTail $ns duplex-link $n11 $n19 10Mb 10ms DropTail $ns duplex-link $n12 $n20 10Mb 10ms DropTail $ns duplex-link $n13 $n21 10Mb 10ms DropTail $ns duplex-link $n14 $n22 10Mb 10ms DropTail $ns duplex-link $n15 $n23 10Mb 10ms DropTail $ns duplex-link $n16 $n24 10Mb 10ms DropTail $ns duplex-link $n17 $n25 10Mb 10ms DropTail $ns duplex-link $n18 $n26 10Mb 10ms DropTail $ns duplex-link $n19 $n27 10Mb 10ms DropTail $ns duplex-link $n20 $n28 10Mb 10ms DropTail $ns duplex-link $n21 $n29 10Mb 10ms DropTail $ns duplex-link $n22 $n30 10Mb 10ms DropTail $ns duplex-link $n23 $n31 10Mb 10ms DropTail $ns duplex-link $n24 $n32 10Mb 10ms DropTail $ns duplex-link $n25 $n33 10Mb 10ms DropTail $ns duplex-link $n26 $n34 10Mb 10ms DropTail $ns duplex-link $n27 $n35 10Mb 10ms DropTail $ns duplex-link $n28 $n36 10Mb 10ms DropTail $ns duplex-link $n29 $n37 10Mb 10ms DropTail $ns duplex-link $n30 $n38 10Mb 10ms DropTail $ns duplex-link $n31 $n39 10Mb 10ms DropTail $ns duplex-link $n32 $n40 10Mb 10ms DropTail $ns duplex-link $n33 $n41 10Mb 10ms DropTail $ns duplex-link $n34 $n42 10Mb 10ms DropTail $ns duplex-link $n35 $n43 10Mb 10ms DropTail $ns duplex-link $n36 $n44 10Mb 10ms DropTail $ns duplex-link $n37 $n45 10Mb 10ms DropTail $ns duplex-link $n38 $n46 10Mb 10ms DropTail $ns duplex-link $n39 $n47 10Mb 10ms DropTail $ns duplex-link $n40 $n48 10Mb 10ms DropTail $ns duplex-link $n41 $n49 10Mb 10ms DropTail $ns duplex-link $n42 $n50 10Mb 10ms DropTail $ns duplex-link $n43 $n51 10Mb 10ms DropTail $ns duplex-link $n44 $n52 10Mb 10ms DropTail $ns duplex-link $n45 $n53 10Mb 10ms DropTail $ns duplex-link $n46 $n54 10Mb 10ms DropTail $ns duplex-link $n47 $n55 10Mb 10ms DropTail $ns duplex-link $n48 $n56 10Mb 10ms DropTail $ns duplex-link $n49 $n57 10Mb 10ms DropTail $ns duplex-link $n50 $n58 10Mb 10ms DropTail $ns duplex-link $n51 $n59 10Mb 10ms DropTail $ns duplex-link $n52 $n60 10Mb 10ms DropTail
151
$ns duplex-link $n53 $n61 10Mb 10ms DropTail $ns duplex-link $n54 $n62 10Mb 10ms DropTail $ns duplex-link $n55 $n63 10Mb 10ms DropTail $ns duplex-link $n56 $n64 10Mb 10ms DropTail #------------------------------------------- #Set Queue Size of link (n1-n64) to 10 ######################################## $ns queue-limit $n1 $n2 100 $ns queue-limit $n2 $n3 100 $ns queue-limit $n3 $n4 100 $ns queue-limit $n4 $n5 100 $ns queue-limit $n5 $n6 100 $ns queue-limit $n6 $n7 100 $ns queue-limit $n7 $n8 100 $ns queue-limit $n9 $n10 100 $ns queue-limit $n10 $n11 100 $ns queue-limit $n11 $n12 100 $ns queue-limit $n12 $n13 100 $ns queue-limit $n13 $n14 100 $ns queue-limit $n14 $n15 100 $ns queue-limit $n15 $n16 100 $ns queue-limit $n17 $n18 100 $ns queue-limit $n18 $n19 100 $ns queue-limit $n19 $n20 100 $ns queue-limit $n20 $n21 100 $ns queue-limit $n21 $n22 100 $ns queue-limit $n22 $n23 100 $ns queue-limit $n23 $n24 100 $ns queue-limit $n25 $n26 100 $ns queue-limit $n26 $n27 100 $ns queue-limit $n27 $n28 100 $ns queue-limit $n28 $n29 100 $ns queue-limit $n29 $n30 100 $ns queue-limit $n30 $n31 100 $ns queue-limit $n31 $n32 100 $ns queue-limit $n33 $n34 100 $ns queue-limit $n34 $n35 100
152
$ns queue-limit $n35 $n36 100 $ns queue-limit $n36 $n37 100 $ns queue-limit $n37 $n38 100 $ns queue-limit $n38 $n39 100 $ns queue-limit $n39 $n40 100 $ns queue-limit $n41 $n42 100 $ns queue-limit $n42 $n43 100 $ns queue-limit $n43 $n44 100 $ns queue-limit $n44 $n45 100 $ns queue-limit $n45 $n46 100 $ns queue-limit $n46 $n47 100 $ns queue-limit $n47 $n48 100 $ns queue-limit $n49 $n50 100 $ns queue-limit $n50 $n51 100 $ns queue-limit $n51 $n52 100 $ns queue-limit $n52 $n53 100 $ns queue-limit $n53 $n54 100 $ns queue-limit $n54 $n55 100 $ns queue-limit $n55 $n56 100 $ns queue-limit $n57 $n58 100 $ns queue-limit $n58 $n59 100 $ns queue-limit $n59 $n60 100 $ns queue-limit $n60 $n61 100 $ns queue-limit $n61 $n62 100 $ns queue-limit $n62 $n63 100 $ns queue-limit $n63 $n64 100 #---------------vertcal conectio------------------------------- $ns queue-limit $n1 $n9 100 $ns queue-limit $n2 $n10 100 $ns queue-limit $n3 $n11 100 $ns queue-limit $n4 $n12 100 $ns queue-limit $n5 $n13 100 $ns queue-limit $n6 $n14 100 $ns queue-limit $n7 $n15 100 $ns queue-limit $n8 $n16 100 $ns queue-limit $n9 $n17 100 $ns queue-limit $n10 $n18 100 $ns queue-limit $n11 $n19 100 $ns queue-limit $n12 $n20 100 $ns queue-limit $n13 $n21 100 $ns queue-limit $n14 $n22 100 $ns queue-limit $n15 $n23 100 $ns queue-limit $n16 $n24 100
153
$ns queue-limit $n17 $n25 100 $ns queue-limit $n18 $n26 100 $ns queue-limit $n19 $n27 100 $ns queue-limit $n20 $n28 100 $ns queue-limit $n21 $n29 100 $ns queue-limit $n22 $n30 100 $ns queue-limit $n23 $n31 100 $ns queue-limit $n24 $n32 100 $ns queue-limit $n25 $n33 100 $ns queue-limit $n26 $n34 100 $ns queue-limit $n27 $n35 100 $ns queue-limit $n28 $n36 100 $ns queue-limit $n29 $n37 100 $ns queue-limit $n30 $n38 100 $ns queue-limit $n31 $n39 100 $ns queue-limit $n32 $n40 100 $ns queue-limit $n33 $n41 100 $ns queue-limit $n34 $n42 100 $ns queue-limit $n35 $n43 100 $ns queue-limit $n36 $n44 100 $ns queue-limit $n37 $n45 100 $ns queue-limit $n38 $n46 100 $ns queue-limit $n39 $n47 100 $ns queue-limit $n40 $n48 100 $ns queue-limit $n41 $n49 100 $ns queue-limit $n42 $n50 100 $ns queue-limit $n43 $n51 100 $ns queue-limit $n44 $n52 100 $ns queue-limit $n45 $n53 100 $ns queue-limit $n46 $n54 100 $ns queue-limit $n47 $n55 100 $ns queue-limit $n48 $n56 100 $ns queue-limit $n49 $n57 100 $ns queue-limit $n50 $n58 100 $ns queue-limit $n51 $n59 100 $ns queue-limit $n52 $n60 100 $ns queue-limit $n53 $n61 100 $ns queue-limit $n54 $n62 100 $ns queue-limit $n55 $n63 100 $ns queue-limit $n56 $n64 100 # connecting 11-----20
154
################################### #Setup first TCP connection Agent/TCP set packetSize_ 512 set tcp1 [new Agent/TCP] $tcp1 set class_ 2 $ns attach-agent $n1 $tcp1 set sink1 [new Agent/TCPSink] $ns attach-agent $n64 $sink1 $ns connect $tcp1 $sink1 $tcp1 set fid_ 1 #Setup Second TCP connection set tcp2 [new Agent/TCP] $tcp2 set class_ 3 $ns attach-agent $n1 $tcp2 set sink2 [new Agent/TCPSink] $ns attach-agent $n64 $sink2 $ns connect $tcp2 $sink2 $tcp2 set fid_ 2 #Setup a FTP over TCP connection set ftp1 [new Application/FTP] $ftp1 attach-agent $tcp1 $ftp1 set type_ FTP1 #SETUP UDP CONNECTION #set udp [new Agent/UDP] #$ns attach-agent $n1 $udp #set null [new Agent/Null] #$ns attach-agent $n50 $null #$ns connect $udp $null #$udp set fid_ 3 #Setup a CBR over TCP connection set cbr [new Application/Traffic/CBR] $cbr attach-agent $tcp2 $cbr set type_ CBR $cbr set packet_size_ 512
155
$cbr set rate_ 10Mb $cbr set random_ off #$cbr set interval_ 0.1 #Schedule events for the CBR and FTP agents $ns at 0.1 "$ftp1 start" $ns at 0.1 "$cbr start" $ns at 60 "$cbr stop" $ns at 60 "$ftp1 stop" #Detach tcp and sink agents (not really necessary) #$ns at 4.5 "$ns detach-agent $n1 $tcp ; $ns detach-agent $n100 $sink" #Call the finish procedure after 5 seconds of simulation time $ns at 62.0 "finish" #Print CBR packet size and interval puts "CBR packet size = [$cbr set packet_size_]" puts "CBR interval = [$cbr set interval_]" puts "CBR Rate = [$cbr set rate_]" #Run the simulation $ns run
156
REFERENCES
[1] http://wccftech.com/intel-broadwell-ep-xeon-e5-v4/#ixzz49YMiOki1.
[2] http://www.intel.com/content/www/us/en/silicon-innovations/intel-
14nmtechnology.html.
[3] http://siliconangle.com/blog/2015/10/26/oracle-debuts-first-systems-with-10-billion
transistor-sparc-m7-chip/.
[4] https://devblogs.nvidia.com/parallelforall/inside-pascal/.
[5] S.W. Keckler et al. (eds.), Multicore Processors and Systems, Integrated Circuits and
Systems, DOI 10.1007/978-1-4419-0263-42, C Springer Science+Business Media,
LLC 2009.
[6] http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance.
[7] “Designofenergy-efficient on-chip networks,”http://www.rle.mit.edu/ isg/
documents/2010_NOC_tutorial_vladimir.pdf, 2010.
[8] T. G. Mattson, R. F. Van der Wijngaart, M. Riepen, T. Lehnig, P. Brett, W. Haas, P.
Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, and S. Dighe, “The 48-core SCC
processor: The programmer’s view,” in Proc. SC, Nov. 2010, pp. 1–11.
[9] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L.
Bao, J. Brown, M. Mattina, C.-C. Miao,C. Ramey, D. Wentzlaff, W. Anderson, E.
Berger, N. Fairbanks, D. Khan,F. Montenegro, J. Stickney, and J. Zook, “TILE64
157
processor: A 64-core SoC with mesh interconnect,” in Proc. ISSCC, Feb. 2008, pp.
88–598.
[10] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A.
Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts,Y. Hoskote, N. Borkar, and S.
Borkar, “An 80-tile sub-100-w teraflops processor in 65-nm CMOS,” IEEE J. Solid-
State Circuits, vol. 43, no. 1, pp. 29–41, Jan. 2008.
[11] Wayne Wolf, Ahmed Amine Jerraya, and Grant “Multiprocessor System-on-Chip
(MPSoC) TechnologyMartin” IEEE Transactiins on Computer –Aided Design og
Integrated Circuits and Systems Vol. 27,no. 10, October 2008.
[12] AMBA™ Specification Rev. 2.0, http://www.arm.com, 1999.
[13] Specification for the: WISHBONE System-on-Chip (SoC) Interconnection
Architecture for Portable IPCores, OpenCore, 2002.
[14] TheCore ConnectBusArchitecture. http://www-03.ibm.com /chips /products
/coreconnect /,1999.
[15] A Comparison of Network-on-Chip and Busses, http://www.arteris.com
/noc_whitepaper.pdf, 2005.
[16] http://chipdesignmag.com/sld/files/2009/11/NoC-Evolution_v11.pdf.
[17] Naveen Choudhary” Network on Chip: A New SoC Communication
Infrastructure Paradigm” International Journal of Soft Computing and Enginee ring
(IJSCE)ISSN: 2231-2307, Volume-1, Issue-6, January 2012.
158
[18] Ahmed Jerraya and Wayne Wolf “Multiprocessor Systems-on‐Chips” Elsevier Inc,
2004.
[19] W. J. Dally, B. Towles, "Route packets, not wires: on -chip interconnection networks",
in Proceedings DAC, pp. 684-689, June2001.
[20] Maurizio Palesi and Masoud Daneshtalab (Eds.), Routing Algorithms inNetworks-on-
Chip, Springer, 2014.
[21] Daniel Sanchez, George Michelogiannakis and Christos Kozyrakis “An Analysis of
On-Chip Interconnection Networks for Large-Scale Chip Multiprocessors” ACM
Transactions on Architecture and Code Optimization, Vol. 7, No. 1, Article 4, April
2010.
[22] David Atienza, Federico Angiolini, Srinivasan Murali, Antonio Pullini, Luca Benini,
and Giovanni De Micheli, “Network-on-Chip design and synthesis outlook,”
Integration, the VLSI journalvol. 41, pp. 340–359, 2008.
[23] A.Kumar, A.Jantesh, M.Millberg, J.Obeerg, J.P.Soininen, M.Forsell, K.Tiensyrja and
A.Hemani “A Network on chip architecture and Design methodology” In Proc. Symp.
VLSI, page 117, Washington, DC,USA,2002. IEEE Computer Society
[24] M.B.Taylor,j.Kim,J. Miller,D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P.
Johnson, Jae-Wook Lee, W. Lee, A. Saraf, M. Seneski, N. Shnidman, V.
Strumpen, M.Frank, S. Amarasinghe and A. Agarwal “ The raw microprocessor: a
computational fabric for software circuits and general purpose programs”, Micro,
IEEE, vol.22, no.2, pp 25-35,2002
159
[25] S.R.Vangal, J.Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. singh,
T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar and S. Borkar “ An
80-tile sub-100-w teraflops processor in 65-nm cmos”, IEEE Journal of Solid State
Circuits, vol.43, no.1,pp29-41, 2008
[26] W.J. Dally and B. Towles “Route packets, not wires: On-chip interconnection
networks. In Procesdings of the Design Automation Conference, pp 684-689, June
2001
[27] C.Grecu, P.P. Pande, A. Ivanov and R.Saleh “ Structured interconnect architecture: a
solution for the non-scalability of bus based SoCs” in 14th ACM Great Lakes
Symposium on VLSI(GLSVLSI’04) pages 192-195, April 2004
[28] A. Greiner L. Mortiez A. Adriahantenaina, H. Charley and C.A. Zeferino “SPIN: A
Scalable, packet switched, on-chip micro-network” In Design, Automation and Test in
Europe (DATE’05), page 20070, Washington, DC, USA, 2003.IEEE Computer
Society.
[29] Karim and Faraydon O. Octagon interconnection network for linking processing nodes
on an SOC device and method of operating same. US Patent 7218616
[30] M. Coppola, M.D. Grammatikakis, R. Locatelli, G. Maruccia and L. Pieralisi “Design
of cost- Efficient Interconnect Processing Units: Spidegon STNoC.CRC Press, Inc.
Boca Raton, FL, USA, 2008.
[31] M. Coppola, R. Locatelli, G. Maruccia, L. Pieralisi and A. Scandurra “Spidergon: A
NoC modeling paradigm. In Proc. 2004 International Symposium on System-on-
Chip,page 15, November 2004.
160
[32] L. Bononi and N. Concer “Simulation and analysis of network on chip architectures:
ring, spidegon and 2d mesh”, In Design, Automation and Test in Europe
(DATE’06), pages 154-159, 2006.
[33] L. Bononi, N. Concer, M. Grammatikakis, M. Coppola and R. Locatelli “ NoC
topologies exploration based on mapping and simulation models” In DSD’ 07:
Proceeding of the 10th Euromicro Conference on Digital System Design Architectures,
Methods and Tools, pages 543-546, 2007.
[34] Michael Kistler, Michael Perrone, and Fabrizio Petrini. “Cell multiprocessor
communication network :Built for Speed.” IEEE MICRO, 26(3): 10-23, May 2006.
[35] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep
Dubey, StephenJunkins, Adam Lake, Jeremy Sugerman, Robeert Cavin, Roger
Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan “Larrabee: A many-core x86
architecture for visual computing” ACM Transactions on Graphics, 27, August
2008.
[36] David Wentzlaff, Patric Griffin, Henry Hoffman, Liewei Bao, Bruce Edwards, Carl
Ramey, Matthew Mattina, Chyi- Chang Miao, John Brown III, and Anant Agarwal.
“On-Chip interconnection architecture of the Tile Processor” IEEE Micro, 27(5): 15-
31, 2007.
[37] Ge Fen, Wu Ning, Wang Qi “Simulation and Performance Evaluation for Network on
Chip Design using OPNET” IEEE region 10 conference TENCON 2007 Oct-Nov
2007.
161
[38] Abdul Ansari, Mohammad Ansari and Mohammad Khan “ Performance Evaluation of
Various Parameters of Network- on- Chip (NoC) for Different Topologies” IEEE
conference 2015.
[39] Partha pratim Pande, Cristian Grecu, Andre Ivanov, Res Saleh, “ Design of a Switch
fornetwork on Chip Applications” Proceedings of the 2003 International Symposium
on Circuits and Systems ISCAS’03 vol. 5 pg 217-220 2003.
[40] Jose Duato, Sudhakar Yalamanchili and Ni Lionel “Interconnection Network: An
Engineering Approach”, Morgan Kaufmann Publishers Inc., San Fransisco, CA,
USA, 2002
[41] N.E. Jergeer and L.S. Peh “On- chip Networks, Morgan, New York, 2009.
[42] Shashi Kumar, Axel Jantsch, Juha-Pekka Soininen, Martti Forsell, Mikael Millberg,
Johny Obergl, Kari Tiensyrja and Ahmed Hemani, “ A Network on Chip Architecture
and Design Methodology”, IEEE Computer Society Annul Symposium on VLSI
(ISVLSI), pp 105-112, 2002
[43] Partha Pratim Pande, Cristian Grecu, Andrew Ivanov, Res Saleh, “Design of Switch
for Network-on-Chip Applications” Proceedings of the 2003 International
Symposium on Circuits and Systems, ISCAS 2003
[44] P.P.Pande, Cristian Grecu, Michael Jones, Andre Ivanov and Resve Saleh “
Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect
Architectures” IEEE Transactions on Computers, vol. 54 no 8 August 2006
162
[45] Abdul Quaiyum Ansari, Mohammad Rashid Ansari and Mohammad Ayoub Khan”
Performance Evaluation of various Parameters of Network-on-chip (NoC) for different
topologies-978-1-4673-6540, June 2015 IEEE.
[46] Bharat Phanibhushana and Sandip Kundu “Network-on-Chip design for Heterogenous
Multiprocessor System-on-Chip” IEEE Computer Society Annual Symposium on
VLSI 2014.
[47] Mahendra Gaikwad and Rajendra Patrikar “Perfect Difference Network for Network-
on-Chip Architecture” International Journal of Computer Science and Network
Security, vol.9 No.12 December 2009
[48] Chia-Hsin OwenChen, Niket Agarwal,Tushar Krishna, Kyung-Hoae Koo, Li-Shiuan
Peh, Krisna C. Saraswat “ Physical vs. Virtual Express Topologies with Low-Swing
Links for Future Many-core NoCs” fourth ACM/IEEE International Symposium on
Network-on-Chip” 2010
[49] Ying-Cherng Lan, Hsiao-An Lo, Yu Hen Hu and Sao-Jie Chen “ A Bidirectional NoC
(BiNoC) Architecture with Dynamic Self-Reconfigurable Channel” IEEE
Transactions on computer –aided design of integrated circuits and systems, vol. 30
No.3, march 2011
[50] Snaider Carrillo, Jim Harkin, Liam j. McDaid, Fearghal Morgan,Sandeep Pande,
Seamus Cawley and Brian McGinley “ Scalable Hierarchical Network-on-Chip
Architecture for Spiking Neural Network Hardware Implementations” IEEE
Transactions on Parallel and Distributed systems, vol. 24, No. 12,December 2013
163
[51] Md. Hasan Furhad and John-Myyon Kim “An Extended Diagonal Mesh Topology
for Network-on-Chip Architectures” International Journal of Multimedia and
Ubiquitous Engineering, vol. 10,No.10, 2015
[52] Wu Chang, Li Yubai and Chai Song “ Design and Simulation of a Torus topology for
network on chip chip” Journal of Systems Engineering and Electronics, vol. 19,No. 4,
2008
[53] Jawwad Latif,Hassan Nazeer Chadhry, Sadia Azam, Naveed Khan baloch “Design
Trade off and Performance Analysis of Router Architectures in Network-on-
Chip” 2nd International workshop on Design and Performance of Networks on Chip
(DPNoC 2015) Procedia Computer Science 56 (2015) 421-426
[54] Umamaheswari S., Meganathan D and Raja Paul Perinbam J “Runtime buffer
management to improve the performance of irregulat Network-on-Chip architecture”
© Indian Academy of Sciences (Sadhana) vol. 40, part 4, june 2015 pp.1117-1137
[55] Anbu chozhan. P,D. Muralidharan and R. Muthaiah “ Performance enhanced router
design for network on chip” International journal of Engineering and
Technology(IJET), vol. 5 No.2 April-May 2013
[56] Qing Sun, Lijun Zhang and Anding Du “Design and Implementation of the Wormhole
Virtual Channel NoC Router” 4th International Conference on Computer Science and
Network Technology (ICCSNT 2015)
164
[57] Rohit Sunkam Ramanujam, Vassos Soterious, Bill Lin and Li-Shiuan “ Design of a
High Throughput Distributed Shared-Buffer NoC Router” Fourth ACM/IEEE
International Symposium on Network-on-Chip 2010
[58] Weiwei Fu, Jingcheng Shao, Bin Xie , Tianzhou Chen and Li Liu “ Design of a High-
Throughput NoC Router with Neighbor Flow Regulation” 14th International
Conference on High Performance and Communications. IEEE 2012
[59] YanivBen-Itzhak,Israel Cidon, Avinoam Kolodny, Michael Shabun and Nir Shmuel
“Heterogeneous NoC Router Architecture” IEEE Transactions on Parallel and
Distributed Systems Vol. 26,No. 9. Sept. 1 2015
[60] Ahmed Bin Dammas, Adel Soudani and Abdullah Al-Dhelaan “ Design of NoC router
with flow control mechanism for congestion avoidance” IEEE conference 22-24
June 2013.
[61] Bidhas Ghoshal, Kanchan Manna, Santanu Chattopadhyay and Indrani Sengupta “ In-
Field Test for Permanent Faults Buffers of NoC Routers” IEEE Transactions on
Very Large Scale Integration (VLSI) systems vol.24 No.1 2016.
[62] Wooyoung Jang and David Z. Pan “SDRAM-Aware Router for Network-on-Chip”
IEEE Transactions on Computer –Aided Design of Integrated Circuits and Systems,
vol.29 No. 10 October 2010
[63] Gaoming Du, Jing He, Yukun Song, Duoli Zhang and Huajie Wu “ comparison of
NoC Routing Algorithms Based on Packet-circuit Switching” Third International
Conference on Information Science and Technology, March 23-25 2013.
165
[64] Ebrahim Behrouzian-Nezhad and Ahmad Khademzadeh “BIOS: A New Efficient
Routing Algorithm for Network on Chip” Contemporary Engineering Sciences, vol.2,
No.1 2009
[65] Andrew DeOrio, David Fick, Valeria Bertacco, Dennis Sylvester, David Blaauw, Jin
Hu and Gregory Chen “ A Reliable Routing Architecture and Agorithm for
NoCs” IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems. Vol. 31, No. 5 May 2012.
[66] Wei Ni and Zhenwei Liu “A routing algorithm for Network on Chip with self-similar
traffic” 11th International Conference on ASIC (ASICON) 2015.
[67] Jan Moritz Joseph, Christopher Blochwitz and Thilo Pionteck “Adaptive allocation of
default router paths in Network on Chip for latency reduction” International
Conference on High Performance Computing & Simulation (HPCS) 2016.
[68] K.Tatas and C. Chrysostomou “Adaptive Network on Chip Routing with Fuzzy Logic
Control” Euromicro Conference on Digital System Design (DSD) 2016.
[69] Salih Bayar and Arda Yurdakul “ An Efficient Mapping Algorithm on 2-D Mesh
Network-on-Chip with Reconfigurable Switches” 11th International Conference on
Design & Technology of Integrated Systems in Nanoscale Era (DTIS) 2016.
[70] Cisse Ahmadou Dit Adi, Ping Qiu, Hidetsugu Irie, Takefumi Miyoshi and Tsutomu
Yoshinaga “ OREX: An Optical Ring with Electrical Crossbar Hybrid Photonic
Network –on- Chip” International Workshop on Innovative Architecture for Future
Generation High-Performance Processors and Systems 2010
166
[71] Xianfang Tan, Mei Yang, Lei Zhang, Yingtao Jiang, Jianyi Yang “ A Generic Optical
Router DesignPhotonic Network-on-Chips” Journal of Lightwave Technology
vol.30,No 3, February 1, 2012.
[72] Parisa Khadem Hamedani, Natalie Enright Jerger and Shaahin Hessabi “QuT: A Low-
Power Optical Network-on-Chip” Eighth IEEE/ACM International Symposium on
Network-on-Chip (NoCS) 2014.
[73] Xiaolu Wang, Huaxi Gu, Yintang AYang, Kun Wang and Qinfen Hao “A Highly
Scalable Optical Network-on-Chip with Small Network Diameter and Deadlock
Freeedom” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol.24 No. 12 2016.
[74] Xiuhua Li, Huaxi Gu, Ke Chen, Liang Song and Qingfen Hao “STorus: A New
Tolpology for Optical Network-on-Chip” Optical Switching and Networking, vol.22
November 2016.
[75] Somayyeh Koohi and Shaahin Hessabi “Scalable architecture for a contention-free
optical network-on-chip” Journal of Parallel and Distributed Computing vol. 72, No
11 November 2012.
[76] Lei Zhang, Mei Yang, Yingtao Jiang, Emma Regentova “ Architectures and routing
schemes for optical network-on-chips” Computers and Electrical Engineering 2009
[77] Zhiliang Qian, Da-Cheng Juan, Paul Bogdan, Chi-Ying Tsui, Diana Marculescu and
Radu Marculescu, “A Comprehensive and Accurate Latency Model for Network-on-
Chip Performance Analysis”, IEEE/ACM ASP-DAC’14, 22 Jan., 2014, Singapore.
167
[78] James Hurt, Andrew May, Xiaohan Zhu, and Bill Lin, “Design and implementation of
high-speed symmetric crossbar schedulers,” Proc. ICC’99, Vancouver, Canada, June
1999, S37-6.
[79] Introduction to Structured VLSI Design by Joachim Rodriguz
[80] https://www.altera.com/content/dam/altera; www/global/en_US/pdfs/
literature/manual/intro_to_quartus2.pdf
[81] Bouraoui Chemli and Abdelkrim Zitouni “Design of a NetworkonChip router based on
turn model”16th international conference on Sciences and Techniques of Automatic
control & computer engineering- STA'2015,Monastir,Tunisia,December 1-
23,2015.
[82] Hung K. Nguyen and Xuan-Tu “Design and Implementation of a Hybrid Switching
Router for the Reconfigurable Network-on-Chip” International Conference on
Advanced Technologies for Communications (ATC) 2016.
[83] Amit Bhanwala, Mayank Kumar and Yogendera Kumar “FPGA based Design of Low
Power Reconfigurable Router for Network on Chip (NoC)”International Conference
on Computing, Communication and Automation (ICCCA2015) IEEE -2015.
[84] Avalon Interface Specifications: http://www.altera.com/mnl_avalon_spec.pdf
[85] Zuo Zhen, Tang Guilin , Dong Zhi , Huang Zhiping ,”Design and realization of the
hardware platform based on the Nios soft-core processor” 8th International
Conference on Electronic Measurement and Instruments, ICEMI '07, Proc.
IEEE,pg.865-869,July2007
168
[86] Alcalde, Ortmann, M.S, Mussa, S.A. ,“NIOS II Processor Implemented in FPGA: An
Application on Control of a PFC Converter” Conference on Power Electronics
Specialists , PESC 2008.Proc. IEEE , pg.4446-4451,June 2008.
[87] Wang Ziting ,Guo Haili ,Sun Yan ,“Design of VGA Image Controller Based on SOPC
Technology” International Conference on New Trends in Information and Service
Science, NISS '09, Proc. IEEE,pg.825-827 , July 2009.
[88] Lin Fei-yu , Jiao Xi-xiang , Guo Yu-Hui , Zhang Jian-chuan , Qiao Wei-ming , Jing
Lan , Wang Yan-Yu , Ma Yun-hai ,“System On Programmable Chip Development
System” Second International Workshop on Education Technology and Computer
Science (ETCS), proc. IEEE , Volume 1,pg.471-474, March2010.
[89] First time designer guide http://www.altera.com/literature/hb/nios2/edh_ed51001.pdf
[90] QuartusIIVersion 7.2 Handbook Volume 4: SOPC Builder:
http://www.cs.columbia.edu/~sedwards/classes/2008/4840/qts_qii5v4.pdf
[91] NIOS II integrated development environment. [Online]. Available:
http://www.altera.com/
[92] M. Jones and G. Gopalakrishnan”Verifying Transaction Ordering Properties in
Unbounded Bus Networks through Combined Deductive/ Algorithmic Methods” In
Proceedings of the Third International Conference on Formal Methods in
Computer-Aided Design, pages 505- 519. Springer-Verlag, November 2000.
[93] W. 1. Dally and C. L. Seitz “Deadlock-Free Message Routing in Multiprocessor
Interconnection Networks” IEEE Transactions on Computers,C-36(5):547-553, May
1987.
169
[94] X. Lin, P. K. McKinley, and L. M. Ni “ Deadlock-free multicast wormhole routing in
2-D mesh multicomputers” IEEE Transactions on Parallel and Distributed Systems,
5(8):793-804, August 1994.
[95] E. M. Clarke, O. Grumberg, and D. A. Peled.Model Checking.MIT Press, 1999.
[96] D. Borrione, A. Helmy, L. Pierre, and 1. Schmaltz “A Formal Approach to the
Verification of Networks on Chip” EURASIP Journal on Embedded Systems,
2009(548324):1-14, February 2009.
[97] 1. Schmaltz and D. Borrione “Towards a Formal Theory of On Chip Communications
in the ACL2 Logic” In Proceedings of the Sixth International Workshop on the
ACL2 Theorem Prover and ItsApplication, pages 47-56. ACM, August 2006.
[98] Salaun, W. Serwe, Y. Thonnart, and P. Vivet “Formal Verification of CHP
Specifications with CADP Illustration on an Asynchronous Network-on-Chip” In
Proceedings of International Symposium on Asynchronous Circuits and Systems,
pages 73-82. IEEE Computer Society Press, March 2007.
[99] Yean-Ru Chen ,Wan-Ting SU, Pao-Ann Hsiung, Ying-Chemg Lan , Yu-Hen Hu and
Sao-Jie Chen “Formal Modeling and Verification for Network-on-Chip” IEEE
InternationalConference on Green Circuits and Systems 2010
[100] Anam Zaman and Osman Hasan “Formalverification of circuit-switchedNetwork on
chip (NoC) architecturesusingSPIN “ IEEE International Symposium on System-
on-Chip (SoC) 2014
170
[101] Arnab Banerjee, Pascal T. Wolkotte, Robert D. Mullins,Simon W. Moore, and Gerard
J. M. Smit “An Energy and Performance Exploration of Network-on-Chip
Architectures” IEEE Transactions on VLSI systems, vol. 17. No.3, 319-March 2009
[102] Naoya Onizawa et al,“High-Throughput Compact Delay-Insensitive Asynchronous
NoC Router”, IEEE Transactions On Computers, Digital Object Identifier
10.1109/TC.2013.81,0018-9340/13/$31.00 © 2013 IEEE.
[103] Huaxi Gu, Kwai Hung Mo, Jiang Xu, Wei Zhang “A Low-power Low-cost Optical
Router for Optical Networks-on-Chip in Multiprocessor Systems-on-Chip” IEEE
Computer Society Annual Symposium on VLSI 2009
[104] Reetuparna Das, Onur Mutlu, Thomas Moscibroda, Chita R. Das, “Aérgia: Exploiting
Packet Latency Slack in On-Chip Networks”, ISCA’10, June 19–23, 2010, Saint-
Malo, France. Copyright
[105] Sunghyun Park, Tushar Krishna, Chia-Hsin Owen Chen, Bhavya Daya, Anantha P.
Chandrakasan, Li-Shiuan Peh, “Approaching the Theoretical Limits of a Mesh NoC
with a 16-Node Chip Prototype in 45nm SOI”, DAC 2012, June 3-7, 2012, San
Francisco, California, USA. Copyright 2012 ACM ACM 978-1-4503-1199-
1/12/02010 ACM 978-1-4503-0053-7/10/06.
[106] J. Flich, P. lopez, M. P. Malumbers, J. Duato, “Improving the Performance of Regular
Networks with Source Routing”, Proceeding of the IEEE International Conference on
Parallel Processing. 21-24 Aug. 2000. Page(s) 353-361.
171
[107] C. Marcon et al., “Tiny–optimised 3D mesh NoC for area and latency minimization”,
Electronics Letters, 30th January 2014 Vol. 50 No. 3 pp. 165–166.
[108] Mukund Ramakrishna et al, “GCA:Global Congestion Awareness for Load Balance in
Networks-on-Chip”,IEEE Transactions On Parallel And Distributed Systems (TPDS),
DOI10.1109/TPDS.2015.2477840.
[109] Jie Chen et al, “Performance Evaluation of Three Network-on-Chip(NoC)
Architectures (Invited)”, IEEE International Conference on Communications in
China: Communications QoS and Reliability (CQR), 978-1-4673-2815-9/12/$31.00
©2012 IEEE.
[110] Zvika Guz et al, “Network Delays and Link Capacities in Application-Specific
Wormhole NoCs”, Hindawi Publishing Corporation, VLSI Design ,Volume 2007,
Article ID 90941,15 pages, Doi:10.1155/2007/90941.
[111] Junbok You et al, “Bandwidth Optimization in Asynchronous NoCs by Customizing
Link Wire Length”,ICCD 2010.
[112] Brett Stanley Feero and Partha Pratim Pande “Network-on-Chip in a Three-
dimensional Environment: A Performance Evaluation” IEEE Transactions on
Computer, vol.58, No. 1 January 2009
[113] Awet Yemane Weldezion, Matt Grange, Dinesh Pamunuwa, Zhonghai Lu, Axel
Jantsch, Roshan Weerasekera and Hannu Tenhunen “Scalability of network-on-chip
communication architecture for 3D meshes” 3rd ACM/IEEE International Symposium
on Network-on-Chip 2009
172
[114] Amir Mohammad Rahmani, Kameswar Rao Vaddina, Khalid Latif, Pasi Liljeberg,
Juha Plosila and Hannu Tenhunen “High-Performance and Fault-Tolerant 3D NoC-
Bus Hybrid.Architecture using ARB-NET based Adaptive Monitoring Platform” IEEE
Transactions onComputers Vol.63, No.3, 2014
[115] Seyyed Hossein Seyyedaghaei Rezaei, Abbas Mazloumi, Mehdi Modarressi and
Pejman Lotfi-Kamran “Dynamic Resource Sharing for High-Performance 3-D
Network-on-Chip” IEEE Computer Architecture Letters vol.15, No.1, 2016
[116] Randy W. Morris Jr, Avinash Karanth Kodi, Ahmed Louri and Ralph D. Whaley Jr
“Three-Dimensional Stacked Nanophotonic Network-on-Chip Architecture with
Minimal Reconfiguration” IEEE Transactions on Computers, vol. 63, Co.1, January
2014.
[117] Yaoyao Ye, Jiang Xu, Baihan Huang, Xiaowen Wu, Wei Zhang, Xuan Wang, Mahdi
Nikdast, Zhehui Wang, Weichen Liu and Zhe Wang “3-D Mesh-Based Optical
Network-on-Chip for Multiprocessor System-on-Chip” IEEE Transactions on
Computer-Aided design og Integrated Circuits and Systems, vol. 32, No.4 April 2014.
[118] Akram Ben Ahmed and Abderazek Ben Abdallah “Architecture and design of high-
throughput, low-latency, and fault-tolerant routing algorithm for 3D-network-on-chip
(3D- NoC)” The Journal of Supercomputing , vol. 66, No. 3 December 2013.
[119] Sara Akbari, Ali Shafiee, Mahmoud Fathy and Reza Berangi “AFRA: A low cost high
performance reliable routing for 3D mesh NoCs” IEEE conference on Design,
Automation & Test in Europe Conference & Exhibition 2012.