1
On-Chip Communication: Networks on Chip (NoCs)
Sudeep PasrichaColorado State University
CS/ECE 561 Fall 2011
Outline
Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
2
Introduction Evolution of on-chip communication architectures
3
Introduction Network-on-chip (NoC) is a packet switched on-chip
communication network designed using a layered methodology
NoCs use packets to route data from the source to the destination PE via a network fabric that consists of
network interfaces (NI) switches (routers) interconnection links (wires)
4
registers
ALU MEM
NI
Introduction NoCs are an attempt to scale down the concepts of
largescale networks, and apply them to the system-on-chip (SoC) domain
NoC Properties Reliable and predictable electrical and physical properties Regular geometry that is scalable Flexible QoS guarantees Higher bandwidth Reusable components
Buffers, arbiters, routers, protocol stack
5
Introduction ISO/OSI network protocol stack model
6
Building Blocks: NI
7
Fron
t en
d
Backe
nd
Standardized node interface @ session layer. Initiator vs. target distinction is blurred
1. Supported transactions (e.g. QoSread…)
2. Degree of parallelism3. Session prot. control flow &
negotiation
NoC specific backend (layers 1-4)1. Physical channel interface2. Link-level protocol3. Network-layer (packetization)4. Transport layer (routing)
Node Switches
Standard P2P Node protocol Proprietary link protocol
Decoupling logic & synchronization
Session-layer (P2P) interface with nodesBack-end manages interface with switches
Building Blocks: Switch
8
Router or Switch: receives and forwards packetsBuffers have dual function
synchronization & queueing
Crossbar
AllocatorArbiter
Output buffers& control flow
Input buffers& control flow
QoS &Routing
Data portswith control flowwires
On-chip vs. Off Chip Networks Cost
Off-chip: cost is channels, huge pads, expensive connectors, cables, optics
On-chip: cost is Si area and Power (storage!), wires are not infinite, but plentiful
Channel Characteristics On-chip: wires are short latency is comparable with logic, huge
amount of bandwidth, can put logic in links; LOT of uncertainty (process variations) and interference (e.g., noise)
Off-chip: wires are long link latency dominates, bandwidth is precious, links are strongly decoupled;
Workload On-chip: non-homogeneous traffic – much is known Off-chip: very little is known, and it may change
Design issues On-chip: Must fit in floorplanning (die area constraint) Off-chip: dictates board/rack organization
— 9
NoC Concepts Topology
How the nodes are connected together
Switching Allocation of network resources (bandwidth, buffer
capacity, …) to information flows
Routing Path selection between a source and a destination
node in a particular topology
Flow control How the downstream node communicates forwarding
availability to the upstream node— 10
Outline
Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
11
NoC Topology 1. Direct Topologies
each node has direct point-to-point link to a subset of other nodes in the system called neighboring nodes
as the number of nodes in the system increases, the total available communication bandwidth also increases
fundamental trade-off is between connectivity and cost
12
NoC Topology
Most direct network topologies have an orthogonal implementation, where nodes can be arranged in an n-dimensional orthogonal space
e.g. n-dimensional mesh, torus, folded torus, hypercube, and octagon
2D mesh is most popular topology all links have the same length
eases physical design area grows linearly with the number
of nodes must be designed in such a way as to
avoid traffic accumulating in the center of the mesh
13
NoC Topology Torus topology, also called a k-ary n-cube, is an n-
dimensional grid with k nodes in each dimension k-ary 1-cube (1-D torus) is essentially a ring network with k
nodes limited scalability as performance decreases when more nodes
k-ary 2-cube (i.e., 2-D torus) topology is similar to a regular 2D mesh
except that nodes at the edges are connected to switches at the opposite edge via wrap-around channels
long end-around connections can, however, lead to excessive delays
14
NoC Topology Folding torus topology overcomes the long link limitation
of a 2-D torus links have the same size
Meshes and tori can be extended by adding bypass links to increase performance at the cost of higher area
15
NoC Topology Octagon topology is another example of a direct
network messages being sent between any 2 nodes require at most
two hops more octagons can be tiled together to accommodate larger
designs by using one of the nodes as a bridge node
16
NoC Topology 2. Indirect Topologies
each node is connected to an external switch, and switches have point-to-point links to other switches
switches do not perform any information processing, and correspondingly nodes do not perform any packet switching
e.g. SPIN, crossbar topologies Fat tree topology
nodes are connected only to the leaves of the tree more links near root, where bandwidth requirements are higher
17
NoC Topology k-ary n-fly butterfly topology
blocking multi-stage network – packets may be temporarily blocked or dropped in the network if contention occurs
kn nodes, and n stages of kn-1 k x k crossbars e.g. 2-ary 3-fly butterfly network
18
NoC Topology
3. Irregular or ad hoc network topologies customized for an application usually a mix of shared bus, direct, and indirect network
topologies e.g. reduced mesh, cluster-based hybrid topology
19
Outline
Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
20
Messaging Units
Data is transmitted based on a hierarchical data structuring mechanism
Messages packets flits phits flits and phits are fixed size, packets and data may be variable sized phit is a unit of data that is transferred on a link in a single cycle typically, phit size = flit size
Switching: Determines “when” messaging units are moved21
Data/Message
head flit
Flits: flow control digits
Phits: physical flow control digits
Packets
Dest Info Seq # misc
tail flit
type
Circuit Switching
Hardware path setup by a routing header or probe End-to-end acknowledgment initiates transfer at full
hardware bandwidth System is limited by signaling rate along the circuits Routing, arbitration and switching overhead experienced
once/message
22
tr ts
tsetup tdata
Time Busy
DataAcknowledgmentHeader Probe
Link
ts
Circuit Switching
23
Source end node
Destination end node
Buffers for “request”
tokens
Circuit Switching
— 24
Request for circuit establishment(routing and arbitration is performed during this step)
Source end node
Destination end node
Buffers for “request”
tokens
Circuit Switching
— 25
Request for circuit establishment
Source end node
Destination end node
Buffers for “ack” tokens
Acknowledgment and circuit establishment(as token travels back to the source, connections are established)
Circuit Switching
— 26
Request for circuit establishment
Source end node
Destination end node
Acknowledgment and circuit establishment
Packet transport(neither routing nor arbitration is required)
Circuit Switching
— 27
HiRequest for circuit establishment
Source end node
Destination end node
Acknowledgment and circuit establishment
Packet transport
X
High contention, low utilization () low throughput
Virtual Circuit Switching
Goal: Reduce cost associated with circuit switching Multiple virtual circuits (channels) multiplexed on a single physical link Allocate one buffer per virtual link can be expensive due to the large number of shared buffers Allocate one buffer per physical link uses time division multiplexing (TDM) to statically schedule usage less expensive routers
28
VC
flit
type
packet
Virtual Circuit Switching Example
— 29
router 1
router 3
router 2networkinterface
networkinterface
networkinterface
1
1
2
1 2
1
3 3
input 2 for router 1 isoutput 1 for router 2
3 3
1
1
2
2
1
1
2
2
1
1
2
2
2
2
1
1
2
2
1
1
4 4
Use slots to avoid contention• divide up bandwidth
the input routed to the output at this slot
-
-
-
-
o1
-
i1
i1
-
o2
-
-
-
-
o3
-
-
-
-
o4
-
-
i4
-
o1
-
-
-
-
o2
-
-
-
-
o3
i1
-
-
-
o4
-
-
-
-
o1
i1
-
-
-
o3
-
-
-
-
o4
i1
-
-
o2
i3
Packet Switching Packets are transmitted from source and make their way
independently to receiver possibly along different routes and with different delays
Zero start up time, followed by a variable delay due to contention in routers along packet path
QoS guarantees are harder to make
Three main packet switching scheme variants: Store and Forward (SAF) switching Virtual Cut Through (VCT) switching Wormhole (WH) switching
30
Packet Switching (Store and Forward)
Routing, arbitration, switching overheads experienced for each packet
Increased storage requirements at the nodes Packetization and in-order delivery requirements Alternative buffering schemes
Use of local processor memory Central (to the switch) queues
31
Message Header
tpacket
Link
Message Data
tr
Time Busy
Packet Switching (Store and Forward)
— 32
Source end node
Destination end node
Packets are completely stored before any portion is forwarded
Store
Buffers for datapackets
Packet Switching (Store and Forward)
— 33
Source end node
Destination end node
Packets are completely stored before any portion is forwarded
StoreForward
Requirement:buffers must be
sized to holdentire packet
(MTU)
Packet Switching (Virtual Cut Through)
Messages cut-through to the next router when feasible In the absence of blocking, messages are pipelined
Pipeline cycle time is the larger of intra-router and inter-router flow control delays
When the header is blocked, the complete message is buffered at a switch
High load behavior approaches that of SAF
34
tblocking
tw
tr ts
Packet HeaderMessage Packetcuts throughthe RouterLink
Time Busy
Packet Switching (Virtual Cut Through)
— 35
Source end node
Destination end node
Routing
Portions of a packet may be forwarded (“cut-through”) to the next switchbefore the entire packet is stored at the current switch
Packet Switching (Wormhole)
Messages are pipelined, but buffer space is on the order of a few flits
Small buffers + message pipelining small compact switches/routers
Supports variable sized messages Messages cannot be interleaved over a channel: routing
information is only associated with the header Base Latency is equivalent to that of virtual cut-through
36
Link
Time Busy
tr ts
twormhole
Single Flit
Header Flit
Virtual Cut Through vs. Wormhole
Virtual Cut Through
Wormhole
37
Source end node
Destination end node
Source end node
Destination end node
Buffers for datapackets
Requirement:buffers must be sized to hold entire packet
(MTU)
Buffers for flits:packets can be larger
than buffers
Virtual Cut Through vs. Wormhole
38
Source end node
Destination end node
Source end node
Destination end node
Busy Link
Packet stored along the path
Busy Link
Packet completelystored atthe switch
Buffers for datapackets
Requirement:buffers must be sized to hold entire packet
(MTU)
Buffers for flits:packets can be larger
than buffers
Virtual Cut Through
Wormhole
Comparison of Packet Switching Techniques
SAF packet switching and virtual cut-through Consume network bandwidth proportional to network load VCT behaves like wormhole at low loads and like SAF packet
switching at high loads High buffer costs
Wormhole switching Provides low (unloaded) latency Lower saturation point Higher variance of message latency than SAF packet or VCT
switching
39
Outline
Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
40
Generic Router Architecture
41
Pipelined Router Microarchitecture
42
Cro
ssB
ar
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
VC Allocation
LT & IB (Input Buffering)LT & IB (Input Buffering)
RCRC
VCAVCA
SASA
ST & Output BufferingST & Output Buffering
Input buffers
Input buffers
DEM
UX
Physi
cal
channel
Link
Contr
ol
Link
Contr
ol
Physi
cal
channel
MU
X
DEM
UX M
UX
Output buffers
Link
Contr
ol
Output buffers
Link
Contr
ol
Physi
cal
channel
Physi
cal
channel
DEM
UX M
UX
DEM
UX M
UXRouting
Computation
Routing Computation
Switch Allocation
Virtual Circuit Switching Router
43
Output 4
RouteComputation
VCAllocatorSwitch
Allocator
VC 1
VC 2
VC n
Input buffers
VC 1
VC 2
VC n
Input buffers
Input 0
Input 4
Output 0
.
.
.
.
.
.
Crossbar switch
RC VCA SA STLT
RCVCASA
STLT
RouterLink
Link Router
RC: Route Computation VCA: VC Allocation; SA: Switch Allocation
ST: Switch Traversal; LT: Link Traversal
By using prediction/speculation, pipeline can be made more compact
Router Power Dissipation
44
• Used for a 6x6 mesh• 4 stage pipeline, > 3 GHz• Wormhole switching
Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip Interconnection Networks Workshop, Dec 2006
Head of Line (HOL) Blocking Can be a problem in NoC routers!
limits the throughput of switches to 58.6%
Solution: Virtual Output Queues (VOQ) input queuing strategy in which each input port maintains a
separate queue for each output port
— 45
Outline
Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
46
Routing Algorithms Responsible for correctly and efficiently routing
packets or circuits from the source to the destination Path selection between a source and a destination
node in a particular topology Ensure load balancing Latency minimization Flexibility wrt faults in the network Deadlock and livelock free solutions
47
S
D
Static vs. Dynamic Routing Static routing: fixed paths are used to transfer data
between a particular source and destination does not take into account current state of the network
advantages of static routing: easy to implement, since very little additional router logic is required in-order packet delivery if single path is used
Dynamic routing: routing decisions are made according to the current state of the network
considering factors such as availability and load on links path between source and destination may change over time
as traffic conditions and requirements of the application change more resources needed to monitor state of the network and
dynamically change routing paths able to better distribute traffic in a network
48
Example: Dimension-order Routing Static XY routing (commonly used):
a deadlock-free shortest path routing which routes packets in the X-dimension first and then in the Y-dimension.
Used for tori and meshes Destination address expressed as absolute coordinates
49
00 10 20
01 11 21
02 12 22
03 13 23
-x
+y00 10 20
01 11 21
02 12 22
03 13 23
For torus, a preferred directionmay have to be selected.
For mesh, the preferred directionis the only valid direction.
Example: Adaptive Routing
Chooses shortest path to destination with minimum congestion Path can change over time
Because congestion changes over time
— 50
00 10 20
01 11 21
02 12 22
03 13 23
Pitfall: Dynamic Routing
A locally optimum decision may lead to a globally sub-optimal route
51
00 10 20
01 11 21
02 12 22
03 13 23
To avoid slight congestionin (01-02), packets then incurmore congested links
Non-Algorithmic vs. Algorithmic
Non-algorithmic schemes Do not use any algorithm in routers to
compute route direction e.g. table based routing schemes
use a simple lookup to a local table that stores route information to find where to send each flit at the input buffers
— 52
Algorithmic routing
— 53
For given topology and routing algorithm, use simple algorithms to compute the route usually implemented as combinational logic may require info to process in packet header specific for a network configuration
E.g., 2D Torus–Packet header contains signed offset to destination (per dimension)
–At each hop, update offset in a dimension
–When x == 0 and y == 0, then at correct processor
sx x sy y
=0 =0
Productive direction vector
Distributed vs. Source Routing
Distributed routing: each packet carries the destination address
e.g., XY co-ordinates or number identifying destination node/router routing decisions are made in each router by looking up the destination
addresses in a routing table or by executing a hardware function
Source routing: packet carries routing information pre-computed routing tables are stored at a nodes’ NI routing information is looked up at the source NI and routing information is
added to the header of the packet (increasing packet size) when a packet arrives at a router, the routing information is extracted from
the routing field in the packet header does not require a destination address in a packet, any intermediate
routing tables, or functions needed to calculate the route
54
Minimal vs. Non-minimal Routing Minimal routing: length of routing path from source to
destination is shortest possible length between the 2 nodes source does not start sending a packet if minimal path is not available
Non-minimal routing: can use longer paths if a minimal path not available
by allowing non-minimal paths, the number of alternative paths is increased, which can be useful for avoiding congestion
disadvantage: overhead of additional power consumption
5500 10 20
01 11 21
02 12 22
03 13 23
Minimal adaptive routingis unable to avoid congested linksin the absence of minimal path diversity
Ordered vs. Unordered Routing Ordered
One route per source-destination (S-D) pair No traffic splitting
Unordered Potentially multiple routes per source-destination (S-D) pair
— 56
Unordered Routing Ordered Routing
Example: Toggle XY (TXY)
— 57
XY is unbalanced - high cost in uniform capacity grids
Split packets evenly between XY, YX routes
Deadlock avoided with 2 VCs Near-optimal for symmetric traffic
(permutations) [Seo et al. 05; Towles & Dally 02]
Simple; Better Balanced Split routes packets of the same flow take different paths Delays may cause out-of-order arrivals Re-ordering buffers are costly Does not take into account the traffic
pattern Static, minimal scheme Extension: stochastically bias XY, YX
routes at design time
v1
v2
f/2 f/2 f/2
f/2f/2
f/2 f/2 f/2
f/2f/2
Example: Source Toggle XY The route is a
function of source and destination ID bitwise XOR
Very simple algorithm
Maximum capacity is similar to TXY
— 58
XY YX XY YX XY
YX YX XY YX
XY YX XY YX XY
YX XY YX XY YX
XY YX XY YX XY
Weighted Ordered Toggle - WOT
Weighted Ordered Toggle (WOT) Route per S-D pair is chosen at programming
time Each source stores a routing bit for each
destination Objective: minimize max link capacity
Optimal route assignment is difficult
— 59
WOT Min-max Route Assignment Initial assignment - STXY Make changes that
reduce the capacity: Find most loaded link Among S-D pairs sharing
this link change one that minimizes the max capacity (if possible)
Sub-optimal
— 60
S3 S2
S1
D3
D1
D2
Deadlock-Free Routing Requirements
Routing algorithm must ensure freedom from deadlocks common in WH switching e.g. cyclic dependency shown below
freedom from deadlocks can be ensured by allocating additional hardware resources or imposing restrictions on the routing
usually channel dependency graph (CDG) of the shared network resources is built and analyzed either statically or dynamically
61
Dependencies
When a packet holds a channel and requests another channel, there is a direct dependency between them
Channel dependency graph D = G(C,E) For deterministic routing: single dependency at each node For adaptive routing: all requested channels produce
dependencies, and dependency graph may contain cycles
62
dependencySAF & VCT dependency
Wormhole dependency
Header flit
Data flit
Breaking Cyclic Dependencies
• The configuration to the left can deadlock• Add (virtual) channels
We can make the channel dependency graph acyclic via routing restrictions (via the routing function)
63
n0 n1
n3 n2
n0 n1
n3 n2
c0
c1
c2
c3
c11
c02
c12
c01
c03
c13
c00
c10
Breaking Cyclic Dependencies
Routing function is c0i
when j<=i, c
1i when
j >i
j is hop distance from source here we assume i = 3
Channels c00
and c13
are unused
Routing function breaks cycles in the channel dependency graph
64
c10
c11
c12
c01
c02
c03
n0 n1
n3 n2
c10
c11
c02
c12
c01
c03
c00
c13
Turn Model Based Routing
What is a turn? From one dimension to another : 90 degree turn To another virtual channel in the same direction: 0 degree turn To the reverse direction: 180 degree turn
Turns combine to form cycles Goal: prohibit the least number of turns to break all
possible cycles Examples: west-first, north-last, negative-first Partially adaptive schemes
— 65
abstract cycles YX routing west-first routing
Other Routing Algorithm Requirements
Routing algorithm must ensure freedom from livelocks livelocks are similar to deadlocks, except that states of the
resources involved constantly change with regard to one another, without making any progress
occurs especially when dynamic (adaptive) routing is used e.g. can occur in a deflective “hot potato” routing if a packet is
bounced around over and over again between routers and never reaches its destination
livelocks can be avoided with simple priority rules Routing algorithm must ensure freedom from starvation
under scenarios where certain packets are prioritized during routing, some of the low priority packets never reach their intended destination
can be avoided by using a fair routing algorithm, or reserving some bandwidth for low priority data packets
66
Outline
Introduction NoC Topology Switching strategies Routing algorithms Router Microarchitecture Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
67
Flow Control
Required in non-Circuit Switched networks to deal with congestion (regulate inflow of flits into NoC)
Recover from transmission errors Commonly used schemes:
ACK-NACK Flow control Credit based Flow control Xon/Xoff (STALL-GO) Flow Control
68
A B C Block
Buffer full
Don’t send
Buffer full
Don’t send
“Backpressure”
Flow Control Schemes
ACK/NACK when flits are sent on a link, a local copy is kept in a buffer by sender when ACK received by sender, it deletes copy of flit from its local
buffer when NACK is received, sender rewinds its output queue and starts
resending flits, starting from the corrupted one implemented either end-to-end or switch-to-switch sender needs to have a buffer of size at least = 2N + k
N is number of buffers encountered between source and destination k depends on latency of logic at the sender and receiver
fault handling support comes at cost of greater power, area overhead69
Flow Control Schemes Credit based flow control
— 70
receiversender
Sender sends packets whenever
credit counter is not zero
10Credit counter 9876543210
X
Queue isnot serviced
pipelined transfer
Flow Control Schemes Credit based flow control
— 71
receiversender
10Credit counter 9876543210
+5
5432
X
Queue isnot serviced
Receiver sends credits after they become available
Sender resumesinjection
pipelined transfer
Flow Control Schemes Xon/Xoff flow control
— 72
receiversender
XonXoff a packet is injected
if control bit is in Xon
Control bit
Xon
Xoff
pipelined transfer
Flow Control Schemes Xon/Xoff flow control
73
receiversender
XonXoff
When Xoff threshold is
reached, an Xoff notification is sent
Control bit Xoff
Xon
When in Xoff,sender cannotinject packets
X
Queue isnot serviced
pipelined transfer
Flow Control Schemes Xon/Xoff flow control
— 74
receiversender
XonXoff
When Xon threshold is
reached, an Xon notification is sent
Control bit
Xon
Xoff
X
Queue isnot serviced
pipelined transfer
Comparing Credit-Based & Xon/Xoff Flow Control
Both schemes can fully utilize buffers Restart latency is lower for credit-based schemes and
therefore Credit-based flow control has higher average buffer occupancy at
high loads Credit-based flow control leads to higher throughput at high loads Smaller inter-packet gap
Control traffic is higher for credit schemes Block credits can be used to tune link behavior
Buffer sizes are independent of round trip latency for credit schemes (at the expense of performance)
Credit schemes have higher information content useful for QoS schemes
— 75
Some Commercial Examples
Hypertransport: Credit based Info packets are not buffered and have guaranteed
reception Infiniband: Credit based
Link based schemes + end-to-end Myrinet : Credit based Ethernet: Xon/Xoff PCI Express
Data link layer ack/nack Transaction layer is credit based
— 76
Outline
Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
77
Clocking schemes Fully synchronous
single global clock is distributed to synchronize entire chip hard to achieve in practice, due to process variations and clock skew
Mesochronous local clocks are derived from a global clock not sensitive to clock skew phase between clock signals in different modules may differ
deterministic for regular topologies (e.g. mesh) non-deterministic for irregular topologies
synchronizers needed between clock domains
Pleisochronous clock signals are produced locally
Asynchronous clocks do not have to be present at all
78
Outline
Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
79
Quality of Service (QoS) QoS refers to the level of commitment for packet delivery
refers to bounds on performance (bandwidth, delay, and jitter)
Two basic categories best effort (BE)
only correctness and completion of communication is guaranteed usually packet switched worst case times cannot be guaranteed
guaranteed service (GS) makes a tangible guarantee on performance, in addition to basic
guarantees of correctness and completion for communication usually (virtual) circuit switched
80
Outline
Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
81
Intel’s Teraflops Research Processor
82
Goals:
Deliver Tera-scale performance■ Single precision TFLOP at desktop
power■ Frequency target 5GHz■ Bi-section B/W order of Terabits/s■ Link bandwidth in hundreds of GB/s
Prototype two key technologies■ On-die interconnect fabric■ 3D stacked memory
Develop a scalable design
methodology■ Tiled design approach■ Mesochronous clocking■ Power-aware capability
I/O AreaI/O Area
I/O AreaI/O Area
PLLPLL
single tilesingle tile
1.5mm1.5mm
2.0mm2.0mm
TAPTAP
21.7
2m
m
I/O AreaI/O Area
PLLPLL TAPTAP
12.64mm
65nm, 1 poly, 8 metal (Cu)Technology
100 Million (full-chip) 1.2 Million (tile)
Transistors
275mm2 (full-chip) 3mm2 (tile)
Die Area
8390C4 bumps #
65nm, 1 poly, 8 metal (Cu)Technology
100 Million (full-chip) 1.2 Million (tile)
Transistors
275mm2 (full-chip) 3mm2 (tile)
Die Area
8390C4 bumps #
[Vangal08]
Main Building Blocks
83
Special Purpose Cores■ High performance Dual
FPMACs
2D Mesh Interconnect ■ High bandwidth low latency
router■ Phase-tolerant tile to tile
communication
Mesochronous Clocking■ Modular & scalable■ Lower power
Workload-aware Power Management■ Sleep instructions■ Chip voltage & freq. control
2KB Data memory (DMEM)
3KB
Inst
. mem
ory
(IMEM
)
6-read, 4-write 32 entry RF
32
64 64
32
64
RIB
96
96
Mesochronous Interface
Processing Engine (PE)
Crossbar RouterM
SIN
T39
39
40 GB/s
FPMAC0
x
+
Normalize
32
32
FPMAC1
x
+32
32
MSIN
T
MSINT
MSINT
Normalize
Tile
Fine-Grain Power Management
84
Scalable power to match workload demandsScalable power to match workload demands
Dynamic sleep
STANDBY: • Memory retains data• 50% less power/tileFULL SLEEP: •Memories fully off•80% less power/tile
21 sleep regions per tile 21 sleep regions per tile (not all shown)(not all shown)
FP FP Engine 1Engine 1
FP FP Engine 2Engine 2
RouterRouter
Data MemoryData Memory
InstructionInstruction MemoryMemory
FP Engine 1
Sleeping:90% less
power
FP Engine 2
Sleeping:90% less
power
RouterSleeping:
10% less power(stays on to pass traffic)
Data MemorySleeping:
57% less power
Instruction MemorySleeping:
56% less power
Router Features 5 ports, wormhole, 5cycle pipeline 39-bit (32data , 6ctrl, 1str) bidirectional
mesochronous P2P links per port 2 logical lanes each with 16 flit-buffers Performance, area, power
Freq 5.1GHz @ 1.2V 102GB/s raw bandwidth Area 0.34mm2 (65nm) Power 945mW (1.2V), 470mW (1V), 98mW (0.75V)
Fine-grained clock-gating + sleep (10 regions)
85
KAIST BONE Project
86
2003 2004 2005 2006 2007
PROTONE- Star topology
Slim Spider- Hierarchical star
Memory Centric NoC(Hierarchical star + Shared memory)
IIS- Configurable
Star
MeshRAW,MIT
80-Tile NoC, Intel
Baseband processor NoC, STMicro, et. al.
[KimNOC07]
On-Chip Serialization
87
PU
Netw
ork
In
terfa
ce
SE
RD
ES
X-barS/W
Reduced LinkWidth
Reduced X-bar Switch
Operation frequency
Driver size
Wire space
Capacitance load
Energy consumption
Coupling capacitance
Buffer resource
Switching energy
→ Proper level of On-chip Serialization improves NoC performance
Memory-Centric NoC Architecture
88
Overall Architecture
■ 10 RISC processors
■ 8 dual port memories
■ 4 Channel controllers
■ Hierarchical-star topology packet switching network
■ Mesochronous comm.
RISC0
RISC1
ChannelContoller 0
Dual PortMem. 0
(1.5 KB)
Por
t A
Por
t B
Dual PortMem. 3
X - barS/ W
X- bar Switch
ChannelContoller 1
ChannelContoller 2
Dual PortMem. 4
Dual PortMem. 5
Dual PortMem. 6
Dual PortMem. 7
ControlProcessor
(RISC)
Ext.Mem.
I/F
HierarchicalStar Topology
Network -on -Chip(400 MHz )
X- barS/ W
X - barS/ W
X- barS/ W
RISC2
RISC3
RISC4
NI
NI
NI
NI
NI
NI
NI
NI
NI
NI
NI
NI
RISC7
RISC8
RISC9
ChannelContoller 3
RISC5
RISC6
Dual PortMem. 2
NI
NI
Dual PortMem. 1
NI
NI
36
36
36
36
NI
NI
NIN
IN
I
NI
NI
NI
NI
NI
Implementation Results
89
Chip photograph & results
8 Dual port Memories
(905.6 mW)
Memory Centric NoC (96.8mW)
RISC Processor (52mW)
10 RISCs(354 mW)
Power Breakdown
[Kim07]
Summary of NoCs in Emerging Processors
90
System Topology Routing Switching Flow ctrl
MIT RAW 2D mesh (32bit) XY DOR WH, no VC Credit
UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit
QuickSilver ACM H-Tree (32bit) Up*/down* 1-flit, no VC Credit
UMass Amherst aSOC
2D mesh Shortest-path Pipelined CS, no VC
Timeslot
Sun T1 Crossbar (128bit) - - ACK/NACK
Cell BE EIB Ring (128bit) Shortest-path Pipelined CS, no VC
Credit
TRIPS (operand) 2D mesh (109bit) YX DOR 1-flit, no VC On/off
TRIPS (on-chip) 2D mesh (128bit) YX DOR WH, 4 VCs Credit
Intel SCC 2D torus (32bit) XY,YX DOR, odd-even TM
WH, no VC On/off
TILE64 iMesh 2D mesh (32bit) XY DOR WH, no VC Credit
Intel 80-core NoC 2-D mesh (32bit) Source routing WH, 2 lanes On/off
Outline
Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems
91
Status and Open Problems Power
complex NI and switching/routing logic blocks are power hungry several times greater than for current bus-based approaches
Latency additional delay to packetize/de-packetize data at NIs flow/congestion control and fault tolerance protocol overheads delays at the numerous switching stages encountered by
packets even circuit switching has overhead (e.g. SOCBUS) lags behind what can be achieved with bus-based/dedicated
wiring Lack of tools and benchmarks Simulation speed
GHz clock frequencies, large network complexity, greater number of PEs slow down simulation 92
Trends
NoCs have high latency and power dissipation
What if we used photonic interconnects on chip?
Trends
Hybrid electro-photonic NoCs
Torus Mesh Firefly Corona Optical Mesh METEOR0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Norm
aliz
ed P
ow
er
Consu
mptio
n
Thermal TuningOptical ComponentsLaser PowerElectrical LinksBufferCrossbar and RoutingStatic
Ft Rd Ch Ry Fm Br Lu0
1
2
3
4
Norm
aliz
ed L
ate
ncy
Electrical TorusElectrical MeshCoronaFireflyHybrid Optical TorusMETEOR
Ft Rd Ch Ry Fm Br Lu0
1
2
3
4
Norm
aliz
ed P
ow
er
Ft Rd Ch Ry Fm Br Lu0
0.5
1
1.5
Norm
aliz
ed T
hro
ughput
Ft Rd Ch Ry Fm Br Lu0
2
4
6
8
Norm
aliz
ed e
nerg
y-dela
y
8x8 12x120
10
20
30
mm
2
Electrical Area
Electrical TorusElectrical MeshCoronaFireflyHybrid Optical TorusMETEOR
128 WaveGuide 256 WaveGuide0
500
1000
1500
mm
2
Optical Area 8x8
CoronaFireflyHybrid Optical TorusMETEOR
128 WaveGuide 256 WaveGuide0
500
1000
1500
mm
2
Optical Area 12x12
CoronaFireflyHybrid Optical TorusMETEOR
S. Bahirat, S. Pasricha, “METEOR: Hybrid Photonic Ring-Mesh Network-on-Chip for Multicore Architectures” ACM JETC (under review)
S. Bahirat, S. Pasricha, "UC-PHOTON: A Novel Hybrid Photonic Network-on-Chip for Multiple Use-Case Applications", IEEE ISQED 2010 (Best Paper Award)
S. Pasricha, S. Bahirat, " OPAL: A Multi-Layer Hybrid Photonic NoC for 3D ICs”, IEEE/ACM ASPDAC 2011
Trends
What if we use carbon nanotube interconnects on a chip?■ vs. Cu: Have better conductivity, highly resistant to electromigration and
other sources of physical breakdown, can support high current densities
■ Options: SWCNTs, MWCNTs, SWCNT bundle, mixed bundle■ Exploration results: [Pasricha et al. NanoArch ’08, VLSID ’09, TVLSI ‘10]
metallic
semiconducting
SWCNT Bundle
Mixed Bundle
SWCNT
MWCNT
Summary: Trends Move towards hybrid interconnection fabrics
NoC-bus based Custom, heterogeneous topologies
New interconnect paradigms Optical Wireless Carbon nanotube
See my research, pubs page for more details http://www.engr.colostate.edu/~sudeep/pubs/pubs.htm
96
Books
— 97