Download - 1 On-Chip Communication: Networks on Chip (NoCs) Sudeep Pasricha Colorado State University CS/ECE 561 Fall 2011

1

On-Chip Communication: Networks on Chip (NoCs)

Sudeep PasrichaColorado State University

CS/ECE 561 Fall 2011

Outline

Introduction NoC Topology Switching strategies Router Microarchitecture Routing algorithms Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems

2

Introduction Evolution of on-chip communication architectures

3

Introduction Network-on-chip (NoC) is a packet switched on-chip

communication network designed using a layered methodology

NoCs use packets to route data from the source to the destination PE via a network fabric that consists of

network interfaces (NI) switches (routers) interconnection links (wires)

4

registers

ALU MEM

NI

Introduction NoCs are an attempt to scale down the concepts of

largescale networks, and apply them to the system-on-chip (SoC) domain

NoC Properties Reliable and predictable electrical and physical properties Regular geometry that is scalable Flexible QoS guarantees Higher bandwidth Reusable components

Buffers, arbiters, routers, protocol stack

5

Introduction ISO/OSI network protocol stack model

6

Building Blocks: NI

7

Fron

t en

d

Backe

nd

Standardized node interface @ session layer. Initiator vs. target distinction is blurred

1. Supported transactions (e.g. QoSread…)

2. Degree of parallelism3. Session prot. control flow &

negotiation

NoC specific backend (layers 1-4)1. Physical channel interface2. Link-level protocol3. Network-layer (packetization)4. Transport layer (routing)

Node Switches

Standard P2P Node protocol Proprietary link protocol

Decoupling logic & synchronization

Session-layer (P2P) interface with nodesBack-end manages interface with switches

Building Blocks: Switch

8

Router or Switch: receives and forwards packetsBuffers have dual function

synchronization & queueing

Crossbar

AllocatorArbiter

Output buffers& control flow

Input buffers& control flow

QoS &Routing

Data portswith control flowwires

On-chip vs. Off Chip Networks Cost

Off-chip: cost is channels, huge pads, expensive connectors, cables, optics

On-chip: cost is Si area and Power (storage!), wires are not infinite, but plentiful

Channel Characteristics On-chip: wires are short latency is comparable with logic, huge

amount of bandwidth, can put logic in links; LOT of uncertainty (process variations) and interference (e.g., noise)

Off-chip: wires are long link latency dominates, bandwidth is precious, links are strongly decoupled;

Workload On-chip: non-homogeneous traffic – much is known Off-chip: very little is known, and it may change

Design issues On-chip: Must fit in floorplanning (die area constraint) Off-chip: dictates board/rack organization

— 9

NoC Concepts Topology

How the nodes are connected together

Switching Allocation of network resources (bandwidth, buffer

capacity, …) to information flows

Routing Path selection between a source and a destination

node in a particular topology

Flow control How the downstream node communicates forwarding

availability to the upstream node— 10

Outline


11

NoC Topology 1. Direct Topologies

each node has direct point-to-point link to a subset of other nodes in the system called neighboring nodes

as the number of nodes in the system increases, the total available communication bandwidth also increases

fundamental trade-off is between connectivity and cost

12

NoC Topology

Most direct network topologies have an orthogonal implementation, where nodes can be arranged in an n-dimensional orthogonal space

e.g. n-dimensional mesh, torus, folded torus, hypercube, and octagon

2D mesh is most popular topology all links have the same length

eases physical design area grows linearly with the number

of nodes must be designed in such a way as to

avoid traffic accumulating in the center of the mesh

13

NoC Topology Torus topology, also called a k-ary n-cube, is an n-

dimensional grid with k nodes in each dimension k-ary 1-cube (1-D torus) is essentially a ring network with k

nodes limited scalability as performance decreases when more nodes

k-ary 2-cube (i.e., 2-D torus) topology is similar to a regular 2D mesh

except that nodes at the edges are connected to switches at the opposite edge via wrap-around channels

long end-around connections can, however, lead to excessive delays

14

NoC Topology Folding torus topology overcomes the long link limitation

of a 2-D torus links have the same size

Meshes and tori can be extended by adding bypass links to increase performance at the cost of higher area

15

NoC Topology Octagon topology is another example of a direct

network messages being sent between any 2 nodes require at most

two hops more octagons can be tiled together to accommodate larger

designs by using one of the nodes as a bridge node

16

NoC Topology 2. Indirect Topologies

each node is connected to an external switch, and switches have point-to-point links to other switches

switches do not perform any information processing, and correspondingly nodes do not perform any packet switching

e.g. SPIN, crossbar topologies Fat tree topology

nodes are connected only to the leaves of the tree more links near root, where bandwidth requirements are higher

17

NoC Topology k-ary n-fly butterfly topology

blocking multi-stage network – packets may be temporarily blocked or dropped in the network if contention occurs

kn nodes, and n stages of kn-1 k x k crossbars e.g. 2-ary 3-fly butterfly network

18

NoC Topology

3. Irregular or ad hoc network topologies customized for an application usually a mix of shared bus, direct, and indirect network

topologies e.g. reduced mesh, cluster-based hybrid topology

19

Outline


20

Messaging Units

Data is transmitted based on a hierarchical data structuring mechanism

Messages packets flits phits flits and phits are fixed size, packets and data may be variable sized phit is a unit of data that is transferred on a link in a single cycle typically, phit size = flit size

Switching: Determines “when” messaging units are moved21

Data/Message

head flit

Flits: flow control digits

Phits: physical flow control digits

Packets

Dest Info Seq # misc

tail flit

type

Circuit Switching

Hardware path setup by a routing header or probe End-to-end acknowledgment initiates transfer at full

hardware bandwidth System is limited by signaling rate along the circuits Routing, arbitration and switching overhead experienced

once/message

22

tr ts

tsetup tdata

Time Busy

DataAcknowledgmentHeader Probe

Link

ts

Circuit Switching

23

Source end node

Destination end node

Buffers for “request”

tokens

Circuit Switching

— 24

Request for circuit establishment(routing and arbitration is performed during this step)

Source end node


Buffers for “request”

tokens

Circuit Switching

— 25

Request for circuit establishment

Source end node


Buffers for “ack” tokens

Acknowledgment and circuit establishment(as token travels back to the source, connections are established)

Circuit Switching

— 26

Request for circuit establishment

Source end node


Acknowledgment and circuit establishment

Packet transport(neither routing nor arbitration is required)

Circuit Switching

— 27

HiRequest for circuit establishment

Source end node


Acknowledgment and circuit establishment

Packet transport

X

High contention, low utilization () low throughput

Virtual Circuit Switching

Goal: Reduce cost associated with circuit switching Multiple virtual circuits (channels) multiplexed on a single physical link Allocate one buffer per virtual link can be expensive due to the large number of shared buffers Allocate one buffer per physical link uses time division multiplexing (TDM) to statically schedule usage less expensive routers

28

VC

flit

type

packet

Virtual Circuit Switching Example

— 29

router 1

router 3

router 2networkinterface

networkinterface

networkinterface

1

1

2

1 2

1

3 3

input 2 for router 1 isoutput 1 for router 2

3 3

1

1

2

2

1

1

2

2

1

1

2

2

2

2

1

1

2

2

1

1

4 4

Use slots to avoid contention• divide up bandwidth

the input routed to the output at this slot

-

-

-

-

o1

-

i1

i1

-

o2

-

-

-

-

o3

-

-

-

-

o4

-

-

i4

-

o1

-

-

-

-

o2

-

-

-

-

o3

i1

-

-

-

o4

-

-

-

-

o1

i1

-

-

-

o3

-

-

-

-

o4

i1

-

-

o2

i3

Packet Switching Packets are transmitted from source and make their way

independently to receiver possibly along different routes and with different delays

Zero start up time, followed by a variable delay due to contention in routers along packet path

QoS guarantees are harder to make

Three main packet switching scheme variants: Store and Forward (SAF) switching Virtual Cut Through (VCT) switching Wormhole (WH) switching

30

Packet Switching (Store and Forward)

Routing, arbitration, switching overheads experienced for each packet

Increased storage requirements at the nodes Packetization and in-order delivery requirements Alternative buffering schemes

Use of local processor memory Central (to the switch) queues

31

Message Header

tpacket

Link

Message Data

tr

Time Busy


— 32

Source end node


Packets are completely stored before any portion is forwarded

Store

Buffers for datapackets


— 33

Source end node


Packets are completely stored before any portion is forwarded

StoreForward

Requirement:buffers must be

sized to holdentire packet

(MTU)

Packet Switching (Virtual Cut Through)

Messages cut-through to the next router when feasible In the absence of blocking, messages are pipelined

Pipeline cycle time is the larger of intra-router and inter-router flow control delays

When the header is blocked, the complete message is buffered at a switch

High load behavior approaches that of SAF

34

tblocking

tw

tr ts

Packet HeaderMessage Packetcuts throughthe RouterLink

Time Busy

Packet Switching (Virtual Cut Through)

— 35

Source end node


Routing

Portions of a packet may be forwarded (“cut-through”) to the next switchbefore the entire packet is stored at the current switch

Packet Switching (Wormhole)

Messages are pipelined, but buffer space is on the order of a few flits

Small buffers + message pipelining small compact switches/routers

Supports variable sized messages Messages cannot be interleaved over a channel: routing

information is only associated with the header Base Latency is equivalent to that of virtual cut-through

36

Link

Time Busy

tr ts

twormhole

Single Flit

Header Flit

Virtual Cut Through vs. Wormhole

Virtual Cut Through

Wormhole

37

Source end node


Source end node



Requirement:buffers must be sized to hold entire packet

(MTU)

Buffers for flits:packets can be larger

than buffers

Virtual Cut Through vs. Wormhole

38

Source end node


Source end node


Busy Link

Packet stored along the path

Busy Link

Packet completelystored atthe switch


Requirement:buffers must be sized to hold entire packet

(MTU)

Buffers for flits:packets can be larger

than buffers

Virtual Cut Through

Wormhole

Comparison of Packet Switching Techniques

SAF packet switching and virtual cut-through Consume network bandwidth proportional to network load VCT behaves like wormhole at low loads and like SAF packet

switching at high loads High buffer costs

Wormhole switching Provides low (unloaded) latency Lower saturation point Higher variance of message latency than SAF packet or VCT

switching

39

Outline


40

Generic Router Architecture

41

Pipelined Router Microarchitecture

42

Cro

ssB

ar

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

VC Allocation

LT & IB (Input Buffering)LT & IB (Input Buffering)

RCRC

VCAVCA

SASA

ST & Output BufferingST & Output Buffering

Input buffers

Input buffers

DEM

UX

Physi

cal

channel

Link

Contr

ol

Link

Contr

ol

Physi

cal

channel

MU

X

DEM

UX M

UX

Output buffers

Link

Contr

ol

Output buffers

Link

Contr

ol

Physi

cal

channel

Physi

cal

channel

DEM

UX M

UX

DEM

UX M

UXRouting

Computation

Routing Computation

Switch Allocation

Virtual Circuit Switching Router

43

Output 4

RouteComputation

VCAllocatorSwitch

Allocator

VC 1

VC 2

VC n

Input buffers

VC 1

VC 2

VC n

Input buffers

Input 0

Input 4

Output 0

.

.

.

.

.

.

Crossbar switch

RC VCA SA STLT

RCVCASA

STLT

RouterLink

Link Router

RC: Route Computation VCA: VC Allocation; SA: Switch Allocation

ST: Switch Traversal; LT: Link Traversal

By using prediction/speculation, pipeline can be made more compact

Router Power Dissipation

44

• Used for a 6x6 mesh• 4 stage pipeline, > 3 GHz• Wormhole switching

Source: Partha Kundu, “On-Die Interconnects for Next-Generation CMPs”, talk at On-Chip Interconnection Networks Workshop, Dec 2006

Head of Line (HOL) Blocking Can be a problem in NoC routers!

limits the throughput of switches to 58.6%

Solution: Virtual Output Queues (VOQ) input queuing strategy in which each input port maintains a

separate queue for each output port

— 45

Outline


46

Routing Algorithms Responsible for correctly and efficiently routing

packets or circuits from the source to the destination Path selection between a source and a destination

node in a particular topology Ensure load balancing Latency minimization Flexibility wrt faults in the network Deadlock and livelock free solutions

47

S

D

Static vs. Dynamic Routing Static routing: fixed paths are used to transfer data

between a particular source and destination does not take into account current state of the network

advantages of static routing: easy to implement, since very little additional router logic is required in-order packet delivery if single path is used

Dynamic routing: routing decisions are made according to the current state of the network

considering factors such as availability and load on links path between source and destination may change over time

as traffic conditions and requirements of the application change more resources needed to monitor state of the network and

dynamically change routing paths able to better distribute traffic in a network

48

Example: Dimension-order Routing Static XY routing (commonly used):

a deadlock-free shortest path routing which routes packets in the X-dimension first and then in the Y-dimension.

Used for tori and meshes Destination address expressed as absolute coordinates

49

00 10 20

01 11 21

02 12 22

03 13 23

-x

+y00 10 20

01 11 21

02 12 22

03 13 23

For torus, a preferred directionmay have to be selected.

For mesh, the preferred directionis the only valid direction.

Example: Adaptive Routing

Chooses shortest path to destination with minimum congestion Path can change over time

Because congestion changes over time

— 50

00 10 20

01 11 21

02 12 22

03 13 23

Pitfall: Dynamic Routing

A locally optimum decision may lead to a globally sub-optimal route

51

00 10 20

01 11 21

02 12 22

03 13 23

To avoid slight congestionin (01-02), packets then incurmore congested links

Non-Algorithmic vs. Algorithmic

Non-algorithmic schemes Do not use any algorithm in routers to

compute route direction e.g. table based routing schemes

use a simple lookup to a local table that stores route information to find where to send each flit at the input buffers

— 52

Algorithmic routing

— 53

For given topology and routing algorithm, use simple algorithms to compute the route usually implemented as combinational logic may require info to process in packet header specific for a network configuration

E.g., 2D Torus–Packet header contains signed offset to destination (per dimension)

–At each hop, update offset in a dimension

–When x == 0 and y == 0, then at correct processor

sx x sy y

=0 =0

Productive direction vector

Distributed vs. Source Routing

Distributed routing: each packet carries the destination address

e.g., XY co-ordinates or number identifying destination node/router routing decisions are made in each router by looking up the destination

addresses in a routing table or by executing a hardware function

Source routing: packet carries routing information pre-computed routing tables are stored at a nodes’ NI routing information is looked up at the source NI and routing information is

added to the header of the packet (increasing packet size) when a packet arrives at a router, the routing information is extracted from

the routing field in the packet header does not require a destination address in a packet, any intermediate

routing tables, or functions needed to calculate the route

54

Minimal vs. Non-minimal Routing Minimal routing: length of routing path from source to

destination is shortest possible length between the 2 nodes source does not start sending a packet if minimal path is not available

Non-minimal routing: can use longer paths if a minimal path not available

by allowing non-minimal paths, the number of alternative paths is increased, which can be useful for avoiding congestion

disadvantage: overhead of additional power consumption

5500 10 20

01 11 21

02 12 22

03 13 23

Minimal adaptive routingis unable to avoid congested linksin the absence of minimal path diversity

Ordered vs. Unordered Routing Ordered

One route per source-destination (S-D) pair No traffic splitting

Unordered Potentially multiple routes per source-destination (S-D) pair

— 56

Unordered Routing Ordered Routing

Example: Toggle XY (TXY)

— 57

XY is unbalanced - high cost in uniform capacity grids

Split packets evenly between XY, YX routes

Deadlock avoided with 2 VCs Near-optimal for symmetric traffic

(permutations) [Seo et al. 05; Towles & Dally 02]

Simple; Better Balanced Split routes packets of the same flow take different paths Delays may cause out-of-order arrivals Re-ordering buffers are costly Does not take into account the traffic

pattern Static, minimal scheme Extension: stochastically bias XY, YX

routes at design time

v1

v2

f/2 f/2 f/2

f/2f/2

f/2 f/2 f/2

f/2f/2

Example: Source Toggle XY The route is a

function of source and destination ID bitwise XOR

Very simple algorithm

Maximum capacity is similar to TXY

— 58

XY YX XY YX XY

YX YX XY YX

XY YX XY YX XY

YX XY YX XY YX

XY YX XY YX XY

Weighted Ordered Toggle - WOT

Weighted Ordered Toggle (WOT) Route per S-D pair is chosen at programming

time Each source stores a routing bit for each

destination Objective: minimize max link capacity

Optimal route assignment is difficult

— 59

WOT Min-max Route Assignment Initial assignment - STXY Make changes that

reduce the capacity: Find most loaded link Among S-D pairs sharing

this link change one that minimizes the max capacity (if possible)

Sub-optimal

— 60

S3 S2

S1

D3

D1

D2

Deadlock-Free Routing Requirements

Routing algorithm must ensure freedom from deadlocks common in WH switching e.g. cyclic dependency shown below

freedom from deadlocks can be ensured by allocating additional hardware resources or imposing restrictions on the routing

usually channel dependency graph (CDG) of the shared network resources is built and analyzed either statically or dynamically

61

Dependencies

When a packet holds a channel and requests another channel, there is a direct dependency between them

Channel dependency graph D = G(C,E) For deterministic routing: single dependency at each node For adaptive routing: all requested channels produce

dependencies, and dependency graph may contain cycles

62

dependencySAF & VCT dependency

Wormhole dependency

Header flit

Data flit

Breaking Cyclic Dependencies

• The configuration to the left can deadlock• Add (virtual) channels

We can make the channel dependency graph acyclic via routing restrictions (via the routing function)

63

n0 n1

n3 n2

n0 n1

n3 n2

c0

c1

c2

c3

c11

c02

c12

c01

c03

c13

c00

c10

Breaking Cyclic Dependencies

Routing function is c0i

when j<=i, c

1i when

j >i

j is hop distance from source here we assume i = 3

Channels c00

and c13

are unused

Routing function breaks cycles in the channel dependency graph

64

c10

c11

c12

c01

c02

c03

n0 n1

n3 n2

c10

c11

c02

c12

c01

c03

c00

c13

Turn Model Based Routing

What is a turn? From one dimension to another : 90 degree turn To another virtual channel in the same direction: 0 degree turn To the reverse direction: 180 degree turn

Turns combine to form cycles Goal: prohibit the least number of turns to break all

possible cycles Examples: west-first, north-last, negative-first Partially adaptive schemes

— 65

abstract cycles YX routing west-first routing

Other Routing Algorithm Requirements

Routing algorithm must ensure freedom from livelocks livelocks are similar to deadlocks, except that states of the

resources involved constantly change with regard to one another, without making any progress

occurs especially when dynamic (adaptive) routing is used e.g. can occur in a deflective “hot potato” routing if a packet is

bounced around over and over again between routers and never reaches its destination

livelocks can be avoided with simple priority rules Routing algorithm must ensure freedom from starvation

under scenarios where certain packets are prioritized during routing, some of the low priority packets never reach their intended destination

can be avoided by using a fair routing algorithm, or reserving some bandwidth for low priority data packets

66

Outline

Introduction NoC Topology Switching strategies Routing algorithms Router Microarchitecture Flow control schemes Clocking schemes QoS NoC Architecture Examples Status and Open Problems

67

Flow Control

Required in non-Circuit Switched networks to deal with congestion (regulate inflow of flits into NoC)

Recover from transmission errors Commonly used schemes:

ACK-NACK Flow control Credit based Flow control Xon/Xoff (STALL-GO) Flow Control

68

A B C Block

Buffer full

Don’t send

Buffer full

Don’t send

“Backpressure”

Flow Control Schemes

ACK/NACK when flits are sent on a link, a local copy is kept in a buffer by sender when ACK received by sender, it deletes copy of flit from its local

buffer when NACK is received, sender rewinds its output queue and starts

resending flits, starting from the corrupted one implemented either end-to-end or switch-to-switch sender needs to have a buffer of size at least = 2N + k

N is number of buffers encountered between source and destination k depends on latency of logic at the sender and receiver

fault handling support comes at cost of greater power, area overhead69

Flow Control Schemes Credit based flow control

— 70

receiversender

Sender sends packets whenever

credit counter is not zero

10Credit counter 9876543210

X

Queue isnot serviced

pipelined transfer

Flow Control Schemes Credit based flow control

— 71

receiversender

10Credit counter 9876543210

+5

5432

X


Receiver sends credits after they become available

Sender resumesinjection

pipelined transfer

Flow Control Schemes Xon/Xoff flow control

— 72

receiversender

XonXoff a packet is injected

if control bit is in Xon

Control bit

Xon

Xoff

pipelined transfer


73

receiversender

XonXoff

When Xoff threshold is

reached, an Xoff notification is sent

Control bit Xoff

Xon

When in Xoff,sender cannotinject packets

X


pipelined transfer


— 74

receiversender

XonXoff

When Xon threshold is

reached, an Xon notification is sent

Control bit

Xon

Xoff

X


pipelined transfer

Comparing Credit-Based & Xon/Xoff Flow Control

Both schemes can fully utilize buffers Restart latency is lower for credit-based schemes and

therefore Credit-based flow control has higher average buffer occupancy at

high loads Credit-based flow control leads to higher throughput at high loads Smaller inter-packet gap

Control traffic is higher for credit schemes Block credits can be used to tune link behavior

Buffer sizes are independent of round trip latency for credit schemes (at the expense of performance)

Credit schemes have higher information content useful for QoS schemes

— 75

Some Commercial Examples

Hypertransport: Credit based Info packets are not buffered and have guaranteed

reception Infiniband: Credit based

Link based schemes + end-to-end Myrinet : Credit based Ethernet: Xon/Xoff PCI Express

Data link layer ack/nack Transaction layer is credit based

— 76

Outline


77

Clocking schemes Fully synchronous

single global clock is distributed to synchronize entire chip hard to achieve in practice, due to process variations and clock skew

Mesochronous local clocks are derived from a global clock not sensitive to clock skew phase between clock signals in different modules may differ

deterministic for regular topologies (e.g. mesh) non-deterministic for irregular topologies

synchronizers needed between clock domains

Pleisochronous clock signals are produced locally

Asynchronous clocks do not have to be present at all

78

Outline


79

Quality of Service (QoS) QoS refers to the level of commitment for packet delivery

refers to bounds on performance (bandwidth, delay, and jitter)

Two basic categories best effort (BE)

only correctness and completion of communication is guaranteed usually packet switched worst case times cannot be guaranteed

guaranteed service (GS) makes a tangible guarantee on performance, in addition to basic

guarantees of correctness and completion for communication usually (virtual) circuit switched

80

Outline


81

Intel’s Teraflops Research Processor

82

Goals:

Deliver Tera-scale performance■ Single precision TFLOP at desktop

power■ Frequency target 5GHz■ Bi-section B/W order of Terabits/s■ Link bandwidth in hundreds of GB/s

Prototype two key technologies■ On-die interconnect fabric■ 3D stacked memory

Develop a scalable design

methodology■ Tiled design approach■ Mesochronous clocking■ Power-aware capability

I/O AreaI/O Area

I/O AreaI/O Area

PLLPLL

single tilesingle tile

1.5mm1.5mm

2.0mm2.0mm

TAPTAP

21.7

2m

m

I/O AreaI/O Area

PLLPLL TAPTAP

12.64mm

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip) 1.2 Million (tile)

Transistors

275mm2 (full-chip) 3mm2 (tile)

Die Area

8390C4 bumps #

[Vangal08]

Main Building Blocks

83

Special Purpose Cores■ High performance Dual

FPMACs

2D Mesh Interconnect ■ High bandwidth low latency

router■ Phase-tolerant tile to tile

communication

Mesochronous Clocking■ Modular & scalable■ Lower power

Workload-aware Power Management■ Sleep instructions■ Chip voltage & freq. control

2KB Data memory (DMEM)

3KB

Inst

. mem

ory

(IMEM

)

6-read, 4-write 32 entry RF

32

64 64

32

64

RIB

96

96

Mesochronous Interface

Processing Engine (PE)

Crossbar RouterM

SIN

T39

39

40 GB/s

FPMAC0

x

+

Normalize

32

32

FPMAC1

x

+32

32

MSIN

T

MSINT

MSINT

Normalize

Tile

Fine-Grain Power Management

84

Scalable power to match workload demandsScalable power to match workload demands

Dynamic sleep

STANDBY: • Memory retains data• 50% less power/tileFULL SLEEP: •Memories fully off•80% less power/tile

21 sleep regions per tile 21 sleep regions per tile (not all shown)(not all shown)

FP FP Engine 1Engine 1

FP FP Engine 2Engine 2

RouterRouter

Data MemoryData Memory

InstructionInstruction MemoryMemory

FP Engine 1

Sleeping:90% less

power

FP Engine 2

Sleeping:90% less

power

RouterSleeping:

10% less power(stays on to pass traffic)

Data MemorySleeping:

57% less power

Instruction MemorySleeping:

56% less power

Router Features 5 ports, wormhole, 5cycle pipeline 39-bit (32data , 6ctrl, 1str) bidirectional

mesochronous P2P links per port 2 logical lanes each with 16 flit-buffers Performance, area, power

Freq 5.1GHz @ 1.2V 102GB/s raw bandwidth Area 0.34mm2 (65nm) Power 945mW (1.2V), 470mW (1V), 98mW (0.75V)

Fine-grained clock-gating + sleep (10 regions)

85

KAIST BONE Project

86

2003 2004 2005 2006 2007

PROTONE- Star topology

Slim Spider- Hierarchical star

Memory Centric NoC(Hierarchical star + Shared memory)

IIS- Configurable

Star

MeshRAW,MIT

80-Tile NoC, Intel

Baseband processor NoC, STMicro, et. al.

[KimNOC07]

On-Chip Serialization

87

PU

Netw

ork

In

terfa

ce

SE

RD

ES

X-barS/W

Reduced LinkWidth

Reduced X-bar Switch

Operation frequency

Driver size

Wire space

Capacitance load

Energy consumption

Coupling capacitance

Buffer resource

Switching energy

→ Proper level of On-chip Serialization improves NoC performance

Memory-Centric NoC Architecture

88

Overall Architecture

■ 10 RISC processors

■ 8 dual port memories

■ 4 Channel controllers

■ Hierarchical-star topology packet switching network

■ Mesochronous comm.

RISC0

RISC1

ChannelContoller 0

Dual PortMem. 0

(1.5 KB)

Por

t A

Por

t B

Dual PortMem. 3

X - barS/ W

X- bar Switch

ChannelContoller 1

ChannelContoller 2

Dual PortMem. 4

Dual PortMem. 5

Dual PortMem. 6

Dual PortMem. 7

ControlProcessor

(RISC)

Ext.Mem.

I/F

HierarchicalStar Topology

Network -on -Chip(400 MHz )

X- barS/ W

X - barS/ W

X- barS/ W

RISC2

RISC3

RISC4

NI

NI

NI

NI

NI

NI

NI

NI

NI

NI

NI

NI

RISC7

RISC8

RISC9

ChannelContoller 3

RISC5

RISC6

Dual PortMem. 2

NI

NI

Dual PortMem. 1

NI

NI

36

36

36

36

NI

NI

NIN

IN

I

NI

NI

NI

NI

NI

Implementation Results

89

Chip photograph & results

8 Dual port Memories

(905.6 mW)

Memory Centric NoC (96.8mW)

RISC Processor (52mW)

10 RISCs(354 mW)

Power Breakdown

[Kim07]

Summary of NoCs in Emerging Processors

90

System Topology Routing Switching Flow ctrl

MIT RAW 2D mesh (32bit) XY DOR WH, no VC Credit

UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit

QuickSilver ACM H-Tree (32bit) Up*/down* 1-flit, no VC Credit

UMass Amherst aSOC

2D mesh Shortest-path Pipelined CS, no VC

Timeslot

Sun T1 Crossbar (128bit) - - ACK/NACK

Cell BE EIB Ring (128bit) Shortest-path Pipelined CS, no VC

Credit

TRIPS (operand) 2D mesh (109bit) YX DOR 1-flit, no VC On/off

TRIPS (on-chip) 2D mesh (128bit) YX DOR WH, 4 VCs Credit

Intel SCC 2D torus (32bit) XY,YX DOR, odd-even TM

WH, no VC On/off

TILE64 iMesh 2D mesh (32bit) XY DOR WH, no VC Credit

Intel 80-core NoC 2-D mesh (32bit) Source routing WH, 2 lanes On/off

Outline


91

Status and Open Problems Power

complex NI and switching/routing logic blocks are power hungry several times greater than for current bus-based approaches

Latency additional delay to packetize/de-packetize data at NIs flow/congestion control and fault tolerance protocol overheads delays at the numerous switching stages encountered by

packets even circuit switching has overhead (e.g. SOCBUS) lags behind what can be achieved with bus-based/dedicated

wiring Lack of tools and benchmarks Simulation speed

GHz clock frequencies, large network complexity, greater number of PEs slow down simulation 92

Trends

NoCs have high latency and power dissipation

What if we used photonic interconnects on chip?

Trends

Hybrid electro-photonic NoCs

Torus Mesh Firefly Corona Optical Mesh METEOR0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Norm

aliz

ed P

ow

er

Consu

mptio

n

Thermal TuningOptical ComponentsLaser PowerElectrical LinksBufferCrossbar and RoutingStatic

Ft Rd Ch Ry Fm Br Lu0

1

2

3

4

Norm

aliz

ed L

ate

ncy

Electrical TorusElectrical MeshCoronaFireflyHybrid Optical TorusMETEOR


1

2

3

4

Norm

aliz

ed P

ow

er


0.5

1

1.5

Norm

aliz

ed T

hro

ughput


2

4

6

8

Norm

aliz

ed e

nerg

y-dela

y

8x8 12x120

10

20

30

mm

2

Electrical Area

Electrical TorusElectrical MeshCoronaFireflyHybrid Optical TorusMETEOR

128 WaveGuide 256 WaveGuide0

500

1000

1500

mm

2

Optical Area 8x8

CoronaFireflyHybrid Optical TorusMETEOR

128 WaveGuide 256 WaveGuide0

500

1000

1500

mm

2

Optical Area 12x12

CoronaFireflyHybrid Optical TorusMETEOR

S. Bahirat, S. Pasricha, “METEOR: Hybrid Photonic Ring-Mesh Network-on-Chip for Multicore Architectures” ACM JETC (under review)

S. Bahirat, S. Pasricha, "UC-PHOTON: A Novel Hybrid Photonic Network-on-Chip for Multiple Use-Case Applications", IEEE ISQED 2010 (Best Paper Award)

S. Pasricha, S. Bahirat, " OPAL: A Multi-Layer Hybrid Photonic NoC for 3D ICs”, IEEE/ACM ASPDAC 2011

Trends

What if we use carbon nanotube interconnects on a chip?■ vs. Cu: Have better conductivity, highly resistant to electromigration and

other sources of physical breakdown, can support high current densities

■ Options: SWCNTs, MWCNTs, SWCNT bundle, mixed bundle■ Exploration results: [Pasricha et al. NanoArch ’08, VLSID ’09, TVLSI ‘10]

metallic

semiconducting

SWCNT Bundle

Mixed Bundle

SWCNT

MWCNT

Summary: Trends Move towards hybrid interconnection fabrics

NoC-bus based Custom, heterogeneous topologies

New interconnect paradigms Optical Wireless Carbon nanotube

See my research, pubs page for more details http://www.engr.colostate.edu/~sudeep/pubs/pubs.htm

96

Books

— 97

Download - 1 On-Chip Communication: Networks on Chip (NoCs) Sudeep Pasricha Colorado State University CS/ECE 561 Fall 2011

Top Related