kilo-noc: a network-on-chip architecture for scalability and service guarantees
DESCRIPTION
Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees. Boris Grot The University of Texas at Austin. Technology Trends. Xeon Nehalem-EX. Core i7. Pentium D. Pentium 4. Transistor count. 486. Pentium. 386. 286. 8086. 4004. Year of introduction. - PowerPoint PPT PresentationTRANSCRIPT
Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service
Guarantees
Boris GrotThe University of Texas at Austin
1970 1975 1980 1985 1990 1995 2000 2005 20101,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
Tran
sisto
r cou
nt
Technology Trends
Core i7Pentium D
Pentium 4
Pentium
Xeon Nehalem-EX
4004
286386
486
8086
Year of introduction
Tran
sisto
r cou
nt
2
𝑃𝑒𝑟𝑓 / $
𝑃𝑒𝑟𝑓 /$𝑊𝑎𝑡𝑡
3
Technology Applications
4
Networks-on-Chip (NOCs)The backbone of highly integrated chips
Transport of memory, operand, and control trafficStructured, packet-based, multi-hop networks Increasing importance with greater levels of integration
Major impact on chip performance, energy, and area
TRIPS: 28% performance losson SPEC 2K in NOC
Intel Polaris: 28% of chip power consumption in NOC
Moving data is more expensive [energy-wise] than operating on it - William Dally, SC ‘10
5
On-chip vs Off-chip Interconnects Topology Routing Flow
control
Pins Bandwidt
h Power Area
Future NOC Requirements100’s to 1000’s of network clients
Cores, caches, accelerators, I/O ports, …Efficient topologies
High performance, small footprintIntelligent routing
Performance through better load balanceLight-weight flow control
High performance, low buffer requirementsService Guarantees
cloud computing, real-time apps demand QOS support
6
HPCA ‘09
HPCA ‘08
MICRO ‘09
under submission
under submission
7
Outline Introduction Service Guarantees in Networks-on-Chip
Motivation Desiderata, prior work Preemptive Virtual Clock Evaluation highlights
Efficient Topologies for On-chip Interconnects Kilo-NOC: A Network for 1000+ Nodes Summary and Future Work
8
Why On-chip Quality-of-Service? Shared on-chip resources
Memory controllers, accelerators, network-on-chip … require QOS support
fairness, service differentiation, performance isolation
End-point QOS solutions are insufficient Data has to traverse the on-chip network Need QOS support at the interconnect level
Hard guarantees in NOCs
9
NOC QOS Desiderata Fairness
Isolation of flows
Bandwidth efficiency
Low overhead: delay area energy
10
Conventional QOS Disciplines Fixed schedule
Pros: algorithmic and implementation simplicity Cons: inefficient BW utilization; per-flow queuing Example: Round Robin
Rate-based Pros: fine-grained scheduling; BW efficient Cons: complex scheduling; per-flow queuing Example: Weighted Fair Queuing (WFQ) [SIGCOMM ‘89]
Frame-based Pros: good throughput at modest complexity Cons: throughput-complexity trade-off; per-flow queuing Example: Rotating Combined Queuing (RCQ) [ISCA ’96]
Per-flow queuingo Area overheado Energy overheado Delay overhead o Scheduling complexity
11
Preemptive Virtual Clock (PVC) [HPCA ‘09]
Goal: high-performance, cost-effective mechanism for fairness and service differentiation in NOCs.
Full QOS support Fairness, prioritization, performance isolation
Modest area and energy overhead Minimal buffering in routers & source nodes
High Performance Low latency, good BW efficiency
12
PVC: Scheduling Combines rate-based and frame-based
features Rate-based: evolved from Virtual Clock
[SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation
f (provisioned rate, consumed BW) Problem: history effect
Flow X
13
PVC: Scheduling Combines rate-based and frame-based
features Rate-based: evolved from Virtual Clock
[SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation
f (provisioned rate, consumed BW) Problem: history effect
Framing: PVC’s solution to history effect Frame rollover clears all BW counters Fixed frame duration
14
PVC: Scheduling Combines rate-based and frame-based
features Rate-based: evolved from Virtual Clock
[SIGCOMM ’90] Routers track each flow’s bandwidth consumption Cheap priority computation
f (provisioned rate, consumed BW) Problem: history effect
Flow X
Frame roller - BW counters reset - Priorities reset
15
PVC: Freedom from Priority Inversion PVC: simple routers w/o per-flow buffering and
no BW reservation Problem: high priority packets may be blocked by
lower priority packets (priority inversion)
x
16
PVC: Freedom from Priority Inversion PVC: simple routers w/o per-flow buffering and
no BW reservation Problem: high priority packets may be blocked by
lower priority packets (priority inversion) Solution: preemption of lower priority packets
`
17
PVC: Preemption Recovery Retransmission of dropped packets Buffer outstanding packets at the source node ACK/NACK protocol via a dedicated network
All packets acknowledged Narrow, low-complexity network Lower overhead than timeout-based recovery 64 node network: 30-flit backup buffer per node
suffices
18
PVC: Preemption Throttling Relaxed definition of priority inversion
Reduces preemption frequency Small fairness penalty
Per-flow bandwidth reservation Flits within the reserved quota are non-preemptible Reserved quota is a function of rate and frame size
Coarsened priority classes Mask out lower-order bits of each flow’s BW counter Induces coarser priority classes Enables a fairness/throughput trade-off
19
PVC: Guarantees Minimum Bandwidth
Based on reserved quota Fairness
Subject to BW counter resolution Worst-case Latency
Packet enters source buffer in frame N, guaranteed delivery by the end of frame N+1
20
Performance IsolationPARSECStream
21
Performance Isolation Baseline NOC
No QOS support Globally Synchronized Frames (GSF)
J. Lee, et al. ISCA 2008 Frame-based scheme adapted for on-chip
implementation Source nodes enforce bandwidth quotas via self-
throttling Multiple frames in-flight for performance Network prioritizes packets based on frame number
Preemptive Virtual Clock (PVC) Highest fairness setting (unmasked bandwidth
counters)
22
Performance Isolation
1
2
3
4
5
6
7
8
PARS
EC n
etw
ork
slow
dow
n No QOS
GSF
PVC
40 50 46 46 34 55
23
PVC Summary Full QOS support
Fairness & service differentiation Strong performance isolation
High performance Inelaborate routers low latency Good bandwidth efficiency
Modest area and energy overhead 3.4 KB of storage per node (1.8x no-QOS router) 12-20% extra energy per packet
24
PVC Summary Full QOS support
Fairness & service differentiation Strong performance isolation
High performance Inelaborate routers low latency Good bandwidth efficiency
Modest area and energy overhead 3.4 KB of storage per node (1.8x no-QOS router) 12-20% extra energy per packet
Will it scale to 1000 nodes?
25
Outline Introduction Service Guarantees in Networks-on-Chip Efficient Topologies for On-chip
Interconnects Mesh-based networks Toward low-diameter topologies Multidrop Express Channels
Kilo-NOC: A Network for 1000+ Nodes Summary and Future Work
26
NOC Topologies Topology is the principal determinant of
network performance, cost, and energy efficiency
Topology desiderata Rich connectivity reduces router traversals High bandwidth reduces latency and contention Low router complexity reduces area and delay
On-chip constraints 2D substrates limit implementable topologies Logic area/energy constrains use of wire resources Power constrains restrict routing choices
2-D Mesh
27
Pros Low design & layout
complexity Simple, fast routers
P
$
Pros Low design & layout
complexity Simple, fast routers
Cons Large diameter Energy & latency impact
2-D Mesh
28
Pros Multiple terminals at each
node Fast nearest-neighbor
communication via the crossbar
Hop count reduction proportional to concentration degree
Cons Benefits limited by crossbar
complexity
29
Concentrated Mesh (Balfour & Dally, ICS ‘06)
Objectives: Improve connectivity Exploit the wire budget
30
Flattened Butterfly (Kim et al., Micro ‘07)
Point-to-point links Nodes fully connected
in each dimension
31
Flattened Butterfly
Pros Excellent connectivity Low diameter: 2 hops
Cons High channel count:
k2/2 per row/column Low channel utilization Control complexity
32
Flattened Butterfly
Objectives: Connectivity More scalable channel count Better channel utilization
33
[Grot et al., Micro ‘09]
Multidrop Express Channels (MECS)
34
Multidrop Express Channels (MECS)
Point-to-multipoint channels Single source Multiple destinations
Drop points: Propagate further -OR- Exit into a router
35
Multidrop Express Channels (MECS)
36
Pros One-to-many topology Low diameter: 2 hops k channels row/column I/O asymmetry
Cons I/O asymmetry Control complexity
Multidrop Express Channels (MECS)
MECS Summary
MECS: a novel one-to-many topologyExcellent connectivityEffective wire utilizationGood fit for planar substrates
Results summaryMECS: lowest latency, high energy efficiencyMesh-based topologies: best throughputFlattened butterfly: smallest router area
37
38
Outline Introduction Service Guarantees in Networks-on-Chip Efficient Topologies for On-chip Interconnects Kilo-NOC: A Networks for 1000+ Nodes
Requirements and obstacles Topology-centric Kilo-NOC architecture Evaluation highlights
Summary and Future Work
39
Scaling to a kilo-node NOC Goal: a NOC architecture that scales to 1000+
clients with good efficiency and strong guarantees
MECS scalability obstacles Buffer requirements: more ports, deeper buffers
area, energy, latency overheads
PVC scalability obstacles Flow state, other storage area, energy
overheads Preemption overheads energy, latency
overheads Prioritization and arbitration latency overheads
40
Scaling to a kilo-node NOC Goal: a NOC architecture that scales to 1000+
clients with good efficiency and strong guarantees
MECS scalability obstacles Buffer requirements: more ports, deeper buffers
area, energy, latency overheads
PVC scalability obstacles Flow state, other storage area, energy
overheads Preemption overheads energy, latency
overheads Prioritization and arbitration latency overheads
Kilo-NOC: Addresses topology and QOS scalability bottlenecks
This talk: reducing QOS overheads
41
NOC QOS: Conventional ApproachMultiple virtual machines (VMs) sharing a die
Shared resources (e.g., memory controllers)VM-private resources (cores, caches)
Q Q Q Q
Q Q Q Q
Q Q Q Q
Q Q Q Q
VM #1
VM #1
VM #3
VM #2
42
NOC QOS: Conventional ApproachNOC contention scenarios: Shared resource
accesses memory access
Intra-VM traffic shared cache access
Inter-VM traffic VM page sharing
Q Q Q Q
Q Q Q Q
Q Q Q Q
Q Q Q Q
VM #1
VM #1
VM #3
VM #2
43
NOC QOS: Conventional ApproachNOC contention scenarios: Shared resource
accesses memory access
Intra-VM traffic shared cache access
Inter-VM traffic VM page sharing
Q Q Q Q
Q Q Q Q
Q Q Q Q
Q Q Q Q
VM #1
VM #1
VM #3
VM #2
44
NOC QOS: Conventional ApproachNOC contention scenarios: Shared resource
accesses memory access
Intra-VM traffic shared cache access
Inter-VM traffic VM page sharing
Q Q Q Q
Q Q Q Q
Q Q Q Q
Q Q Q Q
VM #1
VM #1
VM #3
VM #2
45
NOC QOS: Conventional ApproachNOC contention scenarios: Shared resource
accesses memory access
Intra-VM traffic shared cache access
Inter-VM traffic VM page sharing
Q Q Q Q
Q Q Q Q
Q Q Q Q
Q Q Q Q
VM #1
VM #1
VM #3
VM #2
Network-wide guarantees without network-wide QOS
support
46
Kilo-NOC QOS: Topology-centric Approach
Dedicated, QOS-enabled regions Rest of die: QOS-free
A richly-connected topology (MECS) Traffic isolation
Special routing rules Ensure interference
freedom
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
QOS-free
47
Kilo-NOC QOS: Topology-centric Approach
Dedicated, QOS-enabled regions Rest of die: QOS-free
A richly-connected topology (MECS) Traffic isolation
Special routing rules Ensure interference
freedom
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
48
Kilo-NOC QOS: Topology-centric Approach
Dedicated, QOS-enabled regions Rest of die: QOS-free
A richly-connected topology (MECS) Traffic isolation
Special routing rules Ensure interference
freedom
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
49
Kilo-NOC QOS: Topology-centric Approach
Dedicated, QOS-enabled regions Rest of die: QOS-free
A richly-connected topology (MECS) Traffic isolation
Special routing rules Ensure interference
freedom
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
50
Kilo-NOC QOS: Topology-centric Approach
Dedicated, QOS-enabled regions Rest of die: QOS-free
A richly-connected topology (MECS) Traffic isolation
Special routing rules Ensure interference
freedom
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
51
Performance Isolation
S S S
S S S
S S S
S S S
MC
MC
MC
MC
Stream
PVC-enabledMECS topology
52
Performance Isolation
M2
M4
MaS S S
S S S
S S S
S S SMC
MC
MC
MC
MECS topology
With & without network-wide PVC QOS
53
Performance Isolation
1.0
1.2
1.4
1.6
1.8
2.0
PARS
EC n
etw
ork
slow
dow
n MECS
MECS + PVC
K-MECS
54
Summary: Scaling NOCs to 1000+ nodes Objectives: good performance, high energy-
and area-efficiency, service guarantees
MECS topology Point-to-multipoint interconnect fabric Rich connectivity: improves performance and
efficiency
PVC QOS scheme Preemptive architecture: reduces buffer
requirements Strong guarantees, performance isolation
55
Summary: Scaling NOCs to 1000+ nodes Topology-aware QOS architecture
Limits the extent of QOS support to a fraction of the die
Reduces network cost, improves performance Enables efficiency-boosting optimizations in QOS-
free regions of the chip Kilo-NOC compared to MECS+PVC:
NOC area reduction of 47% NOC energy reduction of 26-53%
56
AcknowledgementFaculty
Steve Keckler (advisor)Doug BurgerOnur MutluEmmett Witchel
CollaboratorsPaul Gratz Joel Hestness
Special ThanksThe awesome CART group
57