![Page 1: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/1.jpg)
On-chip Network forManycore Architecture
Myong Hyon “Brandon” Cho
![Page 2: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/2.jpg)
Multicore to Manycore?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
© Tilera Corporation
Intel Xeon E7-x8xx
10 cores
32nm
2011
Westmere-EX architecture
2.4GHz, 30MB L3, 130W(E7-8870)
© Intel Corporation© Advanced Micro Devices, Inc.
AMD FX 8-core
8 cores
32nm
2012
Vishera (Bulldozer/Piledriver)architecture
4.0GHz, 8MB L3, 125W(FX-8350)
Tilera TILE-Gx72
72 cores
40nm
2013
TILE-Gx architecture
1.0GHz, 18MB L3, ~60W
![Page 3: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/3.jpg)
Multicore as the only way out
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Data credited to Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanovic
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Frequency (MHz)
Performance
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Frequency (MHz)
Performance
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Frequency (MHz)
Performance
1965 1970 1975 1980 1985 1990 1995 2000 2005 20101.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
Transistors (in thousands)
Frequency (MHz)
Performance
Number of cores
![Page 4: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/4.jpg)
vs. Other possibilities
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
© Wikipedia / Jurii
SiGe?
© Wikipedia / AlexanderAIUS
Graphene?
© iStockphoto / Andrey Volodin
Organic?
© The Economist
Quantum?
![Page 5: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/5.jpg)
vs. Other possibilities
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
NehalemTylersburgWestmere
Sandy BridgeRomleyIvy Bridge
HaswellHaswellRockwell
SkylakeSkylakeSkymont
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
45nm 32nm 22nm 14nm 10nm
Intel Server Microarchitecture Roadmapaccording to computerbase.de, 2011
![Page 6: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/6.jpg)
NoC as the key to manycore success
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
realizes every communication between cores.
On-chip network
consumes energy proportionally to traffic size.
provides key mechanisms for parallel programming.
![Page 7: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/7.jpg)
Outline
NoCfor
Manycore
Network-level
Optimization
Physical-level
Design
@ 45nm
System-level
Optimization
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
PROM
NoCARC’09
ENC
NOCS’11
BAN
PACT’09
EM2 Chip
’12/’13
![Page 8: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/8.jpg)
Network-level Optimization:
As simple as oblivious network,As efficient as adaptive network
![Page 9: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/9.jpg)
PROM – path-based oblivious routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Path-based, Randomized, Oblivious, Minimal RoutingMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, and Srinivas Devadas
NoCArc’09
overcomes the limitation of oblivious routing by enhanced path diversity.
![Page 10: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/10.jpg)
Oblivious routing vs Adaptive routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Local and Simple
Oblivious routing
Possibly poor resource utilization
Possibly betterresource utilization
Adaptive routing
Global informationrequired
![Page 11: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/11.jpg)
Oblivious routing vs Adaptive routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Local and Simple
Oblivious routing
Possibly poor resource utilization
Possibly betterresource utilization
Adaptive routing
Global informationrequired
For on-chip networks…
Because performance/area overhead of adaptive routing is more significant in on-chip networks than in large-scale networks.
![Page 12: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/12.jpg)
Poor utilization of oblivious routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SB
DB
DA
SA
DOR (XY)
![Page 13: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/13.jpg)
Path diversity improves oblivious routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SB
DB
DA
SA
O1TURN
• Diversity helps improve utilization and reduce congestion.
![Page 14: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/14.jpg)
Path diversity improves oblivious routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Diversity helps improve utilization and reduce congestion.
IA
SB
DB
IB DA
SA
SB
IA
DB
DA
SA
IBIB DB
SA
SB
DA
IA
Valiant ROMM (2-phase)
![Page 15: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/15.jpg)
Network-level deadlock
• A dependency cycle on network resources causes network-level deadlocks.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Q1
![Page 16: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/16.jpg)
Network-level deadlock
• A dependency cycle on network resources causes network-level deadlocks.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
x
Q1
Q2
Q1
Q2
Channel Dependency Graph (CDG)
![Page 17: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/17.jpg)
Network-level deadlock
• A dependency cycle on network resources causes network-level deadlocks.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
x
x
Q1
Q3
Q2
Q1
Q2
Q3
Channel Dependency Graph (CDG)
![Page 18: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/18.jpg)
Network-level deadlock
• A dependency cycle on network resources causes network-level deadlocks.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
x
x
x
x
Q1
Q3
Q2Q4
Q1
Q2
Q3
Q4
Channel Dependency Graph (CDG)
![Page 19: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/19.jpg)
Deadlock prevention
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DOR never creates dependency cycles.
XY and YX paths of O1TURN cause cycles.
O1TURN requires 2 networks to separate them.
Each phase of ROMM cause cycles.
n-phase ROMM uses n networks to separate them.
Each phase of Valiant cause cycles.
Valiant requires 2 networks to separate them.
…which we found to be wrong!n-phase ROMM only requires 2 networks.
![Page 20: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/20.jpg)
Various oblivious routing schemes
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DOR O1TURN2-phase ROMM
n-phase ROMM
Valiant
Path diversity None Minimum Limited Fair~Large Large
# networksfor deadlockprevention
1 2 2n
*erroneouslyproposed
2
# hops minimal minimal minimal minimal non-minimal
Comm. overhead
None Nonelog2(N)bits/pkt
(n-1) log2(N)bits/pkt
log2(N)bits/pkt
![Page 21: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/21.jpg)
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Path-based
Oblivious Minimal
Randomized
Goal: Best minimal-path diversity
- Use ALL possible minimal routes- Each minimal route has the SAME CHANCE to be taken.
![Page 22: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/22.jpg)
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
25%
75%
…compare the number of possible minimal paths after each choice
![Page 23: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/23.jpg)
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA 75%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
![Page 24: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/24.jpg)
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA
33%
67%75%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
![Page 25: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/25.jpg)
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA 67%75%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
![Page 26: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/26.jpg)
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA 67%75%
50%
50%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
![Page 27: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/27.jpg)
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DA
SA 67%75%
50%
Path-based
Oblivious Minimal
Randomized
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
![Page 28: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/28.jpg)
PROM Routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Path-based
Oblivious Minimal
DA
SA 67%75%
50%100%
Randomized
The chance of this path to be taken is:
75%×67%×50%×100%= 25%
At each hop, where there are multiple choices,
…compare the number of possible minimal paths after each choice
![Page 29: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/29.jpg)
Probability Calculation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• The probability function is reduced to a simple ratio.
Y
DA
SA
X
x
y
NY = (x+y-1)!x!(y-1)!
NX = (x+y-1)!(x-1)!y!
PY = NY
NX+NY
X+y
y =
PX = X+y
x When X>0 and y>0
= x!(y-1)!
1
( + ) x!(y-1)!
1
(x-1)!y!
1
PX PY
X+yx
X+yy
![Page 30: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/30.jpg)
Large-box Problem
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Paths are equally taken, but links are not.
srcdst
link utilization on the minimal-path box
DA
SA
When the MPB is large- edges are underutilized.- inner links are congested,possibly with other flows inside.
![Page 31: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/31.jpg)
Uniform PROM
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Immediate Upstream Router
PX PY
Don’t careX+y
x X+y
y
![Page 32: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/32.jpg)
Parameterized PROM
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Immediate Upstream Router
PX PY
On the X axis
On the Y axis
X+y+fx+f
X+y+fy
X+y+fx
X+y+fy+f
![Page 33: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/33.jpg)
Parameterized PROM
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
f=10 f=25f=0
link utilization on the minimal-path boxparameterized PROM
![Page 34: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/34.jpg)
Deadlock prevention
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Turn Models [Glass et al./J.ACM’94]:- Each turn model is a set of allowed turns.- No deadlock if all routes conform to the same turn model.
West-First Turn Model North-Last Turn Model
![Page 35: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/35.jpg)
Deadlock prevention
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Any minimal routing on a 2D mesh network conforms to either one of two turn models.*
* Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis, Edward Suh, and
Srinivas Devadas, Static Virtual Channel Allocation in Oblivious Routing, NOCS’09
No north-east nor south-east turnsconforms to the West-First turn model
No north-west nor south-west turnsconforms to the North-Last turn model
![Page 36: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/36.jpg)
Performance Evaluation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
![Page 37: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/37.jpg)
Performance Evaluation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
![Page 38: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/38.jpg)
Various oblivious routing schemes
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DOR O1TURN2-phase ROMM
n-phase ROMM
Valiant PROM
Path diversity
None Minimum Limited Fair~Large Large Fair~Large
# networksfor deadlockprevention
1 2 2 n* 2 2
# hops minimal minimal minimal minimalnon-
minimalminimal
Comm. overhead
None Nonelog2(N)bits/pkt
(n-1) log2(N)bits/pkt
log2(N)bits/pkt
None
Heavy-loadPerformance
Fair Good Bad Worst Worst Best
![Page 39: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/39.jpg)
BAN – bandwidth adaptive network
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
achieves adaptivity with oblivious routing, using locally arbitrated bi-directional network links.
Oblivious Routing in On-Chip Bandwidth-Adaptive NetworksMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Michel Kinsy, Tina Wen, and Srinivas Devadas
PACT’09
![Page 40: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/40.jpg)
Oblivious routing failure
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SA
SB
DB
DA
congested
![Page 41: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/41.jpg)
Where can we do better?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
![Page 42: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/42.jpg)
Adaptive Network, not routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SA
SB
DB
DA
Increasedbandwidth
• A set of bidirectional links connects network nodes.- The bandwidth of the link in one direction can be increased at the expense of the other direction.
![Page 43: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/43.jpg)
Adaptive Network, not routing
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
SA DB
DASB
SA DB
DASB
(a)When yellow flow is dominant
(b)When gray flow is dominant
Routes do not change, and arbitration is all local.
![Page 44: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/44.jpg)
BAN Hardware
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Most hardware overhead in the crossbar
BandwidthAllocatorpressure pressure
direction
1-to
-v D
EM
UX
(1, …, v)
v-to
-1 M
UX
Xbarswitch
1-to
-v D
EM
UX
(1, …, v)
v-to
-1 M
UX
Xbarswitch
nop
nop
from other nodes from other nodes to other nodes
to other nodesto other nodes
to other nodes
![Page 45: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/45.jpg)
Crossbar – 2 links, Unidirectional
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 4-input, 4-output, 4 Virtual Channels
![Page 46: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/46.jpg)
Crossbar– 2 links, Bidirectional
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 4-input, 4-output, 4 Virtual Channels
![Page 47: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/47.jpg)
Links Switch# xBar Inputs
# xBar Outputs
Relative xBar Size
Unidirectional
VC-to-Port(fully connected) 16 4 64
Bidirectional
VC-to-Port(fully connected) 16 8 128
Crossbar Size – 2 links
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 4-input, 4-output, 4 Virtual Channels
![Page 48: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/48.jpg)
Links Switch# xBar Inputs
# xBar Outputs
Relative xBar Size
Unidirectional
VC-to-Port(fully connected) 16 8 128
Bidirectional
VC-to-Port(fully connected) 16 16 256
Hybrid
VC-to-Port(fully connected) 16 12 192
Crossbar Size – 4 links
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
The hybrid configuration has a 1.5 times larger crossbar, which typically increases the node size by around 15%.
![Page 49: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/49.jpg)
Bandwidth Allocation
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Local arbiters between any two adjacent routers
Bandwidth Arbiter3 flits 1 flit
The arbitration follows demands from each router, always leaving at least one link in one direction
if there is any flit that can move in that direction.
![Page 50: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/50.jpg)
Symmetry vs. Anti-symmetry
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Bit-complement Transpose
*Both under dimension order routing
![Page 51: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/51.jpg)
Anti-symmetric Traffic
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
![Page 52: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/52.jpg)
Symmetric Traffic
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
![Page 53: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/53.jpg)
Symmetric Traffic with Burstiness
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Traffic Pattern Non-bursty Bursty
Bit-complement 0% 20%
Uniform Random 8% 26%
![Page 54: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/54.jpg)
How about real application traffic…?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• The traffic patterns in many real applications are not symmetric as data is processed by a sequence of modules.
![Page 55: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/55.jpg)
System-level Optimization:
autonomous & fine-grainedthread migration protocol by NoC
![Page 56: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/56.jpg)
ENC – exclusive native context
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
provides the first deadlock-free protocol for autonomous thread migration for any microarchitecture.
Deadlock-Free Fine-Grained Thread MigrationMyong Hyon Cho, Mieszko Lis, Keun Sup Shim, Omer Khan, and Srinivas Devadas
NOCS’11 – Best Paper Award
![Page 57: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/57.jpg)
Why thread migrations again?
• For a simple reason: it’s cheaper on a single die (so we can do it more often).
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
ThreadMotion [Rangan et al., ISCA09]
Higher Voltage/Frequency
Lower Voltage/Frequency
cache misses cache hits
![Page 58: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/58.jpg)
Why thread migrations again?
• For a simple reason: it’s cheaper on a single die (so we can do it more often).
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Architectural Core Salvaging [Powell et al., ISCA09]
has no defectsfloating-point ops
has a defective floating-pointunit
![Page 59: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/59.jpg)
Why thread migrations again?
• For a simple reason: it’s cheaper on a single die (so we can do it more often).
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Execution Migration Machine (EM2) [Lis et al., SPAA11/CSAIL-TR]
Each has the only copy of data on-chip.data misses
![Page 60: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/60.jpg)
Migration protocols aren’t catching up...
• …use a centralized scheduler (e.g., an OS). - slow!
• …store contexts in extra buffer or in the memory hierarchy.- expensive and inefficient!
• …bring restrictions on how threads can migrate.- cannot exploit the full power of migration!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
![Page 61: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/61.jpg)
Need a fast migration protocol that...
• …provides functional correctness for arbitrary migrations.
• …supports autonomous migration scheduling.
• …with a simple & small implementation.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
![Page 62: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/62.jpg)
Protocol-level Deadlock
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Core C
Router C
Core D
Router D
F
E
D
D
A
B
C
C
Core E
Router E
Core F
Router F
Core A
Router A
Core B
Router B
![Page 63: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/63.jpg)
If an autonomous migration protocol is careless…
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORYMIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• SWAP : A deadlock-prone autonomous migration protocol
• An eviction swaps the locations of two threads.
threads
![Page 64: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/64.jpg)
Protocol-level Deadlock
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 100 random, synthetic migration patterns.• 64 threads on 64 core, migrating in every 100 cycles• Network-level deadlock-free routing (DOR-XY)
1 2 3 40
10
20
30
40
50
60
70
80
90
100
2 VCs / No Buffer4 VCs / No Buffer2 VCs / 4 contexts2 VCs / 8 contexts
Number of Hotspots
Dea
dlo
ck (
%)
![Page 65: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/65.jpg)
Exclusive Native Context(ENC) protocols
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
![Page 66: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/66.jpg)
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
core
a running thread A
Exclusive Native Context(ENC) protocols
![Page 67: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/67.jpg)
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
coremigration
a running thread A
eviction
Exclusive Native Context(ENC) protocol
![Page 68: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/68.jpg)
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
coremigration
eviction
a running thread A
migrating threads must not block evicted threads.
Exclusive Native Context(ENC) protocol
![Page 69: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/69.jpg)
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
coremigration
eviction
a running thread A
Separating virtual channel sets is a simple solution.
Exclusive Native Context(ENC) protocol
![Page 70: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/70.jpg)
• Key idea: always accept arrived packets. - evict a running thread in an unblockable way!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
coremigration
eviction
a running thread A
native core
exclusivespace
Each thread has its own native core.
Exclusive Native Context(ENC) protocol
![Page 71: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/71.jpg)
Application performance results
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Total migration distance : no overhead in real applications
RANDOM
FFT RADIX LU OCEAN WATER0
0.2
0.4
0.6
0.8
1
1.2
SWAP
SWAPinf
ENC
DEA
DLO
CK
DEA
DLO
CK
DEA
DLO
CK
Nor
mal
ized
Tot
al H
op C
ount
![Page 72: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/72.jpg)
RANDOM FFT RADIX LU OCEAN WATER0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2SWAP SWAPinf ENC
Nor
mal
ized
Com
pleti
on T
ime
Application performance results
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)
DE
AD
LO
CK D
EA
DL
OC
K DE
AD
LO
CK
![Page 73: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/73.jpg)
Physical-level Design:
NoC router implementation for EM2 (IBM SOI 45nm)
![Page 74: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/74.jpg)
EM2 Implementation - Overview
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
110-core Shared Memory Processor
ISA EM2 Stack ISA
Shared MemoryArchitecture
1. EM2
2. RA (Remote Access)3. EM2+RA
Cache 8KB I$ / 32KB D$ at each core Total 4.4MB on Chip Single-cycle read hits, two-cycle write hits
Technology IBM SOI12SO 45nm
IPARM sc12 library (High voltage threshold),IBM SRAM compiler,IBM IO library (wire-bonding), IBM PLL, etc.
![Page 75: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/75.jpg)
NoC router specification for EM2
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Channels
Communication Unicast, in-order
ArchitecturalPerformance 1 cycle/hop
SchedulingAlgorithm Maximal scheduling
Routing DOR
Network Buffer Single 4-flit ingress buffer for each port
Remote Access
Migration (EM2)
DRAM Access
Migration
Eviction
Request
Response
Request
Response
Six independent 64-bit channels
![Page 76: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/76.jpg)
6 Independent Physical Networks
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
330um
330
um
6-network router with maximal scheduling
Metal Layers Usage
m1, m2, m3 Local logic
c1, c2 Local routing
b1, b2, b3Remote routing/ power grid
ua, ub Global power grid
lb Chip IO
Six 64-bit networks needs a width of 222um.
![Page 77: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/77.jpg)
Tile Floorplanning
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Router
Core
32KB D$
Pre
dict
or
8KB I$
![Page 78: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/78.jpg)
Tile Floorplanning
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Tile floorplan for EM2 tile
855um
917
um
![Page 79: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/79.jpg)
Tile Floorplanning
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Placement Results
ROUTER
CORE
PREDICTOR
![Page 80: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/80.jpg)
EM2 tile
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Width 855um
Height 917um
RC extracted STA(@typical)
WorkingFrequency
105MHz
Hold timeSlack
0.2ns
PowerEstimation (10% activity)
50mW
D$ D$ D$ D$ I$ I$
D$tags
I$tags
![Page 81: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/81.jpg)
Chip Floorplanning
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
![Page 82: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/82.jpg)
Connecting Router Links
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Chip-level Clock Tree
B
Tile-level Clock Tree
A
![Page 83: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/83.jpg)
EM2 chip
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Width 10mm
Height 10mm
~357 Million Transistors
11-by-10EM2 tile array
CLKD-CAPs D-CAPs
I/O
18man-month
EM2 tile arraybelow
the top 2 metal layers
![Page 84: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/84.jpg)
More Link Bandwidth?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Wires connecting to router pins
![Page 85: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/85.jpg)
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
EM2 only(no RA)
BarnesLU-contiguous
Ocean-contiguous
RadixWater-n-squared
Maximum 5 18 15 64 5
Average 2.2 1.6 6.8 4.1 2.1
Thread Concentration on 64-core EM2
* simulated for a 64-core version EM2
Application Migration Patterns
![Page 86: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/86.jpg)
Applications can saturatethe resource cap
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
In YX routing, threads going into the ‘hot core’ are more congested on the horizontal links.
![Page 87: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/87.jpg)
Applications can saturatethe resource cap
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
In YX routing, threads evicted from the ‘hot core’ are more congested on the vertical links.
![Page 88: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/88.jpg)
Applications can saturatethe resource cap
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
In YX routing, threads evicted from the ‘hot core’ are more congested on the vertical links.
![Page 89: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/89.jpg)
BAN on EM2 (Simulation study)
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
0
0.2
0.4
0.6
0.8
1
1.2Average Migration Latency
UN BAN
Nor
mal
ized
Mig
ratio
n La
tenc
yEM2 only(no RA)
BarnesLU-contiguous
Ocean-contiguous
RadixWater-n-squared
Maximum 5 18 15 64 5
Average 2.2 1.6 6.8 4.1 2.1
BARNES LU OCEAN RADIX WATER
* simulated for a 64-core version EM2
WATER
![Page 90: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/90.jpg)
Outline
NoCfor
Manycore
Network-level
Optimization
Physical-level
Design@ 45nm
System-level
Optimization
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
PROM
NoCARC’09
ENC
NOCS’11
BAN
PACT’09
EM2 Chip
’12/’13
![Page 91: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/91.jpg)
Extra slides
![Page 92: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/92.jpg)
Links Switch# xBar Inputs
# xBar Outputs
Relative xBar Size
Unidirectional
VC-to-Port(fully connected) 16 4 64
Bidirectional
VC-to-Port(fully connected) 16 8 128
Crossbar Size – 2 lanes
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• 4-input, 4-output, 4 Virtual Channels
Unidirectional
Port-to-Port(w/ input VC mux) 4 4 16
Bidirectional
Port-to-Port(w/ input VC mux) 8 8 64
![Page 93: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/93.jpg)
Link Arbitration Frequency
93
• How frequently directions need to change?
• Few links change their directions in 10~20 cycles.
![Page 94: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/94.jpg)
Infrequent Link Arbitration
94
unidirectional
N=100
N=1
![Page 95: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/95.jpg)
Infrequent Link Arbitration
95
unidirectional
N=100
N=1
![Page 96: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/96.jpg)
Protocol-level Deadlock
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Router Cto
Router D
Core Cto
Router C
Router Dto
Core D
Core Dto
Router D
Router Dto
Router CRouter C
toCore C
D to C
C to D
Packets are assumed tobe consumedat the destination.
Packets are assumed tobe consumedat the destination.
![Page 97: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/97.jpg)
Cyclic Resource Dependency Graph
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
node1
core1 core2
NetN2
NetN1
C1N1
N1C1
N2C2
C2N2
node2Network
migration
![Page 98: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/98.jpg)
Acyclic Resource Dependency Graph
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
node1
core1 core2
NetN2
NetN1
C1N1
N1C1
N2C2
C2N2
node2
N2Net
NetNative
NetNative
N1Net
Network
migration
eviction
![Page 99: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/99.jpg)
• ENC0 : A thread always visits its native core first!
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
threads
native cores
Exclusive Native Context Zero (ENC0)
![Page 100: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/100.jpg)
Exclusive Native Context (ENC)
• ENC0 : A thread always visits its native core first!
• ENC : A thread goes to its native core only if evicted by another thread.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
threads
native cores
• ENC saved 10 network hops (52.6%) in this example.
• Moving out a thread context must be atomic (extra logic cost).
![Page 101: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/101.jpg)
Exclusive Native Context (ENC)
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
threads
native cores
• ENC saved 10 network hops (52.6%) in this example.
• Moving a thread context onto the network must be atomic.
A B
![Page 102: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/102.jpg)
Execution Migration Machine (EM2)
• In many parallel applications, each thread mostly works on its private data.
• In EM2, a migrating thread mostly returns to a specific core.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Memory accesses on home core
![Page 103: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/103.jpg)
Round Robin Scheduling
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
“N” “E” “W” “S” “C”
RR counter
+1
MUX
wins the output port
“Bubble” cycles when no flit is available on an Input port (non-maximal).
![Page 104: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/104.jpg)
Maximal Scheduling
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
“C” “N” “E” “W” “S”
MUX
“S” “C” “N” “E” “W”
MUX
“W” “S” “C” “N” “E”
MUX
“E” “W” “S” “C” “N”
MUX
“N” “E” “W” “S” “C”
MUX
Fixed Priority Logic (left-to-right)
RR counter
+1
wins the output port
Maximal scheduling without bubblesArea cost: 6.7% (Tile)
![Page 105: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/105.jpg)
Application performance results
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Total migration distance : no overhead in real applications
RANDOM FFT RADIX LU OCEAN WATER0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
SWAP SWAPinf ENC0 ENC
Nor
mal
ized
Hop
Cou
nt
DE
AD
LO
CK D
EA
DL
OC
K DE
AD
LO
CK
![Page 106: On-chip Network for Manycore Architecture Myong Hyon “Brandon” Cho](https://reader038.vdocument.in/reader038/viewer/2022110403/56649e615503460f94b5cbdf/html5/thumbnails/106.jpg)
Application performance results
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
• Completion time : 11.7% overhead of ENC over SWAPinf (on avg.)
RANDOM FFT RADIX LU OCEAN WATER0
0.20.40.60.8
11.21.41.61.8
2SWAP SWAPinf ENC0 ENC
Nor
mal
ized
Com
pleti
on T
ime
DE
AD
LO
CK D
EA
DL
OC
K DE
AD
LO
CK