ARIADNEAgnostic Reconfiguration In A
Disconnected Network Environment
Konstantinos Aisopos (Princeton, MIT), Andrew DeOrio (Michigan),Li-Shiuan Peh (MIT), Valeria Bertacco (Michigan)
What is “reconfiguration”? Silicon technologies move into the nanometer regime …transistors become unreliable
for architects: permanent faults
Our focus in this talk: Network-on-Chip
cannot resend
need to re-route around the fault
reconfiguration: “the process of replacing the routing algorithm”
S D
Shekhar Borkar
In future chips of 100 billion transistors, 10%
of transistors will eventually fail over
the lifetime of the chip
P
P$ S$
NIC
R
Why is reconfiguration challenging?
• XY routing
S
D
X
Y
S
D
• XY routing
Why is reconfiguration challenging?
Agnostic Reconfiguration algorithm
In ADisconnectedNetworkEnvironment
S
D
• XY routing
Why is reconfiguration challenging?
Agnostic Reconfiguration algorithm
In ADisconnectedNetworkEnvironment
Outline• Motivation• Ariadne
–Baseline–Deadlocks–Synchronization
• Evaluation–Overhead–Performance–Reliability
• Conclusions
S
D
How will S find a path to D?
?
S
D
How will S find a path to D?
RT E
…… D:
RT S
…… D:
RT W
…… D:
RT N
…… D:
S
D
How will S find a path to D?
RT E,S
…… D:
RT S
…… D:
RT
W,N
…… D:
RT E,N
…… D:
RT W,S
…… D:
RT N
…… D:
S
D
How will S find a path to D?
How will S find a path to D?
D
S
RT W
…… D:
RT W
…… D: RT
S
…… D:
• Upon a fault that changes the topology…– a node can let everyone know how it can be reached
with a single broadcast– N nodes can let everyone know how they can be
reached with N broadcasts
ARIADNE: baseline
• Upon a fault that changes the topology…– Every node broadcasts “in-turn” to let others know how it can be reached
ARIADNE: baseline
1st
19
2nd
20
3rd …17
last fault
detector: 18
statically assigned node IDs
24
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 21 22 23
24 25 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
20
• Upon a fault that changes the topology…– Every node broadcasts “in-turn” to let others know how it can be reached
ARIADNE: baseline
1st
19
2nd
20
3rd …17
last fault
detector: 18
• Issues:– deadlock avoidance– synchronization (when to broadcast, multiple detectors)
S
D
ARIADNE: deadlocks
S
D
ARIADNE: deadlocks
D
S
D
S
r
ARIADNE: deadlocks
first bcast ONLY:nodes are assigned ranks
0 bcaster“root”
1immediate neighbors
2 2-hop neighbors
3 3-hop neighbors
up*/down* disable routes where rank goes
higher
up
down
lower
unique ordering:among nodes with same rank, arbitrarily select a higher one
assume routing circle:1 node will have higher rank than its neighbors, breaking the circular route
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 19 21 22 23
24 25 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
2018
26
r
ARIADNE: deadlocks
first bcast ONLY:nodes are assigned ranks
0 bcaster“root”
1immediate neighbors
2 2-hop neighbors
3 3-hop neighbors
up*/down* disable routes where rank goes
higher
up
down
lower
unique ordering:among nodes with same rank, arbitrarily select a higher one
assume routing circle:1 node will have higher rank than its neighbors, breaking the circular route
r
ARIADNE: deadlocks
first bcast ONLY:nodes are assigned ranks
0 bcaster“root”
1immediate neighbors
2 2-hop neighbors
3 3-hop neighbors
up*/down* disable routes where rank goes
higher
up
down
lower
unique ordering:among nodes with same rank, arbitrarily select a higher one
connectivity:can reach any node via the root
S
D
SD
assume routing circle:1 node will have higher rank than its neighbors, breaking the circular route
• Upon a fault that changes the topology…– Every node broadcasts “in-turn” to let others know how it can be reached
ARIADNE: deadlocks
• Issues:– deadlock avoidance– synchronization
• Upon a fault that changes the topology…– Every node broadcasts “in-turn” to let others know how it can be reached RULE: (i) first broadcast ranks nodes (ii) remaining broadcasts spread
ONLY via enabled turns
ARIADNE: deadlocks
• Issues:– synchronization
• “in turn” broadcasts: when does previous broadcast complete? can broadcasts overlap?
ARIADNE: synchronization
1-bit 1-bit
1-bit
1-bit
1-bit
1-bit
1-bit 1-bit
• how does the recipient of a flag know the broadcasting node?
NO
arbitration
Solution : Atomic Broadcasts
• Nodes utilize the cycle count as a global reference point
• Each node is assigned a unique broadcast slot from the “global” cycle counter
ARIADNE: synchronization
0
4
1
5
2
6
3
7
8
12
9
13
10
14
11
15
11 0010 10 1101 10
cycle count (same for all nodes)
XX XXXX 10 1101 10
bcast cycle
bcast node
9 75
waits for
5 0 …
11 1110 00 4 15
00 0010 10 5 05 initiates
bcast …11 1110 10 5 15
5’s bcastcompletes6 initiates
bcast 00 0010 01 6 0…11 1110 01 6 15
6’s bcastcompletes …
4’s bcastcompletes
…
11 1110 00 4 15reconfiguration completes in
(16)2 =(number of nodes)2 cycles
log(16) bits
log(16) bits
longest (in hops) broadcast
ARIADNE: synchronization
0
4
1
5
2
6
3
7
8
12
9
13
10
14
11
11 0010 10 1110 10
cycle count (same for all nodes)
XX XXXX 10 1101 10
bcast cycle
bcast node
9 75
waits for
5 0 …
11 1110 00 4 15
00 0010 10 5 05 initiates
bcast
15
11 1110 10 5 155’s bcast
completes
8
waits for
8 0
1st hop
2nd
hop
00 10 5 1
00 01 5 2…
8 resigns from becoming the
root node
(!) we need to reconfigure once even for multiple faults
ARIADNE: synchronization
Outline• Motivation• Ariadne
–Baseline–Deadlocks–Synchronization
• Evaluation–Overhead–Performance–Reliability
• Conclusions
overheadperformancereliability
• On-chip routing algorithms for irregular topologies
Immunet (V. Puente, ISCA’04)
Vicis routing algo (D. Fick, DATE’09)
reserves an escape VC for deadlock freedom (routes deterministically in a ring)
exceptions to turn model to apply it to an arbitrary topology
Evaluation
ARIADNE
synthesized a baseline 5-stage pipelined router (5 ports, 2 VCs, 5-flit buffer/VC)with Synopsys Design Compiler (IBM 130nm target library):router area (mm2): baseline=2.708, Ariadne=2.761, Vicis=2.748, Immunet=2.870
1.5%2.0% 6.0%
Evaluation: Overhead
• Experimental Setup: Garnet + GEMS
network topology 8x8 2D mesh
memory controllers 4 at chip corners
channel width 64 bits
router architecture 5-stage pipeline
router ports, VCs 5, 2 (private)
router buffers/port 5-flit for each VC
processors In-order SPARC cores
coherence MOESI protocol
L1 caching private unified 32KB/node
ways: 2 latency: 3 cycles
L2 caching shared distributed 1MB/node
ways: 16 latency: 15 cycles
Network Architecture (GARNET)
System Configuration (GEMS)
Evaluation: Performance
0 20 40 60 80 1000
20
40
60
80
100
Ariadne (average)Vicis (average)Immunet (average)
Injected Faults
Ave
rage
Lat
ency
(cyc
les)
Average over 10 PARSEC benchmarks
1000 fault configurations lower is better
deadlocks trafficrouting ina ring
overheadperformancereliability
• On-chip routing algorithms for irregular topologies
Immunet (V. Puente, ISCA’04)
Vicis routing algo (D. Fick, DATE’09)
reserves an escape VC for deadlock freedom (routes deterministically in a ring)
exceptions to turn model to apply it to an arbitrary topology
Evaluation: Performance + Reliability
ARIADNE
1.5%2.0% 6.0%
Outline• Motivation• Ariadne
–Baseline–Deadlocks–Synchronization
• Evaluation–Overhead–Performance–Reliability
• Conclusions
ConclusionsWe have presented Ariadne.• a reconfiguration algorithm that provides
deadlock-free routing paths in irregular network topologies that result from faulty links
• is implemented in a fully distributed mode, resulting in simple hardware and low complexity
• enables a trade-off between performance and reliable functionality on unreliable silicon
Thank You!
Questions?The Greek legend of Princess Ariadne
“Ariadne (Αριάδνη), was the daughter of King Minos of Crete. Minos attacked Athens after his son was killed there. The Athenians asked for terms, and
were required to sacrifice seven young men and seven maidens every nine years to the Minotaur, a monster with the head of a bull on the body of a man. One year, the sacrificial party included Theseus, a young man who
volunteered to come and kill the Minotaur. Ariadne fell in love at first sight, and helped him by giving him a ball of red fleece thread that she was
spinning, to find his way out of the Minotaur's labyrinth.”
…similarly to Princess Ariadne, our Ariadne algorithm helps packets find their way in the labyrinth-like topology of a faulty network.
[source: wikipedia]
BACKUP SLIDES
0 1K
2K
3K
4K
5K
6K
7K
8K
9K
10K
11K
12K
13K
14K
15K
16K
17K
18K
19K
20K
21K
22K
23K
24K
25K
26K
27K
28K
29K
30K
31K
32K
33K
34K
35K
36K
37K
38K
39K
40K
41K
42K
43K
44K
45K
46K
47K
48K
49K
50K
0
200
400
600
800
1000
1200
0 to 1 faults
0 to 10 faults
0 to 50 faults
time (cycles)
aver
age
pack
et la
tenc
y in
last
1K
cyc
les
(in c
ycle
s)
reconfiguration initiated
reconfiguration completed
Evaluation: Results (reconfiguration dynamics)
Ariadne Architecture
Ariadne vs. off-chip approaches• Off-chip networks utilize
centralized software algorithms for reconfiguration
C
(-) topology needs to be communicated to a central node
(-) interfacing with OS(-) updated routing tables
need to be delivered back to each node OS
Ariadne: Motivation
• Many transistor failures are expected to occur at NoCs fabricated at advanced technology nodes
0 20 40 60 80 1000
20
40
60
# transistor failures
# lin
k fa
ilure
s
• These failures will result in router link failures and disconnected routers
Evaluation: Related Work• On-chip routing algorithms for irregular
topologies implemented for comparisonImmunet
(V. Puente, ISCA 2004) Vicis routing algorithm
(D. Fick, DATE 2009)
• 1 escape VC reserved for deadlock freedom: routes deterministically in a ring
S
Dlatency when traffic 6% overhead (3 routing tables) reliability
• Other VCs route adaptively. BUT, if no other VC is available, a packet switches to escape VC
Evaluation: Related Work• On-chip routing algorithms for irregular
topologies implemented for comparisonImmunet
(V. Puente, ISCA 2004) Vicis routing algorithm
(D. Fick, DATE 2009)
• 1 escape VC reserved for deadlock freedom: routes deterministically in a ring
latency when traffic 6% overhead (3 routing tables) reliability
• Other VCs route adaptively. BUT, if no other VC is available, a packet switches to escape VC
• No faults? use turn model• Upon fault occurrence: re-enable disabled turns to
increase connectivity
(assuming north last)
?(assuming north last)
(assuming north last)
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 19 21 22 23
24 25 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
2018
26