deadlock recovery in asynchronous networks on chip...
Post on 27-May-2018
219 Views
Preview:
TRANSCRIPT
Deadlock Recovery in Asynchronous Networks on Chip in the Presence of Transient Faults
Guangda Zhang, Jim Garside, Wei Song,Javier Navaridas, Zhiying Wang
APT Group, School of Computer ScienceUniversity of Manchester, UK
1ASYNC 2015 – Mountain View, Silicon Valley, CA, USA, May 4‐6, 2015
Asynchronous Networks on Chip (NoCs)
2
Router Router
Router Router
Asyn. domainIndependent Syn. domains
link
Syn. IP0 Syn. IP1
Syn. IP2 Syn. IP3
NI
NI
NI
NI
Network Interface
SpiNNaker MPSoC with 18 ARM cores
1. Multi-core era --- Network-on-chip (NoC) 2. Asynchronous Networks on Chip --- GALS system
A GALS system constructed by an asynchronous NoC
Deadlock in synchronous NoCs
3
1. Traditional deadlock: cyclic dependence
2. Deadlock avoidance: turn models, virtual channels3. Deadlock recovery: deadlock or congestion?
(0,0) (0,1)
(1,0) (1,1)
A B
CD
Deadlock in asynchronous NoCs
Deadlocked QDI pipeline caused by permanent faults:
data
ackdo
doa
di
dia
data
ackdo
doa
di
dia
Pre-fault Post-fault
1. Traditional deadlock2. Permanent faults*
*G. Zhang, etc., “An asynchronous SDM network-on-chip tolerating permanent faults,” ASYNC 2014
Transient faults?
4
Transient faults
5
Sources of Transient faults Positive/negative
positive fault
negative fault
0‐‘1’‐0
1‐‘0’‐1transient faults
crosstalk
power supply noise
electromagnetic interference
electrostatic discharge
alpha particles
cosmic rays
transient errors (soft errors)
radiation
Single Event Transient (SET)
Single Event Upset (SEU)
Impact of transient faults on QDI pipelines(4-phase 1-of-4 pipeline)
6
1. Symbol corruption*
2. Symbol insertion*
3. Deadlock ?
1001 0000 0100
0001 01000000 0100 0000
*G. Zhang, etc, “Protecting QDI interconnects from transient faults using delay-insensitive redundant check codes,” Microprocessors and Microsystems, 2014
Why this work is important?
7
Deadlock ?• Usual network deadlock (network/data link layer)• Congestion• Deadlock due to permanent faults (physical layer)• Deadlock due to transient faults (physical layer)
1. Differentiate all deadlock types2. All deadlock detection could use a common time-out
mechanism3. A fine-grained recovery mechanism
(system reboot? NO!!!)
Modelling QDI Asynchronous NoCs
(0,0) (0,1) (0,2) (0,3)
(1,0)
(2,0)
(3,0)
y
x
(x,y)
RT RTlink
data
ack Stagedi
dia
output buf. input buf.
Stage
Stage
Stage...
...
...
... Stage
Stage do
doa
8
Deadlocked QDI pipeline
9
Deadlock: In a deadlocked QDI pipeline, no transitions could be fired any more and the pipeline gets stuck at a “stable” state.
Deadlock state: ( A==1, B==0) or (A==0, B==1)
A
B
Q
0
1
0/1
Stage i
i0
in‐1
Stage i+1acki acki+1
Ai Ai+1
CD
Single-word 4-phase 1-of-n pipeline
10
Theorem 1: A transient fault can cause symbol corruption and insertion, but NOT deadlock.
1-of-4 word:
0 0 1 0
Active wire (A)
Victim region
Deadlock of a multi-word pipeline
11
i1,0
i1,n‐1
Stage i+1
ackiacki+1
iN,0
iN,n‐1
... ...
... ...
... ...
......
......
...
Stage i
victim region
Ai,1 A1+1,1
Ai,N Ai+1,N
slice1
…
slice N
Deadlock of a multi-word pipeline (1/7)
12
i1,0
i1,n‐1
Stage i+1
ackiacki+1
iN,0
iN,n‐1
... ...
... ...
... ...
......
......
...
Stage i
victim region
0000
0001 0001
0000
00 0
0000
0001
… … …
Delayed data
Deadlock of a multi-word pipeline (2/7)
13
i1,0
i1,n‐1
Stage i+1
ackiacki+1
iN,0
iN,n‐1
... ...
... ...
... ...
......
......
...
Stage i
victim region
0000
0001 0001
0100
0 00
01
0000
0001
… … …
Deadlock of a multi-word pipeline (3/7)
14
i1,0
i1,n‐1
Stage i+1
ackiacki+1
iN,0
iN,n‐1
... ...
... ...
... ...
......
......
...
Stage i
victim region
0000
0001 0001
0100
0 00
01
0001
0100
… … …
Deadlock of a multi-word pipeline (4/7)
15
i1,0
i1,n‐1
Stage i+1
ackiacki+1
iN,0
iN,n‐1
... ...
... ...
... ...
......
......
...
Stage i
victim region
0000
0001 0001
0000
0 00
010
0001
0100
… … …
Deadlock of a multi-word pipeline (5/7)
16
i1,0
i1,n‐1
Stage i+1
ackiacki+1
iN,0
iN,n‐1
... ...
... ...
... ...
......
......
...
Stage i
victim region
0000
0001 0001
1 10
0100
0001
0000
010
… … …
Deadlock of a multi-word pipeline (6/7)
17
i1,0
i1,n‐1
Stage i+1
ackiacki+1
iN,0
iN,n‐1
... ...
... ...
... ...
......
......
...
Stage i
victim region
0000
0001 0001
1 10
0000
0001
0000
010
… … …
Deadlock of a multi-word pipeline (7/7)
i1,0
i1,n‐1
Stage i+1
ackiacki+1
iN,0
iN,n‐1
... ...
... ...
... ...
......
......
...
Stage i
victim region
0100
0001 0001
1 10
0000
0001
0000
010
… … …Ai,1 A1+1,1
Ai,N Ai+1,N
18
Deadlock pattern comparison
({Ai,1,…, Ai,N} , acki ,{Ai+1,1,…, Ai+1,N}, acki+1 ) = ({1,…,1}, 0, {0,1,…,1}, 1)almost fullcomplete
Positive transient fault: Post-fault
i1,0
i1,n‐1
Stage i+1
ackiacki+1
iN,0
iN,n‐1
... ...
... ...
...
...
......
......
...Stage i
victim region
0100
0001 0001
110
0000
0001
0000
010
… … …
Ai,1 A1+1,1
Ai,N Ai+1,N
Pre-fault
19
Deadlock pattern comparison (transient or permanent)
& ack‐ & ack+
& ack+ & ack‐Transient
Permanent
spacer
almost_full
almost_empty
complete data
G. Zhang, etc., “An asynchronous SDM network-on-chip tolerating permanent faults,” ASYNC 2014
*
*
ft_type ={(almost_full & !ack) | (almost_empty & ack);
((almost_full & ack) | almost_empty & !ack)}
00: default; 01:transient; 10: permanent; 11: invalid
20
A general deadlock detection architecture
21
Stage
Stage
Stage
Stage
Stage
Deadlockdetection
Deadlockdetection
Deadlockdetection
Pipeline piece Pipeline piece
SYNC.
ASYNC.
Deadlock detection
22
• Usual network deadlock/congestion
• Deadlock due to permanent/transient faults
1. No transitions (a time-out)
2. All pipeline stages downstream of the fault have the
same ack while the ack signals in stages upstream
of fault are alternately valued (deadlock by faults)
3. ft_type ={(almost_full & !ack) | (almost_empty & ack);
((almost_full & ack) | almost_empty & !ack)}
Network configuration
23
SouthWestNorthEastLocal
......
Crossbar SouthWestNorthEastLocal
Switch Allocator
BufferBuffer
data
eop
addr.datadata ...
headbodybodytail tail
...ack
...
...
• 2D mesh QDI NoC • 4-phase 1-of-4• Wormhole & XY-DOR
Deadlock detection
24
data
ack Stagedi
dia
output buf. input buf.
Stage
Stage
Stage...
...
...
... Stage
Stage do
doa
Deadlock detection
Patternchecker
Protected region
clock
almost_empty
cd1 cd2 cd3 cd4 cd5 cd6 cd7 cd8
almost_full
cd1 cd2 cd3 cd4 cd5 cd6 cd7 cd8 Idle
Start
timeout
Enquiry
TF_Confirm(transient)
timeout
timeouttimeout
DK_Confirm(permenent)
Permanent fault?
No
Yes
timeout
Deadlock detection
25
Idle
Start
timeout
Enquiry
TF_Confirm(transient)
timeout
timeouttimeout
DK_Confirm(permenent)
Permanent fault?
No
Yes
timeoutIdle
start
Enquiry
Deadlock detection
26
Idle
Start
timeout
Enquiry
TF_Confirm(transient)
timeout
timeouttimeout
DK_Confirm(permenent)
Permanent fault?
No
Yes
timeout
DK_Confirm(permanent)
Deadlock confirmed!!
Deadlock recovery
27
1. Permanent fault?Spatial division Multiplexing (SDM) to divide each link into physically separated sub-links
• Block the defective sub-link• Reconfigure the switch allocator• Drain the flits polluted by the faults from the network
*G. Zhang, etc., “An asynchronous SDM network-on-chip tolerating permanent faults,” ASYNC 2014
Deadlock recovery
28
Idle
Start
timeout
Enquiry
TF_Confirm(transient)
timeout
timeouttimeout
DK_Confirm(permenent)
Permanent fault?
No
Yes
timeoutTF_Confirm(transient)
Resume the blocked sub-link to use
Experimental Results
UMC 130nm standard cell library
Synchronous IP cores (SystemC) + post-synthesis routers
4×4 mesh 4-phase 1-of-4 SDM NoC
packet size: 64 bytes; Local clock: 100MHz; Time-out: 1.5MHz
29
Original Protected OverheadArea (um2) 63446 72394 14.1%
Throughput (Mbyte/s/node) 693 648 ‐6.5%
Energy (pJ/Byte) 3.7 4.3 16%
Conclusion
If the time difference between two slowest sub‐pipelines are longer than the loop latency, a transient fault at the slowest sub‐pipeline could cause deadlock.
The patterns of the deadlock caused by transient faults, congestion and the usual deadlock are different, which can be used to detect the deadlock and tell its kind.
For deadlock caused by transient faults, a fine‐grained recovery mechanism is proposed to recovery the network to avoid expensive system reboot.
30
top related