online diagnosis of network-on-chip
TRANSCRIPT
![Page 1: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/1.jpg)
Online Diagnosis of Networks-on-Chip
Sebastian Klotz
Supervisor: Dipl.-Inf. Stefan Holst
Reliable NoC in the Many Core Era
6/15/2009
![Page 2: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/2.jpg)
2
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 3: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/3.jpg)
3
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 4: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/4.jpg)
4
Objectives of Online Diagnosis
Online Diagnosis:
Detection and localization of faulty switches and inter-switch links
Fault classification Distinguish between transient, intermittent and permanent faults
Observe network behavior during operation
Provide service (input) for recovery mechanism
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 5: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/5.jpg)
5
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 6: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/6.jpg)
6
NoC Switch Architecture
Switch Components:
First-In-First-Out
(FIFO) Buffer
Multiplexer (MUX)
Crossbar Switch
Router
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
N
S
E
W
SEW
N
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
![Page 7: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/7.jpg)
7
System Level Fault Models
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Goal: Route data from W
N
Fault-free case
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
![Page 8: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/8.jpg)
8
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Goal: Route data from W
N
Fault in Router, FIFO or MUX
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 9: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/9.jpg)
9
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Goal: Route data from W
N
Fault in the Router
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 10: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/10.jpg)
10
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Goal: Route data from W
N
Fault in the Router or MUX
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 11: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/11.jpg)
11
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Goal: Route data from W
N
Fault in the FIFO Buffer
&
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
@t+1@t
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 12: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/12.jpg)
12
N
S
E
W
SEWN
RES
N
MUX
ER
PS W
WE R N
SWNR
MU
X
MUX
MUXMUX
Crossbar
Processor
Switch
Router
MUX
MUX
FIFO
FIFO
FIFO
FIFO
FIFO
System Level Fault Models
Control Faults:
Dropped Data
Fault
Direction Fault
Multiple Copies
in Space Fault
Multiple Copies
in Time Fault
Data Fault:
Corrupted
Data Fault
Goal: Route data from W
N
Routing is ok, but data is affected!
10100101
10101101
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 13: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/13.jpg)
13
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 14: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/14.jpg)
14
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
S1[0,0]
S8[1,2]
S2[1,0]
S7[0,2]
[x,y]
- 3x3 Network-on-Chip / XY-Routing
S9[2,2]
S4[0,1]
S5[1,1]
S6[2,1]
S3[2,0]
1XXYY nDestinatioSwitchSwitchSource
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 15: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/15.jpg)
15
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaN
[x,y]
Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults
Ex. 1: Route from S4 [0,1]
S9 [2,2]
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 16: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/16.jpg)
16
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaN
[x,y]
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults
Ex. 1: Route from S4 [0,1]
S9 [2,2]
S8[1,2]
from other Switches
=?
ComperatorSwitch Address
Source AddressDestination Address ErrorNo
1XXYY nDestinatioSwitchSwitchSource
nDestinatioSwitch XX SwitchSource YY
Distraction Detected!
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 17: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/17.jpg)
17
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults
Ex. 2: Route from S2 [1,0]
S6 [2,1]
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaN
[x,y]
This stuck-at port fault cannot be
discovered; S3 acts like expected!
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 18: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/18.jpg)
18
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Direction Faults – „Stuck-at N, E, S, W“ Multiple Copies in Space Faults
Ex. 3: Route from S8 [1,2]
S4 [0,1]
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaE
[x,y]
8
SwitchSource YY No violation of XY-Routing!
Livelock!
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 19: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/19.jpg)
19
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Direction Faults - „Reflection“
Ex. 3: Route from S8 [1,2]
S4 [0,1]
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaE
[x,y]
Idea:
Decrement „switch count field“ on each hop.Switch count field = „10,9,8,7,6,5,4,3,2,1,0“ !?
Drop packet
8
Livelock!
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 20: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/20.jpg)
20
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
Dropped Data Faults
Ex. 3: Route from S8 [1,2]
S4 [0,1]
S1[0,0]
S4[0,1]
S8[1,2]
S5[1,1]
S6[2,1]
S2[1,0]
S9[2,2]
S3[2,0]
S7[0,2]
SaE
[x,y]
ACK/NACK?
Idea:
Start timer whenever data leaves the source.Stop timer when reception is confirmed.Expiration of the timer indicates dropped data.
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 21: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/21.jpg)
21
Duplicated Packet
Packet #4 missing
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
# 1# 2# 2# 3# 5
# 6
# 11# 10# 9# 8# 7
Sourcenode
Destination node
Network-on-Chip
DroppedDataFaults
MultipleCopiesFaults
Sequence Number
Watchdog Counter
Dropped Data Faults Multiple Copies in Space / Time Faults
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 22: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/22.jpg)
22
Control Fault Detection/Localization
Distraction Detection
Switch Count (TTL)
Time-out Detection
Sequence Number
Trapped Packet Detection
N
S
E
W
PProcessor
Router
SaP Fault
ViolationindicatesSaP Fault!
Direction Faults - „Stuck-at Processor“
SwitchnDestinatio XYXY
Comparison that isperformed in the PE:
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 23: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/23.jpg)
23
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 24: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/24.jpg)
24
Data Fault Detection/Localization
End-to-End (e2e)
Switch-to- Switch (s2s)
Code- Disjoint- Detection (cdd)
Detect Corrupted Data Faults
Perform Data Fault Detection by means of Error Detection and Correction (EDC) Codes.
ED:
Parity-Check Codes
Cyclic Redundancy Check (CRC) Codes
EDC:
Hamming Codes (SEC/DED)
SEC: Single Error CorrectionDED: Double Error Detection
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 25: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/25.jpg)
25
Data Fault Detection/Localization
End-to-End (e2e)
Switch-to- Switch (s2s)
Code- Disjoint- Detection (cdd)
PE PE
Encoder Decoder
Switch A Switch B
Sender NI Receiver NI
Packet buffer
Queuing buffer
Credit signal
Data
Encode data at the sending node
Data Fault Detection at the destination
Clear localization is not possible
PE: Processing ElementNI: Network Interface
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 26: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/26.jpg)
26
Data Fault Detection/Localization
End-to-End (e2e)
Switch-to- Switch (s2s)
Code- Disjoint- Detection (cdd)
DecoderSwitch B
PE PE
Encoder DecoderSwitch A
Sender NI Receiver NI
Packet buffer Circular
(queuing and retransmission)
buffers
Data
Decoder
ACK
NACK
Encode data at the sending node
Perform “checking” at ever switch input
Last switch and link is suspicious
PE: Processing ElementNI: Network Interface
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 27: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/27.jpg)
27
Data Fault Detection/Localization
End-to-End (e2e)
Switch-to- Switch (s2s)
Code- Disjoint- Detection (cdd)
N
S
EW
P
FIFO
link error flag
Router
Po(Xo)
switch error flag
Xi
Xip
XoXop
Pi(Xi)
Encode data at the sending node
Perform “checking” at ever switch in- as well as output
Clear localization of the fault (switch/link)
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 28: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/28.jpg)
28
Data Fault Detection/Localization
Comparison of the Fault Localization capabilities:
End-to-End, Switch-to-Switch and Code-Disjoint-Detection
Sw1
D1 D2
Sw7
Sw6
Sw5Sw3S2
Sw2S1
Sw4
L1 L2
L3
L4 L5 L6
L9
L8
L7
I
II
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
Path II (S2 to D2 )
Fault in L5 :
e2e = {S2 , L4 , Sw3 , L5 , Sw4 , L6 , Sw5 ,L7 , Sw6 , L8 , Sw7 , L9, D2 }
s2s = {Sw3 , L5 }cdd = {L5 }
Path I (S1 to D1 )
Fault in Sw1 :
e2e = {S1 , L1 , Sw1 , L2 , Sw2 , L3 , D1 }s2s = {Sw1 , L2 }cdd = {Sw1 }
![Page 29: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/29.jpg)
29
Agenda
1. Objectives of Online Diagnosis
2. System Level Fault Models
3. Control Fault Detection/Localization Methods
4. Data Fault Detection/Localization Methods
5. Conclusion
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 30: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/30.jpg)
30
Conclusion
Objectives of Online Diagnosis:
Fault Detection/Localization
Fault Classification (transient, intermittent & permanent)
Fault Modeling
Regard altered switch behavior (abstraction)
Classify models into either Control or Data Faults
Control Fault Detection/Localization
Distraction Detection with different extensions
Data Fault Detection/Localization
ED(C)
Detection / e2e, s2s and cdd
Localization
Reliable NoC in the Many Core Era - Online Diagnosis of Networks-on-Chip
![Page 31: Online Diagnosis of Network-on-Chip](https://reader035.vdocument.in/reader035/viewer/2022071602/613d519f736caf36b75be9b7/html5/thumbnails/31.jpg)
Thank you for your attention!
Reliable NoC in the Many Core Era
6/15/2009