improving fpga design robustness with partial tmr
DESCRIPTION
Improving FPGA Design Robustness with Partial TMR. Brian Pratt 1,2 Michael Caffrey, Paul Graham 2 Eric Johnson, Keith Morgan, Michael Wirthlin 1 1 Brigham Young University Department of Electrical Engineering 2 Los Alamos National Laboratory. x. MTBF. Reliability constraint. - PowerPoint PPT PresentationTRANSCRIPT
Pratt 1 MAPLD 2005/202
Improving FPGA Design Robustness with Partial TMR
Brian Pratt 1,2
Michael Caffrey, Paul Graham 2
Eric Johnson, Keith Morgan, Michael Wirthlin 1
1 Brigham Young University Department of Electrical Engineering2 Los Alamos National Laboratory
Pratt 2 MAPLD 2005/202
Motivation for Partial TMR
• Factors of fault-tolerant computing:– Availability– Reliability– Mitigation Cost
• Full TMR– Expensive in terms of
power, speed, area, etc.– Worthwhile if
affordable! Area Cost
MT
BF
xReliability constraint
Area constraint
Pratt 3 MAPLD 2005/202
Motivation for Partial TMR
• Partial TMR offers:– Mitigation of most sensitive design structures
– Increased availability of a system by decreasing number of system resets
– Decreased mitigation cost over full TMR
• Suitability of Partial TMR is application dependent– Reduced reliability compared to full TMR
Pratt 4 MAPLD 2005/202
Scrubbing
• Must be included with Partial Mitigation
• Continuously ‘read’ and ‘clean’ configuration memory
• Single bit will be upset no longer than ts
ts = time for one scrub
1001011010101000110001
Pratt 5 MAPLD 2005/202
Non-Persistent Errors
• An SEU in the non-persistent cross-section will cause a temporary interruption of service
• Requires partial reconfiguration to correct
Scrubbing Repairs Configuration
Correct Output
time cycle
erro
r m
agni
tud
e
error = delta between outputs of a golden and DUT circuit
Pratt 6 MAPLD 2005/202
Persistent Errors
• An SEU in the persistent cross-section will cause a permanent interruption of service
• Requires full system reset to correct
Scrubbing Repairs Configuration
Incorrect Output
error = delta between outputs of a golden and DUT circuit
time cycle
erro
r m
agni
tud
e
Pratt 7 MAPLD 2005/202
Non-Persistent Circuit Structures
• Generally consists of circuit components and routing in a feed-forward path
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Pratt 8 MAPLD 2005/202
Persistent Circuit Structures
• Generally consists of circuit components and routing in, or contributing to, a feed-back path
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Pratt 9 MAPLD 2005/202
• Apply a mitigation technique to just the persistent cross section
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
LogicTMR
Partial Mitigation
Pratt 10 MAPLD 2005/202
Limitations of Partial Mitigation
• Does not prevent all errors– System must be corrected with configuration
bitstream scrubbing– Circuit configuration can be incorrect between
scrubbing• Non-persistent errors remain
Pratt 11 MAPLD 2005/202
Automated Partial TMR
• Analyze an EDIF source file for feedback structures– Protect these sections
with TMR to reduce persistent cross section
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Logic FFFF Logic
Logic FFFF Logic
VoterVoter
VoterVoter
VoterVoter
Pratt 12 MAPLD 2005/202
BLTmr Partial TMR Tool
• BYU-LANL Triple Modular Redundancy:
Configurable Reliability– Limit mitigation to minimize:
• design resource requirements
• power consumption
– Mitigation focused on persistent circuit structures
Pratt 13 MAPLD 2005/202
BLTmr Partial TMR Tool
• Design Divided into three sections:– Feedback, Input to FB, Output
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Pratt 14 MAPLD 2005/202
BLTmr Partial TMR Tool
• Design Divided into three sections:– Feedback, Input to FB, Output
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Pratt 15 MAPLD 2005/202
BLTmr Partial TMR Tool
• Design Divided into three sections:– Feedback, Input to FB, Output
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Pratt 16 MAPLD 2005/202
BLTmr Partial TMR Tool
• Design Divided into three sections:– Feedback, Input to FB, Output
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Pratt 17 MAPLD 2005/202
BLTmr Tool Options
• BLTmr Tool applies TMR mitigation to subsections of the design:– Feedback Only
– Feedback + Input to Feedback
– FB + Input to FB + Output (Full TMR)
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Pratt 18 MAPLD 2005/202
BLTmr Tool Options
• BLTmr Tool applies TMR mitigation to subsections of the design:– Feedback Only
– Feedback + Input to Feedback
– FB + Input to FB + Output (Full TMR)
FF
FF
FF
FFLogic
Logic
Logic
Logic
FFLogic
FFLogic
VoterVoter
VoterVoter
VoterVoter
FFLogic
Pratt 19 MAPLD 2005/202
BLTmr Tool Options
• BLTmr Tool applies TMR mitigation to subsections of the design:– Feedback Only
– Feedback + Input to Feedback
– FB + Input to FB + Output (Full TMR)
Logic
FF
FF
FF
FF
FF
Logic
Logic
Logic
Logic
Logic FFFF Logic
Logic FFFF Logic
VoterVoter
VoterVoter
VoterVoter
Pratt 20 MAPLD 2005/202
BLTmr Tool Options
• BLTmr Tool applies TMR mitigation to subsections of the design:– Feedback Only
– Feedback + Input to Feedback
– FB + Input to FB + Output (Full TMR)
Logic FF
FF
FF Logic
Logic
Logic FFFF Logic
Logic FFFF Logic
VoterVoter
VoterVoter
VoterVoter
Logic FFFF Logic
Logic FFFF Logic
Logic FFFF Logic
FFLogic
FFLogic
Pratt 21 MAPLD 2005/202
BLTmr Tool Flow
• BYU EDIF development environment reads in user design
• Design organized into graph structure for analysis
ParseEDIF
CreateDesign
Database
UserConstraints
Analysis(Feedback,Input to FB,
etc.)
CellTriplication
OriginalDesign
PartiallyMitigated
Design
VoterInsertion
Pratt 22 MAPLD 2005/202
BLTmr Tool Flow
• User may direct mitigation
• Design analyzed to classify components as described
ParseEDIF
CreateDesign
Database
UserConstraints
Analysis(Feedback,Input to FB,
etc.)
CellTriplication
OriginalDesign
PartiallyMitigated
Design
VoterInsertion
Pratt 23 MAPLD 2005/202
BLTmr Tool Flow
• Circuit elements triplicated
• Voters inserted
• Mitigated design written in EDIF format
ParseEDIF
CreateDesign
Database
UserConstraints
Analysis(Feedback,Input to FB,
etc.)
CellTriplication
OriginalDesign
PartiallyMitigated
Design
VoterInsertion
Pratt 24 MAPLD 2005/202
Example Circuits
• Tests on two designs
1. DSP Kernel
2. Synthetic Design– LFSR modules feeding
into an add-multiply tree
Pratt 25 MAPLD 2005/202
FPGA Editor Layout Sensitivity Map Persistence Map
DSP Kernel
Unmitigated Fault Analysis
5,746 slices (46%) 575,448 bits (9.9%) 13,841 bits (0.23%)
Synthetic Design
2,538 slices (20%) 189,835 bits (3.3%) 77,159 bits (1.3%)
Pratt 26 MAPLD 2005/202
FPGA Editor Layout Sensitivity Map Persistence Map
Unmitigated
Experimental Results – Design #1DSP Kernel
5,746 slices (46%) 575,448 (9.90%) 13,841 (0.24%)
Partial TMR applied to
Feedback & Input to FB
8,036 slices (65%) 569,700 (9.81%) 152 (0.0026%)
Pratt 27 MAPLD 2005/202
Unmitigated
Experimental Results – Design #2Synthetic (LFSR/Mult)
2,538 slices (20%) 189,835 (3.27%) 77,159 (1.33%)
Full TMR Applied
11,961 slices (97%) 20,256 (0.35%) 671 (0.012%)
FPGA Editor Layout Sensitivity Map Persistence Map
Pratt 28 MAPLD 2005/202
* Full TMR could not be applied to DSP Kernel due to FPGA resource constraints“Qpro Virtex 2.5V radiation hardened FPGAs”, Xilinx Inc., DS028 (v1.2), Nov. 5, 2001.
1.00E-12
1.00E-11
1.00E-10
1.00E-09
1.00E-08
1.00E-07
1.00E-06
1.00E-05
Unmitigated TMRFeedback
TMRFeedback +Input to FB
Max TMR
Vir
tex
10
00
Pro
ton
Cro
ss
Se
cti
on
(c
m2) Static X-
Section
DSP Kernel*Dynamic
DSP KernelPersistent
SyntheticDynamic
SyntheticPersistent
Experimental Results
Pratt 29 MAPLD 2005/202
Experimental Results
• GPS orbit (22,200 km altitude, 55° inclination)• AP-8 Solar Minimum, JPL Solar Proton Quiet, CRÈME 96 Solar Minimum
1
10
100
1000
10000
100000
Static X-Section -Sensitive
Unmitigated -Sensitive
Unmitigated -Persistent
Feedback TMR -Persistent
Feedback+InputTMR - Persistent
Max TMR -Persistent
MT
BF
(d
ay
s)
DSP Kernel
Synthetic
Pratt 30 MAPLD 2005/202
Summary of Results
Design Size
Increase
Sensitivity Decrease
Persistence Decrease
Average MTBF
Increase‡‡
DSP Kernel*
40% 3% 99% 90x
Synthetic
Design ‡
370% 89% 99% 114x
* Unmitigated to Partial TMR of Feedback + Input to FB‡ Unmitigated to Full TMR‡‡ GPS orbit; AP-8 Solar Minimum, JPL Solar Proton Quiet, CRÈME 96 Solar Minimum
Pratt 31 MAPLD 2005/202
Conclusions• Pros: Partial TMR (BLTmr) as fault mitigation
offers:– Increased system availability due to fewer system resets– More “affordable” fault mitigation than full TMR– Critical design areas are mitigated with an automated
tool
• Cons:– Much of the design may be unmitigated, leaving
sensitive sections• May result in temporary errors