early dependability assessment of fpga-based space ...early dependability assessment of fpga-based...
Post on 19-Mar-2020
3 Views
Preview:
TRANSCRIPT
Early Dependability Assessment of FPGA-Based
Space Applications Using Formal Verification
Khaza Anuarul HoqueMentor: Dr. Taylor T. Johnson
Dept. of Computer Science & Engineering,
University of Texas at Arlington
Thanks to Otmane Ait Mohamed (Concordia Univ., Canada) & Yvon Savaria(Polytechnique Montreal, Canada) for supervising the project
Outline
2
Motivation
FPGA and SEUs
Proposed methodology
Design options analysis
DAL analysis and scrub optimization
TMR partitioning optimization
Lessons learned
Summary
Future directions
Motivation: Cosmic Radiation
3
Motivation (cont.)
4
Background: FPGAs and SEUs
5
SRAM-based FPGAs
(+) Cheaper than Rad-hard FPGAs
(+) Better performance than other types of FPGAs
(+) On-field programmability
Susceptible to cosmic radiation induced SEUs
(-) Mitigation required
(-) Redundancy (such as TMR)
(-) Scrubbing/Reconfiguration
SEU Estimation
6
Three main ways to analyze SEUs:
1. Hardware testing (beam testing/laser testing)
(+) Most realistic and accurate
(-) Requires finished implementation
(-) May damage the device and
(-) Costly
2. Fault injection (emulation/simulation)
(+) Less accurate than hardware testing but still very useful
(-) Test time grows with possible test cases
SEU Estimation
7
3. Analytical techniques
(+) Better controllability, quick estimation of SEUs
(+) No risk of damaging the device
(+) Estimation at early design stage
(-) Can be relatively less accurate in some cases
An early estimation of SEUs will help :
-To build a more reliable design
-To adopt the required mitigation strategy
- Reduce the design time
- Reduce the design cost
Objectives
8
To propose a methodology for early SEU analysis on SRAM-based
FPGA designs for aerospace applications:
Based on formal verification technique (probabilistic model checking)
Design options analysis
What is the trade-off between Dependability-performability-area ?
Which design option shall a designer choose ?
Scrub Optimization and DAL analysis
Is the adopted mitigation enough ?
Scrub frequency can be optimized ?
Early assessment of TMR partitioning to increase reliability
Can we find the number of partitions early ?
Formal Verification
9
The application of rigorous, mathematics-based techniques to establish the correctness of computerised systems
Main techniques: Model checking Equivalence checking Theorem proving
Many properties other than correctness are important
Quantitative requirements: “how reliable is my car’s Bluetooth network?” “how efficient is my phone’s power management policy?” “how secure is my bank’s web-service?”
Probabilistic model checking is a formal verification technique for modelling and analysing systems that
exhibit probabilistic behaviour
Probabilistic Model Checking
10 Image credit: David Parker/Probabilistic Model Checking, Michaelmas 2011
Probabilistic Model Checking
11
Models: variants of Markov chains: Discrete-time Markov chains (DTMCs) Continuous-time Markov chains (CTMCs) Markov decision processes (MDPs)
Property Specifications:
- PCTL, CSL, PCTL*, LTL
Transient and Steady-state analysis:P = ? [F [t,t] oper] - Instantaneous availability of the system, e.g. Probability that the system will be in a specific state in time instant t Timing and ordering of events:P = ? [!fail_B U [3600,7200] fail_B] - Probability that component B fails for the first time during the second hour of operation. Reward-based properties:R{“oper”} = ? [C<t] - Expected cumulative operational time of the system in the time interval [0, t]
Proposed Methodology
12
Outline
13
Motivation
FPGA and SEUs
Proposed methodology
Design options analysis
DAL analysis and scrub optimization
TMR partitioning optimization
Lessons learned
Summary
Future directions
Outline
13
Proposed methodology
Design options analysis
Design Options Analysis
14
Analyze design options with respect to
Reliability
Availability
Safety
performability-area tradeoff
Throughput: 0.33
Design Options Analysis
14
Analyze design options with respect to
Reliability
Availability
Safety
performability-area tradeoff
Throughput: 0.33
Component failed !
Design Options Analysis
14
Analyze design options with respect to
Reliability
Availability
Safety
performability-area tradeoff
Throughput: 0.33
Component failed !
Solution:
rescheduling
Design Options Analysis
14
Analyze design options with respect to
Reliability
Availability
Safety
performability-area tradeoff
Throughput: 0.25
Component failed !
Solution:
rescheduling
Design Option Analysis Methodology
15
Dataflow
graph
Configuration
PRISM model
PRISM model
checker
Results
Characterization
Library
Properties
Mitigation(s)
Fault
coverage
Rewards
Design Option Analysis Methodology
15
Dataflow
graph
Configuration
PRISM model
PRISM model
checker
Results
Characterization
Library
Properties
Mitigation(s)
Fault
coverage
Rewards
Design Option Analysis Methodology
15
Dataflow
graph
Configuration
PRISM model
PRISM model
checker
Results
Characterization
Library
Properties
Mitigation(s)
Fault
coverage
Rewards
Design Option Analysis Methodology
15
Dataflow
graph
Configuration
PRISM model
PRISM model
checker
Results
Characterization
Library
Properties
Mitigation(s)
Fault
coverage
Rewards
Design Option Analysis Methodology
15
Dataflow
graph
Configuration
PRISM model
PRISM model
checker
Results
Characterization
Library
Properties
Mitigation(s)
Fault
coverage
Rewards
Design Option Analysis Methodology
15
Dataflow
graph
Configuration
PRISM model
PRISM model
checker
Results
Characterization
Library
Properties
Mitigation(s)
Fault
coverage
Rewards
Markov Modeling Example
16
Table : Characterization library
Outline
17
Motivation
FPGA and SEUs
Proposed methodology
Design options analysis
DAL analysis and scrub optimization
TMR partitioning optimization
Quantitative analysis
Summary
Future directions
Outline
17
Proposed methodology
DAL analysis and scrub optimization
Scrub Optimization and DAL Analysis
18
DO-254:
- Baseline of required design flow steps for airborne component
- Five levels of compliance, commonly known as Design Assurance Levels (DALs)
- A failure condition on flight control system (that may lead to a catastrophic event) ≠ A failure condition on the entertainment system (even though it could spoil your day!)
- Engineers designing to level A or B face a much more rigorous test, verification, and documentation process than for levels C, D, or E.
Can a lower scrub frequency (to lower the power consumption) meeting DAL requirement?
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
+ * +
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
+ * +
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
+ * +
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
+ * +
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
+ * +
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
+ * +
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
Level Classification Failure Condition Description Pb/h
A Catastrophic Failure conditions that would prevent continued safe flight and
landing. <10-9
Extremly improbable
B Hazardous / Severe-Major Large reduction in safety margins or functional capabilities,
physical distress or higher workload such that the flight crew could not be relied on to perform their tasks accurately or completely, or adverse effects on occupants including serious or potentially fatal injuries to a small number of those occupants
<10-7
Extremly remote
C Major Significant reduction in safety margins or functional
capabilities, a significant increase in flight crew workload or in conditions impairing flight crew efficiency, or discomfort to occupants, possibly including injuries.
<10-5
remote
D Minor Slight reduction in safety margins or functional capabilities, a
slight increase in flight crew workload, such as routine flight plan changes, or some inconvenience to occupants
<10-3
Probable
E No Effect Failure conditions that do not affect the operational capability
of the aircraft or increase flight crew workload. -
Yes
DAL Analysis Methodology
19
High-level
description
CDFG
Extraction
Resource estimation
Characterization
Library
Mitigation strategy
+ * +
Failure & scrub
parameter
PRISM MC
PRISM model
(Erlang distribution)
DAL
Properties
DAL
met ?Finish
Number of
total essential bits
Level Classification Failure Condition Description Pb/h
A Catastrophic Failure conditions that would prevent continued safe flight and
landing. <10-9
Extremly improbable
B Hazardous / Severe-Major Large reduction in safety margins or functional capabilities,
physical distress or higher workload such that the flight crew could not be relied on to perform their tasks accurately or completely, or adverse effects on occupants including serious or potentially fatal injuries to a small number of those occupants
<10-7
Extremly remote
C Major Significant reduction in safety margins or functional
capabilities, a significant increase in flight crew workload or in conditions impairing flight crew efficiency, or discomfort to occupants, possibly including injuries.
<10-5
remote
D Minor Slight reduction in safety margins or functional capabilities, a
slight increase in flight crew workload, such as routine flight plan changes, or some inconvenience to occupants
<10-3
Probable
E No Effect Failure conditions that do not affect the operational capability
of the aircraft or increase flight crew workload. -
Yes
No
Modeling and Parameters
20
Set of states
- The set of states can be classified into “fully operational” , “faulty
with one or more faults” and “failed” states.
Modeling parameters
- Number of critical bits: Characterization library
- Environmental: Design Failure rate,
- Target system: SelectMap bus width B and configuration clock
frequency 𝑓𝑐𝑐𝑙𝑘- Mitigation: Correction rate,
µ𝑑𝑒𝑠𝑖𝑔𝑛 =𝐵 × 𝑓𝑐𝑐𝑙𝑘
#𝑐𝑜𝑛𝑓𝑖𝑔𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑡𝑠
λ𝑑𝑒𝑠𝑖𝑔𝑛 = λ𝑏𝑖𝑡 × # critical bits
Deterministic Delay Modeling
21
Erlang process :
𝑆0 𝑆1
𝜆
Deterministic Delay Modeling
21
Erlang process :
𝑆0 𝑆1
𝑆0 𝑆1 𝑆𝑘−1𝑆2 𝑆𝑘 𝑘 𝜏 𝑘 𝜏 𝑘 𝜏
𝜆
Deterministic Delay Modeling
21
Erlang process :
𝑆0 𝑆1
𝑆0 𝑆1 𝑆𝑘−1𝑆2 𝑆𝑘 𝑘 𝜏 𝑘 𝜏 𝑘 𝜏
𝜆
Pro
babi
lity
𝜏
Modeling Periodic Blind Scrub
22
Modeling TMR & Blind Scrub
23
Outline
24
Motivation
FPGA and SEUs
Proposed methodology
Design options analysis
DAL analysis and scrub optimization
TMR partitioning optimization
Lessons learned
Summary
Future directions
Outline
24
Proposed methodology
TMR partitioning optimization
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
Triple Modular Redundancy (TMR)
25
TMR Partitioning: Example
26
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Quantitative
Results
PRISM MC
CTMC
Req.
Met ?
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Quantitative
Results
PRISM MC
CTMC
Req.
Met ?
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Quantitative
Results
PRISM MC
CTMC
Req.
Met ?
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Quantitative
Results
PRISM MC
CTMC
Failure rate of each
module
Scrub
rate
Req.
Met ?
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Quantitative
Results
PRISM MC
CTMC
User
Failure rate of each
module
Scrub
rate
Req.
Met ?
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Quantitative
Results
PRISM MC
CTMC
User
No of
partitions
Failure rate of each
module
Scrub
rate
Req.
Met ?
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Quantitative
Results
PRISM MC
CTMC
User
No of
partitions
Failure rate of each
module
No of components each
moduleScrub
rate
Req.
Met ?
Characterization
library
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Quantitative
Results
PRISM MC
CTMC
User
No of
partitions
Failure rate of each
module
No of components each
moduleScrub
rate
Req.
Met ?
Characterization
library
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Reliability/
Availability
Properties
Quantitative
Results
PRISM MC
CTMC
User
No of
partitions
Failure rate of each
module
No of components each
module
Requirement
specification
Scrub
rate
Req.
Met ?
Characterization
library
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Reliability/
Availability
Properties
Quantitative
Results
PRISM MC
CTMC
User
No of
partitions
Failure rate of each
module
No of components each
module
Requirement
specification
Scrub
rate
Req.
Met ?
Characterization
library
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Reliability/
Availability
Properties
Quantitative
Results
PRISM MC
CTMC
User
No of
partitions
Failure rate of each
module
No of components each
module
Requirement
specification
Scrub
rate
Req.
Met ?
Characterization
library
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Reliability/
Availability
Properties
Quantitative
Results
PRISM MC
CTMC
User
No of
partitions
Failure rate of each
module
No of components each
module
Requirement
specification
Scrub
rate
Req.
Met ?Finish
Characterization
library
Yes
TMR Partitioning Methodology
27
High-level
description
Extracted
CDFG
PRISM model
Reliability/
Availability
Properties
Quantitative
Results
PRISM MC
CTMC
User
No of
partitions
Failure rate of each
module
No of components each
module
Requirement
specification
Scrub
rate
Req.
Met ?Finish
Characterization
library
Yes No
Markov Modeling of Single Bit Upset
28
A system with N partitions can be defined by a set:
where each represented by a CTMC
The final model of system can be defined by the parallel
composition (||) of all the CTMCs of the partitions:
Markov Modeling of Single Bit Upset
28
A system with N partitions can be defined by a set:
where each represented by a CTMC
The final model of system can be defined by the parallel
composition (||) of all the CTMCs of the partitions:
Markov Modeling of Single Bit Upset
28
A system with N partitions can be defined by a set:
where each represented by a CTMC
The final model of system can be defined by the parallel
composition (||) of all the CTMCs of the partitions:
Markov Modeling of Single Bit Upset
28
A system with N partitions can be defined by a set:
where each represented by a CTMC
The final model of system can be defined by the parallel
composition (||) of all the CTMCs of the partitions:
Markov Modeling of Single Bit Upset
28
A system with N partitions can be defined by a set:
where each represented by a CTMC
The final model of system can be defined by the parallel
composition (||) of all the CTMCs of the partitions:
SBU Model for FIR: 2 Partitions
29
SBU Model for FIR: 2 Partitions
29
SBU Model for FIR: 2 Partitions
29
SBU Model for FIR: 2 Partitions
29
Combined Model for FIR: 2 Partitions
30
Combined Model for FIR: 2 Partitions
30
Combined Model for FIR: 2 Partitions
30
Combined Model for FIR: 2 Partitions
30
Key Lessons
31
Extra reliability provided by the redundancy is not always
useful to suppress the additional area overhead.
Rescheduling with scrubbing is good enough to serve as a
fault recovery and repair mechanism where optimization of
reliability, area, and performance is required.
It is possible to find an appropriate scrub interval (slowest
scrub rate) to save power while meeting the dependability
requirements instead of choosing the highest scrub
frequency.
Key Lessons (cont.)
32
There exists an optimal number of TMR partitions
The more the number of partitions (which means smaller
modules), the less frequent scrub will be required to meet a
target reliability.
For availability, the number of partitions is important for the
cases where the scrub interval is long.
For the case of frequent scrubbing, the number of partitions
increases the availability to a minimal level.
For longer scrub intervals the availability improvement with
the increased number of partitions is quite significant.
Lessons Learned
33
Brief experience from some our previous research: K. A. Hoque, O. A. Mohamed and Y. Savaria, “Applying Formal Verification to Early
Assessment of FPGA-based Aerospace Applications: Methodology and Experience”, 10th IEEE Systems Conference (SysCon 2016), Orlando, USA, 2016.
K. A. Hoque, O. A. Mohamed and Y. Savaria, “Towards an accurate reliability, availability and maintainability analysis approach for satellite systems based on probabilistic model checking”, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, 2015.
K. A. Hoque, O. A. Mohamed, Y. Savaria and C. Thibeault, “Probabilistic Model Checking Based DAL Analysis to Optimize a Combined TMR-Blind-Scrubbing Mitigation Technique for FPGA- Based Aerospace Applications”, 12th ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE), IEEE, 2014, Lausanne, Switzerland.
K. A. Hoque, O. A. Mohamed, Y. Savaria and C. Thibeault, “Early Analysis of Soft Error Eects for Aerospace Applications using Probabilistic Model Checking”, 2nd International Workshop on Formal Techniques for Safety-Critical Systems (FTSCS13), CCIS, Springer-Verlag, 2013, Queenstown, New Zealand.
New modeling results related to TMR partitioning are not published yet.
Summary
34
Use of FPGA in aerospace is common, hence early dependability
analysis is helpful in saving:
Time
Effort
Cost
We proposed formal verification technique based methodology:
design options analysis
Scrub interval optimization with DAL analysis
optimal partitioning of TMR
Probabilistic model checking can be used for SEU analysis as a
complimentary approach with other techniques.
Future Directions
35
Inclusion of other fault models, such as analyzing design
failures due to aging, electromigration, hot electron effects,
and Negative-Bias Temperature Instability (NBTI) and Single-
Event Functional Interrupts (SEFI).
Inclusion of adaptive mitigation model based on radiation
sensitivity.
Extension of the models to handle three or more bit upsets.
Extension of the models to support read-back scrubbing and
other customizable scrubbing schemes.
Acknowledgement
36
This research work is a part of the AVIO-403 project financially supported by the Consortium for Research and Innovation in Aerospace in Quebec (CRIAQ), Fonds de Recherche du Qu´ebec- Nature et Technologies (FRQNT) and the Natural Sciences and Engineering Research Council of Canada (NSERC). The presenter would also like to thank Bombardier Aerospace, MDA Space Missions and the Canadian Space Agency (CSA) for their technical guidance and financial support.
The presenter would also like to thank Dr. Taylor T. Johnson, from VeriVITAL group, University of Texas at Arlington for his financial support to attend the S5 Symposium.
37
Thank YouQuestions ?
38
EXTRA SLIDES
Quantitative Analysis:
16-tap FIR Filter
39
No Configurations Spare Scrubbing Rescheduling
C1 2A 2M None ✓ ✓
C2 2A 3M 1 Mul ✓ ✓
C3 3A 2M 1 Add ✓ ✓
C4 3A 3M 1 Add,
1 Mul✓ ✓
Table : Design options to evaluate
Expected Throughput: Overall Reward
40
Interval
(days)
Configurations Overall
reward
1
C1
C2
C3
C4
1.432
1.045
1.326
0.993
4
C1
C2
C3
C4
1.216
0.940
1.166
0.931
9
C1
C2
C3
C4
0.942
0.769
0.932
0.790
R {“Expected throughput”} = ? [ S ]
DAL Verification
41
“For a given scrub interval, the probability that the FIR filter will
fail in last 20 minutes of the flight is less than 0.01”
P < 0.01 [F [344400,345600] ("failure")]
Figure: Fault tree of the system Table : Verification of DAL requirement
Scrub
Interval (s)
DAL-A met
(scrub only)
(A = 0.0001
B=0.001)
DAL-A met
(scrub & TMR)
(B=0.001)
0.5 False True
1.0 False True
1.5 False True
2.0 False True
2.5 False True
3.0 False True
Quantitative Analysis:
Availability Verification
42
“for a given scrub interval, does the system meet the requirement for
five 9s in the long run?”
S >= 0.99999 [“operational”]
Table : Verification of availability requirements
Scrub Interval
(sec)
Availability
requirement
met?
(FIR)
Availability
requirement
met?
(EWF)
0.5 True True
1.0 False True
1.5 False False
2.0 False False
2.5 False False
3.0 False False
Quantitative Analysis: 64-tap FIR Filter
43
No of
partitions
No. of
states
No. of
transitions
(SBU)
No. of
transitions
(SBU+DBU)
0 3 6 N/A
2 9 26 30
4 81 361 578
8 6561 478858 129506
Property 1: Reliability: P = ?[G[0,T] operational], T = 1 month
Property 2: Availability: R{"up time“} = ? [C<=T ]/T, T = 1 month
Reliability/Availability: Combined Model
44Optimal no: 4 partitions
top related