early dependability assessment of fpga-based space ...early dependability assessment of fpga-based...

Early Dependability Assessment of FPGA-Based

Space Applications Using Formal Verification

Khaza Anuarul HoqueMentor: Dr. Taylor T. Johnson

Dept. of Computer Science & Engineering,

University of Texas at Arlington

Thanks to Otmane Ait Mohamed (Concordia Univ., Canada) & Yvon Savaria(Polytechnique Montreal, Canada) for supervising the project

Outline

Motivation

FPGA and SEUs

Proposed methodology

Design options analysis

DAL analysis and scrub optimization

TMR partitioning optimization

Lessons learned

Summary

Future directions

Motivation: Cosmic Radiation

Motivation (cont.)

Background: FPGAs and SEUs

SRAM-based FPGAs

(+) Cheaper than Rad-hard FPGAs

(+) Better performance than other types of FPGAs

(+) On-field programmability

Susceptible to cosmic radiation induced SEUs

(-) Mitigation required

(-) Redundancy (such as TMR)

(-) Scrubbing/Reconfiguration

SEU Estimation

Three main ways to analyze SEUs:

1. Hardware testing (beam testing/laser testing)

(+) Most realistic and accurate

(-) Requires finished implementation

(-) May damage the device and

(-) Costly

2. Fault injection (emulation/simulation)

(+) Less accurate than hardware testing but still very useful

(-) Test time grows with possible test cases

SEU Estimation

3. Analytical techniques

(+) Better controllability, quick estimation of SEUs

(+) No risk of damaging the device

(+) Estimation at early design stage

(-) Can be relatively less accurate in some cases

An early estimation of SEUs will help :

-To build a more reliable design

-To adopt the required mitigation strategy

- Reduce the design time

- Reduce the design cost

Objectives

To propose a methodology for early SEU analysis on SRAM-based

FPGA designs for aerospace applications:

Based on formal verification technique (probabilistic model checking)

What is the trade-off between Dependability-performability-area ?

Which design option shall a designer choose ?

Scrub Optimization and DAL analysis

Is the adopted mitigation enough ?

Scrub frequency can be optimized ?

Early assessment of TMR partitioning to increase reliability

Can we find the number of partitions early ?

Formal Verification

The application of rigorous, mathematics-based techniques to establish the correctness of computerised systems

Main techniques: Model checking Equivalence checking Theorem proving

Many properties other than correctness are important

Quantitative requirements: “how reliable is my car’s Bluetooth network?” “how efficient is my phone’s power management policy?” “how secure is my bank’s web-service?”

Probabilistic model checking is a formal verification technique for modelling and analysing systems that

exhibit probabilistic behaviour

Probabilistic Model Checking

10 Image credit: David Parker/Probabilistic Model Checking, Michaelmas 2011

Probabilistic Model Checking

Models: variants of Markov chains: Discrete-time Markov chains (DTMCs) Continuous-time Markov chains (CTMCs) Markov decision processes (MDPs)

Property Specifications:

- PCTL, CSL, PCTL*, LTL

Transient and Steady-state analysis:P = ? [F [t,t] oper] - Instantaneous availability of the system, e.g. Probability that the system will be in a specific state in time instant t Timing and ordering of events:P = ? [!fail_B U [3600,7200] fail_B] - Probability that component B fails for the first time during the second hour of operation. Reward-based properties:R{“oper”} = ? [C<t] - Expected cumulative operational time of the system in the time interval [0, t]

Proposed Methodology

Outline

Motivation

FPGA and SEUs

Lessons learned

Summary

Future directions

Outline

Design Options Analysis

Analyze design options with respect to

Reliability

Availability

Safety

performability-area tradeoff

Throughput: 0.33

Reliability

Availability

Safety

Throughput: 0.33

Component failed !

Reliability

Availability

Safety

Throughput: 0.33

Component failed !

Solution:

rescheduling

Reliability

Availability

Safety

Throughput: 0.25

Component failed !

Solution:

rescheduling

Design Option Analysis Methodology

Dataflow

Configuration

PRISM model

checker

Results

Characterization

Library

Properties

Mitigation(s)

coverage

Rewards

Dataflow

Configuration

PRISM model

checker

Results

Characterization

Library

Properties

Mitigation(s)

coverage

Rewards

Dataflow

Configuration

PRISM model

checker

Results

Characterization

Library

Properties

Mitigation(s)

coverage

Rewards

Dataflow

Configuration

PRISM model

checker

Results

Characterization

Library

Properties

Mitigation(s)

coverage

Rewards

Dataflow

Configuration

PRISM model

checker

Results

Characterization

Library

Properties

Mitigation(s)

coverage

Rewards

Dataflow

Configuration

PRISM model

checker

Results

Characterization

Library

Properties

Mitigation(s)

coverage

Rewards

Markov Modeling Example

Table : Characterization library

Outline

Motivation

FPGA and SEUs

Quantitative analysis

Summary

Future directions

Outline

Scrub Optimization and DAL Analysis

DO-254:

- Baseline of required design flow steps for airborne component

- Five levels of compliance, commonly known as Design Assurance Levels (DALs)

- A failure condition on flight control system (that may lead to a catastrophic event) ≠ A failure condition on the entertainment system (even though it could spoil your day!)

- Engineers designing to level A or B face a much more rigorous test, verification, and documentation process than for levels C, D, or E.

Can a lower scrub frequency (to lower the power consumption) meeting DAL requirement?

DAL Analysis Methodology

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

(Erlang distribution)

Properties

met ?Finish

Number of

total essential bits

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

Properties

met ?Finish

Number of

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

Properties

met ?Finish

Number of

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

Properties

met ?Finish

Number of

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

Properties

met ?Finish

Number of

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

Properties

met ?Finish

Number of

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

Properties

met ?Finish

Number of

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

Properties

met ?Finish

Number of

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

Properties

met ?Finish

Number of

Level Classification Failure Condition Description Pb/h

A Catastrophic Failure conditions that would prevent continued safe flight and

landing. <10-9

Extremly improbable

B Hazardous / Severe-Major Large reduction in safety margins or functional capabilities,

physical distress or higher workload such that the flight crew could not be relied on to perform their tasks accurately or completely, or adverse effects on occupants including serious or potentially fatal injuries to a small number of those occupants

Extremly remote

C Major Significant reduction in safety margins or functional

capabilities, a significant increase in flight crew workload or in conditions impairing flight crew efficiency, or discomfort to occupants, possibly including injuries.

remote

D Minor Slight reduction in safety margins or functional capabilities, a

slight increase in flight crew workload, such as routine flight plan changes, or some inconvenience to occupants

Probable

E No Effect Failure conditions that do not affect the operational capability

of the aircraft or increase flight crew workload. -

High-level

description

Extraction

Resource estimation

Characterization

Library

Mitigation strategy

Failure & scrub

parameter

PRISM MC

PRISM model

Properties

met ?Finish

Number of

Level Classification Failure Condition Description Pb/h

A Catastrophic Failure conditions that would prevent continued safe flight and

landing. <10-9

Extremly improbable

B Hazardous / Severe-Major Large reduction in safety margins or functional capabilities,

physical distress or higher workload such that the flight crew could not be relied on to perform their tasks accurately or completely, or adverse effects on occupants including serious or potentially fatal injuries to a small number of those occupants

Extremly remote

C Major Significant reduction in safety margins or functional

capabilities, a significant increase in flight crew workload or in conditions impairing flight crew efficiency, or discomfort to occupants, possibly including injuries.

remote

D Minor Slight reduction in safety margins or functional capabilities, a

slight increase in flight crew workload, such as routine flight plan changes, or some inconvenience to occupants

Probable

E No Effect Failure conditions that do not affect the operational capability

of the aircraft or increase flight crew workload. -

Modeling and Parameters

Set of states

- The set of states can be classified into “fully operational” , “faulty

with one or more faults” and “failed” states.

Modeling parameters

- Number of critical bits: Characterization library

- Environmental: Design Failure rate,

- Target system: SelectMap bus width B and configuration clock

frequency 𝑓𝑐𝑐𝑙𝑘- Mitigation: Correction rate,

µ𝑑𝑒𝑠𝑖𝑔𝑛 =𝐵 × 𝑓𝑐𝑐𝑙𝑘

#𝑐𝑜𝑛𝑓𝑖𝑔𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑡𝑠

λ𝑑𝑒𝑠𝑖𝑔𝑛 = λ𝑏𝑖𝑡 × # critical bits

Deterministic Delay Modeling

Erlang process :

𝑆0 𝑆1

Erlang process :

𝑆0 𝑆1

𝑆0 𝑆1 𝑆𝑘−1𝑆2 𝑆𝑘 𝑘 𝜏 𝑘 𝜏 𝑘 𝜏

Erlang process :

𝑆0 𝑆1

𝑆0 𝑆1 𝑆𝑘−1𝑆2 𝑆𝑘 𝑘 𝜏 𝑘 𝜏 𝑘 𝜏

Modeling Periodic Blind Scrub

Modeling TMR & Blind Scrub

Outline

Motivation

FPGA and SEUs

Lessons learned

Summary

Future directions

Outline

Triple Modular Redundancy (TMR)

TMR Partitioning: Example

TMR Partitioning Methodology

High-level

description

Extracted

PRISM model

Quantitative

Results

PRISM MC

High-level

description

Extracted

PRISM model

Quantitative

Results

PRISM MC

High-level

description

Extracted

PRISM model

Quantitative

Results

PRISM MC

High-level

description

Extracted

PRISM model

Quantitative

Results

PRISM MC

Failure rate of each

module

High-level

description

Extracted

PRISM model

Quantitative

Results

PRISM MC

module

High-level

description

Extracted

PRISM model

Quantitative

Results

PRISM MC

partitions

module

High-level

description

Extracted

PRISM model

Quantitative

Results

PRISM MC

partitions

module

No of components each

moduleScrub

Characterization

library

High-level

description

Extracted

PRISM model

Quantitative

Results

PRISM MC

partitions

module

moduleScrub

Characterization

library

High-level

description

Extracted

PRISM model

Reliability/

Availability

Properties

Quantitative

Results

PRISM MC

partitions

module

Requirement

specification

Characterization

library

High-level

description

Extracted

PRISM model

Reliability/

Availability

Properties

Quantitative

Results

PRISM MC

partitions

module

Requirement

specification

Characterization

library

High-level

description

Extracted

PRISM model

Reliability/

Availability

Properties

Quantitative

Results

PRISM MC

partitions

module

Requirement

specification

Characterization

library

High-level

description

Extracted

PRISM model

Reliability/

Availability

Properties

Quantitative

Results

PRISM MC

partitions

module

Requirement

specification

Met ?Finish

Characterization

library

High-level

description

Extracted

PRISM model

Reliability/

Availability

Properties

Quantitative

Results

PRISM MC

partitions

module

Requirement

specification

Met ?Finish

Characterization

library

Yes No

Markov Modeling of Single Bit Upset

A system with N partitions can be defined by a set:

where each represented by a CTMC

The final model of system can be defined by the parallel

composition (||) of all the CTMCs of the partitions:

SBU Model for FIR: 2 Partitions

Combined Model for FIR: 2 Partitions

Key Lessons

Extra reliability provided by the redundancy is not always

useful to suppress the additional area overhead.

Rescheduling with scrubbing is good enough to serve as a

fault recovery and repair mechanism where optimization of

reliability, area, and performance is required.

It is possible to find an appropriate scrub interval (slowest

scrub rate) to save power while meeting the dependability

requirements instead of choosing the highest scrub

frequency.

Key Lessons (cont.)

There exists an optimal number of TMR partitions

The more the number of partitions (which means smaller

modules), the less frequent scrub will be required to meet a

target reliability.

For availability, the number of partitions is important for the

cases where the scrub interval is long.

For the case of frequent scrubbing, the number of partitions

increases the availability to a minimal level.

For longer scrub intervals the availability improvement with

the increased number of partitions is quite significant.

Lessons Learned

Brief experience from some our previous research: K. A. Hoque, O. A. Mohamed and Y. Savaria, “Applying Formal Verification to Early

Assessment of FPGA-based Aerospace Applications: Methodology and Experience”, 10th IEEE Systems Conference (SysCon 2016), Orlando, USA, 2016.

K. A. Hoque, O. A. Mohamed and Y. Savaria, “Towards an accurate reliability, availability and maintainability analysis approach for satellite systems based on probabilistic model checking”, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, 2015.

K. A. Hoque, O. A. Mohamed, Y. Savaria and C. Thibeault, “Probabilistic Model Checking Based DAL Analysis to Optimize a Combined TMR-Blind-Scrubbing Mitigation Technique for FPGA- Based Aerospace Applications”, 12th ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE), IEEE, 2014, Lausanne, Switzerland.

K. A. Hoque, O. A. Mohamed, Y. Savaria and C. Thibeault, “Early Analysis of Soft Error Eects for Aerospace Applications using Probabilistic Model Checking”, 2nd International Workshop on Formal Techniques for Safety-Critical Systems (FTSCS13), CCIS, Springer-Verlag, 2013, Queenstown, New Zealand.

New modeling results related to TMR partitioning are not published yet.

Summary

Use of FPGA in aerospace is common, hence early dependability

analysis is helpful in saving:

Effort

We proposed formal verification technique based methodology:

design options analysis

Scrub interval optimization with DAL analysis

optimal partitioning of TMR

Probabilistic model checking can be used for SEU analysis as a

complimentary approach with other techniques.

Future Directions

Inclusion of other fault models, such as analyzing design

failures due to aging, electromigration, hot electron effects,

and Negative-Bias Temperature Instability (NBTI) and Single-

Event Functional Interrupts (SEFI).

Inclusion of adaptive mitigation model based on radiation

sensitivity.

Extension of the models to handle three or more bit upsets.

Extension of the models to support read-back scrubbing and

other customizable scrubbing schemes.

Acknowledgement

This research work is a part of the AVIO-403 project financially supported by the Consortium for Research and Innovation in Aerospace in Quebec (CRIAQ), Fonds de Recherche du Qu´ebec- Nature et Technologies (FRQNT) and the Natural Sciences and Engineering Research Council of Canada (NSERC). The presenter would also like to thank Bombardier Aerospace, MDA Space Missions and the Canadian Space Agency (CSA) for their technical guidance and financial support.

The presenter would also like to thank Dr. Taylor T. Johnson, from VeriVITAL group, University of Texas at Arlington for his financial support to attend the S5 Symposium.

Thank YouQuestions ?

EXTRA SLIDES

Quantitative Analysis:

16-tap FIR Filter

No Configurations Spare Scrubbing Rescheduling

C1 2A 2M None ✓ ✓

C2 2A 3M 1 Mul ✓ ✓

C3 3A 2M 1 Add ✓ ✓

C4 3A 3M 1 Add,

1 Mul✓ ✓

Table : Design options to evaluate

Expected Throughput: Overall Reward

Interval

(days)

Configurations Overall

reward

R {“Expected throughput”} = ? [ S ]

DAL Verification

“For a given scrub interval, the probability that the FIR filter will

fail in last 20 minutes of the flight is less than 0.01”

P < 0.01 [F [344400,345600] ("failure")]

Figure: Fault tree of the system Table : Verification of DAL requirement

Interval (s)

DAL-A met

(scrub only)

(A = 0.0001

B=0.001)

DAL-A met

(scrub & TMR)

(B=0.001)

0.5 False True

1.0 False True

1.5 False True

2.0 False True

2.5 False True

3.0 False True

Quantitative Analysis:

Availability Verification

“for a given scrub interval, does the system meet the requirement for

five 9s in the long run?”

S >= 0.99999 [“operational”]

Table : Verification of availability requirements

Scrub Interval

Availability

requirement

Availability

requirement

0.5 True True

1.0 False True

1.5 False False

2.0 False False

2.5 False False

3.0 False False

Quantitative Analysis: 64-tap FIR Filter

partitions

No. of

states

No. of

transitions

No. of

transitions

(SBU+DBU)

0 3 6 N/A

2 9 26 30

4 81 361 578

8 6561 478858 129506

Property 1: Reliability: P = ?[G[0,T] operational], T = 1 month

Property 2: Availability: R{"up time“} = ? [C<=T ]/T, T = 1 month

Reliability/Availability: Combined Model

44Optimal no: 4 partitions

early dependability assessment of fpga-based space ...early dependability assessment of fpga-based...

Documents

infrastructure dependability

sustainability: integrity, dependability & value

lec4-security and dependability

asset management and dependability · iec dependability - a...

dependable systems system dependability evaluation ·...

fundamentals of dependability -...

chapter 13 – dependability engineering lecture 1 1chapter...

global service global dependability

threats to dependability

dependability of computing systems

chapter 12 – dependability and security specification...

security and dependability modelling · • present a...

dependability what is dependability? dependability is...

cse 322: software reliability engineering topics covered:...

dependability benchmarking: where are we standing? ·...

l5 dependability requirements

chapter 11 – security and dependability 1chapter 11...

fundamental concepts of dependability

chapter 13 – dependability engineering 1chapter 13...

functional validation for dependability