5. software redundancy reliable system design 2010 by: amir m. rahmani
TRANSCRIPT
5. Software Redundancy
Reliable System Design 2010by: Amir M. Rahmani
matlab1.ir
Software
There are many kinds of software System software
• – Operating system (Windows, Linux, Solaris, etc.)• – Device driver (for printer, graphic card, etc.)• – Compiler (gcc)• – Library (DLLs)• – Distributed system (software shared memory,
Napster, etc.)
User-level software• – E.g., simulator, word processor, spreadsheet, game,
etc.
matlab1.ir
Software Faults/Errors Operating system software (including device drivers)
• – Deadlock (may be able to escape with Ctrl-Alt-Delete)• – Crash and reboot• – Incorrect I/O
User-level software• – Deadlock (can escape with Control-C)• – Incorrect algorithm• – Array bounds violation• – Memory leak (C, C++, but not Java)
• Allocating memory, but not de allocating it• – Reference to a NULL pointer (C, C++, but not Java)• – Incorrect synchronization in multithreaded code
• Allowing more than one thread in critical section at a time• Blocking when holding a lock
• – Inability to handle unanticipated inputs• – Exception that triggers OS to kill process
• Segmentation fault• Bus error
matlab1.ir
Specification vs. Implementation
There are many, many techniques Problem is in two parts:
• 1- Correct specification, erroneous implementation• 2- Erroneous specification, correct implementation
Both parts concern us as software engineers Both parts need attention - why a system fails is
not important to the users Different techniques for the two parts
matlab1.ir
Static (Pre-Release) Fault Detection in Software
As with hardware, can try to find faults before shipping the product
• * Design reviews• * Formal verification (analysis of design)• * Testing (analysis of implementation)
Can try to add in redundancy to mask potential faults• * N-Version programming
Can try to proactively “scrub” the software to remove latent errors (due to aging) before failures occur
• * Software rejuvenation: It involves stopping the running software occasionally, “cleaning” its internal state (e.g., garbage collection, flushing operating system kernel tables, and reinitializing internal data structures) and restarting. E.g. periodically reboot system to flush out remaining latent problems due to aging
matlab1.ir
Static Fault Detection with Formal Verification Formal verification is a systematic, mathematical
way to prove that a system (software or hardware) is correct or incorrect
Correctness is based on a specification Examples of mathematical objects often used to
model systems are:• finite state machines, lPetri nets, timed automata,
process algebra Two broad approaches for formal verification – Theorem proving – Model checking
matlab1.ir
Formal Verification: Theorem Proving Theorem Proving consists of using a formal
version of mathematical reasoning (logical inference) about the system.
Example: theorem proving software such as a• HOL (Higher Order Logic) theorem prover, • ACL2
(A Computational Logic for Applicative Common Lisp) Develop logical/mathematical equations that
describe:• – System to be verified• – Specification of correctness for the system• In the rules of this logic/mathematics, prove that the system
is equivalent to its specification Theorem proving is difficult for very large
complex systems, but can work on small sub-systems
matlab1.ir
Formal Verification: Model Checking
which consists of a systematically exhaustive exploration of the mathematical model (possible for finite models)
• Example: Describe system as finite state machine (FSM) Develop logical/mathematical equations that describe
required properties of the FSM Example properties:
• – Never ends up in state X• – Can reach every desired state in FSM
Software has been developed to perform model checking logically, this is an exhaustive search
• – Example: Murp ِ model checker (from Stanford) Similar to theorem proving, model checking is difficult
for large complicated systems• – Algorithms tend to be exponential in number of states
Verification and Validation
Validation: "Are we trying to make the right thing?", i.e., is the product specified to the user's actual needs?
Verification: "Have we made what we were trying to make?", i.e., does the product conform to the specifications?
Often refers to the overall checking process as V & V
matlab1.ir
matlab1.ir
Software tools for Static Analysis There are tools that can analyze software to determine
if it has bugs• In most cases the analysis is performed on some version of the
source code and in the other cases some form of the object code. Can check to see if:
• – All code is reachable• – Deadlock is possible
Advantage of static analysis tools• – Checks all possible control flow paths through application
can detect any possible specified problem, even if it would only occur very rarely in practice
Disadvantages• – Must have access to entire code base, e.g., can’t deal with
dynamically loaded libraries• – Difficult to assess probability of error occurring in practice
matlab1.ir
Dynamic Fault Detection in Software
Must add code to check software as it is running
• – Unless you’re willing to wait for it to crash Added code = redundancy! Most common form of error detection:
assertions• – E.g., assert (Grade >= 0 && Grade <= 20)
Challenges• – Knowing which invariants to check• – Knowing when to check these invariants• – Dealing with black box code (e.g., libraries)
matlab1.ir
Automatic Dynamic Fault Detection with Meta-Compilation Recent research from Berkeley explores how to
have the compiler automatically integrate error checking to code
User can specify general high-level invariants Compiler automatically integrates invariant
checking into the code Example
• 1- 99% of lock_acquire() must have corresponding lock_release(). // so that other 1% is probably wrong
• 2- if (ptr = = NULL){printf(%d, ptr->data) // what’s wrong here?}
matlab1.ir
Other Forms of Dynamic Fault Detection
Java has automatic array bounds checking, and it won’t let you write beyond the bounds of the array
Operating system will not let an application process access memory that doesn’t belong to it. This is what is happening when you see “segmentation fault”!
FTP software uses a checksum to make sure that the data that was received is the same as the data that was sent
Other examples?
matlab1.ir
Self-Checking Code
Can we write software that checks that its output is• • Example: if we divide A/B = C, we can check the result by
multiplying B*C. If B*C != A, then the division was incorrect.– Detects hardware faults (famous Pentium bug)
– Detects software faults (assuming more complicated operation than just division, which is a single instruction)
Key idea: checking a computation is always at least as easy as performing it (result from computational complexity theory)
Other examples? Finding paper.
matlab1.ir
Hardware for Software Fault-Tolerance
Difficult for HW to know that SW is in error, because HW doesn’t know what SW is trying to do
Example• – it’s unlikely that a program really wants to divide by
zero• – Any others?• - Watchdog timer
Current work at Duke is exploring hardware support for detecting starvation
matlab1.ir
Software for Hardware Fault-Tolerance
Many examples of using software to tolerate HW faults
In fact, all schemes for tolerating software errors will detect hardware errors that manifest themselves in the same way (i.e., they have the same error model)
• – E.g., self-checking software will detect a hardware fault if it leads to an incorrect result• Example: if we divide A/B = C, we can check the
result by multiplying B*C. If B*C != A, then the division was incorrect.
matlab1.ir
What is Software Fault Tolerance?
The term ”software fault-tolerance” can mean two things:
1. ”the tolerance of software faults”, or
2. ”the tolerance of faults by the use of s software”
Definition 1 is more commonly used.
The term ”software redundancy” corresponds to definition 2.
Remember: All software faults are design faults (Specification and Implementation mistakes)!
matlab1.ir
Cause-and-Effect Relationship
matlab1.ir
Software Redundancy
Software redundancy techniques can be divided in two major classes:• With diversity• – Design or data diversity• – Aim is to tolerate design faults
• Without diversity• – Implements error detection, recovery, etc• – Aim is to handle errors of any origin (physical
faults, design faults, operator faults)
matlab1.ir
Design Diversity
Design diversity is used to tolerate design faults in hardware and software
Two techniques for tolerating software design faults:
• • N-version programming• • Recovery blocks
matlab1.ir
N-version programming
Heterogeneous redundancy• – TMR is homogeneous redundancy • – Question? Why would TMR not work here?
Uses majority voting on results produced by N program versions
Program versions are developed by different teams of programmers
Assumes that programs fail independently Look likes masking hardware redundancy Uses Forward Error Recovery
matlab1.ir
N-version programming
matlab1.ir
Achieving Version Independence-Diversity
Different design teams for each version Diverse specifications Versions with differing capabilities Teams working on different modules are forbidden to
directly communicate Diverse programming languages, development tools,
compilers, hardware, operating systems and etc. Questions regarding ambiguities in specifications or any
other issue have to be addressed to some central authority who makes any necessary corrections and updates all teams
…
matlab1.ir
Causes of Version Correlation
Common specifications: errors in specifications will propagate to software
Inherent difficulty of problem: algorithms may be more difficult to implement for some inputs, causing faults triggered by same inputs
Common algorithms: algorithm itself may contain instabilities in certain regions of input space - different versions have instabilities in same region
Cultural factors: programmers make similar mistakes in interpreting ambiguous specifications
Common software and hardware platforms: if same hardware, operating system, and compiler are used - their faults can trigger a correlated failure
matlab1.ir
N-version programming depends on
Initial specification — The majority of software faults stem from inadequate specification? A specification error will manifest itself in all N versions of the implementation
Independence of effort — Experiments produce conflicting results. Where part of a specification is complex, this leads to a lack of understanding of the requirements. If these requirements also refer to rarely occurring input data, common design errors may not be caught during system testing
Adequate budget — The predominant cost is software. A 3-version system will triple the budget requirement and cause problems of maintenance. Would a more reliable system be produced if the resources potentially available for constructing an N-versions were instead used to produce a single version?
matlab1.ir
Evaluation of N-version programming
Few experimental studies of effectiveness of N-version programming
Published results only for work in universities. Program: Anti-missile application
• • 27 versions produced by students at University of Virginia and University of California at Irvine. Published in 1985.
• • Some had no prior industrial experience while others over ten years
• • All students was given the same specification• • All versions were written in Pascal• • 200 test cases to validate each program• • 1 million test cases to test independence (simulation of
production • • 93 correlated faults were identified by standard statistical
hypothesis-testing methods• • No correlation observed between quality of programs produced
and experience of programmer
matlab1.ir
Recovery Block
N-versions; one running - if it fails, execution is switched to a backup
Uses one primary software module and one or several secondary (back-up) software modules
Assumes that program failures can be detected by acceptance tests
Executes only the primary module under error-free conditions
Look likes dynamic hardware redundancy
matlab1.ir
Recovery Block
matlab1.ir
Recovery Block Mechanism
EstablishRecovery
Point
AnyAlternatives
Left?
EvaluateAcceptance
Test
RestoreRecovery
Point
ExecuteNext
Alternative
DiscardRecovery
Point
Fail Recovery Block
Yes
No
Pass
Fail
matlab1.ir
Recovery Block Format
Acceptance test is provided to check if answers are reasonable
Format:ensure
acceptance testby
primary moduleelse by
first alternativeelse by second alternative …. else error
matlab1.ir
Example: Solution to Differential Equation
Explicit Kutta Method fast but inaccurate when equations are stiff
Implicit Kutta Method more expensive but can deal with stiff equations
• - The above will cope with all equations• - It will also potentially tolerate design errors in the Explicit
Kutta Method if the acceptance test is flexible enough
ensure Rounding_err_has_acceptable_toleranceby Explicit Kutta Methodelse by Implicit Kutta Method else error
matlab1.ir
Construction of Acceptance Tests An acceptance test is a software implemented check designed to
detect errors in the results produced by a primary or a secondary module
The design of the acceptance test is crucial to the efficacy of the Recovery Block scheme
Acceptance tests often relies on application specific information All the previously discussed error detection techniques discussed can
be used to form the acceptance tests There is a trade-off between providing comprehensive acceptance
tests and keeping overhead to a minimum, so that fault-free execution is not affected
Note that the term used is acceptance not correctness; this allows a component to provide a degraded service
However, care must be taken as a faulty acceptance test may lead to residual errors going undetected
Success of Recovery Block approach depends on failure independence of different versions (modules) and quality of acceptance test
matlab1.ir
Examples of how acceptance can be constructed
Satisfaction of requirements (Structural checks)
• • Inversion of mathematical functions; e.g. squaring the result of a square-root operation to see if it equals the original operand.
• • Checking sort functions; result should have elements in descending order
matlab1.ir
Examples of how acceptance can be constructed
Reasonable checks• • Checking physical constraints; e.g. speed,
pressure, etc• • Checking sequence of application states
Structural checks• • Structural checks are based on known properties• of data structures• – a number or elements in a list can be counted, or
links and pointer can be verified
matlab1.ir
Evaluation of Recovery Blocks
Naval command and control system (8000 statements in the Coral language)
117 abnormal events• Correct recovery 78 %• Incorrect recovery, program failure 3 %• Incorrect recovery, no program failure 15
%• Unnecessary recovery 3 %
• Anderson, T., et al., ”Software Fault Tolerance: An Evaluation,” IEEE Trans. on Software Engineering, vol. SE-11, no. 12, Dec 1985, pp. 1502-1510.
matlab1.ir
N-Version vs. Recovery Block N-version programming
• • Applied at the program level• • Runs N programs at the same time• • Look likes static hardware redundancy• • Vote comparison (error masking)• • Assumes that independence among program versions is
achieved by random differences in programming style among programmers
Recovery block• • Applied at the module (subprogram) level• • Runs only the primary module under error-free conditions• • Look likes dynamic hardware redundancy• • Error detection : acceptance test• • Independence is achieved by intentionally designing the
primary and secondary modules to be as different as possible (different algorithms)
Data Diversity
This technique is cheaper to implement than the design diversity tecghnique.
Popular techniques which are based on the data diversity concept for fault tolerance in software are:
• • Retry blocks• • N-copy programming
matlab1.ir
Retry Blocks
A retry block is a modification of the recovery block structure that uses data diversity instead of design diversity (data and re-expressed data like complement of data).
Rather than the multiple alternate algorithms used in a recovery block, a retry block use only one algorithm.
A retry block's acceptance test has the same form and purpose as a recovery block's acceptance test.
matlab1.ir
N-Copy Programming
An N-copy programming is similar to an N-version programming but uses data diversity instead of design diversity.
N copies of a program execute in parallel, each on a set of data produced by re-expression.
The system selects the output to be used by an enhanced voting scheme.
matlab1.ir
Airbus A330
National origin Multi-national Manufacturer Airbus First flight 2 November 1992 Status In production, in service Primary users
• Cathay Pacific• Delta Air Lines• Qatar Airways• Emirates
Produced 1993–present Number built 1,016 as of 10 October 2013 Unit cost A330-300, €215 million(2011)
http://en.wikipedia.org/wiki/Airbus_A330 (3 Nov. 2013)
matlab1.ir
A340
National origin Multi-national Manufacturer Airbus First flight 25 October 1991 Status Out of production, in service Primary users
• Lufthansa• Iberia, • South African Airways• Virgin Atlantic Airways
Produced 1993-2011 Number built 375 Unit cost A340-600: US$275.4 million
matlab1.ir
matlab1.ir
Design Diversity in Airbus A330/A340
Two types of computers• • 3 primary computers• • 2 secondary computers
Each computer are internally duplicated and consists of two channels
• • Command channel• • Monitor channel
matlab1.ir
Architecture for A330/A340
Flight control Flight control Flight control
primary computers secondary computers data concentrators
matlab1.ir
Implementation of primary computers• • Supplier: Aerospatiale (HW&SW)• • Hardware: Two Intel 80386 (one for each channel)• • Software: assembler for command channel, PL/M
for monitor channel.
Implementation of secondary computers• • Supplier: Sextant Avionique (HW), Aerospatiale
(SW)• • Hardware: Two Intel 80186 (one for each channel)• • Software: assembler for command channel, Pascal
for monitor channel.
Design Diversity in Airbus 330/A340
matlab1.ir
Exception Handling Exception indicates that something happened during
execution that needs attention Control is transferred to an exception-handler - routine
which takes appropriate action Example: When executing y = a*b, if overflow => result
incorrect => signal an exception Effective exception-handling can make a significant
improvement to system fault tolerance Over half of code lines in many programs are devoted to
exception-handling Exception handling is a Forward Error Recovery
mechanism, as there is no roll back to a previous state; instead control is passed to the handler so that recovery procedures can be initiated
However, the exception handling facility can be used to provide Backward Error Recovery
matlab1.ir
Example: Domain and Range Failure Exceptions can be used to deal with
• - domain or range failure• - out-of-ordinary event (not failure) needing special attention• - timing failure
A domain failure happens when illegal input is used Example: if X, Y are real numbers and X = √Y is
attempted with Y = -1, a domain failure occurs A range failure occurs when program produces an output
or carries out an operation that is seen to be incorrect in some way
Examples include:• - Encountering an end-of-file while reading data from file• - Producing a result that violates an acceptance test• - Trying to print a line that is too long• - Generating an arithmetic overflow or underflow
matlab1.ir
Timing Failure
Timing Checks: Timing checks are an effective form of software check for detecting errors even in cases of running programs in a dual redundant execution mode, if the specification of a component includes timing constraints.
Watch-dog timer • - is used to guard against program hang-ups.• - Also used in communications between CPU and main
store.• -Also used in periodic "hello" exchanges (network
surveillance) and in I/O operations