fault tolerance mechanisms itv model-based analysis and design of embedded software techniques and...
TRANSCRIPT
Fault Tolerance Mechanisms
ITV Model-based Analysis and Design of Embedded SoftwareTechniques and methods for Critical Software
Anders P. RavnAalborg University
August 2011
Fault Tolerance
Means to isolate component faults
Prevents system failures
May increase system dependability
... And mask them
Fault Tolerance
FT - levels
• Full tolerance
• Graceful Degradation
• Fail safeBW p. 107
FT basis: Redundancy
• Time
• Space
Try Retry Retry ...
TryTry
Try
...
Fault Tolerance
Basic Strategies
Dynamic Redundancy
1. Error detection
2. Damage confinement and assessment
3. Error recovery
4. Fault treatment and continued service
BW p. 114
Error Detection
f: State x Input State x Output
• Environment (exception)• Application Assertion:
• precondition (input)• postcondition (input, output)• invariant(state, state’)
Timing:• WCET(f, input) • Deadline (f,input)
D
Damage Confinement
• Static structure
• Dynamic structure (transaction)
object
object
II
Error Recovery
• Forward • Backward
Repair the state – if you can !
• define recovery points• checkpoint state at r. p.• roll back• retry
Domino effect
Recovery blocks
ENSURE acceptance_testBY { module_1 }ELSE BY { module_2 } ...ELSE BY { module_m }ELSE ERROR
BW p. 120
Implementation of Recovery Blocks
Abstract class RecoveryBlockpublic abstract class RecoveryBlock {
abstract boolean acceptanceTest();
/** method to produce the result, it must be implemented by the application.
* @param module 0, ... , MaxModule-1 */
abstract void block(int module);
/* MaxModules must be set by the application to the number of blocks */
protected int MaxModules;
ENSURE acceptance_testBY { module_1 }ELSE BY { module_2 } ...ELSE BY { module_m }ELSE ERROR
RecoveryBlock execution/** method to execute recovery module 0, 1, ... MaxModules-1 until one succeds
* @throws NoAccept if no module passes acceptanceTest.
*/
public final void do_it() throws NoAccept, CloneNotSupportedException{
save();
int i = 0;
do { try { block(i++);
if ( acceptanceTest() ) return;
} catch (Exception e) {/* if the block fails, we continue - not acceptance */}
restore(copy);
} while (i < MaxBlocks);
throw new NoAccept();
}
}
ENSURE acceptance_testBY { module_1 }ELSE BY { module_2 } ...ELSE BY { module_m }ELSE ERROR
RecoveryBlock cachepublic abstract class RecoveryBlock {
/** The recovery Cache is implemented by a clone of the original object */
RecoveryBlock copy;
/** save object to recovery cache, uses Java clone which must be a deep clone. */
private final void save() throws CloneNotSupportedException {
copy = (RecoveryBlock) this.clone();
}
/** method to restore data from recovery cache, it must be implemented by the application
* @param value of the object to be restored */
abstract void restore(RecoveryBlock copy);
Application/** Extends the basic abstract RecoveryBlock with faulty sorting
* algorithms and log calls, returns etc. to a TextArea. */
public class RecoveringSort extends RecoveryBlock {
/** checksum for acceptance test */
private int checksum;
/** data to be saved in recovery cache */
private int [] argument;
public RecoveringSort(TextArea t) {
MaxBlocks = 3;
log = t;
}
Acceptance criteria /* Acceptance test for sorting; it shall verify:
* 1) the return value is an ordered list,
* 2) the return value is a permutation of the initial values */
boolean acceptanceTest() {
boolean result = true;
// check ordering
int i = argument.length-1;
while (i > 0) if (argument[i] < argument[--i]) {result = false; break; }
// check permutation, this is a partial check through a checksum
// A full check is as expensive computationally as sorting,
// thus, we use a partial check.
i = argument.length; int sum = 0;
while (i > 0) sum+=argument[--i];
return result && (sum == checksum);
}
Application - modules /** Starts sorting using the recovery block mechanisms..
* @param data integer array containing elements to be sorted. */
public int [] sort(int [] data) {
argument = (int [])data.clone(); // copy needed for recovery to work
checksum = 0; int i = argument.length; while (i > 0) checksum+=argument[--i];
try { do_it();
} catch (NoAccept e) { log.append("All blocks falied\n"); }
return argument;
}
void block(int i) {
switch (i) {
case 0: BucketSort(argument); break;
case 1: BadSort(argument); break;
case 2: AlmostGoodSort(argument); break;
default:
}
}
Fault classes (scope of R-B)
• Origin
• Kind
• Property
• physical (internal/external)
• logical (design/interaction)
• omission
• value
• timing
byzantine
• duration (permanent, transient)
• consistency (determinate, nondeterminate)
• autonomy (spontaneous, event-dependent)
++
(+)++(-)
+ / (+)
+ / ++ / +
The ideal FT-component
Exception HandlerNormal mode
Request/response
Request/response
Interfaceexception
Interfaceexception
Failureexception
Failureexception
N-version programming
V1 V2 V3
Driver (comparator)
Comparison vectors (votes)
Comparison status indicators
Comparison points
Fault classes (scope of N-VP)
• Origin
• Kind
• Property
• physical (internal/external)
• logical (design/interaction)
• omission
• value
• timing
byzantine
• duration (permanent, transient)
• consistency (determinate, nondeterminate)
• autonomy (spontaneous, event-dependent)
++
(+)+++
+ / (+)
+ / ++ / +