feedback based real-time fault tolerance issues and possible solutions
DESCRIPTION
Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions. Xue Liu, Hui Ding, Kihwal Lee , Marco Caccamo, Lui Sha. Major Issues in Software Reliability. Software becoming more and more complex More features → larger code size Rapid evolution → introduction of new code. - PowerPoint PPT PresentationTRANSCRIPT
1
Feedback Based Real-Time Fault Tolerance
Issues and Possible Solutions
Xue Liu, Hui Ding, Kihwal Lee,
Marco Caccamo, Lui Sha
2
• Software becoming more and more complex– More features → larger code size– Rapid evolution → introduction of new code
Major Issues in Software Reliability
E.g. Apache
1998 0.8 MLOC
2002 10 MLOC
2004 27 MLOC
E.g. Windows XP 40-50 MLOC
Gray’s Estimate : 1 bug / KLOC
3
Growing Software Complexity
Poorly managed or maintained; Software bugs and errors.
• Managed by human operators– Shortage of skilled
operators due to the growing complexity
– Costly– To err is human
• Faults
Sources of computing system downtime
(Cite from: Candea, Stanford’03)
Category Source of downtime (percentage)
Hardware 20%
Software 40%
Human operators
40%
Complexity adds difficulty to management and breeds bugs.
- Control the complexity in computer systems!
- Build systems that are robust against software bugs
4
Feedback Control Reflection• Successful track record in controlling
electro/mechanical systems• Observation 1: Computing systems haven been
crucial in the success of feedback control– Digital designs & implementations etc
• Observation 2: Feedback control have appealing properties– Tolerance of errors (model/sensing/actuation etc) in
the physical process• Utilize runtime feedback for error correction
Computing Systems
Feedback Control
Reflection: Can feedback control help to solve fault tolerance problem in computing systems?
Fault tolerance
5
Idea 1: Feedback Control of Software Execution
Mechanical systems: Sense (feedback)->Control (error correction) -> Actuation
Software systems: Sense (feedback)->Control (error correction) -> Execution
• A simple and reliable core which gives acceptable performance;• The system under complex control software remains in states that are recoverable by the simple core. (achieve fault tolerance)
Idea 2: Using Simplicity to Control Complexity
Q: Feedback control can help to tolerate errors in mechanical systems, can feedback control help to tolerate software errors also?
Targeted applications: Real-time control systems
Tolerant of Errors in Software Systems
Feedback Control
Tolerant of Errors in Mechanical Systems
6
A Typical Feedback Control Loop for Mechanical Systems
• Sense: System output, identify if error exists
• Control: Decision
• Actuation: Execution
Mechanical System (Plant)
Sensor
Controller Actuator_
Reference Input
(Decision) (Execution)
(Sensing/error identification)
7
Related Work – Simplex Architecture
Simple high assurancecontrol subsystem (HAC)
Complex high performancecontrol subsystem (HPC)
Data Flow Block Diagram
Plant
Decision
• A simple reliable core (HAC)
• Diversity in the form of 2 alternatives (HAC, HPC)
• Feedback control of the software execution.
Sense (feedback)->Decision (control/error correction) -> Execution (actuation)
8
Drawbacks of Simplex• P1: Analytically redundant high assurance controller
(HAC) runs in parallel with complex controller (HPC)– Lowers system performance, increase operating costs– Limits the application of Simplex in only safety-critical domains
• P2: HAC and HPC must run at the same period
Design Goals of ORTGA
1. Similar functionalities with Simplex2. Much less resource usage 3. Flexibility
Our new Proposal: On-demand Real-Time Guard (ORTGA)HAC only runs when faulty occurs!
9
ORTGA Architecture: Key Ideas
(1) : Reduce resource usage of Simplex
Solution:
• “On-demand” execution of HAC.– Only when the control under HPC is detected as faulty, the HAC is switched in to take over the plant
(2): Flexibility
Solution:• HAC and HPC ‘s periods are multiples of subperiod• HAC and HPC can have different periods.
10
Background: Maximum Stability Region
• The largest state space such that system is still stable under the current controller
Maximum Stability Region (Recovery
Region)
Stability Region
Lyapunov Functions
State Constraints
11
How to determine the Maximum Stability Region?
• In the operation of a plant, there is a set of state constraints: representing the safety, device physical limitations, environmental and other operation requirements.
• They can be represented as a normalized polytope, CTX 1, in the N-dimensional state space. We must be able – take the control away from a faulty
State constraints
Admissible States
Operation Constraints and Admissible states
12
Maximum Stability Region• A stability region is closed with
respect to the operations of simple controller. It is Lyapunov function inside the polytope.
• The maximum recovery region can be found using LMI.
State constraints
RecoveryRegion
Lyapunov function
State Constraints and the switching rule (Lyapunov function)min l
subject to
Switching rule:
T
1
T
T
X AX
A Q + Q A < 0
og det Q
C X < 1
X QX < 1
13
Research Issues of ORTGA
• How to detect faults in HPC– Timing faults:
• Application level support: Monitor detect heartbeat messages misses
• OS support: Scheduler detect task deadline misses– Other faults:
• Wide range of traditional fault detection techniques can be used.
• When to recover if a fault in HPC is detected?– Recover early?
• Too early: False alarms– Recover late?
• Too late: could not recover in time
14
When to recover
• Why not recover too early?– Control tasks are shown can tolerate several deadline
misses– Sometimes system just have some delay (overloaded,
communication delay etc)– These are not “real” faults– Try to minimize the recovery due to false alarms
• Why not recover too late?– If you recover too late, then no time to make the
system stable!
15
Right Time To Recover (RTTR)
• An example of a “desirable” late but timely recovery (under RM)
0 2 4 86
0 2 4 86
0 2 4 86
(b) recover 2 immediately
(a) Normal schedule of 1 and 2
(c) recover 2 late
Observation: Sometimes, a late but timely recovery makes system more schedulable
Assumption: Fault is detected at t=2.0 before its task deadline D=8
Find RTTR instead of minimize MTTR!
16
A possible solution to determine RTTR
• Idea– Recover as late as possible, – But not too late
• If the state of HPC is going to be out of the HAC-established stability region, recover!
• Otherwise, wait (maybe HPC still OK )
HB1 (t1)
When to recover?
Recovered Threads
HB2 (t2)
Prediction ts
Monitor find HB3 missing
Stability Region S of Controlled Plant
(t3) tr
S
17
Performance Gain of ORTGA
Reduce Resource Usage: On-demand Execution of HAC
HPC’s timing parameters: {Cp, Tp}; HAC’s timing parameters: {Ca, Ta};
A total savings of:
Relative saving:
18
Ongoing Work: A proof-of-concept System
Double Inverted Pendulum System
- Double Quanser inverted pendulum with custom-made tracks
- PC/104 sized, i486 compatible system
- Customized Linux 2.6 kernel and root image in flash memory
- ORTGA middleware layer
19
Conclusions
• Feedback Based Real-Time Fault Tolerance– Leverage feedback control of software execution
• ORTGA Architecture– On-demand execution of reliable core (HAC) only
when fault occurs– Significantly reduces resource usage
• Issues and possible solutions– How to detect fault– When to recover to maintain system stability– How to find the RTTR (instead of minimize MTTR)
20
Backup Slides
21
Software Fault Model in RT Control systems
• Timing fault: misses its deadlines
• Capability abuse: – Corrupt others’ code or data– Unauthorized acquisition of
process/resource management capability
• Semantic fault: incorrect results that can lead to:– Poor control performance– Instability in the plant
Timing fault
GRMS
Semantic fault
Analytic Redundancy(simple & complex Controllers
Capability abuse
Privilege management