feedback based real-time fault tolerance issues and possible solutions

1

Feedback Based Real-Time Fault Tolerance

Issues and Possible Solutions

Xue Liu, Hui Ding, Kihwal Lee,

Marco Caccamo, Lui Sha

2

• Software becoming more and more complex– More features → larger code size– Rapid evolution → introduction of new code

Major Issues in Software Reliability

E.g. Apache

1998 0.8 MLOC

2002 10 MLOC

2004 27 MLOC

E.g. Windows XP 40-50 MLOC

Gray’s Estimate : 1 bug / KLOC

3

Growing Software Complexity

Poorly managed or maintained; Software bugs and errors.

• Managed by human operators– Shortage of skilled

operators due to the growing complexity

– Costly– To err is human

• Faults

Sources of computing system downtime

(Cite from: Candea, Stanford’03)

Category Source of downtime (percentage)

Hardware 20%

Software 40%

Human operators

40%

Complexity adds difficulty to management and breeds bugs.

- Control the complexity in computer systems!

- Build systems that are robust against software bugs

4

Feedback Control Reflection• Successful track record in controlling

electro/mechanical systems• Observation 1: Computing systems haven been

crucial in the success of feedback control– Digital designs & implementations etc

• Observation 2: Feedback control have appealing properties– Tolerance of errors (model/sensing/actuation etc) in

the physical process• Utilize runtime feedback for error correction

Computing Systems

Feedback Control

Reflection: Can feedback control help to solve fault tolerance problem in computing systems?

Fault tolerance

5

Idea 1: Feedback Control of Software Execution

Mechanical systems: Sense (feedback)->Control (error correction) -> Actuation

Software systems: Sense (feedback)->Control (error correction) -> Execution

• A simple and reliable core which gives acceptable performance;• The system under complex control software remains in states that are recoverable by the simple core. (achieve fault tolerance)

Idea 2: Using Simplicity to Control Complexity

Q: Feedback control can help to tolerate errors in mechanical systems, can feedback control help to tolerate software errors also?

Targeted applications: Real-time control systems

Tolerant of Errors in Software Systems

Feedback Control

Tolerant of Errors in Mechanical Systems

6

A Typical Feedback Control Loop for Mechanical Systems

• Sense: System output, identify if error exists

• Control: Decision

• Actuation: Execution

Mechanical System (Plant)

Sensor

Controller Actuator_

Reference Input

(Decision) (Execution)

(Sensing/error identification)

7

Related Work – Simplex Architecture

Simple high assurancecontrol subsystem (HAC)

Complex high performancecontrol subsystem (HPC)

Data Flow Block Diagram

Plant

Decision

• A simple reliable core (HAC)

• Diversity in the form of 2 alternatives (HAC, HPC)

• Feedback control of the software execution.

Sense (feedback)->Decision (control/error correction) -> Execution (actuation)

8

Drawbacks of Simplex• P1: Analytically redundant high assurance controller

(HAC) runs in parallel with complex controller (HPC)– Lowers system performance, increase operating costs– Limits the application of Simplex in only safety-critical domains

• P2: HAC and HPC must run at the same period

Design Goals of ORTGA

1. Similar functionalities with Simplex2. Much less resource usage 3. Flexibility

Our new Proposal: On-demand Real-Time Guard (ORTGA)HAC only runs when faulty occurs!

9

ORTGA Architecture: Key Ideas

(1) : Reduce resource usage of Simplex

Solution:

• “On-demand” execution of HAC.– Only when the control under HPC is detected as faulty, the HAC is switched in to take over the plant

(2): Flexibility

Solution:• HAC and HPC ‘s periods are multiples of subperiod• HAC and HPC can have different periods.

10

Background: Maximum Stability Region

• The largest state space such that system is still stable under the current controller

Maximum Stability Region (Recovery

Region)

Stability Region

Lyapunov Functions

State Constraints

11

How to determine the Maximum Stability Region?

• In the operation of a plant, there is a set of state constraints: representing the safety, device physical limitations, environmental and other operation requirements.

• They can be represented as a normalized polytope, CTX 1, in the N-dimensional state space. We must be able – take the control away from a faulty

State constraints

Admissible States

Operation Constraints and Admissible states

12

Maximum Stability Region• A stability region is closed with

respect to the operations of simple controller. It is Lyapunov function inside the polytope.

• The maximum recovery region can be found using LMI.

State constraints

RecoveryRegion

Lyapunov function

State Constraints and the switching rule (Lyapunov function)min l

subject to

Switching rule:

T

1

T

T

X AX

A Q + Q A < 0

og det Q

C X < 1

X QX < 1

13

Research Issues of ORTGA

• How to detect faults in HPC– Timing faults:

• Application level support: Monitor detect heartbeat messages misses

• OS support: Scheduler detect task deadline misses– Other faults:

• Wide range of traditional fault detection techniques can be used.

• When to recover if a fault in HPC is detected?– Recover early?

• Too early: False alarms– Recover late?

• Too late: could not recover in time

14

When to recover

• Why not recover too early?– Control tasks are shown can tolerate several deadline

misses– Sometimes system just have some delay (overloaded,

communication delay etc)– These are not “real” faults– Try to minimize the recovery due to false alarms

• Why not recover too late?– If you recover too late, then no time to make the

system stable!

15

Right Time To Recover (RTTR)

• An example of a “desirable” late but timely recovery (under RM)

0 2 4 86

0 2 4 86

0 2 4 86

(b) recover 2 immediately

(a) Normal schedule of 1 and 2

(c) recover 2 late

Observation: Sometimes, a late but timely recovery makes system more schedulable

Assumption: Fault is detected at t=2.0 before its task deadline D=8

Find RTTR instead of minimize MTTR!

16

A possible solution to determine RTTR

• Idea– Recover as late as possible, – But not too late

• If the state of HPC is going to be out of the HAC-established stability region, recover!

• Otherwise, wait (maybe HPC still OK )

HB1 (t1)

When to recover?

Recovered Threads

HB2 (t2)

Prediction ts

Monitor find HB3 missing

Stability Region S of Controlled Plant

(t3) tr

S

17

Performance Gain of ORTGA

Reduce Resource Usage: On-demand Execution of HAC

HPC’s timing parameters: {Cp, Tp}; HAC’s timing parameters: {Ca, Ta};

A total savings of:

Relative saving:

18

Ongoing Work: A proof-of-concept System

Double Inverted Pendulum System

- Double Quanser inverted pendulum with custom-made tracks

- PC/104 sized, i486 compatible system

- Customized Linux 2.6 kernel and root image in flash memory

- ORTGA middleware layer

19

Conclusions

• Feedback Based Real-Time Fault Tolerance– Leverage feedback control of software execution

• ORTGA Architecture– On-demand execution of reliable core (HAC) only

when fault occurs– Significantly reduces resource usage

• Issues and possible solutions– How to detect fault– When to recover to maintain system stability– How to find the RTTR (instead of minimize MTTR)

20

Backup Slides

21

Software Fault Model in RT Control systems

• Timing fault: misses its deadlines

• Capability abuse: – Corrupt others’ code or data– Unauthorized acquisition of

process/resource management capability

• Semantic fault: incorrect results that can lead to:– Poor control performance– Instability in the plant

Timing fault

GRMS

Semantic fault

Analytic Redundancy(simple & complex Controllers

Capability abuse

Privilege management

feedback based real-time fault tolerance issues and possible solutions

Documents

hpc feedback control

complex control software

control help

demand execution of

alternatives hac

multiples of subperiod

software reliabilitye

computer systems