s elf-stabilizing o perating s ystems

23
Self-Stabilizing Operating Systems Shlomi Dolev and Reuven Yagel Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel SOSP’05 \ SIGOPS Doctoral Workshop, Oct 23, 2005

Upload: baldasarre-keenan

Post on 30-Dec-2015

23 views

Category:

Documents


0 download

DESCRIPTION

S elf-Stabilizing O perating S ystems. Shlomi Dolev and Reuven Yagel Computer Science Department Ben-Gurion University of the Negev, Beer-Sheva, Israel. SOSP’05 \ SIGOPS Doctoral Workshop, Oct 23, 2005. SOS - Motivation. Growing use of autonomous and remote systems (e.g. RFID) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: S elf-Stabilizing  O perating  S ystems

Self-Stabilizing Operating Systems

Shlomi Dolev and Reuven YagelComputer Science DepartmentBen-Gurion University of the Negev,Beer-Sheva, Israel

SOSP’05 \ SIGOPS Doctoral Workshop, Oct

23, 2005

Page 2: S elf-Stabilizing  O perating  S ystems

2

SOS - Motivation• Growing use of autonomous and remote

systems (e.g. RFID)• Human management is too expensive,

risky or just unavailable• The combination and type of faults

cannot be totally anticipated in on-going systems (due to e.g. SEU\soft errors)

• Pentium HALTING problem: “… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down…”

Page 3: S elf-Stabilizing  O perating  S ystems

3

Proposed solution

• To build according to the well defined and understood paradigm of self-stabilization (traditionally used in distributed systems)

• Thereby achieving: trustworthiness, dependability, self-healing, automatic recovery, adaptive systems, …

Page 4: S elf-Stabilizing  O perating  S ystems

4

Self-stabilization

• Elegant fault tolerant approach– Started at any state, the system

convergences to a desired behavior • Generally used in distributed systems

– Routing, clock synchronization, leader election, etc.

• Overcome transient faults in the system.– Transient faults: soft-errors (“98% of

RAM errors are soft errors”), wrong CRC during communication etc.

Page 5: S elf-Stabilizing  O perating  S ystems

5

Self-stabilization

• The combination and type of faults cannot be totally anticipated in on-going systems

• Any on-going system must be self stabilizing (or manually monitored)

• “Self-Stabilization in Spite of Distributed Control” [Dijkstra ’74]

• Self-Stabilization [Dolev ’00]

E L

Page 6: S elf-Stabilizing  O perating  S ystems

6

OS Reliability

• Past examples:– Dijkstra, “THE” Multiprogramming System ’68

(Layered Approach)– Denning, Fault tolerant operating systems ’76

(Protection)– KeyKOS ‘85, EROS ’92 (Capabilities, Checkpoints)– Micro-kernel ~‘90, Exo-kernel ’94 (Minimal TCB)

• Current– JHU: The Coyotos Secure Operating System– IBM: K42, Autonomic Computing– SUN: Solaris 10, Predictive Self-Healing– MSR: Singularity, managed code OS

Page 7: S elf-Stabilizing  O perating  S ystems

7

Goal: Autonomic Computer

• Following any sequence of transient faults (e.g. soft-errors), the (operating) system converges

• Using self stabilization:– A system can be started in an arbitrary state

and converge to a desired behavior– Using fair composition to run hardware+OS

• BGU: Self-stabilizing systems, tools & paradigms– Microprocessor [DH’04]– Operating System [DY’04]– Compiler [DH’05]– Framework: autonomic recoverer [BDK’03]– Middleware: File System [DK’02], Group Comm. [DS’01]

Page 8: S elf-Stabilizing  O perating  S ystems

8

SOS - Directions

• Black-box– Take existing (Desktop/Real-time) OS– Add stabilization layer– Detailed formal specification needed

• Carefully tailoring a tiny kernel– Processor scheduling [SAACS04]– Memory management [SSS05]– Device drivers

Page 9: S elf-Stabilizing  O perating  S ystems

9

Method

• Additional requirements for each OS function

• Evolve self-stabilizing solutions that follow computer-architecture/OS progress

• Detailed proof for self-stabilization of algorithms AND implementation

• Processor (e.g. Pentium) instruction manual defines a transition function– Don’t rely on existing compilers

Page 10: S elf-Stabilizing  O perating  S ystems

10

Assumptions

• Whole soft-state can be corrupted (e.g. Program Counter)

• Stabilization of other layers

Page 11: S elf-Stabilizing  O perating  S ystems

11

Solution Foundations

• Program loading & process scheduling

• Code portions in ROM• Truly non-maskable interrupt

and watchdog architecture• Periodic reset reinstall &

execute (weak)• continuous monitoring and

consistency enforcement

Page 12: S elf-Stabilizing  O perating  S ystems

12

Example: Memory Management- Requirements

• Consistency of memory hierarchy

•Self-stabilization preservation

App1

App2

AppN

OS

HW

Page 13: S elf-Stabilizing  O perating  S ystems

13

• Allocate whole available memory to the running application

• Consistency: kept all the time• Stabil. Preserving: no mutual sharing

Solution 1: Full Swapping

App1 OS

App1

RAM

Disk App2 App3 App…

App2

Page 14: S elf-Stabilizing  O perating  S ystems

14

Solution 2: Fixed Partitioning

• Fixed slots in main memory for several programs

c5 d5 d2c2

c1 c2 c3 c4 c5

d1 d2 d3 d4 d5

CD-ROM

Disk

OS

c3

d3

RAM

Page 15: S elf-Stabilizing  O perating  S ystems

15

• Consistency: through continuous checks and consistency establishmentof OS data structures

• Stabil. Preserving: via segmentation + code refreshing

Solution 2: Fixed Partitioning

# F R

P1 2 4

P2 -1 3

...

# P

F1

F2 1

Process Table

Fram

e T

able

Page 16: S elf-Stabilizing  O perating  S ystems

16

Solution 3: Dynamic allocations

• We want to allow applications to dynamically allocate memory

• How can we avoid a process that (faultily) allocates the whole available memory?

• What happens if a process “forgets” about its ownership?

• Leasing

Page 17: S elf-Stabilizing  O perating  S ystems

17

Solution 3: Dynamic allocations

# F R

P1 -1

P2 -1

...

# P

F1

F2

Process Table

L

Fram

e

Table

Request

Queue P2

3

3

1

1

2

-1

2

2

1

1

0

0

2

-1

-1

1

9

0

• Consistency: dynamic memory is temporarily leased & garbage collected, verification of PCB & queue

• Stabil. Preserving: access through special segment selector

Page 18: S elf-Stabilizing  O perating  S ystems

18

Implementation

• Pentium in real-mode, single address space– Simple– common for sensors/microcontrollers– Protected mode & VM mechanisms

can be handled accordingly• Code size: ~1-2K

– TinyOS ~1K– VxWorks ~102K– Linux kernel ~4M

• Fault injection with the Bochs simulator

Page 19: S elf-Stabilizing  O perating  S ystems

19

Implementation1 ) MM_FindFrame: ;(PT, FT, i); al contains current frame suggestion; nf <- (frmae[PT[i]] + 1) modulo M2 ) and byte [bx+FRAME_COL], FRAME_MASK3 ) inc al4 ) and al, FRAME_MASK

;Check all slots for an empty one.;while nf != frame[PT[i]] and FT[nf] != nil5 ) while1:6 ) cmp al, [bx+FRAME_COL]7 ) jz endwhile18 ) lea si, [frames]9 ) add si, ax10) mov dl, [si]11) cmp dl, NULL_PROCESS12) jz endwhile1; do nf <- (nf + 1) modulo M13) inc al14) and al, FRAME_MASK15) jmp while116) endwhile1:

; return found frame number in register 'al'17) ret

Page 20: S elf-Stabilizing  O perating  S ystems

20

Page 21: S elf-Stabilizing  O perating  S ystems

21

Future Work

• I/O device drivers– Major cause of operating systems

failures– Co-operation of more than one

microprocessor– Detailed driver / General monitoring

layer• Gather the different parts• Micro-kernel / VMM

Page 22: S elf-Stabilizing  O perating  S ystems

22

Conclusion

• The work shows theoretical and practical ways to achieve the goal of a self-stabilizing OS

• The (system) research community & industry can benefit from the foundation of self-stabilization

• http://www.cs.bgu.ac.il/~yagel/sos

Page 23: S elf-Stabilizing  O perating  S ystems

23

Space Vehicle Failure

• …The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works..

• …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended…

• …Spirit fell silent, alone on the emptiness of Mars…

http://www.eetimes.com/story/OEG20040220S0046

…the rover was in fact listening and rebooting, the team commanded Spirit to reboot without mounting the flash file system…But just in case, the team is working on an exception-handler routine that will more gracefully recover from an allocation failure