fault tolerant distributed computing system

Fault Tolerant Distributed Computing system.

What is fault? A fault is a blemish, weakness, or shortcoming

of a particular hardware or software component.

Fault, error and failures

Why fault tolerant? Availability, reliability, dependability, …

How to provide fault tolerance ? Replication Checkpointing and message logging Hybrid

Fundamentals

Message Logging

Tolerate crash failures Each process periodically records its local

state and log messages received after Once a crashed process recovers, its state must

be consistent with the states of other processesOrphan processes

• surviving processes whose states are inconsistent with the recovered state of a crashed process

Message Logging protocols guarantee that upon recovery no processes are orphan processes

Message logging protocols

Pessimistic Message Logging• avoid creation of orphans during execution• no process p sends a message m until it knows

that all messages delivered before sending m are logged; quick recovery

• Can block a process for each message it receives - slows down throughput

• allows processes to communicate only from recoverable states; synchronously log to stable storage any information that may be needed for recovery before allowing process to communicate

Message Logging

Optimistic Message Logging• take appropriate actions during recovery to

eliminate all orphans• Better performance during failure-free runs• allows processes to communicate from non-

recoverable states; failures may cause these states to be permanently unrecoverable, forcing rollback of any process that depends on such states

Causal Message Logging

Causal Message Logging• no orphans when failures happen and do not block

processes when failures do not occur.• Weaken condition imposed by pessimistic protocols• Allow possibility that the state from which a process

communicates is unrecoverable because of a failure, but only if it does not affect consistency.

• Append to all communication information needed to recover state from which communication originates - this is replicated in memory of processes that causally depend on the originating state.

KAN – A Reliable Distributed Object System

Developed at UC Santa BarbaraProject Goal:

Language support for parallelism and distribution

Transparent location/migration/replication Optimized method invocation Fault-tolerance Composition and proof reuse

System Description

Kan Compiler

Kan source

Java bytecode + Kan run-time libraries

JVM JVMJVM

UNIX sockets

Fault Tolerance in Kan

Log-based forward recovery scheme: Log of recovery information for a node is

maintained externally on other nodes. The failed nodes are recovered to their pre-

failure states, and the correct nodes keep their states at the time of the failures.

Only consider node crash failures. Processor stops taking steps and failures are

eventually detected.

Basic Architecture of the Fault Tolerance Scheme

Logical Node yLogical Node x

Fault Detector Failure handler

Request handler

Communication Layer

IP Address

Network

External Log

Physical Node i

Logical Ring

Use logical ring to minimize the need for global synchronization and recovery.

The ring is only used for logging (remote method invocations).

Two parts: Static part containing the active correct nodes. It has a

leader and a sense of direction: upstream and downstream. Dynamic part containing nodes that trying to join the ring

A logical node is logged at the next T physical nodes in the ring, where T is the maximum number of nodes failures to tolerate.

Logical Ring Maintenance

Each node participating in the protocol maintains a variables: Failedi(j): true if i has detected the failure of j

Mapi(x): the physical node on which logical node x resides

Leaderi: i’s view of the leader of the ring

Viewi: i’s view of the logical ring (membership and order)

Pendingi: the set of physical nodes that i suspects of failing

Recovery_counti: the number of logical nodes that need to be recovered

Readyi: records whether I is active. Initial set of ready nodes; new nodes become ready when they are

linked into the ring.

Failure Handling

When node i is informed of failure of node j: If every node upstream of i has failed, then I must become new

leader. It remaps all logical nodes from the upstream physical nodes, and informs the other correct nodes by sending a remap message. It then recovers the logical nodes.

If the leader has failed but there is some upstream node k that will become the new leader, then just update the map and leader variables to reflect the new situation

If the failed node j is upstream of i, then just update map. If I is the next downstream node from j, also recover the logical nodes from j.

If j is downstream of i and there is some node k downstream of j, then just update map.

If j is downstream of I and there is no node downstream of j, then wait for the leader to update map.

If i is the leader and must recover j, then change map, send a remap message to change the correct nodes’ maps, and recover all logical nodes that are mapped locally

Physical Node and Leader Recovery

When a physical node comes back up: It sends a join message to the leader. The leader tries to link this node in the ring:

Acquire <-> Grant Add, Ack_add Release

When the leader fails, the next downstream node in the ring becomes the new leader.

AQuA

Adaptive Quality of Service AvailabilityDeveloped in UIUC and BBN.Goal:

Allow distributed applications to request and obtain a desired level of availability.

Fault tolerance replication reliable messaging

Features of AQuA

Uses the QuO runtime to process and make availability requests.

Proteus dependability manager to configure the system in response to faults and availability requests.

Ensemble to provide group communication services.

Provide CORBA interface to application objects using the AQuA gateway.

Proteus functionality

How to provide fault tolerance for appl.Style of replication (active, passive)voting algorithm to usedegree of replicationtype of faults to tolerate (crash, value or time)location of replicas

How to implement chosen ft schemedynamic configuration modificationstart/kill replicas, activate/deactivate

monitors,voters

Group structure

For reliable mcast and pt-to-pt. CommReplication groupsConnection groupsProteus Communication Service Group for

replicated proteus manager• replicas and objects that communicate with the

manager• e.g. notification of view change, new QuO request• ensure that all replica managers receive same info

Point-to-point groups• proteus manager to object factory

AQuA Architecture

Fault Model, detection and Handling

Object Fault Model: Object crash failure - occurs when object stops sending

out messages; internal state is lost• crash failure of an object is due to the crash of at lease one

element composing the objectValue faults - message arrives in time with wrong content

(caused by application or QuO runtime)• Detected by voter

Time faults• Detected by monitor

Leaders report fault to Proteus; Proteus will kill objects with fault if necessary, and generate new objects

AQuA Gateway Structure

Egida

Developed in UT, AustinAn object-oriented, extensible toolkit

for low-overhead fault-toleranceProvides a library of objects that can

be used to compose log-based rollback recovery protocols.

Specification language to express arbitrary rollback-recovery protocols

Log-based Rollback Recovery

Checkpointing• independent, coordinated, induced by specific

patterns of communication

Message Logging• Pessimistic, optimistic, causal

Core Building Blocks

Almost all the log-based rollback recovery protocols share event-driven structures

The common events are: Non-deterministic events

Orphans, determinant

Dependency-generating events Output-commit events Checkpointing events Failure-detection events

A grammar for specifying rollback-recovery protocols

Protocol := <non-det-event-stmt>* <output-commit-event-stmt>* <dep-gen-event-stmt> <ckpt-stmt>op t <recovery-stmt>op t

<non-det-event-stmt> := <event> : determinant : <determinant-structure> <Log <event-info-list> <how-to-log> on <stable-

storage>>opt

<output-commit-event-stmt> := <output-commit-proto> output commit on < event-list>

<event> := send | receive | read | write<determinant-structure> := {source, sesn, dest, dest}<output-commit-proto> := independent | co-ordinated<how-to-log> := synchronously | asynchronously<stable-storage> := local disk | volatile memory of self

Egida Modules

EventHandler Determinant HowToOutputCommit LogEventDeterminant LogEventInfo HowToLog WhereToLog StableStorage VolatileStorage Checkpointing …

fault tolerant distributed computing system

Documents

log of recovery information

node crash failures

prefailure states

recovered state

memory of processes

failureswhy fault tolerant

crashed process recovers

executionno process