fault-containment in weakly stabilizing systems anurag dasgupta sukumar ghosh xin xiao university of...

27
Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Upload: edwin-cox

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Fault-containment in Weakly Stabilizing Systems

Anurag Dasgupta Sukumar GhoshXin Xiao

University of Iowa

Page 2: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Preview

• Weak stabilization (Gouda 2001) guarantees reachability and closure of the legal configuration. Once “stable”, if there is a minor perturbation, apparently no recovery guarantee exists, let alone “efficient recovery”.

• We take a weakly stabilizing leader election algorithm, and add fault-containment to it.

Page 3: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Our contributions

• An exercise in adding fault-containment to a weakly stabilizing leader election algorithm on a line topology. Processes are anonymous.

• Containment time = O(1) from all single failures

• Lim m∞ (contamination number) is O(1) (precisely 4), where m is a tuning parameter

(Contamination number = max. no. of non-faulty processes that change their states during recovery)

Page 4: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

The big picture

Page 5: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Model and NotationsConsider n processes in a line topologyN(i) = neighbors of process iVariable P(i) = {N(i) U ⊥} (parent of i)Macro C(i) = {q ∈ N(i): P(q) = i} (children of i)Predicate Leader(i) ≡ (P(i)=⊥)

Legal configuration:• For exactly one process i: P(i) = ⊥ j ≠ i: P(j) = k P(k) ≠ j

Node i

P(i)

C(i)

Leader

Page 6: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Model and Notations

Shared memory model and central schedulerWeak fairness of the schedulerGuarded action by a process: g AComputation is a sequence of (global) statesand state transitions

Node i

P(i)

C(i)

Leader

Page 7: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Stabilization

A stable (or legal) configuration satisfies a predicateLC defined in terms of the primary variables p that are observable by the application. However, fault-containment often needs the use of secondary variables (a.k.a auxiliary or state variables) s. Thus,

Local state of process i = (pi, si)Global state of the system = (p, s), where p = the set of all pi, and s = the set of all si

(p, s) LC p LCp and s LCs

Page 8: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Definitions

Containment time is the maximum time needed to establish LCp from a 1-faulty configuration

Containment in space means the primary variables of O(1) processes changing their state during recovery from any 1-faulty configuration

Fault-gap is the time to reach LC (both LCp and LCS) from any 1-faulty configuration

LCp restored LCs restored

Fault gap

Page 9: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Weakly stabilizing leader election

We start from the weakly stabilizing leader electionalgorithm by Devismes, Tixeuil,Yamashita [ICDCS 2007], and then modify it to add fault-containment. Here is the DTY algorithm for an array of processes.

DTY algorithm: Program for any process in the array

Guarded actions:

R1 :: not leader ∧ N(i) = C(i)→ be a leader

R2 :: not leader∧ N(i) \ {C(i) U P(i)} ≠ → switch parent

R3 :: leader∧ N(i) ≠ C(i) → parent := k : k C(i)

Page 10: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Effect of a single failure

With a randomized scheduler, the weakly stabilizing system

will recover to a legal configuration with probability 1.

However, If a single failure occurs, the recovery time can be

as large as n (Using situations similar to Gambler’s ruin). For

fault-containment, we need something better.

We bias a randomized scheduler to achieve our goal. The

technique is borrowed [Dasgupta, Ghosh, Xiao: SSS 2007].

Here we show that the technique is indeed powerful enough

to solve a larger class of problems.

Page 11: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Biasing a random scheduler

For fault-containment, each process i uses a secondary variable

x(i). A node i updates its primary variable P(i). when thefollowing conditions hold:

• The guard involving the primary variables is true • The randomized scheduler chooses i• x(i) ≥ x(k), where k N(i)

Page 12: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Biasing a random scheduler

ij k

x(i)=10 x(k)=7x(j)=8

ij k

x(i)=13 x(k)=7x(j)=8

(Let m = 5)

ij k

x(i)=10 x(k)=7x(j)=8

ij k

x(i)=10 x(k)=8x(j)=8

After the action, x(i) is incremented as x(i) := max q ∈N(i) x(q) + m, m ∈ Z+

(call it update x(i), here m is a tuning parameter). When x(i) < x(k) but conditions 1-2 hold, the primary variable P(i) remains unchanged -- onlyx(i) is incremented by 1

UPDATE x(i) INCREMENT x(i)

* * * *

Page 13: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

The Algorithm

Algorithm 1 (containment) : program for process i

Guarded actions:

R1 :: (P(i) ≠ ⊥) ∧ (N(i) = C(i)) → P(i) := ⊥

R2 :: (P(i) = ⊥) ∧ (∃k ∈ N(i) \ C(i)) → P(i) := k

R3a :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) ≠ i or ⊥)∧ x(i) ≥ x(k) → P(i) := k; update x(i)

R3b :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) ≠ i or⊥) ∧ x(i) < x(k) → increment x(i)

R4a :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) = ⊥) ∧ x(i) ≥ x(k) → P(i) := k

R4b :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) = ⊥) ∧ x(i) < x(k) → increment x(i)

R5 :: (P(i) = j) ∧ (P(j) = ⊥) ∧ (∃k ∈ N(i) : P (k) ≠ i or⊥} → P(i) := k

Page 14: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Analysis of containment

Consider six cases

1. Fault at the leader

2. Fault at distance-1 from the leader

3. Fault at distance-2 from the leader

4. Fault at distance-3 from the leader

5. Fault at distance-4 from the leader

6. Fault at distance-5 or greater from the leader

Page 15: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Case 1: fault at leader node

0 1 72 6543

0 1 72 6543

0 1 72 653 4

R1 applied by node 5

R1 applied by node 4: node 4 is the new leader

**

R1 :: (P(i) ≠ ⊥) ∧ (N(i) = C(i)) → P(i) := ⊥

Page 16: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Case 2: fault at distance-1 from the leader node

0 1 72 6543

0 1 72 6543

R1: node 3

0 1 72 6543

R2: node 5

* *

**

*

R2 :: (P(i) = ⊥) ∧ (∃k ∈ N(i) \ C(i)) → P(i) := k

Page 17: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Case 5: fault at distance-4 from the leader node

0 1 72 6543

* *0 1 72 6543

**

R4a(2): x(2)>x(1)

0 1 72 6543 R5 (4)

* *

*R2(5)

*

R3a(3): x(3)>x(2)

0 1 72 6543

0 1 72 6543 stable

Non-faulty processes up to distance 4 from the faulty node being affected

R4a :: (P(i) = j) ∧ (∃k ∈ N(i) : P(k) = ⊥) ∧ x(i) ≥ x(k) → P(i) := k

Page 18: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Case 6: fault at distance ≥ 5 from the leader node

0 1 72 6543

* *0 1 72 6543

**

R4a(2): x(2)>x(1)

0 1 72 6543 R3a (3); R5 (2)

* *

*R2 (1)

*

R3a(3): x(3)>x(2), x(4)

0 1 72 6543

With a high m, it is difficult for 4 to change its parent, but 3 can easily do it

Recovery complete0 1 72 6543

Current leader

Page 19: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Fault-containment in space

Theorem 1. As m∞ , the effect of a single failure is restricted

within distance-4 from the faulty process i.e., algorithm is spatially

fault-containing.

Proof idea. Uses the exhaustive case-by-case analysis. The worst

case occurs when a node at distance-4 from the leader node fails

as shown earlier.

Page 20: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Fault-containment in time

Theorem 2. The expected number of steps needed to contain

a single fault is independent of n. Hence algorithm containment is

fault-containing in time.

Proof idea. Case by case analysis. When a node beyond distance-4

from the leader fails, its impact on the time complexity remains

unchanged.

Page 21: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Fault-containment in time

0 1 72 6543

* *

Recovery completed in a single move regardless of whether node 3 or 4 executes a move.

Case 1: leader fails

Case 2: A node i at distance -1 from the leader fails.

• P(i) becomes :⊥ recovery completed in one step

• P(i) switches to a new parent: recovery time = 2 +∑∞n=1 n/2n = 4

Page 22: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Fault-containment in time

Summary of expected containment times

Fault at leader - 1

Fault at dist-1 1 4

Fault at dist-2 2 151/108

Fault at dist-3 131/54 115/36

Fault at dist-4 10/9 29/27

Fault at dist ≥ 4 33/32 115/36

P(i) ⊥ P(i) switches

Thus, the expected containment time is O(1)

Page 23: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Another proof of convergence

Theorem 3. The proposed algorithm recovers from all single faultsto a legal configuration in O(1) time.

Proof (Using martingale convergence theorem)

A martingale is a sequence of random variables X1, X2, X3, … s.t. ∀n•E(|Xn|) < ∞, and•E(Xn+1|X1 … Xn) = Xn (for super-martingale use ≤ for =, and

for sub-martingale, use ≥ for =)

We use the following corollary of Martingale convergence theorem:

Corollary. If Xn ≥ 0 is a super-martingale then as n → ∞, Xn converges to X with probability 1, and E(X) ≤ E(X0).

Page 24: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Proof of convergence (continued)

Let Xi be the number of processes with enabled guards in step i.After 0 or 1 failure, X can be 0, 2, or 3 (exhaustive enumeration).

When Xi = 0, Xi+1 = 0 (already stable) When Xi = 2, E(Xi+1)= 1/2 x 1 + 1/2 x 2 = 1 ≤ 2When Xi = 3, E(Xi+1)= 1/3 x 0 + 1/3 x 2 + 1/3 x 4 = 2 ≤ 3

Thus X1, X2, X3, … is a super-martingale. Using the Corollary, as n → ∞, E(Xn) ≤ E(X0). Since X is non-negative by definition, Xn

converges to 0 with probability 1, and the system stabilizes.

Page 25: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Proof idea of weak stabilization

DTY algorithm Our algorithm

R1

R2

R3

R1

R2

R3

R4

R5Weakly stabilizing

Executes the same action (P(i) :=k) as in DTY, but the guards are biased differently

Weakly stabilizing

Page 26: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Stabilization from multiple failures

Theorem 3. When m → ∞, the expected recovery time from multiple

failures is O(1) if the faults occur at distance 9 or more apart.

Proof sketch. Since the contamination number is 4, no non-faultyprocess is influenced by both failures.

4 4

FaultFault

Page 27: Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Conclusion

1. With increasing m, the containment in space is tighter, but

stabilization from arbitrary initial configurations slows down.

2. LCs = true, so the systems is ready to deal with the next single

failure as soon as LCp holds. This reduces the fault-gap and

increases system availability.

• The unbounded secondary variable x can be bounded using the

technique discussed in [Dasgupta, Ghosh, Xiao SSS 2007] paper.

• It is possible to extend this algorithm to a tree topology (but

we did not do it here)