chapter 7 - local stabilization1 chapter 7: roadmap 7.1 super stabilization 7.2 self-stabilizing...
TRANSCRIPT
Chapter 7 - Local Stabilization 1
Chapter 7: roadmap
7.1 Super stabilization 7.2 Self-Stabilizing Fault-Containing
Algorithms7.3 Error-Detection Codes and Repair
Chapter 7 - Local Stabilization 2
Introduction
We present a scheme that can be used to correct the state of algorithms for ongoing long-lived tasks.
Converting non-stabilizing algorithms for such tasks to self-stabilizing algorithm for the same task.
Chapter 7 - Local Stabilization 3
The Malicious Fault Model
Starting from a safe configuration c, after which k processors experience transient fault - a new configuration c’ is reached.
The states of the faulty processors can be chosen as the states that result in the
longest convergence time.
Chapter 7 - Local Stabilization 4
The Malicious Fault Model (2)
This worst case measure minimize the convergence time in the worst case scenario
However, algorithms designed with the worst case measure may have larger average convergence time than other algorithms
Chapter 7 - Local Stabilization 5
The Non-malicious Fault Model
In this model, a transient fault assigns a state to a processor, that is chosen with equal probability from the state space of the processor
Chapter 7 - Local Stabilization 6
Average Convergence Time
Pr (c, k, c’) : The probability of reaching a particular configuration c’ from a safe configuration c due to the occurrence of k faults
WorstCase(c) : The maximal number of cycles before the system reaches a safe configuration when it starts in c
Chapter 7 - Local Stabilization 7
Average Convergence Time (2)
The average convergence time following the occurrence of k non-malicious transient faults is:
Σ [pr(c, k, c’) · WorstCase(c’)]Computed over all possible configurations c’
Chapter 7 - Local Stabilization 8
Error Detection Codes
We use error-detection codes to reduce average convergence time
For each processor we maintain a variable ErrorDetect holding the error-detection code ed, of its current state s
The error-detecting function computes a pair <s, ed> given s
Chapter 7 - Local Stabilization 9
Converting the Algorithm
Replace every step a by a step a’ that does the following:
1. Examines whether the value of ErrorDetect fits the current state
2. If (1) holds, execute a
3. Otherwise, execute a special repair step a’’
4. Compute the new ed’ by using the error-detecting function on the resulting state s’
Chapter 7 - Local Stabilization 10
Converting the Algorithm (2)
A transient fault can corrupt all the memory bits of a processor
Thus, the probability that the value of ErrorDetect will fit the state of the faulty processor, decreases as the number of bits in ErrorDetect increases
Chapter 7 - Local Stabilization 11
PyramidsA pyramid ∆i = vi[0], vi[1], vi[2],…, vi[d] of
views is maintained by every processor Pi , where vi[h] is a view of all the processors that are within a distance of no more than h from Pi, h times units ago.
In particular, vi[d] is a view of the entire system, d time units ago.
Chapter 7 - Local Stabilization 12
V1
V1[0] : View of V1 Now.
Chapter 7 - Local Stabilization 13
V1
V1[1] : View of colored vertices, one time unit ago.
Chapter 7 - Local Stabilization 14
V1
V1[2] : View of colored vertices, two time units ago.
Chapter 7 - Local Stabilization 15
V1
V1[3] : View of colored vertices, three time units ago.
Chapter 7 - Local Stabilization 16
V1
V1[4] : View of the entire system, four time units ago.
Chapter 7 - Local Stabilization 17
V1
V1[5] and V1[6] are views of the entire system as well, the difference is only in the time these views were taken.
Chapter 7 - Local Stabilization 18
Neighboring Pyramids
Neighboring processors exchange pyramids between themselves, and check agreement on the shared portions
If shared portions are equal, then all the v[d] views are equal
In addition, every processor checks that vi[d] is a consistent configuration for the input
algorithm AL and the current task (the configuration is reachable from the initial state
of AL)
Chapter 7 - Local Stabilization 19
Checking Consistent Configuration
Pi checks that its state in the view vi[h] , for 0 ≤ h ≤ d-1, is obtained by executing AL using the state of Pi and its neighbors in vi[h+1] .
Chapter 7 - Local Stabilization 20
Updating the Pyramids
In every time unit, Pi receives the pyramid ∆j = vj[0], vj[1], vj[2],…, vj[d] of every neighbor, and uses the values of vj[d-1] to construct the value of the new vi[d]
The values of vj[d-1] contain information about every processor at distance d from Pi, d-1 time units ago
In the same way, Pi uses the received values of vj[k-1], for 0 ≤ k ≤ d-1,
(together with vi[k-1] ) to compute vi[k]
Chapter 7 - Local Stabilization 21
The Repair Scheme
First, we will assume that the error detection code, identifies all the faults
In general, the faulty processors initialize their states, and collect state information from non-faulty
processors to reconstruct their pyramids
Chapter 7 - Local Stabilization 22
The Repair Scheme(2)
Let c’ be a configuration reached after several faults
Three groups of processors:Faulty,Border-non-faulty, Operating.
A Process which identifies an error, assigns faulty to its local status variable, and resets its pyramid
Chapter 7 - Local Stabilization 23
Border-Non-Faulty and Operating
The pyramid of a non-faulty processor that is neighbor to a faulty processor has almost all the information stored in the faulty processor before the fault.
Such process assigns its local status variable the value border-non-faulty.
The rest non-faulty processors are defined operating.
Chapter 7 - Local Stabilization 24
Faulty
Border-non-faulty
Operating
Chapter 7 - Local Stabilization 25
Freezing the Pyramids
A border-non-faulty processor does not change its pyramid until all the faulty processors finished reconstructing theirs
The Topology Collection procedure is used to verify that.
Chapter 7 - Local Stabilization 26
Topology Collection
Every faulty and border-non-faulty processors send their topology known at that moment to their neighbors
After several rounds (the diameter of the corrupted region + 1), all the information in the pyramids of processors next to a faulty one has arrived
Chapter 7 - Local Stabilization 27
Topology Collection (2)
Every processor checks if there exists a faulty processor which has an edge connected to a processor with an unknown state
When this test returns false, the processor pyramids can be reconstructed
Chapter 7 - Local Stabilization 28
Reconstruction
The faulty processors reconstruct their pyramids using the collected information from the other pyramids and the transition functions of the processors
Chapter 7 - Local Stabilization 29
Back to Operating
Using a local counter, and the collected topology, the faulty and border-non-faulty processors conclude when the rest have finished reconstructing their pyramids
At the end of the repair process, all the processors change their status to operating
Chapter 7 - Local Stabilization 30
The algorithm
State variables:
Status = {operating, faulty, border non faulty}
Topology = {V , E}
Pyramid (Explained before)
Round Counter – counts the number of rounds since the occurrence of the recent fault.
Chapter 7 - Local Stabilization 31
The algorithm (cont.)Upon a clock tick:1. If (status = operating)
1.1 if (DetectError())1.1.1 status = faulty1.1.2 Pyramid = nil1.1.3 RoundCounter = 0
1.2 else if (HaveFaultyNeighbor())1.2.1 status = Border non faulty1.2.2 RoundCounter = 0
1.3 else UpdatePyramid()2. Else
2.1 ExchangeLocalTopologyInformation()2.2 if ( HasAllTopology()
& status = faulty)2.2.1 ReconstructPyramid()
2.3 RoundCounter++2.4 If (Diamater(Topology) = RoundCounter)
2.4.1 status = operating
Detects if a transient error occurredError Detection Codes
If one of the neighbors is faulty
Send immediate neighbors information, and receive Information from neighbors
Returns true iff there is not an edge coming out from faulty to an unknown state processor`
Chapter 7 - Local Stabilization 32
Undetected Faults
What happens in case the faults are not detected?
Transient fault detectors and watch dog counters are used in this situation
When an error is detected by the transient fault detector, the faulty process starts counting while letting the repair scheme try and fix the problem
Chapter 7 - Local Stabilization 33
Undetected Faults (2)
When the counter reaches its upper bound, the system is examined again
If the repair failed, a reset is triggered to the system