CS717
Algorithm-Based Fault ToleranceTheory of Check Placement
Greg Bronevetsky
CS717
So Far…
• Learned how certain computations could be checked using algorithm-specific checks.
• In any algorithm we can develop checks to verify any set of data items.
• How effective are these checks?• How many faults can given set of checks
detect?
CS717
Abstract Checks
• Suppose we are given (g,h)-checks• Check defined on g data elements• If all elements correct, returns 0• If 0 and h elements erroneous, return 1• If h elements erroneous, undefined
CS717
Checking Example
• Assume (2, 1) checks – 2 elements, 1-failure detect
• Both sets of checks can detect single errors• Neither can locate individual errors
…
d1
d2
dn
+ sum …d1
d2
dn
+sum
n checks: i. di and sum 1 check: sum
CS717
But with one more check…
• If also check sum– can detect any pair of errors– can locate single errors
• Need general theory of effective and efficient check placement
…
d1
d2
dn
+ sum
n checks: i. di and sum1 more check: sum
CS717
Goals
• Need models for correlating processor faults to data errors
• Given fault model and set of checks need to derive fault detectability and locatability
CS717
Papers covered
• V.S.S. Nair, J.A. Abraham, P. Banerjee. "Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes", 1996.
• Choon-Sik Park and Mineo Kaneko, "An Efficient Technique for Design of ABFT Systems Based on Modified PD Graph".
• Choon-Sik Park, "Algorithm-Based Fault Tolerant Systems Based on Graph-Theoretic Error Occurence+Propagation Models", 2000. (PhD Thesis)
• V.S.S. Nair, J.A. Abraham. "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection", 1990.
CS717
Outline
• Matrix-based formalism of Nair et al
• Dependence graph-based formalism of Park et al– Includes fault propagation models
• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant
components
CS717
Basic Framework
• Each processor and check associated with set of elements
P1
P2
P3
P4
d1
d2
d3
d4
d5
d6
d7
C1
C2
C3
Processors
Checks
CS717
Basic Framework
• Data(Pi) = set of data elements affected by processor i– If Pi fails, any subset of of Data(Pi) may be
erroneous– No notion of errors propagating based on data
dependences
• Data() defines the Processor-Data (PD) Matrix
CS717
Associated PD Matrix
P1
P2
P3
P4
d1
d2
d3
d4
d5
d6
d7
Processors
Data Elements
Processors
1100000
0010000
0001010
0000111
CS717
Basic Framework
• Check(di) = set of checks that check data element di.– Must be non-empty if we expect to detect errors
• Check defines the Data-Check (DC) Matrix
• Paper focuses on (g,1) checks– g data elements– can detect upto 1 fault
CS717
Associated DC Matrixd1
d2
d3
d4
d5
d6
d7
C1
C2
C3
Checks
Checks
Data Elements
100
010
001
110
001
010
001
• C1 and C2 are (3,1) checks
• C3 is a (2,1) check
CS717
The PC Matrix
• Finally, associate processors and checks:• Processor-check (PC) matrix = PDDC
Data Elements
Processors
0100000
0010000
0001010
0000111
Checks
Data Elements
=
=# elements verified by check
Processors
120
012
100
010
001
110
001
010
001PD DC
PC
CS717
Using the PC Matrix
• PC matrix shows if we can detect single-processor errors:
• Assume all checks are (g,h) checks• If each row of PC has all entries h failure of
that process will be detected– Regardless of which entries actually become
erroneous
# elements verified by check
Processors
120
012
PC
CS717
Using the PC Matrix
• If each row of PC has all entries h failure of that process will be detected
P1
P2
P3
P4
d2
d3
d4
d5
d6
d7
C1
C2
C3
Processors
Checks
d1
# elements verified by check
Processors
120
012
PC
CS717
Relaxing Detectability
• Condition is too conservative• Suppose we have (3, 2) checks
• Pi’s PD row is:
• There are 2 checks. DC matrix:• PC Matrix:
P1
d1
d2
d3
d4
d5
C1
C2
11011
11
10
10
01
01
23
CS717
Relaxing Detectability
• C1 may be overwhelmed by errors
– Will not notice error <d1, d2 d5>
• By above criterion system can’t detect failure in P1
P1
d1
d2
d3
d4
d5
C1
C2
CS717
Reaching New Detectability Definition
• But how could C1 be overwhelmed?
• When all 3 of its elements have errors– Recall, these are (3,2) checks
P1
d1
d2
d3
d4
d5
C1
C2
CS717
Reaching New Detectability Definition
• But C1 and C2 overlap on d5
• Thus if C1 overwhelmed, C2 detects error– It is not overwhelmed
• Thus, for any error pattern can see if any check will notice
P1
d1
d2
d3
d4
d5
C1
C2
CS717
Trivial Algorithm 2
• Try every possible error pattern– Exponentially many of them
• For each pattern see if some check will detect it– Before: ensured that no check overwhelmed
• Pro: Correct and not conservative• Con: Expensive
CS717
New Definition of Detectability
• Work with error patterns– Ex: <d1, d2, d5>, <d1, d3, d4>, <d3>, etc.
• If one check detects given error pattern, no problem if other checks overwhelmed
• Repeat until all error patterns detected:
If some check not overwhelmed, eliminate all detectable error patterns from
consideration
CS717
Example of Detectability Algorithm
• Is failure of P1 detectable?
• P1 fails d1, d2 and/or d3 may have errors
• C1, C2 overwhelmed
• C3 not overwhelmed
P1
d1
d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4
CS717
Example of Detectability Algorithm
• Look at errors C3 can detect: d3
• Remove them from consideration– Since any error pattern involving d3 will be
detected
P1
d1
d2
d3
C1
C2P2C3
(2,1) checks
C4
d4
d5
CS717
Example of Detectability Algorithm
• Look at remaining error patterns: combinations of d1 and/or d2
• Now C2 not overwhelmed
• Remove any error patterns involving d2
P1
d1
d2 C1
C2P2C3
(2,1) checks
C4
d4
d5
CS717
Example of Detectability Algorithm
• Look at remaining error patterns: d1
• C1 not overwhelmed
• Remove any of its error patterns
P1
d1
C1
C2P2C3
(2,1) checks
C4
d4
d5
CS717
Example of Detectability Algorithm
• All of P1’s error patterns detected
• We are done!
P1 C1
C2P2C3
(2,1) checks
C4
d4
d5
CS717
Failing Check Processors
• What if processor performing check fails?
• Add “pseudo” data elements to represent processors
• Each check will also check its processor’s pseudo-data element– New element has weight, so error in it will
overwhelm any check
CS717
Final System
• Check C3 is in P1
• Checks C1, C2 and C4 on P2
P1 d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7
CS717
The Infinities
P1
d1
d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7
# elements verified by check
Processors
11
1122
Data Elements
Processors
011000
000111
PDChecks
Data Elements
1011
0100
1000
0100
0110
1011
0001
DC
PC
CS717
The Infinities
P1
d1
d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
1122
PC
• If P1 fails, C1 and C2 overwhelmed• C3 also overwhelmed by +1
– Because C3 runs on failed P1
• Only C4 not overwhelmed
Processors
CS717
The Infinities
P1
d1
d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
1122
PC
• Remove all error patterns detected by C4
– Any that include d2
Processors
CS717
The Infinities
P1
d1
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
0111
PC
• C1 and C2 no longer overwhelmed
• Remove error patterns detected by C1 and C2
– Any that include d1 and d3
Processors C4’s entry must become 0Others may go lower
CS717
The Infinities
P1
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
000
PC
• Now P1’s row is all 0’s and ’s• All real data elements successfully checked• Only pseudo-elements remain
– Don’t care
Processors C1’s and C2’s entries must become 0Others may go lower
CS717
The Infinities
P1
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
000
PC
Processors
• Note failure of P2 not detectable
• d5 only checked by C4, which runs on P2
• Thus, entry will never drop to
CS717
Multi-Process Errors
• Want to know if system detect failures of r processors
• For every subset of r processors– Take union of all data elements they touched– Pretend each r-set is single processor
• Use above algorithm to check if all resulting error patterns detectable
CS717
Fault Locatability
• We only see errors, not faults• For each error pattern, want to know which
fault caused it
• Given two fault patterns, are they distinguishable?
• Only if they have different patterns of failed checks
• Will give intuition for analysis
CS717
0-1 Disagreement
• Take rows Ri and Rj of rPC (faults Fi and Fj)
• For every possible error pattern in Ri and Rj look at what each check says on this pattern
• If check responses different on each pattern: Fi and Fj can be differentiated
CS717
1-0 Disagreement
• Want to differentiate faults Fi and FiFj j
• Compare each error pattern of Fi and Fj: Eik and Ejl
• If some check meets Eik on 1 & h spots and meets Eil on 0 spots then Ejk and EjkEjl distinguishable
• If this is true for all error patterns then F i and FiFj distinguishable
CS717
1-0 Disagreement Example
101
110
,
,
lj
ki
EError
EError
001
011
101
DC
102
012
,
,
lj
ki
EonChecks
EonChecks1-0 disagreement in
both directions
CS717
1-0 Disagreement Example
• Clearly, Eik and Ejl look different
• EikEjl corresponds to fault pattern:
• Checks would say:
• Different from Eik or Ejl : Distinguishable!
001
011
101
DC
111
112
101
110
,
,
lj
ki
EError
EError
102
012
,
,
lj
ki
EonChecks
EonChecks
CS717
Fault Locatability
• If can show 1-0 disagreement between every single-process fault and every r-process fault:System is r-fault locatable
• Algorithm for locatability is obscure
• Read the paper
CS717
Summary
• Presented matrix-based framework for evaluating error detectability & locatability
• Framework deals with arbitrary errors
• More work by V.S.S. Nair with other coauthors
CS717
Outline
• Matrix-based formalism of Nair et al
• Dependence graph-based formalism of Park et al– Includes fault propagation models
• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant
components
CS717
Graph-Based Framework
• Developed by Choon-Sik Park• Does in graphs what Nair et al work does in
matrices• Assumes (g,1) checks• Differences:
– Different definition of fault locatability• Unknown if equivalent
– Presents more limited faulterror models• As opposed to “anything and everything”
• Will first present general view, then specific error models
CS717
Basic Picture
Errors
……
…
Faults
……
…
Fi
Fj
Data
……
…eiu
ejv
Checks
……
…
c
c`
ProcessorData, DataData dependence info maintained
CS717
ErrorsFaults Data
k-Faults
• Faults may cause number of possible errors– For given fault, many errors possible– If given error happens, all associated data
elements definitely corrupted
• k-Faults: faults generating errors that corrupt k data elements
Fi
eiu
CS717
Fault Detectability
• System is k-fault detectable if for every error pattern check c s.t. |ceiu|=1 means intersection of affected data elements
• Proof:– If there exists such check then every error pattern
induced by fault will be detected– If k-fault detectable then must some check that
reliably yells for any possible error pattern• Can allow the check that yells to be the check in
definition
CS717
Fault Management
• k-fault detectability: If a fault affects k data elements then checks will detect it
• k-fault locatability: For all faults that affect k data elements, can tell any pair of faults apart
• Will examine all fault patterns Fi that come from k data elements failing
CS717
Fault Locatability 1
• To locate faults, must ensure that different faults cause different errors
• Theorem 1:System k-fault locatable only if for error patterns eiu, ejv (from faults Fi and Fj) eiuejv symmetric difference
• Proof clear:If two faults can show up as same error, can’t tell them apart
CS717
Fault Locatability 2
• Theorem 2:System k-fault locatable only if for error patterns eiu, ejv checks c and c' s.t.
– |c(eiuejv)|=1 (recall: all checks are (g,1))
– |c(eiuejv)|=0
– If |c(eiu-ejv)|=1 then |c'ejv)|=1
– If |c(ejv-eiu)|=1 then |c'eiu)|=1
• Intuition: Trying to make tuple <c,c'> be different and <0,0> on errors eiu and ejv
CS717
Fault Locatability Illustration
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv
CS717
Fault Locatability Illustration
• |c(eiuejv)|=1
• i.e. c overlaps one element (eiuejv)
(because of (g,1) checks)
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv
c
CS717
Fault Locatability Illustration
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv• |c(eiuejv)|=0
• i.e. c only touches on the part that is unique to ejv
c
CS717
Fault Locatability Illustration
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv• If |c(ejv-eiu)|=1 then |
c'eiu)|=1
• If c notices ejv make sure that c‘ notices eiu
c
c'OR
CS717
Fault Locatability Illustration
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv• Error eiu:<c,c'>=<0,1>• Error ejv:<c,c'>=<1,?>• Patterns distinguishable• Either error detected
c
c'OR
CS717
Fault Locatability 2
• Theorem 2:System k-fault locatable only if for error patterns eiu, ejv checks c and c' s.t.
– |c(eiuejv)|=1 (recall: all checks are (g,1))
– |c(eiuejv)|=0
– If |c(eiu-ejv)|=1 then |c'ejv)|=1
– If |c(ejv-eiu)|=1 then |c'eiu)|=1
• This, is above true for every pair of error patterns, system k-fault detectable
CS717
Extra Fault Detectability
• Theorem: if system is k-fault locatable then it is 2k-fault detectable
• Must show: for any fault Fl in 2k processors, resulting errors elw, check c. |celw|=1
• Note: Failures of 2k processors result in 2 errors as failures of k data elements
• Thus, can break up elw = (eiuejv), coming from k-fault patterns Fi and Fj
CS717
Extra Fault Detectability
• Theorem: if system is k-fault locatable then it is 2k-fault detectable
• Must show: eiu,ejv check c. |c(eiuejv)|=1
• If (eiuejv) happens, both c and c' will notice
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv
c
c'OR
CS717
FaultError Models
• So far trying to deal with arbitrary errors• Actual model of how faults turn into errors not
defined– i.e. arbitrary
• This is unnecessarily general
• Should focus on realistic models of error generation and propagation– Makes it easier to design reliable systems
CS717
Single-Input-Driven Model
• Output of computation erroneous if any input(s) are– Even if processor is faulty
• If processor is faulty, its computations may or may not be erroneous(this is where we use data dependence information)
• Will focus on how model treats single-processor failures
CS717
SID Model Picture
• … : data elements on Pi
– Synonymous with sets of data elements on Pi
• Focus on single-processor failures
Pi
iiWD
2iD
1iD
……
iwD
Data
1iD iiWD
CS717
Fault Model in Practice
• If Pi fails, any subset of Diw’s may have error
• If Diw has error, any data depending on it has error– Bijection between Diw
and errors Eiw
Pi
iiWD
2iD
1iD
……
iwD
Data2iE
iwE
iiWE
CS717
Single-Fault Detectability in SID
• Brute-Force algorithm: sets of Eiw’s
– If check c s.t. |c(Eiw’s)|=1 then this error pattern detectable
– If all patterns detectable, system is single-fault detectable
Pi
iiWD
2iD
1iD
……
iwD
Data
c
CS717
Too Conservative
• Like before, algorithm too conservative• Examines exponentially many error patterns• Suppose set of errors
detected via check c– i.e. |cE|=1
• Look at
} E , E,{EE r21
1E..E }E ,E ,E{E r21 jcts
1D
2D
c
3D
EE
CS717
Too Conservative
• Clearly, all overlap with c on one element– Thus, each one detectable– Similarly, all unions containing detectable
• Therefore, if a set of errors detectable, all unions containing suberrors also detectable– And thus, no need to check them
1D
2D
c
s'E j
s'E j
3D
EE
Can ignore:E1, E2, E1E2, E1E3, E1E2, E1 E2 E3
Can’t ignore:E3
CS717
New Definition of Detectability
• = (start with all possible errors)
• For each check cs:– Check that detectable:
• Now ignore detectable subsets of • Remove detectable subsets:
• Repeat to ensure rest of also detectable
0iE iE
1
1
iws Ec
iwsi
si EEE
siE 1
siiw EEiws Ec
siE
siE
CS717
Detectability Example
• Check (= )
• c1 meets E1 and E21D
2D
c1
3D
4D
5D
6D
0iE iE
CS717
Detectability Example
• Check (= )
• c1 meets E1 and E2
• Remove them to get
1D
2D
c1
3D
4D
5D
6D
0iE iE
1iE
CS717
Detectability Example
• Check
• C2 meets E3 and E4
– Also meets E2 but on error E2, c1 will ring
1D
2D
c1
3D
4D
5D
6D
},,,{ 65431 EEEEEi
c2
CS717
Detectability Example
• Check
• C2 meets E3 and E4
– Also meets E2 but on error E2, c1 will ring
• Remove them to get
1D
2D
c1
3D
4D
5D
6D
},,,{ 65431 EEEEEi
c2
2iE
CS717
Detectability Example
• Check
• C3 meets E5
1D
2D
c1
3D
4D
5D
6D
},{ 652 EEEi
c2c3
CS717
Detectability Example
• Check
• C3 meets E5
• Remove it to get
1D
2D
c1
3D
4D
5D
6D
},{ 652 EEEi
c23iE
c3
CS717
Detectability Example
• Check
• C3 meets E6
– Recall: circles on left are data on processor I
1D
2D
c1
3D
4D
5D
6D
}{ 63 EEi
c2c3
c4
CS717
Detectability Example
• Check
• C3 meets E6
– Recall: circles on left are data on processor I
• Remove it to get
1D
2D
c1
3D
4D
5D
6D
}{ 63 EEi
c2
3iE
c3
c4
CS717
Detectability Example
DONE!
1D
2D
c1
3D
4D
5D
6D
c2c3
c4
CS717
Single-Fault Locatability in SID
• Basic definition:Must exist enough checks s.t. all error patterns produced by failure of Pi differentiable from error patterns of Pj
• Involves a lot of error patterns
• Start with brute-force definition
CS717
Brute-Force Definition
error patterns Eq={Ei1, Ei5, Eiw, …} from Pi checks and s.t.–
• Detects error E
– • Ignores any error from Pj
– detect Ej and all subsets via above algorithm– And vice versa (since ‘s may ring on Pi’s errors)
• Result: – Any error pattern in Ei, none in Ej will ring some cq
– Every pattern in Ej detectable
rcc ...1qc1Ecq
0 jq Ec
rcc ...1
kc
CS717
Responses of Checks
• On error pattern Eq (due to failure of Pi):
• On any error Ej due to failure of Pj
• Can brute-force evaluate test on every possible Eq
???11 rq ccc
1/01/01/001 rccc
At least one must be =1 (else Ej not detectable)
CS717
Brute Force Too Exhaustive
• Recall that if then same true for all sets containing E1, … Er
• Thus, can eliminate many of the steps above
1} E , E,{Ec r21
CS717
New Definition of Locatability
• = (start with all possible Pi errors)
• For each check cs:
– Check cs detects :
– But not Ej :
• Ensure that Ej is detectable via above algorithm
0iE iE
siE 1
siiw EEiws Ec
0 js Ec
CS717
New Definition of Locatability
• Syndrome of Ei and detectable subsets:
• Syndrome of Ej all subsets:
• Can now ignore detectable subsets of • Remove detectable subsets:• Repeat until all covered• Do same for
– In paper, steps for and interleaved
1
1
iws Ec
iwsi
si EEE
siE
???11 rq ccc
1/01/01/001 rccc
At least one must be =1 (else Ej not detectable)
iE
jEiE jE
CS717
Summary
• Presented graph-based framework for evaluating error detectability & locatability
• Framework deals with arbitrary errors• Can be specialized to a simpler fault model:
Single-Input Driven• Choon-Sik Park’s thesis presents the
Multiple-Input Driven model– More realistic but complex
CS717
Outline
• Matrix-based formalism of Nair et al
• Dependence graph-based formalism of Park et al– Includes fault propagation models
• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant
components
CS717
Building Larger Systems
• Now know how to analyze systems for detectability & locatability
• For large systems this can be very hard/expensive
• Large systems typically made up of smaller components
• Simplifies fault tolerance design
CS717
Basic Idea
• Have component with known detectability (=t) & locatability (=l)
• Construct system S out of k components
• What is resulting fault tolerance?
CS717
Basic Idea
• System fault tolerance no better than for individual component
• If >t data elements fail in same component, error not detected
• If >l elements fail in component, will not locate
• Detectability & locatability ratio tends to 0 as system size increases!
CS717
Hierarchical Design
• To build fault tolerant systems must introduce checks with new components
• Will present hierarchical design scheme with specific detectability & locatability guarantees
• Assumptions:– All (g,h) checks have same h
• No restriction on g
– Every processor produces only one data element• Same true for blocks of processors
– Checks are fault tolerant• Claims that this doesn’t change problem
CS717
Basic Component
• Start off with basic system:
• System has internal checks• Fault detectability = t• Fault locatability = l
…
B
CS717
Basic Component
• Then replicate it k-fold
• Assumptions:– copies are independent
• (i.e. do not affect each other’s data)
– Each system produces one data element…
B1
…
B2
…
Bk
…
CS717
Basic Component
• Then replicate it k-fold
• And add additional checks across all copies• Process repeated d-1 times to get d-level
hierarchical system…
B1
…
B2
…
Bk
c1c2
cr
…
CS717
Detectability 1kh
• Theorem 1:– If 1kh then hierarchical system can detect |B|kd-1 errors
• Proof:– Base case: d=2– Suppose every element has error– Each check must deal with kh
errors– But they are (g,h) checks and
will detect such errors– Thus, system can detect |B|k errors
…
B1
…
B2
…
Bk
c1c2
cr
…
CS717
Detectability 1kh
• Theorem 1:– If 1kh then hierarchical system can detect |B|kd-1 errors
• Proof:– Inductive case: d+1
– Components Bi each have |B|kd-2
elements– By argument above, system
detects (|B|kd-2)k=|B|kd-1 errors• Argument works because sub-systems
at each level produce one data element
…
B1
…
B2
…
Bk
c1c2
cr
…
CS717
Detectability k>h
• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors
• Proof:– Base case: d=2– Suppose (t+1)(h+1) errors with h+1
copies of B having t+1 errors each– Detectability of B = t, so internal
checks will not notice errors– 2nd level checks will get h+1 errors
each: will not notice– Thus, error pattern of size (t+1)(h+1) that will not
be detected
…
B1
…
B2
…
Bk
c1c2
cr
…
CS717
Detectability k>h
• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors
• Proof:– Base case: d=2– Suppose (t+1)(h+1)-1 errors– By pigeonhole principle, some unit
has t errors or some 2nd levelcheck has h errors
– Thus, some check at 1st or 2nd levelwill ring
– Thus, system detectability = (t+1)(h+1)-1
…
B1
…
B2
…
Bk
c1c2
cr
…
CS717
Detectability k>h
• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors
• Proof:– Inductive case: d+1
– Components Bi detect Td errors
– By induction, Td= (t+1)(h+1)d-1-1
– By argument above, system detects (Td+1)(h+1)-1 errors
– Thus, system detectability = (t+1)(h+1)d-1
…
B1
…
B2
…
Bk
c1c2
cr
…
CS717
Locatability
• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors
• Proof:– Base case: d=2– Suppose fault pattern of 2(l+1)
errors, l+1 errors in two Bi’s
– Bi & Bj can’t locate the errors
– 2nd level checks may locate erroneous rows, not columns
– Thus, unlocatable fault pattern of size 2(l+1)
… … …
Bk
c1c2
cr
…
B1 B2
CS717
Locatability
• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors
• Proof:– Base case: d=2– Suppose fault pattern of 2(l+1)-1
– At most one Bi may have l+1 errors
• If none do, we’re done
– Remaining l errors distributed among other Bj’s
… …
c1c2
cr
B1
…
Bk
…
B2
CS717
Locatability
• Let Bi have l+r errors (r1)
…
Bi
…
Bj
…
Bk
c1c2
cr
…
CS717
Locatability
• Let Bi have l+r errors (r1)
• Remaining Bj’s share remaining l-r+1 errors
(l+r)-(l-r+1)=2r-1 rows only have errors in Bi
– =2r-1 rows when all l-r+1 errors are in same Bj…
Bi
…
Bj
…
Bk
c1c2
cr
…
CS717
Finding Overwhelmed Unit
• First, find the Bi that have >l errors
• All but one sub-system detects and locates errors correctly
• Overwhelmed subsystem:– Detects correctly
• Locatability = l Detectability > 2*l• Citation of 1973 paper by Russel & Kime
– Error location mistakes
CS717
Finding Overwhelmed Unit
• In 2r-1 rows only Bi has error– Thus, no other row will claim an error there
• 2nd-level checks will catch these errors– Bi’s checks can’t lie about it
– Will definitely know these are errorsBi Bj Bk…
l+12r-1
Known errors:Uknown errors:
No error:
CS717
Finding Overwhelmed Unit
• Number of errors in Bi = l+r
• Number of known errors 2r-1
• Number of unknown errors in Bi
(l+r)-(2r-1) = l-r+1
• Since r1, l-r+1l
• Bi’s checks can identify l errors– Error patterns l produce unique check alert
patterns – This data enough to identify remaining unknown
errors
CS717
Locatability
• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors
• Proof:– Base case: d=2– Can Locate errors size 2(l+1)-1– Inductive case: d+1
– Components Bi can locate 2d-1(l+1)-1 errors
– By argument above, system locates 2*[(2d-1(l+1)-1)+1]-1 = 2d(l+1)-1 errors
CS717
Summary
• Presented systematic way to build hierarchical systems with good fault-detection properties
• For d-level system composed of identical independent components– Component detectability=t, locatability=l
11)1(
1)1)(1(
1
1
1
kforldL
hkforht
hkforkBT
dd
d
d
d
CS717
Conclusion
• Formalisms for analyzing fault detectability & locatability– Matrix-based formalism of Nair et al– Dependence graph-based formalism of Park et al
• Includes fault propagation models
• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant
components
CS717
Conclusion
• These schemes have complex rules for acceptable check placements
• Requires detailed analysis of system to place them manually
• More detailed analysis if checks are hand-designed– Likely since few known automatic techniques
• Overall, approach can support automatic solutions but currently very manual