cs184b: computer architecture (abstractions and optimizations)
DESCRIPTION
CS184b: Computer Architecture (Abstractions and Optimizations). Day 21: May 30, 2003 Defect and Fault Tolerance. Today. Defect and Fault Tolerance Problem Defect Tolerance Fault Tolerance EAS Questionnaire??? (10 min). Probabilities. Given: N objects P yield probability - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/1.jpg)
Caltech CS184 Spring2003 -- DeHon1
CS184b:Computer Architecture
(Abstractions and Optimizations)
Day 21: May 30, 2003
Defect and Fault Tolerance
![Page 2: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/2.jpg)
Caltech CS184 Spring2003 -- DeHon2
Today
• Defect and Fault Tolerance– Problem– Defect Tolerance– Fault Tolerance
• EAS Questionnaire??? (10 min)
![Page 3: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/3.jpg)
Caltech CS184 Spring2003 -- DeHon3
Probabilities
• Given:– N objects– P yield probability
• What’s the probability for yield of composite system of N items?– Asssume iid faults– P(N items good) = PN
![Page 4: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/4.jpg)
Caltech CS184 Spring2003 -- DeHon4
Probabilities
• P(N items good) = PN
• N=106, P=0.999999
• P(all good) ~= 0.37
• N=107, P=0.999999
• P(all good) ~= 0.000045
![Page 5: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/5.jpg)
Caltech CS184 Spring2003 -- DeHon5
Simple Implications
• As N gets large– must either increase reliability– …or start tolerating failures
• N– memory bits– disk sectors– wires – transmitted data bits– processors
![Page 6: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/6.jpg)
Caltech CS184 Spring2003 -- DeHon6
Increase Reliability?
• Psys = PN
• Want: Psys = constant
• c=ln(Psys)=N ln(P)
• ln(P)=ln(Psys)/N
• P=Nth root of Psys
![Page 7: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/7.jpg)
Caltech CS184 Spring2003 -- DeHon7
Problems
![Page 8: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/8.jpg)
Caltech CS184 Spring2003 -- DeHon8
Two “problems”
• Shorts, breaks– wire/node X shorted to power, ground,
another node
• Noise– node X value flips
• crosstalk• alpha particle• bad timing
![Page 9: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/9.jpg)
Caltech CS184 Spring2003 -- DeHon9
Defects
• Shorts example of defect
• Persistent problem– reliably manifests
• Occurs before computation
• Can test for at fabrication / boot time and then avoid
![Page 10: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/10.jpg)
Caltech CS184 Spring2003 -- DeHon10
Faults
• Alpha particle bit flips is an example of a fault
• Fault occurs dynamically during execution
• At any point in time, can fail– (produce the wrong result)
![Page 11: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/11.jpg)
Caltech CS184 Spring2003 -- DeHon11
First Step to Recover
Admit you have a problem
(observe that there is a failure)
![Page 12: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/12.jpg)
Caltech CS184 Spring2003 -- DeHon12
Detection
• Determine if something wrong?– Some things easy
• ….won’t start
– Others tricky• …one and gate computes F*T=>T
• Observability– can see effect of problem– some way of telling if defect/fault present
![Page 13: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/13.jpg)
Caltech CS184 Spring2003 -- DeHon13
Detection• Coding
– space of legal values < space of all values– should only see legal– e.g. parity, redundancy, ECC
• Explicit test (defects, recurring faults)– ATPG = Automatic Test Pattern Generation– Signature/BIST=Built-In Self-Test– POST = Power On Self-Test
• Direct/special access– test ports, scan paths
![Page 14: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/14.jpg)
Caltech CS184 Spring2003 -- DeHon14
Coping with defects/faults?
• Key idea: redundancy
• Detection:– Use redundancy to detect error
• Mitigating: use redundant hardware– Use spare elements in place of faulty
elements (defects)– Compute multiple times so can discard faulty
result (faults)
![Page 15: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/15.jpg)
Caltech CS184 Spring2003 -- DeHon15
Defect Tolerance
![Page 16: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/16.jpg)
Caltech CS184 Spring2003 -- DeHon16
Two Models
• Disk Drives
• Memory Chips
![Page 17: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/17.jpg)
Caltech CS184 Spring2003 -- DeHon18
VM/Disk Drive Model
• On startup OS checks physical memory– Notes bad pages– Treat all bad pages as permanently
allocated to the “bad page” process
• Never map user (OS) to a “bad page”– User never sees physical addresses– Doesn’t expect contiguous physical
memory
![Page 18: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/18.jpg)
Caltech CS184 Spring2003 -- DeHon19
Page Tables and Defective Memory
Real MemoryTask 1Page Table
Disk
Task 2
Task 3
xxx
xxx
![Page 19: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/19.jpg)
Caltech CS184 Spring2003 -- DeHon20
Memory Chips
• Provide model in hardware of perfect chip
• Model of perfect memory at capacity X
• Use redundancy in hardware to provide perfect model
• Yielded capacity fixed– discard part if not achieve
![Page 20: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/20.jpg)
Caltech CS184 Spring2003 -- DeHon21
Example: Memory
• Correct memory:– N slots– each slot reliably stores last value written
• Millions, billions, etc. of bits…– have to get them all right?
![Page 21: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/21.jpg)
Caltech CS184 Spring2003 -- DeHon22
Memory defect tolerance
• Idea:– few bits may fail– provide more raw bits– configure so yield what looks like a perfect
memory of specified size
![Page 22: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/22.jpg)
Caltech CS184 Spring2003 -- DeHon23
Memory Techniques
• Row Redundancy
• Column Redundancy
• Block Redundancy
![Page 23: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/23.jpg)
Caltech CS184 Spring2003 -- DeHon24
Row Redundancy
• Provide extra rows
• Mask faults by avoiding bad rows
• Trick:– have address decoder substitute spare
rows in for faulty rows– use fuses to program
![Page 24: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/24.jpg)
Caltech CS184 Spring2003 -- DeHon25
Spare Row
![Page 25: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/25.jpg)
Caltech CS184 Spring2003 -- DeHon26
Row Redundancy
[diagram from Keeth&Baker 2001]
![Page 26: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/26.jpg)
Caltech CS184 Spring2003 -- DeHon27
Column Redundancy
• Provide extra columns
• Program decoder/mux to use subset of columns
![Page 27: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/27.jpg)
Caltech CS184 Spring2003 -- DeHon28
Spare Memory Column
• Provide extra columns
• Program output mux to avoid
![Page 28: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/28.jpg)
Caltech CS184 Spring2003 -- DeHon29
Column Redundancy
[diagram from Keeth&Baker 2001]
![Page 29: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/29.jpg)
Caltech CS184 Spring2003 -- DeHon30
Block Redundancy
• Substitute out entire block– e.g. memory subarray
• include 5 blocks– only need 4 to yield perfect
• (N+1 sparing more typical for larger N)
![Page 30: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/30.jpg)
Caltech CS184 Spring2003 -- DeHon31
Spare Block
![Page 31: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/31.jpg)
Caltech CS184 Spring2003 -- DeHon32
Yield M of N
• P(M of N) = P(yield N) + (N choose N-1) P(exactly N-1)+ (N choose N-2) P(exactly N-2)…+ (N choose N-M) P(exactly N-M)…
[think binomial coefficients]
![Page 32: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/32.jpg)
Caltech CS184 Spring2003 -- DeHon33
M of 5 example• 1*P5 + 5*P4(1-P)1+10P3(1-P)2+10P2(1-
P)3+5P1(1-P)4 + 1*(1-P)5
• Consider P=0.9– 1*P5 0.59 M=5 P(sys)=0.59
– 5*P4(1-P)1 0.33 M=4 P(sys)=0.92
– 10P3(1-P)2 0.07 M=3 P(sys)=0.99
– 10P2(1-P)3 0.008
– 5P1(1-P)4 0.00045
– 1*(1-P)5 0.00001
![Page 33: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/33.jpg)
Caltech CS184 Spring2003 -- DeHon34
Repairable Area
• Not all area in a RAM is repairable– memory bits spare-able – io, power, ground, control not redundant
![Page 34: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/34.jpg)
Caltech CS184 Spring2003 -- DeHon35
Repairable Area
• P(yield) = P(non-repair) * P(repair)
• P(non-repair) = PN
– N<<Ntotal– Maybe P > Prepair
• e.g. use coarser feature size
• P(repair) ~ P(yield M of N)
![Page 35: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/35.jpg)
Caltech CS184 Spring2003 -- DeHon36
Consider HSRA
• Contains– wires– luts – switches
![Page 36: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/36.jpg)
Caltech CS184 Spring2003 -- DeHon37
HSRA
• Spare wires– most area in wires and switches– most wires interchangeable
• Simple model– just fix wires
![Page 37: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/37.jpg)
Caltech CS184 Spring2003 -- DeHon38
HSRA “domain” model
• Like “memory” model
• spare entire domains by remapping
• still looks like perfect device
![Page 38: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/38.jpg)
Caltech CS184 Spring2003 -- DeHon39
HSRA direct model
• Like “disk drive” model
• Route design around known faults– designs become device specific
![Page 39: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/39.jpg)
Caltech CS184 Spring2003 -- DeHon40
HSRA: LUT Sparing
• All LUTs are equivalent
• In pure-tree HSRA– placement irrelevant– i.e. which endpoint as
long as match logical tree
• skip faulty LUTs
![Page 40: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/40.jpg)
Caltech CS184 Spring2003 -- DeHon41
Simple LUT Sparing
• Promise N-1 LUTs in subtree of some size– e.g. 63 in 64-LUT subtree– shift try to avoid fault LUT– tolerate any one fault in each subtree
![Page 41: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/41.jpg)
Caltech CS184 Spring2003 -- DeHon42
More general LUT sparing
• “Disk Drive” Model
• Promise M LUTs in N-LUT subtree– do unique placement around faulty LUTs
![Page 42: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/42.jpg)
Caltech CS184 Spring2003 -- DeHon43
SCORE Array
• Has memory and HSRA LUT arrays
![Page 43: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/43.jpg)
Caltech CS184 Spring2003 -- DeHon44
SCORE Array
• …but already know how to spare– LUTs– interconnect
• in LUT array• among LUT arrays and memory blocks
– memory blocks
• Example how can spare everything in universal computing block
![Page 44: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/44.jpg)
Caltech CS184 Spring2003 -- DeHon45
…but
• Important not to expose adjacency– E.g. physically contiguous real memory– E.g. Physical adjacency in mesh
• How spare mesh?
![Page 45: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/45.jpg)
Caltech CS184 Spring2003 -- DeHon46
Transit Multipath
• Butterfly (or Fat-Tree) networks with multiple paths– showed last time
![Page 46: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/46.jpg)
Caltech CS184 Spring2003 -- DeHon47
Multiple Paths
• Provide bandwidth• Minimize congestion• Provide redundancy
to tolerate faults
![Page 47: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/47.jpg)
Caltech CS184 Spring2003 -- DeHon48
Routers May be faulty(links may be faulty)
• Static– always corrupt
message– not route or misroute
message
• Dynamic – occasionally corrupt
or misroute
![Page 48: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/48.jpg)
Caltech CS184 Spring2003 -- DeHon49
Metro: Static Faults
• Turn off – faulty ports (make look busy)– ports connected to faulty channels– ports connected to faulty routers
• As long as paths remain between all communication endpoints– still functions
![Page 49: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/49.jpg)
Caltech CS184 Spring2003 -- DeHon50
Multibutterfly Yield
![Page 50: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/50.jpg)
Caltech CS184 Spring2003 -- DeHon51
Multibutterfly Performancew/ Faults
![Page 51: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/51.jpg)
Caltech CS184 Spring2003 -- DeHon52
Fault Tolerance
![Page 52: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/52.jpg)
Caltech CS184 Spring2003 -- DeHon53
Metro: dynamic faults
• Detection: Check success– checksums on packets to see data intact– check destination (arrived at right place)– acknowledgement from receiver
• know someone received correctly
• If fail– resend message
• same as blocked route case
![Page 53: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/53.jpg)
Caltech CS184 Spring2003 -- DeHon54
Metro: dynamic faults
• Consequence– may have faulty components– as long as
• detection strong• there is a non-faulty path
– will eventually deliver an intact message• may deliver multiple times if fault in ack• hence earlier concern about idempotence
![Page 54: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/54.jpg)
Caltech CS184 Spring2003 -- DeHon55
Memory: Dynamic Faults
• Error Correcting Codes (ECC)
• Provide enough redundancy to – detect most any errors– correct typical errors
• Simple scheme:– row and column parity
• 2 sqrt(N) bits – correct any single bit error
• …better schemes in practice– [Caltech has whole course on this…EE127]
![Page 55: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/55.jpg)
Caltech CS184 Spring2003 -- DeHon56
Processing Faults?
• Simplest model detection:– parallel checking
• run N copies in parallel• compare results
![Page 56: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/56.jpg)
Caltech CS184 Spring2003 -- DeHon57
Processor/Gate Fault Handling
• What do on fault?– Stop (not do anything wrong)
• maybe just restart – adequate if soft error
• maybe “reconfigure” to substitute out faulty processor
– Vote• if have enough redundancy take majority• Von Neumann shows:
– Can contain error rate of system to error rate of single component
![Page 57: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/57.jpg)
Caltech CS184 Spring2003 -- DeHon58
Checkpoint and Rollback
• Commit state of computation at key points– to memory (ECC, RAID protected...)– …reduce to previously solved problem…
• On faults– recover state from last checkpoint– like going to last backup….– …(snapshot)
![Page 58: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/58.jpg)
Caltech CS184 Spring2003 -- DeHon59
Together
• Examples of handling faults in– processing– storage– interconnect
• All components of our system
![Page 59: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/59.jpg)
Caltech CS184 Spring2003 -- DeHon60
Big Ideas
• Left to itself:– reliability of system << reliability of parts
• Can design– system reliability >> reliability of parts [defects]– system reliability ~= reliability of parts [faults]
• For large systems– must engineer reliability of system
![Page 60: CS184b: Computer Architecture (Abstractions and Optimizations)](https://reader035.vdocument.in/reader035/viewer/2022070410/5681465b550346895db3794b/html5/thumbnails/60.jpg)
Caltech CS184 Spring2003 -- DeHon61
Big Ideas
• Detect failures– static: directed test– dynamic: use redundancy to guard
• Repair with Redundancy
• Model– establish and provide model of correctness
• perfect model part (memory model)• visible defects in model (disk drive model)