penn ese534 spring2014 -- dehon 1 ese534: computer organization day 26: april 30, 2014 defect and...
TRANSCRIPT
![Page 1: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/1.jpg)
Penn ESE534 Spring2014 -- DeHon1
ESE534:Computer Organization
Day 26: April 30, 2014
Defect and Fault Tolerance
![Page 2: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/2.jpg)
Penn ESE534 Spring2014 -- DeHon2
Today
• Defect and Fault Tolerance– Problem– Defect Tolerance– Fault Tolerance
![Page 3: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/3.jpg)
Penn ESE534 Spring2014 -- DeHon3
Warmup Discussion
• Where do we guard against defects and faults today?– Where do we accept imperfection today?
![Page 4: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/4.jpg)
Penn ESE534 Spring2014 -- DeHon4
Motivation: Probabilities
• Given:– N objects
– Pg yield probability
• What’s the probability for yield of composite system of N items? [Preclass 1]
– Assume iid faults
– P(N items good) = (Pg)N
![Page 5: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/5.jpg)
Penn ESE534 Spring2014 -- DeHon5
Probabilities
• Pall_good(N)= (Pg)N
• P=0.999999
N Pall_good(N)
104
105
106
107
![Page 6: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/6.jpg)
Penn ESE534 Spring2014 -- DeHon6
Probabilities
• Pall_good(N)= (Pg)N
• P=0.999999
N Pall_good(N)
104 0.99105 0.90106 0.37107 0.000045
![Page 7: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/7.jpg)
Penn ESE534 Spring2014 -- DeHon7
Simple Implications
• As N gets large– must either increase reliability– …or start tolerating failures
• N– memory bits– disk sectors– wires – transmitted data bits– processors– transistors – molecules
– As devices get smaller, failure rates increase
– chemists think P=0.95 is good
– As devices get faster, failure rate increases
![Page 8: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/8.jpg)
Failure Rate Increases
Penn ESE534 Spring2014 -- DeHon8
[Nassif / DATE 2010]
![Page 9: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/9.jpg)
Quality Required for Perfection?
• How high must Pg be to achieve 90% yield on a collection of 1010 devices?
[preclass 3]
Penn ESE534 Spring2014 -- DeHon9
€
Pg( )1010
> 0.9
Pg>1-10-11
![Page 10: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/10.jpg)
Failure Rate Increases
Penn ESE534 Spring2014 -- DeHon10
[Nassif / DATE 2010]
![Page 11: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/11.jpg)
Penn ESE534 Spring2014 -- DeHon11
Defining Problems
![Page 12: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/12.jpg)
Penn ESE534 Spring2014 -- DeHon12
Three Problems1. Defects: Manufacturing imperfection
– Occur before operation; persistent• Shorts, breaks, bad contact
2. Transient Faults: – Occur during operation; transient
• node X value flips: crosstalk, ionizing particles, bad timing, tunneling, thermal noise
3. Lifetime “wear” defects– Parts become bad during operational lifetime
• Fatigue, electromigration, burnout….– …slower
• NBTI, Hot Carrier Injection
![Page 13: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/13.jpg)
Penn ESE534 Spring2014 -- DeHon13
Sherkhar BokarIntel Fellow
Micro37 (Dec.2004)
![Page 14: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/14.jpg)
Penn ESE534 Spring2014 -- DeHon14
Defect Rate• Device with 1011 elements (100BT)
• 3 year lifetime = 108 seconds
• Accumulating up to 10% defects
• 1010 defects in 108 seconds1 new defect every 10ms
• At 10GHz operation:• One new defect every 108 cycles
• Pnewdefect=10-19
![Page 15: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/15.jpg)
Penn ESE534 Spring2014 -- DeHon15
First Step to Recover
Admit you have a problem(observe that there is a failure)
![Page 16: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/16.jpg)
Penn ESE534 Spring2014 -- DeHon16
Detection
• How do we determine if something wrong?– Some things easy
• ….won’t start
– Others tricky• …one and gate computes False & TrueTrue
• Observability– can see effect of problem– some way of telling if defect/fault present
![Page 17: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/17.jpg)
Penn ESE534 Spring2014 -- DeHon17
Detection• Coding
– space of legal values << space of all values– should only see legal– e.g. parity, ECC (Error Correcting Codes)
• Explicit test (defects, recurring faults)– ATPG = Automatic Test Pattern Generation– Signature/BIST=Built-In Self-Test– POST = Power On Self-Test
• Direct/special access– test ports, scan paths
![Page 18: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/18.jpg)
Penn ESE534 Spring2014 -- DeHon18
Coping with defects/faults?
• Key idea: redundancy• Detection:
– Use redundancy to detect error
• Mitigating: use redundant hardware– Use spare elements in place of faulty
elements (defects)– Compute multiple times so can discard faulty
result (faults)
![Page 19: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/19.jpg)
Penn ESE534 Spring2014 -- DeHon19
Defect Tolerance
![Page 20: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/20.jpg)
Penn ESE534 Spring2014 -- DeHon20
Two Models
• Disk Drives (defect map)
• Memory Chips (perfect chip)
![Page 21: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/21.jpg)
Penn ESE534 Spring2014 -- DeHon21
Disk Drives
• Expose defects to software– software model expects faults
• Create table of good (bad) sectors
– manages by masking out in software• (at the OS level)• Never allocate a bad sector to a task or file
– yielded capacity varies
![Page 22: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/22.jpg)
Penn ESE534 Spring2014 -- DeHon22
Memory Chips
• Provide model in hardware of perfect chip
• Model of perfect memory at capacity X
• Use redundancy in hardware to provide perfect model
• Yielded capacity fixed– discard part if not achieve
![Page 23: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/23.jpg)
Penn ESE534 Spring2014 -- DeHon23
Example: Memory
• Correct memory:– N slots– each slot reliably stores last value written
• Millions, billions, etc. of bits…– have to get them all right?
![Page 24: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/24.jpg)
Failure Rate Increases
Penn ESE534 Spring2014 -- DeHon24
[Nassif / DATE 2010]
![Page 25: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/25.jpg)
Penn ESE534 Spring2014 -- DeHon25
Memory Defect Tolerance
• Idea:– few bits may fail– provide more raw bits– configure so yield what looks like a perfect
memory of specified size
![Page 26: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/26.jpg)
Penn ESE534 Spring2014 -- DeHon26
Memory Techniques
• Row Redundancy
• Column Redundancy
• Bank Redundancy
![Page 27: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/27.jpg)
Penn ESE534 Spring2014 -- DeHon27
Row Redundancy
• Provide extra rows
• Mask faults by avoiding bad rows
• Trick:– have address decoder substitute spare
rows in for faulty rows– use fuses to program
![Page 28: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/28.jpg)
Penn ESE534 Spring2014 -- DeHon28
Spare Row
![Page 29: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/29.jpg)
Penn ESE534 Spring2014 -- DeHon29
Column Redundancy
• Provide extra columns
• Program decoder/mux to use subset of columns
![Page 30: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/30.jpg)
Penn ESE534 Spring2014 -- DeHon30
Spare Memory Column
• Provide extra columns
• Program output mux to avoid
![Page 31: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/31.jpg)
Penn ESE534 Spring2014 -- DeHon31
Bank Redundancy
• Substitute out entire bank– e.g. memory subarray
• include 5 banks– only need 4 to yield perfect
• (N+1 sparing more typical for larger N)
![Page 32: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/32.jpg)
Penn ESE534 Spring2014 -- DeHon32
Spare Bank
![Page 33: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/33.jpg)
Yield M of N
• Preclass 4: Probability of yielding 3 of 5 things?– Symbolic?
– Numerical for Pg=0.9?
Penn ESE534 Spring2014 -- DeHon33
![Page 34: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/34.jpg)
Penn ESE534 Spring2014 -- DeHon34
Yield M of N
• P(M of N) = P(yield N) + (N choose N-1) P(exactly N-1)+ (N choose N-2) P(exactly N-2)…+ (N choose N-M) P(exactly N-M)…
[think binomial coefficients]
![Page 35: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/35.jpg)
Penn ESE534 Spring2014 -- DeHon35
M of 5 example• 1*P5 + 5*P4(1-P)1+10P3(1-P)2+10P2(1-
P)3+5P1(1-P)4 + 1*(1-P)5
• Consider P=0.9– 1*P5 0.59 M=5 P(sys)=0.59– 5*P4(1-P)1 0.33 M=4 P(sys)=0.92– 10P3(1-P)2 0.07 M=3 P(sys)=0.99– 10P2(1-P)3 0.008– 5P1(1-P)4 0.00045– 1*(1-P)5 0.00001 Can achieve higher
system yield than
individual components!
![Page 36: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/36.jpg)
Penn ESE534 Spring2014 -- DeHon36
Repairable Area
• Not all area in a RAM is repairable– memory bits spare-able – io, power, ground, control not redundant
![Page 37: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/37.jpg)
Penn ESE534 Spring2014 -- DeHon37
Repairable Area
• P(yield) = P(non-repair) * P(repair)
• P(non-repair) = PNnr
– Nnr<<Ntotal
– P > Prepair
• e.g. use coarser feature size• Differential reliability
• P(repair) ~ P(yield M of N)
![Page 38: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/38.jpg)
Penn ESE534 Spring2014 -- DeHon38
Consider a Crossbar
• Allows us to connect any of N things to each other– E.g.
• N processors• N memories• N/2 processors+ N/2 memories
![Page 39: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/39.jpg)
Penn ESE534 Spring2014 -- DeHon39
Crossbar Buses and Defects
• Two crossbars
• Wires may fail
• Switches may fail
• Provide more wires– Any wire fault avoidable
• M choose N
![Page 40: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/40.jpg)
Penn ESE534 Spring2014 -- DeHon40
Crossbar Buses and Defects
• Two crossbars
• Wires may fail
• Switches may fail
• Provide more wires– Any wire fault avoidable
• M choose N
![Page 41: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/41.jpg)
Penn ESE534 Spring2014 -- DeHon41
Crossbar Buses and Faults
• Two crossbars
• Wires may fail
• Switches may fail
• Provide more wires– Any wire fault avoidable
• M choose N
![Page 42: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/42.jpg)
Penn ESE534 Spring2014 -- DeHon42
• Two crossbars• Wires may fail• Switches may fail
• Provide more wires– Any wire fault avoidable
• M choose N
– Same idea
Crossbar Buses and Faults
![Page 43: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/43.jpg)
Penn ESE534 Spring2014 -- DeHon43
Simple System
• P Processors
• M Memories
• Wires
Memory, Compute, Interconnect
![Page 44: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/44.jpg)
Penn ESE534 Spring2014 -- DeHon44
Simple System w/ Spares
• P Processors
• M Memories
• Wires
• Provide spare– Processors– Memories– Wires
![Page 45: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/45.jpg)
Penn ESE534 Spring2014 -- DeHon45
Simple System w/ Defects
• P Processors• M Memories• Wires• Provide spare
– Processors– Memories– Wires
• ...and defects
![Page 46: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/46.jpg)
Penn ESE534 Spring2014 -- DeHon46
Simple System Repaired
• P Processors• M Memories• Wires• Provide spare
– Processors– Memories– Wires
• Use crossbar to switch together good processors and memories
![Page 47: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/47.jpg)
Penn ESE534 Spring2014 -- DeHon47
In Practice
• Crossbars are inefficient [Day16-19]
• Use switching networks with– Locality– Segmentation
• …but basic idea for sparing is the same
![Page 48: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/48.jpg)
Defect Tolerance Questions?
Penn ESE534 Spring2014 -- DeHon48
![Page 49: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/49.jpg)
Penn ESE534 Spring2014 -- DeHon49
Fault Tolerance
![Page 50: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/50.jpg)
Penn ESE534 Spring2014 -- DeHon50
Faults
• Bits, processors, wires– May fail during operation
• Basic Idea same:– Detect failure using redundancy– Correct
• Now– Must identify and correct online with the
computation
![Page 51: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/51.jpg)
Transient Sources
• Effects– Thermal noise– Timing– Ionizing particles
particle 105 to 106 electrons• Calculated gates with 15--30 electrons Day 6
– Even if CMOS restores, takes time
Penn ESE534 Spring2014 -- DeHon51
![Page 52: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/52.jpg)
Voltage and Error Rate
Penn ESE534 Spring2014 -- DeHon52
[Austin et al.--IEEE Computer, March 2004]
![Page 53: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/53.jpg)
Penn ESE534 Spring2014 -- DeHon53
Scaling and Error Rates
Source: Carter/Intel –53
–Increasing Error Rates
–1
–10
–180 –130 –90 –65 –45 –32
–Technology (nm)
– SE
U/b
it N
orm
to
130
nm
–cache –arrays
–logic–2X bit/latch count increase per generation
![Page 54: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/54.jpg)
Errors versus Frequency
Penn ESE534 Spring2014 -- DeHon54
0.0
1.0
2.0
3.0
4.0
5.0
2100 2400 2700 3000 3300 3600
Clock Frequency (MHz)
Th
rou
gh
pu
t (B
IPS
)
1.0E-13
1.0E-10
1.0E-07
1.0E-04
1.0E-01
1.0E+02
Erro
r Ra
te (%
)
–Conventional Design –Max TP
–Resilient Design Max TP
–VCC & Temperature
–FCLK Guardband
[Bowman, ISSCC 2008]
![Page 55: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/55.jpg)
Penn ESE534 Spring2014 -- DeHon55
Simple Memory Example
• Problem: bits may lose/change value– Alpha particle – Molecule spontaneously switches
• Idea: – Store multiple copies– Perform majority vote on result
![Page 56: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/56.jpg)
Penn ESE534 Spring2014 -- DeHon56
Redundant Memory
![Page 57: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/57.jpg)
Penn ESE534 Spring2014 -- DeHon57
Redundant Memory
• Like M-choose-N
• Only fail if >(N-1)/2 faults
• P=0.9
• P(2 of 3)All good: (0.9)3 = 0.729+ Any 2 good: 3(0.9)2(0.1)=0.243= 0.971
![Page 58: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/58.jpg)
Penn ESE534 Spring2014 -- DeHon58
Better: Less Overhead
• Don’t have to keep N copies
• Block data into groups
• Add a small number of bits to detect/correct errors
![Page 59: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/59.jpg)
Penn ESE534 Spring2014 -- DeHon59
Row/Column Parity
• Think of NxN bit block as array
• Compute row and column parities– (total of 2N bits)
![Page 60: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/60.jpg)
Penn ESE534 Spring2014 -- DeHon60
Row/Column Parity
• Think of NxN bit block as array
• Compute row and column parities– (total of 2N bits)
• Any single bit error
![Page 61: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/61.jpg)
Penn ESE534 Spring2014 -- DeHon61
Row/Column Parity
• Think of NxN bit block as array
• Compute row and column parities– (total of 2N bits)
• Any single bit error
• By recomputing parity– Know which one it is– Can correct it
![Page 62: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/62.jpg)
InClass Exercise
• Which Block has an error?
• What correction do we need?
Penn ESE534 Spring2014 -- DeHon62
![Page 63: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/63.jpg)
Penn ESE534 Spring2014 -- DeHon63
Row/Column Parity
• Simple case is 50% overhead – Add 8 bits to 16– Better than 200%
with 3 copies– More expensive
than used in practice
![Page 64: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/64.jpg)
Penn ESE534 Spring2014 -- DeHon64
In Use Today
• Conventional DRAM Memory systems– Use 72b ECC (Error Correcting Code)– On 64b words [12.5% overhead]– Correct any single bit error– Detect multibit errors
• CD and flash blocks are ECC coded– Correct errors in storage/reading
![Page 65: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/65.jpg)
RAID
• Redundant Array of Inexpensive Disks
• Disk drives have ECC on sectors– At least enough to detect failure
• RAID-5 has one parity disk– Tolerate any single disk failure– E.g. 8-of-9 survivability case– With hot spare, can rebuild data on spare
Penn ESE534 Spring2014 -- DeHon65
![Page 66: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/66.jpg)
Penn ESE534 Spring2014 -- DeHon66
Interconnect
• Also uses checksums/ECC– Guard against data transmission errors– Environmental noise, crosstalk, trouble
sampling data at high rates…
• Often just detect error
• Recover by requesting retransmission– E.g. TCP/IP (Internet Protocols)
![Page 67: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/67.jpg)
Penn ESE534 Spring2014 -- DeHon67
Interconnect
• Also guards against whole path failure
• Sender expects acknowledgement
• If no acknowledgement will retransmit
• If have multiple paths– …and select well among them– Can route around any fault in interconnect
![Page 68: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/68.jpg)
Penn ESE534 Spring2014 -- DeHon68
Interconnect Fault Example
• Send message• Expect
Acknowledgement
![Page 69: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/69.jpg)
Penn ESE534 Spring2014 -- DeHon69
Interconnect Fault Example
• Send message• Expect
Acknowledgement• If Fail
![Page 70: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/70.jpg)
Penn ESE534 Spring2014 -- DeHon70
Interconnect Fault Example
• Send message• Expect
Acknowledgement• If Fail
– No ack
![Page 71: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/71.jpg)
Penn ESE534 Spring2014 -- DeHon71
Interconnect Fault Example
• If Fail no ack– Retry– Preferably with different resource
![Page 72: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/72.jpg)
Penn ESE534 Spring2014 -- DeHon72
Interconnect Fault Example
• If Fail no ack– Retry– Preferably with different resource
Ack signals success
![Page 73: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/73.jpg)
Penn ESE534 Spring2014 -- DeHon73
Compute Elements
• Simplest thing we can do:– Compute redundantly– Vote on answer– Similar to redundant memory
![Page 74: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/74.jpg)
Penn ESE534 Spring2014 -- DeHon74
Compute Elements
• Unlike Memory– State of computation important– Once a processor makes an error
• All subsequent results may be wrong
• Response– “reset” processors which fail vote– Go to spare set to replace failing processor
![Page 75: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/75.jpg)
Penn ESE534 Spring2014 -- DeHon75
In Use
• NASA Space Shuttle– Uses set of 4 voting processors
• Boeing 777– Uses voting processors
• Uses different architectures for processors• Uses different software• Avoid Common-Mode failures
– Design errors in hardware, software
![Page 76: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/76.jpg)
Penn ESE534 Spring2014 -- DeHon76
Forward Recovery
• Can take this voting idea to gate level– VonNeuman 1956
• Basic gate is a majority gate– Example 3-input voter
• Alternate stages– Compute– Voting (restoration)
• Number of technical details…• High level bit:
– Requires Pgate>0.996– Can make whole system as reliable as individual
gate
![Page 77: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/77.jpg)
Penn ESE534 Spring2014 -- DeHon77
Majority Multiplexing
[Roy+Beiu/IEEE Nano2004]
Maybe there’s a better way…
![Page 78: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/78.jpg)
Detect vs. Correct
• Detection is cheaper than correction
• To handle k-faults– Voting correction requires 2k+1
• K=1 3
– Detection requires k+1• K=1 2
Penn ESE534 Spring2014 -- DeHon78
![Page 79: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/79.jpg)
Penn ESE534 Spring2014 -- DeHon79
Rollback Recovery
• Commit state of computation at key points– to memory (ECC, RAID protected...)– …reduce to previously solved problem of
protecting memory
• On faults (lifetime defects)– recover state from last checkpoint– like going to last backup….– …(snapshot)
![Page 80: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/80.jpg)
Rollback vs. Forward
Penn ESE534 Spring2014 -- DeHon80
![Page 81: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/81.jpg)
Penn ESE534 Spring2014 -- DeHon81
Defect vs. Fault Tolerance
• Defect– Can tolerate large defect rates (10%)
• Use virtually all good components• Small overhead beyond faulty components
• Fault – Require lower fault rate (e.g. VN <0.4%)
• Overhead to do so can be quite large
![Page 82: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/82.jpg)
Fault/Defect Models
• i.i.d. fault (defect) occurrences easy to analyze
• Good for?
• Bad for?
• Other models?– Spatially or temporally clustered– Burst – Adversarial
Penn ESE534 Spring2014 -- DeHon82
![Page 83: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/83.jpg)
Penn ESE534 Spring2014 -- DeHon83
Summary• Possible to engineer practical, reliable systems from
– Imperfect fabrication processes (defects)– Unreliable elements (faults)
• We do it today for large scale systems– Memories (DRAMs, Hard Disks, CDs)– Internet
• …and critical systems– Space ships, Airplanes
• Engineering Questions– Where invest area/effort?
• Higher yielding components? Tolerating faulty components?
![Page 84: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/84.jpg)
Penn ESE534 Spring2014 -- DeHon84
Big Ideas• Left to itself:
– reliability of system << reliability of parts
• Can design– system reliability >> reliability of parts [defects]– system reliability ~= reliability of parts [faults]
• For large systems– must engineer reliability of system– …all systems becoming “large”
![Page 85: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/85.jpg)
Penn ESE534 Spring2014 -- DeHon85
Big Ideas
• Detect failures– static: directed test– dynamic: use redundancy to guard
• Repair with Redundancy
• Model– establish and provide model of correctness
• Perfect component model (memory model)• Defect map model (disk drive model)
![Page 86: Penn ESE534 Spring2014 -- DeHon 1 ESE534: Computer Organization Day 26: April 30, 2014 Defect and Fault Tolerance](https://reader030.vdocument.in/reader030/viewer/2022032611/56649e755503460f94b76119/html5/thumbnails/86.jpg)
Penn ESE534 Spring2014 -- DeHon86
Admin
• Discussion period ends today– From midnight on, no discussion of
approaches to final
• FM2 due today– Get them in today and feedback by weekend
• Final due Monday, May 12 (10pm)– No late finals
• André traveling next two weeks