self repair technology for logic circuits
DESCRIPTION
Self Repair Technology for Logic Circuits. Architecture, Overhead and Limitations. Heinrich T. Vierhaus BTU Cottbus Computer Engineering Group. Outline. 1. Introduction: Nano Structure Problems. 2. The Problem of Wear-Out. 3. Repair for Memory and FPGAs. - PowerPoint PPT PresentationTRANSCRIPT
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Self Repair Technology for Logic Circuits
Architecture, Overhead and Limitations
Heinrich T. VierhausBTU Cottbus
Computer Engineering Group
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Outline
1. Introduction: Nano Structure Problems
4. Basic Logic Repair Strategies & Structures
5. Test and Repair Administration
2. The Problem of Wear-Out
3. Repair for Memory and FPGAs
6. De-Stressing Strategies
7. Cost, Overhead, Single Points of Failure
8. Summary and Conclusions
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
1. Introduction
A bunch of new problems from nano structures ...
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Nanoelectronic Problems
Lithography:
The wavelength used to „map“ structural information frommasks to wafers is larger (4 times of more) than the minimumstructural features (193 versus 90 / 65 / 45 nm).
Adaptation of layouts for correction of mapping faults.
Statistical Parameter Variations:
The number of atoms in MOS-transistor channels becomes sosmall that statistical variations of doping densities have an impacton device parameters such as threshold voltages.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
New Problems with Nano-Technologies
Lightsource
mask (reticle)
wafer
resist
exposed resist
Wave length: 193 nm
Feature size: down to 28 nm
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Layout Correction
Modified layoutfor compensationof mapping faults
Compensation is critical and non-ideal
Faults are not random but correlated!
Requires fast fault diagnosis
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Doping Fluctuations in MOS Transistors
p-Substrate
n n
Poly-Si
doping atom
p-Substrate
n n
Poly-Si
doping atom
Density and distribution of doping atomscause shifts in transistor threshold voltages!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Nanostructure ProblemsIndividual device characteristics such as Vth are more dependent on statistical variations of underlying physical features such as doping profiles.
A significant share of basic devices will be „out or specs“ and needs a replacement by backup elements for yield improvement after production.
Smaller features mean higher stress (field strength, current density), also foster new mechanisms of early wear-out.
Transient error recognition and compensation „in time“ is becoming a must due to e. g. charged particles that can discharge circuit nodes.
Primary Relevance: Yield
Primary Relevance: Yield
Primary Relevance: Lifetime
Primary Relevance: Dependability
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Fault Tolerant Computing
Faultevent
Software-basedfault detection
& compensation
HW logic & RT-level
detection &compensation
Works onlyfor transient faults!
Typically worksfor transient and permanent faults!
Transistor-and switch levelcompensation
Typically worksfor specific types of
transient faultsonly!
specific
veryspecific
universal
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
2. Wear-Out Problems and Mechanisms
Structures on ICs used to live longer than either their applicationor even their users. Not any more ...
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
IC Structures May Get Tired
„Wear-out“ – effects ICs in nano-electronics are likely to appear much earlier,causing a lot of problems for dependable long-time applications !
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Fault Effects on ICs
Field-Oxide
Poly-imide(low-k)
Metal 2
Via
Metal 1
Metal 3
n-welln np p
GateOxide(high-k)
metalmigration
low- k insulatordeterioration
Transistor deterioration (HCI, NBTI),eventually gate oxide shorts !
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Wear-Out MechnismsMetal Migration:
Metal atoms (Al, Cu) tendto migrate under high currentdensity and high temperature.
Stress migration:
Migration effects may be enhancedunder mechanical stress conditons.
Effect:
Metal lines and vias may actuallycause line interrupts. The effect ispartly reversible by changing currentdirections.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Metal Migration
metal -wire under high current density:new
After some time in operation
Voids (holes)
neighbor
neighbor
neighbor
Open-defectshort
Vias are specially prone to such defects
The effect is reversible by reversing the direction of current flow !
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Transistor Degradation
Negative Bias Thermal Instability (NBTI): Reduced switching speedfor p-channel MOS transistors that have operated under long-time constant negative gate bias. The effect is partly reversible.
Hot Carrier Injection (HCI): Reduced switching speed for n-channel MOStransistors, induced by positive gate bias and frequent switching. Not reversible.
Gate Oxide Deterioration: Induced by high field strengh. Not reversible
Dielectric Breakdown: Insulating layers between metal lines may break causing shorts between signal lines.
Design technology including a prospective „life time budget“!!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Management of Wear-Out by „Fault Tolerant Computing?
Built-in fault tolerance and error compensation are needed in nano-technologies anyway and for the management of transient faults.
Wear-out induced faults may show up as „intermittent“ faults first,which become more and more frequent.
Fault in synchronous circuits and systems are detected „by clock cycle“.Hence the detection does not even recognize if the fault is permanentor not for many types of fault tolerant architecture.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Triple Modular Redundancy
ExecutionUnit 1
ExecutionUnit 2
ExecutionUnit 3
ComparatorVoter
Result out(majority)
Errordetect
Can detect and compensate almost any type of faultOverhead about 200-300 %, additional signal delaysThe voter itself is not covered but must be a „self checking checker“
Standard (by law) in avionics applications!
inputsignal
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Error Detecting / Correcting Codes
Data
Transmission /Storage
Signature
Data
Signature
Signature
Comparison
Fault-detect
Errorcorrection
Often applicable to 1- or 2-bit faults only
Becomes expensive if applied tocomputational units
Often limited to certain fault models (uni-directional)
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Can TMR and Codes CompensatePermanent Faults?
Fault / error detection circuitry typically works on a clock-cycle base.It does not „know“ if a fault is transient or permanent.
A permanent fault is a fault event that occurs in several to many successiveclock cycles repeatedly.
Error correction technology can detect and compensate such permanent faultsas well as transient faults.
A critical condition occurs if transient faults occur on top ofpermanent faults. Then the superposition of fault effects is likely toexceed the system‘s fault handling capacity.
System components that run actively „in parallel“ suffer from the samewear-out effects. Therefore there is a an increase in dependability beforewear-out limits, but no significant life time extension!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Redundancy and Wear-Out
During the normal life time of the system, duplication or triplicationcan enhance reliability significantly. But also area and power consumptionare about triplicated.
And by the end of normal operating time (out of fuel / steam) all threesystems will fail shortly one after the other !!
Reliability enhancement is not equal to life time extension !!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Self Repair?
Faultevent
Software-basedfault detection
& compensation
HW logic & RT-level
detection &compensation
Works onlyfor transient faults!
Typically worksfor transient and permanent faults!
Transistor-and switch levelcompensation
Typically worksfor specific types of
transient faultsonly!
specific
veryspecific
universal
Self Repair for permanent faults!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
3. Repair for Memory and FPGAs
Compensation of transient faults is not enough.
Some technologies for transient compensation can handle permanent faults, too, but not on the long run and withadditional transient faults!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Memory Test & Repair
Lines
columns
Lineaddress
Read- /write lines
spare column
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Memory Test & Repair (2)
Lines
columns
Lineaddress
Read- /Write lines
spare column
MemoryBIST
controller... is already state-of-the-art!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
FPGA-based Self Repair
L W
L
L
L
L
L
L
L L
W
W
W
W W
W W
W
W W
W
L L
L L
L W
L
L
L
L
L
L
L L
W
W
W
W W
W W
W
W W
W
L L
L L
L W
L
L
L
L
L
L
L L
W
W
W
W W
W W
W
W W
W
L L
L L
Config.SW
Memory
Applic.SW &
data
FPGA macro-blocks working as CPUs logicblock
wiringblock
* e. g. proposed by McCluskey et al. IEEE Design and Test 2004
FPGA-based embedded controller: 8051
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
In-System FPGA Repair
L W
L
L
L
L
L
L
L L
W
W
W
W W
W W
W
W W
W
L L
L L
L W
L
L
L
L
L
L
L L
W
W
W
W W
W W
W
W W
W
L L
L L
L W
L
L
L
L
L
L
L L
W
W
W
W W
W W
W
W W
W
L L
L L
Config.SW
Memory
fault
Applic.SW &
data
Systemfunction
Repairfunction
FPGA-based CPUs
under repair
logicblock
wiringblock
L W
L
L
L
L
L
L
L L
W
W
W
W W
W W
W
W W
W
L L
L L
L W
L
L
L
L
L
L
L L
W
W
W
W W
W W
W
W W
W
L L
L L
L W
L
L
L
L
L
L
L L
W
W
W
W W
W W
W
W W
W
L L
L L
Config.SW
Memory
fault
Applic.SW &
data
Systemfunction
Repairfunction
FPGA-based CPUs
under repair
logicblock
wiringblock
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Repair Mechanism: Row/Line-Shift
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB reserverow
occupiedCLBs
row withfaulty CLB
occupiedCLBs
Little Overhead for the re-configuration process
Loss of many “good” CLBs for every fault
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Distributed Backup CLBs
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB
CLB
CLB
CLB
CLB functionally occupied CLB
CLBnon-occupied CLB (reserve)
CLB faulty CLB
CLBselected replacement CLB
Minimum loss of functional CLBs
High effort for re-wiring requires massive „embedded“computing power (32-bit CPU, 500 MHz)
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Self Repair within FPGA Basic Blocks
Heterogeneous repair strategies required (memory, logic)
Logic blocks may use methods known from memory BISR
Additional repair strategies are necessary for logic elements
The basic overhead for FPGAs versus standard logic(about 10) is enhanced.Repair strategies for logic may use some features alreadyused in FPGAs (e. g. switched interconnects).
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Structure of a CLB Slice
LogicField
Logicin
Program in
Logicout
Redudant Row
MUX FF
FFin SRAM
MUX
FF
out
out
SRAM
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
FPGAs for a Solution?The granularity of re-configurable logic blocks (CLBs)in most FPGAs is the order of several thousand transistors. Replacement strategies must be placed on a granularity ofblocks in the area of 100-500 transistors for fault densities between 0.01 % and 0.1 %.
Efficient FPGA- repair mechanism requires detailed fault diagnosisplus specific repair schemes, which cannot be kept as pre-computedreconfiguration schemes.Computation of specific repair schemes requires „in-systemEDA“ (re-placement and routing) with a massive demandfor computing power.
There is no source of such „always available“ computing power.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Self-Repairing FPGA ?
Pro
gram
CLB CLB CLB CLBWB WB WB
CLB CLB CLB CLBWB WB WB
CLB CLB CLB CLBWB WB WB
Virtual CPU
Config.Scheme
CLB CLB CLB CLBWB WB WB
CLB CLB CLB CLBWB WB WB
CLB CLB CLB CLBWB WB WB
New-Config.
Reconfigurable Logic
Memory
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Advanced FPGA Structures
CPU CPU
ALU ALUMULT MULT
CLB CLB CLB CLBWB WB WB
CLB CLB CLB CLBWB WB WB
CLB CLB CLB CLBWB WB WB
WB WB WB
WB
CLB CLB CLB CLBWB WB WB
... are only partly re-configurable for performance reasons !
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
FPGA / CPLD RepairLooks pretty easy at first glance because of regulararchitecture!
Requires lines / columns of switches for configuration atinputs and between AND / OR matrices.
Requires additional programmability of cross-points by double-gate transistor as in EEPROMs or Flash memory.
Not fully compatible with standard CMOS
Limited number of (re-) configurations
Floating gate (FAMOS) transistors are fault-sensitive!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
4. Basic Logic Repair Strategies
Repair techniques that replace failing building blocks by redundantelements from a „silent“ storage are not new.
IBM has been selling such computer systems specifically forapplications in banks for decade.
But always with few (2-10) backup elements (CPUs) assuminga small number of failures (< 10) within years.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Mainframes
.. will often contain „redundant“ CPUs for eventual fault compensation. But one faulty transistor then „costs“ a whole CPU, limiting the fault handling to a few (about 10) permanent fault cases.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Granularity of Replacement
Granularity(transistors)
100 101 102 103 104 105 106
trans. gate macroFPGA-block
cores CPU
Block-levelreplacement
(e. g. FPGAs)
Core-Replacement(e. g. CPU)
Expected fault density (1 out of..)
Hardly explored(logic)
Granularity(transistors)
100 101 102 103 104 105 106
trans. gate macroFPGA-block
cores CPU
Block-levelreplacement
(e. g. FPGAs)
Core-Replacement(e. g. CPU)
Expected fault density (1 out of..)
Hardly explored(logic)
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Repair Overhead versus Element Loss
Size of replaced blocks(granularity)
Repair procedureoverhead
Functioningelements lost
1 10 100 1k 10k 100k 1M 10M
Prohibitiveoverhead
Prohibitivefault density
NewMethodsandArchi-tectures
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Built-in Self Repair (BISR)
BISR is well understood for highly regular structures such as embeddedmemory blocks.
BISR is essentially depending on built-in self test (BIST) with highdiagnostic resolution.
FaultDetection
Fault Diagnosis
FaultIsolation
RedundancyAllocation
Fault / Redundancy Management
Redundancy management must monitor faults, replacements, available redundancy andmust also re-establish a „working“ system state after power-down states.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Levels of RepairTransistors - Switch LevelReplace transistors or transistor groupsLosses by reconfiguration: (switched-off „good“ devices):
Overhead for test and diagnosis: Very highPotentially small ( 20 – 50%) for transistor faults
Gate LevelReplace gates or logic cellsLosses by reconfiguration: Medium (60 to 90 %) for single transistor faultsOverhead for test and diagnosis: High
Macro-Block LevelReplace functional macros (ALU, FPU, CPU)Losses by reconfiguration: High, 99% or more
Overhead for test and diagnosis: Maybe acceptable
Repair overhead will dominatereliability!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
The Fault Isolation Problem
Load1
Load2
Driver
Gate-short
GND-shorts of input gates affect the whole fan-innetwork and make redundancy obsolete!!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Block-Level Repair
&
&
&
&
SE
SESE
Blocks of logic / RT elements (gates and larger) contain a redundant element each that can replace a faulty unit.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Switching Concept (1)
FunctionalBlock 1
FunctionalBlock 3
Replace-mentBlock
inputs outputs
FunctionalBlock 2
Test in Test out
FunctionalBlock 1
FunctionalBlock 3
Replace-mentBlock
inputsoutputs
FunctionalBlock 2
Test in Test out
1 2
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Switching Concept (2)
FunctionalBlock 1
FunctionalBlock 3
Replace-mentBlock
inputs outputs
FunctionalBlock 2
Test in Test out
FunctionalBlock 1
FunctionalBlock 3
Replace-mentBlock
inputs outputs
FunctionalBlock 2
Test in Test out
3 4
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
A Regular Switching Scheme
The scheme is regular and scalable by nature, comprising always k functional blocks of the same nature plus 1 additional block for backup.
Building blocks are separated by (pass-) transistor switches at inputs andoutputs, providing a full isolation of a faulty block.
Always 2 additional pass-transistors between two functional blocks.
The reconfiguration scheme is regular in shifting functionality betweenblocks, which results in a simple scheme of administration.
The functional access to the „spare“ block can be used for testing purposes.In any state of (re-) configuration, the potentially „faulty“ block is connectedto test input / output terminals.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Overhead Depending on Block Size
3 /4- 2-NAND 12 4 18 24
Transistors
Functional backup norm switch ext. switch
3 / 4 2-AND 18 6 18 24
Basic Element
3/4 2-XOR 18 6 18 24
H- Adder 36 12 24 30
F- Adder 90 30 30 36
For small basic blocks, the switches make the essential overhead (200%)!
For larger basic blocks,the overhead can be reduced to about 30-50%
... not counting test- and administration overhead!
Extract larger basic units from seemingly irregular logic netlists!!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Overhead
2- NAND 12 4 18 /24 230 %
Transistors per RLB (3 functional units)
functional backup
2- AND 18 6 18 /24 160 %
Basic Block
XOR 18 6 18 /24 160 %
Half Adder 36 12 24 /30 116 %
Full Adder 90 30 30 /36 73 %
Overhead
8-bit ALU 4500 1500 168 / 224 38 %
Switchesmin. / ext.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
5. Test and Repair Administration
Logic
Test Analyzer
Configurator and
StatusMemory
Test Generator
Centralized Control
LogicRLB RLB
RLB RLB
SystemMonitoring
RLB
BIST
Conf.
RLB
BIST
Conf.
RLB
BIST
Conf.
RLB
BIST
Conf.
De-centralized test and controlMay be faulty!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Blocks, Switching, Administration
F-Unit
F-Unit
Red.-Unit
Conf.-Unit
F-Unit
F-Unit
F-Unit
Red.-Unit
Conf.-Unit
F-Unit
Global Control-Unit
Columns of Switches
F-Unit
F-Unit
Red.-Unit
F-Unit
F-Unit
F-Unit
Red.-Unit
F-Unit
Global Control-Unit
Conf.-Unit Conf.-Unit
Columns of Switches
Decoder Decoder
Local (re-) configuration Remote (re-) configuration
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Combining Test and Re-Configuration
LogicunderTest
Testinput
Compare
Reference
Config. Memory /Counter
next statefaultdetect
Testout
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Test and Administration
Each of the elements in ablock is testable via specifictest inputs.
Test is done by comparisonwith reference outputs. The system is runthrough states of re-configuration with the sameinput test pattern applied.At test, a functional unit is always removedfrom normal operation and connectedto test I / O s.
State Reg.
Decoder
FunctionalBlock 1
FunctionalBlock n
Inp
ut
Sw
itc
hes
Ou
tpu
t S
wit
ch
es
Replace-mentBlock
inputs outputs
Self Test Circ.Test clock Fault indicator
Faultflag
In case of a „fault detect“,the system is fixed in the current status.
Test in Test out
fix at faultSuch a procedure of self-testand self-reconfiguration can run at every system start-up, avoidinga central „fault memory“.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Controller for (Re-) Configuration
>1
RLB
+
+
f1
f2
+f3
Ref
eren
ce
f1
f3
ff2
F
1
Sw
itche
s
Sw
itche
s
Testin
2 3 4
Decoder
Control-Bits
reset
BISRclock
act
>1
fault
act
freset
out
Scanout
test
>1
& in
sca
np
ath
s1 s2 s3 s4
Controller minimumcomplexity: 80 transistors (3 + 1 configuration)
A controller may driveone or several re-configurableblocks in parallel, dependingon their size
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Local Interconnects
The block-based repair scheme so far can not cover faults on wires between re-configurable blocks.
For small basic blocks (such as logic gates) the majority ofwiring is between re-configurable units and not covered.
For larger (RT-level) basic blocks the majority of wiringis within basic blocks and covered.
Schemes that can also cover inter-block wiring are possible,but require FPGA-like configurable switching and complex switching schemes.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Essentials of the Repair Scheme
Logic self repair is feasible at cost below triple modularredundancy (TMR).
There is a trade-off between the size or the reconfigurablelogic blocks (RLBs) and the maximum tolerable fault density.
Administration, not redundancy makes the critical overhead.
Efforts can be saved by administrating several RLBs in parallel.
Low-level interconnects between RLBs make for the essential„single point of failure“ in the repair scheme!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
6. De-Stressing
t4
Component
failure rates
10-2
10-3
10-4
10-1failure curvewithout de-stressing
System life time
failure curvewith de-stressing
t1 t2 t3
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
The Purpose of De-Stressing
Building blocks in digital systems of equal type may be more orless heavily used.
Blocks running with the highest dynamic load and at the highesttemperature are candidates for early failure.
Using otherwize „silent“ resources to relieve such units from stressperiodically may serve the overall life time of the system.
The re-configuration scheme developed for repair may also servesuch purpose with slight modifications.
..and the scheme must be compatible with repair architectures !
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
The Scheme of De-StressingBB1
BB2
BB3
RB
Task 1
Task 2
Task 3
Backup
heavy load
medium load
low load
state 0
test
BB1
BB2
BB3
RB
Task 1
Task 2
Task 3
Backup
heavy load
medium load
low load
state 1
test
BB1
BB2
BB3
RB
Task 1
Task 2
Task 3
Backup
heavy load
medium load
low load
state 2
test
BB1
BB2
BB3
RB
Task 1
Task 2
Task 3
Backup
heavy load
medium load
low load
state 3
A better initial distributionof taks and stress makesa better re-distribution.
Repair capabilities can bepreserved.
But:
De-stressing may needre-organisation within anactive system, while repairhas been off-line so far !
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Modified Control Scheme
For de-stressing, functions have to be shifted while the systemis in „hot“ operation.
As long as all building blocks are fully functional, running twofunctional blocks in parallel serving the same inputs and outputsis possible.
With a total of k building blocks (including the spare one) there arek „stable“ states of re-configuration (1 normal, 3 repairs) and (k-1)intermediate states for „handover“ in case of de-stressing.
There are no extra switches necessary, but an additional overheadin state management and state decoding.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
FSM including Transitional States0
0/1
1
tr=1
tr =0
1/22
2/3
3
tr =0
tr =0
tr=1
tr=1
If a „flying“ transition between repair states becomes necessary,the control logic will have seven states instead of four!
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Control Logic Functionality
Test access to each of four basic blocks is possible through the extra test acces.
Testin
BB
BB
BB
RBTestout
With a test input pattern applied, the RBB is run through the 4 states.
If a BB or the RB is found to be faulty through the test access, the controlis fixed in this state. The faulty block is then not in functional use.
The controller has a „fault“ flag, which indicates thestatus of „backup in use“.
Once a RBB has a fault detected, it cannot be usedfor de-stressing operations.
As long as a RBB has no fault detected, if can activatethe re-configuration for de- stressing with an extracontrol signal, which makes the FSM run throughtscheme of extended logic states for „hot“ re-configuration.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Extended Control Logic
Reconfigurable Block(RB)
Test in
FSM
Decoder
Switch controlsignals
FF&
clockFF reset
faultflag
„1“ forfault detect
test
tr
> 1
&
FSM reset
Test out
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
7. Overhead and Limitations
BISR requires additional overhead.
The inevitable extra circuitry used for fault administration is not fault-free by definition.
But we can assume that such circuitry, if fabricated correctly,is not in heavy use all the time and will exhibit much reducedfailure from stress.
Memory cells used for repair state administration are prone totransient fault effects from particle radiation.
Wit suitable state encoding (1-out of n-code) parity checkcan be applied.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Overhead
Overhead factors:
- Number and size of redundant elements,
- Number of switches for (re)- configuration,
- Control logic,
- Extra overhead for system – management.
- Test and fault diagnosis,
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Cost / Overhead
Basic Trans. Trans. Switch Contr. OverheadBlock funct. backup Trans. Unit Tr. %
2-NAND 3* 4 4 30 81 /200 960 / 3600
H- Adder 3 * 12 12 40 81 /200 369 / 700
F- Adder 3 * 30 30 50 81 /200 179 / 311
8-bit ALU 3 * 1367 1367 260 81 /200 41.6 / 44.54-bit ALU 3 * 699 699 180 81 /200 45.8 / 51.52-bit ALU 3 * 352 352 140 81 /200 54.2 / 65.5
*
* with / without extensions for de-stressing, controller design optimized for supervision by parity control.
( 3 functional blocks plus 1 backup in RLB)
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Sources of Overhead
Basic Complexity Overhead in %Block (trans.) redund. switches control ctrl/destr.
2-NAND 4 33 250 675 1666 H-Adder 12 33 111 225 555 F-Adder 30 33 55 90 222
8Bit ALU 1367 33 6.2 2 4.8
2Bit ALU 352 33 13 7.6 18.9 4Bit ALU 699 33 8.5 3.8 9.5
Switches and control overhead dominate, reasonable lower boundfor complexity of basic blocks is around 100-200 transistors.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Overhead and Block Size
10 102 103 104
Overheadin %
Basic Block Size(transistors)
10
100
1000
33
self repair plus de-stressing
self repair
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
The Switching Problem (1)switchcontrol
switchcontrol
switchcontrol
switchcontrol
switchcontrol
Compensates „always on“
Compensates „always off“
Compensates „always on“ and „always off“
... always in one single transistor.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Single Points of FailureTransistor Switches
switchcontrol
Reconfigu-rable
Logic Block(RLB)
Signalwiring
1 2
3
1: short gate - signal input
2: short gate - block input
3: channel short
Config.ControlNetwork
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Pass Transistor Faults
Short
A short condition between the signal input (Usign) and the control input (Uctrl) may be solved by designing the gate input line (Rbr)as a fuse. Then one additional transistor is needed as a „power sink“.
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Blowing Fuses
sin
sout
CTL in
fuse
VDDhigh
n
gateshort n
p
Power-Sink-Transistor
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
8. Summary and ConclusionsLogic self-repair is not impossible, but noch cheap either.
The lower bound for logic blocks is about 100 transistors.
Experience shows that most logic designs „yield“ some potentialfor logic extraction.
Repair technologies work even (much) better for regular processorarchitectures such as VLIW processors.
In real-life designs, a large part of the system (memory, 50-90 %),functional units, 10-40 %) is regular. Only a small fraction is truly„irregular“ and needs higher overhead.
No such strategy yet for analog and mixed signal circuits !
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Real Embedded Systems
Data Path
Ctrl Cache
Data Path
Ctrl Cache
MemoryMixed
Signal / RF
Mem.
DSP
CPU CPU
.. only a small fraction of the real system is truly irregular and needs „expensive“ logic repair !
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Regular Processor Architectures
Register FileCrtl.-Logic
Add MultMultiple parallel Processing units
NeedsLogic-BISR
Regular processor structures with multiple parallel units needexpensive logic (self-) repair only for their control logic. Reconfigurationof data-path elements can be arranged by software, which does not have wear-out !
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Design for Repairability
RT netlist
Extract obviousregular blocks
Compose RT-RLBsFind and extract
regular entities
RandomLogic
RandomRest Logic
ComposeGate-Level
RLBs
ComposeRLB control
Scheme
RLBControl
Circuitry
EstimateReliability
done
Computer Engineering
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
This is the END !
Thank you for not falling asleep !(I would have....)