self repair technology for logic circuits

75
Computer Engineering EDES / ZUSYS / DAAD Summer School 2011, Tallinn Self Repair Technology for Logic Circuits Architecture, Overhead and Limitations Heinrich T. Vierhaus BTU Cottbus Computer Engineering Group

Upload: zelda

Post on 08-Feb-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Self Repair Technology for Logic Circuits. Architecture, Overhead and Limitations. Heinrich T. Vierhaus BTU Cottbus Computer Engineering Group. Outline. 1. Introduction: Nano Structure Problems. 2. The Problem of Wear-Out. 3. Repair for Memory and FPGAs. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Self Repair Technology for Logic Circuits

Architecture, Overhead and Limitations

Heinrich T. VierhausBTU Cottbus

Computer Engineering Group

Page 2: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Outline

1. Introduction: Nano Structure Problems

4. Basic Logic Repair Strategies & Structures

5. Test and Repair Administration

2. The Problem of Wear-Out

3. Repair for Memory and FPGAs

6. De-Stressing Strategies

7. Cost, Overhead, Single Points of Failure

8. Summary and Conclusions

Page 3: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

1. Introduction

A bunch of new problems from nano structures ...

Page 4: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Nanoelectronic Problems

Lithography:

The wavelength used to „map“ structural information frommasks to wafers is larger (4 times of more) than the minimumstructural features (193 versus 90 / 65 / 45 nm).

Adaptation of layouts for correction of mapping faults.

Statistical Parameter Variations:

The number of atoms in MOS-transistor channels becomes sosmall that statistical variations of doping densities have an impacton device parameters such as threshold voltages.

Page 5: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

New Problems with Nano-Technologies

Lightsource

mask (reticle)

wafer

resist

exposed resist

Wave length: 193 nm

Feature size: down to 28 nm

Page 6: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Layout Correction

Modified layoutfor compensationof mapping faults

Compensation is critical and non-ideal

Faults are not random but correlated!

Requires fast fault diagnosis

Page 7: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Doping Fluctuations in MOS Transistors

p-Substrate

n n

Poly-Si

doping atom

p-Substrate

n n

Poly-Si

doping atom

Density and distribution of doping atomscause shifts in transistor threshold voltages!

Page 8: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Nanostructure ProblemsIndividual device characteristics such as Vth are more dependent on statistical variations of underlying physical features such as doping profiles.

A significant share of basic devices will be „out or specs“ and needs a replacement by backup elements for yield improvement after production.

Smaller features mean higher stress (field strength, current density), also foster new mechanisms of early wear-out.

Transient error recognition and compensation „in time“ is becoming a must due to e. g. charged particles that can discharge circuit nodes.

Primary Relevance: Yield

Primary Relevance: Yield

Primary Relevance: Lifetime

Primary Relevance: Dependability

Page 9: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Fault Tolerant Computing

Faultevent

Software-basedfault detection

& compensation

HW logic & RT-level

detection &compensation

Works onlyfor transient faults!

Typically worksfor transient and permanent faults!

Transistor-and switch levelcompensation

Typically worksfor specific types of

transient faultsonly!

specific

veryspecific

universal

Page 10: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

2. Wear-Out Problems and Mechanisms

Structures on ICs used to live longer than either their applicationor even their users. Not any more ...

Page 11: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

IC Structures May Get Tired

„Wear-out“ – effects ICs in nano-electronics are likely to appear much earlier,causing a lot of problems for dependable long-time applications !

Page 12: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Fault Effects on ICs

Field-Oxide

Poly-imide(low-k)

Metal 2

Via

Metal 1

Metal 3

n-welln np p

GateOxide(high-k)

metalmigration

low- k insulatordeterioration

Transistor deterioration (HCI, NBTI),eventually gate oxide shorts !

Page 13: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Wear-Out MechnismsMetal Migration:

Metal atoms (Al, Cu) tendto migrate under high currentdensity and high temperature.

Stress migration:

Migration effects may be enhancedunder mechanical stress conditons.

Effect:

Metal lines and vias may actuallycause line interrupts. The effect ispartly reversible by changing currentdirections.

Page 14: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Metal Migration

metal -wire under high current density:new

After some time in operation

Voids (holes)

neighbor

neighbor

neighbor

Open-defectshort

Vias are specially prone to such defects

The effect is reversible by reversing the direction of current flow !

Page 15: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Transistor Degradation

Negative Bias Thermal Instability (NBTI): Reduced switching speedfor p-channel MOS transistors that have operated under long-time constant negative gate bias. The effect is partly reversible.

Hot Carrier Injection (HCI): Reduced switching speed for n-channel MOStransistors, induced by positive gate bias and frequent switching. Not reversible.

Gate Oxide Deterioration: Induced by high field strengh. Not reversible

Dielectric Breakdown: Insulating layers between metal lines may break causing shorts between signal lines.

Design technology including a prospective „life time budget“!!

Page 16: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Management of Wear-Out by „Fault Tolerant Computing?

Built-in fault tolerance and error compensation are needed in nano-technologies anyway and for the management of transient faults.

Wear-out induced faults may show up as „intermittent“ faults first,which become more and more frequent.

Fault in synchronous circuits and systems are detected „by clock cycle“.Hence the detection does not even recognize if the fault is permanentor not for many types of fault tolerant architecture.

Page 17: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Triple Modular Redundancy

ExecutionUnit 1

ExecutionUnit 2

ExecutionUnit 3

ComparatorVoter

Result out(majority)

Errordetect

Can detect and compensate almost any type of faultOverhead about 200-300 %, additional signal delaysThe voter itself is not covered but must be a „self checking checker“

Standard (by law) in avionics applications!

inputsignal

Page 18: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Error Detecting / Correcting Codes

Data

Transmission /Storage

Signature

Data

Signature

Signature

Comparison

Fault-detect

Errorcorrection

Often applicable to 1- or 2-bit faults only

Becomes expensive if applied tocomputational units

Often limited to certain fault models (uni-directional)

Page 19: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Can TMR and Codes CompensatePermanent Faults?

Fault / error detection circuitry typically works on a clock-cycle base.It does not „know“ if a fault is transient or permanent.

A permanent fault is a fault event that occurs in several to many successiveclock cycles repeatedly.

Error correction technology can detect and compensate such permanent faultsas well as transient faults.

A critical condition occurs if transient faults occur on top ofpermanent faults. Then the superposition of fault effects is likely toexceed the system‘s fault handling capacity.

System components that run actively „in parallel“ suffer from the samewear-out effects. Therefore there is a an increase in dependability beforewear-out limits, but no significant life time extension!

Page 20: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Redundancy and Wear-Out

During the normal life time of the system, duplication or triplicationcan enhance reliability significantly. But also area and power consumptionare about triplicated.

And by the end of normal operating time (out of fuel / steam) all threesystems will fail shortly one after the other !!

Reliability enhancement is not equal to life time extension !!

Page 21: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Self Repair?

Faultevent

Software-basedfault detection

& compensation

HW logic & RT-level

detection &compensation

Works onlyfor transient faults!

Typically worksfor transient and permanent faults!

Transistor-and switch levelcompensation

Typically worksfor specific types of

transient faultsonly!

specific

veryspecific

universal

Self Repair for permanent faults!

Page 22: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

3. Repair for Memory and FPGAs

Compensation of transient faults is not enough.

Some technologies for transient compensation can handle permanent faults, too, but not on the long run and withadditional transient faults!

Page 23: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Memory Test & Repair

Lines

columns

Lineaddress

Read- /write lines

spare column

Page 24: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Memory Test & Repair (2)

Lines

columns

Lineaddress

Read- /Write lines

spare column

MemoryBIST

controller... is already state-of-the-art!

Page 25: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

FPGA-based Self Repair

L W

L

L

L

L

L

L

L L

W

W

W

W W

W W

W

W W

W

L L

L L

L W

L

L

L

L

L

L

L L

W

W

W

W W

W W

W

W W

W

L L

L L

L W

L

L

L

L

L

L

L L

W

W

W

W W

W W

W

W W

W

L L

L L

Config.SW

Memory

Applic.SW &

data

FPGA macro-blocks working as CPUs logicblock

wiringblock

* e. g. proposed by McCluskey et al. IEEE Design and Test 2004

FPGA-based embedded controller: 8051

Page 26: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

In-System FPGA Repair

L W

L

L

L

L

L

L

L L

W

W

W

W W

W W

W

W W

W

L L

L L

L W

L

L

L

L

L

L

L L

W

W

W

W W

W W

W

W W

W

L L

L L

L W

L

L

L

L

L

L

L L

W

W

W

W W

W W

W

W W

W

L L

L L

Config.SW

Memory

fault

Applic.SW &

data

Systemfunction

Repairfunction

FPGA-based CPUs

under repair

logicblock

wiringblock

L W

L

L

L

L

L

L

L L

W

W

W

W W

W W

W

W W

W

L L

L L

L W

L

L

L

L

L

L

L L

W

W

W

W W

W W

W

W W

W

L L

L L

L W

L

L

L

L

L

L

L L

W

W

W

W W

W W

W

W W

W

L L

L L

Config.SW

Memory

fault

Applic.SW &

data

Systemfunction

Repairfunction

FPGA-based CPUs

under repair

logicblock

wiringblock

Page 27: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Repair Mechanism: Row/Line-Shift

CLB CLB CLB CLB

CLB CLB CLB CLB

CLB CLB CLB CLB

CLB CLB CLB CLB

CLB CLB CLB CLB reserverow

occupiedCLBs

row withfaulty CLB

occupiedCLBs

Little Overhead for the re-configuration process

Loss of many “good” CLBs for every fault

Page 28: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Distributed Backup CLBs

CLB CLB CLB CLB

CLB CLB CLB CLB

CLB CLB CLB CLB

CLB CLB CLB CLB

CLB

CLB

CLB

CLB

CLB functionally occupied CLB

CLBnon-occupied CLB (reserve)

CLB faulty CLB

CLBselected replacement CLB

Minimum loss of functional CLBs

High effort for re-wiring requires massive „embedded“computing power (32-bit CPU, 500 MHz)

Page 29: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Self Repair within FPGA Basic Blocks

Heterogeneous repair strategies required (memory, logic)

Logic blocks may use methods known from memory BISR

Additional repair strategies are necessary for logic elements

The basic overhead for FPGAs versus standard logic(about 10) is enhanced.Repair strategies for logic may use some features alreadyused in FPGAs (e. g. switched interconnects).

Page 30: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Structure of a CLB Slice

LogicField

Logicin

Program in

Logicout

Redudant Row

MUX FF

FFin SRAM

MUX

FF

out

out

SRAM

Page 31: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

FPGAs for a Solution?The granularity of re-configurable logic blocks (CLBs)in most FPGAs is the order of several thousand transistors. Replacement strategies must be placed on a granularity ofblocks in the area of 100-500 transistors for fault densities between 0.01 % and 0.1 %.

Efficient FPGA- repair mechanism requires detailed fault diagnosisplus specific repair schemes, which cannot be kept as pre-computedreconfiguration schemes.Computation of specific repair schemes requires „in-systemEDA“ (re-placement and routing) with a massive demandfor computing power.

There is no source of such „always available“ computing power.

Page 32: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Self-Repairing FPGA ?

Pro

gram

CLB CLB CLB CLBWB WB WB

CLB CLB CLB CLBWB WB WB

CLB CLB CLB CLBWB WB WB

Virtual CPU

Config.Scheme

CLB CLB CLB CLBWB WB WB

CLB CLB CLB CLBWB WB WB

CLB CLB CLB CLBWB WB WB

New-Config.

Reconfigurable Logic

Memory

Page 33: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Advanced FPGA Structures

CPU CPU

ALU ALUMULT MULT

CLB CLB CLB CLBWB WB WB

CLB CLB CLB CLBWB WB WB

CLB CLB CLB CLBWB WB WB

WB WB WB

WB

CLB CLB CLB CLBWB WB WB

... are only partly re-configurable for performance reasons !

Page 34: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

FPGA / CPLD RepairLooks pretty easy at first glance because of regulararchitecture!

Requires lines / columns of switches for configuration atinputs and between AND / OR matrices.

Requires additional programmability of cross-points by double-gate transistor as in EEPROMs or Flash memory.

Not fully compatible with standard CMOS

Limited number of (re-) configurations

Floating gate (FAMOS) transistors are fault-sensitive!

Page 35: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

4. Basic Logic Repair Strategies

Repair techniques that replace failing building blocks by redundantelements from a „silent“ storage are not new.

IBM has been selling such computer systems specifically forapplications in banks for decade.

But always with few (2-10) backup elements (CPUs) assuminga small number of failures (< 10) within years.

Page 36: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Mainframes

.. will often contain „redundant“ CPUs for eventual fault compensation. But one faulty transistor then „costs“ a whole CPU, limiting the fault handling to a few (about 10) permanent fault cases.

Page 37: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Granularity of Replacement

Granularity(transistors)

100 101 102 103 104 105 106

trans. gate macroFPGA-block

cores CPU

Block-levelreplacement

(e. g. FPGAs)

Core-Replacement(e. g. CPU)

Expected fault density (1 out of..)

Hardly explored(logic)

Granularity(transistors)

100 101 102 103 104 105 106

trans. gate macroFPGA-block

cores CPU

Block-levelreplacement

(e. g. FPGAs)

Core-Replacement(e. g. CPU)

Expected fault density (1 out of..)

Hardly explored(logic)

Page 38: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Repair Overhead versus Element Loss

Size of replaced blocks(granularity)

Repair procedureoverhead

Functioningelements lost

1 10 100 1k 10k 100k 1M 10M

Prohibitiveoverhead

Prohibitivefault density

NewMethodsandArchi-tectures

Page 39: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Built-in Self Repair (BISR)

BISR is well understood for highly regular structures such as embeddedmemory blocks.

BISR is essentially depending on built-in self test (BIST) with highdiagnostic resolution.

FaultDetection

Fault Diagnosis

FaultIsolation

RedundancyAllocation

Fault / Redundancy Management

Redundancy management must monitor faults, replacements, available redundancy andmust also re-establish a „working“ system state after power-down states.

Page 40: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Levels of RepairTransistors - Switch LevelReplace transistors or transistor groupsLosses by reconfiguration: (switched-off „good“ devices):

Overhead for test and diagnosis: Very highPotentially small ( 20 – 50%) for transistor faults

Gate LevelReplace gates or logic cellsLosses by reconfiguration: Medium (60 to 90 %) for single transistor faultsOverhead for test and diagnosis: High

Macro-Block LevelReplace functional macros (ALU, FPU, CPU)Losses by reconfiguration: High, 99% or more

Overhead for test and diagnosis: Maybe acceptable

Repair overhead will dominatereliability!

Page 41: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

The Fault Isolation Problem

Load1

Load2

Driver

Gate-short

GND-shorts of input gates affect the whole fan-innetwork and make redundancy obsolete!!

Page 42: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Block-Level Repair

&

&

&

&

SE

SESE

Blocks of logic / RT elements (gates and larger) contain a redundant element each that can replace a faulty unit.

Page 43: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Switching Concept (1)

FunctionalBlock 1

FunctionalBlock 3

Replace-mentBlock

inputs outputs

FunctionalBlock 2

Test in Test out

FunctionalBlock 1

FunctionalBlock 3

Replace-mentBlock

inputsoutputs

FunctionalBlock 2

Test in Test out

1 2

Page 44: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Switching Concept (2)

FunctionalBlock 1

FunctionalBlock 3

Replace-mentBlock

inputs outputs

FunctionalBlock 2

Test in Test out

FunctionalBlock 1

FunctionalBlock 3

Replace-mentBlock

inputs outputs

FunctionalBlock 2

Test in Test out

3 4

Page 45: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

A Regular Switching Scheme

The scheme is regular and scalable by nature, comprising always k functional blocks of the same nature plus 1 additional block for backup.

Building blocks are separated by (pass-) transistor switches at inputs andoutputs, providing a full isolation of a faulty block.

Always 2 additional pass-transistors between two functional blocks.

The reconfiguration scheme is regular in shifting functionality betweenblocks, which results in a simple scheme of administration.

The functional access to the „spare“ block can be used for testing purposes.In any state of (re-) configuration, the potentially „faulty“ block is connectedto test input / output terminals.

Page 46: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Overhead Depending on Block Size

3 /4- 2-NAND 12 4 18 24

Transistors

Functional backup norm switch ext. switch

3 / 4 2-AND 18 6 18 24

Basic Element

3/4 2-XOR 18 6 18 24

H- Adder 36 12 24 30

F- Adder 90 30 30 36

For small basic blocks, the switches make the essential overhead (200%)!

For larger basic blocks,the overhead can be reduced to about 30-50%

... not counting test- and administration overhead!

Extract larger basic units from seemingly irregular logic netlists!!

Page 47: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Overhead

2- NAND 12 4 18 /24 230 %

Transistors per RLB (3 functional units)

functional backup

2- AND 18 6 18 /24 160 %

Basic Block

XOR 18 6 18 /24 160 %

Half Adder 36 12 24 /30 116 %

Full Adder 90 30 30 /36 73 %

Overhead

8-bit ALU 4500 1500 168 / 224 38 %

Switchesmin. / ext.

Page 48: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

5. Test and Repair Administration

Logic

Test Analyzer

Configurator and

StatusMemory

Test Generator

Centralized Control

LogicRLB RLB

RLB RLB

SystemMonitoring

RLB

BIST

Conf.

RLB

BIST

Conf.

RLB

BIST

Conf.

RLB

BIST

Conf.

De-centralized test and controlMay be faulty!

Page 49: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Blocks, Switching, Administration

F-Unit

F-Unit

Red.-Unit

Conf.-Unit

F-Unit

F-Unit

F-Unit

Red.-Unit

Conf.-Unit

F-Unit

Global Control-Unit

Columns of Switches

F-Unit

F-Unit

Red.-Unit

F-Unit

F-Unit

F-Unit

Red.-Unit

F-Unit

Global Control-Unit

Conf.-Unit Conf.-Unit

Columns of Switches

Decoder Decoder

Local (re-) configuration Remote (re-) configuration

Page 50: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Combining Test and Re-Configuration

LogicunderTest

Testinput

Compare

Reference

Config. Memory /Counter

next statefaultdetect

Testout

Page 51: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Test and Administration

Each of the elements in ablock is testable via specifictest inputs.

Test is done by comparisonwith reference outputs. The system is runthrough states of re-configuration with the sameinput test pattern applied.At test, a functional unit is always removedfrom normal operation and connectedto test I / O s.

State Reg.

Decoder

FunctionalBlock 1

FunctionalBlock n

Inp

ut

Sw

itc

hes

Ou

tpu

t S

wit

ch

es

Replace-mentBlock

inputs outputs

Self Test Circ.Test clock Fault indicator

Faultflag

In case of a „fault detect“,the system is fixed in the current status.

Test in Test out

fix at faultSuch a procedure of self-testand self-reconfiguration can run at every system start-up, avoidinga central „fault memory“.

Page 52: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Controller for (Re-) Configuration

>1

RLB

+

+

f1

f2

+f3

Ref

eren

ce

f1

f3

ff2

F

1

Sw

itche

s

Sw

itche

s

Testin

2 3 4

Decoder

Control-Bits

reset

BISRclock

act

>1

fault

act

freset

out

Scanout

test

>1

& in

sca

np

ath

s1 s2 s3 s4

Controller minimumcomplexity: 80 transistors (3 + 1 configuration)

A controller may driveone or several re-configurableblocks in parallel, dependingon their size

Page 53: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Local Interconnects

The block-based repair scheme so far can not cover faults on wires between re-configurable blocks.

For small basic blocks (such as logic gates) the majority ofwiring is between re-configurable units and not covered.

For larger (RT-level) basic blocks the majority of wiringis within basic blocks and covered.

Schemes that can also cover inter-block wiring are possible,but require FPGA-like configurable switching and complex switching schemes.

Page 54: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Essentials of the Repair Scheme

Logic self repair is feasible at cost below triple modularredundancy (TMR).

There is a trade-off between the size or the reconfigurablelogic blocks (RLBs) and the maximum tolerable fault density.

Administration, not redundancy makes the critical overhead.

Efforts can be saved by administrating several RLBs in parallel.

Low-level interconnects between RLBs make for the essential„single point of failure“ in the repair scheme!

Page 55: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

6. De-Stressing

t4

Component

failure rates

10-2

10-3

10-4

10-1failure curvewithout de-stressing

System life time

failure curvewith de-stressing

t1 t2 t3

Page 56: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

The Purpose of De-Stressing

Building blocks in digital systems of equal type may be more orless heavily used.

Blocks running with the highest dynamic load and at the highesttemperature are candidates for early failure.

Using otherwize „silent“ resources to relieve such units from stressperiodically may serve the overall life time of the system.

The re-configuration scheme developed for repair may also servesuch purpose with slight modifications.

..and the scheme must be compatible with repair architectures !

Page 57: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

The Scheme of De-StressingBB1

BB2

BB3

RB

Task 1

Task 2

Task 3

Backup

heavy load

medium load

low load

state 0

test

BB1

BB2

BB3

RB

Task 1

Task 2

Task 3

Backup

heavy load

medium load

low load

state 1

test

BB1

BB2

BB3

RB

Task 1

Task 2

Task 3

Backup

heavy load

medium load

low load

state 2

test

BB1

BB2

BB3

RB

Task 1

Task 2

Task 3

Backup

heavy load

medium load

low load

state 3

A better initial distributionof taks and stress makesa better re-distribution.

Repair capabilities can bepreserved.

But:

De-stressing may needre-organisation within anactive system, while repairhas been off-line so far !

Page 58: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Modified Control Scheme

For de-stressing, functions have to be shifted while the systemis in „hot“ operation.

As long as all building blocks are fully functional, running twofunctional blocks in parallel serving the same inputs and outputsis possible.

With a total of k building blocks (including the spare one) there arek „stable“ states of re-configuration (1 normal, 3 repairs) and (k-1)intermediate states for „handover“ in case of de-stressing.

There are no extra switches necessary, but an additional overheadin state management and state decoding.

Page 59: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

FSM including Transitional States0

0/1

1

tr=1

tr =0

1/22

2/3

3

tr =0

tr =0

tr=1

tr=1

If a „flying“ transition between repair states becomes necessary,the control logic will have seven states instead of four!

Page 60: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Control Logic Functionality

Test access to each of four basic blocks is possible through the extra test acces.

Testin

BB

BB

BB

RBTestout

With a test input pattern applied, the RBB is run through the 4 states.

If a BB or the RB is found to be faulty through the test access, the controlis fixed in this state. The faulty block is then not in functional use.

The controller has a „fault“ flag, which indicates thestatus of „backup in use“.

Once a RBB has a fault detected, it cannot be usedfor de-stressing operations.

As long as a RBB has no fault detected, if can activatethe re-configuration for de- stressing with an extracontrol signal, which makes the FSM run throughtscheme of extended logic states for „hot“ re-configuration.

Page 61: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Extended Control Logic

Reconfigurable Block(RB)

Test in

FSM

Decoder

Switch controlsignals

FF&

clockFF reset

faultflag

„1“ forfault detect

test

tr

> 1

&

FSM reset

Test out

Page 62: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

7. Overhead and Limitations

BISR requires additional overhead.

The inevitable extra circuitry used for fault administration is not fault-free by definition.

But we can assume that such circuitry, if fabricated correctly,is not in heavy use all the time and will exhibit much reducedfailure from stress.

Memory cells used for repair state administration are prone totransient fault effects from particle radiation.

Wit suitable state encoding (1-out of n-code) parity checkcan be applied.

Page 63: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Overhead

Overhead factors:

- Number and size of redundant elements,

- Number of switches for (re)- configuration,

- Control logic,

- Extra overhead for system – management.

- Test and fault diagnosis,

Page 64: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Cost / Overhead

Basic Trans. Trans. Switch Contr. OverheadBlock funct. backup Trans. Unit Tr. %

2-NAND 3* 4 4 30 81 /200 960 / 3600

H- Adder 3 * 12 12 40 81 /200 369 / 700

F- Adder 3 * 30 30 50 81 /200 179 / 311

8-bit ALU 3 * 1367 1367 260 81 /200 41.6 / 44.54-bit ALU 3 * 699 699 180 81 /200 45.8 / 51.52-bit ALU 3 * 352 352 140 81 /200 54.2 / 65.5

*

* with / without extensions for de-stressing, controller design optimized for supervision by parity control.

( 3 functional blocks plus 1 backup in RLB)

Page 65: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Sources of Overhead

Basic Complexity Overhead in %Block (trans.) redund. switches control ctrl/destr.

2-NAND 4 33 250 675 1666 H-Adder 12 33 111 225 555 F-Adder 30 33 55 90 222

8Bit ALU 1367 33 6.2 2 4.8

2Bit ALU 352 33 13 7.6 18.9 4Bit ALU 699 33 8.5 3.8 9.5

Switches and control overhead dominate, reasonable lower boundfor complexity of basic blocks is around 100-200 transistors.

Page 66: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Overhead and Block Size

10 102 103 104

Overheadin %

Basic Block Size(transistors)

10

100

1000

33

self repair plus de-stressing

self repair

Page 67: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

The Switching Problem (1)switchcontrol

switchcontrol

switchcontrol

switchcontrol

switchcontrol

Compensates „always on“

Compensates „always off“

Compensates „always on“ and „always off“

... always in one single transistor.

Page 68: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Single Points of FailureTransistor Switches

switchcontrol

Reconfigu-rable

Logic Block(RLB)

Signalwiring

1 2

3

1: short gate - signal input

2: short gate - block input

3: channel short

Config.ControlNetwork

Page 69: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Pass Transistor Faults

Short

A short condition between the signal input (Usign) and the control input (Uctrl) may be solved by designing the gate input line (Rbr)as a fuse. Then one additional transistor is needed as a „power sink“.

Page 70: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Blowing Fuses

sin

sout

CTL in

fuse

VDDhigh

n

gateshort n

p

Power-Sink-Transistor

Page 71: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

8. Summary and ConclusionsLogic self-repair is not impossible, but noch cheap either.

The lower bound for logic blocks is about 100 transistors.

Experience shows that most logic designs „yield“ some potentialfor logic extraction.

Repair technologies work even (much) better for regular processorarchitectures such as VLIW processors.

In real-life designs, a large part of the system (memory, 50-90 %),functional units, 10-40 %) is regular. Only a small fraction is truly„irregular“ and needs higher overhead.

No such strategy yet for analog and mixed signal circuits !

Page 72: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Real Embedded Systems

Data Path

Ctrl Cache

Data Path

Ctrl Cache

MemoryMixed

Signal / RF

Mem.

DSP

CPU CPU

.. only a small fraction of the real system is truly irregular and needs „expensive“ logic repair !

Page 73: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Regular Processor Architectures

Register FileCrtl.-Logic

Add MultMultiple parallel Processing units

NeedsLogic-BISR

Regular processor structures with multiple parallel units needexpensive logic (self-) repair only for their control logic. Reconfigurationof data-path elements can be arranged by software, which does not have wear-out !

Page 74: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Design for Repairability

RT netlist

Extract obviousregular blocks

Compose RT-RLBsFind and extract

regular entities

RandomLogic

RandomRest Logic

ComposeGate-Level

RLBs

ComposeRLB control

Scheme

RLBControl

Circuitry

EstimateReliability

done

Page 75: Self Repair Technology  for Logic Circuits

Computer Engineering

CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

This is the END !

Thank you for not falling asleep !(I would have....)