embedded systems laboratory informatics institute federal university of rio grande do sul porto...

49
Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon, USA Dealing with Multiple Simultaneous Faults in Future Technologies Carlos A. L. Lisbôa Erik Schüler Luigi Carro

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Embedded Systems Laboratory Informatics Institute

Federal University of Rio Grande do SulPorto Alegre – RS – Brazil

SRC TechCon 2005Portland, Oregon, USA

Dealing withMultiple Simultaneous Faults

in Future Technologies

Carlos A. L. Lisbôa Erik Schüler

Luigi Carro

Page 2: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 2

Why Multiple Simultaneous Faults ?

• Future technologies (2010 and beyond)

• very small transistors and fewer electrons to form the

channel ( SETs)

• transient pulses due to radiation attack will last longer

than the propagation delays of gates

• devices will be more sensitive to the effects of

electromagnetic noise, neutrons and alpha particles

Page 3: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 3

Single Event Upset Origin

1 0 1 0 0 0 0 1

0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0

Page 4: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 4

Why Should One Study Multiple Faults ?

Change in paradigm:

Gates will behave statistically,

producing correct outputs only a

fraction of the time.

Page 5: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 5

• New paradigm: multiple simultaneous faults• new fault tolerance techniques will be required

(TMR will no longer provide enough protection)

How to Deal with Multiple Faults ?

Page 6: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 6

• New paradigm: multiple simultaneous faults• new fault tolerance techniques will be required (TMR

will no longer provide enough protection)

• How to deal with this problem ?

• new materials and manufacturing technologies

must be developed

OR• new design approaches must be taken

How to Deal with Multiple Faults ?

Page 7: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 7

• New paradigm: multiple simultaneous faults• new fault tolerance techniques will be required (TMR

will no longer provide enough protection)

• How to deal with this problem ?

How to Deal with Multiple Faults ?

•new design approaches must be taken (our bet !)

Page 8: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 8

Research Approaches

• Use of stochastic operators

• Use of bit stream operators

• Ensuring voter reliability to use n-MR while dealing with multiple simultaneous faults

• Next steps: 2005 - 2007 time frame

Page 9: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 9

Research Evolution

OK for someDSP

Applications

Look

ing fo

r

mor

e sp

eed

StochasticOperators

Small footprintand fast

Tolerant to multiple faults in n-MR solutions

AnalogVoter

Bit StreamOperators

Looking for

tolerant converter

Page 10: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 10

Using Stochastic Operators

• SEU induced transient errors are of random nature

Page 11: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 11

Using Stochastic Operators

• SEU induced transient errors are of random nature

• Stochastic operators rely on randomness to produce approximate results

Page 12: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 12

Using Stochastic Operators

• SEU induced transient errors are of random nature

• Stochastic operators rely on randomness to produce approximate results

• The injection of random faults in the input signals processed by stochastic operators did not impact the precision of the results

0 faults 2 faults 4 faults 8 faults0.1412 0.2580 0.1768 0.2196

Stochastic AdderConventional

0.0000

% Errors in 1,000 additions

Page 13: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 13

Using Stochastic Operators

• SEU induced transient errors are of random nature

• Stochastic operators rely on randomness to produce approximate results

• The injection of random faults in the input signals processed by stochastic operators did not impact the precision of the results

• Several application areas (DSP) can deal with approximate values and still produce acceptable results (outputs)

Page 14: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 14

Using Stochastic Operators

• Benefit: reduced area of the operators

Stochastic multiplier circuit

1000100110011010

10010001000010111000000100001010

Stochastic Adder Circuit

01100010101

010111011001S1

S3

Sum

01010101101

0010100110101

S2

Page 15: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 15

Using Stochastic Operators

How does it work ?

Come and see the posters !

No free drinks, but the answer to this question is granted !

Page 16: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 16

Using Bit Stream Operators

• Computation principles similar to those of the stochastic adder and multiplier

Page 17: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 17

Using Bit Stream Operators

• Computation principles similar to those of the stochastic adder and multiplier

• Operators can produce bit streams which represent the exact results of the operation

Proposed Multiplication Algorithm - bit stream product(the count of 1’s in the stream is equal to the product value)

F12 F11 F10

x F22 F21 F20

F20.F12 F20.F11 F20.F10

F21.F12 F21.F11 F21.F10

F22.F12 F22.F11 F22.F10

b48 .. b33 b32 .. b17 b16 .. b5 b4 .. b1 b0

Page 18: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 18

b48 .. b48 b47 .. b47 ... b0 .. b0 1 1 1 1 0 0 0

8 times 8 times 8 times +4total count of 1’s = 8 * product + 4

Using Bit Stream Operators

• Computation principles similar to those of the stochastic adder and multiplier

• Operators can produce bit streams which represent the exact results of the operation

• Redundancy is added to the bit streams in order to stand to multiple bit flips

Adding robustness to the bit stream through redundancy

Page 19: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 19

Using Bit Stream Operators

• Computation principles similar to those of the stochastic adder and multiplier

• Operators can produce bit streams which represent the exact results of the operation

• Redundancy is added to the bit streams in order to stand to multiple bit flips

• Conversion of bit streams to binary coded values is delayed as much as possible, and conversion circuits must use TMR or n-MR for protection against faults

Page 20: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 20

Using Bit Stream Operators

• Computation principles similar to those of the stochastic adder and multiplier

• Operators can produce bit streams which represent the exact results of the operation

• Redundancy is added to the bit streams in order to stand to multiple bit flips

• Conversion of bit streams to binary coded values is delayed as much as possible, and conversion circuits must use TMR or n-MR for protection against faults

• Issues to be further investigated: size of bit streams and area of the conversion circuits

Page 21: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 21

Using Bit Stream Operators

No free food, but some more info on this subject will be provided !

How does it work ?

Come and see the posters !

Page 22: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 22

VOTER

correct output

What is Wrong with TMR ?

• TMR protects only against single faults in one of the modules

Module 1

Module 2

Module 3

correct output

correct output

correct output

Page 23: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 23

Module 2 wrong output

What is Wrong with TMR ?

Module 1

Module 3

correct output

correct output

VOTER

correct output

• TMR protects only against single faults in one of the modules

Page 24: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 24

Module 2 correct output

What is Wrong with TMR ?

• TMR does not protect against double faults in different modules

Module 1

Module 3

wrong output

wrong output

VOTER

wrong output

Page 25: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 25

VOTER

correct output

What is Wrong with TMR ?

• When a single fault occurs in the voter circuit, the voter output may be wrong

Module 1

Module 2

Module 3

correct output

correct output

correct output

Page 26: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 26

VOTER

correct output ?

What is Wrong with TMR ?

Module 1

Module 2

Module 3

correct output

correct output

correct output

• When a single fault occurs in the voter circuit, the voter output may be wrong

Page 27: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 27

Making TMR (n-MR) more reliable

• Known solutions imply in• area, performance and / or power penalties

• deadlock: how to protect the output generator ?

Page 28: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 28

Making TMR (n-MR) more reliable

• Known solutions imply in• area, performance and / or power penalties

• deadlock: how to protect the output generator ?

• Proposed solution:• use TMR to cope with single faults in the modules

Page 29: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 29

Making TMR (n-MR) more reliable

• Known solutions imply in• area, performance and / or power penalties

• deadlock: how to protect the output generator ?

• Proposed solution:• use TMR to cope with single faults in the modules

• replace the digital voter by an analog voter that• uses a comparator to generate the output

Page 30: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 30

• Known solutions imply in• area, performance and / or power penalties

• deadlock: how to protect the output generator ?

• Proposed solution:• use TMR to cope with single faults in the modules

• replace the digital voter by an analog voter that• uses a comparator to generate the output

• can support some noise, nevertheless producing the correct result

Making TMR (n-MR) more reliable

Page 31: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 31

The Analog Voter

Page 32: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 32

Injection of faultsin the comparator (*)

Minimum Area Comparator

(*) using CMOS 0.35µm

Page 33: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 33

Electrical Simulation: Multiple Faults(SPICE and CMOS 0.35 m)

Page 34: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 34

Dealing with Multiple Simultaneous Faults: n-MR

The Analog Voter with 5 Inputs (for 5-MR)

Page 35: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 35

Dealing with Multiple Simultaneous Faults: n-MR

The Analog Voter with 5 Inputs (for 5-MR)

Simulations with injection of2 simultaneous faults also succeeded

Page 36: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 36

The Analog Voter ... Oops !

Does t

his

work ??

?

Page 37: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 37

Let’s

see the

posters !

The Analog Voter

Page 38: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 38

Future Work - Short Term (2005-2006)

• use of signal redundancy with other number representation forms, such as Sigma-Delta

Page 39: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 39

Future Work - Short Term (2005-2006)

• use of signal redundancy with other number representation forms, such as Sigma-Delta

• use of the analog voter as an efficient way to implement robust n-MR circuits

Page 40: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 40

Future Work - Short Term (2005-2006)

• use of signal redundancy with other number representation forms, such as Sigma-Delta

• use of the analog voter as an efficient way to implement robust n-MR circuits

• investigate the application of statistical methods and neural networks to the design of fault tolerant circuits with minimum redundancy

Page 41: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 41

Future Work - Long Term (2006-2007)

• use of logic properties to develop signal redundancy with low cost

Page 42: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 42

Future Work - Long Term (2006-2007)

• use of logic properties to develop signal redundancy with low cost

• apply the developed techniques to actual processors w/ DSP and VLIW architectures

Page 43: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 43

Future Work - Long Term (2006-2007)

• use of logic properties to develop signal redundancy with low cost

• apply the developed techniques to actual processors with DSP and VLIW architectures

• discuss the architectural impact of new technologies together with fault tolerance

Page 44: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 44

Research Evolution

StochasticOperators Analog

Voter

Bit StreamOperators

previous work (2004-2005) 2005 2006 2007

Page 45: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 45

Research Evolution

StochasticOperators Analog

Voter

Bit StreamOperators

SigmaDelta

previous work (2004-2005) 2005 2006 2007

Page 46: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 46

Research Evolution

StochasticOperators Analog

Voter

Bit StreamOperators

SigmaDelta

Logic Properties

previous work (2004-2005) 2005 2006 2007

Page 47: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 47

Low cost

redundancy

Research Evolution

StochasticOperators Analog

Voter

Bit StreamOperators

SigmaDelta

Logic Properties

previous work (2004-2005) 2005 2006 2007

Page 48: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 48

Application to actualDSP and VLIW processors

Low cost

redundancy

Research Evolution

StochasticOperators Analog

Voter

Bit StreamOperators

SigmaDelta

Logic Properties

DSP / VLIW

previous work (2004-2005) 2005 2006 2007

Page 49: Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,

Carlos A. L. Lisbôa SRC TechCon 2005 - October, 26, 2005 - Paper # 20.4 49

Questions ?

Looking forward to answer them at the poster booth!

(# 20.4)

Contact: [email protected]

Thank You !

No free anything, but a nice chat about these matters will be a pleasure !