thermal halt - a tool for discovery signal integrity … halt.pdf · accelerated stress testing and...

35
Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and Software reliability issues Kirk A. Gray Accelerated Reliability Solutions, L.L.C. [email protected] August 2, 2016 1

Upload: phungdat

Post on 01-Sep-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Thermal HALT - a tool for discovery Signal Integrity and Software reliability issues

Kirk A. Gray

Accelerated Reliability Solutions, L.L.C.

[email protected]

August 2, 2016 1

Page 2: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

SI operational failures

2

• The differential voltage, the “eye” in the eye diagram is shrinking as clock and bus frequencies increase

Undistorted eye diagram of band limited

signal

eye diagram of signal with amplitude

(noise) and phase (timing) errors

From “Analyzing Signals Using the Eye Diagram”

November 2005 High Frequency Electronics

Page 3: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

SI operational failures

3

Little data exist on relationship of PWBA hardware variations and effects on signal integrity and software failures at

Affects in data transmission that were 2nd or 3rd order in previous designs begin to dominate as bus speeds increase

– Not a big deal before, become a big deal in SI

– Many new variables that are difficult to model correctly

– Decrease in IC metallization an higher frequency’s will result in higher sensitivity to fabrication variations

Page 4: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

No Fault Found (NFF) Field Returns

4

Many causes for warranty returns that are NFF

Some Intermittent failures due to low SI margin

Signal Integrity operational margin due to voltage, board impedance, crosstalk, noise, etc

Page 5: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

No Fault Found (NFF) Field Returns

5

Many companies do not consider it a “failure”

• no root cause investigation

• may be returned to field or as a replacement part for repair depot

• Marginal operation may be less in another system – Ex. graphics cards, or DIMMs (DRAM memory)

Page 6: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Signal Integrity (SI) and HALT

6

SI operational issues may significantly contribute to NFF

Very difficult to observe in the field and on a test bench

– May take hundreds of operational cycles to observe

– Marginality may only occur in the stack up of a specific whole system hardware

– NFF when tested on bench or in the “golden” system

Page 7: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Using Thermal Stress for Timing variations

7

• Thermal Stress is very useful for stimulation of timing variations –Both high and Lowtemperature limits

– High temperature – lower speed

– Low temperature – higher speed

Page 8: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Effect of Temperature on Signal Propagation

8

Measured low-to-high propagation delay versus case temperature in Fairchild Octal buffer MM74HC244N (rated for -40 to 85C)

Referenced From L. Condra, D. Das, N. Pendse, and M. Pecht “Junction Temperature Considerations in Evaluating Electronic Parts for Use

Outside Manufacturers-Specified Temperature Ranges” IEEE Transactions on Components and Packaging Technologies, Vol. 21, No. 4, pp 721-

728, Dec. 2001

Page 9: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Variation of Signal Propagation

9

Thermal stress provides stimulation

of signal propagation variation

?

Page 10: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Lot to Lot Variation of Signal Propagation?

10

Predicting the Future variations • How much propagation variation die to die, and

lot to lot?• How close to the specified maximum delay?• How will variation impact operational reliability in

each in-circuit application of the component?

??

Page 11: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Stimulation of Variations

11

Temperature can

skew signal

propagation IC’s

and conductors

Page 12: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Using Thermal Stress for Timing variations

12

Timing and quality of SI variations come from:

• Lot to lot manufacturing

• Within lots

• Board impedance variations

• Second and third source components

• Interconnects

• Parametric drift - Aging

Page 13: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Signal Integrity

13

• Electric and Magnetic fields - noise, crosstalk, reflections

• Every conductor -frequency dependant Inductance, Capacitance and resistance impacts the quality of signal transmissions from each node of the non-ideal conductor

• Surface on Copper affects L and R – rough for FR4 adhesion

Typical transmission lines in PCB cross

section

Ground /power

Ground /power

Referenced from S.H. HALL and H.L. Howard, “Advanced Signal Integrity for

High-Speed Designs” , Wiley and Sons, 2009

Electric Field

Magnetic Field

Page 14: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Thermal Stress to Skew Parametrics

14

Marginal designs may not be discovered until a sufficient number of units are in the field

Field - costly place to discover these marginal conditions or have high NDF returns if not discovered

Parametric

timing value

Parametric

timing value

#units#unit

s

100 100,000

Limited samples During development Mass Production variation

Parameter

Specification

Lower op

limitUpper op

limit

Lower op

limit

Upper op

limit

Parameter

Specification

Marginal

operation

regions1,00

0

Page 15: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Parametric Skewing from Thermal Stress

15

Applying thermal stress stimulates a timing shift

Parametric

timing value

units

100

Parameter

Specification

Lower op

limit

Upper op

limit

ColdHOT

• Thermal Step Stress skews the signal propagation

speeds in components and assemblies

• Rapid thermal transitions provide higher thermal

gradients across components and PWBA – mix of

parametrics skewing

Page 16: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Signal Integrity

16

Fiber weave effect - weave of dielectric cannot be assumed homogeneous at Gb/s transmission rates RH (relative humidity) has an impact on the electrical performance of the dielectric – dramatic increase of insertion loss from dry Arizona to humid Malaysia

Typical transmission lines in PCB cross

section

Ground /power

Ground /power

Referenced from S.H. HALL and H.L. Howard, “Advanced Signal Integrity for

High-Speed Designs” , Wiley and Sons, 2009

Page 17: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Thermal stimulation for Signal Integrity Margin

17

Thermal stress expands and contracts material dimensions

Heat expands materials, dimensions

Ground /power

Ground /power

Heat

Page 18: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Thermal stimulation for Signal Integrity Margin

18

Can provide stimulation of potential affects and impact of variation of parametrics, noise, L, C, R resulting from manufacturing, materials variation

Cooling contracts materials, dimensions

Cold

Page 19: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Thermal stimulation for SI Margin

19

Thermal cycling adds an additional variation –

thermal gradients create differential parametric shift

from shifts in dimensions and impedance

Thermal Gradients provide differential

mechanical and parametric stresses

Transitio

ns

Page 20: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

No Fault Found (NFF) Field Returns

20

Two computers returned from two different customers with same reported intermittent failure condition– After five days of bench testing, OEM Failure

Analysis could not duplicate the failure mode– Units were declared NFF – Same units placed in thermal cycling (+65 to -10

°C) reproduced the same (soft) failure mode as reported by the customer 3 times in a 24 hour period

Page 21: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Combined HALT Stress Interactions

Stresses combinations can have significant interactions for multi-dimensional limit or boundary maps

Clock/bus Frequency margining limits

Voltage margining limits

www.ieee-astr.org September 28- 30 2016, Pensacola Beach, Florida 21

Page 22: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Stress Boundary Maps

22

tem

pe

ratu

re

voltage

distributions in the

boundary identifies

reliability margin

risks

Derating boundary 5%

guard band

Normal user operating

ranges

“Four Corner” test points

HALT Operational Limit

Page 23: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability ConferenceStress Boundary Maps

23

tem

pe

ratu

re

voltage

Wide distributions in limits –higher risk of

stress strength overlap

Stress-strength

interference

End use

stress

conditions

reliability margin

reliability margin

Page 24: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Allied Telesis | White PaperSoftware Fault Isolation using HALT and HASS

24

First Presented at the IEEE/CPMT 2010 ASTR Workshop

• Donovan JohnsonSenior Hardware & Reliability Test Engineer

• Ken Franks

Hardware & Reliability Test Manager

Page 25: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Background

25

2004 Gregg Hobbs, Ph.D. gave a “Mastering HALT and HASS” Class at New Zealand research and development centre

The term “software fault” is defined at Allied Telesis (formerly Allied Telesyn) as a fault found in:

• The firmware of a product, such as code in a Programmable Logic Device (PLD)

• The boot code of a product, such as EPROM boot code.

• The operating system of a product.

Page 26: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Allied Telesyn HALT

26

Test at each thermal step during HALT

• External traffic test – use industry standard equipment

• Power Cycling – margin voltage and frequency

• Internal memory test – RAM and NVS testing

• Internal packet generator test – CPU, Encryption engine and RAM test

• Other product specific tests

Page 27: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Software Fault Isolation

27

Nearly one-third of the issues found in HALT were software related

Page 28: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Failure Types Found in HALT

28

Software issues28%

Hardware issues70%

To be determine2%

FAILURE PERCENTAGE

Page 29: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Software Issues found using HALT

29

Abnormal LED Activity

• This anomaly was found during cold step testing at minus 10°C and attributed to the reset pulse timing inside PLD code.

• After change to PLD code the system operated to -50°C.

Page 30: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Software Issues found using HALT

30

Switch Tuning

• Change in UOL from 70°C to greater than 100°C And LOL from minus 20 to less than minus 60°C.

changes in software increase operational temperature range of 90°C to 160°C

Page 31: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Software Issues found using HALT

31

System Crash• A product that had been in the field for six

months • First HALT the UOL of 70°C – failure was a system

crash • Changed register setting inside the boot code

allowed operation to 100°C.• In addition to the software fault, a flaw within

the CPU silicon was revealed, which amplified the effects of the software fault.

Page 32: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Software Issues found using HALT

32

System Silent Reboot

• Rapid Thermal Transitions exposed a flaw in software during temperature ramps even though the initial failure occurred in a moderate climate inside a server room.

• Failure mode was only apparent when running one particular test. A software patch fixed this problem.

Page 33: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Software Issues found using HALT

33

Silent Reboot

• The same fault took weeks to replicate intermittently using traditional methods,.

• The same failure mode was repeatedly replicated in HALT in less than one day of testing

Page 34: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Summary

34

Thermal HALT has multiple benefits in electronics systems testing

– Well established for hardware latent defects

– Secondary and less recognized (Opportunity) –Thermal induced skewing of signal speeds in components and PWB assemblies help to discover marginal SI that may result in failures of software and firmware.

Page 35: Thermal HALT - a tool for discovery Signal Integrity … HALT.pdf · Accelerated Stress Testing and Reliability Conference Thermal HALT - a tool for discovery Signal Integrity and

Accelerated Stress Testing and Reliability Conference

Thank you – Q & A

www.ieee-astr.org September 28- 30 2016, Pensacola Beach, Florida 35

The material in this presentation is contained

in our new book Next Generation HALT and

HASS: Robust Design of Electronics and

Systems published by John Wiley & Sons,

June, 2016

Co-Authored with John J. Paschkewitz