ieee transactions on june fault-tolerant coompauc4 · fault-tolerant coompauc4 ... avionics systems...

3
417 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-29, NO. 6, JUNE 1980 Fault-Tolerant Coompauc4 HE first Special Issue on Fault-Tolerant Computing was , tsAbra ,tt c the problems of testing published in this TRANSACTIONS nearly a decade ago I I t gae&&& ry, in this case microprocessors. Their [I]. Five other Special Issues devoted to this same topic fol- m 4\ 'nmodel a microprocessor as a device for manipu- lowed [2]-[6]; this is the seventh in the series. These seven lati'g and transferring the contents of registers and then to issues contain a representative sample (over 120 papers and model faults as events that interfere with these manipulations. correspondences) of the research activities that have taken Procedures are devised to detect these faults. Since the models place in fault-tolerant computing over the past decade. It is developed are independent of the details of implementation, therefore tempting, in perusing these issues, to look for possible this approach should be appealing to microprocessor users who trends in this activity. do not normally have access to these details. At least two general trends are discernible, neither of which The paper by Savir and the correspondence by Smith are should be particularly surprising to readers of this TRANS- both concerned with test procedures for complex logical cir- ACTIONS. The most obvious trend is the increasing concern cuits that obviate the very large lists of fault-free responses that with the effects of large-scale integration. Fault-tolerance would be needed were standard stimulus-response tests used. techniques that were effective for computers implemented with Savir's procedure involves counting the number of logical ones SSI and MSI circuitry are not necessarily so for computers appearing at the output of a circuit in response to the set of all using LSI or VLSI circuits. Different failure modes may be possible input combinations, and comparing this count with anticipated as the scale of integration increases; testing and that which would be expected were the circuit fault-free. In diagnostic procedures that were appropriate ten years ago may his paper, he describes design procedures for ensuring that be totally inadequate for the much more highly integrated faulty circuits cannot yield the same count as fault-free cir- circuitry of today. The influence of such considerations on the cuits. Smith investigates the use of a linear feedback shift work being done in fault-tolerant computing is pervasive, as register rather than a counter to compress the serial output even a cursory reading of the papers in this Special Issue will produced by a circuit under test. He shows that the effective- reveal. ness of this procedure can be improved through judicious se- The second trend suggested by an examination of this series lection of the shift-register feedback connections. of Special Issues has to do with the widening scope of appli- Abramovici and Breuer investigate a technique for diag- cation for fault-tolerant computing techniques. Much of the nosing (i.e., detecting and locating) faults in combinational early work in this area was motivated by space applications, circuits that also avoids the need to store large amounts of data. and in particular by the need for computers to be able to sur- Since they are concerned with multiple as well as single faults, vive unattended for long periods of time. While this application the size of the fault dictionary (list of test outputs that would is still an important one, and while other potential applications be observed were any of the faults of concern present) would were certainly not unnoticed ten years ago, fault tolerance has be prohibitive, even for relatively simple circuits. Their method now come to be recognized as a desirable, and in some cases entails an "effect-cause" analysis; that is, they attempt to an essential, feature of a wide range of computing systems. deduce the values of internal points in the circuit under test by Interest in computers capable of long maintenance-free op- observing the test outputs. On the basis of this analysis, they eration has been matched by interest in low-maintenance or are able to identify fault conditions that are compatible with scheduled maintenance commercial systems; in flight-control these observed outputs. avionics systems that provide extremely high reliability for The next four correspondences are also devoted to various relatively short periods; and in high availability process-control aspects of the fault detection/diagnosis problem. David and and telephone switching systems. The influence of these var- Thevenod-Fosse describe a procedure for generating what they ious requirements on the work being done in fault-tolerant call "minimal detecting transition sequences" guaranteed to computing will also be apparent in the papers in this Special detect a given fault in a sequential circuit regardless of the Issue, especially in those papers dealing with system design and initial state of that circuit. They then apply this concept to the evaluation. evaluation of the probability that a random test sequence of The effect of large-scale integration on fault-tolerant some specified length will detect the fault in question. computing research is particularly evident in this issue in the Karpovsky and Su examine the problem of detecting and papers on testing. The paper by Suk and Reddy, for example, locating a class of bridging faults in combinational circuits in is concerned with the fact that "as more and more memory which a short circuit exists between a primary input and an cells are packed into a single chip, the number of failure modes output. Conditions under which such faults can be detected increases and the need for efficient algorithms to detect faults are presented and test procedures for locating faults of this in them becomes more critical." The authors postulate a model class are described. for pattern-sensitive faults. Bounds are then derived on the Agarwal addresses the problem of detecting faults in pro- number of test operations needed to detect and locate a pat- grammable logic arrays (PLA's). He points out that such de- tern-sensitive fault, and efficient test procedures are de- vices are vulnerable to a unique class of contact faults, and he scribed. develops a PLA model whereby these faults can be represented. 0018-9340/80/0600-0417$00.75 © 1980 IEEE

Upload: doancong

Post on 27-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

417IEEE TRANSACTIONS ON COMPUTERS, VOL. C-29, NO. 6, JUNE 1980

Fault-Tolerant Coompauc4HE first Special Issue on Fault-Tolerant Computing was , tsAbra ,tt c the problems of testingpublished in this TRANSACTIONS nearly a decade ago I I t gae&&& ry, in this case microprocessors. Their

[I]. Five other Special Issues devoted to this same topic fol- m 4\ 'nmodel a microprocessor as a device for manipu-lowed [2]-[6]; this is the seventh in the series. These seven lati'g and transferring the contents of registers and then toissues contain a representative sample (over 120 papers and model faults as events that interfere with these manipulations.correspondences) of the research activities that have taken Procedures are devised to detect these faults. Since the modelsplace in fault-tolerant computing over the past decade. It is developed are independent of the details of implementation,therefore tempting, in perusing these issues, to look for possible this approach should be appealing to microprocessor users whotrends in this activity. do not normally have access to these details.At least two general trends are discernible, neither of which The paper by Savir and the correspondence by Smith are

should be particularly surprising to readers of this TRANS- both concerned with test procedures for complex logical cir-ACTIONS. The most obvious trend is the increasing concern cuits that obviate the very large lists of fault-free responses thatwith the effects of large-scale integration. Fault-tolerance would be needed were standard stimulus-response tests used.techniques that were effective for computers implemented with Savir's procedure involves counting the number of logical onesSSI and MSI circuitry are not necessarily so for computers appearing at the output of a circuit in response to the set of allusing LSI or VLSI circuits. Different failure modes may be possible input combinations, and comparing this count withanticipated as the scale of integration increases; testing and that which would be expected were the circuit fault-free. Indiagnostic procedures that were appropriate ten years ago may his paper, he describes design procedures for ensuring thatbe totally inadequate for the much more highly integrated faulty circuits cannot yield the same count as fault-free cir-circuitry of today. The influence of such considerations on the cuits. Smith investigates the use of a linear feedback shiftwork being done in fault-tolerant computing is pervasive, as register rather than a counter to compress the serial outputeven a cursory reading of the papers in this Special Issue will produced by a circuit under test. He shows that the effective-reveal. ness of this procedure can be improved through judicious se-The second trend suggested by an examination of this series lection of the shift-register feedback connections.

of Special Issues has to do with the widening scope of appli- Abramovici and Breuer investigate a technique for diag-cation for fault-tolerant computing techniques. Much of the nosing (i.e., detecting and locating) faults in combinationalearly work in this area was motivated by space applications, circuits that also avoids the need to store large amounts of data.and in particular by the need for computers to be able to sur- Since they are concerned with multiple as well as single faults,vive unattended for long periods of time. While this application the size of the fault dictionary (list of test outputs that wouldis still an important one, and while other potential applications be observed were any of the faults of concern present) wouldwere certainly not unnoticed ten years ago, fault tolerance has be prohibitive, even for relatively simple circuits. Their methodnow come to be recognized as a desirable, and in some cases entails an "effect-cause" analysis; that is, they attempt toan essential, feature of a wide range of computing systems. deduce the values of internal points in the circuit under test byInterest in computers capable of long maintenance-free op- observing the test outputs. On the basis of this analysis, theyeration has been matched by interest in low-maintenance or are able to identify fault conditions that are compatible withscheduled maintenance commercial systems; in flight-control these observed outputs.avionics systems that provide extremely high reliability for The next four correspondences are also devoted to variousrelatively short periods; and in high availability process-control aspects of the fault detection/diagnosis problem. David andand telephone switching systems. The influence of these var- Thevenod-Fosse describe a procedure for generating what theyious requirements on the work being done in fault-tolerant call "minimal detecting transition sequences" guaranteed tocomputing will also be apparent in the papers in this Special detect a given fault in a sequential circuit regardless of theIssue, especially in those papers dealing with system design and initial state of that circuit. They then apply this concept to theevaluation. evaluation of the probability that a random test sequence ofThe effect of large-scale integration on fault-tolerant some specified length will detect the fault in question.

computing research is particularly evident in this issue in the Karpovsky and Su examine the problem of detecting andpapers on testing. The paper by Suk and Reddy, for example, locating a class of bridging faults in combinational circuits inis concerned with the fact that "as more and more memory which a short circuit exists between a primary input and ancells are packed into a single chip, the number of failure modes output. Conditions under which such faults can be detectedincreases and the need for efficient algorithms to detect faults are presented and test procedures for locating faults of thisin them becomes more critical." The authors postulate a model class are described.for pattern-sensitive faults. Bounds are then derived on the Agarwal addresses the problem of detecting faults in pro-number of test operations needed to detect and locate a pat- grammable logic arrays (PLA's). He points out that such de-tern-sensitive fault, and efficient test procedures are de- vices are vulnerable to a unique class of contact faults, and hescribed. develops a PLA model whereby these faults can be represented.

0018-9340/80/0600-0417$00.75 © 1980 IEEE

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-29, NO. 6, JUNE 1980

Using this model, he is able to estimate the effectiveness ofsingle-contact-fault tests when applied to a PLA actuallycontaining multiple faults.The growing recognition that traditional "stuck-at" fault

models may not be appropriate for highly integrated digitalcircuits is reflected in the correspondence by Galiay, Crouzet,and Vergniault. The authors show that, in particular, suchmodels do not adequately account for the effects of the mostcommon physical failures observed in single-channel MOS LSIcircuits. They then describe test procedures that are effectivein uncovering these physical failures and suggest design rulesthat enhance the capability of such tests.

In a related correspondence, Crouzet and Landrault contendthat an LSI circuit can be made self-checking if it is designedfrom the outset with that goal in mind and if its various possiblefailure modes are well understood. They apply this notion tothe design of a self-checking MOS microprocessor and ex-amine the efficiency of several self-checking techniques for thispurpose.The correspondence by Etiemble also investigates the in-

terrelationship of technology (in this case, 12L), failure modes,and self-checking design. Designs are presented for two totallyself-checking 12L checkers. (Self-checking checkers are circuitsdesigned to check the status of another circuit, for example,by monitoring the coded output of that circuit, and at the sametime provide outputs guaranteed to expose its own malfunc-tions.)

Systems consisting of independent units (e.g., micropro-cessors) capable of monitoring each other are the subject ofa paper by Mallela and Masson and of a correspondence bySimoncini, Saheban, and Friedman. A problem in such systemsis that faulty units can erroneously indict healthy units; theability of the system as a whole to identify those of its units thatare faulty is therefore a function both of the number of itsfaulty units and of the way in which its units are interconnected(i.e., which units are capable of monitoring which other units).Mallela and Masson examine the relationship between con-nectivity and self-diagnostic capability when the units aresubject to a combination of permanent and intermittent faults.Simoncini, Saheban, and Friedman investigate the fault tol-erance of such a system when each unit performs diagnosticsonly when it is not needed for some other processing task. Sinceunits may be available for diagnostic purposes at differenttimes; this means that, in contrast to the assumptions usuallymade in analyses of this sort, some units may perform ordinarycomputations, while others are engaged in diagnostics.

The influence of technology on fault-tolerant design can alsobe seen in the paper by Pradhan. Pradhan-observes that thefailure modes in certain memory technologies tend to be uni-directional; that is, the effect of a failure, at least in some cases,is to produce error patterns in which either 0 - 1 errors (log-ical zeros erroneously converted to logical ones) are signifi-cantly more likely than 1 - 0 errors, or conversely. He char-acterizes codes in terms of their ability to detect or correctcertain combinations of unidirectional and bidirectional errors.He then presents a technique for constructing codes that arecapable of combatting both types of errors.The next two papers describe specific fault-tolerant com-

puter designs. The AXE telephone-exchange switching system

is described in the paper by Ossfeldt and Jonsson. AXE, firstdeployed in 1977, was designed to support the extremely highavailability (no more than a few minutes' down-time per year)demanded of modern telephone systems. Ossfeldt and Jonssondescribe the architecture used to insure AXE's high availabilityand comment on its fault-tolerant performance both in thepresence of simulated faults and in the field.A design approach for fault-tolerant general-purpose

computers implemented with VLSI is presented in the paperby Sedmak and Liebergot. The authors note that "there aresignificant problems in using some conventional fault-toleranttechniques in VLSI implementations." Their approach is toimplement all the logic needed to detect faults in a VLSI chipdirectly on the chip itself and to design and partition this logicso as to minimize the possibility that any failure mode is ca-pable both of causing a chip to malfunction and of simulta-neously making it incapable of reporting this fact.The influence of application on the evaluation of fault-tol-

erant systems is illustrated in the paper by Meyer, Furchgott,and Wu. The authors contend that a realistic assessment of asystem capable of various degraded levels of performance in-volves both the system and its environment; in particular, suchan assessment must concern itself with the relationship be-tween the system's capability at any instant and the compu-tational demands that are to be placed on it. They then applythis notion in an analysis of the SIFT computer being devel-oped by SRI International for NASA's Langley ResearchCenter.

Computer errors, of course, are not always caused byhardware failures. Faults in improperly designed software canalso produce errors that may seriously jeopardize a computer'sperformance unless the effect of such errors can be contained.This containment problem is the subject of the final corre-spondence in this Special Issue, that by Lee, Ghani, and Heron.They point out that error containment can be facilitatedthrough the use of a recovery cache. The function of a cacheis to store all data that would be needed to restore a machineto the state it was in prior to the execution of any defectivesoftware module. The authors describe such a recovery cachedesigned to be used with PDP- 11 computers.

ACKNOWLEDGMENT

The Guest Editor would like to thank the authors for theircontributions to this Special Issue and the referees for theirgenerous help in reviewing them. He would particularly liketo thank Dr. Charles R. Kime, Associate Editor for Fault-Tolerant Computing and Design Automation, for his invalu-able counsel and cooperation.

REFERENCES

[1 ] IEEE Trans. Comput., vol. C-20, Nov. 1971.[2] IEEE Trans. Comput., vol. C-22, Mar. 1973.[3] IEEE Trans. Comput., vol. C-23, July 1974.[4] IEEE Trans. Comput., vol. C-24, May 1975.[5] IEEE Trans. Comput., vol. C-25, June 1976.[6] IEEE Trans. Comput., vol. C-27, June 1978.

J. J. STIFFLER, Guest EditorRaytheon CompanySudbury, MA 01776

418

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-29, NO. 6, JUNE 1980

J. J. Stiffler (M'63-F'76) received the A.B. degree in physics, magna cum laude, fromHarvard University, Cambridge, MA, in 1956 and the M.S. degree in electrical engineeringfrom the California Institute of Technology, Pasadena, in 1957. After a year in Paris as a

Fulbright scholar, he returned to Caltech where he received the Ph.D. degree in 1962.From 1961 to 1967 he was on the Technical Staff of the Jet Propulsion Laboratory, Pas-

adena, CA. He is at present a Consulting Engineer at the Raytheon Company, Sudbury, MA.His current interests include the design and analysis of highly reliable data processing sys-tems.

Dr. Stiffler is the author of many papers in the field of communications. He wrote the book,_ &_ Theory ofSynchronous Communications, and also contributed to two other books. He is

a member of Phi Beta Kappa and Sigma Xi.

Test Procedures for a Class of Pattern-Sensitive Faults in Semiconductor Random-

Access MemoriesDONG S. SUK, MEMBER, IEEE, AND SUDHAKAR M. REDDY, MEMBER, IEEE

Abstract-A class of pattern-sensitive faults in semiconductorrandom-access memories are studied. Efficient test procedures todetect and locate modeled faults are presented.

Index Terms-Pattern-sensitive faults, semiconductor random-access memories.

I. INTRODUCTION

R APID developments in semiconductor technology haveR\ made larger and denser semiconductor memories on asingle chip a reality. As more and more memory cells arepacked into a single chip the number of failure modes increasesand the need for efficient algorithms to detect faults in thembecomes more critical. One of the more difficult fault diagnosisproblems is to detect what are known as pattern sensitive faults[1]-[3]. The impracticality of attempting to detect unre-stricted pattern sensitive faults in large semiconductor ran-dom-access read/write memories was shown earlier [2].However, by taking-into consideration the general features of

Manuscript received August 6, 1979; revised January 25, 1980. This re-search was supported by the Air Force Office of Scientific Research underGrant AFOSR-78-3582 and by the Rome Air Development Center underContract F30602-78-C-0083.

D. S. Suk was with the Department of Electrical Engineering, Universityof Iowa, Iowa City, IA 52242. He is now with Bell Laboratories, Naperville,IL 60540.

S. M. Reddy is with the Department of Electrical and Computer Engi-neering, University of Iowa, Iowa City, IA 52242.

memory design and access methods it appears to be possibleto derive a restricted model of pattern sensitive faults that aremost likely to occur. In this paper we propose a fault model andderive bounds on algorithms and propose procedures to detectand locate modeled faults. We consider only semiconductorrandom-access read/write memories (called RAM's hence-forth) with R = 2K bits, R a positive integer. Furthermore weassume that the RAM's are I bit wide (i.e., one bit of infor-mation is read or written into the memory at a time).

It should be mentioned that several algorithms to detectdifferent classes of faults are given in [1 2]- [19]. In [1 21 and[13] different classes of pattern sensitive faults are alsostudied.

The paper is organized the following way. In Section 11 thefaults to be considered are given. In Section III a procedureto derive Eulerian paths in certain graphs is given. Lowerbounds on the number of operations required in procedures todetect and locate modeled faults are given. Section V containsthe test procedures for modeled faults.

11. FAULT MODEL AND PRELIMINARIES

As indicated earlier it is essential to derive a model for re-stricted pattern sensitive faults that would include the mostprobable failure modes. The basic idea behind this approachis the concept of "neighborhood" of a memory cell, introducedby Hayes [2]. The contents of a memory cell are potentially

0018-9340/80/0600-0419$00.75 © 1980 IEEE

419