distributed fault-tolerant computer systems

11
Modular systems employing building-block VLSI circuits may provide fault tolerance to a variety of applications. ;. . : . . #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Distributed Fault-Tolerant 41| _1: 0 Computer Systems David A. Rennels University of California, Los Angeles and the Jet Propulsion Laboratory, California Institute of Technology Distributed computers offer new ways of achiev- ing fault-tolerant operation. Their salient feature is the distribution of computing power-when one com- puter fails, others may be able to aid in recovery. This distributed intelligence supports software recovery algorithms that are much more complex and powerful than those that a fault-tolerant uniprocessor can sup- port. Uniprocessor systems require recovery con- trolled by relatively simple specialized hardware, since the faulty computer cannot be depended upon to execute software until it is repaired. 12 Distributed computers allow other computers to perform detailed diagnosis of a faulty machine and then to reconfigure it, replace it with a spare, or take over portions of its computations to effect graceful degradation. Many of the same techniques of fault detection, contain- ment, diagnosis, and local redundancy are employed in both distributed computers and their single- computer counterparts. It is instructive to examine the applications requirements and fault tolerance techniques which influence the design of ultrareliable distributed computing systems. Four elements can be found in fault-tolerant com- puter design: * fault detection-hardware and software mech- anisms used to deterrnine if a fault exists; * fault containment-techniques used to prevent fault-damaged information from propagating through a system after a fault occurs but before it is detected; * fault diagnosis-hardware and software tech- niques to locate and identify a fault; and * fault recovery-mechanisms called upon to cor- rect the fault by "voting out" incorrect results or replacing faulty components with spares. These elements of fault-tolerant design are present in both centralized and distributed computer sys- tems, and a large collection of design techniques have been developed for their implementation. Largely through NASA and DoD sponsorship, a number of fault-tolerant computer designs have been com- pleted, and many of these have added to a growing fault-tolerance design methodology. Among the most influential work in this area has been the development of * coding techniques for concurrent detection of specific faults,3 * self-checking circuits,4 * software-implemented fault tolerance,' * memory systems tolerating multiple faults,6 and * reliable clocking and communications networks.7'8 These and other techniques make up the "bag of tricks" available to the designer of fault-tolerant systems. The way they are used depends on the fault tolerance requirements, technology level, and eco- nomics of a system. Requirements for fault-tolerant computing systems The decision to employ fault tolerance in a com- puter system involves trading off the cost of failure against the cost of implementing fault detection and recovery. The cost of failure involves both the ex- pense of unscheduled maintenance and the losses associated with incorrect or unperformed computa- tions. We will use these criteria to define a set of ap- plication types. 0018-9162/80/0300-0055$00.75 © 1980 IEEE ,c .,V. ! , March 1980 55

Upload: phamdang

Post on 10-Feb-2017

223 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Distributed Fault-Tolerant Computer Systems

Modular systems employing building-block VLSI circuits mayprovide fault tolerance to a variety of applications.

;. . : . . #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Distributed Fault-Tolerant

41| _1:0

Computer SystemsDavid A. Rennels

University of California, Los Angelesand the

Jet Propulsion Laboratory,California Institute of Technology

Distributed computers offer new ways of achiev-ing fault-tolerant operation. Their salient feature isthe distribution of computing power-when one com-puter fails, others may be able to aid in recovery. Thisdistributed intelligence supports software recoveryalgorithms that are much more complex and powerfulthan those that a fault-tolerant uniprocessor can sup-port. Uniprocessor systems require recovery con-trolled by relatively simple specialized hardware,since the faulty computer cannot be depended uponto execute software until it is repaired. 12 Distributedcomputers allow other computers to perform detaileddiagnosis of a faulty machine and then to reconfigureit, replace it with a spare, or take over portions of itscomputations to effect graceful degradation. Manyof the same techniques of fault detection, contain-ment, diagnosis, and local redundancy are employedin both distributed computers and their single-computer counterparts. It is instructive to examinethe applications requirements and fault tolerancetechniques which influence the design of ultrareliabledistributed computing systems.Four elements can be found in fault-tolerant com-

puter design:* fault detection-hardware and software mech-anisms used to deterrnine if a fault exists;

* fault containment-techniques used to preventfault-damaged information from propagatingthrough a system after a fault occurs but beforeit is detected;

* fault diagnosis-hardware and software tech-niques to locate and identify a fault; and

* fault recovery-mechanisms called upon to cor-rect the fault by "voting out" incorrect results orreplacing faulty components with spares.

These elements of fault-tolerant design are presentin both centralized and distributed computer sys-tems, and a large collection of design techniques havebeen developed for their implementation. Largelythrough NASA and DoD sponsorship, a number offault-tolerant computer designs have been com-pleted, and many of these have added to a growingfault-tolerance design methodology. Among themost influential work in this area has been thedevelopment of

* coding techniques for concurrent detection ofspecific faults,3

* self-checking circuits,4* software-implemented fault tolerance,'* memory systems tolerating multiple faults,6 and* reliable clocking and communications networks.7'8These and other techniques make up the "bag of

tricks" available to the designer of fault-tolerantsystems. The way they are used depends on the faulttolerance requirements, technology level, and eco-nomics of a system.

Requirements for fault-tolerantcomputing systems

The decision to employ fault tolerance in a com-puter system involves trading off the cost of failureagainst the cost of implementing fault detection andrecovery. The cost of failure involves both the ex-pense of unscheduled maintenance and the lossesassociated with incorrect or unperformed computa-tions. We will use these criteria to define a set of ap-plication types.

0018-9162/80/0300-0055$00.75 © 1980 IEEE

,c .,V. ! ,

March 1980 55

Page 2: Distributed Fault-Tolerant Computer Systems

Computation-critical applications. The most strin- facility, the number of "front line" repairmen can begent fault tolerance requireitents arise in real-time reduced, along with their associated test equipmentcontrol systems in which faulty computations can (which can also fail). VLSI technology shouldbecomejeopardize human life or expensive equipment. Such- sufficiently reliable that for a modest amount ofreal-time applications require not only that a com- redundancy, it may be possible to "postpone" on-siteputation be correct, but also that any delay associ- repair of small computing systems for their entireated with fault recovery be very small (on the order of operational lives.milliseconds). As computers take on more importantroles in factories, hospitals, transportation systems,and other critical applications, the number of compu- Applicability of distributed architecturesfn-in-n-rifie:l nnlirntai nn will r-rnw- A maior nro-t,at,VI-L--IIt,I%;a allzu;usviv 1.U r,, . .,,JW =,

gram in this area (sponsored by NASA9) is thedevelopment of avionics computers for dynamicallyunstable aircraft.

Long-life applications. Long-life systems are oneswhich are never maintained. Unmanned spacecraftare the most dramatic examples; manual repair is im-possible and they must operate reliably for five yearsor more. Computer systems for these applications arehighly redundant, providing enough spare hardwareto maintain nominal performance until the end of amission. Long-life systems may or may not performcritical computations. Some systems rely on remoteground-based fault diagnosis and external recon-figuration, while others provide on-board automatedfault recovery.'0"'1

Applications requiring high availability. In large,resource sharing systems, the occasional loss of oneuser's computations is acceptable, but a systemwideoutage or the destruction of a common data base isunacceptable. Examples are telephone switchingcomputers and a variety of commercial timesharingservices.'2

Signal processing applications. High-performancecomputing systems will reach a point of speed andcomplexity where expected performance cannot beachieved without the use of fault tolerance. Super-computers are moving toward this limit, as can beseen in the 4-hour MTBF of the Cray-i.13 The in-troduction of submicron VLSI technology, coupledwith signal processing applications being studied byNASA and DoD, will increase system complexity oneor more orders of magnitude beyond the currentsupermachines. For example, a proposed syntheticaperture radar processor requires an array of 1000processing elements, each with a complexity equiva-lent to 40,000 gates.'4 In such systems, transient er-rors due to complexity and limited clock margins canbe frequently expected, along with a relatively highpermanent failure rate.

Maintenance postponement. In a number of ap-plications the life-cycle cost of unscheduled main-tenance can be higher than the cost of fault tolerance.This is especially true for some military systemswhich must have many repairmen assigned to them.In an environment where on-site repair is expensive,fault tolerance becomes attractive. If, after a faultoccurs, a computer system can continue operationwhile a defective module is shipped to a central repair

The choice of a centralized or distributed architec-ture is determined by system considerations ofwhichfault tolerance requirements are only a part.The first two categories-computation critical and

long-life applications-are largely associated withdedicated real-time control systems. These systemsare closely matched to distributed- computing forseveral reasons. Typically, a number of subsystems(e.g., inertial reference, telemetry, guidance) performcomplex functions. By embedding small, dedicatedcomputers into these subsystems, several advan-tages result:

* The subcontractor who is most familiar with hisown equipment can best develop the softwarenecessary for its control and fault diagnosis.

* The resulting interface between subsystems canbe greatly simplified (over a central time-sharedcomputer), since the embedded computer can re-duce intercommunication requirements by gen-eratitig detailed timing and control signals local-ly.

* By off-loading time-critical local subsystemfunctions, higher level (global) control and datahandling programs can be made simpler and lesstime dependent, and therefore more reliable.'5

A number of studies havee concluded that distrib-uted computer architectures are superior to central-ized computer architectures for a variety of real-timecontrol applications.'167 A model of a distributedsystem for spacecraft control:and data handling,shown in Figure 1, is typical for many applications inthis class. The system consists -of a set of low-levelcomputers, embedded within various subsystems.These computers, being dedicated, must have dedi-cated spares or the capability for internal -recon-figuration for fault recdvery, if the functions theyprovide are not discardable. At the next level is a setof nondedicated "high-level" computers which areresponsible for controling and providing computa-tional services to the terminal computers. The high-level computers can, through software and comniuni-cations protocols, be configured into single- ormultiple-level hierarchical structures. Being nonded-icated, a set of spare machines can back up all of thehigh-level computers, which are in turn- responsible.for fault detection and reconfiguration of the ter-minal computers and of a redundant intercommuni-cations bus system.'8

Fault-tolerant multiple computer systems for real-time control applications can-be viewed as "top of theline" fault-tolerant architectures. Designed for

COMPUTER

Reader Service -Number 9 -

56

Page 3: Distributed Fault-Tolerant Computer Systems

NONEDIATEDSPAE

HLC - HIIGH-LEVEL COMPUTERSTC TEMNLCOPTR

S2 ANINTERNALLYREDUNDANT SUBSYSTEM

Figure 1. A distributed system for spacecraft control and data handling.

critical applications, these systems have extra hard-ware and unusual architectures which tend to bequite expensive. Specially designed hardware is oftenused which provides concurrent detection of faultsbefore damaged information leaves a faulty module.These systems are typically designed to detect andrecover nearly instantly from a wide range of hard-ware failures. Later we will briefly examine threesuch systems-SIFT, FTMP, and the FTBBC.5,18'19The third category of applications-those requir-

ing high availability and security-is much moreoriented toward general-purpose computing than thededicated control systems. Architectures for theseapplications tend to use a number of computers, sothat if one fails, other machines can continue com-putation in a degraded mode. Since these architec-tures are oriented toward the commercial market-place, they use minimal modifications to existingcomputers to attain fault tolerance, and what redun-dant hardware is used is aimed toward increased per-formance. Processors in this class of applicationsseldom employ specialized hardware for concurrentand complete detection of hardware faults. They de-pend upon isolation of processors in the distributedsystem (i.e., by protective memory mappings andcommunications protocols) to provide containmentof erroneous data within a faulty processor, until thefault is detected by software and other diagnostictests. Unlike dedicated control systems, these sys-tems can be expected to run a variety of user pro-grams whose demands cannot be anticipated. Theycan also be expected to support a full complement ofshared peripheral devices. Thus, the problems of con-

currency, resource sharing, and executive softwarebecome much more complex. Much more research inoperating systems is involved in the development ofthese applications than in the other categories. Twoexcellent examples of the high availability type of ar-chitecture are the Cm* and Pluribus systems.20'21The fourth category-signal processing applica-

tions-is new, but several general characteristics canalready be discerned. Sensing devices often generatehundreds of megabits per second that must be pro-cessed in a highly parallel fashion.'4 Distributed ar-chitectures are often required to achieve theparallelism needed to process the incoming data. Pro-cessor arrays and interconnection structures must betailored to the demands of the target application.For many signal processing systems (which can

also generate extremely large quantities of outputdata), it is acceptable for an occasional fault todisrupt processing for a few seconds, as long asautomatic recovery follows. Examples in the litera-ture indicate that high availability is more importantthan perfect computations. In any case, these pro-cessing arrays must contain redundant elements,procedures must exist to detect faults within areasonable time after they occur, and a mechanismmust exist to effect recovery from transient and per-manent faults by restarting and replacing faulty pro-cessing elements. Micronet* is an interesting designhaving these attributes.22The fifth category of fault tolerance-for mainte-

nance postponement-has not received a great deal*Micronet is a registered trademark of the Westinghouse Corpora-tion.

March 1980 57

Page 4: Distributed Fault-Tolerant Computer Systems

of attention. NASA and DoD have typically sup-ported specific applications in computation-criticalavionics and long-life spacecraft systems. Much ofthe university work has focused on high availabilitysystems. Some recent research has shown that self-checking computers (i.e., computers which can detect-a variety of internal hardware faults during normaloperation) can be implemented relatively inexpen-sively.'1'23 Using microprocessors, we can assemblesmall self-checking machines which can be backed upwith spares to provide low-cost fault tolerance. TheFTBBC architecture"-Fault-Tolerant BuildingBlock Computer-addresses the maintenance avoid-ance problem.

Distributed architectures provide thegreatest number of design options.

The following sections deal with architectural tech-niques employed to achieve fault tolerance in selectedmultiple-computer architectures in each of the fiveapplications categories outlined above. The similar-ities and differences between these approaches helpidentify designs which can -apply to a wide range ofapplications requirements.

Architectural tradeoffs for fault-tolerantdistributed systems

Having briefly looked at some applications of fault-tolerant systems, it is useful to re-examine the "bagof tricks" available to the architect. Although mostof the same design techniques apply to both central-ized and distributed processors, distributed architec-tures provide a greater freedom of choice. Specific-ally, there are the local fault tolerance techniquesemployed within each node (i.e., local computer) in thenetwork, and global fault tolerance techniques ap-plied across the collection of computers and their in-terconnections.At one end of the design spectrum is a network

made up of a collection of independently fault-tolerant computers, connected by a redundant com-munications structure. This approach is seldom usedbecause it is more efficient to rely on network-wide re-sources for backup redundancy (i.e., shared spares)than to dedicate redundant units at each node. At theother end of the spectrum is a collection of nonredun-dant off-the-shelf machines connected by a redun-dant communications network where fault toleranceis achieved through network-wide sparing and loadsharing. Real systems often find a balance some-where between these two extremes, with faulttolerance techniques employed at both the local andglobal level. Local modules may contain circuitry toenhance fault detection and to recover from some, butoften not all, local faults. External system require-ments may force some nodes to be independentlyfault-tolerant, due to a requirement for special,dedicated connections or very high-speed recovery.

Many aspects of reconfiguration and software re-covery are global concerns.In order to examine some existing fault-tolerant ar-

chitectures, it is useful to develop a checklist of someof the design approaches currently used.

Custom design. In the design of the basic computermodule which makes up the distributed system, wefind two approaches. The first uses off-the-shelf com-puters with little or no internal modification. The sec-ond approach uses custom computers in which com-mercial processors and memories are connected withspecially designed logic and internal buses to en-hance local fault detection and, in some cases, providea degree of local fault recovery. The external struc-ture connecting the computers can vary in its degreeof customization. On one end are structures which aremodifications to existing buses (e.g., Unibus, MILSTD 1533a), and on the other end are highly special-ized, system-specific structures.

Redundant partitioning. Within computing'"nodes" in a distributed systemwe find twocommonapproaches to partitioning. With whole computerpartitioning, faults cannot be repaired within an in-dividual computer, and recovery must be effected byother redundant computers. Alternatively, subcom-puter partitioning may be used where individual pro-cessors, memory modules, or "bit-planes" may bespared to make computers individually repairable.Whole computer partitioning uses off-the-shelfmachines, whereas subcomputer partitioning isassociated with custom designs.At the global level, partitioning can be more com-

plex. If communication faults and computer faultscan be recovered independently, a system has fullyredundant communications. With fully redundantcommunications, a communications failure can be re-covered (assuming spares are not exhausted) withoutthe loss of any computer. Similarly, no computer fail-ure precludes the use of any other computer or anyportion of the communications system.

In many distributed computing structures, thecommunications system is only partially redundant,and intercommunications failures and processorfailures are not independent. For example, a failure ofa node in a tree-type communications structure mayeffectively disable all computers "below" that node.

Fault detection. In a distributed system, faultdetection mechanisms can be employed both in localcomputers and in intercommunications betweencomputers within the network. Several basic ap-proaches are employed in fault detection. The first,concurrent fault detection, employs special hardwareto detect faults in logic modules within the com-puters or communications system. The goal is rapiddetection of faults-before damaged informationpropagates to memory or adjacent modules-as thecomputer performs its intended function. This can bedone by running the same programs on two or morecomputers and comparing all data transfers among

COMPUTER58

Page 5: Distributed Fault-Tolerant Computer Systems

the processor, memory, and I/O circuits. Anotherlow-cost approach is to implement a machine using acombination of error detecting codes and self-check-ing logic.23 The second basic approach, stepwise com-parison, involves running the same programs on twoor more computers and, after executing various pro-gram segments, comparing or voting on their out-puts to see if the computers agree. This approach canuse off-the-shelf computers and provides highly effec-tive fault detection, since it is unlikely that twomachines will fail in the same way simultaneouslyand produce agreeing wrong outputs. However, afault may greatly damage a memory's contents be-fore it is detected.A third approach is periodic testing, in which

diagnostics are periodically invoked to check ifa com-puter or communication path is still working. Thisapproach detects few transient faults and providesless effective fault detection than the others. It isused in high availability systems and attempts toconfine faulty data to a single faulty machine sup-porting a single expendable user. A number of otherfault detection techniques fall into an ad hoccategory. They include software reasonablenesschecks, time-out counters, and testing with dummyinput data.

Fault recovery. There are several basic approachesto fault recovery. The first, voted recovery, deter-mines a correct result by a majority vote of the out-puts of several modules. If disagreeing modules canbe replaced with spares (under control of agreeingmodules), this self-healing form of voted recovery istermed hybrid redundancy.24 A second approachmight be called duplex recovery, in which the outputsof two modules are compared. If a disagreement oc-curs, diagnostic routines identify the faulty unit, andit is replaced or disabled.A variation of this approach is duplex self-checking

recovery. Here, two modules with concurrent inter-nal fault detection hardware perform identical com-putations. If one fails, its checking hardware disablesit, and the other module continues the computation.A third approach is standby recovery-when a per-manent fault is detected, the module is replaced withan inactive spare, the spare is initialized, and soft-ware is restarted. Standby recovery is usually limitedto processes that can be disrupted and restarted;such processes are common in control systems.

Intercommunications structure and redundancy.Intercommunications in existing fault-tolerantsystems usually fall into two categories. The first isredundant bus networks, in which each computer cancommunicate with other computers over a sharedbus, which is replicated for fault recovery. The secondis redundant point-to-point communications. Hereeach node, consisting of a computer, group of com-puters, memory, or I/O facility, can communicatewith at least two other nodes. The redundant connec-tion allows any node to be reached by a different pathif one path should fail.

Hard core items. Hard core items are circuit ele-ments whose failure could disable the complete com-puter network or large portions of it. Hard core itemsrequire thorough identification and careful protec-tion against faults. Some hard core items are

* clocks (for synchron'izing software),* common control in the communications network,* fault recovery mechanisms, and* power supplies.

Selective redundancy. A distributed computingsystem is selectively redundant if critical computa-tions can be thoroughly protected with a largeamount of redundancy, while less important compu-tations can be less expensively protected with lessredundant hardware. This is equivalent to freeingredundant hardware to increase performance whennoncritical computations are carried out and canmake the system more cost-effective.

Effectiveness of fault detection and recovery. Acommonly accepted measure of the effectiveness offault detection and recovery mechanisms is a para-meter designated coverage. Coverage is the condi-tional probability, given that a fault occurs, that aproper recovery takes place. Modeling studies haveshown that coverage must be very high (between 99and 100 percent) to achieve the reliability goals ofmost fault-tolerant systems.25

Fault-tolerant distributed architecturesfor critical applications

By far the most stringent fault tolerance require-ments to date are associated with the control ofdynamically unstable commercial aircraft. Althoughcurrenti planes are stable, future aircraft will belighter and more fuel efficient and hence unstable, re-quiring active computer control to stay in the air.Two fault-tolerant computers for this applicationhave been designed under NASA sponsor-ship-FTMP, the Fault-Tolerant Microprocessor bythe C. S. Draper Laboratory; and SIFT, the SoftwareImplemented Fault Tolerance system by SRI Inter-national. Figure 2 compares the two designs.Both designs have much in common, due to their

shared goals of extremely high coverage and a failureprobability of less than 1O-9 for a ten-hour mission.Each uses triplicated computations with voting todetect and correct faults, since this provides the mostcomprehensive fault coverage of all available choices.A fault limited to one module will be detected and cor-rected. Both designs are concerned with "lurkingfaults," i.e., faults which rarely cause incorrect out-puts and are therefore hard to detect. Both periodical-ly run diagnostics to "flush out" these faults, whichcould, if undetected, cause an error coincident with asecond fault occurring at a later time in a differentmodule, and thus upset the system. Each uses thevoting process to mask independently occurring er-

March 1980 59

Page 6: Distributed Fault-Tolerant Computer Systems

broadcasts its results to the others; process synchro-nization, voting, and reconfiguration are carried outby software. A faulty computer which cannot be dis-abled and "babbles" unwanted outputs is ignored bythe other machines in the system. Software votinggives SIFT the property of selective redundancy. Thenumber of redundant machines assigned to a task canbe adjusted according to the task's criticality. Non-critical tasks can be run concurrently in differentmachines, while critical tasks are run in triplicate.FTMP and SIFT have evolved over several design

iterations and represent many years of experience.Prototypes are being constructed for evaluation.

Fault-tolerant distributed systemsfor general applications

Figure 2. Architectures for critical computations-FTMP, partitionedat the subsystem level, and SIFT, partitioned at the computer level.

rors, and each quickly reconfigures itself to take afaulty module out of service.These architectures differ in the tradeoffs they

- make between hardware and software in implemrnt-ing fault tolerance features. FTMP employs fullysynchronous hardware partitioned into individuallyreconfigurable processor/cache, memory, and I/Omodules. These are connected by common busedclock and data lines and employ serial bit-by-bitvoting in each processor/cache and each memorymodule. Dt1al "bus guardian" submodules control ac-cess to the buses to prevent bus pollution by "bab-bling" transmitters; they are also used to reconfiguremodule assignments. The processor/cache and mem-ory modules are configured into "triads" (groups ofthree processors and memories), each of which is avoting set. Each triad carries out an independentcomputation, and the collection of triads makes up amultiprocessor.SIFT is partitioned at the computer level, and the

computers run asynchronously. Each computer

For general commercial applications, fault-tolerantdesign tends to use existing machines to take advan-tage of their existing program libraries, low-costhigh-volume production, and peripheral support andmaintenance. Redundant hardware is not dedicatedto triplicated processing or left idle as standbyspares, but is utilized to increase performance. Faulttolerance goals are rather modest-to minimize sys-tem crashes and to prevent a faulty computer fromdisturbing other computers until the fault isrecognized and the offender taken out of service. Anacknowledged weak point inmany ofthese systems is-their lack of fault detection capability. Memory par-ity, software reasonableness checks, time-outcounters, and software diagnostics are typically theonly checks available in existing commercial pro-cessors-this results in fairly low coverage. However,if a faulty computer is identified, such systems candisable the faulty machine and continue operation.Two early architectures-Prime and C.mmp-used

a crossbar switch to connect local processors to ashared memory system. Present machines use com-puters with both local and shared memory, the latterlocated either within the various computers or inspecial memory modules. Redundant buses or redun-dant point-to-point connections are used and areoften extensions of "standard" processor buses (e.g.,Unibus). Two such architectures are Cm*, developedat Carnegie-Mellon University, and Pluribus, devel-oped at Bolt Beranek and Newman, Inc. (Figure 3).The Cm* is a two-level structure. At the lower level

several LSI-11 computers are connected through alocal bus to form clusters. The local bus interface toeach computer provides a degree of fault detectionand logging for its associated computer. Each clusterinterfaces with two intercluster buses through amore elaborate unit called a K.map. The K.map pro-vides message switching and can correct single-bit er-rors in messages (through vertical and longitudinalparity) or reroute messages over redundant pathswhen an intercluster link fails. This two-level struc-ture can connect a large number of machines. Theshared memory space is diffused throughout thevarious computer clusters, but, due toprogram locali-ty, most memory references are within clusters, with

COMPUTER

Page 7: Distributed Fault-Tolerant Computer Systems

an occasional reference to memory in other clustersthrough the intercommunication system. A sophisti-cated virtual memory mapping system allows mem-ory sharing, with protection against unwanted pro-cess interaction. Diagnostics are periodically run todetect faulty machines and prune them from the net-work. The goal of this system is "to provide distrib-uted intelligence ... such that there is no criticalsystem resource whose loss could cause system fail-ure. "20The Pluribus system consists of a set of modules

which fall into three types. The first type containsone or more computers (Lockheed SUE processorsplus local memory); the second, collections ofmemorymodules (to serve as shared memory); and the third,I/O and clock devices. All these modules are con-nected by a redundant point-to-point communicationsystem. For example, a small system might consist oftwo computer modules having independent connec-tions to two memory modules, and two I/O modules,each connected to the two memory modules.Pluribus uses software to provide most of its fault

tolerance. Data structures are redundantly con-structed so that they can be checked for correctness.Time-out counters are employed for hardware/soft-ware fault detection, and many system functions re-quire concurrence of several processors before theyare executed.The Pluribus serves as an IMP-interface message

processor-in the Arpanet. Its goal is high availabili-ty. An occasional dropped message or brief outage isacceptable, but the system must recover within a fewseconds so as not to disrupt network users. Pluribusis operational and has exceeded its requirements withan availability of 99.9 percent.21Cm* and Pluribus are not suitable for critical real-

time control applications where no downtime is al-lowed and computations must always be correct. Butby meeting a somewhat relaxed set of fault tolerancerequirements (i.e., improved availability), they canutilize resources more efficiently than critical appli-cation machines.

Fault-tolerant distributed systemsfor signal processing

One of the newest application areas for fault-toler-ant computing is in highly parallel, very high perfor-mance signal processors, which require bothparallelism and pipelined computations in order tohandle very high incoming data rates.22 The Micronetarchitecture has been designed for this applicationarea.A group of computers performing identical com-

putations in parallel is called a stri77g. Each computerin a string receives data from a common input busand delivers data to a common output bus. There arespare computers in each string, and the input andoutput buses are redundant. Strings can be seriallyconnected to provide a pipelined set of parallel pro-cessors (Figure 4).

March 1980

CLUSTER

THE PLURIBUSSTRUCTURE

C =LOCAL COMPUTERWITH 1/O ANDINTERFACE TOINTERNAL CLUSTERBUS

THE Cm* STRUCTURE

Figure 3. Architectures for general applications-Cm* connectsseveral LSI-l1 s through a local bus to form clusters; Pluribus connectsthree module types through a redundant point-to-point communica-tion system.

C = ACTIVE COMPUTERCs = SPARE COMPUTER

- THE CHECKER

Figure 4. The Micronet signal processing structure-strings of com-puters are serially connected to provide a pipelined set of parallel pro-cessors.

61

Page 8: Distributed Fault-Tolerant Computer Systems

This approach to fault tolerance is based on the factthat the system processes high-volume data in batch-es, and that the loss of a few batches when an occa-sional fault occurs is acceptable. Thus, the goal is toquickly detect a fault and quickly replace the faultymodule with a spare so that subsequent data batcheswill be processed correctly.This scheme is simple and effective. One of the

spare computers within each string is used as a check-er. It is paired with one of the working (active)machines for processing a batch of data, and its out-puts are compared with the machine under test. Forthe next batch of data the checker is paired with thenext computer. Thus, the checker steps through acontinuous sequence of pairings with each of theworking machines. A special redundant hardwaremechanism sequences this check procedure and ef-fects recovery if disagreements are found. If thechecker disagrees with one ofthe working computers,the working computer is assumed to be faulty and isreplaced by a spare. If the checker disagrees with sev-eral computers in the string, the checker is assumedto be faulty and is replaced.Checking of this type is carried out in each string.

When a fault occurs, the system must wait untilenough batches of data have been processed (incor-rectly) so that a checker will be advanced to the faultycomputer. When the checker identifies the computeras faulty, recovery can take place. Thus, the system isdescribed as "self-healing."

Architectures for multiple classesof fault tolerance applications

In current fault-tolerant architectures we see twomajor directions. Systems for critical control applica-tions pay a considerable price in redundancy and spe-

Table 1.Building block configurations.

CONFIGURATION

Uniprocessor-2 CPUs,error correcting memory,redundant I/O.

Two (duplex) self-checkingcomputers.

Distributed self-checkingcomputers.

Distributed self-checkingcomputers-executive functions runin duplex computers, many spareunits.

Distributed self-checkingcomputers-critical computations runon 3 machines, output voting.

APPLICATION

Low-cost maintenancepostponement.

Moderate-cost maintenancepostponement, error-freecomputations.

High availability general-purposecomputations with gracefuldegradation.

Long-life control systems.

Critical computation applications.

cialized design in order to achieve high fault tol-erance. Systems for general applications are econom-ically constrained to avoid custom design wheneverpossible and to be hardware-efficient. Thus, theyusually exhibit less effective fault detection andoften allow some computations to be damaged when afault occurs.A look at the various architectures and applica-

tions leads to an obvious but important question-Isthere an architecture which will efficiently meet thefault tolerance requirements of a variety of applica-tions? SIFT has addressed this problem by allowingprograms to be run with varying degrees of redun-dancy, depending on their individual criticality. An-other approach is to develop a modular architectureusing building block circuits, so that the system ar-chitect can assemble configurations efficient for dif-fering applications, from simple memory-error cor-rection for maintenance deferral to highly redundantvoted systems for critical systems. Using VLSI cir-cuits, building blocks can be constructed with consid-erable capability (over 10,000 gates per chip). To beeffective, however, the number of different buildingblocks must be reasonably small. The following are aset of desirable capabilities for a building block sys-tem:

* Concurrent fault detection. Recent work hasshown that computers can be made self-checkingat an additional hardware cost of 20 to 30 percentusing LSI devices.'123 This cost is low enough tomake self-checking applicable to all configura-tions of the building block system. Such a capa-bility provides standard local fault detectioncoverage close to 99 percent. Augmentation atthe system level (e.g., triplicated modules withvoting) is required to achieve very high coveragefor certain highly critical applications.

* Use ofexistingprocessors. To have access to pre-viously developed software and low-cost hard-ware, a building block approach should use exist-ing processors. Preferably, a building block ar-chitecture should support any of several pro-cessors (e.g., T19900, 8086, etc.).

* Wide configurability. A building block architec-ture should allow the use of a single computer, orgroups of computers, connected into a variety ofdistributed systems. It should be possible to addcomputers to either augment performance or in-crease sparing for long-life applications.

Table 1 shows typical building block configurationsfor various fault tolerance applications.

A fault-tolerant building block architecture

A fault-tolerant building block architecture des-ignated the FTBBC has been developed at the JetPropulsion Laboratory, and a feasibility breadboardis under construction. This architecture uses severalbuilding block circuits (each able to be implementedas a single VLSI device) to combine existing micro-processor and memory chips into fault-tolerant sys-

COMPUTER62

Page 9: Distributed Fault-Tolerant Computer Systems

tems. The building block circuits provide the frame-work on which to build self-checking computer mod-ules with interfaces to a redundant busing system.This redundant busing allows the modules to be com-bined into distributed fault-tolerant computer net-works.The FTBBC's self-checking computer module is

shown in Figure 5. The SCCM contains four types ofbuilding block circuits which interface memories,processors, I/O, and external buses to an internalSCCM bus. The building blocks provide concurrentfault detection within themselves and in their asso-ciated circuitry. The internal bus employs error-detecting codes to verify transmission of addressesand data between the building blocks. All buildingblocks can be addressed by out-of-range addresses(memory-mapped I/O) and can be accessed both bythe local processor and by other SCCMs through theexternal busing system. Thus, an external computercan read out status, access memory, or command re-configuration within the SCCM by accessing localmemory addresses through the external bus. Thebuilding blocks are:

The memory interface building block. The MIBBinterfaces RAM chips to the SCCM internal bus toform a memory module. It supports single error cor-rection or double error detection. In addition, theMIBB can be commanded to replace any two spec-ified bits (in all words) with the two spare bit planes.

The core building block. The Core BB controls twomicroprocessors which carry out identical computa-tions. It continuously compares their outputs andsignals a fault if it detects a disagreement. The CoreBB also serves as a bus arbiter and collects all faultindicators from other building blocks and from itsown internal circuitry. If a fault is detected, the CoreBB attempts either a program rollback or restart. Ifthe fault recurs, it disables its host computer moduleby halting the processors and disabling the SCCMoutputs.The Core BB uses internal duplication and self-

checking logic so that most failures in the checkinglogic will also shut down the module. When a moduleis thus disabled, other SCCMs can access its internalbuilding blocks via the external bus system to readout its status, correct its memory, and command in-ternal reconfiguration. One optional mode of opera-tion allows computations with only one of the pair ofduplex processors after the other has failed. Pro-cessor-fault detection is lost in this degraded mode,which is intended primarily for maintenance deferralapplications.

Several bus interface building blocks. BIBBs areused in each SCCM to provide communicationsthrough redundant buses with other computers in'the network. These devices are microprogrammable,providing bus controller and bus terminal functions.Status messages and coding verify proper transmis-sion, and redundant buses provide backup transmis-sion paths. The building blocks are described in moredetail elsewhere.1"

A multifundtion I/O building block has been speci-fied but is not being implemented in the initial JPLbreadboard.The cost of a building block self-checking computer

is not excessive, compared to an equivalent non-redundant module. A typical nonredundant moduleconsists of 32 memory chips, one processor chip, oneI/O chip, and one bus interface chip, for a total of 35LSI devices (plus a small set of MSI circuits). Asimilar self-checking module requires 36 memorychips, one memory interface, two processors, two businterfaces, one core building block, and one I/Obuilding block for a total of 43 devices. This rep-resents an approximate 23 percent increase in cost. Ifa spare bit-plane is included in memory to providesingle fault recovery, the cost increase is 29 percent;if a full Hamming single error correcting/double error

Figure 5. The self-checking computer module of JPL's Fault-TolerantBuilding Block Computer.

March 1980 63

Page 10: Distributed Fault-Tolerant Computer Systems

Order theprofessional'schoice.UCSD Pascal.The Pascal everyone is talking about is UCSDPascal... with over 10,000 users and growing. Thefully developed Pascal is available with supportfrom a professional software company. Implementedon most major microprocessors.

Not just another compiler, but completedevelopment software -from operating system toscreen-oriented editor. Language extensions forsystems development and commercial applicationsprogramming.Program portability that allows programs written

on one microcomputer to run without recompilationon different microcomputers. This protects yoursoftware investment . .. without restricting yourhardware options.

If you have CP/M,o visit your local computerstore or order below. System supplied on single-density, soft-sectored, 8" floppy disks and requires48K of contiguous RAM. For other systems callus or write for more information. Telephone ordersaccepted with Master Charge or VISA.

FTMCAOSYST&PTISR SUBSIOIARY OF SOFTECH

9494 Black Mountain Road * San Diego * CA * 92126TEL: (714) 578-6105 * TWX: 910-335-1594CP/M is a registered tradeinark of Digital Research Corporation. LSI 11 is atrademark of Digital Equipment Corporation. UCSD Pascal is a trademark of theRegents of the University of California.

I O Yes! Rush me a complete UCSD Pascal systemfor my CP/M' based microcomputer. A check Ior money order for $300.00 is enclosed. I havea computer.

C] Send me more information about UCSD Pascal.Versions are available for systems using thefollowing microprocessors: LSI-11Tm 6502, 6800, n6809, 9900, Z80 and 8080/8085.

S Send me only the complete set of documentation SI for UCSD Pascal. A check or money order for I

$37.00 is enclosed.Ei Send distributor information.

| NameCompany X I

Address

City- +2State/Zip-M/C or VISA#Exp. Date_ _Mass. and Calif. tesidents enclose applicable sales tax IC-33

Reader Service Number 10

detecting capability is employed in memory with twospare bit-planes, the increase rises to 60 percent.t1The FTBBC architecture was designed to meet a

set of requirements for military control and long-lifespacecraft applications; these requirements canrange from low-cost maintenance deferral to long-lifeunattended operation. Software will run critical func-tions in duplex or in triplicated SCCMs, while non-critical tasks will be less expensively carried out insingle standby redundant SCCMs.18The FTBBC is a step toward multi-application

fault-tolerant architectures. However, many exten-sions must still be made to broaden its range of ap-plications. The intercommunications buses of theFTBBC are tailored to real-time control applications,for example. It is clear that general-purpose applica-tions require a much more powerful intercommunica-tions system. Moreover, it is yet to be determined if asingle building block architecture can meet the com-munications demands of aCm* class of system or theextreme reliability requirements of a SIFT or FTMP.The solutions to these important questions are linkedto VLSI technology-approaches like the FTBBC ar-chitecture can use that technology to build reliable,inexpensive modules of very high complexity.

With the knowledge gained from a number of in-teresting and successful systems, fault-tolerant com-puting has developed into a mature discipline.Perhaps we are at a threshold where much of thedivergent work can be consolidated and synthesizedinto a few highly modular architectures. Modular ap-proaches could provide good fault coverage and allowcost-effective system configuration in a wide range ofapplications.

It is important to realize that fault-tolerant design,like the rest of computer architecture, is driven bytechnology. The systems we have described arebased on existing technology plus or minus fiveyears. New technology will be VLSI, and if history isany guide, it will take us years to understand how tobest use it. Distributed modular systems will be anatural outgrowth of VLSI, since we can deal withenormous complexity only by breaking a system intosmaller intelligent pieces. Using the very largenumber of logic elements that will be available will re-quire highly parallel algorithms-a problem alsopointing to distributed systems. High performancesignal-processing architectures are beginning toopen a new and relatively unexplored area of fault-tolerant system design that is tightly coupled to newtechnology.

Fault-tolerant computing has progressed throughtwo generations. The first included fault-tolerant uni-processors like the Apollo and Star computers. Thesecond encompasses the multiple-computer architec-tures described here. The VLSI generation of ar-chitectures, building on a substantial body of results,should be even more challenging. -

COMPUTER

Page 11: Distributed Fault-Tolerant Computer Systems

References

1. A. Avizienis et al., "The STAR (Self-Testing-And-Repairing) Computer: An Investigation of the Theoryand Practice of Fault-Tolerant Computer Design,"IEEE Trans. Computers, Vol. C-20, No.11, Nov. 1971,pp. 1312-1321.

2. D. D. Burchby et al., "Specification of the Fault-Tolerant Spaceborne Computer (FTSC)," Proc. 1976Int'l Symp. Fault-Tolerant Computing, Pittsburgh,Penn., June 1976, pp. 129-133.*

3. A. Avizienis, "Arithmetic Error Codes: Cost and Ef-fectiveness Studies for Application in Digital SystemDesign," IEEE Trans. Computers, Vol. C-20, No. 11,Nov. 1971, pp. 1322-1331.

4. W. C. Carter et al., "Computer Error Control byTestable Morphic Boolean Functions-A Way ofRemoving Hardcore," Digest of Papers-1972 Int'lSymp. Fault-Tolerant Computing, Newton, Mass.,June 1972, pp. 154-159.*

5. J. H. Wensley et aL, "SIFT: Design and Analysis of aFault-Tolerant Computer for Aircraft Control," Proc.IEEE, Vot. 66, No. 10, Oct. 1978, pp. 1240-1255.

6. W. C. Carter and C. E. McCarthy, "Implementation ofan Experimental Fault-Tolerant Memory System,"IEEE Trans. Computers, Vol. C-25, No.6, June 1976,pp. 557-568.

7. W. M. Daly et al., "A Fault-Tolerant Digital ClockingSystem," Digest of Papers-1973 Int'l Symp. Fault-Tolerant Computing, Palo Alto, Calif., June 1973, pp.17-22.*

8. T. B. Smith, III, "A Damage-and-Fault-Tolerant In-put/Output Network," Digest of Papers-FourthAnn. Int'l Symp. Fault-Tolerant Computing, Urbana,III., June 1974, pp. 4-7 to 4-11.*

9. N. D. Murray et al., "Highly Reliable Multiproces-sor," AGARDograph No. 224, Integrity in ElectronicFlight ControlSystems, Technical Editing and Repro-duction Ltd., Harford House, 7-9 Charlotte St., Lon-don WlPlHD, Apr. 1977, pp. 17.1 to 17.16.

10. NSSC-NASA Standard Spacecraft Computer, IBMFederal Systems Division, Space Systems, Hunts-ville, Ala., Feb. 1977.

11. Fault-Tolerant Building Block Computer Study, JetPropulsion Laboratory, California Institute ofTechnology, Pasadena, Calif., July 1978, JPL Publica-tion 78-67.

12. W. N. Toy, "Fault-Tolerant Design of Local ESS Pro-cessors," Proc. IEEE, Vol. 66, No. 10, Oct. 1978, pp.1126-1145.

13. A. Avizienis, "Fault Tolerance: The Survival At-tribute of Digital Systems," Proc. IEEE, Vol. 66, No.10, Oct. 1978, pp. 1109-1125.

14. V. C. Tyree, "A Custom Microcircuit for SpaceborneSynthetic Aperture Radar Processors," Digest ofPapers-Government Microcircuit ApplicationsConf., Monterey, Calif., Nov. 1978, pp. 385-389.

15. D. A. Rennels, "Reconfigurable Modular ComputerNetworks for Spacecraft On-Board Processing," Com-puter, Vol. 11, No. 7, July 1978, pp. 49-59.

16. P. S. Kilpatrick et al., "All Semiconductor DistributedProcessor/Memory Study," Avionics Processing Re-quirements, Vol. 1, Honeywell Inc., AFAL TR-72,performed for the Air. Force Avionics Laboratory,Wright Patterson Air Force Base, Ohio, Nov. 1972.

17. C. 0. Beum, "Standardization of Avionics Informa-tion Systems," System Development Corporation,Santa Monica, Calif., TM-51591000100A, performed atthe ARPA Institute for Defense Analysis, Aug. 1973.

18. D. A. Rennels, "Architecture for Fault-TolerantSpacecraft Computers," Proc. IEEE, Vol. 66, No. 10,Oct. 1978, pp. 1255-1268.

19. A. L. Hopkins, Jr., et al., "FTMP-A Highly ReliableFault-Tolerant Multiprocessor for Aircraft," Proc.IEEE, Vol. 66, No. 10, Oct. 1978, pp. 1221-1239.

20. D. P. Siewiorek et al., "A Case Study of C. mmp, Cm*,and C.vmp: Part 1-Experiences with Fault Tolerancein Multiprocessor Systems," Proc. IEEE, Vol. 66, No.10, Oct. 1978, pp. 1178-1199.

21. -D. Katsuki et al., "Pluribus-An Operational Fault-Tolerant Multiprocessor," Proc. IEEE, Vol. 66, No.10, Oct. 1978, pp. 1146-1159.

22. P. K. DeGonia et al., "Micronet-A Self-Healing Net-work for Signal Processing," Digest ofPapers-Government Microcircuit ApplicationsConf, Monterey, Calif., Nov. 1978, pp. 370-377.

23. W. C. Carter et al., "Cost Effectiveness of Self Check-ing Computer Design," Proc. Seventh Ann. Int'l Conf.Fault-Tolerant Computing, Los Angeles, Calif., June1977, pp. 117-123.*

24. F. P. Mathur and A. Avizienis, "Reliability Analysisand Architecture of a Hybrid-Redundant DigitalSystem: Generalized Triple Modular Redundancywith Self-Repair," AFIPS Conf Proc., Vol. 36, 1970SJCC, pp. 375-383.

25. W. G. Bouricius et al., "Reliability Modeling Tech-niques for Self-Repairing Computer Systems," Proc.ACMAnn. Conf., San Francisco, Calif., Aug. 1969, pp.295-309.

*These proceedings and digests are available from the IEEE Com-puter-Society Publications Office, 5855 Naples Plaza, Suite 301,Long Beach, CA 90803.

David A. Rennels is an acting assistantprofessor in the Computer ScienceDepartment at UCLA and is an aca-demic member of the technical staff atthe Jet Propulsion Laboratory of the

a RW g California Institute of Technology,where he has been employed since 1966.His fields of interest are distributedcomputer architectures for real-timeE computing and fault-tolerant com-

puting. He has been a consultant to various companies inthe field of fault-tolerant computer design.Rennels received the BSEE from the Rose-Hulman In-

stitute of Technology, Terre Ha'ute, Indiana, in 1964, theMSEE from Caltech in 1965, and the PhD in computerscience from UCLA in 1973.

March 1980 65