get pdf 10

Upload: jafar-sherief

Post on 29-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Get PDF 10

    1/15

    Published in IET Computers & Digital Techniques

    Received on 1st February 2009

    Revised on 21st July 2009

    doi: 10.1049/iet-cdt.2009.0011

    In Special Issue on selected papers from the 18th

    International Conference on Field ProgrammableLogic and Applications (FPL 2008)

    ISSN 1751-8601

    Fault tolerance and reliability in field-

    programmable gate arraysE. Stott P. Sedcole P. CheungDepartment of Electrical and Electronic Engineering, Imperial College, Exhibition Road, London, UK

    E-mail: [email protected]

    Abstract: Reduced device-level reliability and increased within-die process variability will become serious issues for

    future field-programmable gate arrays (FPGAs), and will result in faults developing dynamically during the lifetime

    of the integrated circuit. Fortunately, FPGAs have the ability to reconfigure in the field and at runtime, thus

    providing opportunities to overcome such degradation-induced faults. This study provides a comprehensive

    survey of fault detection methods and fault-tolerance schemes specifically for FPGAs and in the context of

    device degradation, with the goal of laying a strong foundation for future research in this field. All methods and

    schemes are quantitatively compared and some particularly promising approaches are highlighted.

    1 Introduction

    As process technology scaling continues, integrated circuitsface greater challenges from defects, process variability andreliability. Field-programmable gate arrays (FPGAs) are noexception to this; one recent study suggested defecttolerance will be necessary in future large FPGAs at andbeyond the 45 nm technology node [1]. FPGAs have somekey advantages over application specific integrated circuits(ASICs) for achieving fault tolerance. Firstly, they are(mostly) composed of regular arrays of generic resources,giving them inherent redundancy. Secondly, they can bereconfigured in the field. These have been exploited in a

    wealth of research and some promising fault tolerantsystems have been developed.

    There have been different motivations for designs of faulttolerant FPGA systems. Early work was concentrated onincreasing manufacturing yield through defect tolerance andsome of this has found its way into commercial use [2]. Theadvent of SRAM FPGAs presented the problem of single-event upsets (SEUs), which are sporadic flips of configurationbits causing connectivity, logic and state errors. This has alsolead to a great deal of research, the benefits of which can be

    widely found in space and nuclear applications [3].

    The focus of this study, however, is on work relating to thereliability of FPGAs and in-field tolerance of permanent

    faults that are caused by device degradation. This is a less

    established field of research, although techniques developedin defect and SEU tolerance schemes are highly relevant todegradation fault tolerance. This aspect of fault tolerance isset to become increasingly important with the continuingdevelopment of silicon technology. This paper is based onmaterial previously published by the authors in [4]. It isextended with new sections on fault modelling and future

    work in the field. Existing sections are discussed at greaterdepth, with 17 additional papers surveyed and eight newfigures.

    A fault tolerant system consists of two main components;

    these are fault detection and fault repair. Section 3 surveysfault detection methods and Section 4 considers faultrepair. Causes of faults, modelling and application issuesare discussed in Section 2 and the possibilities for futuredevelopment of the field are explored in Section 5.

    2 Background

    2.1 Causes of degradation

    Degradation is the permanent deterioration of a circuit overtime, resulting in a negative impact on performance. The

    effects can be progressive, a gradual change of a circuitparameter or catastrophic, a sudden onset of a failed state in acircuit component. Degradation in VLSI circuits can be

    196 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196210

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    2/15

    attributed to a number of mechanisms [5]. The hot-carriereffect leads to a build up of trapped charges in the gate-channel interface region [6]. This causes a gradual reductionin channel mobility and increase in threshold voltage inCMOS transistors. The effect on the circuit is that switchingspeeds become slower, leading to delay faults. Negative-bias

    temperature instability (NBTI) has similar consequences forcircuits and is also caused by a build up of trapped charges [7].

    Electromigration is a mechanism by which metal ionsmigrate over time leading to voids and deposits ininterconnects. Eventually, these can cause faults because ofthe creation of open and short circuits [8].

    Time-dependent dielectric breakdown (TDDB) affectsthe gates of transistors, causing an increase in leakagecurrent and eventually a short circuit. The mechanism hereis the creation of charge traps within the gate dielectrics,

    diminishing the potential barrier it forms [9, 10].

    All of these degradation mechanisms have the potential tobecome more severe with the shrinking of process geometry.

    This is due to increasing gate field strength, higher currentdensity, smaller feature size, thinner gate dielectrics andincreasing variability [11]. In the case of TDDB, thesituation is made complicated by the introduction of newprocesses such as high-Kdielectrics and metal gates [12].

    2.2 Other types of fault

    In addition to degradation, there are two other types of faultsthat can affect FPGAs. These are relevant to this study assome of the techniques that have been developed inresponse to them can also be applied to faults caused bydegradation.

    The first of these is manufacturing defects. Manufacturingdefects can be exhibitedas circuit nodes which arestuck-at0 or1 or switch too slowly to meet the timing specification. Defectsalso affect the interconnect network and can cause short oropen circuits and stuck open or closed pass transistors [13].

    Test of manufacturing defects is well established in VLSI

    and defect tolerance techniques are currently used in sometypes of device, including FPGAs [2], to increase yield.

    The second class of fault which is widely discussed in relationto FPGAs comprises of SEUs and single event transientsSETs, caused by certain types of radiation [14]. This is ofparticular concern to aviation, nuclear research and spaceapplications where devices are exposed to higher levels ofradiation and high levels of reliability are required. The mostcommonly considered failure mode is the flipping of anSRAM cell in the configuration memory, leading to an errorin the logic function that persists until the configuration is

    refreshed in a process known as scrubbing. Although thisrecovery method is not applicable to permanent faults causedby degradation, ways of detecting SEU faults are relevant.

    2.3 Modelling of faults

    In order to effectively detect, locate and repair faults a modelis needed of how they affect the circuit. Fault modelling hasseveral aspects including (a) determining which faultmechanisms may occur; (b) simulating the effect that

    possible faults will have on the system; (c) predicting therate and distribution of failures; and (d) establishing faultscenarios for evaluating potential repair strategies.

    Faults can be modelled at different layers of the FPGA, asshown in Fig. 1. Although faults occur in the siliconstructures which make up transistors and interconnect, faulttolerant systems deal with them at various levels ofabstraction. A repair at each level of abstraction aims to betransparent to the level above it.

    Logic: A low-level approach considers the underlying logic ofthe FPGA and models faults on particular circuit nets.

    Fabric: Some fault tolerant systems consider faults in theFPGA fabric, that is the set of LUTs, registers,interconnect and so on that is available to the designer[15]. This has the advantage that these elements are easy totest with reconfiguration and BIST, though the behaviourof the configuration logic is obscured.

    Array: A popular option is to consider the FPGA at an arraylevel, that is to mark off entire clusters or interconnect lines asfaulty. This best exploits the regular structure of FPGAs.

    Application: A higher level of abstraction is possible when theapplication is modular and adaptable. This allows the faultmodel to extend to other parts of the circuit outside theFPGA for a very robust system.

    Figure 1 Design of an FPGA and its application can be

    abstracted to several levelsFault modelling and tolerance can be approached at numerouspoints in the hierarchy

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196 210 197

    doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    3/15

    Within this study, fault repair is defined to be the repair of afaulty system so that it returns to being fully operational.Invariably, at some level of the FPGA this repair is achievedby the replacement of a failed component with a functionalone. The size and nature of the replaced component variesfrom scheme to scheme and this represents the granularity

    of the approach. All of the studies surveyed here fall into thefabric-level, array-level or application-level categories.

    Approaching fault tolerance at different levels ofabstraction places the burden of dealing with them ondifferent parties over the design, manufacturing and servicephases of the product lifetime. A fabric-level repair, forexample, may be completely transparent to the engineer

    who designs the application circuit and requires noalteration of the configuration data. On the other hand, anapplication-level strategy is likely to be embedded into thesystem design and be tailored to the application.

    An important part of fault modelling is to determinethe possible failure modes at the design level underconsideration. At the circuit level, the simplest of faultmodels assumes that faulty circuit nodes can be either stuckat 0, stuck at 1, shorted to another node or an open circuit[13]. Although these hard-failure modes have been aneffective approach to defect testing for a long period,

    worsening process variation and degradation requiremarginal and timing faults to be considered [16]. Sincesome of the VLSI wear-out mechanisms are progressive innature, marginal faults are likely to be more prevalent infield failures than in failures because of manufacturingdefects. Examples of marginal faults include slow switching,intermittent switching, weak drivers and unstable registers.

    Another aspect to a degradation fault model is the rate at which faults occur and how this varies over time [5]. Traditionally, a bathtub curve of failures is described forVLSI circuits, in common with many other manufacturingprocesses. High numbers of infant mortality failures occurshortly after manufacture, then the failure rate remains lowuntil the end of the design life. Greater process variation anddegradation will make these phases less distinct, for examplea significant background rate of failure may be observed over

    the entire life of the product [11]. This is illustrated in Fig. 2.

    FPGA systems can be reconfigured in the field, either tochange their functionality or as required by a fault-tolerancesystem. This raises the possibility of dormant faults, faultsthat occur on resources which are unused when the faultoccurs but which may be used in the future. The implicationof this is that multiple faults may become apparent onreconfiguration and the system must be prepared for this.

    2.4 Applications of fault tolerance

    All of the fault detection and repair methods surveyed haveindividual strengths and weaknesses and which method ismost appropriate depends on the application.

    In some cases, reliability is critical for safety or mission

    success. For example, an automotive application was discussedat a system level by Steiningeret al. [17]. Fast detection and/or error correction is crucial here so that erroneous data orstate is not acted upon, which could be hazardous.

    A widely implemented application of fault tolerance inFPGAs is in space missions. Traditionally, this is because(a) they experience significant numbers of SEUs caused byincreased radiation; (b) the breakdown of an electronicsystem could cause the mission to be lost; and (c) manualrepair is impractical.

    In the light of variability and reliability concerns associatedwith future VLSI process nodes, it may become economicalto use fault tolerance in general purpose, high-volumeapplications. In this case, it will be important that thedetection and repair method has the lowest possibleoverhead on timing performance and area. Suchapplications may be able to compromise data integrity andfault coverage to achieve this, for example infrequent visibleerrors and a small proportion of returns would be tolerablein a consumer video decoder.

    3 Fault detection

    The first function to take place in a fault-tolerant schemeis fault detection. Fault detection has two purposes; firstly,it alerts the supervising process that action needs to betaken for the system to remain operational and secondly, itidentifies which components of the device are defective sothat a solution can be determined. These two functionsmay be covered simultaneously, or it may be a multi-stageprocess comprised of different strategies.

    Fault detection methods can be categorised into threebroad types:

    Redundant/concurrent error detection (CED) usesadditional logic as a means of detecting when a functionalblock is not generating the correct output.

    Figure 2 The failure rate of electronic devices varies

    over time

    Trace a depicts the traditional bathtub curve, where most failuresoccur early or late in the device lifetime. Trace b shows the higherprevalence of mid-life failures that is caused by increased processvariation and degredation

    198 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196210

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    4/15

    Off-line test methods test the FPGA when it is notoperating.

    Roving test methods perform a progressive scan of theFPGA structure whilst it is on-line.

    The different approaches to fault detection are evaluatedagainst a set of metrics in Table 1.

    3.1 Functional redundancy and CED

    The principle of using redundancy to guarantee a reliablelogical system was developed many years before the adventof VLSI [18, 19], and today it is widely used as a methodof fault detection in FPGAs, particularly in the form oftriple modular redundancy (TMR). The main driver forerror detection of this kind is the need to detect andcorrect errors because of SEUs and SETs. However, these

    methods are also suitable for detecting permanent faultsthat occur while the system is operating.

    These detection methods work by invoking more than theminimum amount of logic than is necessary to implement thelogic function. When an error occurs, there is a disagreementbetween the multiple parts of the circuit over which aparticular calculation is processed and this is flagged bysome form of error detection function.

    The simplest form of this kind of error detection ismodular redundancy. A functional block is replicated,

    usually two or three times and the outputs compared. Ifthere are two modules then a difference between theoutputs indicates that one of the modules is faulty. If thereare three modules then, assuming a single fault, one groupof outputs will differ from the other two. The use of threemodules identifies which of them is faulty and allows thecorrect output to be maintained whilst a repair is underway.

    Practical implementations of TMR partition the designinto multiple sets of modules so that transient errors do notstick in state machines or registers [20]. The systemproposed by DAngelo et al. [21, 22] is capable ofdistinguishing between permanent and transient errors, andalso if the fault has occurred in a module or the voting

    logic. In [15], a modular redundancy system is given whichcan detect multiple faults on start-up.

    CED allows a more space efficient design than modularredundancy. Extra bits are added to data flows and storesthat are encoded with redundant logic, for example parityinformation. Data validation circuitry at the output tofunctional blocks can then detect faults that arise. Theapplication of CED is very much dependent on the dataflows and algorithms of the design and the logic overheadthat is required varies. It is least efficient for small widthsignals such as those found in control logic.

    CED techniques have been widely researched bothin theory and in application to ASICs [23], but implementation on FPGAs presents additional problems.Using traditional CAD tools for FPGAs, there is theproblem that error detection logic can be removed or madeineffective once it is minimised and implemented on LUTs.Bolchini et al. [24] sought to address this by mapping aself-testing circuit using standard tools then testing andredesigning any part where fault coverage had been lost. In[25], a CED method is developed for state machinesin FPGAs using embedded memory blocks. Here, the

    memory blocks are used as ROM look-up tables for nextstate and output logic with embedded parity data. Any errorin either the memory or the controlling logic causesa parity mismatch and is detected by a parity checker. Asimilar function can be carried out when a state machine isencoded using one-hot logic; the activation of more thanone state at any time indicates an error. A full fault tolerant

    Table 1 Comparison of fault detection methods

    Method Speed of

    detection

    Resource

    overhead

    Performance

    overhead

    Granularity Coverage

    modularredundancy

    fast as soonas fault is

    manifest

    very large triplicate plus

    voting logic

    very small latency ofvoting logic

    coarse limitedto size of

    module

    good all manifesterrors are detected

    concurrent error

    detection

    fast as soon

    as fault is

    manifest

    medium

    trade-off with

    coverage

    small additional

    latency of checking logic

    medium

    trade-off with

    resource

    medium not

    practical for all types

    of functionality

    off-line slow only

    when off-line

    very small small start-up delay fine possible

    to detect the

    exact error

    very good all faults

    including dormant

    roving

    (segmented

    interconnect)

    medium

    order of

    seconds

    medium

    empty test block

    plus testcontroller

    large clock must be

    stopped to swap blocks.

    Critical paths maylengthen

    fine possible

    to detect the

    exact error

    very good multiple

    manifest and latent

    faults are detected

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196 210 199

    doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    5/15

    system using CED was proposed in [26], where the errordetection is broken down into regions to provide a level offault location.

    Redundancy provides a very fast means of error detectionas a fault is uncovered as soon as an error propagates to the

    voting or validation logic. In addition, this form of errordetection has a small impact on timing performance onlythe latency of voting or parity logic, or similar. Modularredundancy detects all faults that become manifest at theoutput of a functional block, including transient errors,providing the other block(s) are fully functional. In CED,coverage comes at a trade-off with area overhead [27].

    These methods provide no means of detecting dormantfaults, which may be relevant if an FPGA is going to bereconfigured in the field, either for fault repair or to alterthe functionality.

    The chief drawback of redundancy as a method of errordetection is the area overhead needed to replicatefunctionality, which can be over three times in the case of

    TMR [3]. Furthermore, it provides a limited resolution foridentification of the faulty component. The fault can onlybe pinned down to a particular functional block or, in thecase of TMR, an instance of a functional block. Faultresolution can be increased to a certain extent by breakingfunctional areas down and adding additional error detectionlogic [26].

    Redundancy does not have to be restricted to the circuit-area dimension. It is also possible to detect errors in atrade-off with latency/data throughput. Lima et al. [28]proposed a scheme where operations are carried out twice.In the second operation, operands are encoded in such a

    way that they exercise the logic in a different way. Theoutput is then passed through a suitable decoder andcompared to the original.

    Although most of the work on redundancy has been aimedat detecting and correcting SEUs, there have been somenotable publications which apply the techniques to faultdetection. Parity checking is used in [29] as part of a faulttolerant scheme that is structured so that detection is

    applied to small regular networks, rather than beingbespoke to the function that is implemented. In theevolutionary system of [30], dual modular redundancy(DMR) is used to grade the fitness of competingconfigurations. The configurations are chosen in pairs froma pool and the outputs are compared on each clock cycle.

    The configurations that contain errors or faults cause morefrequent output mismatches and accumulate a poor fitnessscore. More information on evolutionary fault tolerance isgive in Section 4.2.4.

    Redundant and data-checking detection systems are

    generally implemented as part of the applicationconfiguration, as they fit around the specific data andcontrol functions that make up the circuit. They can be

    designed manually by the configuration engineer to fit thefault-detection requirements of the application, or integratedautomatically using tools, such as Xilinxs TMRTool [31].

    In [32], an FPGA architecture was proposed that haserror detection built in, so that it is transparent to the

    configuration. The system uses a combination of area andtime redundancy to identify errors at the cell level andsubsequently trigger an automatic repair. The practicality ofthis architecture is limited by the severe timing and areaoverheads of incorporating the error detection logic in eachcell. Hardware error checking is used in Alteras StratixFPGAs to detect SEUs in configuration memory. Thiscould be used to detect faults in configuration logic thatmight otherwise be difficult to detect, but would needsupplementing with soft-logic error detection to provideadequate coverage of the whole device.

    3.2 Off-line fault detection/built-inself-test (BIST)

    Off-line fault detection is another widely used technique,usually as a means of quickly identifying manufacturingdefects in FPGAs. Any scheme that does this without theneed for any external equipment is known as BIST, and isa suitable candidate for fault detection in the field.

    BIST schemes for FPGAs work by having one or moretest configurations which are loaded separately to theoperating configuration. Within the test configuration is a

    test pattern generator, an output response analyser and,between them, the logic and interconnect to be testedarranged in paths-under-test (PUTs). To be fullycomprehensive, a BIST system will have to test not onlythe logic and interconnect, but also the configurationnetwork. Specialised features such as carry chains,multipliers and PLLs also need to be considered. The

    Xilinx Virtex series of FPGAs feature a self-configurationport which can speed up this process and reduce the needfor external logic [33].

    Compared to traditional built-in and external test methodsfor ASICs, FPGAs have the advantage of a regular structure

    that does not need a new test programme to be developed foreach design. Also, the ability to reconfigure an FPGAreduces or removes the need for dedicated test structures tobe built into the silicon [34]. However, with the ability toreconfigure comes a vast number of permutations in the

    way the logic can be expressed, making optimisation of testpatterns important.

    Published BIST methods have competed for coverage, testduration and memory overhead. Many focus on testing

    just one subset of FPGA structures, e.g. interconnect,suggesting a multi-phased approach may be appropriate for

    testing the whole chip. Testing of LUTs is a mature field;the BIST scheme for LUTs in [35] is designed tominimise test time, whereas a means of detecting multiple

    200 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196210

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    6/15

    faults in a single test block or PUT is given in [36]. Alaghiet al. [37] also addressed multiple faults and concentrateson achieving the optimum balance between granularity andtest logic overhead.

    Many publications have focussed on testing interconnectin response to the large amount of configuration logic andsilicon area it consumes [38, 39]. A method of exhaustivetesting for multiple faults global interconnect using sixconfigurations is given in [40]. In [41], a BIST system forinterconnect is given that reduces test time through a largedegree of self-configuration. Harris et al. used a hierarchicalapproach which locates stuck-at faults, short circuits andopen circuits with the highest accuracy [13, 42].

    Recent developments have considered timing performanceas well as stuck-at faults. Some authors have targeted

    resistive-open defects in interconnect by testing for delayfaults in interconnect paths [43, 44]. Girard et al.considered the optimum test patterns for exercising delayfaults [16, 45]. Methods for analysing the propagationdelays of logic chains have been proposed in [46], bytiming the difference between two paths using a ringoscillator, and more accurately in [47], by using a built-inPLL unit to match the clock speed to the propagation delay.

    Elements of BIST can be found in roving test systems,where only a small part of the FPGA is taken off-line fortesting at any point in time. Roving and off-line testing are

    both cited as suitable applications for the delay-test methodin [46]. Doumar et al. proposed an off-line test that uses aroving sequence to remove the need for reconfiguration;instead a small self-test area is always present and is shiftedaround the array to gain full coverage [48].

    The advantage of BIST as a fault detection method is thatit has no impact on the FPGA during normal operation. Theonly overhead is the requirement to store test configurations,

    which may be compressed because of their repetitive nature.BIST also allows complete coverage of the FPGA fabric,including features that may be hard to test with an on-line

    test system, such as PLLs and the clock network. As theentire resource set is tested, the BIST process is commonto all systems using a particular FPGA model and can beextended to cover a whole family of devices with littlemodification. The only additional work required tointegrate BIST into a new FPGA design is to providestorage for the BIST configuration and set up a trigger toload it at the desired time.

    The major drawback of BIST for fault tolerant systems isthat it can only detect faults during a dedicated test mode

    when the FPGA is not otherwise operational. Typically,

    this would occur during system start-up, as part of amaintenance schedule, or in response to an error detectedby some other means.

    3.3 Roving fault detection

    Roving detection exploits run-time reconfiguration to carryout BIST techniques on-line, in the field, with a minimumof area overhead. In roving detection, the FPGA is splitinto equal-sized regions. One of these is configured to

    perform self-test, whereas the remaining areas carry out thedesign function of the FPGA. Over time, the test region isswapped with functional regions one at a time so that theentire array can be tested while the FPGA remainsfunctional. The process is illustrated in Fig. 3.

    Roving test has a lower area overhead than redundancymethods; the overhead comprising of one self-test regionand a controller to manage the reconfiguration process. Themethod also gives excellent fault coverage and granularity,comparable to BIST methods.

    Roving test is less intrusive than a full system halt to carryout off-line BIST and it is usually possible to detect faultsearlier. However, the speed of detection is not as good asredundancy techniques. The detection latency depends onthe period of a complete roving cycle; the best reportedimplementations of roving test have maximum detectionlatency in the order of a second [49].

    Roving test impacts performance in two ways. Firstly, as thetest region is moved through the FPGA, connections betweenadjacent functional areas are stretched. This results in longersignal delays and may force a reduction in the system clock

    speed, reported to be in the range of 2.515%. Secondly,implementations in current FPGAs require the functionalblocks to be halted as they are switched. A 250 ms windowfor each swapping move has been reported [49].

    The dominant work in the field of roving test and repairhas been carried out by Abramovici, Stroud et al. [49, 50].Called roving STARs, this system uses two test areas, onecovering entire rows and one covering entire columns. Aroving test method was also proposed in [51] by usingFPGAs with bussed, rather than segmented, interconnects,a system was devised which had no impact on system clock

    Figure 3 In roving test, blocks of the FPGA are taken off-

    line one at a time for testingBy shifting functionality between blocks, the device can remainoperational

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196 210 201

    doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    7/15

    performance. However, the resulting constraints on thesystem topology and connectivity limit its applicability tothe majority of systems.

    Like off-line BIST, roving detection does not needadapting to application circuit, simplifying its deployment

    and allowing a market to develop in 3rd party fault-detection IPs. However, as the test process does have animpact on FPGA resource use and timing performance,careful verification of the application design would beneeded once it was integrated.

    4 Fault recovery

    Once a fault is detected and located, it must be repaired.Depending on the fault modelling approach (see Fig. 1),repair can be approached at a number of different levels:

    Hardware: A hardware level repair performs a correction suchthat the FPGA remains unchanged for the purposes of theconfiguration. The device retains its original number andarrangement of useable logic clusters and interconnects.

    Configuration: A configuration level repair is achieved usingspare resources that the design does not use. The spareresources can replace faulty ones in the event of a fault.

    System: A higher level of repair can be carried out at thesystem level. When a design is highly modular a fault canbe tolerated by the use of a spare functional block, or byproviding degraded performance [52]. Such methods arenot considered in more detail here, as they are not limitedin application to FPGAs.

    It should be noted that some fault detection methods alsoprovide a level of fault tolerance. The voting system in TMRallows the erroneous output of one module to be ignored.

    Also, roving test provides fault tolerance by stopping theroving process if a fault is detected. If the fault stays withinthe test area it will not be used by the operational part ofthe FPGA. In both these situations, the system operates ina reliability degraded state where another fault would notbe tolerated and may not even be detected. But they doallow the system to carry on functioning whilst apermanent repair is carried out. Table 2 shows the classes

    of fault repair techniques that have been reported andevaluates them against a range of metrics.

    To date, manufacturing defects are a far more common typeof fault than failures in the field and have been the focus ofpractical fault-tolerance efforts in industry. The challenge ofattaining an economic yield in FPGA manufacture willintensify along with the challenge of ensuring reliability, and

    Table 2 Comparison of fault repair methods

    Method Fault pattern tolerance Resource overhead Performance

    overhead/degradation

    Complexity of

    repair

    Repair

    level

    hardware poor: limited number and

    distribution tolerated

    medium: spare

    resources required

    low: transparent to

    configuration

    low: effected with

    hardware switches

    logic-

    array

    multiple

    configurations

    poor: limited number and

    distribution tolerated.

    interconnect tolerance

    causes complexity

    low: uses naturally

    spare resources, but

    requires ROM for

    configurations

    low: each

    configuration can be

    fully optimised

    medium: selection

    and loading of

    configurations

    array

    pebble shifting medium: relies on nearby

    spare PLBs

    low: uses naturally

    spare resources

    medium, rerouting

    causes uncertainty

    high: re-routing

    necessary

    array

    cluster reconfig. poor: reliant on spareresource in cluster. Poor

    tolerance in interconnect

    low: uses naturallyspare resources

    low: changes onlylocal interconnect,

    slight uncertainty

    medium: analysisof logic, no

    re-routing

    fabric

    cluster

    reconfig.

    pebble shifting

    good: flexible solutions

    possible

    low: uses naturally

    spare resources

    low: usually a fast

    alternative will be

    found, medium

    uncertainty

    high: analysis of

    logic and rerouting

    fabric-

    array

    constrained

    (chain shifting)

    poor: limited number and

    distribution tolerated.

    Poor for interconnect

    medium: a set of

    interconnect must be

    reserved

    low low: alternative

    routing already

    reserved

    array

    evolutionary good: implementation is

    completely flexible

    large: configuration

    grading and storage

    variable: solution is

    arrived throughrandom mutations

    massive: may take

    a long time torepair

    app.

    202 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196210

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    8/15

    it is likely that any scheme for fault tolerance will be used totackle both. The key difference is that overcomingmanufacturing defects only needs to be done once and doesnot need to involve the end user of an FPGA or even, forhardware-level repair, the OEM who installs and configuresit. Therefore, defect tolerance can be viewed as a subset of

    multi-purpose fault tolerance, where the process is appliedduring manufacture. The techniques discussed in this sectionare targetted at in-field faults, but could also be applied tomanufacturing defects in this way.

    4.1 Hardware-level repair

    The regular structure of FPGAs makes them suitablecandidates for hardware-level repair, using methods similarto those used for defect tolerance in memory devices. Inthe event of a fault, a part of the circuit can be mapped toanother part with no change in function.

    Hardware-level repair has the advantage of being transparentto the configuration. This makes repair a simple process, as therepair controller does not need any knowledge of the placementand routing of the design. Another benefit is that the timingperformance of the repaired FPGA can be guaranteed, as anyfaulty element will be replaced by a pre-determinedalternative. Switching in spare resources will change timingslightly, for example net lengths may change, but theconfiguration can be designed to function with the worst-caseselection. A fault-tolerant system using hardware-level repair

    with off-line BIST could be packaged with a FPGAarchitecture and require little work to integrate it into anapplication nor maintenance intervention in the field.Hardware-level fault tolerance has a drawback in that it cantolerate just a low number of faults for a given area overheadand there are likely to be certain patterns of faults whichcannot be tolerated.

    The first methods of this kind were based on column/rowshifting [53, 54]. Multiplexers are introduced at the ends oflines of cells that allow a whole row or column to bebypassed, by shifting across to a spare line of cells at theend of the array. A column/row shifting architecture wasproposed in [32], which could repair faults in the field by

    shifting the configuration memory. If the FPGA is bus-based, the shifted cells can connect to the same lines ofinterconnect. For segmented interconnect, bypass sectionsneed to be added to skip the faulty row/column. Today,column/row shifting has found its way into commercialFPGA designs for defect tolerance [2].

    As illustrated in Fig. 4, adding more bypass connections andmultiplexers allows greater flexibility for tolerating multiplefaults and makes more efficient use of spare resources [55]. In[56], faults in the configuration logic where considered andthe proposed solution was to split the FPGA up into sub-

    arrays which can be configured independently.

    4.2 Configuration level repair

    Although hardware-level repair is attractive for defect tolerancebecause it is mostly transparent at the configuration level, theproposed schemes have not proved flexible or efficient enoughfor use in reliability enhancement. As the computationalpower available to FPGAs increases, the complexity of self-repair is becoming less of a constraint. For these reasons, themost promising fault tolerant systems have used configurationlevel repair. Configuration level repair exploits two keyfeatures in FPGAs; reconfiguration and the availability of

    unused resources. Configuration level repair strategies can bedivided into several subclasses:

    4.2.1 Alternative configurations: A straightforward way of achieving fault tolerance is to pre-compile

    Figure 4 Trade-offs in row/column shifting methods

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196 210 203

    doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    9/15

    alternative configurations. As long as a configuration existsin which a given resource is unused, the FPGA can berepaired should that resource become faulty. In [5759]the FPGA is split into tiles, each with its own set ofconfigurations which have a common functionality andinterface to the adjacent tiles. The replacement of a faulty

    tile is illustrated in Fig. 5.

    Pre-compilation of the alternative configurations makesa repair simple to compute in the field; the replacementconfigurations can simply be indexed by the location of theunused/faulty resource or resources. It also guarantees thetiming performance of the repaired design. Software toolscould generate the configuration set for any design, as longas sufficient spare resources are available, and a simpleindex could select a repair configuration based on thelocation of the fault.

    This strategy performs relatively poorly in terms of areaefficiency and multiple fault tolerance. It is dependent onthere being a configuration available in which any givenresource is set aside as a spare. If only a small amount ofspare resource is available then a large number ofconfigurations are needed to cover all possible faults.

    Allocating more spare resources allows a smaller number ofconfigurations, but that reduces the amount of functionalitythat can fit in the FPGA. If the system is required totolerate multiple faults, the number of configurations caneasily become prohibitive, especially if they must be storedand recalled in the field. As the configurations are likely to

    have a significant amount of commonality, compression canbe used to mitigate the ROM overhead, though this thenrequires decoding logic. Splitting the array into tiles allowsmultiple faults to be tolerated with superior overheadefficiency, but introduces complications if a fault occurs onthe interface between adjacent tiles.

    4.2.2 Incremental mapping, placement androuting: A popular approach to fault tolerance is torecompute the mapping, placement and routing of theFPGA in the field. This has the potential to be a veryefficient method as it can exploit the residual amount ofspare resource which is found in virtually all practical FPGA

    designs. It has also proven to be the only method capable ofhandling large numbers of faults in an arbitrary pattern. Thechallenge to overcome is that the mapping, placement androuting tools of an FPGA CAD system must be adapted tooperate autonomously on an embedded platform. This willimpose constraints on processing power, memory and thetime available to compute a new configuration.

    A simple method of tolerating faults in logic clusters existsif the cluster can be reconfigured to work around the fault[49, 60]. A typical example of this kind of repair is theswapping of a faulty LUT input with an unused one.

    Repair within a cluster is attractive because it has only asmall impact on the global routing of the device, whichmakes the repair easy to compute and guarantees only aslight impact on timing performance. However, a repair ofthis kind is not always possible; there may be no spareresource of the type needed or there may be architecturalconstraints which prevent it being used without changingother clusters and global interconnect. This is especiallytrue where hardware optimisations are used such as carrychains and distributed RAM.

    If there are spare clusters, then these can be used to replace

    faulty ones. To minimise the impact on timing and routingin the area around the fault, pebble shifting may be used.

    An illustration of this method is shown in Fig. 6, where anumber of clusters are shifted to carry out the repair. In[61], an algorithm is given that calculates the cost ofpotential shifting moves as a function of additional routingdistance and congestion in routing channels. Using thisinformation, the entire repair can be optimised so that itcauses the smallest reduction in timing performance andthe least perturbation to the wider routing of the device.

    In order to reconnect displaced clusters or to repair faulty

    interconnect an incremental router can be used. One such

    Figure 5 An alternative configuration is selected to repair a

    tile of clusters containing a faulty resource

    Figure 6 In pebble shifting, logic functions are shifted to

    replace a fault cluster with a spare one

    204 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196210

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    10/15

    algorithm is developed in [62]. The router is optimised fordeployment in the field and hence has a small hardwareand memory requirement. Invalid nets are ripped-up andrerouted one at a time within a restricted window of thearray. The router determines the placement and routing ofthe existing design by reading back the configuration

    memory so it does not need to read or modify a separatenetlist database. A similar strategy is adopted by Lakamrajuand Tessier [60] and a performance evaluation is given.

    Cluster reconfiguration, pebble shifting and incrementalrouting can be combined to implement a flexible andefficient fault tolerant system. Lakamraj et al. and

    Abramovici et al. both recognised that clusterreconfiguration is a good first line of defence against faultsbecause it is simple to evaluate and makes use ofnaturally spare resources [49, 60]. However, as it cannotguarantee a repair, pebble shifting and incremental

    routing are also needed to form a robust system. In [50],the approaches are merged so that a faulty cluster can stillbe used as a spare for logic expressions which do notrequire the faulty component. This further enhances theefficiency of the system as fewer dedicated spare clustersare needed.

    Although self-redesign in the field is a complex task foran FPGA-based system, it is becoming increasingly feasibleas FPGAs become larger and more powerful. Also, anincreasing number of applications use microprocessor cores

    which are either implemented in soft-logic or areembedded into the silicon as dedicated modules. Ageneral-purpose microprocessor platform could be turnedover to the task of self-repair when a fault arises, providedit remains operational. An alternative would be to computea repair configuration remotely and download it to thedevice; this would require some form of communicationresource. Providing the necessary computation orcommunication resources would create a significant

    workload for designers integrating this form of faultrecovery into a system, though the incremental CADalgorithm itself would be general purpose.

    4.2.3 Constrained and coarse-grained repair:

    Incremental mapping, placement and routing provide ahigh degree of flexibility for dealing with random faultpatterns, especially when cluster reconfiguration and pebbleshifting are used together. However, this comes at the costof increased computational effort for the repair which mustbe carried out in the field. Some publications haveproposed less complex solutions by structuring the designsuch that the repair mechanism can operate over a limitedset of parameters.

    In [63], a repair method known as chain shifting is given.Here, clusters are arranged into chains, each with one or more

    spare clusters at the end. The spare clusters are allocatedwhen the device configuration is compiled along with a setof spare interconnects. If a cluster becomes faulty in the

    field, the chain can be shifted along to use a spare atthe end. The pre-allocated interconnect is used to restorethe original connectivity (see Fig. 7). This method has theadvantage that it does not require placement and routing inthe field; the allocation of repair resources can be computedby software tools within the configuration CAD flow.

    Also, the worst-case performance of the repaired design isknown. However, each chain requires a certain amount ofspare resource and can only tolerate a given number offailures. Hence, this method does not use its area overheadas efficiently as more intelligent approaches.

    In [29], a multi-level approach to fault tolerance ispresented which consists of an array of small, reconfigurablemultistage interconnection networks (MINs). Each MINcontains an amount of redundant logic so that some faultscan be masked entirely [64]. After this, the system aimsprimarily to correct faults by reconfiguring the network. If a

    repair in this way is not possible the system can resort tomore extensive slice or device-wide methods such asswapping a whole network with a spare and re-routing thedesign as necessary. The results show that, on average,several faults can be tolerated on each network usingnetwork-level repair. However, repair at this level cannot beguaranteed beyond the first fault. The area overhead of thissystem is large, given the redundancy that is built in andthe use of a parity checker to validate every network.

    4.2.4 Evolutionary algorithms: Reconfiguration makesFPGAs well suited to evolutionary algorithms, where

    random changes are made to a design to overcome faults. The outcome of each change is monitored so that, overtime, beneficial changes are retained and the designproduces fewer errors. Unlike other fault recovery methods,an evolutionary approach does not need to know thelocation and nature of all the faults affecting a device.Instead, some form of error checking is used to grade thecorrectness of each attempted configuration.

    Evolutionary algorithms were tested in [65, 66] as a meansof synthesising and repairing FPGA configurations. In [66],stuck-at faults were simulated in a range of simple logic

    circuits by changing bits in the configuration data. Theconfiguration then went through a iterative evolution

    Figure 7 Clusters are shifted along a predefined chain using

    spare interconnect that has already been allocated

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196 210 205

    doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    11/15

    process. The results showed that in all cases there was animprovement in the correctness of the circuit output.However, a complete repair was not guaranteed even whenthe fault density was low. The authors suggest that a widerredundant scheme could be implemented around theevolved modules to correct any residual errors.

    In [65], the use of redundancy is also suggested tocomplement an evolutionary repair mechanism. The focusin this work is the use of an evolutionary algorithm togenerate a configuration that is tolerant to faults throughthe use of redundant elements. The granularity of system isconsidered and the results show a trade-off between thelevel of fault tolerance possible and the number of iterationsneeded to reach a solution.

    Demara et al. proposed a complete system for achievingfault tolerance, based on a pool of competitive

    configurations [30]. To detect errors and assign a fitnessto each configuration, two functionally identicalconfigurations are invoked from the pool and the outputsare compared. Over time, different pairs are selected andeach configuration accumulates a score representing thenumber of missmatches it has generated. Configurationsthat exceed a certain threshold are put through a mutationprocess to attempt to correct the fault. The process isillustrated in Fig. 8.

    Evolutionary hardware is a research field that encompassesmore than just FPGAs. A finer-grained system based on

    field-programmable transistor arrays is developed and testedin [67].

    An evolutionary approach allows a large degree offlexibility with the number and distribution of faults thatcan be tolerated. It does not need to carry out tests tolocate and classify faults and it can discover ways to usepartially faulty resources that would otherwise require

    complex modelling. It is best suited to numerical and data-flow applications, where faults are less likely to becatastrophic and error-detection circuitry can be added toquantify the error rate.

    The main disadvantage to evolutionary fault tolerance is

    that the overheads are very large. Error detection circuitrymust be designed to check the outputs and this couldrequire a large amount of resource if an error-free outputis required. Also needed is a controller to manage theconfiguration updates. As the process in random, there isalso no guarantee of how long a solution will take to evolvefollowing a fault, or what its timing performance will be although improved timing could be selected for by theevolutionary algorithm.

    4.2.5 Architectural enhancements for faulttolerance: The majority of fault repair methods that

    are based on reconfiguration target standard generic orcommercial FPGA architectures. There exists someresearch that considers possible enhancements to FPGAarchitectures to improve the effectiveness of configuration-based repair.

    In [68], the architecture of switch blocks and switch blockarrays is considered with respect to the ease of re-routing toavoid an interconnect fault. An algorithm is developed

    which evaluates the routability of a generic parameterisedinterconnect model when faced with different numbers andtypes of fault. The results of the analysis show the expected

    trade-off between better fault tolerance and lower areaoverhead, and that switch matrices of different topologiesexhibit different fault-tolerance characteristics.

    Switch matrix design was also explored in [69] and theanalysis used to develop a fault tolerant scheme. Analgorithm is given which evaluates a given routing channelin terms of a connectivity matrix; this shows all the possiblepoint-to-point connections and the number of these whichhave alternative routings. From this, it is possible to addextra strategic switches to give the greatest increase inrouting redundancy for the smallest overhead. This schemeis aimed primarily at yield enhancement and does not aim

    to give complete fault coverage or tolerance for multiplefaults, as doing these would not be an efficient use ofadditional silicon area.

    5 Future development

    Without methods of achieving in-field fault tolerance, thegrowing challenges of process variation and reliability willlimit the progress of FPGA technology. Fortunately, thereremains plenty of scope for future work in this field, bothin developing the promising approaches seen so far and theexploration of new ideas. This section explores some of the

    possibilities that exist. We start by considering someimprovements that can be made to detection and repairtechniques using existing technology, and then consider

    Figure 8 In this example of an evolutionary repair system[30] , candidate configurations are taken from pool,

    graded against each other and mutated if faulty

    206 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196210

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    12/15

    how fault tolerance could be aided in the long term througharchitectural enhancements.

    5.1 Fault detection

    BIST schemes for FPGAs have been widely researched and

    efficient methods are available for testing connectivity andlogic function. The focus now is on self-testing of timingperformance, which will become necessary as process

    variation and degradation become more severe [47]. Forfault tolerance, this will allow the degradation of the chip(for certain wear-out mechanisms) to be monitoredthroughout the lifetime of the chip. Corrective andadaptive action, such as reducing clock speed or reroutingof critical paths, will then be possible.

    For some applications,off-line BIST is not possible becausefast fault detection is required or maintenance down-time isnot available. Modular redundancy provides a reliablesolution but it consumes a large amount of FPGA resource.Future development here could take the form of a systematicmeans of adding redundancy to a design in a more efficient

    way. Possible approaches include an intermediate level in thedesign hierarchy such as [29], or by analysis of datapath andcontrol logic within the automated design process. Faultdetection based on high-level behaviour is anotherpossibility; for example, output properties such as SNRcould be monitored for unexplained changes.

    5.2 Fault repair

    Current methods of incremental mapping, placementand routing are quite effective at reconfiguring currentarchitectures to operate in the presence of faults. Somepossible enhancements to the technique are improved repairspeed, awareness of timing requirements for critical paths,tolerance of faults in the fault detection/repair kernel andtolerance of faults in DSP and memory blocks.

    An alternative to incremental, embedded CAD couldbe remote repair. If a system already has a means ofcommunicating with a remote server, a repair configurationcould be calculated remotely; this way the platformconstraints are removed.

    Evolutionary algorithms are proven to be capable, inprinciple, of evaluating repairs for FPGAs. However, moredevelopment is needed before they can be consideredpractical methods of achieving fault tolerance. As well asimproving the speed and accuracy of the process, morethought is needed as to how evolution can be implementedin a complete system.

    5.3 Fault tolerant architectures

    Much of the work to date on fault tolerance has assumed

    an architecture broadly similar to that of contemporarycommercial FPGAs. Some publications have consideredparametric modifications to current architectures [68, 69],

    but there are plenty of opportunities beyond this. Thegrowing importance of fault tolerance allows researchersand, subsequently, vendors to explore FPGA architecturesthat are designed to accommodate fault tolerance as afundamental part of the system.

    An example is roving test as a means of on-line faultdetection. Current implementations are limited by theshort periods when the clock must be stopped to move thetest blocks. This is imposed by the nature of the point-to-point interconnect and configuration circuitry of currentFPGAs. It may be possible to use techniques similar tothose of proposed multi-context FPGAs to design anFPGA architecture that is more sympathetic to on-line testand repair. Multi-context FPGAs make more efficient useof hardware resource by allowing instantaneous swapping ofblocks [70].

    Another limitation which applies to virtually all proposedfault detection and repair methods is that they consideronly faults in the programmable components of the FPGA.

    There are a few examples that consider other parts of theFPGA, such as the configuration network in [56], howeverthere remain many components which are not faulttolerant. If a truly robust FPGA system is needed, everypart of the FPGA must either be testable and repairablein the field or be intrinsically reliable, for example usingredundancy or more robust feature geometry.

    Recently, there has been some discussion of coarse-grainedand network-on-chip FPGA architectures that are betteroptimised for computing applications. This notion couldalso be explored as a means of implementing faulttolerance. Complex FPGA applications are likely to bemodular by design and may be organised in some form ofhierarchy, for example a microprocessor with several cores.If the modules can be reconfigured, tested and repairedindependently and can serve as replacements for oneanother then that provides a method of achieving on-linefault tolerance.

    A further avenue exists in the way FPGA configurationsare described. Currently, fault repair algorithms based on

    incremental CAD operate on the affected portion of theconfiguration bitstream and aim to restore the describednetlist. If the design information were stored in terms ofhigher-level functionality, rather than connectivity, theFPGA would have the freedom to correct faults using anyform of available hardware. This would be particularlyuseful for heterogeneous arrays that contain a variety oftypes of different hardware resource. Repairs could also becarried out at a system level, using resources outside theFPGA where the fault has occurred.

    6 Conclusion

    Fault tolerance in FPGAs has been widely studied with works considering fault modelling, detection and repair.

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196 210 207

    doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    13/15

    These have exploited advantageous features of FPGAs suchas regularity and the ability to reconfigure. This paper hassought to present all the important publications in this fieldand categorise and compare the techniques that they present.

    The motivation behind the works studied here has variedbetween recovery from SEUs, resilience to manufacturingdefects and tolerance of in-field faults because of degradation.

    The latter of these goals is of growing interest because of theincreasing challenge of reliability that faces ongoing processscaling. The development of circuit-level techniques tohandle degradation could enable FPGAs to provide newopportunities for reliability-critical applications and furtherpush the boundaries of VLSI technology improvement.

    There is much scope for future work in this field; thetechniques present have differing strengths and weaknesses,or are suited to limited applications. Practical fault-toleranceschemes will have to be of low overhead and make efficientuse of circuit resources. As the threat of degradation grows,so too will the impetus behind this field.

    7 References

    [1] CAMPREGHER N., CHEUNG P.Y., CONSTANTINIDES G.A., VASILKO M.:

    Analysis of yield loss due to random photolithographic

    defects in the interconnect structure of FPGAs. ACM Int.

    Workshop on FPGAs, 2005, pp. 138 148

    [2] MCCLINTOCK C.: Redundancy circuitry for logic circuits.

    U.S. Patent 66 166 559, December 2000

    [3] BERG M.: Fault tolerance implementation within SRAM

    based FPGA designs based upon the increased level of single

    event upset susceptibility. Int. On-Line Testing Symp., 2006

    [4] STOTT E., SEDCOLE P., CHEUNG P.: Fault tolerant methods for

    reliability in FPGAs. Int. Conf on Field Prog. Logic and Apps.,

    September 2008, pp. 415 420

    [5] SRINIVASAN S., MANGALAGIRI P., XIE Y., VIJAYKRISHNAN N.,SARPATWARI K.: FLAW: FPGA lifetime awarenes. Design

    Automation Conf., 2006, pp. 630635

    [6] GURIN C., HUARD V., BRAVAIX A.: The energy-driven hot-

    carrier degradation modes of nMOSFETs, IEEE Trans.

    Device Mater. Reliab., 2007, 7, (2), pp. 225235

    [7] DIETER K., BABCOCK J.: Negative bias temperature

    instability: road to cross in deep submicron silicon

    semiconductor manufacturing, J. App. Phys., 2003, 94,

    (1), pp. 118

    [8] CLARKE P., RAY A., HOGARTH C.: Electromigration a tutorial

    introduction, Int. J. Electron., 1990, 69, (3), pp. 333388

    [9] ESSENI D., BUDE J.D., SELMI L.: On interface and oxide

    degradation in VLSI MOSFETs part I: deuterium effect

    in CHE stress regime, IEEE Trans. Electron Devices, 2002,

    49, (2), pp. 247253

    [10] ESSENI D., BUDE J.D., SELMI L.: On interface and oxide

    degradation in VLSI MOSFETs part II: FowlerNordheimstress regime, IEEE Trans. Electron Devices, 2002, 49, (2),

    pp. 254263

    [11] SRINIVASAN J., ADVE S.V., BOSE P., RIVERS J.A.: The impact of

    technology scaling on lifetime reliability. Int. Conf. on

    Dependable Systems and Networks, 2004, pp. 177186

    [12] CHEUNG K.: Can TDDB continue to serve as reliability

    test method for advance gate dielectric?. Int. Conf. on

    Integrated Circuit Design and Technology, 2004

    [13] HARRIS I., TESSIER R.: Testing and diagnosis of interconnectfaults in cluster-based FPGA architectures, IEEE Trans. CAD

    Integ. Circuits Syst., 2002, 21, (11), pp 1337 1343

    [14] NORMAND E.: Single event upset at ground level, IEEE

    Trans. Nucl. Sci., 1996, 43, (6), pp. 27422750

    [15] MOJOLI G., SALVI D., SAMI M.G., SECHI G.R., STEFANELLI R.: KITE:

    a behavioural approach to fault-tolerance in FPGA-based

    systems. Int. Workshop on Defect and Fault Tolerance in

    VLSI Systems, 1996, pp. 327334

    [16] GIRARD P., HRON O., PRAVOSSOUDOVITCH S., RENOVELL M.: Defect

    analysis for delay-fault BIST in FPGAs. Int. On-Line Testing

    Symp., 2003, pp. 124128

    [17] STEININGER A., SCHERRER C.: On the necessity of on-line-

    BIST in safety-critical applications. Int. Symp. on Fault-

    Tolerant Computing, 1999, pp. 208215

    [18] NEUMANN J.: Probabilistic logics and the synthesis of

    reliable organisms from unreliable components. Automata

    studies (Ann. of Math. Studies, vol. 34, 1956), pp. 4398

    [19] RAMAMOORTHY C., HAN Y.-W.: Reliability analysis of systems

    with concurrent error detection, IEEE Trans. Comput.,1975, C-24, (9), pp. 868878

    [20] CARMICHAEL C.: Triple module redundancy design

    techniques for Virtex FPGAs. Xilinx Application Note

    XAPP197, 2006

    [21] DANGELO S., METRA C., PASTORE S., POGUTZ A., SECHI G.R.: Fault-

    tolerant voting mechanism and recovery scheme for TMR

    FPGA-based systems. Int. Symp. on Defect and Fault

    Tolerance in VLSI Systems, 1998, pp. 233240

    [22]DANGELO S.

    ,METRA C.

    ,SECHI G.

    : Transient and permanentfault diagnosis for FPGA-based TMR systems. Int. Symp. on

    Defect and Fault Tolerance in VLSI Systems, 1999, pp. 330 338

    208 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196210

    & The Institution of Engineering and Technology 2010 doi: 10.1049/iet-cdt.2009.0011

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    14/15

    [23] M ITRA S., MCCLUSKEY E.J.: Which concurrent error

    detection scheme to choose?. IEEE Int. Test Conf., 2000,

    pp. 985994

    [24] BOLCHINI C., SALICE F., SCIUTO D.: Designing self-checking

    fpgas through error detection codes. IEEE Inter. Symp. on

    Defect and Fault Tolerance in VLSI Systems, 2002,pp. 6068

    [25] KRASNIEWSKI A.: Low-cost concurrent error detection for

    fsms implemented using embedded memory blocks of

    fpgas. IEEE Design and Diagnostics of Electronic Circuits

    and Systems, 2006, pp. 178 183

    [26] HUANG W., MITRA S., MCCLUSKEY E.J.: Fast run-time fault

    location in dependable FPGA-based applications. IEEE Int.

    Symp. on Defect and Fault Tolerance in VLSI Systems,

    2001, pp. 206214

    [27] LO J., FUJIWARA E.: Probability to achieve TSC goal, IEEE

    Trans. Comput., 1996, 45, (4), pp 450460

    [28] LIMA F., CARRO L., REIS R.: Designing fault tolerant systems

    into SRAM-based FPGAs. Design Automation Conf., 2003,

    pp. 650655

    [29] ALDERIGHI M., CASINI F., DANGELO S., SALVI D., SECHI G.R.: A fault-

    tolerant FPGA-based multistage interconnection network

    for space applications. IEEE Int. Workshop on Electronic

    Design, Test and Applications, 2002, pp. 302306

    [30] DEMARA R.F., ZHANG K.: Autonomous FPGA fault handling

    through competitive runtime reconfiguration. NASA/DoDConf. of Evolution Hardware, 2005

    [31] Xilinx Inc.: Xilinx TMRTool product brief, 2006

    [32] DURAND S.: FPGA with self-repair capabilities. Int.

    Workshop on Field Programmable Gate Arrays, 1994,

    pp. 1 6

    [33] Xilinx Inc.: Virtex-5 FPGA configuration user guide

    (vol. v2.5, 2007)

    [34] STROUDC., KONALA S., CHEN P., ABRAMOVICIM.: Built-in self-test

    of logic blocks in FPGAs. VLSI Test Symp., 1996, vol. 14

    [35] LU S., YEH F., SHIH J.: Fault detection and fault diagnosis

    techniques for lookup table FPGAs, VLSI Des., 2002, 15,

    (1), pp. 397406

    [36] ITAZAKI N., MATSUKI F., MATSUMOTO Y., KINOSHITA K.: Built-in

    self-test for multiple CLB faults of a LUT type FPGA. Asian

    Test Symp., 1998, pp. 272277

    [37]ALAGHI A.

    ,YARANDI M.S.

    ,NAVABI Z.

    : An optimum ORA BISTfor multiple fault FPGA look-up table testing. Asian Test

    Symp., 2006, pp. 293298

    [38] LIU J., SIMMONS S.: BIST-diagnosis of interconnect fault

    locations in FPGAs. Canadian Conf. on Electrical and

    Computer Engineering, 2003, pp. 207210

    [39] CAMPREGHER N., CHEUNG P.Y., VASILKO M.: BIST based

    interconnect fault location for FPGAs. Int. Conf. on Field

    Programmable Logic, 2004, pp. 322332

    [40] SUN X., XU J., CHAN B., TROUBORST P.: Novel technique for

    built-in self-test of FPGA interconnects. Int. Test Conf.

    2000, 2000, pp. 795803

    [41] SMITH J., XIAT., STROUD C.: An automated BIST architecture

    for testing and diagnosing FPGA interconnect faults,

    J. Electron. Test. Theory Appli., 2006, 22, (3), pp. 239253

    [42] HARRIS I., TESSIER R.: Diagnosis of interconnect faults in

    cluster-based FPGA architectures. Int. Conf. on Computer

    Aided Design, 2000, pp. 472475

    [43] CHMELAR E.: FPGA interconnect delay fault testing. Int.

    Test Conf., 2003, vol. 1, pp. 12391247

    [44] TAHOORI M.B.: Diagnosis of open defects in FPGA

    interconnect. IEEE Int. Conf. on Field-Programmable

    Technology, 2002, pp. 328331

    [45] GIRARD P., HRON O., PRAVOSSOUDOVITCH S., RENOVELL M.: High

    quality TPG for delay faults in look-up tables of FPGAs. Int.

    Workshop on Electronic Design, Test and Applications, 2004

    [46] ABRAMOVICI M., STROUD C.E.: BIST-based delay-fault testing

    in FPGAs, J. Electron. Test., 2003, 19, pp. 549558

    [47] WONG J.S.J., SEDCOLE P., CHEUNG P.Y.K.: Self-characterization

    of combinatorial circuit delays in FPGAs. Int. Conf. on

    Field Programmable Techniques, 2007, pp. 1723

    [48] DOUMAR A., ITO H.: Testing approach within FPGA-

    based fault tolerant systems. IEEE Asian Test Symp.,

    2000, p. 411

    [49] ABRAMOVICI M., EMMERT J.M., STROUD C.E.: Roving STARs: an

    integrated approach to on-line testing, diagnosis, and faulttolerance for FPGAs. NASA/DoD Workshop on EvolvableHardware, 2001, p. 73

    [50] EMMERT J.M., STROUD C.E., ABRAMOVICI M.: Online fault

    tolerance for FPGA logic blocks, IEEE Trans. VLSI Syst.,

    2007, 15, (2), pp. 216226

    [51] SHNIDMAN N.R., MANGIONE-SMITH W.H., POTKONJAK M.: On-line

    fault detection for bus-based field programmable gate

    arrays, IEEE Trans. VLSI Syst., 1998, 6, (4), pp. 656 666

    [52]NAKAMURA Y.

    ,HIRAKI K.

    : Highly fault-tolerant FPGAprocessor by degrading strategy. Pacific Rim Int. Symp.

    on Dependable Computing, 2002, pp. 75 78

    IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196 210 209

    doi: 10.1049/iet-cdt.2009.0011 & The Institution of Engineering and Technology 2010

    www.ietdl.org

    Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on June 26,2010 at 07:00:23 UTC from IEEE Xplore. Restrictions apply.

  • 8/8/2019 Get PDF 10

    15/15

    [53] KELLY J., IVEY P.: A novel approach to defect tolerant

    design for SRAM based FPGAs. Int Workshop on FPGAs,

    1994

    [54] HATORI F., SAKURAI T., NOGAMI K., E T A L.: Introducing

    redundancy in field programmable gate arrays. Custom

    Integrated Circuits Conf., May 1993, pp. 7.1.17.1.4

    [55] KELLY J., IVEY P.: Defect tolerant SRAM based FPGAs. Int.

    Conf. on Computer Design, 1994, pp. 479482

    [56] HOWARD N.J., TYRRELL A.M., ALLINSON N.M.: The yield

    enhancement of field-programmable gate arrays, IEEE

    Trans. VLSI Syst., 1994, 2, (1), pp. 115 123

    [57] LACH J., MANGIONE-SMITH W.H., P OT KO NJ AK M .:

    Enhanced FPGA reliability through efficient run-time

    fault reconfiguration, Trans. Reliab, 2000, 49, ( 3) ,

    pp. 296304

    [58] LACH J., MANGIONE-SMITH W.H., POTKONJAK M.: Low overhead

    fault-tolerant FPGA systems, IEEE Trans. VLSI Syst., 1998, 6,

    (2), pp 212221

    [59] LACH J., MANGIONE-SMITH W.H., POTKONJAK M.: Algorithms

    for efficient runtime fault recovery on diverse FPGA

    architectures. Int. Symp. on Defect and Fault Tolerance in

    VLSI Systems, 1999

    [60] LAKAMRAJU V., TESSIER R.: Tolerating operational faults in

    cluster-based FPGAs. ACM Int. Workshop on FPGAs, 2000

    [61] NARASIMHAN J., NAKAJIMA K., RIM C.S., DAHBURA A.T.: Yield

    e nhance me nt of progr am ma ble ASI C a rr ay s by

    reconfiguration of circuit placements, IEEE Trans. CAD

    Integ. Circuit Syst., 1994, 13, (8), pp. 976986

    [62] EMMERT J.M., BHATIA D.K.: A fault tolerant technique for

    FPGAs, J. Electron. Test., 2000, 16, (6), pp. 591606

    [63] HANCHEK F., DUTT S.: Node-covering based defect

    and fault tolerance methods for increased yield in FPGAs.

    Int. Conf. on VLSI Design, January 1996, pp. 225229

    [64] ADAMS G., AGRAWAL D., SIEGEL H.: A survey and comparision

    of fault-tolerant multistage interconnection networks, IEEE

    Comput., 1987, 20, (6), pp. 3040

    [65] SHANTHI A., PARTHASARATHI R.: Exploring FPGA structures

    for evolving fault tolerant hardware. NASA/DoD Conf. onEvolvable Hardware, 2003, pp. 174181

    [66] LARCHEV G., LOHN J.: Evolutionary based techniques for

    fault tolerant field programmable gate arrays. Int. Conf.

    on Space Mission Challenges for Information Technology,

    2006

    [67] ESSENI D., BUDE J.D., SELMI L.: Fault-tolerant evolvable

    hardware using field-programmable transistor arrays, IEEE

    Trans. Reliab., 2000, 49, (3), pp. 305316

    [68] HUANG J., TAHOORI M.B., LOMBARDI F.: Fault tolerance

    of switch blocks and switch block arrays in FPGA, IEEE

    Trans. VLSI Syst., 2005, 13, (7), pp. 794807

    [69] CAMPREGHER N., CHEUNG P.Y.K., CONSTANTINIDES G.A., VASILKO M.:

    Reconfiguration and finegrained redundancy for fault

    tolerance in FPGAs. Int. Conf. on Field Programmable

    Logic, 2006, pp. 455460

    [70] HARIYAMA M., OGATA S., KAMEYAMA M.: Multi-context FPGA

    using floating-gate-MOS functional pass-gates, IEICE

    Trans. Electron., 2006, E89-C, (11), pp. 16551661

    210 IET Comput. Digit. Tech., 2010, Vol. 4, Iss. 3, pp. 196210

    & The Institution of Engineering and Technology 2010 doi: 10 1049/iet cdt 2009 0011

    www.ietdl.org