sedc-based hardware-level fault tolerance and fault secure...

17
Research Article SEDC-Based Hardware-Level Fault Tolerance and Fault Secure Checker Design for Big Data and Cloud Computing Zahid Ali Siddiqui , 1 Jeong-A Lee, 2 and Unsang Park 1 1 Department of Computer Science and Engineering, Sogang University, 35 Baekbeom-ro, Mapo-gu, Seoul 04107, Republic of Korea 2 Department of Computer Engineering, Chosun University, 309 Pilmun-daero, Dong-gu, Gwangju 61452, Republic of Korea Correspondence should be addressed to Unsang Park; [email protected] Received 15 December 2017; Revised 15 March 2018; Accepted 3 April 2018; Published 7 June 2018 Academic Editor: Shangguang Wang Copyright © 2018 Zahid Ali Siddiqui et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fault tolerance is of great importance for big data systems. Although several soſtware-based application-level techniques exist for fault security in big data systems, there is a potential research space at the hardware level. Big data needs to be processed inexpensively and efficiently, for which traditional hardware architectures are, although adequate, not optimum for this purpose. In this paper, we propose a hardware-level fault tolerance scheme for big data and cloud computing that can be used with the existing soſtware-level fault tolerance for improving the overall performance of the systems. e proposed scheme uses the concurrent error detection (CED) method to detect hardware-level faults, with the help of Scalable Error Detecting Codes (SEDC) and its checker. SEDC is an all unidirectional error detection (AUED) technique capable of detecting multiple unidirectional errors. e SEDC scheme exploits data segmentation and parallel encoding features for assigning code words. Consequently, the SEDC scheme can be scaled to any binary data length “n” with constant latency and less complexity, compared to other AUED schemes, hence making it a perfect candidate for use in big data processing hardware. We also present a novel area, delay, and power efficient, scalable fault secure checker design based on SEDC. In order to show the effectiveness of our scheme, we (1) compared the cost of hardware- based fault tolerance with an existing soſtware-based fault tolerance technique used in HDFS and (2) compared the performance of the proposed checker in terms of area, speed, and power dissipation with the famous Berger code and m-out-of-2m code checkers. e experimental results show that (1) the proposed SEDC-based hardware-level fault tolerance scheme significantly reduces the average cost associated with soſtware-based fault tolerance in a big data application, and (2) the proposed fault secure checker outperforms the state-of-the-art checkers in terms of area, delay, and power dissipation. 1. Introduction Big data is promising for business applications and is rapidly increasing as an important segment of the IT industry. Big data has also opened doors of significant interest in various fields, including remote healthcare, telebanking, social net- working services (SNS), and satellite imaging [1]. Failures in many of these systems may represent significant economic or market share loss and negatively affect an organization’s reputation [2]. Hence, it is always intended that whenever a fault occurs, the damage done should be within an acceptable threshold rather than beginning the whole task from scratch, due to which fault tolerance becomes an integral part in cloud computing and big data [3]. Fault tolerance prevents a computer or network device from failing in the event of an unexpected error [2]. A recent study [4] showed that the cost of fault tolerance in cloud applications with high probability of failure and network latency is around 5% for the range of application sizes, hence providing improved performance at a lower cost. e fault tolerance schemes in popular big data frame- works like Hadoop and MongoDB are composed of some sort of data replication or redundancy [5, 6]. MongoDB replicates its primary data in secondary devices. In a faulty event, the data is recalled from the secondary or the secondary tem- porarily acts as a primary. Fault tolerance in Hadoop relies on multiple copies of data stored on different data nodes. Although replication schemes allow complete data recovery, Hindawi Scientific Programming Volume 2018, Article ID 7306837, 16 pages https://doi.org/10.1155/2018/7306837

Upload: others

Post on 12-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

Research ArticleSEDC-Based Hardware-Level Fault Tolerance and Fault SecureChecker Design for Big Data and Cloud Computing

Zahid Ali Siddiqui 1 Jeong-A Lee2 and Unsang Park 1

1Department of Computer Science and Engineering Sogang University 35 Baekbeom-ro Mapo-gu Seoul 04107 Republic of Korea2Department of Computer Engineering Chosun University 309 Pilmun-daero Dong-gu Gwangju 61452 Republic of Korea

Correspondence should be addressed to Unsang Park unsangparksogangackr

Received 15 December 2017 Revised 15 March 2018 Accepted 3 April 2018 Published 7 June 2018

Academic Editor Shangguang Wang

Copyright copy 2018 Zahid Ali Siddiqui et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Fault tolerance is of great importance for big data systems Although several software-based application-level techniques existfor fault security in big data systems there is a potential research space at the hardware level Big data needs to be processedinexpensively and efficiently for which traditional hardware architectures are although adequate not optimum for this purpose Inthis paper we propose a hardware-level fault tolerance scheme for big data and cloud computing that can be used with the existingsoftware-level fault tolerance for improving the overall performance of the systemsThe proposed scheme uses the concurrent errordetection (CED) method to detect hardware-level faults with the help of Scalable Error Detecting Codes (SEDC) and its checkerSEDC is an all unidirectional error detection (AUED) technique capable of detecting multiple unidirectional errors The SEDCscheme exploits data segmentation and parallel encoding features for assigning code words Consequently the SEDC scheme canbe scaled to any binary data length ldquonrdquo with constant latency and less complexity compared to other AUED schemes hencemakingit a perfect candidate for use in big data processing hardware We also present a novel area delay and power efficient scalable faultsecure checker design based on SEDC In order to show the effectiveness of our scheme we (1) compared the cost of hardware-based fault tolerance with an existing software-based fault tolerance technique used in HDFS and (2) compared the performance ofthe proposed checker in terms of area speed and power dissipation with the famous Berger code and m-out-of-2m code checkersThe experimental results show that (1) the proposed SEDC-based hardware-level fault tolerance scheme significantly reduces theaverage cost associated with software-based fault tolerance in a big data application and (2) the proposed fault secure checkeroutperforms the state-of-the-art checkers in terms of area delay and power dissipation

1 Introduction

Big data is promising for business applications and is rapidlyincreasing as an important segment of the IT industry Bigdata has also opened doors of significant interest in variousfields including remote healthcare telebanking social net-working services (SNS) and satellite imaging [1] Failures inmany of these systems may represent significant economicor market share loss and negatively affect an organizationrsquosreputation [2] Hence it is always intended that whenever afault occurs the damage done should be within an acceptablethreshold rather than beginning the whole task from scratchdue to which fault tolerance becomes an integral part incloud computing and big data [3] Fault tolerance prevents

a computer or network device from failing in the event of anunexpected error [2] A recent study [4] showed that the costof fault tolerance in cloud applications with high probabilityof failure and network latency is around 5 for the range ofapplication sizes hence providing improved performance ata lower cost

The fault tolerance schemes in popular big data frame-works likeHadoop andMongoDB are composed of some sortof data replication or redundancy [5 6] MongoDB replicatesits primary data in secondary devices In a faulty event thedata is recalled from the secondary or the secondary tem-porarily acts as a primary Fault tolerance in Hadoop relieson multiple copies of data stored on different data nodesAlthough replication schemes allow complete data recovery

HindawiScientific ProgrammingVolume 2018 Article ID 7306837 16 pageshttpsdoiorg10115520187306837

2 Scientific Programming

they consume a lot ofmemory and communication resourcesHence in recent years many researchers have proposed faulttolerance algorithms for improved data recovery effectivefault detection and reduced latency in big data and cloudcomputing [2 5ndash10] All of which detect fault at the software(SW) level Even though faults propagated due to transienterrors in hardware are also detected by these schemes andsoftware-based techniques are more flexible the amount ofdata required to process to detect a fault costs a lot morethan hardware- (HW-) based fault tolerance schemes Arecent study [11] investigated the cause of data corruptionin a Hadoop Distributed File System (HDFS) and foundthat when processing uploaded files HW errors such asdisk failure and bit-flips in processor and memory generateexceptions that are difficult to handle properly Liu et al[7] implemented some level of HW-based fault toleranceby modelling CPU temperature to anticipate a deterioratingphysical machine Liu et al [7] proposed the CPU tempera-ture monitoring as an essential step for preventing machinefailure due to overheating as well as for improving the datacenterrsquos energy efficiency

Parker [12] discussed how in many cases the faultsare a direct consequence of tightly integrating digital andphysical components into a single unit at a sensor or fieldnode In fact many modern systems rely so heavily ondigital technology that the reliability of the system cannotbe decomposed and partitioned into physical and SW com-ponents due to interactions between them There is a costassociated with the storage transmission and analysis ofthese higher-dimensional data Furthermore many of theSW-based approaches are simulation intensive which maylead to broad implementation challenges To overcome someof these challenges he suggested that onboard embeddedprocessing will be a practical requirement

Transient errors in HW if propagated may cause chainreaction of errors at the SW layer causing potential failure atthe nodeserver level Detection at the HW level requires lesscomputation time (as low as single clock cycle) as comparedwith detection at the SW level (several machine cycles) whilea simple recovery mechanism called recomputation at theHW level can save a lot of data swapping and signalingat the SW level As discussed in [13] big data has createdopportunities for semiconductor companies to develop moresophisticated systems to cover the challenges faced in bigdata and cloud computing and a trend towards integrationof more functions onto a single piece of silicon is likely tocontinue Also due to advances in semiconductor processingthere has been a reduction in the cost of digital components[12] For these reasons we propose the detection of transientfaults as they occur in HW through a HW-based faulttolerance scheme while the SW-based fault tolerance staysat the top level as a second check for HW errors and firstcheck for SW errors As a result the transient errors that arisein HW are mostly taken care of by lightweight processing atthe HW level with little overhead (in terms of area powerand delay) saving tremendous computation resources at thesystem level The potential for catastrophic consequences inbig data systems justify the overhead incurred due to HW-based fault tolerance method

On the other hand fault tolerance has also become anintegral part of very large-scale integration (VLSI) circuitswhere downsized large-scaled and low-power VLSI systemsare prone to transient faults [14] Transient faults or soft errorsare transient-induced events on memory and logic circuitscaused by the striking of rays emitted from an IC package andhigh energy alpha particles from cosmic rays [14ndash18] Also inmultilevel cell memories like NAND Flash memories theseerrors are mostly caused by cell-to-cell interference and dataretention errors [19] Physical protection such as shieldingtemperature control and grounding circuits are not alwaysfeasible hence in such cases concurrent error detecting(CED) methods are employed for protection against theseerrors Since CED circuits add to the overall area and delayof the system the selection of appropriate error detectingand even error correcting circuits for a particular applicationleads to an efficient design [18] It has been reported thatthe biggest portion of errors that occur in VLSI circuits andmemories are related to unidirectional errors (UE) [19ndash21]because these errors shift threshold voltage levels to either thepositive or negative side [22] causing the circuit node logicfrom ldquo0rdquo to ldquo1rdquo or from ldquo1rdquo to ldquo0rdquo but not both at the sametime

Many all unidirectional error detection (AUED) schemeshave been proposed and implemented among which theBerger code technique [23] is agreed to be the least redun-dant With the ability to detect single- as well as multiple-bitunidirectional errors this technique provides error detectionby simply summing the logic 0rsquos (a B0 scheme) or 1rsquos (aB1 scheme) in the information word expressing its sum inbinary If the information word contains ldquon-bitsrdquo then aBerger code will require lceillog2(n + 1)rceil-bits A Berger codechecker employs a 0rsquos (or 1rsquos) counter circuitry for reencodingthe information word to check bits and then compares it withthe preencoded check bits using a two-rail checker [23] Achain of adders and a tree of two-rail checkers are requiredto design these checker circuits [23] where area and latencyincrease drastically as data length increases

An m-out-of-n code is one in which all valid code wordshave exactly ldquomrdquo 1rsquos and ldquon-mrdquo 0rsquosThese codes can also detectall unidirectional errorswhenn= 2mThis condition not onlyincreases the code size but also the checkerrsquos area Cellularrealization of an m-out-of-2m code circuit was deemed byLala [24] as more area- and delay-efficient than the previousimplementations

Given the importance of fault tolerance at the HW levelin big data and cloud computing applications in this paperwe present a fault secure (FS) SEDC checker used with SEDCcodes [25] An FS checker has the ability to safely hide or self-check (detect) its own faults as they occur in its circuitryTheSEDC partitions the input data into smaller segments (2 3and 4 bits) and encodes them in parallel This unique scalingfeature makes the system faster and less complex to designfor any binary data length The FS SEDC checker inheritsall these features of SEDC codes (ie simple scalabilityconstant latency and less power dissipation) which suits itsimplementation in online fault detection in processors cachememories and NAND Flash-based memories for big data

Scientific Programming 3

Informationsymbol

generator

Check SymbolGenerator

Checker

FunctionalCircuit

Check bits S

Informationbits G(D)

Error indicationsignal V

Inputs D

Figure 1 Block diagram of the proposed hardware-level fault tolerance system

applications The major contributions of this paper are asfollows

(1) We propose HW-level fault tolerance for circuitsdesigned to process big data and cloud computingapplications

(2) In order to show the effectiveness of the proposedHW-level fault tolerance scheme in a big data sce-nario we compare the cost associated with and with-out the proposed fault tolerance scheme and presentresults that show a significant reduction in the overallcost of fault tolerance in big data when the proposedHW-based fault tolerance scheme is applied

(3) We also present a novel FS SEDC checker for use withSEDC-based HW-level fault tolerance systems

(4) In order to prove the superiority of the FS SEDCchecker presented in contrast with state-of-the-artAUED checkers we show that the FS SEDC checkerachieves state-of-the-art performance in terms ofarea delay and power dissipation

The rest of the paper is organized as follows We presentan overall system diagram of the proposed HW-level faulttolerance system in Section 2 We give a brief mathematicalfoundation of the SEDC scheme and an example to encodelogical circuits using SEDC in Section 3 Design details of theFS SEDC checker are described in Section 4 The proposedchecker is shown to be FS through the fault testing methodsand its area delay and power comparison with state-of-the-art are derived in Section 5 We compute the fault coverage ofthe proposed SEDC-based fault tolerance system and presentthe experimental details and results in Section 5 To showthe effectiveness of the proposed method in big data andcloud computation applications we also perform a cost-performance analysis of fault tolerance at the SW level versusHW level in Section 5 Finally we conclude the paper inSection 6

2 Introduction to the Overall System

Figure 1 shows the main components of an error detectingcodes based HW-level fault tolerance The functional circuit

consists of two subcircuits an information symbol generator(ISG) and a check symbol generator (CSG)These two circuitsdo not share any logic The ISG takes input D and performssome operation 119866 and produces output 119866(D) The CSG isa carefully chosen logic function that acts as the encoderand generates check bits S using the same input D suchthat S = 120601(119866(D)) where 120601 denotes the particular codingfunction The checker normally contains another encoderthat reencodes the information bits 119866(D) into S1015840 = 120601(119866(D))and then compares both S and S1015840 A mismatch between Sand S1015840 is treated as an error which is indicated by the errorindication or verification signal V

The checker shown in Figure 1 plays a vital role in theoverall fault tolerance system The checker must exhibit aself-checking property or failsafe property to make sure thatthe whole system is fault secure (FS) If the checker is bothself-checking and failsafe the overall system is said to be astotally self-checking (TSC) In order to formally define theseproperties let us consider the output of the functional circuitshown in Figure 1 to be represented by 119866(D) = 119866(119909 119891)where 119909 is the input and 119891is the fault and then in fault-free operation ie 119891 = 0 the output can be represented by119866(119909 0) Also consider the input code space D sube 119883 outputcode space S sube 119884 and an assumed fault set F then accordingto the definition of totally self-checking (TSC) 119866 is

(1) self-testing if for each fault 119891 in F there exists at leastone input code d isin D that produces a noncodeoutput ie forall119891 isin Fexist d isin D ni 119866(e 119891) notin S

(2) fault secure (FS) if for all faults 119891 in F and all codeinputs d isin D the output is either correct or is anoncode word ie forall119891 isin F 119886119899119889 foralld isin D 119866(e 119891) =119866(e 0) or 119866(e 119891) notin S

In the proposed SEDC-based HW-level fault tolerance sys-tem the CSG circuit is realized by an SEDC check symbolgenerator (SCSG) circuit which generates the SEDC codewords corresponding to the information bits 119866(D) Wepresented a realization of an SEDC encoded SCSG circuit in[27] ie an SEDC encoded arithmetic logic unit (ALU) ofa microprocessor The SEDC encoded ALU circuit (SCSG)computes the SEDC codes corresponding to the output of the

4 Scientific Programming

lsquobrsquo-bit data segment 3-bit data segment

lsquoarsquo repetition

(Sm-1 S2 S1 S0)

(Dn-1D3 D2 D1 D0) Dn-1 Dn-b D8 D7 D6 D5 D4 D3 D2 D1 D0

Sm-1 Sm-2 S5 S4

SEDCb SEDC3 SEDC3 SEDC3

S3 S2 S1 S0

(a)

1011

01

(01)

(01) (11)

(10)

00

(0 1)

1 0

(b)

Figure 2 (a) SEDC scheme for given data word and (b) 2D illustration of SEDC2 scheme

ISG (in [27] a normal ALU) Any fault that causes multipleunidirectional errors at the output of the normal ALU isdetected by the SEDC checker Any logic circuitry includingSRAM-based memory cells [28] can be made fault tolerantby encoding them similar to the methods given in [27 28] Inthe next section we briefly introduce the SEDC scheme withan example to encode an adder circuit while in the rest of thepaper we focus on the proposed FS SEDC checker that can beused with any SEDC-based HW-level fault tolerance system

3 Scalable Error Detection Coding(SEDC) Scheme

TheScalable ErrorDetectionCoding scheme [25] is anAUEDscheme formulated and designed in such a way that only theresultant circuit area is scaled while its latency depends on asmall portion of the input data (explained later)

For any binary data D of length 119899-bits represented as(119863119899minus1 1198632 1198631 1198630) with 119863119894 isin 0 1 for 0 le 119894 le 119899 minus 1two parameters 119886 and 119887 are computed using

119886 = 119899 minusmax (119887)3 (1)

where parameter 119886 can only take a positive integer valueie 119886 isin Z+ and parameter 119887 isin 2 3 4 Satisfying thecondition for parameter119886 the maximum possible value forparameter 119887 is selectedThe SEDC code word S is representedas (119878119898minus1 119878119895 1198782 1198781 1198780) with 119878119895 isin 0 1 for 0 le 119895 le119898 minus 1 where 119898 denotes the length of the SEDC code wordand is computed by

119898 = lceillog2 (119899 + 1 minus 3119886)rceil + 2119886 (2)

After computing the values for parameters 119886 and 119887 the SEDCcode S for binary data D is computed SEDC is designedto generate codes basically for 2- 3- and 4-bit data and isaccordingly referred to as the SEDC2 SEDC3 and SEDC4scheme respectively It is then extended for any integer valuesof 119899 as shown in Figure 2(a)

31 SEDC2 Code A two-dimensional (2D) illustration of a2-bit SEDC (SEDC2) scheme is shown in Figure 2(b) where

nodes represent data words and their corresponding codewords are written in brackets

The SEDC coding scheme assigns code words to differentdata words with a unique criterion Whenever there is achange of a bit (or bits) in a data word from ldquo1rdquo rarr ldquo0rdquoas shown with a bold arrow in Figure 2(b) the change isreflected in the code word in the opposite way ie the codechanges from ldquo0rdquorarr ldquo1rdquo as shown with the dashed arrow inFigure 2(b) and vice versa Equation (3) is used to assign 2-bitcode words 11987811198780 to the 2-bit data words11986311198630 Clearly we caninterchange the bit positions of 1198781 and 1198780 for another variantof SEDC2 codes This will not affect the code characteristics

[1198781 1198780] = 1198781198641198631198622 (1198631 1198630)= [119883119873119874119877 (1198631 1198630) 119873119860119873119863(1198631 1198630)] (3)

In (3) [1198781 1198780] represent the concatenated SEDC code bits119883119873119874119877 and119873119860119873119863 are the logical operations and SEDC2 isthe basic coding scheme

32 SEDC3 Code SEDC3 code for 3-bit data is computedusing (4) as follows

[1198781 1198780] = 1198781198641198631198623 (1198632 1198631 1198630)

= 1198781198641198631198622 (1198631 1198630) 119894119891 1198632 = 01198781198641198631198622 (1198631 1198630) 119894119891 1198632 = 1

(4)

where the bar sign (eg1198631) in (4) represents the logical NOToperation

Figure 3 shows a 3D cube illustrating the unidirectionalerror detection mechanism of SEDC3 codes The same nota-tions are used in Figure 3 as in Figure 2(b) The dashed sideof the cube represents the embedded SEDC2 coding schemein SEDC3 Note that when there is a 2-bit unidirectionalchange in data word ldquo001rdquorarr ldquo111rdquo (the two MSBs changingfrom ldquo00rdquorarr ldquo11rdquo) the code changes in the opposite direction(the least significant bit of the code changes from ldquo1rdquo rarrldquo0rdquo) In a similar way the SEDCn scheme detects 119899-bit or allunidirectional errors in the data word D

Scientific Programming 5

33 SEDC4 Code A SEDC4 code for 4-bit data is formulatedby (5) as follows

[1198782 (1198781 1198780)] = 1198781198641198631198624 (1198633 1198632 1198631 1198630)= [1198633 1198781198641198631198623 (1198632 1198631 1198630)]

(5)

TheMSB of the code word is completely dependent upon theMSB of the data word for SEDC4 hence any change in theMSB of the data word is detected The rest of the three databits are encoded using the same SEDC3 scheme

It can be observed from (3) (4) and (5) that the SEDC2is embedded in 3-bit SEDC (SEDC3) and consequently in 4-bit SEDC (SEDC4) to detect all unidirectional errors in 3-bitand 4-bit data as shown laterThis ability to scale codes is notpresent in any other concurrent error detecting (CED) codingscheme

In general for SEDCn the 119899-bit binary data is groupedinto one 119887-bit segment and the 119886 number of 3-bit segmentsand then these segments are encoded using one SEDCb and 119886numbernumbers of SEDC3 modules in parallel as shown inFigure 2(a) It is noteworthy that each group of data segmentsand corresponding code segments is independent of eachotherThis independencemakes our scheme scalable and ableto detect some portion of bidirectional errors (BE) (discussedin Section 53)

If we interchange 1198781 and 1198780 for SEDC3 in Figure 3 thecorresponding SEDC3 code is equal to Berger codes for a3-bit segment but our way of deriving the SEDC3 code isa lot different from that of Berger codes SEDC3 codes arebasically scaled from SEDC2 codes and SEDC2 codes haveno commonality with 2-bit Berger codes

34 SEDC-Based HW-Level Fault Tolerance System ExampleIn order to illustrate the designing of a HW-level fault toler-ance system using the SEDC scheme we take the example ofa 4-bit adder Let us consider that this 4-bit adder is a partof a processor which processes big data applications and wewant to make this 4-bit adder fault tolerant against transienterrors that arise in its circuitry so the general HW-level faulttolerance system diagram shown in Figure 1 will be convertedto the one shown in Figure 4 As shown in Figure 4 the 4-bitadder acts as an ISG and its equivalent SEDC encoder acts asa CSGThe SEDC encoder or CSG can be implemented using(6) as follows

[1198783 1198780] = 119878119864119863119862 (A[30] + B[30] + 119862119894119899) (6)

As the output of 4-bit adder is a 5-bit value hence theequivalent SEDC code has a 4-bit value according to (2) Weused Alterarsquos Quartus II software to synthesize the 4-bit adder(ISG) SEDC encoder (CSG) and the SEDC checker shownin Figure 4 and utilized the synthesized circuit for computingthe fault coverage of the SEDC scheme which is presented inSection 53 In the next section we present the proposed FSSEDC checker which completes the overall proposed SEDC-based HW-level fault tolerance system

Table 1 Code table for FS SEDC1 checker

G0 S0 V1 V0

0 0 1 10 1 1 01 0 1 01 1 0 0

4 The FS SEDC Checker

As shown in Figure 4 the FS SEDC checker takes 119899-information bits and119898-SEDC check bits from the functionalunit The FS SEDC checker is also composed of one 119887-bit FSSEDC checker and 119886 sets of 3-bit FS SEDC checkers With 1- 2- and 3-bit FS SEDC checkers the output can be directlyused as an error indication signal but for 119899 gt 3 one level ofwired-AND-OR logic gates is used to combine all the outputof subblocks of FS SEDC checkers and generate the 2-biterror indication signal Subsections discuss logic and circuitdiagrams for primitive FS SEDC checkers (SEDC1 SEDC2SEDC3 and SEDC4 checkers) which can be used to scale theSEDC checker to an 119899-bit FS SEDC checker (ie an FS SEDCnchecker)

41 The FS SEDC1 Checker Table 1 shows the logic for a 1-bit SEDC (FS SEDC1) checker The valid input code wordsare ldquo10rdquo and ldquo01rdquo and the valid output code word is ldquo10rdquo 1198660denotes the 1-bit information word that is the output of ISGand 1198780 denotes the 1-bit SEDC check bit generated by theSEDC check symbol generator (SCSG)11988111198810 is the 2-bit errorindication signal of the FS SEDC1 checker 1198811and 1198810 signalsare generated by the circuits shown in Figure 5(a)

42 The FS SEDC2 Checker

[1198811 1198810] = [1198781 (1198661 + 1198660) (1198780 + 11986611198660) (1198661 + 1198660 + 1198780) (1198781 + 119866111986601198780)]

(7)

In Figure 5 the symbols P1-P13 and N1-N13 representthe PMOS and NMOS transistors respectively and Vssrepresents the voltage supply For simplicity we used theCMOS-based implementation of SEDC checker circuits Anyother technology can be used to design these circuits but theunderlying algorithm ie SEDC will remain the same

43 The FS SEDC3 Checker Figure 6(a) shows the blockdiagram and the logic for a 3-bit FS SEDC checker Three-bit data 119866211986611198660 from the ISG and 2-bit SEDC check bits11987811198780 from the SCSG are first converted to 1198661101584011986601015840 and 1198781101584011987801015840respectively and then are checked using the same 2-bit FSSEDCchecker as shown in Figure 6(a)When the1198662 bit is ldquo1rdquo11986611198660 and 11987811198780 are inverted whereas if 1198662 is ldquo0rdquo then 11986611198660and 11987811198780 remain the same As the outputs of the XOR gatesare fed to the FS SEDC2 checker any error in the XOR gatesis detected This makes the overall 3-bit SEDC checker FS

6 Scientific Programming

010

110

100101

001

011

111

000

(01)

(10)

(01)

(00)

(10)

(01)

(10)

(11)

Figure 3 3D illustration of SEDC3 scheme

4-bit adder(ISG)

SEDC encoded4-bit adder

(SCSG)

FS SEDC checker

Check bits

Adder outputError indication signal V

CinA[30] B[30]

S=SEDC(A[30]+B[30]+Cin)

A[30]+B[30]+Cin

Figure 4 Example of SEDC-based HW-level fault tolerance system

44 The FS SEDC4 Checker A 4-bit FS SEDC checkerconsists of one FS SEDC1 checker and one FS SEDC3 checkeras shown in Figure 6(b) Both SEDC1 and SEDC3 checkersgenerate 2-bit output 11988111198810 Because the valid code word isldquo10rdquo to make sure that both checker units generate the ldquo10rdquooutput during error-free operation we ldquoANDrdquo the1198811 output-bit of the FS SEDC1 checker with the 1198811 output-bit of theFS SEDC3 checker Also we ldquoORrdquo the 1198810 output-bits of bothFS SEDC checkers using wired logic gates We checked andconfirmed by fault simulation that wired-AND and wired-OR gates are also FS for single faults (stuck-at-0 stuck-at-1transistor-stuck-on and transistor-stuck-off)

45 The FS SEDC119899 Checker Like the SEDC code generatorthe FS SEDC checker also consists of multiple 1- 2- and 3-bitFS SEDC checkers depending upon the value of 119886 and 119887 from(1) For example if 119899 = 8 bits then (1)rArr 119886 = 2 and 119887 = 2Thisrequires one FS SEDC2 checker and two FS SEDC3 checkersto realize an 8-bit FS SEDC checker

The area of wired-AND-OR gates will also definitelyincrease as 119899 is increased Figure 7 shows the block diagramof an 119899-bit FS SEDC checker For 119899 = 8 bits there will be totalof three FS SEDC checkers each with 2-bit output hence a3-input wired-AND and a 3-input wired-OR gate is requiredto compare all1198811 and1198810 bits In general for 119899-bit input thereare ldquo119886 + 1rdquo FS SEDC checkers each with 2-bit output Sowe require ldquo119896 = 2 times (119886 + 1)rdquo-input wired-AND and wired-OR gates With each increasing input to the wired-AND-ORnetwork one extra transistor is required by each of the wired

gatesThis causes the circuit to expandwidth-wise hence thelatency of the wired logic remains constant for any value of 119899

The size of the load transistor driving these wired-ANDand -OR gates will also increase with increasing input sowe consider the maximum fan-in of one gate as equal to 4For 119896 gt 4 an extra load transistor is connected in parallelGenerally for k-inputs we require 119903 = lceil1198964rceil load transistorsA total of 119896 + 119903 transistors is required to design the k-input wired AND-OR network with a constant latency of 1transistor

5 Experiments and Results

In this section we present the experiments we conductedon the proposed FS SEDC checker and the overall proposedSEDC-based HW-level fault tolerance system The results ofeach experiment are given alongwith the experimental detailsin the subsections below

51 Fault Test on FS SEDC Checker The FS SEDC1 SEDC2SEDC3 and SEDC4 circuits in our paper were tested forstuck-at-0 stuck-at-1 transistor-stuck-ON and transistor-stuck-OFF faults We assume a single-fault model wherefaults occur one at a time and there is enough time betweendetection of the first fault and the occurrence of another fault[29] In Table 2 we provide a summary of fault analysis ofan SEDC1 checker circuit We applied one fault at a time in

Scientific Programming 7

Vss Vss

P3 P4

N3

N4

P1

P2

N1 N2G0

G0G0

G0V0

V1

S0

S0

S0

S0

(a)

Vss

N1 N2

N4

N3N5

N6

P2

P1P4

P3

P5

P6

Vss

N7 N8 N9

N11

N12

N13

N10

P7 P8 P9P10

P11

P12P13

G

G1

G1

G1

G1

G1G1

G1

G1

V

V1 S

S

G

G

G

G

G

G

G

S

S

S

S

S1

S1

S1

S1

(b)

Figure 5 CMOS-based circuits of FS (a) SEDC1 checker and (b) SEDC2 checker

FS SEDC Checker for 2-bitData

S1 S0

V0V1

S0

1

G1

G1G2

0

S1

S1

S0G0

G0

(a)

G S

FS SEDC Checkerfor 1-bit data

FS SEDC Checker for 3-bitdata

Functional Circuit Output SEDC Code

wired OR gatewired AND gate

Error indication

G1G2

S2

G3

S1 S0G0

V0V1V0V1

(b)

Figure 6 Block diagram of FS (a) SEDC3 checker and (b) SEDC4 checker

the circuit of Figure 5(a) and observed the output In single-fault operation the circuit either produced the correct outputor never produced any invalid code words (exhibiting FSproperty) as shown in Table 2

Case 1 (transistor stuck ON) In Table 2 we show all six casesof transistor stuck ON faults (one at a time) For the caseswith N3 or N4 stuck ON the circuit shows fault detection byone input code combination (representedwith symbol) andhence the circuit is self-testing whereas other cases showedthat the circuit is fault secure as well as code disjoint

Case 2 (transistor stuck OFF) In Table 2 all six cases fortransistor stuck OFF faults are shown In cases where N1 orN2 was stuck OFF the circuit demonstrates the self-testingproperty (represented with symbol) and for the rest of thecases the circuit is fault secure

Case 3 (input stuck at 0) When input G0 or S0 is stuck at 0the circuit demonstrates the self-testing property otherwiseit remains fault secure

Case 4 (input stuck at 1) When inputG0 or S0 is stuck at 1 thecircuit shows the self-testing property otherwise it remainsfault secure

There is one case where the output becomes floating (ieP3 or P4 stuck OFF) In either case if we consider the floatingvoltage as logic high then the circuit is fault secure and if weconsider the floating voltage as logic low then the circuit isself-testing Hence we can say that the circuit in Figure 5(a)which is a 1-bit SEDC checker is FS Similar analysis wascarried out when testing 2- 3- and 4-bit SEDC checkers andwe found that all these checkers are FS

52 Area Delay and Power Comparison In this section wecompare the area and delay of TSC Berger FS SEDC and m-out-of-2m code checkersWeuse the twopossible TSCBergerchecker implementations from Piestrak et al [23] and PierceJr and Lala [26] with the m-out-of-2m code checker fromLala [24] for comparison For the sake of fairness the areaoverhead was measured in terms of the number of equivalent

8 Scientific Programming

G S

FS SEDC Checkerfor b-bit data

FS SEDC Checker for 3-bitdata

FS SEDC Checker for 3-bitdata

Functional circuit output SEDC Code

a-units

n m

k-input wired AND-OR network

Error indication signal

G1G2 S1 S0G0 G1G2 S1 S0G0

V0V1V0V1V0V1

Figure 7 Block diagram of FS SEDCn checker

Table 2 Results of single faults on FS SEDC1 checker

G0 S0 V1 V0 G0 S0 V1 V0 G0 S0 V1 V0

MOS P1or P2 is stuck ON MOS P1 or P2 is stuck OFF Input C0 stuck at zero0 1 1 0 0 1 1 0 permil0 0 1 11 0 1 0 1 0 1 0 1 0 1 0

MOS P3 or P4 is stuck ON MOS P3 or P4 is stuck OFF Input F0 stuck at zero0 1 1 0 0 1 Floating 0 permil0 0 1 11 0 1 0 1 0 1 0 0 1 1 0

Transistor N1 is stuck ON Transistor N1 is stuck OFF Input C0 stuck at 10 1 1 0 0 1 1 0 0 1 1 01 0 1 0 permil1 0 1 1 permil1 1 0 0

Transistor N2 is stuck ON Transistor N2 is stuck OFF Input F0 stuck at 10 1 1 0 permil0 1 1 1 1 0 1 01 0 1 0 1 0 1 0 permil1 1 0 0

Transistor N3 is stuck ON Transistor N3 is stuck OFF - - - -permil0 1 0 0 0 1 1 0 - - - -1 0 1 0 1 0 1 0 - - - -

Transistor N4 is stuck ON Transistor N4 is stuck OFF - - - -permil0 1 1 0 0 1 1 0 - - - -1 0 0 0 1 0 1 0 - - - -

permilThe cases where circuit shows self-testing property

transistors Wemade use of the assumptions by Smith [30] totranslate gate-level circuits to transistor-level circuits

Before comparison we illustrate the functional dissim-ilarities of the three checkers with the help of Figure 8Figure 8(a) shows the general block diagram of a TSC Bergercode checker For all the information symbols that the ISG ofthe functional circuit can produce in normal operation thecheck symbol complement generator (CSCG) outputs (1198781198611015840)correspond to the bit-by-bit complement of the expectedcheck symbol 119878119861 The TSC two-rail checker validates thateach bit of 119878119861 is the complement of corresponding bit of 1198781198611015840As the size of the input data increases the length of checksymbol 119878119861 also increases resulting in a longer length for theTSC two-rail checker tree and hence the resulting delay

A general block diagram of a TSC m-out-of-2m codechecker is shown in Figure 8(b) The checker takes the

information bits and check bits 119878119882 and partitions them intotwo parts The numbers of 1rsquos ie the weight of both partsare mapped to a pair of values which in binary belongs toa code in most cases a two-rail code The checker consistsof a cellular structure of AND-OR gates as given by Lala[24]

Figure 8(c) depicts the general block diagram for an FSSEDC checker that resembles the structure of an m-out-of-2m code checker and differs from a Berger code checkerThe FS SEDC checker block receives the information andcheck bits from the functional unit If the input data lengthincreases the size of the FS checker block increases width-wise The FS SEDCn block contains ldquo119886 + 1rdquo pairs of smallSEDC checkers (subblocks) Each subblock of the FS SEDCchecker produces ldquo10rdquo as the valid code output The overallSEDC checker has a final 2-bit output 11987810 unlike two-rail

Scientific Programming 9

Check SymbolComplement

Generator

m-variabletwo-rail

TSC m-ot-of-2m codechecker

2m-input wired-AND-OR gate

blocks

Informationbits

Informationbits

Informationbits

Check bits Check bits Check bits

Error indicationError indication Error indication

(a) (b) (c)

k-input wired-AND-OR gate

signal VB

signal VSsignal VW

SB SWSS

S10

SFS SEDH checker

SB

Figure 8 Block diagrams of (a) TSC Berger checker (b) m-out-of-2m code checker and (c) FS SEDC checker

2 3 4 5 7 8 15 16 30 32Data Length (bits)

m-out-2mBergerSEDC

0500

1000150020002500

Circ

uit S

ize (

of

tran

sisto

rs)

Figure 9 Area comparison of area-optimized Berger [23] SEDC and m-out-of-2m [24] code checkers

codes only one of the output combinations ldquo10rdquo is considereda valid code word A nonvalid checker output ldquo00rdquo ldquo01rdquoor ldquo11rdquo at output 11987810 indicates the presence of a fault in thefunctional circuit or the FS checker itself The k-input wiredAND-OR network takes the ldquo119886+1rdquo pairs of output from eachSEDC checker subblock and then converts them into a final2-bit error indication signal 11988111987851 Fault Test on FS SEDC Checker Area-optimized real-ization of TSC Berger code checkers in Piestrak et al [23]showed less area overhead than m-out-of-2m code checkerswhich is apparent fromFigure 9 But if we consider the delay-optimized implementation of the TSC Berger code checkerfrom Pierce Jr and Lala [26] we see that the TSC Berger codechecker requires more area than the FS SEDC and m-out-of-2m codes checkers [24] as shown in Table 3 For claritywe discretely listed the area overhead offered based on codestorage area and code checker area in Table 3 Also listedseparately are the area overhead required by the TRC tree forthe TSC Berger code checker the wired-AND-OR networkfor FS SEDC and the m-out-of-2m code checker

For a fair comparison the extra cost of the code storagearea is also taken into account We assumed that 1-bit storage

is implemented by 12-MOS transistors [30] Table 3 lists thearea (in terms of the number of transistors) occupied by FSSEDC delay-optimized Berger code and m-out-of-2m codecheckers for up to 32-bit data

The FS SEDCn checker block shown in Figure 8(c)requires fewer gates implemented with [26 + (a times 50)] MOStransistors if ldquob = 2rdquo [50 + (a times 50)] MOS transistors if ldquob= 3rdquo and [58 + (a times 50)] MOS transistors if ldquob = 4rdquo The m-out-of-2m code checker implementation of Lala [24] requires2m2 - 2m + 2 gates The gate-level circuit is also translated totransistor-level circuits using data from Smith [30]

The results show that when scaling a 7-bit 0rsquos counter toan 8-bit 0rsquos counter 154 extra MOS transistors are requiredThe m-out-of-2m code checker requires 60 MOS transistorswhen scaling a 7-out-of-14 checker to an 8-out-of-16 checkerwhereas the SEDC checker requires only 18 extra MOS tran-sistors That is because a 7-bit SEDC checker is implementedwith one SEDC3 and one SEDC4 circuit that contain 50 and58 MOS transistors respectively (a total of 108 transistors)An 8-bit SEDC checker is implemented using one SEDC2and two SEDC3 checkers requiring 26 and 100 (50x2) MOStransistors (a total of 126 transistors) This means that SEDCsaves 88 of the number of transistors compared to a Bergercode checker [26] and it saves 70 of the transistors when

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 2: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

2 Scientific Programming

they consume a lot ofmemory and communication resourcesHence in recent years many researchers have proposed faulttolerance algorithms for improved data recovery effectivefault detection and reduced latency in big data and cloudcomputing [2 5ndash10] All of which detect fault at the software(SW) level Even though faults propagated due to transienterrors in hardware are also detected by these schemes andsoftware-based techniques are more flexible the amount ofdata required to process to detect a fault costs a lot morethan hardware- (HW-) based fault tolerance schemes Arecent study [11] investigated the cause of data corruptionin a Hadoop Distributed File System (HDFS) and foundthat when processing uploaded files HW errors such asdisk failure and bit-flips in processor and memory generateexceptions that are difficult to handle properly Liu et al[7] implemented some level of HW-based fault toleranceby modelling CPU temperature to anticipate a deterioratingphysical machine Liu et al [7] proposed the CPU tempera-ture monitoring as an essential step for preventing machinefailure due to overheating as well as for improving the datacenterrsquos energy efficiency

Parker [12] discussed how in many cases the faultsare a direct consequence of tightly integrating digital andphysical components into a single unit at a sensor or fieldnode In fact many modern systems rely so heavily ondigital technology that the reliability of the system cannotbe decomposed and partitioned into physical and SW com-ponents due to interactions between them There is a costassociated with the storage transmission and analysis ofthese higher-dimensional data Furthermore many of theSW-based approaches are simulation intensive which maylead to broad implementation challenges To overcome someof these challenges he suggested that onboard embeddedprocessing will be a practical requirement

Transient errors in HW if propagated may cause chainreaction of errors at the SW layer causing potential failure atthe nodeserver level Detection at the HW level requires lesscomputation time (as low as single clock cycle) as comparedwith detection at the SW level (several machine cycles) whilea simple recovery mechanism called recomputation at theHW level can save a lot of data swapping and signalingat the SW level As discussed in [13] big data has createdopportunities for semiconductor companies to develop moresophisticated systems to cover the challenges faced in bigdata and cloud computing and a trend towards integrationof more functions onto a single piece of silicon is likely tocontinue Also due to advances in semiconductor processingthere has been a reduction in the cost of digital components[12] For these reasons we propose the detection of transientfaults as they occur in HW through a HW-based faulttolerance scheme while the SW-based fault tolerance staysat the top level as a second check for HW errors and firstcheck for SW errors As a result the transient errors that arisein HW are mostly taken care of by lightweight processing atthe HW level with little overhead (in terms of area powerand delay) saving tremendous computation resources at thesystem level The potential for catastrophic consequences inbig data systems justify the overhead incurred due to HW-based fault tolerance method

On the other hand fault tolerance has also become anintegral part of very large-scale integration (VLSI) circuitswhere downsized large-scaled and low-power VLSI systemsare prone to transient faults [14] Transient faults or soft errorsare transient-induced events on memory and logic circuitscaused by the striking of rays emitted from an IC package andhigh energy alpha particles from cosmic rays [14ndash18] Also inmultilevel cell memories like NAND Flash memories theseerrors are mostly caused by cell-to-cell interference and dataretention errors [19] Physical protection such as shieldingtemperature control and grounding circuits are not alwaysfeasible hence in such cases concurrent error detecting(CED) methods are employed for protection against theseerrors Since CED circuits add to the overall area and delayof the system the selection of appropriate error detectingand even error correcting circuits for a particular applicationleads to an efficient design [18] It has been reported thatthe biggest portion of errors that occur in VLSI circuits andmemories are related to unidirectional errors (UE) [19ndash21]because these errors shift threshold voltage levels to either thepositive or negative side [22] causing the circuit node logicfrom ldquo0rdquo to ldquo1rdquo or from ldquo1rdquo to ldquo0rdquo but not both at the sametime

Many all unidirectional error detection (AUED) schemeshave been proposed and implemented among which theBerger code technique [23] is agreed to be the least redun-dant With the ability to detect single- as well as multiple-bitunidirectional errors this technique provides error detectionby simply summing the logic 0rsquos (a B0 scheme) or 1rsquos (aB1 scheme) in the information word expressing its sum inbinary If the information word contains ldquon-bitsrdquo then aBerger code will require lceillog2(n + 1)rceil-bits A Berger codechecker employs a 0rsquos (or 1rsquos) counter circuitry for reencodingthe information word to check bits and then compares it withthe preencoded check bits using a two-rail checker [23] Achain of adders and a tree of two-rail checkers are requiredto design these checker circuits [23] where area and latencyincrease drastically as data length increases

An m-out-of-n code is one in which all valid code wordshave exactly ldquomrdquo 1rsquos and ldquon-mrdquo 0rsquosThese codes can also detectall unidirectional errorswhenn= 2mThis condition not onlyincreases the code size but also the checkerrsquos area Cellularrealization of an m-out-of-2m code circuit was deemed byLala [24] as more area- and delay-efficient than the previousimplementations

Given the importance of fault tolerance at the HW levelin big data and cloud computing applications in this paperwe present a fault secure (FS) SEDC checker used with SEDCcodes [25] An FS checker has the ability to safely hide or self-check (detect) its own faults as they occur in its circuitryTheSEDC partitions the input data into smaller segments (2 3and 4 bits) and encodes them in parallel This unique scalingfeature makes the system faster and less complex to designfor any binary data length The FS SEDC checker inheritsall these features of SEDC codes (ie simple scalabilityconstant latency and less power dissipation) which suits itsimplementation in online fault detection in processors cachememories and NAND Flash-based memories for big data

Scientific Programming 3

Informationsymbol

generator

Check SymbolGenerator

Checker

FunctionalCircuit

Check bits S

Informationbits G(D)

Error indicationsignal V

Inputs D

Figure 1 Block diagram of the proposed hardware-level fault tolerance system

applications The major contributions of this paper are asfollows

(1) We propose HW-level fault tolerance for circuitsdesigned to process big data and cloud computingapplications

(2) In order to show the effectiveness of the proposedHW-level fault tolerance scheme in a big data sce-nario we compare the cost associated with and with-out the proposed fault tolerance scheme and presentresults that show a significant reduction in the overallcost of fault tolerance in big data when the proposedHW-based fault tolerance scheme is applied

(3) We also present a novel FS SEDC checker for use withSEDC-based HW-level fault tolerance systems

(4) In order to prove the superiority of the FS SEDCchecker presented in contrast with state-of-the-artAUED checkers we show that the FS SEDC checkerachieves state-of-the-art performance in terms ofarea delay and power dissipation

The rest of the paper is organized as follows We presentan overall system diagram of the proposed HW-level faulttolerance system in Section 2 We give a brief mathematicalfoundation of the SEDC scheme and an example to encodelogical circuits using SEDC in Section 3 Design details of theFS SEDC checker are described in Section 4 The proposedchecker is shown to be FS through the fault testing methodsand its area delay and power comparison with state-of-the-art are derived in Section 5 We compute the fault coverage ofthe proposed SEDC-based fault tolerance system and presentthe experimental details and results in Section 5 To showthe effectiveness of the proposed method in big data andcloud computation applications we also perform a cost-performance analysis of fault tolerance at the SW level versusHW level in Section 5 Finally we conclude the paper inSection 6

2 Introduction to the Overall System

Figure 1 shows the main components of an error detectingcodes based HW-level fault tolerance The functional circuit

consists of two subcircuits an information symbol generator(ISG) and a check symbol generator (CSG)These two circuitsdo not share any logic The ISG takes input D and performssome operation 119866 and produces output 119866(D) The CSG isa carefully chosen logic function that acts as the encoderand generates check bits S using the same input D suchthat S = 120601(119866(D)) where 120601 denotes the particular codingfunction The checker normally contains another encoderthat reencodes the information bits 119866(D) into S1015840 = 120601(119866(D))and then compares both S and S1015840 A mismatch between Sand S1015840 is treated as an error which is indicated by the errorindication or verification signal V

The checker shown in Figure 1 plays a vital role in theoverall fault tolerance system The checker must exhibit aself-checking property or failsafe property to make sure thatthe whole system is fault secure (FS) If the checker is bothself-checking and failsafe the overall system is said to be astotally self-checking (TSC) In order to formally define theseproperties let us consider the output of the functional circuitshown in Figure 1 to be represented by 119866(D) = 119866(119909 119891)where 119909 is the input and 119891is the fault and then in fault-free operation ie 119891 = 0 the output can be represented by119866(119909 0) Also consider the input code space D sube 119883 outputcode space S sube 119884 and an assumed fault set F then accordingto the definition of totally self-checking (TSC) 119866 is

(1) self-testing if for each fault 119891 in F there exists at leastone input code d isin D that produces a noncodeoutput ie forall119891 isin Fexist d isin D ni 119866(e 119891) notin S

(2) fault secure (FS) if for all faults 119891 in F and all codeinputs d isin D the output is either correct or is anoncode word ie forall119891 isin F 119886119899119889 foralld isin D 119866(e 119891) =119866(e 0) or 119866(e 119891) notin S

In the proposed SEDC-based HW-level fault tolerance sys-tem the CSG circuit is realized by an SEDC check symbolgenerator (SCSG) circuit which generates the SEDC codewords corresponding to the information bits 119866(D) Wepresented a realization of an SEDC encoded SCSG circuit in[27] ie an SEDC encoded arithmetic logic unit (ALU) ofa microprocessor The SEDC encoded ALU circuit (SCSG)computes the SEDC codes corresponding to the output of the

4 Scientific Programming

lsquobrsquo-bit data segment 3-bit data segment

lsquoarsquo repetition

(Sm-1 S2 S1 S0)

(Dn-1D3 D2 D1 D0) Dn-1 Dn-b D8 D7 D6 D5 D4 D3 D2 D1 D0

Sm-1 Sm-2 S5 S4

SEDCb SEDC3 SEDC3 SEDC3

S3 S2 S1 S0

(a)

1011

01

(01)

(01) (11)

(10)

00

(0 1)

1 0

(b)

Figure 2 (a) SEDC scheme for given data word and (b) 2D illustration of SEDC2 scheme

ISG (in [27] a normal ALU) Any fault that causes multipleunidirectional errors at the output of the normal ALU isdetected by the SEDC checker Any logic circuitry includingSRAM-based memory cells [28] can be made fault tolerantby encoding them similar to the methods given in [27 28] Inthe next section we briefly introduce the SEDC scheme withan example to encode an adder circuit while in the rest of thepaper we focus on the proposed FS SEDC checker that can beused with any SEDC-based HW-level fault tolerance system

3 Scalable Error Detection Coding(SEDC) Scheme

TheScalable ErrorDetectionCoding scheme [25] is anAUEDscheme formulated and designed in such a way that only theresultant circuit area is scaled while its latency depends on asmall portion of the input data (explained later)

For any binary data D of length 119899-bits represented as(119863119899minus1 1198632 1198631 1198630) with 119863119894 isin 0 1 for 0 le 119894 le 119899 minus 1two parameters 119886 and 119887 are computed using

119886 = 119899 minusmax (119887)3 (1)

where parameter 119886 can only take a positive integer valueie 119886 isin Z+ and parameter 119887 isin 2 3 4 Satisfying thecondition for parameter119886 the maximum possible value forparameter 119887 is selectedThe SEDC code word S is representedas (119878119898minus1 119878119895 1198782 1198781 1198780) with 119878119895 isin 0 1 for 0 le 119895 le119898 minus 1 where 119898 denotes the length of the SEDC code wordand is computed by

119898 = lceillog2 (119899 + 1 minus 3119886)rceil + 2119886 (2)

After computing the values for parameters 119886 and 119887 the SEDCcode S for binary data D is computed SEDC is designedto generate codes basically for 2- 3- and 4-bit data and isaccordingly referred to as the SEDC2 SEDC3 and SEDC4scheme respectively It is then extended for any integer valuesof 119899 as shown in Figure 2(a)

31 SEDC2 Code A two-dimensional (2D) illustration of a2-bit SEDC (SEDC2) scheme is shown in Figure 2(b) where

nodes represent data words and their corresponding codewords are written in brackets

The SEDC coding scheme assigns code words to differentdata words with a unique criterion Whenever there is achange of a bit (or bits) in a data word from ldquo1rdquo rarr ldquo0rdquoas shown with a bold arrow in Figure 2(b) the change isreflected in the code word in the opposite way ie the codechanges from ldquo0rdquorarr ldquo1rdquo as shown with the dashed arrow inFigure 2(b) and vice versa Equation (3) is used to assign 2-bitcode words 11987811198780 to the 2-bit data words11986311198630 Clearly we caninterchange the bit positions of 1198781 and 1198780 for another variantof SEDC2 codes This will not affect the code characteristics

[1198781 1198780] = 1198781198641198631198622 (1198631 1198630)= [119883119873119874119877 (1198631 1198630) 119873119860119873119863(1198631 1198630)] (3)

In (3) [1198781 1198780] represent the concatenated SEDC code bits119883119873119874119877 and119873119860119873119863 are the logical operations and SEDC2 isthe basic coding scheme

32 SEDC3 Code SEDC3 code for 3-bit data is computedusing (4) as follows

[1198781 1198780] = 1198781198641198631198623 (1198632 1198631 1198630)

= 1198781198641198631198622 (1198631 1198630) 119894119891 1198632 = 01198781198641198631198622 (1198631 1198630) 119894119891 1198632 = 1

(4)

where the bar sign (eg1198631) in (4) represents the logical NOToperation

Figure 3 shows a 3D cube illustrating the unidirectionalerror detection mechanism of SEDC3 codes The same nota-tions are used in Figure 3 as in Figure 2(b) The dashed sideof the cube represents the embedded SEDC2 coding schemein SEDC3 Note that when there is a 2-bit unidirectionalchange in data word ldquo001rdquorarr ldquo111rdquo (the two MSBs changingfrom ldquo00rdquorarr ldquo11rdquo) the code changes in the opposite direction(the least significant bit of the code changes from ldquo1rdquo rarrldquo0rdquo) In a similar way the SEDCn scheme detects 119899-bit or allunidirectional errors in the data word D

Scientific Programming 5

33 SEDC4 Code A SEDC4 code for 4-bit data is formulatedby (5) as follows

[1198782 (1198781 1198780)] = 1198781198641198631198624 (1198633 1198632 1198631 1198630)= [1198633 1198781198641198631198623 (1198632 1198631 1198630)]

(5)

TheMSB of the code word is completely dependent upon theMSB of the data word for SEDC4 hence any change in theMSB of the data word is detected The rest of the three databits are encoded using the same SEDC3 scheme

It can be observed from (3) (4) and (5) that the SEDC2is embedded in 3-bit SEDC (SEDC3) and consequently in 4-bit SEDC (SEDC4) to detect all unidirectional errors in 3-bitand 4-bit data as shown laterThis ability to scale codes is notpresent in any other concurrent error detecting (CED) codingscheme

In general for SEDCn the 119899-bit binary data is groupedinto one 119887-bit segment and the 119886 number of 3-bit segmentsand then these segments are encoded using one SEDCb and 119886numbernumbers of SEDC3 modules in parallel as shown inFigure 2(a) It is noteworthy that each group of data segmentsand corresponding code segments is independent of eachotherThis independencemakes our scheme scalable and ableto detect some portion of bidirectional errors (BE) (discussedin Section 53)

If we interchange 1198781 and 1198780 for SEDC3 in Figure 3 thecorresponding SEDC3 code is equal to Berger codes for a3-bit segment but our way of deriving the SEDC3 code isa lot different from that of Berger codes SEDC3 codes arebasically scaled from SEDC2 codes and SEDC2 codes haveno commonality with 2-bit Berger codes

34 SEDC-Based HW-Level Fault Tolerance System ExampleIn order to illustrate the designing of a HW-level fault toler-ance system using the SEDC scheme we take the example ofa 4-bit adder Let us consider that this 4-bit adder is a partof a processor which processes big data applications and wewant to make this 4-bit adder fault tolerant against transienterrors that arise in its circuitry so the general HW-level faulttolerance system diagram shown in Figure 1 will be convertedto the one shown in Figure 4 As shown in Figure 4 the 4-bitadder acts as an ISG and its equivalent SEDC encoder acts asa CSGThe SEDC encoder or CSG can be implemented using(6) as follows

[1198783 1198780] = 119878119864119863119862 (A[30] + B[30] + 119862119894119899) (6)

As the output of 4-bit adder is a 5-bit value hence theequivalent SEDC code has a 4-bit value according to (2) Weused Alterarsquos Quartus II software to synthesize the 4-bit adder(ISG) SEDC encoder (CSG) and the SEDC checker shownin Figure 4 and utilized the synthesized circuit for computingthe fault coverage of the SEDC scheme which is presented inSection 53 In the next section we present the proposed FSSEDC checker which completes the overall proposed SEDC-based HW-level fault tolerance system

Table 1 Code table for FS SEDC1 checker

G0 S0 V1 V0

0 0 1 10 1 1 01 0 1 01 1 0 0

4 The FS SEDC Checker

As shown in Figure 4 the FS SEDC checker takes 119899-information bits and119898-SEDC check bits from the functionalunit The FS SEDC checker is also composed of one 119887-bit FSSEDC checker and 119886 sets of 3-bit FS SEDC checkers With 1- 2- and 3-bit FS SEDC checkers the output can be directlyused as an error indication signal but for 119899 gt 3 one level ofwired-AND-OR logic gates is used to combine all the outputof subblocks of FS SEDC checkers and generate the 2-biterror indication signal Subsections discuss logic and circuitdiagrams for primitive FS SEDC checkers (SEDC1 SEDC2SEDC3 and SEDC4 checkers) which can be used to scale theSEDC checker to an 119899-bit FS SEDC checker (ie an FS SEDCnchecker)

41 The FS SEDC1 Checker Table 1 shows the logic for a 1-bit SEDC (FS SEDC1) checker The valid input code wordsare ldquo10rdquo and ldquo01rdquo and the valid output code word is ldquo10rdquo 1198660denotes the 1-bit information word that is the output of ISGand 1198780 denotes the 1-bit SEDC check bit generated by theSEDC check symbol generator (SCSG)11988111198810 is the 2-bit errorindication signal of the FS SEDC1 checker 1198811and 1198810 signalsare generated by the circuits shown in Figure 5(a)

42 The FS SEDC2 Checker

[1198811 1198810] = [1198781 (1198661 + 1198660) (1198780 + 11986611198660) (1198661 + 1198660 + 1198780) (1198781 + 119866111986601198780)]

(7)

In Figure 5 the symbols P1-P13 and N1-N13 representthe PMOS and NMOS transistors respectively and Vssrepresents the voltage supply For simplicity we used theCMOS-based implementation of SEDC checker circuits Anyother technology can be used to design these circuits but theunderlying algorithm ie SEDC will remain the same

43 The FS SEDC3 Checker Figure 6(a) shows the blockdiagram and the logic for a 3-bit FS SEDC checker Three-bit data 119866211986611198660 from the ISG and 2-bit SEDC check bits11987811198780 from the SCSG are first converted to 1198661101584011986601015840 and 1198781101584011987801015840respectively and then are checked using the same 2-bit FSSEDCchecker as shown in Figure 6(a)When the1198662 bit is ldquo1rdquo11986611198660 and 11987811198780 are inverted whereas if 1198662 is ldquo0rdquo then 11986611198660and 11987811198780 remain the same As the outputs of the XOR gatesare fed to the FS SEDC2 checker any error in the XOR gatesis detected This makes the overall 3-bit SEDC checker FS

6 Scientific Programming

010

110

100101

001

011

111

000

(01)

(10)

(01)

(00)

(10)

(01)

(10)

(11)

Figure 3 3D illustration of SEDC3 scheme

4-bit adder(ISG)

SEDC encoded4-bit adder

(SCSG)

FS SEDC checker

Check bits

Adder outputError indication signal V

CinA[30] B[30]

S=SEDC(A[30]+B[30]+Cin)

A[30]+B[30]+Cin

Figure 4 Example of SEDC-based HW-level fault tolerance system

44 The FS SEDC4 Checker A 4-bit FS SEDC checkerconsists of one FS SEDC1 checker and one FS SEDC3 checkeras shown in Figure 6(b) Both SEDC1 and SEDC3 checkersgenerate 2-bit output 11988111198810 Because the valid code word isldquo10rdquo to make sure that both checker units generate the ldquo10rdquooutput during error-free operation we ldquoANDrdquo the1198811 output-bit of the FS SEDC1 checker with the 1198811 output-bit of theFS SEDC3 checker Also we ldquoORrdquo the 1198810 output-bits of bothFS SEDC checkers using wired logic gates We checked andconfirmed by fault simulation that wired-AND and wired-OR gates are also FS for single faults (stuck-at-0 stuck-at-1transistor-stuck-on and transistor-stuck-off)

45 The FS SEDC119899 Checker Like the SEDC code generatorthe FS SEDC checker also consists of multiple 1- 2- and 3-bitFS SEDC checkers depending upon the value of 119886 and 119887 from(1) For example if 119899 = 8 bits then (1)rArr 119886 = 2 and 119887 = 2Thisrequires one FS SEDC2 checker and two FS SEDC3 checkersto realize an 8-bit FS SEDC checker

The area of wired-AND-OR gates will also definitelyincrease as 119899 is increased Figure 7 shows the block diagramof an 119899-bit FS SEDC checker For 119899 = 8 bits there will be totalof three FS SEDC checkers each with 2-bit output hence a3-input wired-AND and a 3-input wired-OR gate is requiredto compare all1198811 and1198810 bits In general for 119899-bit input thereare ldquo119886 + 1rdquo FS SEDC checkers each with 2-bit output Sowe require ldquo119896 = 2 times (119886 + 1)rdquo-input wired-AND and wired-OR gates With each increasing input to the wired-AND-ORnetwork one extra transistor is required by each of the wired

gatesThis causes the circuit to expandwidth-wise hence thelatency of the wired logic remains constant for any value of 119899

The size of the load transistor driving these wired-ANDand -OR gates will also increase with increasing input sowe consider the maximum fan-in of one gate as equal to 4For 119896 gt 4 an extra load transistor is connected in parallelGenerally for k-inputs we require 119903 = lceil1198964rceil load transistorsA total of 119896 + 119903 transistors is required to design the k-input wired AND-OR network with a constant latency of 1transistor

5 Experiments and Results

In this section we present the experiments we conductedon the proposed FS SEDC checker and the overall proposedSEDC-based HW-level fault tolerance system The results ofeach experiment are given alongwith the experimental detailsin the subsections below

51 Fault Test on FS SEDC Checker The FS SEDC1 SEDC2SEDC3 and SEDC4 circuits in our paper were tested forstuck-at-0 stuck-at-1 transistor-stuck-ON and transistor-stuck-OFF faults We assume a single-fault model wherefaults occur one at a time and there is enough time betweendetection of the first fault and the occurrence of another fault[29] In Table 2 we provide a summary of fault analysis ofan SEDC1 checker circuit We applied one fault at a time in

Scientific Programming 7

Vss Vss

P3 P4

N3

N4

P1

P2

N1 N2G0

G0G0

G0V0

V1

S0

S0

S0

S0

(a)

Vss

N1 N2

N4

N3N5

N6

P2

P1P4

P3

P5

P6

Vss

N7 N8 N9

N11

N12

N13

N10

P7 P8 P9P10

P11

P12P13

G

G1

G1

G1

G1

G1G1

G1

G1

V

V1 S

S

G

G

G

G

G

G

G

S

S

S

S

S1

S1

S1

S1

(b)

Figure 5 CMOS-based circuits of FS (a) SEDC1 checker and (b) SEDC2 checker

FS SEDC Checker for 2-bitData

S1 S0

V0V1

S0

1

G1

G1G2

0

S1

S1

S0G0

G0

(a)

G S

FS SEDC Checkerfor 1-bit data

FS SEDC Checker for 3-bitdata

Functional Circuit Output SEDC Code

wired OR gatewired AND gate

Error indication

G1G2

S2

G3

S1 S0G0

V0V1V0V1

(b)

Figure 6 Block diagram of FS (a) SEDC3 checker and (b) SEDC4 checker

the circuit of Figure 5(a) and observed the output In single-fault operation the circuit either produced the correct outputor never produced any invalid code words (exhibiting FSproperty) as shown in Table 2

Case 1 (transistor stuck ON) In Table 2 we show all six casesof transistor stuck ON faults (one at a time) For the caseswith N3 or N4 stuck ON the circuit shows fault detection byone input code combination (representedwith symbol) andhence the circuit is self-testing whereas other cases showedthat the circuit is fault secure as well as code disjoint

Case 2 (transistor stuck OFF) In Table 2 all six cases fortransistor stuck OFF faults are shown In cases where N1 orN2 was stuck OFF the circuit demonstrates the self-testingproperty (represented with symbol) and for the rest of thecases the circuit is fault secure

Case 3 (input stuck at 0) When input G0 or S0 is stuck at 0the circuit demonstrates the self-testing property otherwiseit remains fault secure

Case 4 (input stuck at 1) When inputG0 or S0 is stuck at 1 thecircuit shows the self-testing property otherwise it remainsfault secure

There is one case where the output becomes floating (ieP3 or P4 stuck OFF) In either case if we consider the floatingvoltage as logic high then the circuit is fault secure and if weconsider the floating voltage as logic low then the circuit isself-testing Hence we can say that the circuit in Figure 5(a)which is a 1-bit SEDC checker is FS Similar analysis wascarried out when testing 2- 3- and 4-bit SEDC checkers andwe found that all these checkers are FS

52 Area Delay and Power Comparison In this section wecompare the area and delay of TSC Berger FS SEDC and m-out-of-2m code checkersWeuse the twopossible TSCBergerchecker implementations from Piestrak et al [23] and PierceJr and Lala [26] with the m-out-of-2m code checker fromLala [24] for comparison For the sake of fairness the areaoverhead was measured in terms of the number of equivalent

8 Scientific Programming

G S

FS SEDC Checkerfor b-bit data

FS SEDC Checker for 3-bitdata

FS SEDC Checker for 3-bitdata

Functional circuit output SEDC Code

a-units

n m

k-input wired AND-OR network

Error indication signal

G1G2 S1 S0G0 G1G2 S1 S0G0

V0V1V0V1V0V1

Figure 7 Block diagram of FS SEDCn checker

Table 2 Results of single faults on FS SEDC1 checker

G0 S0 V1 V0 G0 S0 V1 V0 G0 S0 V1 V0

MOS P1or P2 is stuck ON MOS P1 or P2 is stuck OFF Input C0 stuck at zero0 1 1 0 0 1 1 0 permil0 0 1 11 0 1 0 1 0 1 0 1 0 1 0

MOS P3 or P4 is stuck ON MOS P3 or P4 is stuck OFF Input F0 stuck at zero0 1 1 0 0 1 Floating 0 permil0 0 1 11 0 1 0 1 0 1 0 0 1 1 0

Transistor N1 is stuck ON Transistor N1 is stuck OFF Input C0 stuck at 10 1 1 0 0 1 1 0 0 1 1 01 0 1 0 permil1 0 1 1 permil1 1 0 0

Transistor N2 is stuck ON Transistor N2 is stuck OFF Input F0 stuck at 10 1 1 0 permil0 1 1 1 1 0 1 01 0 1 0 1 0 1 0 permil1 1 0 0

Transistor N3 is stuck ON Transistor N3 is stuck OFF - - - -permil0 1 0 0 0 1 1 0 - - - -1 0 1 0 1 0 1 0 - - - -

Transistor N4 is stuck ON Transistor N4 is stuck OFF - - - -permil0 1 1 0 0 1 1 0 - - - -1 0 0 0 1 0 1 0 - - - -

permilThe cases where circuit shows self-testing property

transistors Wemade use of the assumptions by Smith [30] totranslate gate-level circuits to transistor-level circuits

Before comparison we illustrate the functional dissim-ilarities of the three checkers with the help of Figure 8Figure 8(a) shows the general block diagram of a TSC Bergercode checker For all the information symbols that the ISG ofthe functional circuit can produce in normal operation thecheck symbol complement generator (CSCG) outputs (1198781198611015840)correspond to the bit-by-bit complement of the expectedcheck symbol 119878119861 The TSC two-rail checker validates thateach bit of 119878119861 is the complement of corresponding bit of 1198781198611015840As the size of the input data increases the length of checksymbol 119878119861 also increases resulting in a longer length for theTSC two-rail checker tree and hence the resulting delay

A general block diagram of a TSC m-out-of-2m codechecker is shown in Figure 8(b) The checker takes the

information bits and check bits 119878119882 and partitions them intotwo parts The numbers of 1rsquos ie the weight of both partsare mapped to a pair of values which in binary belongs toa code in most cases a two-rail code The checker consistsof a cellular structure of AND-OR gates as given by Lala[24]

Figure 8(c) depicts the general block diagram for an FSSEDC checker that resembles the structure of an m-out-of-2m code checker and differs from a Berger code checkerThe FS SEDC checker block receives the information andcheck bits from the functional unit If the input data lengthincreases the size of the FS checker block increases width-wise The FS SEDCn block contains ldquo119886 + 1rdquo pairs of smallSEDC checkers (subblocks) Each subblock of the FS SEDCchecker produces ldquo10rdquo as the valid code output The overallSEDC checker has a final 2-bit output 11987810 unlike two-rail

Scientific Programming 9

Check SymbolComplement

Generator

m-variabletwo-rail

TSC m-ot-of-2m codechecker

2m-input wired-AND-OR gate

blocks

Informationbits

Informationbits

Informationbits

Check bits Check bits Check bits

Error indicationError indication Error indication

(a) (b) (c)

k-input wired-AND-OR gate

signal VB

signal VSsignal VW

SB SWSS

S10

SFS SEDH checker

SB

Figure 8 Block diagrams of (a) TSC Berger checker (b) m-out-of-2m code checker and (c) FS SEDC checker

2 3 4 5 7 8 15 16 30 32Data Length (bits)

m-out-2mBergerSEDC

0500

1000150020002500

Circ

uit S

ize (

of

tran

sisto

rs)

Figure 9 Area comparison of area-optimized Berger [23] SEDC and m-out-of-2m [24] code checkers

codes only one of the output combinations ldquo10rdquo is considereda valid code word A nonvalid checker output ldquo00rdquo ldquo01rdquoor ldquo11rdquo at output 11987810 indicates the presence of a fault in thefunctional circuit or the FS checker itself The k-input wiredAND-OR network takes the ldquo119886+1rdquo pairs of output from eachSEDC checker subblock and then converts them into a final2-bit error indication signal 11988111987851 Fault Test on FS SEDC Checker Area-optimized real-ization of TSC Berger code checkers in Piestrak et al [23]showed less area overhead than m-out-of-2m code checkerswhich is apparent fromFigure 9 But if we consider the delay-optimized implementation of the TSC Berger code checkerfrom Pierce Jr and Lala [26] we see that the TSC Berger codechecker requires more area than the FS SEDC and m-out-of-2m codes checkers [24] as shown in Table 3 For claritywe discretely listed the area overhead offered based on codestorage area and code checker area in Table 3 Also listedseparately are the area overhead required by the TRC tree forthe TSC Berger code checker the wired-AND-OR networkfor FS SEDC and the m-out-of-2m code checker

For a fair comparison the extra cost of the code storagearea is also taken into account We assumed that 1-bit storage

is implemented by 12-MOS transistors [30] Table 3 lists thearea (in terms of the number of transistors) occupied by FSSEDC delay-optimized Berger code and m-out-of-2m codecheckers for up to 32-bit data

The FS SEDCn checker block shown in Figure 8(c)requires fewer gates implemented with [26 + (a times 50)] MOStransistors if ldquob = 2rdquo [50 + (a times 50)] MOS transistors if ldquob= 3rdquo and [58 + (a times 50)] MOS transistors if ldquob = 4rdquo The m-out-of-2m code checker implementation of Lala [24] requires2m2 - 2m + 2 gates The gate-level circuit is also translated totransistor-level circuits using data from Smith [30]

The results show that when scaling a 7-bit 0rsquos counter toan 8-bit 0rsquos counter 154 extra MOS transistors are requiredThe m-out-of-2m code checker requires 60 MOS transistorswhen scaling a 7-out-of-14 checker to an 8-out-of-16 checkerwhereas the SEDC checker requires only 18 extra MOS tran-sistors That is because a 7-bit SEDC checker is implementedwith one SEDC3 and one SEDC4 circuit that contain 50 and58 MOS transistors respectively (a total of 108 transistors)An 8-bit SEDC checker is implemented using one SEDC2and two SEDC3 checkers requiring 26 and 100 (50x2) MOStransistors (a total of 126 transistors) This means that SEDCsaves 88 of the number of transistors compared to a Bergercode checker [26] and it saves 70 of the transistors when

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 3: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

Scientific Programming 3

Informationsymbol

generator

Check SymbolGenerator

Checker

FunctionalCircuit

Check bits S

Informationbits G(D)

Error indicationsignal V

Inputs D

Figure 1 Block diagram of the proposed hardware-level fault tolerance system

applications The major contributions of this paper are asfollows

(1) We propose HW-level fault tolerance for circuitsdesigned to process big data and cloud computingapplications

(2) In order to show the effectiveness of the proposedHW-level fault tolerance scheme in a big data sce-nario we compare the cost associated with and with-out the proposed fault tolerance scheme and presentresults that show a significant reduction in the overallcost of fault tolerance in big data when the proposedHW-based fault tolerance scheme is applied

(3) We also present a novel FS SEDC checker for use withSEDC-based HW-level fault tolerance systems

(4) In order to prove the superiority of the FS SEDCchecker presented in contrast with state-of-the-artAUED checkers we show that the FS SEDC checkerachieves state-of-the-art performance in terms ofarea delay and power dissipation

The rest of the paper is organized as follows We presentan overall system diagram of the proposed HW-level faulttolerance system in Section 2 We give a brief mathematicalfoundation of the SEDC scheme and an example to encodelogical circuits using SEDC in Section 3 Design details of theFS SEDC checker are described in Section 4 The proposedchecker is shown to be FS through the fault testing methodsand its area delay and power comparison with state-of-the-art are derived in Section 5 We compute the fault coverage ofthe proposed SEDC-based fault tolerance system and presentthe experimental details and results in Section 5 To showthe effectiveness of the proposed method in big data andcloud computation applications we also perform a cost-performance analysis of fault tolerance at the SW level versusHW level in Section 5 Finally we conclude the paper inSection 6

2 Introduction to the Overall System

Figure 1 shows the main components of an error detectingcodes based HW-level fault tolerance The functional circuit

consists of two subcircuits an information symbol generator(ISG) and a check symbol generator (CSG)These two circuitsdo not share any logic The ISG takes input D and performssome operation 119866 and produces output 119866(D) The CSG isa carefully chosen logic function that acts as the encoderand generates check bits S using the same input D suchthat S = 120601(119866(D)) where 120601 denotes the particular codingfunction The checker normally contains another encoderthat reencodes the information bits 119866(D) into S1015840 = 120601(119866(D))and then compares both S and S1015840 A mismatch between Sand S1015840 is treated as an error which is indicated by the errorindication or verification signal V

The checker shown in Figure 1 plays a vital role in theoverall fault tolerance system The checker must exhibit aself-checking property or failsafe property to make sure thatthe whole system is fault secure (FS) If the checker is bothself-checking and failsafe the overall system is said to be astotally self-checking (TSC) In order to formally define theseproperties let us consider the output of the functional circuitshown in Figure 1 to be represented by 119866(D) = 119866(119909 119891)where 119909 is the input and 119891is the fault and then in fault-free operation ie 119891 = 0 the output can be represented by119866(119909 0) Also consider the input code space D sube 119883 outputcode space S sube 119884 and an assumed fault set F then accordingto the definition of totally self-checking (TSC) 119866 is

(1) self-testing if for each fault 119891 in F there exists at leastone input code d isin D that produces a noncodeoutput ie forall119891 isin Fexist d isin D ni 119866(e 119891) notin S

(2) fault secure (FS) if for all faults 119891 in F and all codeinputs d isin D the output is either correct or is anoncode word ie forall119891 isin F 119886119899119889 foralld isin D 119866(e 119891) =119866(e 0) or 119866(e 119891) notin S

In the proposed SEDC-based HW-level fault tolerance sys-tem the CSG circuit is realized by an SEDC check symbolgenerator (SCSG) circuit which generates the SEDC codewords corresponding to the information bits 119866(D) Wepresented a realization of an SEDC encoded SCSG circuit in[27] ie an SEDC encoded arithmetic logic unit (ALU) ofa microprocessor The SEDC encoded ALU circuit (SCSG)computes the SEDC codes corresponding to the output of the

4 Scientific Programming

lsquobrsquo-bit data segment 3-bit data segment

lsquoarsquo repetition

(Sm-1 S2 S1 S0)

(Dn-1D3 D2 D1 D0) Dn-1 Dn-b D8 D7 D6 D5 D4 D3 D2 D1 D0

Sm-1 Sm-2 S5 S4

SEDCb SEDC3 SEDC3 SEDC3

S3 S2 S1 S0

(a)

1011

01

(01)

(01) (11)

(10)

00

(0 1)

1 0

(b)

Figure 2 (a) SEDC scheme for given data word and (b) 2D illustration of SEDC2 scheme

ISG (in [27] a normal ALU) Any fault that causes multipleunidirectional errors at the output of the normal ALU isdetected by the SEDC checker Any logic circuitry includingSRAM-based memory cells [28] can be made fault tolerantby encoding them similar to the methods given in [27 28] Inthe next section we briefly introduce the SEDC scheme withan example to encode an adder circuit while in the rest of thepaper we focus on the proposed FS SEDC checker that can beused with any SEDC-based HW-level fault tolerance system

3 Scalable Error Detection Coding(SEDC) Scheme

TheScalable ErrorDetectionCoding scheme [25] is anAUEDscheme formulated and designed in such a way that only theresultant circuit area is scaled while its latency depends on asmall portion of the input data (explained later)

For any binary data D of length 119899-bits represented as(119863119899minus1 1198632 1198631 1198630) with 119863119894 isin 0 1 for 0 le 119894 le 119899 minus 1two parameters 119886 and 119887 are computed using

119886 = 119899 minusmax (119887)3 (1)

where parameter 119886 can only take a positive integer valueie 119886 isin Z+ and parameter 119887 isin 2 3 4 Satisfying thecondition for parameter119886 the maximum possible value forparameter 119887 is selectedThe SEDC code word S is representedas (119878119898minus1 119878119895 1198782 1198781 1198780) with 119878119895 isin 0 1 for 0 le 119895 le119898 minus 1 where 119898 denotes the length of the SEDC code wordand is computed by

119898 = lceillog2 (119899 + 1 minus 3119886)rceil + 2119886 (2)

After computing the values for parameters 119886 and 119887 the SEDCcode S for binary data D is computed SEDC is designedto generate codes basically for 2- 3- and 4-bit data and isaccordingly referred to as the SEDC2 SEDC3 and SEDC4scheme respectively It is then extended for any integer valuesof 119899 as shown in Figure 2(a)

31 SEDC2 Code A two-dimensional (2D) illustration of a2-bit SEDC (SEDC2) scheme is shown in Figure 2(b) where

nodes represent data words and their corresponding codewords are written in brackets

The SEDC coding scheme assigns code words to differentdata words with a unique criterion Whenever there is achange of a bit (or bits) in a data word from ldquo1rdquo rarr ldquo0rdquoas shown with a bold arrow in Figure 2(b) the change isreflected in the code word in the opposite way ie the codechanges from ldquo0rdquorarr ldquo1rdquo as shown with the dashed arrow inFigure 2(b) and vice versa Equation (3) is used to assign 2-bitcode words 11987811198780 to the 2-bit data words11986311198630 Clearly we caninterchange the bit positions of 1198781 and 1198780 for another variantof SEDC2 codes This will not affect the code characteristics

[1198781 1198780] = 1198781198641198631198622 (1198631 1198630)= [119883119873119874119877 (1198631 1198630) 119873119860119873119863(1198631 1198630)] (3)

In (3) [1198781 1198780] represent the concatenated SEDC code bits119883119873119874119877 and119873119860119873119863 are the logical operations and SEDC2 isthe basic coding scheme

32 SEDC3 Code SEDC3 code for 3-bit data is computedusing (4) as follows

[1198781 1198780] = 1198781198641198631198623 (1198632 1198631 1198630)

= 1198781198641198631198622 (1198631 1198630) 119894119891 1198632 = 01198781198641198631198622 (1198631 1198630) 119894119891 1198632 = 1

(4)

where the bar sign (eg1198631) in (4) represents the logical NOToperation

Figure 3 shows a 3D cube illustrating the unidirectionalerror detection mechanism of SEDC3 codes The same nota-tions are used in Figure 3 as in Figure 2(b) The dashed sideof the cube represents the embedded SEDC2 coding schemein SEDC3 Note that when there is a 2-bit unidirectionalchange in data word ldquo001rdquorarr ldquo111rdquo (the two MSBs changingfrom ldquo00rdquorarr ldquo11rdquo) the code changes in the opposite direction(the least significant bit of the code changes from ldquo1rdquo rarrldquo0rdquo) In a similar way the SEDCn scheme detects 119899-bit or allunidirectional errors in the data word D

Scientific Programming 5

33 SEDC4 Code A SEDC4 code for 4-bit data is formulatedby (5) as follows

[1198782 (1198781 1198780)] = 1198781198641198631198624 (1198633 1198632 1198631 1198630)= [1198633 1198781198641198631198623 (1198632 1198631 1198630)]

(5)

TheMSB of the code word is completely dependent upon theMSB of the data word for SEDC4 hence any change in theMSB of the data word is detected The rest of the three databits are encoded using the same SEDC3 scheme

It can be observed from (3) (4) and (5) that the SEDC2is embedded in 3-bit SEDC (SEDC3) and consequently in 4-bit SEDC (SEDC4) to detect all unidirectional errors in 3-bitand 4-bit data as shown laterThis ability to scale codes is notpresent in any other concurrent error detecting (CED) codingscheme

In general for SEDCn the 119899-bit binary data is groupedinto one 119887-bit segment and the 119886 number of 3-bit segmentsand then these segments are encoded using one SEDCb and 119886numbernumbers of SEDC3 modules in parallel as shown inFigure 2(a) It is noteworthy that each group of data segmentsand corresponding code segments is independent of eachotherThis independencemakes our scheme scalable and ableto detect some portion of bidirectional errors (BE) (discussedin Section 53)

If we interchange 1198781 and 1198780 for SEDC3 in Figure 3 thecorresponding SEDC3 code is equal to Berger codes for a3-bit segment but our way of deriving the SEDC3 code isa lot different from that of Berger codes SEDC3 codes arebasically scaled from SEDC2 codes and SEDC2 codes haveno commonality with 2-bit Berger codes

34 SEDC-Based HW-Level Fault Tolerance System ExampleIn order to illustrate the designing of a HW-level fault toler-ance system using the SEDC scheme we take the example ofa 4-bit adder Let us consider that this 4-bit adder is a partof a processor which processes big data applications and wewant to make this 4-bit adder fault tolerant against transienterrors that arise in its circuitry so the general HW-level faulttolerance system diagram shown in Figure 1 will be convertedto the one shown in Figure 4 As shown in Figure 4 the 4-bitadder acts as an ISG and its equivalent SEDC encoder acts asa CSGThe SEDC encoder or CSG can be implemented using(6) as follows

[1198783 1198780] = 119878119864119863119862 (A[30] + B[30] + 119862119894119899) (6)

As the output of 4-bit adder is a 5-bit value hence theequivalent SEDC code has a 4-bit value according to (2) Weused Alterarsquos Quartus II software to synthesize the 4-bit adder(ISG) SEDC encoder (CSG) and the SEDC checker shownin Figure 4 and utilized the synthesized circuit for computingthe fault coverage of the SEDC scheme which is presented inSection 53 In the next section we present the proposed FSSEDC checker which completes the overall proposed SEDC-based HW-level fault tolerance system

Table 1 Code table for FS SEDC1 checker

G0 S0 V1 V0

0 0 1 10 1 1 01 0 1 01 1 0 0

4 The FS SEDC Checker

As shown in Figure 4 the FS SEDC checker takes 119899-information bits and119898-SEDC check bits from the functionalunit The FS SEDC checker is also composed of one 119887-bit FSSEDC checker and 119886 sets of 3-bit FS SEDC checkers With 1- 2- and 3-bit FS SEDC checkers the output can be directlyused as an error indication signal but for 119899 gt 3 one level ofwired-AND-OR logic gates is used to combine all the outputof subblocks of FS SEDC checkers and generate the 2-biterror indication signal Subsections discuss logic and circuitdiagrams for primitive FS SEDC checkers (SEDC1 SEDC2SEDC3 and SEDC4 checkers) which can be used to scale theSEDC checker to an 119899-bit FS SEDC checker (ie an FS SEDCnchecker)

41 The FS SEDC1 Checker Table 1 shows the logic for a 1-bit SEDC (FS SEDC1) checker The valid input code wordsare ldquo10rdquo and ldquo01rdquo and the valid output code word is ldquo10rdquo 1198660denotes the 1-bit information word that is the output of ISGand 1198780 denotes the 1-bit SEDC check bit generated by theSEDC check symbol generator (SCSG)11988111198810 is the 2-bit errorindication signal of the FS SEDC1 checker 1198811and 1198810 signalsare generated by the circuits shown in Figure 5(a)

42 The FS SEDC2 Checker

[1198811 1198810] = [1198781 (1198661 + 1198660) (1198780 + 11986611198660) (1198661 + 1198660 + 1198780) (1198781 + 119866111986601198780)]

(7)

In Figure 5 the symbols P1-P13 and N1-N13 representthe PMOS and NMOS transistors respectively and Vssrepresents the voltage supply For simplicity we used theCMOS-based implementation of SEDC checker circuits Anyother technology can be used to design these circuits but theunderlying algorithm ie SEDC will remain the same

43 The FS SEDC3 Checker Figure 6(a) shows the blockdiagram and the logic for a 3-bit FS SEDC checker Three-bit data 119866211986611198660 from the ISG and 2-bit SEDC check bits11987811198780 from the SCSG are first converted to 1198661101584011986601015840 and 1198781101584011987801015840respectively and then are checked using the same 2-bit FSSEDCchecker as shown in Figure 6(a)When the1198662 bit is ldquo1rdquo11986611198660 and 11987811198780 are inverted whereas if 1198662 is ldquo0rdquo then 11986611198660and 11987811198780 remain the same As the outputs of the XOR gatesare fed to the FS SEDC2 checker any error in the XOR gatesis detected This makes the overall 3-bit SEDC checker FS

6 Scientific Programming

010

110

100101

001

011

111

000

(01)

(10)

(01)

(00)

(10)

(01)

(10)

(11)

Figure 3 3D illustration of SEDC3 scheme

4-bit adder(ISG)

SEDC encoded4-bit adder

(SCSG)

FS SEDC checker

Check bits

Adder outputError indication signal V

CinA[30] B[30]

S=SEDC(A[30]+B[30]+Cin)

A[30]+B[30]+Cin

Figure 4 Example of SEDC-based HW-level fault tolerance system

44 The FS SEDC4 Checker A 4-bit FS SEDC checkerconsists of one FS SEDC1 checker and one FS SEDC3 checkeras shown in Figure 6(b) Both SEDC1 and SEDC3 checkersgenerate 2-bit output 11988111198810 Because the valid code word isldquo10rdquo to make sure that both checker units generate the ldquo10rdquooutput during error-free operation we ldquoANDrdquo the1198811 output-bit of the FS SEDC1 checker with the 1198811 output-bit of theFS SEDC3 checker Also we ldquoORrdquo the 1198810 output-bits of bothFS SEDC checkers using wired logic gates We checked andconfirmed by fault simulation that wired-AND and wired-OR gates are also FS for single faults (stuck-at-0 stuck-at-1transistor-stuck-on and transistor-stuck-off)

45 The FS SEDC119899 Checker Like the SEDC code generatorthe FS SEDC checker also consists of multiple 1- 2- and 3-bitFS SEDC checkers depending upon the value of 119886 and 119887 from(1) For example if 119899 = 8 bits then (1)rArr 119886 = 2 and 119887 = 2Thisrequires one FS SEDC2 checker and two FS SEDC3 checkersto realize an 8-bit FS SEDC checker

The area of wired-AND-OR gates will also definitelyincrease as 119899 is increased Figure 7 shows the block diagramof an 119899-bit FS SEDC checker For 119899 = 8 bits there will be totalof three FS SEDC checkers each with 2-bit output hence a3-input wired-AND and a 3-input wired-OR gate is requiredto compare all1198811 and1198810 bits In general for 119899-bit input thereare ldquo119886 + 1rdquo FS SEDC checkers each with 2-bit output Sowe require ldquo119896 = 2 times (119886 + 1)rdquo-input wired-AND and wired-OR gates With each increasing input to the wired-AND-ORnetwork one extra transistor is required by each of the wired

gatesThis causes the circuit to expandwidth-wise hence thelatency of the wired logic remains constant for any value of 119899

The size of the load transistor driving these wired-ANDand -OR gates will also increase with increasing input sowe consider the maximum fan-in of one gate as equal to 4For 119896 gt 4 an extra load transistor is connected in parallelGenerally for k-inputs we require 119903 = lceil1198964rceil load transistorsA total of 119896 + 119903 transistors is required to design the k-input wired AND-OR network with a constant latency of 1transistor

5 Experiments and Results

In this section we present the experiments we conductedon the proposed FS SEDC checker and the overall proposedSEDC-based HW-level fault tolerance system The results ofeach experiment are given alongwith the experimental detailsin the subsections below

51 Fault Test on FS SEDC Checker The FS SEDC1 SEDC2SEDC3 and SEDC4 circuits in our paper were tested forstuck-at-0 stuck-at-1 transistor-stuck-ON and transistor-stuck-OFF faults We assume a single-fault model wherefaults occur one at a time and there is enough time betweendetection of the first fault and the occurrence of another fault[29] In Table 2 we provide a summary of fault analysis ofan SEDC1 checker circuit We applied one fault at a time in

Scientific Programming 7

Vss Vss

P3 P4

N3

N4

P1

P2

N1 N2G0

G0G0

G0V0

V1

S0

S0

S0

S0

(a)

Vss

N1 N2

N4

N3N5

N6

P2

P1P4

P3

P5

P6

Vss

N7 N8 N9

N11

N12

N13

N10

P7 P8 P9P10

P11

P12P13

G

G1

G1

G1

G1

G1G1

G1

G1

V

V1 S

S

G

G

G

G

G

G

G

S

S

S

S

S1

S1

S1

S1

(b)

Figure 5 CMOS-based circuits of FS (a) SEDC1 checker and (b) SEDC2 checker

FS SEDC Checker for 2-bitData

S1 S0

V0V1

S0

1

G1

G1G2

0

S1

S1

S0G0

G0

(a)

G S

FS SEDC Checkerfor 1-bit data

FS SEDC Checker for 3-bitdata

Functional Circuit Output SEDC Code

wired OR gatewired AND gate

Error indication

G1G2

S2

G3

S1 S0G0

V0V1V0V1

(b)

Figure 6 Block diagram of FS (a) SEDC3 checker and (b) SEDC4 checker

the circuit of Figure 5(a) and observed the output In single-fault operation the circuit either produced the correct outputor never produced any invalid code words (exhibiting FSproperty) as shown in Table 2

Case 1 (transistor stuck ON) In Table 2 we show all six casesof transistor stuck ON faults (one at a time) For the caseswith N3 or N4 stuck ON the circuit shows fault detection byone input code combination (representedwith symbol) andhence the circuit is self-testing whereas other cases showedthat the circuit is fault secure as well as code disjoint

Case 2 (transistor stuck OFF) In Table 2 all six cases fortransistor stuck OFF faults are shown In cases where N1 orN2 was stuck OFF the circuit demonstrates the self-testingproperty (represented with symbol) and for the rest of thecases the circuit is fault secure

Case 3 (input stuck at 0) When input G0 or S0 is stuck at 0the circuit demonstrates the self-testing property otherwiseit remains fault secure

Case 4 (input stuck at 1) When inputG0 or S0 is stuck at 1 thecircuit shows the self-testing property otherwise it remainsfault secure

There is one case where the output becomes floating (ieP3 or P4 stuck OFF) In either case if we consider the floatingvoltage as logic high then the circuit is fault secure and if weconsider the floating voltage as logic low then the circuit isself-testing Hence we can say that the circuit in Figure 5(a)which is a 1-bit SEDC checker is FS Similar analysis wascarried out when testing 2- 3- and 4-bit SEDC checkers andwe found that all these checkers are FS

52 Area Delay and Power Comparison In this section wecompare the area and delay of TSC Berger FS SEDC and m-out-of-2m code checkersWeuse the twopossible TSCBergerchecker implementations from Piestrak et al [23] and PierceJr and Lala [26] with the m-out-of-2m code checker fromLala [24] for comparison For the sake of fairness the areaoverhead was measured in terms of the number of equivalent

8 Scientific Programming

G S

FS SEDC Checkerfor b-bit data

FS SEDC Checker for 3-bitdata

FS SEDC Checker for 3-bitdata

Functional circuit output SEDC Code

a-units

n m

k-input wired AND-OR network

Error indication signal

G1G2 S1 S0G0 G1G2 S1 S0G0

V0V1V0V1V0V1

Figure 7 Block diagram of FS SEDCn checker

Table 2 Results of single faults on FS SEDC1 checker

G0 S0 V1 V0 G0 S0 V1 V0 G0 S0 V1 V0

MOS P1or P2 is stuck ON MOS P1 or P2 is stuck OFF Input C0 stuck at zero0 1 1 0 0 1 1 0 permil0 0 1 11 0 1 0 1 0 1 0 1 0 1 0

MOS P3 or P4 is stuck ON MOS P3 or P4 is stuck OFF Input F0 stuck at zero0 1 1 0 0 1 Floating 0 permil0 0 1 11 0 1 0 1 0 1 0 0 1 1 0

Transistor N1 is stuck ON Transistor N1 is stuck OFF Input C0 stuck at 10 1 1 0 0 1 1 0 0 1 1 01 0 1 0 permil1 0 1 1 permil1 1 0 0

Transistor N2 is stuck ON Transistor N2 is stuck OFF Input F0 stuck at 10 1 1 0 permil0 1 1 1 1 0 1 01 0 1 0 1 0 1 0 permil1 1 0 0

Transistor N3 is stuck ON Transistor N3 is stuck OFF - - - -permil0 1 0 0 0 1 1 0 - - - -1 0 1 0 1 0 1 0 - - - -

Transistor N4 is stuck ON Transistor N4 is stuck OFF - - - -permil0 1 1 0 0 1 1 0 - - - -1 0 0 0 1 0 1 0 - - - -

permilThe cases where circuit shows self-testing property

transistors Wemade use of the assumptions by Smith [30] totranslate gate-level circuits to transistor-level circuits

Before comparison we illustrate the functional dissim-ilarities of the three checkers with the help of Figure 8Figure 8(a) shows the general block diagram of a TSC Bergercode checker For all the information symbols that the ISG ofthe functional circuit can produce in normal operation thecheck symbol complement generator (CSCG) outputs (1198781198611015840)correspond to the bit-by-bit complement of the expectedcheck symbol 119878119861 The TSC two-rail checker validates thateach bit of 119878119861 is the complement of corresponding bit of 1198781198611015840As the size of the input data increases the length of checksymbol 119878119861 also increases resulting in a longer length for theTSC two-rail checker tree and hence the resulting delay

A general block diagram of a TSC m-out-of-2m codechecker is shown in Figure 8(b) The checker takes the

information bits and check bits 119878119882 and partitions them intotwo parts The numbers of 1rsquos ie the weight of both partsare mapped to a pair of values which in binary belongs toa code in most cases a two-rail code The checker consistsof a cellular structure of AND-OR gates as given by Lala[24]

Figure 8(c) depicts the general block diagram for an FSSEDC checker that resembles the structure of an m-out-of-2m code checker and differs from a Berger code checkerThe FS SEDC checker block receives the information andcheck bits from the functional unit If the input data lengthincreases the size of the FS checker block increases width-wise The FS SEDCn block contains ldquo119886 + 1rdquo pairs of smallSEDC checkers (subblocks) Each subblock of the FS SEDCchecker produces ldquo10rdquo as the valid code output The overallSEDC checker has a final 2-bit output 11987810 unlike two-rail

Scientific Programming 9

Check SymbolComplement

Generator

m-variabletwo-rail

TSC m-ot-of-2m codechecker

2m-input wired-AND-OR gate

blocks

Informationbits

Informationbits

Informationbits

Check bits Check bits Check bits

Error indicationError indication Error indication

(a) (b) (c)

k-input wired-AND-OR gate

signal VB

signal VSsignal VW

SB SWSS

S10

SFS SEDH checker

SB

Figure 8 Block diagrams of (a) TSC Berger checker (b) m-out-of-2m code checker and (c) FS SEDC checker

2 3 4 5 7 8 15 16 30 32Data Length (bits)

m-out-2mBergerSEDC

0500

1000150020002500

Circ

uit S

ize (

of

tran

sisto

rs)

Figure 9 Area comparison of area-optimized Berger [23] SEDC and m-out-of-2m [24] code checkers

codes only one of the output combinations ldquo10rdquo is considereda valid code word A nonvalid checker output ldquo00rdquo ldquo01rdquoor ldquo11rdquo at output 11987810 indicates the presence of a fault in thefunctional circuit or the FS checker itself The k-input wiredAND-OR network takes the ldquo119886+1rdquo pairs of output from eachSEDC checker subblock and then converts them into a final2-bit error indication signal 11988111987851 Fault Test on FS SEDC Checker Area-optimized real-ization of TSC Berger code checkers in Piestrak et al [23]showed less area overhead than m-out-of-2m code checkerswhich is apparent fromFigure 9 But if we consider the delay-optimized implementation of the TSC Berger code checkerfrom Pierce Jr and Lala [26] we see that the TSC Berger codechecker requires more area than the FS SEDC and m-out-of-2m codes checkers [24] as shown in Table 3 For claritywe discretely listed the area overhead offered based on codestorage area and code checker area in Table 3 Also listedseparately are the area overhead required by the TRC tree forthe TSC Berger code checker the wired-AND-OR networkfor FS SEDC and the m-out-of-2m code checker

For a fair comparison the extra cost of the code storagearea is also taken into account We assumed that 1-bit storage

is implemented by 12-MOS transistors [30] Table 3 lists thearea (in terms of the number of transistors) occupied by FSSEDC delay-optimized Berger code and m-out-of-2m codecheckers for up to 32-bit data

The FS SEDCn checker block shown in Figure 8(c)requires fewer gates implemented with [26 + (a times 50)] MOStransistors if ldquob = 2rdquo [50 + (a times 50)] MOS transistors if ldquob= 3rdquo and [58 + (a times 50)] MOS transistors if ldquob = 4rdquo The m-out-of-2m code checker implementation of Lala [24] requires2m2 - 2m + 2 gates The gate-level circuit is also translated totransistor-level circuits using data from Smith [30]

The results show that when scaling a 7-bit 0rsquos counter toan 8-bit 0rsquos counter 154 extra MOS transistors are requiredThe m-out-of-2m code checker requires 60 MOS transistorswhen scaling a 7-out-of-14 checker to an 8-out-of-16 checkerwhereas the SEDC checker requires only 18 extra MOS tran-sistors That is because a 7-bit SEDC checker is implementedwith one SEDC3 and one SEDC4 circuit that contain 50 and58 MOS transistors respectively (a total of 108 transistors)An 8-bit SEDC checker is implemented using one SEDC2and two SEDC3 checkers requiring 26 and 100 (50x2) MOStransistors (a total of 126 transistors) This means that SEDCsaves 88 of the number of transistors compared to a Bergercode checker [26] and it saves 70 of the transistors when

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 4: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

4 Scientific Programming

lsquobrsquo-bit data segment 3-bit data segment

lsquoarsquo repetition

(Sm-1 S2 S1 S0)

(Dn-1D3 D2 D1 D0) Dn-1 Dn-b D8 D7 D6 D5 D4 D3 D2 D1 D0

Sm-1 Sm-2 S5 S4

SEDCb SEDC3 SEDC3 SEDC3

S3 S2 S1 S0

(a)

1011

01

(01)

(01) (11)

(10)

00

(0 1)

1 0

(b)

Figure 2 (a) SEDC scheme for given data word and (b) 2D illustration of SEDC2 scheme

ISG (in [27] a normal ALU) Any fault that causes multipleunidirectional errors at the output of the normal ALU isdetected by the SEDC checker Any logic circuitry includingSRAM-based memory cells [28] can be made fault tolerantby encoding them similar to the methods given in [27 28] Inthe next section we briefly introduce the SEDC scheme withan example to encode an adder circuit while in the rest of thepaper we focus on the proposed FS SEDC checker that can beused with any SEDC-based HW-level fault tolerance system

3 Scalable Error Detection Coding(SEDC) Scheme

TheScalable ErrorDetectionCoding scheme [25] is anAUEDscheme formulated and designed in such a way that only theresultant circuit area is scaled while its latency depends on asmall portion of the input data (explained later)

For any binary data D of length 119899-bits represented as(119863119899minus1 1198632 1198631 1198630) with 119863119894 isin 0 1 for 0 le 119894 le 119899 minus 1two parameters 119886 and 119887 are computed using

119886 = 119899 minusmax (119887)3 (1)

where parameter 119886 can only take a positive integer valueie 119886 isin Z+ and parameter 119887 isin 2 3 4 Satisfying thecondition for parameter119886 the maximum possible value forparameter 119887 is selectedThe SEDC code word S is representedas (119878119898minus1 119878119895 1198782 1198781 1198780) with 119878119895 isin 0 1 for 0 le 119895 le119898 minus 1 where 119898 denotes the length of the SEDC code wordand is computed by

119898 = lceillog2 (119899 + 1 minus 3119886)rceil + 2119886 (2)

After computing the values for parameters 119886 and 119887 the SEDCcode S for binary data D is computed SEDC is designedto generate codes basically for 2- 3- and 4-bit data and isaccordingly referred to as the SEDC2 SEDC3 and SEDC4scheme respectively It is then extended for any integer valuesof 119899 as shown in Figure 2(a)

31 SEDC2 Code A two-dimensional (2D) illustration of a2-bit SEDC (SEDC2) scheme is shown in Figure 2(b) where

nodes represent data words and their corresponding codewords are written in brackets

The SEDC coding scheme assigns code words to differentdata words with a unique criterion Whenever there is achange of a bit (or bits) in a data word from ldquo1rdquo rarr ldquo0rdquoas shown with a bold arrow in Figure 2(b) the change isreflected in the code word in the opposite way ie the codechanges from ldquo0rdquorarr ldquo1rdquo as shown with the dashed arrow inFigure 2(b) and vice versa Equation (3) is used to assign 2-bitcode words 11987811198780 to the 2-bit data words11986311198630 Clearly we caninterchange the bit positions of 1198781 and 1198780 for another variantof SEDC2 codes This will not affect the code characteristics

[1198781 1198780] = 1198781198641198631198622 (1198631 1198630)= [119883119873119874119877 (1198631 1198630) 119873119860119873119863(1198631 1198630)] (3)

In (3) [1198781 1198780] represent the concatenated SEDC code bits119883119873119874119877 and119873119860119873119863 are the logical operations and SEDC2 isthe basic coding scheme

32 SEDC3 Code SEDC3 code for 3-bit data is computedusing (4) as follows

[1198781 1198780] = 1198781198641198631198623 (1198632 1198631 1198630)

= 1198781198641198631198622 (1198631 1198630) 119894119891 1198632 = 01198781198641198631198622 (1198631 1198630) 119894119891 1198632 = 1

(4)

where the bar sign (eg1198631) in (4) represents the logical NOToperation

Figure 3 shows a 3D cube illustrating the unidirectionalerror detection mechanism of SEDC3 codes The same nota-tions are used in Figure 3 as in Figure 2(b) The dashed sideof the cube represents the embedded SEDC2 coding schemein SEDC3 Note that when there is a 2-bit unidirectionalchange in data word ldquo001rdquorarr ldquo111rdquo (the two MSBs changingfrom ldquo00rdquorarr ldquo11rdquo) the code changes in the opposite direction(the least significant bit of the code changes from ldquo1rdquo rarrldquo0rdquo) In a similar way the SEDCn scheme detects 119899-bit or allunidirectional errors in the data word D

Scientific Programming 5

33 SEDC4 Code A SEDC4 code for 4-bit data is formulatedby (5) as follows

[1198782 (1198781 1198780)] = 1198781198641198631198624 (1198633 1198632 1198631 1198630)= [1198633 1198781198641198631198623 (1198632 1198631 1198630)]

(5)

TheMSB of the code word is completely dependent upon theMSB of the data word for SEDC4 hence any change in theMSB of the data word is detected The rest of the three databits are encoded using the same SEDC3 scheme

It can be observed from (3) (4) and (5) that the SEDC2is embedded in 3-bit SEDC (SEDC3) and consequently in 4-bit SEDC (SEDC4) to detect all unidirectional errors in 3-bitand 4-bit data as shown laterThis ability to scale codes is notpresent in any other concurrent error detecting (CED) codingscheme

In general for SEDCn the 119899-bit binary data is groupedinto one 119887-bit segment and the 119886 number of 3-bit segmentsand then these segments are encoded using one SEDCb and 119886numbernumbers of SEDC3 modules in parallel as shown inFigure 2(a) It is noteworthy that each group of data segmentsand corresponding code segments is independent of eachotherThis independencemakes our scheme scalable and ableto detect some portion of bidirectional errors (BE) (discussedin Section 53)

If we interchange 1198781 and 1198780 for SEDC3 in Figure 3 thecorresponding SEDC3 code is equal to Berger codes for a3-bit segment but our way of deriving the SEDC3 code isa lot different from that of Berger codes SEDC3 codes arebasically scaled from SEDC2 codes and SEDC2 codes haveno commonality with 2-bit Berger codes

34 SEDC-Based HW-Level Fault Tolerance System ExampleIn order to illustrate the designing of a HW-level fault toler-ance system using the SEDC scheme we take the example ofa 4-bit adder Let us consider that this 4-bit adder is a partof a processor which processes big data applications and wewant to make this 4-bit adder fault tolerant against transienterrors that arise in its circuitry so the general HW-level faulttolerance system diagram shown in Figure 1 will be convertedto the one shown in Figure 4 As shown in Figure 4 the 4-bitadder acts as an ISG and its equivalent SEDC encoder acts asa CSGThe SEDC encoder or CSG can be implemented using(6) as follows

[1198783 1198780] = 119878119864119863119862 (A[30] + B[30] + 119862119894119899) (6)

As the output of 4-bit adder is a 5-bit value hence theequivalent SEDC code has a 4-bit value according to (2) Weused Alterarsquos Quartus II software to synthesize the 4-bit adder(ISG) SEDC encoder (CSG) and the SEDC checker shownin Figure 4 and utilized the synthesized circuit for computingthe fault coverage of the SEDC scheme which is presented inSection 53 In the next section we present the proposed FSSEDC checker which completes the overall proposed SEDC-based HW-level fault tolerance system

Table 1 Code table for FS SEDC1 checker

G0 S0 V1 V0

0 0 1 10 1 1 01 0 1 01 1 0 0

4 The FS SEDC Checker

As shown in Figure 4 the FS SEDC checker takes 119899-information bits and119898-SEDC check bits from the functionalunit The FS SEDC checker is also composed of one 119887-bit FSSEDC checker and 119886 sets of 3-bit FS SEDC checkers With 1- 2- and 3-bit FS SEDC checkers the output can be directlyused as an error indication signal but for 119899 gt 3 one level ofwired-AND-OR logic gates is used to combine all the outputof subblocks of FS SEDC checkers and generate the 2-biterror indication signal Subsections discuss logic and circuitdiagrams for primitive FS SEDC checkers (SEDC1 SEDC2SEDC3 and SEDC4 checkers) which can be used to scale theSEDC checker to an 119899-bit FS SEDC checker (ie an FS SEDCnchecker)

41 The FS SEDC1 Checker Table 1 shows the logic for a 1-bit SEDC (FS SEDC1) checker The valid input code wordsare ldquo10rdquo and ldquo01rdquo and the valid output code word is ldquo10rdquo 1198660denotes the 1-bit information word that is the output of ISGand 1198780 denotes the 1-bit SEDC check bit generated by theSEDC check symbol generator (SCSG)11988111198810 is the 2-bit errorindication signal of the FS SEDC1 checker 1198811and 1198810 signalsare generated by the circuits shown in Figure 5(a)

42 The FS SEDC2 Checker

[1198811 1198810] = [1198781 (1198661 + 1198660) (1198780 + 11986611198660) (1198661 + 1198660 + 1198780) (1198781 + 119866111986601198780)]

(7)

In Figure 5 the symbols P1-P13 and N1-N13 representthe PMOS and NMOS transistors respectively and Vssrepresents the voltage supply For simplicity we used theCMOS-based implementation of SEDC checker circuits Anyother technology can be used to design these circuits but theunderlying algorithm ie SEDC will remain the same

43 The FS SEDC3 Checker Figure 6(a) shows the blockdiagram and the logic for a 3-bit FS SEDC checker Three-bit data 119866211986611198660 from the ISG and 2-bit SEDC check bits11987811198780 from the SCSG are first converted to 1198661101584011986601015840 and 1198781101584011987801015840respectively and then are checked using the same 2-bit FSSEDCchecker as shown in Figure 6(a)When the1198662 bit is ldquo1rdquo11986611198660 and 11987811198780 are inverted whereas if 1198662 is ldquo0rdquo then 11986611198660and 11987811198780 remain the same As the outputs of the XOR gatesare fed to the FS SEDC2 checker any error in the XOR gatesis detected This makes the overall 3-bit SEDC checker FS

6 Scientific Programming

010

110

100101

001

011

111

000

(01)

(10)

(01)

(00)

(10)

(01)

(10)

(11)

Figure 3 3D illustration of SEDC3 scheme

4-bit adder(ISG)

SEDC encoded4-bit adder

(SCSG)

FS SEDC checker

Check bits

Adder outputError indication signal V

CinA[30] B[30]

S=SEDC(A[30]+B[30]+Cin)

A[30]+B[30]+Cin

Figure 4 Example of SEDC-based HW-level fault tolerance system

44 The FS SEDC4 Checker A 4-bit FS SEDC checkerconsists of one FS SEDC1 checker and one FS SEDC3 checkeras shown in Figure 6(b) Both SEDC1 and SEDC3 checkersgenerate 2-bit output 11988111198810 Because the valid code word isldquo10rdquo to make sure that both checker units generate the ldquo10rdquooutput during error-free operation we ldquoANDrdquo the1198811 output-bit of the FS SEDC1 checker with the 1198811 output-bit of theFS SEDC3 checker Also we ldquoORrdquo the 1198810 output-bits of bothFS SEDC checkers using wired logic gates We checked andconfirmed by fault simulation that wired-AND and wired-OR gates are also FS for single faults (stuck-at-0 stuck-at-1transistor-stuck-on and transistor-stuck-off)

45 The FS SEDC119899 Checker Like the SEDC code generatorthe FS SEDC checker also consists of multiple 1- 2- and 3-bitFS SEDC checkers depending upon the value of 119886 and 119887 from(1) For example if 119899 = 8 bits then (1)rArr 119886 = 2 and 119887 = 2Thisrequires one FS SEDC2 checker and two FS SEDC3 checkersto realize an 8-bit FS SEDC checker

The area of wired-AND-OR gates will also definitelyincrease as 119899 is increased Figure 7 shows the block diagramof an 119899-bit FS SEDC checker For 119899 = 8 bits there will be totalof three FS SEDC checkers each with 2-bit output hence a3-input wired-AND and a 3-input wired-OR gate is requiredto compare all1198811 and1198810 bits In general for 119899-bit input thereare ldquo119886 + 1rdquo FS SEDC checkers each with 2-bit output Sowe require ldquo119896 = 2 times (119886 + 1)rdquo-input wired-AND and wired-OR gates With each increasing input to the wired-AND-ORnetwork one extra transistor is required by each of the wired

gatesThis causes the circuit to expandwidth-wise hence thelatency of the wired logic remains constant for any value of 119899

The size of the load transistor driving these wired-ANDand -OR gates will also increase with increasing input sowe consider the maximum fan-in of one gate as equal to 4For 119896 gt 4 an extra load transistor is connected in parallelGenerally for k-inputs we require 119903 = lceil1198964rceil load transistorsA total of 119896 + 119903 transistors is required to design the k-input wired AND-OR network with a constant latency of 1transistor

5 Experiments and Results

In this section we present the experiments we conductedon the proposed FS SEDC checker and the overall proposedSEDC-based HW-level fault tolerance system The results ofeach experiment are given alongwith the experimental detailsin the subsections below

51 Fault Test on FS SEDC Checker The FS SEDC1 SEDC2SEDC3 and SEDC4 circuits in our paper were tested forstuck-at-0 stuck-at-1 transistor-stuck-ON and transistor-stuck-OFF faults We assume a single-fault model wherefaults occur one at a time and there is enough time betweendetection of the first fault and the occurrence of another fault[29] In Table 2 we provide a summary of fault analysis ofan SEDC1 checker circuit We applied one fault at a time in

Scientific Programming 7

Vss Vss

P3 P4

N3

N4

P1

P2

N1 N2G0

G0G0

G0V0

V1

S0

S0

S0

S0

(a)

Vss

N1 N2

N4

N3N5

N6

P2

P1P4

P3

P5

P6

Vss

N7 N8 N9

N11

N12

N13

N10

P7 P8 P9P10

P11

P12P13

G

G1

G1

G1

G1

G1G1

G1

G1

V

V1 S

S

G

G

G

G

G

G

G

S

S

S

S

S1

S1

S1

S1

(b)

Figure 5 CMOS-based circuits of FS (a) SEDC1 checker and (b) SEDC2 checker

FS SEDC Checker for 2-bitData

S1 S0

V0V1

S0

1

G1

G1G2

0

S1

S1

S0G0

G0

(a)

G S

FS SEDC Checkerfor 1-bit data

FS SEDC Checker for 3-bitdata

Functional Circuit Output SEDC Code

wired OR gatewired AND gate

Error indication

G1G2

S2

G3

S1 S0G0

V0V1V0V1

(b)

Figure 6 Block diagram of FS (a) SEDC3 checker and (b) SEDC4 checker

the circuit of Figure 5(a) and observed the output In single-fault operation the circuit either produced the correct outputor never produced any invalid code words (exhibiting FSproperty) as shown in Table 2

Case 1 (transistor stuck ON) In Table 2 we show all six casesof transistor stuck ON faults (one at a time) For the caseswith N3 or N4 stuck ON the circuit shows fault detection byone input code combination (representedwith symbol) andhence the circuit is self-testing whereas other cases showedthat the circuit is fault secure as well as code disjoint

Case 2 (transistor stuck OFF) In Table 2 all six cases fortransistor stuck OFF faults are shown In cases where N1 orN2 was stuck OFF the circuit demonstrates the self-testingproperty (represented with symbol) and for the rest of thecases the circuit is fault secure

Case 3 (input stuck at 0) When input G0 or S0 is stuck at 0the circuit demonstrates the self-testing property otherwiseit remains fault secure

Case 4 (input stuck at 1) When inputG0 or S0 is stuck at 1 thecircuit shows the self-testing property otherwise it remainsfault secure

There is one case where the output becomes floating (ieP3 or P4 stuck OFF) In either case if we consider the floatingvoltage as logic high then the circuit is fault secure and if weconsider the floating voltage as logic low then the circuit isself-testing Hence we can say that the circuit in Figure 5(a)which is a 1-bit SEDC checker is FS Similar analysis wascarried out when testing 2- 3- and 4-bit SEDC checkers andwe found that all these checkers are FS

52 Area Delay and Power Comparison In this section wecompare the area and delay of TSC Berger FS SEDC and m-out-of-2m code checkersWeuse the twopossible TSCBergerchecker implementations from Piestrak et al [23] and PierceJr and Lala [26] with the m-out-of-2m code checker fromLala [24] for comparison For the sake of fairness the areaoverhead was measured in terms of the number of equivalent

8 Scientific Programming

G S

FS SEDC Checkerfor b-bit data

FS SEDC Checker for 3-bitdata

FS SEDC Checker for 3-bitdata

Functional circuit output SEDC Code

a-units

n m

k-input wired AND-OR network

Error indication signal

G1G2 S1 S0G0 G1G2 S1 S0G0

V0V1V0V1V0V1

Figure 7 Block diagram of FS SEDCn checker

Table 2 Results of single faults on FS SEDC1 checker

G0 S0 V1 V0 G0 S0 V1 V0 G0 S0 V1 V0

MOS P1or P2 is stuck ON MOS P1 or P2 is stuck OFF Input C0 stuck at zero0 1 1 0 0 1 1 0 permil0 0 1 11 0 1 0 1 0 1 0 1 0 1 0

MOS P3 or P4 is stuck ON MOS P3 or P4 is stuck OFF Input F0 stuck at zero0 1 1 0 0 1 Floating 0 permil0 0 1 11 0 1 0 1 0 1 0 0 1 1 0

Transistor N1 is stuck ON Transistor N1 is stuck OFF Input C0 stuck at 10 1 1 0 0 1 1 0 0 1 1 01 0 1 0 permil1 0 1 1 permil1 1 0 0

Transistor N2 is stuck ON Transistor N2 is stuck OFF Input F0 stuck at 10 1 1 0 permil0 1 1 1 1 0 1 01 0 1 0 1 0 1 0 permil1 1 0 0

Transistor N3 is stuck ON Transistor N3 is stuck OFF - - - -permil0 1 0 0 0 1 1 0 - - - -1 0 1 0 1 0 1 0 - - - -

Transistor N4 is stuck ON Transistor N4 is stuck OFF - - - -permil0 1 1 0 0 1 1 0 - - - -1 0 0 0 1 0 1 0 - - - -

permilThe cases where circuit shows self-testing property

transistors Wemade use of the assumptions by Smith [30] totranslate gate-level circuits to transistor-level circuits

Before comparison we illustrate the functional dissim-ilarities of the three checkers with the help of Figure 8Figure 8(a) shows the general block diagram of a TSC Bergercode checker For all the information symbols that the ISG ofthe functional circuit can produce in normal operation thecheck symbol complement generator (CSCG) outputs (1198781198611015840)correspond to the bit-by-bit complement of the expectedcheck symbol 119878119861 The TSC two-rail checker validates thateach bit of 119878119861 is the complement of corresponding bit of 1198781198611015840As the size of the input data increases the length of checksymbol 119878119861 also increases resulting in a longer length for theTSC two-rail checker tree and hence the resulting delay

A general block diagram of a TSC m-out-of-2m codechecker is shown in Figure 8(b) The checker takes the

information bits and check bits 119878119882 and partitions them intotwo parts The numbers of 1rsquos ie the weight of both partsare mapped to a pair of values which in binary belongs toa code in most cases a two-rail code The checker consistsof a cellular structure of AND-OR gates as given by Lala[24]

Figure 8(c) depicts the general block diagram for an FSSEDC checker that resembles the structure of an m-out-of-2m code checker and differs from a Berger code checkerThe FS SEDC checker block receives the information andcheck bits from the functional unit If the input data lengthincreases the size of the FS checker block increases width-wise The FS SEDCn block contains ldquo119886 + 1rdquo pairs of smallSEDC checkers (subblocks) Each subblock of the FS SEDCchecker produces ldquo10rdquo as the valid code output The overallSEDC checker has a final 2-bit output 11987810 unlike two-rail

Scientific Programming 9

Check SymbolComplement

Generator

m-variabletwo-rail

TSC m-ot-of-2m codechecker

2m-input wired-AND-OR gate

blocks

Informationbits

Informationbits

Informationbits

Check bits Check bits Check bits

Error indicationError indication Error indication

(a) (b) (c)

k-input wired-AND-OR gate

signal VB

signal VSsignal VW

SB SWSS

S10

SFS SEDH checker

SB

Figure 8 Block diagrams of (a) TSC Berger checker (b) m-out-of-2m code checker and (c) FS SEDC checker

2 3 4 5 7 8 15 16 30 32Data Length (bits)

m-out-2mBergerSEDC

0500

1000150020002500

Circ

uit S

ize (

of

tran

sisto

rs)

Figure 9 Area comparison of area-optimized Berger [23] SEDC and m-out-of-2m [24] code checkers

codes only one of the output combinations ldquo10rdquo is considereda valid code word A nonvalid checker output ldquo00rdquo ldquo01rdquoor ldquo11rdquo at output 11987810 indicates the presence of a fault in thefunctional circuit or the FS checker itself The k-input wiredAND-OR network takes the ldquo119886+1rdquo pairs of output from eachSEDC checker subblock and then converts them into a final2-bit error indication signal 11988111987851 Fault Test on FS SEDC Checker Area-optimized real-ization of TSC Berger code checkers in Piestrak et al [23]showed less area overhead than m-out-of-2m code checkerswhich is apparent fromFigure 9 But if we consider the delay-optimized implementation of the TSC Berger code checkerfrom Pierce Jr and Lala [26] we see that the TSC Berger codechecker requires more area than the FS SEDC and m-out-of-2m codes checkers [24] as shown in Table 3 For claritywe discretely listed the area overhead offered based on codestorage area and code checker area in Table 3 Also listedseparately are the area overhead required by the TRC tree forthe TSC Berger code checker the wired-AND-OR networkfor FS SEDC and the m-out-of-2m code checker

For a fair comparison the extra cost of the code storagearea is also taken into account We assumed that 1-bit storage

is implemented by 12-MOS transistors [30] Table 3 lists thearea (in terms of the number of transistors) occupied by FSSEDC delay-optimized Berger code and m-out-of-2m codecheckers for up to 32-bit data

The FS SEDCn checker block shown in Figure 8(c)requires fewer gates implemented with [26 + (a times 50)] MOStransistors if ldquob = 2rdquo [50 + (a times 50)] MOS transistors if ldquob= 3rdquo and [58 + (a times 50)] MOS transistors if ldquob = 4rdquo The m-out-of-2m code checker implementation of Lala [24] requires2m2 - 2m + 2 gates The gate-level circuit is also translated totransistor-level circuits using data from Smith [30]

The results show that when scaling a 7-bit 0rsquos counter toan 8-bit 0rsquos counter 154 extra MOS transistors are requiredThe m-out-of-2m code checker requires 60 MOS transistorswhen scaling a 7-out-of-14 checker to an 8-out-of-16 checkerwhereas the SEDC checker requires only 18 extra MOS tran-sistors That is because a 7-bit SEDC checker is implementedwith one SEDC3 and one SEDC4 circuit that contain 50 and58 MOS transistors respectively (a total of 108 transistors)An 8-bit SEDC checker is implemented using one SEDC2and two SEDC3 checkers requiring 26 and 100 (50x2) MOStransistors (a total of 126 transistors) This means that SEDCsaves 88 of the number of transistors compared to a Bergercode checker [26] and it saves 70 of the transistors when

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 5: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

Scientific Programming 5

33 SEDC4 Code A SEDC4 code for 4-bit data is formulatedby (5) as follows

[1198782 (1198781 1198780)] = 1198781198641198631198624 (1198633 1198632 1198631 1198630)= [1198633 1198781198641198631198623 (1198632 1198631 1198630)]

(5)

TheMSB of the code word is completely dependent upon theMSB of the data word for SEDC4 hence any change in theMSB of the data word is detected The rest of the three databits are encoded using the same SEDC3 scheme

It can be observed from (3) (4) and (5) that the SEDC2is embedded in 3-bit SEDC (SEDC3) and consequently in 4-bit SEDC (SEDC4) to detect all unidirectional errors in 3-bitand 4-bit data as shown laterThis ability to scale codes is notpresent in any other concurrent error detecting (CED) codingscheme

In general for SEDCn the 119899-bit binary data is groupedinto one 119887-bit segment and the 119886 number of 3-bit segmentsand then these segments are encoded using one SEDCb and 119886numbernumbers of SEDC3 modules in parallel as shown inFigure 2(a) It is noteworthy that each group of data segmentsand corresponding code segments is independent of eachotherThis independencemakes our scheme scalable and ableto detect some portion of bidirectional errors (BE) (discussedin Section 53)

If we interchange 1198781 and 1198780 for SEDC3 in Figure 3 thecorresponding SEDC3 code is equal to Berger codes for a3-bit segment but our way of deriving the SEDC3 code isa lot different from that of Berger codes SEDC3 codes arebasically scaled from SEDC2 codes and SEDC2 codes haveno commonality with 2-bit Berger codes

34 SEDC-Based HW-Level Fault Tolerance System ExampleIn order to illustrate the designing of a HW-level fault toler-ance system using the SEDC scheme we take the example ofa 4-bit adder Let us consider that this 4-bit adder is a partof a processor which processes big data applications and wewant to make this 4-bit adder fault tolerant against transienterrors that arise in its circuitry so the general HW-level faulttolerance system diagram shown in Figure 1 will be convertedto the one shown in Figure 4 As shown in Figure 4 the 4-bitadder acts as an ISG and its equivalent SEDC encoder acts asa CSGThe SEDC encoder or CSG can be implemented using(6) as follows

[1198783 1198780] = 119878119864119863119862 (A[30] + B[30] + 119862119894119899) (6)

As the output of 4-bit adder is a 5-bit value hence theequivalent SEDC code has a 4-bit value according to (2) Weused Alterarsquos Quartus II software to synthesize the 4-bit adder(ISG) SEDC encoder (CSG) and the SEDC checker shownin Figure 4 and utilized the synthesized circuit for computingthe fault coverage of the SEDC scheme which is presented inSection 53 In the next section we present the proposed FSSEDC checker which completes the overall proposed SEDC-based HW-level fault tolerance system

Table 1 Code table for FS SEDC1 checker

G0 S0 V1 V0

0 0 1 10 1 1 01 0 1 01 1 0 0

4 The FS SEDC Checker

As shown in Figure 4 the FS SEDC checker takes 119899-information bits and119898-SEDC check bits from the functionalunit The FS SEDC checker is also composed of one 119887-bit FSSEDC checker and 119886 sets of 3-bit FS SEDC checkers With 1- 2- and 3-bit FS SEDC checkers the output can be directlyused as an error indication signal but for 119899 gt 3 one level ofwired-AND-OR logic gates is used to combine all the outputof subblocks of FS SEDC checkers and generate the 2-biterror indication signal Subsections discuss logic and circuitdiagrams for primitive FS SEDC checkers (SEDC1 SEDC2SEDC3 and SEDC4 checkers) which can be used to scale theSEDC checker to an 119899-bit FS SEDC checker (ie an FS SEDCnchecker)

41 The FS SEDC1 Checker Table 1 shows the logic for a 1-bit SEDC (FS SEDC1) checker The valid input code wordsare ldquo10rdquo and ldquo01rdquo and the valid output code word is ldquo10rdquo 1198660denotes the 1-bit information word that is the output of ISGand 1198780 denotes the 1-bit SEDC check bit generated by theSEDC check symbol generator (SCSG)11988111198810 is the 2-bit errorindication signal of the FS SEDC1 checker 1198811and 1198810 signalsare generated by the circuits shown in Figure 5(a)

42 The FS SEDC2 Checker

[1198811 1198810] = [1198781 (1198661 + 1198660) (1198780 + 11986611198660) (1198661 + 1198660 + 1198780) (1198781 + 119866111986601198780)]

(7)

In Figure 5 the symbols P1-P13 and N1-N13 representthe PMOS and NMOS transistors respectively and Vssrepresents the voltage supply For simplicity we used theCMOS-based implementation of SEDC checker circuits Anyother technology can be used to design these circuits but theunderlying algorithm ie SEDC will remain the same

43 The FS SEDC3 Checker Figure 6(a) shows the blockdiagram and the logic for a 3-bit FS SEDC checker Three-bit data 119866211986611198660 from the ISG and 2-bit SEDC check bits11987811198780 from the SCSG are first converted to 1198661101584011986601015840 and 1198781101584011987801015840respectively and then are checked using the same 2-bit FSSEDCchecker as shown in Figure 6(a)When the1198662 bit is ldquo1rdquo11986611198660 and 11987811198780 are inverted whereas if 1198662 is ldquo0rdquo then 11986611198660and 11987811198780 remain the same As the outputs of the XOR gatesare fed to the FS SEDC2 checker any error in the XOR gatesis detected This makes the overall 3-bit SEDC checker FS

6 Scientific Programming

010

110

100101

001

011

111

000

(01)

(10)

(01)

(00)

(10)

(01)

(10)

(11)

Figure 3 3D illustration of SEDC3 scheme

4-bit adder(ISG)

SEDC encoded4-bit adder

(SCSG)

FS SEDC checker

Check bits

Adder outputError indication signal V

CinA[30] B[30]

S=SEDC(A[30]+B[30]+Cin)

A[30]+B[30]+Cin

Figure 4 Example of SEDC-based HW-level fault tolerance system

44 The FS SEDC4 Checker A 4-bit FS SEDC checkerconsists of one FS SEDC1 checker and one FS SEDC3 checkeras shown in Figure 6(b) Both SEDC1 and SEDC3 checkersgenerate 2-bit output 11988111198810 Because the valid code word isldquo10rdquo to make sure that both checker units generate the ldquo10rdquooutput during error-free operation we ldquoANDrdquo the1198811 output-bit of the FS SEDC1 checker with the 1198811 output-bit of theFS SEDC3 checker Also we ldquoORrdquo the 1198810 output-bits of bothFS SEDC checkers using wired logic gates We checked andconfirmed by fault simulation that wired-AND and wired-OR gates are also FS for single faults (stuck-at-0 stuck-at-1transistor-stuck-on and transistor-stuck-off)

45 The FS SEDC119899 Checker Like the SEDC code generatorthe FS SEDC checker also consists of multiple 1- 2- and 3-bitFS SEDC checkers depending upon the value of 119886 and 119887 from(1) For example if 119899 = 8 bits then (1)rArr 119886 = 2 and 119887 = 2Thisrequires one FS SEDC2 checker and two FS SEDC3 checkersto realize an 8-bit FS SEDC checker

The area of wired-AND-OR gates will also definitelyincrease as 119899 is increased Figure 7 shows the block diagramof an 119899-bit FS SEDC checker For 119899 = 8 bits there will be totalof three FS SEDC checkers each with 2-bit output hence a3-input wired-AND and a 3-input wired-OR gate is requiredto compare all1198811 and1198810 bits In general for 119899-bit input thereare ldquo119886 + 1rdquo FS SEDC checkers each with 2-bit output Sowe require ldquo119896 = 2 times (119886 + 1)rdquo-input wired-AND and wired-OR gates With each increasing input to the wired-AND-ORnetwork one extra transistor is required by each of the wired

gatesThis causes the circuit to expandwidth-wise hence thelatency of the wired logic remains constant for any value of 119899

The size of the load transistor driving these wired-ANDand -OR gates will also increase with increasing input sowe consider the maximum fan-in of one gate as equal to 4For 119896 gt 4 an extra load transistor is connected in parallelGenerally for k-inputs we require 119903 = lceil1198964rceil load transistorsA total of 119896 + 119903 transistors is required to design the k-input wired AND-OR network with a constant latency of 1transistor

5 Experiments and Results

In this section we present the experiments we conductedon the proposed FS SEDC checker and the overall proposedSEDC-based HW-level fault tolerance system The results ofeach experiment are given alongwith the experimental detailsin the subsections below

51 Fault Test on FS SEDC Checker The FS SEDC1 SEDC2SEDC3 and SEDC4 circuits in our paper were tested forstuck-at-0 stuck-at-1 transistor-stuck-ON and transistor-stuck-OFF faults We assume a single-fault model wherefaults occur one at a time and there is enough time betweendetection of the first fault and the occurrence of another fault[29] In Table 2 we provide a summary of fault analysis ofan SEDC1 checker circuit We applied one fault at a time in

Scientific Programming 7

Vss Vss

P3 P4

N3

N4

P1

P2

N1 N2G0

G0G0

G0V0

V1

S0

S0

S0

S0

(a)

Vss

N1 N2

N4

N3N5

N6

P2

P1P4

P3

P5

P6

Vss

N7 N8 N9

N11

N12

N13

N10

P7 P8 P9P10

P11

P12P13

G

G1

G1

G1

G1

G1G1

G1

G1

V

V1 S

S

G

G

G

G

G

G

G

S

S

S

S

S1

S1

S1

S1

(b)

Figure 5 CMOS-based circuits of FS (a) SEDC1 checker and (b) SEDC2 checker

FS SEDC Checker for 2-bitData

S1 S0

V0V1

S0

1

G1

G1G2

0

S1

S1

S0G0

G0

(a)

G S

FS SEDC Checkerfor 1-bit data

FS SEDC Checker for 3-bitdata

Functional Circuit Output SEDC Code

wired OR gatewired AND gate

Error indication

G1G2

S2

G3

S1 S0G0

V0V1V0V1

(b)

Figure 6 Block diagram of FS (a) SEDC3 checker and (b) SEDC4 checker

the circuit of Figure 5(a) and observed the output In single-fault operation the circuit either produced the correct outputor never produced any invalid code words (exhibiting FSproperty) as shown in Table 2

Case 1 (transistor stuck ON) In Table 2 we show all six casesof transistor stuck ON faults (one at a time) For the caseswith N3 or N4 stuck ON the circuit shows fault detection byone input code combination (representedwith symbol) andhence the circuit is self-testing whereas other cases showedthat the circuit is fault secure as well as code disjoint

Case 2 (transistor stuck OFF) In Table 2 all six cases fortransistor stuck OFF faults are shown In cases where N1 orN2 was stuck OFF the circuit demonstrates the self-testingproperty (represented with symbol) and for the rest of thecases the circuit is fault secure

Case 3 (input stuck at 0) When input G0 or S0 is stuck at 0the circuit demonstrates the self-testing property otherwiseit remains fault secure

Case 4 (input stuck at 1) When inputG0 or S0 is stuck at 1 thecircuit shows the self-testing property otherwise it remainsfault secure

There is one case where the output becomes floating (ieP3 or P4 stuck OFF) In either case if we consider the floatingvoltage as logic high then the circuit is fault secure and if weconsider the floating voltage as logic low then the circuit isself-testing Hence we can say that the circuit in Figure 5(a)which is a 1-bit SEDC checker is FS Similar analysis wascarried out when testing 2- 3- and 4-bit SEDC checkers andwe found that all these checkers are FS

52 Area Delay and Power Comparison In this section wecompare the area and delay of TSC Berger FS SEDC and m-out-of-2m code checkersWeuse the twopossible TSCBergerchecker implementations from Piestrak et al [23] and PierceJr and Lala [26] with the m-out-of-2m code checker fromLala [24] for comparison For the sake of fairness the areaoverhead was measured in terms of the number of equivalent

8 Scientific Programming

G S

FS SEDC Checkerfor b-bit data

FS SEDC Checker for 3-bitdata

FS SEDC Checker for 3-bitdata

Functional circuit output SEDC Code

a-units

n m

k-input wired AND-OR network

Error indication signal

G1G2 S1 S0G0 G1G2 S1 S0G0

V0V1V0V1V0V1

Figure 7 Block diagram of FS SEDCn checker

Table 2 Results of single faults on FS SEDC1 checker

G0 S0 V1 V0 G0 S0 V1 V0 G0 S0 V1 V0

MOS P1or P2 is stuck ON MOS P1 or P2 is stuck OFF Input C0 stuck at zero0 1 1 0 0 1 1 0 permil0 0 1 11 0 1 0 1 0 1 0 1 0 1 0

MOS P3 or P4 is stuck ON MOS P3 or P4 is stuck OFF Input F0 stuck at zero0 1 1 0 0 1 Floating 0 permil0 0 1 11 0 1 0 1 0 1 0 0 1 1 0

Transistor N1 is stuck ON Transistor N1 is stuck OFF Input C0 stuck at 10 1 1 0 0 1 1 0 0 1 1 01 0 1 0 permil1 0 1 1 permil1 1 0 0

Transistor N2 is stuck ON Transistor N2 is stuck OFF Input F0 stuck at 10 1 1 0 permil0 1 1 1 1 0 1 01 0 1 0 1 0 1 0 permil1 1 0 0

Transistor N3 is stuck ON Transistor N3 is stuck OFF - - - -permil0 1 0 0 0 1 1 0 - - - -1 0 1 0 1 0 1 0 - - - -

Transistor N4 is stuck ON Transistor N4 is stuck OFF - - - -permil0 1 1 0 0 1 1 0 - - - -1 0 0 0 1 0 1 0 - - - -

permilThe cases where circuit shows self-testing property

transistors Wemade use of the assumptions by Smith [30] totranslate gate-level circuits to transistor-level circuits

Before comparison we illustrate the functional dissim-ilarities of the three checkers with the help of Figure 8Figure 8(a) shows the general block diagram of a TSC Bergercode checker For all the information symbols that the ISG ofthe functional circuit can produce in normal operation thecheck symbol complement generator (CSCG) outputs (1198781198611015840)correspond to the bit-by-bit complement of the expectedcheck symbol 119878119861 The TSC two-rail checker validates thateach bit of 119878119861 is the complement of corresponding bit of 1198781198611015840As the size of the input data increases the length of checksymbol 119878119861 also increases resulting in a longer length for theTSC two-rail checker tree and hence the resulting delay

A general block diagram of a TSC m-out-of-2m codechecker is shown in Figure 8(b) The checker takes the

information bits and check bits 119878119882 and partitions them intotwo parts The numbers of 1rsquos ie the weight of both partsare mapped to a pair of values which in binary belongs toa code in most cases a two-rail code The checker consistsof a cellular structure of AND-OR gates as given by Lala[24]

Figure 8(c) depicts the general block diagram for an FSSEDC checker that resembles the structure of an m-out-of-2m code checker and differs from a Berger code checkerThe FS SEDC checker block receives the information andcheck bits from the functional unit If the input data lengthincreases the size of the FS checker block increases width-wise The FS SEDCn block contains ldquo119886 + 1rdquo pairs of smallSEDC checkers (subblocks) Each subblock of the FS SEDCchecker produces ldquo10rdquo as the valid code output The overallSEDC checker has a final 2-bit output 11987810 unlike two-rail

Scientific Programming 9

Check SymbolComplement

Generator

m-variabletwo-rail

TSC m-ot-of-2m codechecker

2m-input wired-AND-OR gate

blocks

Informationbits

Informationbits

Informationbits

Check bits Check bits Check bits

Error indicationError indication Error indication

(a) (b) (c)

k-input wired-AND-OR gate

signal VB

signal VSsignal VW

SB SWSS

S10

SFS SEDH checker

SB

Figure 8 Block diagrams of (a) TSC Berger checker (b) m-out-of-2m code checker and (c) FS SEDC checker

2 3 4 5 7 8 15 16 30 32Data Length (bits)

m-out-2mBergerSEDC

0500

1000150020002500

Circ

uit S

ize (

of

tran

sisto

rs)

Figure 9 Area comparison of area-optimized Berger [23] SEDC and m-out-of-2m [24] code checkers

codes only one of the output combinations ldquo10rdquo is considereda valid code word A nonvalid checker output ldquo00rdquo ldquo01rdquoor ldquo11rdquo at output 11987810 indicates the presence of a fault in thefunctional circuit or the FS checker itself The k-input wiredAND-OR network takes the ldquo119886+1rdquo pairs of output from eachSEDC checker subblock and then converts them into a final2-bit error indication signal 11988111987851 Fault Test on FS SEDC Checker Area-optimized real-ization of TSC Berger code checkers in Piestrak et al [23]showed less area overhead than m-out-of-2m code checkerswhich is apparent fromFigure 9 But if we consider the delay-optimized implementation of the TSC Berger code checkerfrom Pierce Jr and Lala [26] we see that the TSC Berger codechecker requires more area than the FS SEDC and m-out-of-2m codes checkers [24] as shown in Table 3 For claritywe discretely listed the area overhead offered based on codestorage area and code checker area in Table 3 Also listedseparately are the area overhead required by the TRC tree forthe TSC Berger code checker the wired-AND-OR networkfor FS SEDC and the m-out-of-2m code checker

For a fair comparison the extra cost of the code storagearea is also taken into account We assumed that 1-bit storage

is implemented by 12-MOS transistors [30] Table 3 lists thearea (in terms of the number of transistors) occupied by FSSEDC delay-optimized Berger code and m-out-of-2m codecheckers for up to 32-bit data

The FS SEDCn checker block shown in Figure 8(c)requires fewer gates implemented with [26 + (a times 50)] MOStransistors if ldquob = 2rdquo [50 + (a times 50)] MOS transistors if ldquob= 3rdquo and [58 + (a times 50)] MOS transistors if ldquob = 4rdquo The m-out-of-2m code checker implementation of Lala [24] requires2m2 - 2m + 2 gates The gate-level circuit is also translated totransistor-level circuits using data from Smith [30]

The results show that when scaling a 7-bit 0rsquos counter toan 8-bit 0rsquos counter 154 extra MOS transistors are requiredThe m-out-of-2m code checker requires 60 MOS transistorswhen scaling a 7-out-of-14 checker to an 8-out-of-16 checkerwhereas the SEDC checker requires only 18 extra MOS tran-sistors That is because a 7-bit SEDC checker is implementedwith one SEDC3 and one SEDC4 circuit that contain 50 and58 MOS transistors respectively (a total of 108 transistors)An 8-bit SEDC checker is implemented using one SEDC2and two SEDC3 checkers requiring 26 and 100 (50x2) MOStransistors (a total of 126 transistors) This means that SEDCsaves 88 of the number of transistors compared to a Bergercode checker [26] and it saves 70 of the transistors when

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 6: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

6 Scientific Programming

010

110

100101

001

011

111

000

(01)

(10)

(01)

(00)

(10)

(01)

(10)

(11)

Figure 3 3D illustration of SEDC3 scheme

4-bit adder(ISG)

SEDC encoded4-bit adder

(SCSG)

FS SEDC checker

Check bits

Adder outputError indication signal V

CinA[30] B[30]

S=SEDC(A[30]+B[30]+Cin)

A[30]+B[30]+Cin

Figure 4 Example of SEDC-based HW-level fault tolerance system

44 The FS SEDC4 Checker A 4-bit FS SEDC checkerconsists of one FS SEDC1 checker and one FS SEDC3 checkeras shown in Figure 6(b) Both SEDC1 and SEDC3 checkersgenerate 2-bit output 11988111198810 Because the valid code word isldquo10rdquo to make sure that both checker units generate the ldquo10rdquooutput during error-free operation we ldquoANDrdquo the1198811 output-bit of the FS SEDC1 checker with the 1198811 output-bit of theFS SEDC3 checker Also we ldquoORrdquo the 1198810 output-bits of bothFS SEDC checkers using wired logic gates We checked andconfirmed by fault simulation that wired-AND and wired-OR gates are also FS for single faults (stuck-at-0 stuck-at-1transistor-stuck-on and transistor-stuck-off)

45 The FS SEDC119899 Checker Like the SEDC code generatorthe FS SEDC checker also consists of multiple 1- 2- and 3-bitFS SEDC checkers depending upon the value of 119886 and 119887 from(1) For example if 119899 = 8 bits then (1)rArr 119886 = 2 and 119887 = 2Thisrequires one FS SEDC2 checker and two FS SEDC3 checkersto realize an 8-bit FS SEDC checker

The area of wired-AND-OR gates will also definitelyincrease as 119899 is increased Figure 7 shows the block diagramof an 119899-bit FS SEDC checker For 119899 = 8 bits there will be totalof three FS SEDC checkers each with 2-bit output hence a3-input wired-AND and a 3-input wired-OR gate is requiredto compare all1198811 and1198810 bits In general for 119899-bit input thereare ldquo119886 + 1rdquo FS SEDC checkers each with 2-bit output Sowe require ldquo119896 = 2 times (119886 + 1)rdquo-input wired-AND and wired-OR gates With each increasing input to the wired-AND-ORnetwork one extra transistor is required by each of the wired

gatesThis causes the circuit to expandwidth-wise hence thelatency of the wired logic remains constant for any value of 119899

The size of the load transistor driving these wired-ANDand -OR gates will also increase with increasing input sowe consider the maximum fan-in of one gate as equal to 4For 119896 gt 4 an extra load transistor is connected in parallelGenerally for k-inputs we require 119903 = lceil1198964rceil load transistorsA total of 119896 + 119903 transistors is required to design the k-input wired AND-OR network with a constant latency of 1transistor

5 Experiments and Results

In this section we present the experiments we conductedon the proposed FS SEDC checker and the overall proposedSEDC-based HW-level fault tolerance system The results ofeach experiment are given alongwith the experimental detailsin the subsections below

51 Fault Test on FS SEDC Checker The FS SEDC1 SEDC2SEDC3 and SEDC4 circuits in our paper were tested forstuck-at-0 stuck-at-1 transistor-stuck-ON and transistor-stuck-OFF faults We assume a single-fault model wherefaults occur one at a time and there is enough time betweendetection of the first fault and the occurrence of another fault[29] In Table 2 we provide a summary of fault analysis ofan SEDC1 checker circuit We applied one fault at a time in

Scientific Programming 7

Vss Vss

P3 P4

N3

N4

P1

P2

N1 N2G0

G0G0

G0V0

V1

S0

S0

S0

S0

(a)

Vss

N1 N2

N4

N3N5

N6

P2

P1P4

P3

P5

P6

Vss

N7 N8 N9

N11

N12

N13

N10

P7 P8 P9P10

P11

P12P13

G

G1

G1

G1

G1

G1G1

G1

G1

V

V1 S

S

G

G

G

G

G

G

G

S

S

S

S

S1

S1

S1

S1

(b)

Figure 5 CMOS-based circuits of FS (a) SEDC1 checker and (b) SEDC2 checker

FS SEDC Checker for 2-bitData

S1 S0

V0V1

S0

1

G1

G1G2

0

S1

S1

S0G0

G0

(a)

G S

FS SEDC Checkerfor 1-bit data

FS SEDC Checker for 3-bitdata

Functional Circuit Output SEDC Code

wired OR gatewired AND gate

Error indication

G1G2

S2

G3

S1 S0G0

V0V1V0V1

(b)

Figure 6 Block diagram of FS (a) SEDC3 checker and (b) SEDC4 checker

the circuit of Figure 5(a) and observed the output In single-fault operation the circuit either produced the correct outputor never produced any invalid code words (exhibiting FSproperty) as shown in Table 2

Case 1 (transistor stuck ON) In Table 2 we show all six casesof transistor stuck ON faults (one at a time) For the caseswith N3 or N4 stuck ON the circuit shows fault detection byone input code combination (representedwith symbol) andhence the circuit is self-testing whereas other cases showedthat the circuit is fault secure as well as code disjoint

Case 2 (transistor stuck OFF) In Table 2 all six cases fortransistor stuck OFF faults are shown In cases where N1 orN2 was stuck OFF the circuit demonstrates the self-testingproperty (represented with symbol) and for the rest of thecases the circuit is fault secure

Case 3 (input stuck at 0) When input G0 or S0 is stuck at 0the circuit demonstrates the self-testing property otherwiseit remains fault secure

Case 4 (input stuck at 1) When inputG0 or S0 is stuck at 1 thecircuit shows the self-testing property otherwise it remainsfault secure

There is one case where the output becomes floating (ieP3 or P4 stuck OFF) In either case if we consider the floatingvoltage as logic high then the circuit is fault secure and if weconsider the floating voltage as logic low then the circuit isself-testing Hence we can say that the circuit in Figure 5(a)which is a 1-bit SEDC checker is FS Similar analysis wascarried out when testing 2- 3- and 4-bit SEDC checkers andwe found that all these checkers are FS

52 Area Delay and Power Comparison In this section wecompare the area and delay of TSC Berger FS SEDC and m-out-of-2m code checkersWeuse the twopossible TSCBergerchecker implementations from Piestrak et al [23] and PierceJr and Lala [26] with the m-out-of-2m code checker fromLala [24] for comparison For the sake of fairness the areaoverhead was measured in terms of the number of equivalent

8 Scientific Programming

G S

FS SEDC Checkerfor b-bit data

FS SEDC Checker for 3-bitdata

FS SEDC Checker for 3-bitdata

Functional circuit output SEDC Code

a-units

n m

k-input wired AND-OR network

Error indication signal

G1G2 S1 S0G0 G1G2 S1 S0G0

V0V1V0V1V0V1

Figure 7 Block diagram of FS SEDCn checker

Table 2 Results of single faults on FS SEDC1 checker

G0 S0 V1 V0 G0 S0 V1 V0 G0 S0 V1 V0

MOS P1or P2 is stuck ON MOS P1 or P2 is stuck OFF Input C0 stuck at zero0 1 1 0 0 1 1 0 permil0 0 1 11 0 1 0 1 0 1 0 1 0 1 0

MOS P3 or P4 is stuck ON MOS P3 or P4 is stuck OFF Input F0 stuck at zero0 1 1 0 0 1 Floating 0 permil0 0 1 11 0 1 0 1 0 1 0 0 1 1 0

Transistor N1 is stuck ON Transistor N1 is stuck OFF Input C0 stuck at 10 1 1 0 0 1 1 0 0 1 1 01 0 1 0 permil1 0 1 1 permil1 1 0 0

Transistor N2 is stuck ON Transistor N2 is stuck OFF Input F0 stuck at 10 1 1 0 permil0 1 1 1 1 0 1 01 0 1 0 1 0 1 0 permil1 1 0 0

Transistor N3 is stuck ON Transistor N3 is stuck OFF - - - -permil0 1 0 0 0 1 1 0 - - - -1 0 1 0 1 0 1 0 - - - -

Transistor N4 is stuck ON Transistor N4 is stuck OFF - - - -permil0 1 1 0 0 1 1 0 - - - -1 0 0 0 1 0 1 0 - - - -

permilThe cases where circuit shows self-testing property

transistors Wemade use of the assumptions by Smith [30] totranslate gate-level circuits to transistor-level circuits

Before comparison we illustrate the functional dissim-ilarities of the three checkers with the help of Figure 8Figure 8(a) shows the general block diagram of a TSC Bergercode checker For all the information symbols that the ISG ofthe functional circuit can produce in normal operation thecheck symbol complement generator (CSCG) outputs (1198781198611015840)correspond to the bit-by-bit complement of the expectedcheck symbol 119878119861 The TSC two-rail checker validates thateach bit of 119878119861 is the complement of corresponding bit of 1198781198611015840As the size of the input data increases the length of checksymbol 119878119861 also increases resulting in a longer length for theTSC two-rail checker tree and hence the resulting delay

A general block diagram of a TSC m-out-of-2m codechecker is shown in Figure 8(b) The checker takes the

information bits and check bits 119878119882 and partitions them intotwo parts The numbers of 1rsquos ie the weight of both partsare mapped to a pair of values which in binary belongs toa code in most cases a two-rail code The checker consistsof a cellular structure of AND-OR gates as given by Lala[24]

Figure 8(c) depicts the general block diagram for an FSSEDC checker that resembles the structure of an m-out-of-2m code checker and differs from a Berger code checkerThe FS SEDC checker block receives the information andcheck bits from the functional unit If the input data lengthincreases the size of the FS checker block increases width-wise The FS SEDCn block contains ldquo119886 + 1rdquo pairs of smallSEDC checkers (subblocks) Each subblock of the FS SEDCchecker produces ldquo10rdquo as the valid code output The overallSEDC checker has a final 2-bit output 11987810 unlike two-rail

Scientific Programming 9

Check SymbolComplement

Generator

m-variabletwo-rail

TSC m-ot-of-2m codechecker

2m-input wired-AND-OR gate

blocks

Informationbits

Informationbits

Informationbits

Check bits Check bits Check bits

Error indicationError indication Error indication

(a) (b) (c)

k-input wired-AND-OR gate

signal VB

signal VSsignal VW

SB SWSS

S10

SFS SEDH checker

SB

Figure 8 Block diagrams of (a) TSC Berger checker (b) m-out-of-2m code checker and (c) FS SEDC checker

2 3 4 5 7 8 15 16 30 32Data Length (bits)

m-out-2mBergerSEDC

0500

1000150020002500

Circ

uit S

ize (

of

tran

sisto

rs)

Figure 9 Area comparison of area-optimized Berger [23] SEDC and m-out-of-2m [24] code checkers

codes only one of the output combinations ldquo10rdquo is considereda valid code word A nonvalid checker output ldquo00rdquo ldquo01rdquoor ldquo11rdquo at output 11987810 indicates the presence of a fault in thefunctional circuit or the FS checker itself The k-input wiredAND-OR network takes the ldquo119886+1rdquo pairs of output from eachSEDC checker subblock and then converts them into a final2-bit error indication signal 11988111987851 Fault Test on FS SEDC Checker Area-optimized real-ization of TSC Berger code checkers in Piestrak et al [23]showed less area overhead than m-out-of-2m code checkerswhich is apparent fromFigure 9 But if we consider the delay-optimized implementation of the TSC Berger code checkerfrom Pierce Jr and Lala [26] we see that the TSC Berger codechecker requires more area than the FS SEDC and m-out-of-2m codes checkers [24] as shown in Table 3 For claritywe discretely listed the area overhead offered based on codestorage area and code checker area in Table 3 Also listedseparately are the area overhead required by the TRC tree forthe TSC Berger code checker the wired-AND-OR networkfor FS SEDC and the m-out-of-2m code checker

For a fair comparison the extra cost of the code storagearea is also taken into account We assumed that 1-bit storage

is implemented by 12-MOS transistors [30] Table 3 lists thearea (in terms of the number of transistors) occupied by FSSEDC delay-optimized Berger code and m-out-of-2m codecheckers for up to 32-bit data

The FS SEDCn checker block shown in Figure 8(c)requires fewer gates implemented with [26 + (a times 50)] MOStransistors if ldquob = 2rdquo [50 + (a times 50)] MOS transistors if ldquob= 3rdquo and [58 + (a times 50)] MOS transistors if ldquob = 4rdquo The m-out-of-2m code checker implementation of Lala [24] requires2m2 - 2m + 2 gates The gate-level circuit is also translated totransistor-level circuits using data from Smith [30]

The results show that when scaling a 7-bit 0rsquos counter toan 8-bit 0rsquos counter 154 extra MOS transistors are requiredThe m-out-of-2m code checker requires 60 MOS transistorswhen scaling a 7-out-of-14 checker to an 8-out-of-16 checkerwhereas the SEDC checker requires only 18 extra MOS tran-sistors That is because a 7-bit SEDC checker is implementedwith one SEDC3 and one SEDC4 circuit that contain 50 and58 MOS transistors respectively (a total of 108 transistors)An 8-bit SEDC checker is implemented using one SEDC2and two SEDC3 checkers requiring 26 and 100 (50x2) MOStransistors (a total of 126 transistors) This means that SEDCsaves 88 of the number of transistors compared to a Bergercode checker [26] and it saves 70 of the transistors when

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 7: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

Scientific Programming 7

Vss Vss

P3 P4

N3

N4

P1

P2

N1 N2G0

G0G0

G0V0

V1

S0

S0

S0

S0

(a)

Vss

N1 N2

N4

N3N5

N6

P2

P1P4

P3

P5

P6

Vss

N7 N8 N9

N11

N12

N13

N10

P7 P8 P9P10

P11

P12P13

G

G1

G1

G1

G1

G1G1

G1

G1

V

V1 S

S

G

G

G

G

G

G

G

S

S

S

S

S1

S1

S1

S1

(b)

Figure 5 CMOS-based circuits of FS (a) SEDC1 checker and (b) SEDC2 checker

FS SEDC Checker for 2-bitData

S1 S0

V0V1

S0

1

G1

G1G2

0

S1

S1

S0G0

G0

(a)

G S

FS SEDC Checkerfor 1-bit data

FS SEDC Checker for 3-bitdata

Functional Circuit Output SEDC Code

wired OR gatewired AND gate

Error indication

G1G2

S2

G3

S1 S0G0

V0V1V0V1

(b)

Figure 6 Block diagram of FS (a) SEDC3 checker and (b) SEDC4 checker

the circuit of Figure 5(a) and observed the output In single-fault operation the circuit either produced the correct outputor never produced any invalid code words (exhibiting FSproperty) as shown in Table 2

Case 1 (transistor stuck ON) In Table 2 we show all six casesof transistor stuck ON faults (one at a time) For the caseswith N3 or N4 stuck ON the circuit shows fault detection byone input code combination (representedwith symbol) andhence the circuit is self-testing whereas other cases showedthat the circuit is fault secure as well as code disjoint

Case 2 (transistor stuck OFF) In Table 2 all six cases fortransistor stuck OFF faults are shown In cases where N1 orN2 was stuck OFF the circuit demonstrates the self-testingproperty (represented with symbol) and for the rest of thecases the circuit is fault secure

Case 3 (input stuck at 0) When input G0 or S0 is stuck at 0the circuit demonstrates the self-testing property otherwiseit remains fault secure

Case 4 (input stuck at 1) When inputG0 or S0 is stuck at 1 thecircuit shows the self-testing property otherwise it remainsfault secure

There is one case where the output becomes floating (ieP3 or P4 stuck OFF) In either case if we consider the floatingvoltage as logic high then the circuit is fault secure and if weconsider the floating voltage as logic low then the circuit isself-testing Hence we can say that the circuit in Figure 5(a)which is a 1-bit SEDC checker is FS Similar analysis wascarried out when testing 2- 3- and 4-bit SEDC checkers andwe found that all these checkers are FS

52 Area Delay and Power Comparison In this section wecompare the area and delay of TSC Berger FS SEDC and m-out-of-2m code checkersWeuse the twopossible TSCBergerchecker implementations from Piestrak et al [23] and PierceJr and Lala [26] with the m-out-of-2m code checker fromLala [24] for comparison For the sake of fairness the areaoverhead was measured in terms of the number of equivalent

8 Scientific Programming

G S

FS SEDC Checkerfor b-bit data

FS SEDC Checker for 3-bitdata

FS SEDC Checker for 3-bitdata

Functional circuit output SEDC Code

a-units

n m

k-input wired AND-OR network

Error indication signal

G1G2 S1 S0G0 G1G2 S1 S0G0

V0V1V0V1V0V1

Figure 7 Block diagram of FS SEDCn checker

Table 2 Results of single faults on FS SEDC1 checker

G0 S0 V1 V0 G0 S0 V1 V0 G0 S0 V1 V0

MOS P1or P2 is stuck ON MOS P1 or P2 is stuck OFF Input C0 stuck at zero0 1 1 0 0 1 1 0 permil0 0 1 11 0 1 0 1 0 1 0 1 0 1 0

MOS P3 or P4 is stuck ON MOS P3 or P4 is stuck OFF Input F0 stuck at zero0 1 1 0 0 1 Floating 0 permil0 0 1 11 0 1 0 1 0 1 0 0 1 1 0

Transistor N1 is stuck ON Transistor N1 is stuck OFF Input C0 stuck at 10 1 1 0 0 1 1 0 0 1 1 01 0 1 0 permil1 0 1 1 permil1 1 0 0

Transistor N2 is stuck ON Transistor N2 is stuck OFF Input F0 stuck at 10 1 1 0 permil0 1 1 1 1 0 1 01 0 1 0 1 0 1 0 permil1 1 0 0

Transistor N3 is stuck ON Transistor N3 is stuck OFF - - - -permil0 1 0 0 0 1 1 0 - - - -1 0 1 0 1 0 1 0 - - - -

Transistor N4 is stuck ON Transistor N4 is stuck OFF - - - -permil0 1 1 0 0 1 1 0 - - - -1 0 0 0 1 0 1 0 - - - -

permilThe cases where circuit shows self-testing property

transistors Wemade use of the assumptions by Smith [30] totranslate gate-level circuits to transistor-level circuits

Before comparison we illustrate the functional dissim-ilarities of the three checkers with the help of Figure 8Figure 8(a) shows the general block diagram of a TSC Bergercode checker For all the information symbols that the ISG ofthe functional circuit can produce in normal operation thecheck symbol complement generator (CSCG) outputs (1198781198611015840)correspond to the bit-by-bit complement of the expectedcheck symbol 119878119861 The TSC two-rail checker validates thateach bit of 119878119861 is the complement of corresponding bit of 1198781198611015840As the size of the input data increases the length of checksymbol 119878119861 also increases resulting in a longer length for theTSC two-rail checker tree and hence the resulting delay

A general block diagram of a TSC m-out-of-2m codechecker is shown in Figure 8(b) The checker takes the

information bits and check bits 119878119882 and partitions them intotwo parts The numbers of 1rsquos ie the weight of both partsare mapped to a pair of values which in binary belongs toa code in most cases a two-rail code The checker consistsof a cellular structure of AND-OR gates as given by Lala[24]

Figure 8(c) depicts the general block diagram for an FSSEDC checker that resembles the structure of an m-out-of-2m code checker and differs from a Berger code checkerThe FS SEDC checker block receives the information andcheck bits from the functional unit If the input data lengthincreases the size of the FS checker block increases width-wise The FS SEDCn block contains ldquo119886 + 1rdquo pairs of smallSEDC checkers (subblocks) Each subblock of the FS SEDCchecker produces ldquo10rdquo as the valid code output The overallSEDC checker has a final 2-bit output 11987810 unlike two-rail

Scientific Programming 9

Check SymbolComplement

Generator

m-variabletwo-rail

TSC m-ot-of-2m codechecker

2m-input wired-AND-OR gate

blocks

Informationbits

Informationbits

Informationbits

Check bits Check bits Check bits

Error indicationError indication Error indication

(a) (b) (c)

k-input wired-AND-OR gate

signal VB

signal VSsignal VW

SB SWSS

S10

SFS SEDH checker

SB

Figure 8 Block diagrams of (a) TSC Berger checker (b) m-out-of-2m code checker and (c) FS SEDC checker

2 3 4 5 7 8 15 16 30 32Data Length (bits)

m-out-2mBergerSEDC

0500

1000150020002500

Circ

uit S

ize (

of

tran

sisto

rs)

Figure 9 Area comparison of area-optimized Berger [23] SEDC and m-out-of-2m [24] code checkers

codes only one of the output combinations ldquo10rdquo is considereda valid code word A nonvalid checker output ldquo00rdquo ldquo01rdquoor ldquo11rdquo at output 11987810 indicates the presence of a fault in thefunctional circuit or the FS checker itself The k-input wiredAND-OR network takes the ldquo119886+1rdquo pairs of output from eachSEDC checker subblock and then converts them into a final2-bit error indication signal 11988111987851 Fault Test on FS SEDC Checker Area-optimized real-ization of TSC Berger code checkers in Piestrak et al [23]showed less area overhead than m-out-of-2m code checkerswhich is apparent fromFigure 9 But if we consider the delay-optimized implementation of the TSC Berger code checkerfrom Pierce Jr and Lala [26] we see that the TSC Berger codechecker requires more area than the FS SEDC and m-out-of-2m codes checkers [24] as shown in Table 3 For claritywe discretely listed the area overhead offered based on codestorage area and code checker area in Table 3 Also listedseparately are the area overhead required by the TRC tree forthe TSC Berger code checker the wired-AND-OR networkfor FS SEDC and the m-out-of-2m code checker

For a fair comparison the extra cost of the code storagearea is also taken into account We assumed that 1-bit storage

is implemented by 12-MOS transistors [30] Table 3 lists thearea (in terms of the number of transistors) occupied by FSSEDC delay-optimized Berger code and m-out-of-2m codecheckers for up to 32-bit data

The FS SEDCn checker block shown in Figure 8(c)requires fewer gates implemented with [26 + (a times 50)] MOStransistors if ldquob = 2rdquo [50 + (a times 50)] MOS transistors if ldquob= 3rdquo and [58 + (a times 50)] MOS transistors if ldquob = 4rdquo The m-out-of-2m code checker implementation of Lala [24] requires2m2 - 2m + 2 gates The gate-level circuit is also translated totransistor-level circuits using data from Smith [30]

The results show that when scaling a 7-bit 0rsquos counter toan 8-bit 0rsquos counter 154 extra MOS transistors are requiredThe m-out-of-2m code checker requires 60 MOS transistorswhen scaling a 7-out-of-14 checker to an 8-out-of-16 checkerwhereas the SEDC checker requires only 18 extra MOS tran-sistors That is because a 7-bit SEDC checker is implementedwith one SEDC3 and one SEDC4 circuit that contain 50 and58 MOS transistors respectively (a total of 108 transistors)An 8-bit SEDC checker is implemented using one SEDC2and two SEDC3 checkers requiring 26 and 100 (50x2) MOStransistors (a total of 126 transistors) This means that SEDCsaves 88 of the number of transistors compared to a Bergercode checker [26] and it saves 70 of the transistors when

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 8: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

8 Scientific Programming

G S

FS SEDC Checkerfor b-bit data

FS SEDC Checker for 3-bitdata

FS SEDC Checker for 3-bitdata

Functional circuit output SEDC Code

a-units

n m

k-input wired AND-OR network

Error indication signal

G1G2 S1 S0G0 G1G2 S1 S0G0

V0V1V0V1V0V1

Figure 7 Block diagram of FS SEDCn checker

Table 2 Results of single faults on FS SEDC1 checker

G0 S0 V1 V0 G0 S0 V1 V0 G0 S0 V1 V0

MOS P1or P2 is stuck ON MOS P1 or P2 is stuck OFF Input C0 stuck at zero0 1 1 0 0 1 1 0 permil0 0 1 11 0 1 0 1 0 1 0 1 0 1 0

MOS P3 or P4 is stuck ON MOS P3 or P4 is stuck OFF Input F0 stuck at zero0 1 1 0 0 1 Floating 0 permil0 0 1 11 0 1 0 1 0 1 0 0 1 1 0

Transistor N1 is stuck ON Transistor N1 is stuck OFF Input C0 stuck at 10 1 1 0 0 1 1 0 0 1 1 01 0 1 0 permil1 0 1 1 permil1 1 0 0

Transistor N2 is stuck ON Transistor N2 is stuck OFF Input F0 stuck at 10 1 1 0 permil0 1 1 1 1 0 1 01 0 1 0 1 0 1 0 permil1 1 0 0

Transistor N3 is stuck ON Transistor N3 is stuck OFF - - - -permil0 1 0 0 0 1 1 0 - - - -1 0 1 0 1 0 1 0 - - - -

Transistor N4 is stuck ON Transistor N4 is stuck OFF - - - -permil0 1 1 0 0 1 1 0 - - - -1 0 0 0 1 0 1 0 - - - -

permilThe cases where circuit shows self-testing property

transistors Wemade use of the assumptions by Smith [30] totranslate gate-level circuits to transistor-level circuits

Before comparison we illustrate the functional dissim-ilarities of the three checkers with the help of Figure 8Figure 8(a) shows the general block diagram of a TSC Bergercode checker For all the information symbols that the ISG ofthe functional circuit can produce in normal operation thecheck symbol complement generator (CSCG) outputs (1198781198611015840)correspond to the bit-by-bit complement of the expectedcheck symbol 119878119861 The TSC two-rail checker validates thateach bit of 119878119861 is the complement of corresponding bit of 1198781198611015840As the size of the input data increases the length of checksymbol 119878119861 also increases resulting in a longer length for theTSC two-rail checker tree and hence the resulting delay

A general block diagram of a TSC m-out-of-2m codechecker is shown in Figure 8(b) The checker takes the

information bits and check bits 119878119882 and partitions them intotwo parts The numbers of 1rsquos ie the weight of both partsare mapped to a pair of values which in binary belongs toa code in most cases a two-rail code The checker consistsof a cellular structure of AND-OR gates as given by Lala[24]

Figure 8(c) depicts the general block diagram for an FSSEDC checker that resembles the structure of an m-out-of-2m code checker and differs from a Berger code checkerThe FS SEDC checker block receives the information andcheck bits from the functional unit If the input data lengthincreases the size of the FS checker block increases width-wise The FS SEDCn block contains ldquo119886 + 1rdquo pairs of smallSEDC checkers (subblocks) Each subblock of the FS SEDCchecker produces ldquo10rdquo as the valid code output The overallSEDC checker has a final 2-bit output 11987810 unlike two-rail

Scientific Programming 9

Check SymbolComplement

Generator

m-variabletwo-rail

TSC m-ot-of-2m codechecker

2m-input wired-AND-OR gate

blocks

Informationbits

Informationbits

Informationbits

Check bits Check bits Check bits

Error indicationError indication Error indication

(a) (b) (c)

k-input wired-AND-OR gate

signal VB

signal VSsignal VW

SB SWSS

S10

SFS SEDH checker

SB

Figure 8 Block diagrams of (a) TSC Berger checker (b) m-out-of-2m code checker and (c) FS SEDC checker

2 3 4 5 7 8 15 16 30 32Data Length (bits)

m-out-2mBergerSEDC

0500

1000150020002500

Circ

uit S

ize (

of

tran

sisto

rs)

Figure 9 Area comparison of area-optimized Berger [23] SEDC and m-out-of-2m [24] code checkers

codes only one of the output combinations ldquo10rdquo is considereda valid code word A nonvalid checker output ldquo00rdquo ldquo01rdquoor ldquo11rdquo at output 11987810 indicates the presence of a fault in thefunctional circuit or the FS checker itself The k-input wiredAND-OR network takes the ldquo119886+1rdquo pairs of output from eachSEDC checker subblock and then converts them into a final2-bit error indication signal 11988111987851 Fault Test on FS SEDC Checker Area-optimized real-ization of TSC Berger code checkers in Piestrak et al [23]showed less area overhead than m-out-of-2m code checkerswhich is apparent fromFigure 9 But if we consider the delay-optimized implementation of the TSC Berger code checkerfrom Pierce Jr and Lala [26] we see that the TSC Berger codechecker requires more area than the FS SEDC and m-out-of-2m codes checkers [24] as shown in Table 3 For claritywe discretely listed the area overhead offered based on codestorage area and code checker area in Table 3 Also listedseparately are the area overhead required by the TRC tree forthe TSC Berger code checker the wired-AND-OR networkfor FS SEDC and the m-out-of-2m code checker

For a fair comparison the extra cost of the code storagearea is also taken into account We assumed that 1-bit storage

is implemented by 12-MOS transistors [30] Table 3 lists thearea (in terms of the number of transistors) occupied by FSSEDC delay-optimized Berger code and m-out-of-2m codecheckers for up to 32-bit data

The FS SEDCn checker block shown in Figure 8(c)requires fewer gates implemented with [26 + (a times 50)] MOStransistors if ldquob = 2rdquo [50 + (a times 50)] MOS transistors if ldquob= 3rdquo and [58 + (a times 50)] MOS transistors if ldquob = 4rdquo The m-out-of-2m code checker implementation of Lala [24] requires2m2 - 2m + 2 gates The gate-level circuit is also translated totransistor-level circuits using data from Smith [30]

The results show that when scaling a 7-bit 0rsquos counter toan 8-bit 0rsquos counter 154 extra MOS transistors are requiredThe m-out-of-2m code checker requires 60 MOS transistorswhen scaling a 7-out-of-14 checker to an 8-out-of-16 checkerwhereas the SEDC checker requires only 18 extra MOS tran-sistors That is because a 7-bit SEDC checker is implementedwith one SEDC3 and one SEDC4 circuit that contain 50 and58 MOS transistors respectively (a total of 108 transistors)An 8-bit SEDC checker is implemented using one SEDC2and two SEDC3 checkers requiring 26 and 100 (50x2) MOStransistors (a total of 126 transistors) This means that SEDCsaves 88 of the number of transistors compared to a Bergercode checker [26] and it saves 70 of the transistors when

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 9: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

Scientific Programming 9

Check SymbolComplement

Generator

m-variabletwo-rail

TSC m-ot-of-2m codechecker

2m-input wired-AND-OR gate

blocks

Informationbits

Informationbits

Informationbits

Check bits Check bits Check bits

Error indicationError indication Error indication

(a) (b) (c)

k-input wired-AND-OR gate

signal VB

signal VSsignal VW

SB SWSS

S10

SFS SEDH checker

SB

Figure 8 Block diagrams of (a) TSC Berger checker (b) m-out-of-2m code checker and (c) FS SEDC checker

2 3 4 5 7 8 15 16 30 32Data Length (bits)

m-out-2mBergerSEDC

0500

1000150020002500

Circ

uit S

ize (

of

tran

sisto

rs)

Figure 9 Area comparison of area-optimized Berger [23] SEDC and m-out-of-2m [24] code checkers

codes only one of the output combinations ldquo10rdquo is considereda valid code word A nonvalid checker output ldquo00rdquo ldquo01rdquoor ldquo11rdquo at output 11987810 indicates the presence of a fault in thefunctional circuit or the FS checker itself The k-input wiredAND-OR network takes the ldquo119886+1rdquo pairs of output from eachSEDC checker subblock and then converts them into a final2-bit error indication signal 11988111987851 Fault Test on FS SEDC Checker Area-optimized real-ization of TSC Berger code checkers in Piestrak et al [23]showed less area overhead than m-out-of-2m code checkerswhich is apparent fromFigure 9 But if we consider the delay-optimized implementation of the TSC Berger code checkerfrom Pierce Jr and Lala [26] we see that the TSC Berger codechecker requires more area than the FS SEDC and m-out-of-2m codes checkers [24] as shown in Table 3 For claritywe discretely listed the area overhead offered based on codestorage area and code checker area in Table 3 Also listedseparately are the area overhead required by the TRC tree forthe TSC Berger code checker the wired-AND-OR networkfor FS SEDC and the m-out-of-2m code checker

For a fair comparison the extra cost of the code storagearea is also taken into account We assumed that 1-bit storage

is implemented by 12-MOS transistors [30] Table 3 lists thearea (in terms of the number of transistors) occupied by FSSEDC delay-optimized Berger code and m-out-of-2m codecheckers for up to 32-bit data

The FS SEDCn checker block shown in Figure 8(c)requires fewer gates implemented with [26 + (a times 50)] MOStransistors if ldquob = 2rdquo [50 + (a times 50)] MOS transistors if ldquob= 3rdquo and [58 + (a times 50)] MOS transistors if ldquob = 4rdquo The m-out-of-2m code checker implementation of Lala [24] requires2m2 - 2m + 2 gates The gate-level circuit is also translated totransistor-level circuits using data from Smith [30]

The results show that when scaling a 7-bit 0rsquos counter toan 8-bit 0rsquos counter 154 extra MOS transistors are requiredThe m-out-of-2m code checker requires 60 MOS transistorswhen scaling a 7-out-of-14 checker to an 8-out-of-16 checkerwhereas the SEDC checker requires only 18 extra MOS tran-sistors That is because a 7-bit SEDC checker is implementedwith one SEDC3 and one SEDC4 circuit that contain 50 and58 MOS transistors respectively (a total of 108 transistors)An 8-bit SEDC checker is implemented using one SEDC2and two SEDC3 checkers requiring 26 and 100 (50x2) MOStransistors (a total of 126 transistors) This means that SEDCsaves 88 of the number of transistors compared to a Bergercode checker [26] and it saves 70 of the transistors when

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 10: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

10 Scientific Programming

Table 3 Area overhead of Berger [26] SEDC and m-out-of-2m [24] code checkers

Data Bit

Berger Code SEDC m-out-of-2mCode

storageArea

1rsquoscounter

Area

TRCArea

TotalArea

Codestorage

Area

CheckerArea

AND-ORNetwork

TotalArea

CodeStorage

Area

CheckerArea

AND-ORNetwork Total Area

2 24 22 4 50 24 26 0 50 24 36 0 503 24 80 8 112 24 50 0 74 36 152 0 1884 36 180 12 228 36 58 6 100 48 240 10 2985 36 178 16 230 48 76 6 130 60 300 14 3747 36 396 24 456 60 108 8 176 84 420 18 5228 48 550 28 626 72 126 8 206 96 480 20 59615 48 1106 56 1210 120 250 14 384 180 900 38 111816 60 1308 60 1428 132 258 16 406 192 960 40 119230 60 2586 116 2762 240 500 26 766 360 1800 76 223632 72 3048 120 3240 264 526 28 818 384 1920 80 2384

compared to m-out-of-2m code checkers Although Bergerand m-out-of-2m checkers are TSC while the proposedSEDC checker is only FS all three checkers provide the samefault security

522 Delay As far as delay is concerned the FS SEDCchecker also performs better than Berger and cellular imple-mentations for an m-out-of-2m code checker as shown inTable 4 For the sake of uniformity we designed all the basicgates using the same technology transistors (PMOS = 81205832120583NMOS = 41205832120583) and evaluated the worst-case propagationdelay of each circuit

The SEDC checker shows almost a constant delay for n gt3 bits due to its parallel implementation whereas the delay inthe Berger code checker increases owing to an increase in gatelevels (from 6 to 16) in the critical path as shown by Pierce Jrand Lala [26] The delay for m-out-of-2m code checkers alsocontinues to increasewith increasing data lengths because thecellular implementation requires ldquom (= input data length)rdquogate levels in the critical path

523 Power Dissipation In order to evaluate the powerdissipation of the three checkers we used the PowerPlaypower analyzer toolWe implemented the Berger [24]m-out-of-2m [26] and SEDC checker using Verilog and synthesizedthe circuits usingAlterarsquos Quartus II softwareWe targeted thecircuit for a Cyclone II EP2C5AF256A7 chip which has theleast power dissipating properties among the Cyclone familyWe allowed the synthesizer to create a balance between areaand delay while synthesizing in order to get a better powerestimate We also enabled the synthesizer to use synthesizingmodel that takes intensive steps to optimize power for allthree circuits We clocked the inputs of the circuit with thedefault toggle rate and estimated the total thermal powerdissipation for different values of input data width

Figure 10(a) shows a comparison of power dissipationbetween the three checkers The Berger and m-out-of-2mcheckers exhibited a sudden increase in power dissipation

when the input data width was changed from 16-bits to 32-bits while SEDC showed a minimal change This happensdue to the increase in the number of two-rail checkers inthe case of the Berger checker and due to the increase inthe checker circuitry itself in the case of the m-out-of-2mchecker which is also evident in Figure 10(b) which depictsan area comparison between the three checkers in terms of of logic elements (LE) occupied by the checkers

53 Fault Coverage of the Proposed HW-Level Fault ToleranceScheme In order to elaborate the effectiveness of the SEDCCSG and its FS checker we computed the fault coverage ofthe proposed SEDC-based HW-level fault tolerance schemeWe applied faults in the example circuit of Figure 4 givenin Section 34 As most of the VLSI combinational circuitsdesigned for mathematical operations like add subtractmultiply division etc consist of multiple instances of 1-bitadders (full adders) hence the example circuit ie a 4-bitadder is a simple and good candidate for presenting theeffectiveness of our scheme We injected two major typesof transient errors ie stuck-at-0 and stuck-at-1 [29] at 24nodes (at 6 nodes per full adder as shown in Figure 11(b))Weinjected these errors using 2-to-1 multiplexers whose outputis given by

119898119906119909119906=

1198941198991 (119899119900119903119898119886119897 119892119886119905119890 119900119906119905119901119906119905) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 01198941198992 (119904119905119906119888119896 minus 119886119905 minus 119891119886119906119897119905 119891 isin F) 119894119891 119904119890119897119890119888119905 (119891 119890119899119886119887119897119890) = 1

(8)

In Figure 11(a) the symbols A[30] B[30] Cin f enableand F[230] denote the 4-bits input A 4-bits input B 1-bitcarry-in 1-bit fault enabling signal and 24-bits fault signalsrespectively while Cout is the carry-out and S[30] representsthe 4-bits sum output of the 4-bits adder Figure 11(b) showsthe detailed schematic of a single full adder

We considered that the faults can occur at the outputsof the logic gates only and adopted a single-fault modelaccording to which only one fault can occur at a time [29]

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 11: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

Scientific Programming 11

(a) (b)

Figure 10 Comparison of (a) power dissipation and (b) area in terms of LE counts between Berger [26] m-out-of-2m [24] and SEDCcheckers

ABCinf_enableF[50] S

Cout

A[30]B[30]

F[230]

Cinf_enable

FullAdderFA1ABCinf_enableF[50] S

Cout

FullAdderFA2ABCinf_enableF[50] S

Cout

FullAdderFA3ABCinf_enableF[50] S

Cout

FullAdderFA4

Cout

S[30]

(a)

in1in2select

out

AB

F[50]

Cin

f_enable

mux2_1comb_10

Cout

S

in1in2select

out

mux2_1comb_11

in1in2select

out

mux2_1comb_12

in1in2select

out

mux2_1comb_4

in1in2select

out

mux2_1comb_14

in1in2select

out

mux2_1comb_6

(b)

Figure 11 (a) RTL schematic of a 4-bit adder and (b) 1-bit full adder with fault injection

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 12: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

12 Scientific Programming

Table 4 Critical path (CP) delay comparison of Berger SEDC and m-out-of-2m codes checker (unit = microseconds)

Data Bits Berger SEDC m-out-2m2 3888 0514 10243 4151 2524 -4 7741 2738 54905 - 2713 55587 7821 277 82978 7599 276 928415 10566 2826 -16 12956 275132 17964 2771 -

Table 5 Summary of fault testing experiment on SEDC-based fault tolerant 4-bit adder

(a) Total errors at theoutput of the adder (b) BEs

(c)Detected

BEs(d) UEs (e) Detected

UEs(f) Total detected

errors(g) Total undetected

errors

Total 1748 252 120 1496 1496 1616 132

Percentage () 100 1442wrt (a)

4762 wrt(b)

8558 wrt(a) 100 wrt (d) 9245 wrt (a) 755 wrt (a)

We used Alterarsquos Quartus II software to design and synthesizethe overall system and then simulated the system usingModelSimWedesigned a self-checking test bench to evaluatethe overall fault coverage The statistics of the fault injectionand its results are summarized in Table 5

In total we injected 6425 faults exhaustively out of which1748 faults actually caused a logical error at the output ofthe adder circuitry Only 1442 of these injected faultsresulted in bidirectional errors (BEs) while most of thefaults caused unidirectional errors (UEs) This also provedthe fact that most of the errors in VLSI circuits result inUEs at the output [19ndash21] Even though SEDC is an AUEDscheme and it provides 100 fault coverage against UEs italso successfully detected 4762 of the BEs as shown inTable 5 This is due to the reason that SEDC partitions theinput data word into multiple parts and encodes and decodeseach part independently Consequently a subset of BEs isalso partitioned into multiple UEs and thus detected by theproposed SEDC scheme

54 Cost Analysis SW-Based Fault Tolerance Versus HW-Based Fault Tolerance In this section we discuss the effectof fault propagation and the estimated cost of recovery fromfailure (also known as repair time) in big data computingin two cases (a) when HW-based fault tolerance is appliedand (b) when only SW-based fault tolerance is appliedFor simplicity in our analysis we take the example of acoordinated checkpointing (CC) algorithm which is widelyused in HDFS for data recovery [31]

In HDFS an image is used to define metadata (whichcontains node data and a list of blocks belonging to eachfile) while checkpoint defines the persistent record of theimage stored on a secondary NameNode (SNN) (also calledDataNode) or Checkpoint Node or in some cases on the

primary NameNode (PNN) itself If the PNN uses the CCdata recovery algorithm the checkpoints are distributedamong multiple SNNs During normal operation the SNNsends heartbeats (a communication signal) to the PNNperiodically If the PNN does not receive a heartbeat fromthe SNN for certain fixed amount of time the SNN isconsidered to be out of service and the block replicas ithosts are considered to be unavailable In this case the PNNinitiates the CC recovery algorithm which includes signaling(sending heartbeats with control signals to other nodes) andreplicating the copy of failed SNN data (available on thecheckpoint nodes) to the other nodes in a coordinated way[31]

For our cost analysis we would like to compute the costassociated with the CC data recovery algorithm for which weassume a cloud application such as a message passing inter-face (MPI) program that comprises 119901 logical processes thatcommunicate through message passing (heartbeats) Eachprocess is executed on a virtual machine and sends a messageto remaining 119901minus1 processes with equal probabilitiesWe alsoconsider that the message sending checkpointing and faultoccurrence events are independent of each other Assumingthat a process is modelled as a sequence of deterministicevents ie every step taken by the process has a knownoutcome and failure only occurs during message passingwith equal probability and not during checkpointing orrecovery we use the analytical costmodel given in [4] for costanalysis of fault tolerance at the SW level According to [4]119879denotes the total execution time of a process without faulttolerance while119879119862119875 and119879119877119874 represent the checkpointing andfailure recovery overheads respectivelyThen the total cost offault tolerance per process is given by

119862 = 119879119862119875 + 119879119877119874119879 times 100 (9)

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 13: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

Scientific Programming 13

(a) (b)

Figure 12 Effect of (a) number of processes and (b) network latency on data recovery overhead in CC algorithm

Assuming that the average time to roll back a failed processis 119862119903119887and mean time between failures is 1119875(119891) where119875(119891)denotes the probability of failure then according to [4]the average recovery cost in CC per process is given by

119879119877119874 = 119862119903119887(1119875 (119891)) = 119875 (119891)119862119903119887 (10)

Let 119875(119888119901) denote the probability that a process startscheckpointing then (1 minus 119875(119888119901))119901 becomes the probabilitythat 119901 processes do not start checkpointing while 1 minus(1 minus 119875(119888119901))119901becomes the probability that at least one pro-cess starts a checkpoint Consequently 1(1 minus (1 minus 119875(119888119901))119901)represents the checkpointing interval A process can be theinitiator of checkpointing with probability 1119901 and generaterequest (REQ) and acknowledgement signals (ACK) to therest of the 119901 minus 1 noninitiators (total 2(119901 minus 1) signals) andlikewise be a noninitiator with probability 1 minus 1119901 andgenerate only one ACK signal in response to the initiatorAs a result there are 3(119901 minus 1)119901 average messages generatedper checkpoint and the average overhead per checkpoint is119862119908 + (3(119901 minus 1)119901)119862119899119897 where 119862119908denotes the average time towrite a checkpoint to a stable node and119862119899119897denotes the averagenetwork latency Then the average checkpointing cost for aprocess is given by

119879119862119875 = 119862119908 + (3 (119901 minus 1) 119901) 1198621198991198971 (1 minus (1 minus 119875 (119888119901))119901)

= (1 minus (1 minus 119875 (119888119901))119901)(119862119908 + 3 (119901 minus 1)119901 119862119899119897)

(11)

Using the cost model given in (9) (10) and (11) we carriedout the cost of data recovery in the CC algorithm with theparameters 119901 = 128 processes (virtual machines) 119875(119888119901) =115 (one checkpointing per 15 minutes) 119862119899119897 = 20 119898119904119890119888119904119862119908 = 1 119904119890119888 119862119903119887 = 2 119904119890119888119904 as given in [4] We consider the

value of 119875(119891) = 1168 which implies that 100 of the faultsin hardware are propagated to the SW level in the absenceof HW-level fault tolerance while each fault occurs after168 hours (one weekrsquos time) After we apply HW-level faulttolerance the probability of failure 119875(119891) reduces to 1198751015840(119891) =0755 times 119875(119891) where the value 0755 signifies that only 755of the faults are unhandled by the proposed HW-level faulttolerance system (see Table 5) We vary one of the aboveparameters by keeping the other constant and observe theeffect of data recovery cost with and without the proposedHW-level fault tolerance

The graph in Figure 12(a) shows the average cost of datarecoverywhen the number of processes119901 is increased from32to 4096 (virtual machines) We consider that an applicationis partitioned into 119901 processes and each process runs on avirtual machine The increase in number of processes causesa sharp increase in data recovery cost in the CC algorithmbecause every process has to coordinate with each other incase of a failure

Figure 12(b) depicts the effect of network latency on thecost of data recovery In this case we increased the networklatency from 2 milliseconds to 300 milliseconds Networklatency depends heavily upon the traffic situation networkbandwidth data size and number of active nodes in thenetwork Figure 12(b) shows that increasing network latencyhas a negative impact on data recovery because it takes alonger time for processes to communicate with each otherresulting in delayed data recovery

Figure 13 illustrates the situation where we increasethe checkpointing frequency from one checkpoint per hour(160) to one checkpoint per minute Even though theincrease in checkpointing frequency improves the overallfault tolerance it also increases the overall fault toleranceoverhead as shown in Figure 13

Finally we show the effect of the increasing probability offailure on the cost of data recovery in Figure 14 We variedthe failure frequency from one failure per 1024 hours to one

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 14: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

14 Scientific Programming

Figure 13 Effect of checkpointing frequency on data recovery cost in CC algorithm

Figure 14 Effect of failure probability on data recovery in CC algorithm

failure per 2 hours which caused a huge impact on faulttolerance overhead as shown in Figure 14 But if we detectmost of the errors at the hardware level the average costof data recovery reduces to a tolerable limit as shown inFigure 14

Because of the errors arising at the HW level the averagecost of data recovery in terms of percent increase in runtimein all of the above cases is much higher if we apply faulttolerance at the SW level only Among the four parametersie of processes network latency checkpointing frequencyand frequency of failure frequency of failure has the worsteffect on the average cost of data recoveryThe proposedHW-level fault tolerance reduces the average cost to a tolerablelimit which is promising for big data and cloud computingapplications Although there is a one-time cost associatedwith HW-level fault tolerance it provides high reliabilityagainst potential failures leading to severe socioeconomicconsequences in big data and cloud computing

6 Conclusions and Future Work

In this paper we presented a concurrent error detectioncoding-based HW-level fault tolerance scheme for big dataand cloud computing The proposed method uses SEDCcodes to protect against transient errors which is a major

problem in modern VLSI circuits We also presented an FSSEDC checker that not only detects errors in the functionalcircuitry but also remains failsafe under s-a-1 s-a-0 s-openand s-short errors within checker circuitry We comparedthe performance of the proposed SEDC checker with Bergerand m-out-of-2m checker in terms of area delay and powerdissipation which proves the superiority of the proposedSEDC checker Using the example of a 4-bit adder circuitwe presented a complete SEDC-based HW-level fault toler-ance system and computed its fault coverage by exhaustivefault injection The SEDC-based HW-level fault tolerancemethod shows 100 47 and 925 fault coverage againstunidirectional bidirectional and total errors respectivelyIn order to show the effectiveness of the proposed SEDC-based HW-level fault tolerance method in big data and cloudcomputing applications we compared the average cost offault tolerance overhead with and without HW-level faulttolerance The results show that HW-level fault tolerancereduces the probability of failure due to transient errorsconsequently reducing the average cost of fault toleranceoverhead to a great extent when comparedwith SW level faulttolerance only

From hardware-level evolution such as microprocessorsmemories and parallel computing devices to system-leveladvancements such as networking data security resource

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 15: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

Scientific Programming 15

sharing protocols and operating systems the underlyingtechnologies have changed a lot since the emergence of bigdata and cloud computing Fault tolerance plays a vital rolein big data and cloud computing because of the uncertainfailures associated with the huge amount of data both at SWandHW levels Given this we believe that this research opensnew opportunities for fault tolerance at the hardware-level forbig data and cloud computing

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This study was partly supported by research funds fromChosun University 2017 Sogang University Research Grantof 2012 (20121005601) and MISP (Ministry of Science ICTamp Future Planning) Korea under the National Program forExcellence in SW (2015-0-00910) supervised by the IITP(Institute for Information amp communications TechnologyPromotion)

References

[1] M Chen S Mao and Y Liu ldquoBig data A surveyrdquo MobileNetworks and Applications vol 19 no 2 pp 171ndash209 2014

[2] R Jhawar V Piuri and M Santambrogio ldquoA comprehensiveconceptual system-level approach to fault tolerance in CloudComputingrdquo in Proceedings of the 2012 6th Annual IEEE SystemsConference (SysCon) pp 1ndash5 Vancouver Canada March 2012

[3] A Katal M Wazid and R H Goudar ldquoBig data issueschallenges tools and good practicesrdquo in Proceedings of the 6thInternational Conference on Contemporary Computing (IC3 rsquo13)pp 404ndash409 IEEE Noida India August 2013

[4] YM Teo B L Luong Y Song and T Nam ldquoCost-performanceof fault tolerance in cloud computingrdquo Special Issue of Journal ofScience and Technology vol 49 no 4A pp 61ndash73 2011

[5] M Nazari Cheraghlou A Khadem-Zadeh andM HaghparastldquoA survey of fault tolerance architecture in cloud computingrdquoJournal of Network and Computer Applications vol 61 pp 81ndash92 2016

[6] J Deng S C-H Huang Y S Han and J H Deng ldquoFault-tolerant and reliable computation in cloud computingrdquo inProceedings of the 2010 IEEE Globecom Workshops GCrsquo10 pp1601ndash1605 Miami Fla USA December 2010

[7] J Liu SWangA Zhou S Kumar F Yang andR Buyya ldquoUsingproactive fault-tolerance approach to enhance cloud servicereliabilityrdquo IEEE Transactions on Cloud Computing p 1 2017httpieeexploreieeeorgdocument7469864

[8] M Reitblatt M Canini A Guha and N Foster ldquoFatTireDeclarative fault tolerance for software-defined networksrdquo inProceedings of the 2013 2nd ACM SIGCOMMWorkshop on HotTopics in Software Defined Networking HotSDN rsquo13 pp 109ndash114Hong Kong China August 2013

[9] R C Fernandez M Migliavacca E Kalyvianaki and PPietzuch ldquoIntegrating scale out and fault tolerance in streamprocessing using operator state managementrdquo in Proceedings ofthe 2013 ACM SIGMOD Conference on Management of DataSIGMOD rsquo13 pp 725ndash736 New York NY USA June 2013

[10] M Zaharia T Das H Li T Hunter S Shenker and I StoicaldquoDiscretized streams an efficient and fault-tolerant model forstream processing on large clustersrdquo in Proceedings of the 4thUSENIX Conference on Hot Topics in Cloud Computer p 10Berkeley Calif USA 2012

[11] P Wang D J Dean and X Gu ldquoUnderstanding Real WorldData Corruptions in Cloud Systemsrdquo in Proceedings of the 2015IEEE International Conference on Cloud Engineering pp 116ndash125 Tempe Ariz USA March 2015

[12] P A Parker ldquoDiscussion of Reliability Meets Big Data Oppor-tunities and Challengesrdquo Quality Engineering vol 26 no 1 pp117ndash120 2014

[13] H Bauer P Ranade and S Tandon ldquoBig data and the oppor-tunities it creates for semiconductor playersrdquo in McKinesy onSemiconductors BIG DATA for Semiconductors McKinesy ampCompany 2012

[14] H Ueno and K Namba ldquoConstruction of a soft error (SEU)hardened Latch with high critical chargerdquo in Proceedings ofthe 29th IEEE International Symposium on Defect and FaultTolerance in VLSI and Nanotechnology Systems DFT rsquo16 pp 27ndash30 September 2016

[15] S Mitra N Seifert M Zhang Q Shi and K S Kim ldquoRobustsystem design with built-in soft-error resiliencerdquoThe ComputerJournal vol 38 no 2 pp 43ndash52 2005

[16] T Karnik P Hazucha and J Patel ldquoCharacterization of softerrors caused by single event upsets in CMOS processesrdquo IEEETransactions on Dependable and Secure Computing vol 1 no 2pp 128ndash143 2004

[17] L-T Wang X Wen and K S Abdel-Hafez ldquoDesign fortestabilityrdquo VLSI Test Principles and Architectures pp 37ndash1032006

[18] N Alves ldquoState-of-the-art techniques for detecting transienterrors in electrical circuitsrdquo IEEE Potentials vol 30 no 3 pp30ndash35 2011

[19] S Kotaki and M Kitakami ldquoCodes correcting asymmet-ricunidirectional errors along with bidirectional errors ofsmall magnituderdquo in Proceedings of the 20th IEEE Pacific RimInternational Symposium on Dependable Computing PRDC rsquo14pp 159-160 Singapore November 2014

[20] B SManjunathaG SD Pateel andV Shah ldquoOral fibrolipomaA rare histological entity report of 3 cases and review ofliteraturerdquo Journal of Dentistry vol 7 no 4 pp 226ndash231 2010

[21] N K Jha and M B Vora ldquoA t-unidirectional error-detectingsystematic coderdquo Computers amp Mathematics with Applicationsvol 16 no 9 pp 705ndash714 1988

[22] J Kim D-H Lee and W Sung ldquoPerformance of rate 096(68254 65536) EG-LDPC code for NAND Flash memoryerror correctionrdquo in Proceedings of the 2012 IEEE InternationalConference on Communications ICC rsquo12 pp 7029ndash7033 June2012

[23] S Piestrak D Bakalis and X Kavousianos ldquoOn the design ofself-testing checkers for modified Berger codesrdquo in Proceedingsof the Seventh International On-Line Testing Workshop pp 153ndash157 Taormina Italy 2001

[24] P K Lala Self-Checking and Fault Tolerant Digital DesignAcademic press UK 2001

[25] J-A Lee Z A Siddiqui N Somasundaram and J-G LeeldquoSelf-checking look-up tables using scalable error detectioncoding (SEDC) schemerdquo Journal of Semiconductor Technologyand Science vol 13 no 5 pp 415ndash422 2013

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 16: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

16 Scientific Programming

[26] D A Pierce Jr and P K Lala ldquoModular implementation ofefficient self-checking checkers for the Berger coderdquo Journal ofElectronic Testing vol 9 no 3 pp 279ndash294 1996

[27] Z A Siddiqui P Hui-Jong and J Lee ldquoArea-Time Efficient Self-Checking ALU Based on Scalable Error Detection Codingrdquo inProceedings of the 2013 Euromicro Conference on Digital SystemDesign (DSD) pp 870ndash877 Los Alamitos CA USA September2013

[28] Z A Siddiqui and J-A Lee ldquoOnline error detection in SRAMbased FPGAs using Scalable Error Detection Codingrdquo inProceedings of the 5th Asia Symposium on Quality ElectronicDesign ASQED rsquo13 pp 321ndash324 PenangMalaysia August 2013

[29] D A Anderson and GMetze ldquoDesign of Totally Self-CheckingCheck Circuits for m-Out-of-n Codesrdquo IEEE Transactions onComputers vol C-22 no 3 pp 263ndash269 1973

[30] M A Smith Transistor counts httpenwikipediaorgwikiTransistor count April 05 2018

[31] K Shvachko H Kuang S Radia and R Chansler ldquoTheHadoop distributed file systemrdquo in Proceedings of the IEEE 26thSymposium on Mass Storage Systems and Technologies (MSSTrsquo10) 10 1 pages Piscataway NJ USA May 2010

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 17: SEDC-Based Hardware-Level Fault Tolerance and Fault Secure ...downloads.hindawi.com/journals/sp/2018/7306837.pdf · ResearchArticle SEDC-Based Hardware-Level Fault Tolerance and Fault

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom