designing and increasing failure masking scalability in stencil … · 2017-07-24 · scalable...

SCALABLE FAILURE MASKING FOR STENCIL COMPUTATIONSUSING GHOST REGION EXPANSION AND CELL TO RANK

RE-MAPPING ∗

MARC GAMELL † , KEITA TERANISHI ‡ , HEMANTH KOLLA ‡ , JACKSON MAYO ‡ ,

MICHAEL A. HEROUX § , JACQUELINE CHEN ‡ , AND MANISH PARASHAR †

Abstract. In order to achieve exascale systems, application resilience needs to be addressed.Some programming models, such as task-DAG architectures, currently embed resilience featureswhereas traditional SPMD and message-passing models do not. Since a large part of the community’scode base follows the latter models, it is still required to take advantage of application characteristicsto minimize the overheads of fault tolerance.

To that end, this paper explores how recovering from hard process/node failures in a localmanner is a natural approach for certain applications to obtain resilience at lower costs in faultyenvironments. In particular, this paper targets enabling online, semi-transparent local recovery forstencil computations on current, leadership-class systems as well as presents programming supportand scalable runtime mechanisms. Also described and demonstrated in this paper is the effect offailure masking, which allows the effective reduction of impact on total time to solution due tomultiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion andcell-to-rank re-mapping to increase the probability of failure masking. To conclude, this paper showsthe integration of all aforementioned mechanisms with the S3D combustion simulation through anexperimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e.,node failures every 5 seconds) with low overhead while sustaining performance at large scales. Inaddition, this demonstration also displays the failure masking probability increase resulting from thecombination of both ghost region expansion and cell-to-rank re-mapping.

Key words. Parallel Computing, Fault Tolerance, Stencil Computation

AMS subject classifications. 68M15 (Reliability, testing and fault tolerance), 68M14 (Dis-tributed systems), 68Q85 (Models and methods for concurrent and distributed computing)

1. Introduction. Currently, the HPC community lacks the knowledge and sup-port for deploying most applications on unreliable environments. This forces vendorsto offer reliable systems at the expense of higher engineering and running costs; coststhat HPC centers need to cover. In the future, vendors might offer unreliable yetmore accessible systems that not only will be more affordable to acquire and main-

∗This work is a revision and an extension of a conference publication [17]. The authors wouldlike to thank Josep Gamell and Robert Clay for interesting discussions related to this work. Theresearch presented in this work is supported in part by National Science Foundation (NSF) via grantsnumbers ACI 1339036, ACI 1310283, CNS 1305375, and DMS 1228203, by the Office of AdvancedScientific Computing Research, Office of Science, of the US Department of Energy through theSciDAC Institute for Scalable Data Management, Analysis and Visualization (SDAV) under awardnumber DE-SC0007455, RSVP award via subcontract number 4000126989 from UT Battelle, theASCR and FES Partnership for Edge Physics Simulations (EPSI) under award number DE-FG02-06ER54857, and the ExaCT Combustion Co-Design Center via subcontract number 4000110839 fromUT Battelle. The research at Rutgers was conducted as part of the Rutgers Discovery InformaticsInstitute (RDI2). The work at Sandia was sponsored by the DOE ASCR Exascale Combustion Co-Design Center, ExaCT and Sandia is a multiprogram laboratory operated by Sandia Corporation, aLockheed Martin Company, for the UnitedStates Department of Energy under Contract DE-AC04-94AL85000. Computer allocations were awarded by the Department of Energys INCITE award atthe Oak Ridge Leadership Computing Facility (OLCF) at the Oak Ridge National Laboratories(ORNL).†Rutgers Discovery Informatics Institute, Rutgers University, Piscataway, NJ 08854, USA.

([email protected], [email protected]).‡Sandia National Laboratories, Livermore, CA 94551, USA. ([email protected],

[email protected], [email protected], [email protected]).§Sandia National Laboratories, Albuquerque, NM 56310, USA. ([email protected]).

1

mailto:[email protected]







2 GAMELL, TERANISHI, KOLLA, MAYO, HEROUX, CHEN AND PARASHAR

tain, but will also allocate most of the power budget to the application by limitingcostly hardware- and runtime-based fault tolerance power consumption to a negligiblefraction.

The mean time between failures (MTBF) for modern petascale (i.e., order of1015 floating point operations per second) HPC systems can be measured in termsof hours. Furthermore, it has been observed that failures tend to occur in burstsseparated by long periods of high stability [29]. An example of this behavior canbe seen in our previous work on Titan Cray XK7, which has an average MTBFof ∼ 7.75 hours [29]. Nine failures were observed during an execution lasting 24hours, resulting in an average MTBF of ∼ 2.67 hours [16]. Predictions for the failurefrequency of future systems, such as exascale-level HPC resources, vary significantly.For example, some studies predict MTBFs in the order of minutes [10], while othersconsider future MTBFs to be only slightly lower than current petascale systems [2].The research presented in this paper assumes there will be some level of increase inboth the frequency of failures and the frequency of aforementioned failure bursts.

Given this expected decrease in MTBFs, the abstraction of a failure-free machinewill quickly become infeasible [21]. As a result, the community is revisiting classi-cal fault tolerance approaches, such as checkpoint/restart (C/R), from a scalabilityperspective and reviewing the overheads due to, for example, checkpoint storage andglobal restart of a large parallel program after a failure. Online and application-aware resilience approaches provide effective means for reducing such fault toleranceoverheads at extreme scales.

Online recovery is an alternative approach for achieving scalable resilience. Inthis approach a failure at one or more processes or nodes does not shut down theentire parallel program execution. This approach addresses some of the shortcomingsof C/R, eliminating the overhead associated with process restart as well as allowingapplication memory space to survive the failure and be used either to reconstructapplication state more efficiently or to optimize checkpoint storage. Online globalrecovery has been shown to effectively handle high failure rates for message passingapplications [16, 28]. In these studies, all ranks that are not affected by a failure andneed to recover globally are required to collaboratively repair the MPI environmentin conjunction with newly spawned ranks or previously-allocated spare ranks.

Problems with global rollback have been studied by the research community, withemphasis on how to enable runtime support for message logging to allow for uncoordi-nated recovery of parallel codes [12, 4, 19]. The applicability of generic global rollbackis dependent on the communication pattern as well as performance characteristics ofthe application and the I/O subsystems [14]. On the other hand, parallel stencil com-putations, which represent a large class of parallel computing applications such asfinite-difference methods, exhibit unique computation and communication patterns– multiple iterations, each composed of two steps: computation on local data andcommunication with immediate neighbors. This observation implies that, if a multi-process failure is recovered in a local manner, only the immediate neighbors will beimmediately affected by the recovery delay of that particular failure while the rest ofthe domain will be allowed to continue the simulation in a failure-agnostic way.

Building on this observation, we have experimentally demonstrated that the de-lay of recovering from a failure in a local manner propagates slowly across the ma-chine [17]. Furthermore, if a subsequent failures occur at a distant node before theoriginal failure delay has spread to that node, then the delay due to the second failureis masked by the delay due to the first one. In general, we showed that the overheaddue to several separate failures on the total execution time can appear to be the same

SCALABLE FAILURE MASKING FOR STENCIL COMPUTATIONS 3

as the overhead due to a single failure.This paper builds on our preliminary work [17] and makes contributions along

multiple dimensions. First, this paper analyses multiple failure masking and showsthe probability of failure masking benefiting an execution increases with higher corecounts.

Certain application execution patterns allow for a further increase of the failuremasking probability. This paper presents a novel, in-depth exploration of the effectsof two different techniques, targeted at stencil codes, that effectively increase theprobability of failure masking. Both techniques focus on extending the time it takes fora generic delay (for example, one caused by a failure recovery) in a single stencil cell,or group of stencil cells owned by one or multiple ranks, to reach other cells across thedomain. The first technique explores how a reduction in the communication frequencybetween the ranks of an application can dramatically increase the probability of failuremasking. The second technique explores how this probability can be further increasedby mapping the domain cells to ranks in an architecture-aware manner. In particular,this paper focuses on formulating these techniques, analyzing their impact on failurerecovery delay propagation, implementing and integrating them with applications,and evaluating both techniques using real stencil codes at scale while injecting realsingle-node and multi-node failures.

Specifically, we have developed the FenixLR framework and deployed it on Titan(Cray XK7) at Oak Ridge National Laboratory (ORNL). FenixLR shares the sameinterface as Fenix, which is the primary framework used in our initial investigationinvolving global online application recovery [16]. While both Fenix and FenixLR sharethe same interface, the new implementation allows for the support of local recoverywith very simple application-aware message logs. We use FenixLR to augment astencil-based application to enable local recovery, and use this code to experimentallyevaluate the techniques presented in this paper. We establish scalability of localrecovery and, through the use of thick ghost exchange and rank re-ordering, show thebenefits of increased propagation delay on failure masking. In our experiments weinject high-frequency failures by killing all processes of different nodes, demonstratingsustained performance of a stencil code on scales up to 250000+ cores. Our resultsalso show that, by using a thicker ghost region exchange, the number of iterationstranspired before all ranks are affected by a failure increases dramatically while somecell-to-rank mappings result in a decrease in the number of affected ranks given thesame number of iterations.

Key contributions of this paper include: (1) use of thick ghost regions to increasethe probability of failure masking and the effectiveness of online local recovery forstencil computations; (2) use of architecture-aware cell-to-rank mapping to furtherreduce overheads and increase effectiveness of online local recovery for stencil com-putations; (3) design and implementation of the FenixLR runtime, with (1) and (2),and its deployment on the Titan Cray XK7 production system at ORNL; and (4)experimental evaluation of local recovery algorithms in FenixLR in conjunction withapplication-level ghost resizing and rank re-mapping optimizations. Using the S3Dcombustion application on Titan, the evaluation demonstrates sustained performanceand scalability in spite of high frequency real node failures, as well as an increase ofthe probability of failure masking.

This document is structured as follows: background and related work is presentedin Section 2, the theory and constraints underlying the local recovery procedure aredescribed in Section 3, the rationale and theory underlying increasing ghost regionsare described in Section 4, the exploration of cell to rank re-mapping is presented


in Section 5, the implementation is discussed in Section 6, experimental results arepresented and evaluated in Section 7, and conclusions are outlined in Section 8.

2. Background and Related Work. This section reviews the backgroundwork related to the design of a resilient stencil computation.

Global Recovery through Checkpoint Restart. Several studies have fo-cused on understanding the characteristics of process and node failures [27, 30]. Awide range of studies use checkpoint and restart (C/R) [24, 23, 22] for implementinghard failure tolerance for HPC systems. As per this technique, the application stateis periodically checkpointed (e.g., using BLCR [20]) so that, upon failure, it can berestarted from the last globally committed checkpoint. This recovery model has beenpracticed in HPC for several decades due to its simplicity and adaptivity to majorparallel programming models (such as MPI [15]) which, by default, are designed toabort all active application processes upon a single process failure. However, thismodel has two major disadvantages: (1) the overhead of recovery is independent ofthe number of processes affected by the failure, i.e. if a node or process failure oc-curs, all processes are typically forced to rollback to the previous strongly consistentcheckpoint, and (2) the efficiency of application state preservation and recovery isdependent on the global and centralized I/O subsystem, which becomes a bottleneckas the scale of the application increases.

Local Recovery. In contrast, by recovering locally, only failed processes need torollback to a previously saved state. The HPC resilience community has studied howto automatically enable this recovery method by using uncoordinated checkpointing,aiming at rolling back only processes affected by failure and making the applicationprocesses continue through failures [5, 26]. A rollback of a single process, however, maytrigger the rollback of other processes due to inconsistently saved states , resultingin global rollback to the latest coordinated checkpoint or, in the worst case, backto the beginning of the application (domino-effect). This effect can be mitigatedusing message logging at the expense of the memory footprint, which involves majormodifications of message passing software or application code [19, 13]. Interestingly,a recent study focused on its general applicability [14] claims that the effectiveness ofthe uncoordinated approach depends on the application communication pattern andthe performance of underlying I/O subsystems. This result indicates the necessity offrameworks supporting application-aware local recovery techniques.

Fenix [16] and LFLR [28] software frameworks show how efficient execution infailure-prone scenarios can be enabled by using global online recovery in conjunctionwith advanced in-memory diskless checkpointing. To replace failed processes in an on-line manner, these frameworks leverage a spare process pool which is readily integratedwith the in-memory checkpointing mechanism. Despite its usefulness demonstratedwith large scale stencil computation and finite element mini-applications, global roll-back and synchronization are still necessary even for the response to a single processfailure. This paper explores how to design scalable recovery for stencil computationsbased on our prototype presented in [18, 17], extended from the stencil computationintegrated with Fenix [16].

Classifying Failure Recovery Techniques. We classify recovery mechanismsin six broad categories, disjoint by pairs. Recovery procedures can be offline oronline, depending on whether they require a complete shut-down of the survivorprocesses or not. Traditional C/R can be considered off-line, while approaches likeFenix [16] or LFLR [28] as well as the techniques presented in this paper are consideredto implement online recovery constructs. In roll-back protocols, when a failure is


detected, all processes (including the spare processes) revert to a valid state in thepast at which point the computation is restarted from that state. Traditional C/Ruses a roll-back recovery mechanism. While Fenix and LFLR use roll-back recoveryas well, it can be considered improved since these systems do not need to synchronizethe survivor processes before restarting. In roll-forward protocols, on the otherhand, survivor processes are allowed to reconstruct a valid state without the need toglobally revert the state to a previous consistent state. When a failure is detected,every process constructs a valid state, or state that is potentially valid after some extracomputations. Algorithm-based Fault Tolerance techniques [31, 32, 3, 9, 11], as wellas the techniques presented in this paper, fall into this category. Recovery procedurescan also be considered global or local. Local recovery implies that not all processesare required to roll-back or roll-forward, but only the processes substituting the failedones are directly affected by the failure. This contrasts with global techniques, inwhich all processes must be part of the recovery process.

Rank Reordering and Ghost Region Expansion on Stencil Computa-tions. With the continuous increase of node and core counts, application performancehas become more sensitive to the mapping between user processes to physical pro-cessors within a large network topology. For example, Barrett et al. [1] found thatmisalignment of MPI ranks to the 3D torus topology of Cray XE6 can decrease theperformance of 3D stencil applications. In their study, many ghost cell exchangeswere not translated to message exchange between physically adjacent nodes, as theMPI ranks were placed to conform to a long rectangular shape. This was resolvedthrough manual rank reordering to reduce the average network hop counts of ghostcell exchange. On the other hand, the previous work [25] studies how to use bigger re-gion expansion in order to reduce the latency of communications by merging differentmessages.

Another rank reordering idea has been applied in the context of multicore nodeswhere current MPI implementations exploit shared memory copy to improve the mes-sage passing performance among the ranks placed in the same physical node. Brand-fass et al. [6] demonstrated how rank reordering is able to confine interprocess com-munications within a single node in their unstructured CFD applications. In thispaper we employ similar ideas to reduce the delay propagation of local failure recov-ery, specifically where failures cause a loss of all processes in the same node. Ouruse of rank reordering is intended to slow down the speed of delay propagation, andwe successfully demonstrate the effects both analytically and empirically in Section 5and Section 7.

3. Local Recovery for Parallel Stencil Computations. This section intro-duces the technique of recovering from failures in a local manner as well as how thistechnique can be efficiently applied on stencil computations. Local recovery impliesthat only the processes that directly communicate with the failed set of processes willdetect the failure and will, therefore, need to be part of the online recovery proce-dure, while the rest can continue working ignoring the failure. Note how both theseconditions are diametrically opposite to global recovery, which requires all processesto be involved in the recovery and to rollback to a globally-consistent saved state.This allows local recovery to overcome the scalability challenges imposed by globalrecovery which are unnecessary in certain situations.

This section summarizes our preliminary work, which is presented in detail in [17].Specifically, Section 3 of [17] outlines the approach while Section 4 of [17] demonstratesthe feasibility of the approach using a mathematical model. In particular, this section


Rank r1 Rank r2

Rank r3 Rank r4(a) Partitioning of a square 2D domainacross four MPI ranks.

Rank r1 Rank r2

Rank r3

Ghost from r1Ghost from r2Ghost from r3Ghost from r1

Data transfer

(b) Detail of a ghost region exchange ofa 2D domain.

Fig. 1. Parallel implementation of a stencil computation. Figure 1b can be interpreted astwo different scenarios: (1) 5-point 2D stencil in which communication is grouped to reduce itsfrequency (one communication per three iterations), or (2) 13-point 2D stencil that communicatesevery iteration.

analyzes how the specific communication patterns of stencil computations are a goodfit for maximizing the efficiency of local online recovery. It also summarizes thechallenges of optimizing local recovery for this class of applications as well as theadded benefits this combination offers.

3.1. Scientific Parallel Stencil Computations. A stencil computation dis-cretizes a multi-dimensional computational domain into a mesh of points. In HPCenvironments, this domain is partitioned into disjoint subsets and distributed into thedifferent compute and memory resources. An example partition of a two-dimensionalmesh into four patches mapped to four MPI ranks is illustrated in Figure 1a.

In general, each compute resource iteratively applies the stencil operator over allthe mesh points in parallel. In particular, each compute resource iteratively performstwo sequential tasks. First, processes compute the local data to advance the simula-tion. Second, processes communicate with each other to exchange part of the updateddata. Typically, the updated data that needs to be exchanged corresponds to pointsin the mesh that neighbor data points assigned to a different compute resource, whichis called the “ghost region”. An example can be seen in Figure 1b.

Guaranteeing Consistency and Determinism upon Failure. Since pro-cesses logically neighboring the failed processes must communicate with the failedprocesses to make progress with the simulation, guaranteeing the consistency in mes-sage exchanges is not a trivial task in spite of the aforementioned natural fit betweenlocal recovery and stencil computations. The presented implementation makes use ofthe ghost point exchange to develop an application-specific message log. The outgoingmessages that are sent since the last saved state are stored in sender memory. Sincethese messages only include the points in the ghost region, the overhead of this tech-nique is negligible compared to the option of remotely storing the application state.For example, in a 5-point two-dimensional stencil only four messages correspondingto the ghost regions for the neighbor processes are logged at each timestep.


3.2. Delay Propagation and Failure Masking. The communication patternsdescribed above indicate that process or node failures in stencil computations can beinterpreted as a loss of part of the domain, and affect the computation of immediateneighbor subdomains. A natural approach to adapt our local recovery techniqueto stencil codes is to force recovered processes (recovered by re-spawning them orstealing them from a spare process pool) to re-execute lost computations since te lasttime state was saved, and re-establish the communication channels with the logicalstencil neighbors to continue future iterations.

As a result of these recovery and rollback procedures, the processes that logicallyneighbor recovered processes must wait if they require to fetch the updated domain tocontinue their computation. Only when the recovered processes finish re-calculatingthe local computation of the iteration(s) that failed, they can start the communicationprocess with the waiting neighbors. This process is recursive, since the immediateneighbors of the aforementioned delayed processes will be required to stop after thenext iteration. This makes the delay due to recovery and rollback of a failure topropagate slowly through the domain. Each new iteration propagates the recoveryand rollback delay to a new set of ranks.

A second failure may occur in a node that has not yet been delayed by the originalfailure. In this case, the total overhead on the total execution time will be similarto the case in which only one failure occurs, since the delay of the respective failuresis masked. The effect of failure masking contrasts with global recovery, in which therecovery delay is immediately propagated to the entire domain, preventing failurecontainment and disabling ranks far from the failure to advance computation.

The larger the scale of a machine, the closer the failures will be in time, and thefarther they will be in space. As a result, the probability that the effect of one nodefailure has not propagated to a node experiencing a subsequent failure, increases. Forthis reason, the probability of failure masking increases with machine size, making ita desirable technique to be exploited by future large scale machines.

4. Increasing the Ghost Region Size. While failure masking provides scal-ability by itself, several application characteristics can be taken into consideration toincrease its impact and, therefore, obtain higher reductions in the end-to-end execu-tion time. Section 4 and Section 5 analyze particular characteristics of stencil iterativecomputations that can be exploited to that end. In both cases, the goal is to increasethe time it takes for a failure in a single stencil cell to propagate and affect all othercells across the stencil system. Ideally, the impact on the simulation time of any opti-mization that increases failure masking probability should be minimal, but a trade-offexists between the amount of propagation extension and its impact on the executiontime. This tradeoff can be exploited depending on the application characteristics, thesize of the system, and the MTBF. While presenting some related experimental con-clusions in Section 7, we leave for future work the break-even analysis of this tradeoffdepending on system size.

This section explores the effects of communication frequency reduction achievedby an increase in the ghost region size.

4.1. A Guiding Example: 1-D, 3-point Stencil. Update and FailurePropagation Windows. Assuming, for example, a 1-D, 3-point stencil, if a changeoccurs in point p in iteration i, it will not be reflected in point p+ k until k iterationslater (i.e. iteration i+k). In this case, the update propagation window is of k iterations.

However, since multiple points are associated with a rank (assume, P pointsper rank), and assuming ranks communicate every iteration, a failure in the rank r


Rank riRank ri−1

it=0 comm

it=0 comp

it=1 comp

Rank ri

Ghost points from ri−1

Rank riRank ri−1

it=1 comm

Rank ri

3-point 5-point 7-point

Data transferGhost points from ri

It., phase

a)

b)

Fig. 2. Detail of the communication and decomposition of a 1-D stencil computation between2 ranks. a) A 3-point calculation with standard communication pattern, i.e. one transfer at thebeginning of each iteration. b) Detail of two iterations (both communication -comm.- and computa-tion -comp.- phases) of 3-point, 5-point, and 7-point 1D stencils. For space economy, only rank riis shown in the 5-point and 7-point figures. Crossed ghost points do not contain a valid value in thespecific iteration/phase.

including point p (failure happens before completing iteration i) will propagate torank r2 including point p+ k before iteration i+ k. In particular, rank r+ 1 will stallin iteration i + 1, rank r + 2 will stall in iteration i + 2, and, in general, rank r + nwill stall in iteration i + n. In the example, r2 is r + t, where t is either t = b kP c or

t = d kP e. Therefore, in this example, the failure propagation window between point

p and p + k is d kP e iterations, in the best case. This is to compare to the updatepropagation window of k iterations.

Expanding the Failure Propagation Window. If, based on the previousexample, we assume the ranks in the application communicate every two iterationsinstead of every iteration, the failure propagation window expands. Rank r+1 will beable to advance an extra iteration, and, hence, will stall in iteration i+ 2. In general,rank r + n will stall in iteration i + 2× n. Therefore, the failure propagation windowgets expanded from d kP e to 2× d kP e.

Note that the failure propagation window can never be larger than the updatepropagation window, as that would imply violating the semantics of the algorithm.

Delaying Communication between Ranks. In order to avoid communicat-ing every iteration, the frequency of communication can be reduced to every severaliterations. To that end, computation of each partition’s bordering cells needs to bereplicated in every rank that needs them.

In the simplest case of a 1-D 3-point stencil, as depicted in Figure 2, commu-nication can be avoided every two iterations. Instead of exchanging a single pointper border, each rank initially fetches two points per border from the correspondingneighbor and keeps them in the ghost point G−2 and G−1. Each rank can then finishcalculating the first iteration, which is done by updating cell C0 with previous valuesof cells G−1, C0, and C1, updating cell C1 with previous values of cells C0, C1, andC2, etc. Additionally, the ghost point G−1 needs to be updated with previous valuesof cells G−2, G−1, and C0. After all the values are updated in the first iteration,the rank already has the updated ghost point and, therefore, is ready to perform theupdate for the second iteration without the need to communicate with its neighbors.Note that, since in the first iteration the ghost point position G−2 was not updated,


Transfer to top rank

5-point 9-point 13-point

Transfer to left rank Transfer to diagonal left/top rank

Communication

every iteration

Communication

every two

iterations

Transfer from top rank Transfer from left rank Transfer from diagonal left/top rank

Fig. 3. Detail of the communication and decomposition of a 2-D stencil computation between arank and (not-shown) its left, top, and left/top diagonal neighboring ranks. The top row representsa 5-point, 9-point, and 13-point calculation with standard communication pattern, i.e. one transferat the beginning of each iteration. The bottom row represents how the data is transferred during thecommunication phase –every two compute iteration, only one communication phase is performed.

the rank can not re-use it to calculate the new value of G−1. Henceforth, in orderto perform the third iteration, communication needs to occur, in a similar manner asdescribed for the first iteration.

To summarize, in this simple 1-D 3-point stencil, by replicating the computationof one ghost point and doubling the size of the message payload, the application candouble the time between two subsequent data transfers.

4.2. Beyond 3-point Calculations on a 1-D Stencil. More complex cal-culations on a particular stencil cell may require more points than the immediatelyadjacent neighbors. In the case of a 5-point, 1-D stencil, in order to calculate a par-ticular cell Ci, five cells are required, Ci−2, Ci−1, Ci, Ci+1, and Ci+2. In the case ofa 7-point, 1-D stencil, seven adjacent cells are required, from Ci−3 to Ci+3. There-fore, to delay the communication between cores would require a higher cost due toreplication than, when compared to the 3-point stencil example.

In these two more complex cases, however, the same conclusion applies: by dou-bling the size of the message payload, and replicating the computation of the extraghost points (two extra points per message in the 5-point case; and three extra pointsin the 7-point case), the application can double the time between two message ex-changes.

4.3. 2-D and 3-D Stencils. Extrapolating the same ideas from a single di-mension to a multi-dimensional stencil requires understanding how the domain ispartitioned and which ghost points are required for each scenario.

Two Dimensions. Figure 3 demonstrates that doubling the communicationperiod requires communicating with one extra neighbor per corner in the 2-D case.Instead of communicating with four neighbors every iteration, the new algorithmmust communicate with exactly eight neighbors every two iterations. In a 5-pointcalculation, assuming the domain size is n × n, (1) communicating every iterationrequires 4n ghost cells while (2) communicating every two iterations requires 8n + 4.In a 9-point calculation, 8n and 16n + 16 ghost points are required for situations (1)


and (2), respectively. Finally, for a 13-point calculation, 12n and 24n+ 36 ghost cellsare required. In general, for a (4p + 1)-point calculation,

C1 = 4pn

C2 = 4p2 + 8pn

...

Ci = 2i(i− 1)p2 + 4ipn

where Ci is the number of ghost cells required for communicating every i iterations,and p is the thickness of the calculation (e.g., p = 1 for a 5-point calculation, or p = 2for a 9-point calculation).

Three Dimensions. In the 3-D case, the domain is typically decomposed intocubes and the neighboring ranks near each face of the cube are required to update thefaces. As a result, each rank communicates with six neighbors every iteration (i = 1).If the communication frequency needs to be reduced to two iterations (i = 2), eachrank will require communication with 6 + 12 = 18 neighboring ranks, as the regionsin the diagonals associated with the twelve edges will be required to update the sixfaces. If the communication frequency needs to be further reduced (e.g., i = 3), theranks with cells in the diagonals near the eight vertices will also need to be contactedby each rank because those cells will be required in order to update the cells nearthe cube edges. In this case, each rank will communicate with 6 + 12 + 8 = 26neighboring ranks. For any i such that i ≥ 3, only these 26 neighboring ranks willbe needed for ghost exchange provided that i ≤ n/p. The latter formula assumesa three dimensional domain where each dimension is of size n. For the case wherei > n/p, ghost regions would extend naturally into additional ranks. However, thiscase is not feasible in practice, since p is typically small and n is typically large in realproduction scenarios. Any practical reduction of communication frequency should,therefore, require interaction with a maximum of 26 ranks during the population ofghost region cells.

In a 3-D domain, the number of ghost cells required to communicate every iiterations is as follows, for a (6p + 1)-point calculation:

C1 = 6(1)pn2

C2 = 6(2)pn2 + (12np2)

C3 = 6(3)pn2 + 3(12np2) + 1(8p3)

C4 = 6(4)pn2 + 6(12np2) + 4(8p3)

...

Ci = 6ipn2 +(i− 1)i

2(12np2) +

(i− 2)(i− 1)i

6(8p3)

= 6ipn2 + 6(i− 1)ip2n +4

3(i− 2)(i− 1)ip3

where the factor that multiplies (12np2) is the (i − 1)-th triangular number andthe factor that multiplies (8p3) is the (i − 2)-th tetrahedral number (for i = 1, thefactor is 0).

n3 +Ci can be used to theoretically determine the algorithmic worst-case cost of


0 2 4 6 8

10 12 14 16

1 2 3 4 5 6 7 8

Gho

st p

oint

cou

nt a

s a

per

cent

age

of d

omai

n po

int

coun

t

Number of iterations between communication (i)

19-point operator (p=3)13-point operator (p=2)7-point operator (p=1)

(a) Effect of the number of iterations betweencommunication when operating on a cubical do-main with side n = 1000.

0

5

10

15

20

25

100 500 1000 1500 2000

Gho

st p

oint

cou

nt a

s a

per

cent

age

of d

omai

n po

int

coun

t

Size of each dimension (n)

Comm. every it. (i=1)Comm. every two its. (i=2)

Comm. every three its. (i=3)Comm. every four its. (i=4)

(b) Effect of the size of each dimension of thecubical domain when using a 7-point Stencil op-erator.

Fig. 4. Number of ghost cells compared to number of domain cells for different scenarios.Results were computed using the formula for Ci with different values of i and p (Figure 4a) anddifferent values of i and n (Figure 4b).

the computational portion of each iteration. It is important to note that the minimalcost of the computational portion is n3 + C1 in any case.

Figure 4 plots Ci using different values of i, n, and p. Specifically, it shows100Ci/n

3, the percentage of the cell points that the ghost region represents for eachcase, when compared to the domain region (which size is n3). The main conclusion,which can be extracted from Figure 4b, is that the overhead on the computation costof extending the ghost region is negligible for bigger values of n. In the cases wheren is small, however, the overhead of ghost region extension will be low or negligibleonly when the cost of communication is larger than the cost of computation. In thatcase, extending the ghost region implies an increase of the computation that is tradedoff by a larger decrease of the total communication cost.

5. Node-aware Mapping of Cells to Ranks. A typical approach when run-ning an MPI application on a multi-core system is to allocate multiple MPI ranks ineach socket; usually one rank per core or one rank per thread. As a result, a failurein any of the components shared by all cores in the socket (e.g. on-chip L3 cache,memory subsystem, network interface, power supply, operating system) will affectmultiple ranks at once.

Henceforth, when trying to reduce the costs of fault tolerance, application devel-opers and tuners need to take into consideration the underlying architecture. Specifi-cally, this section focuses on the effect of the mapping of a decomposed stencil domainto the different available MPI ranks.

Assumptions. In this section, we assume that a particular rank is mapped intoa specific socket/node statically by the MPI implementation (in this case, FenixLR)when the MPI application is first started. Therefore, any changes requires the map-ping of cells to ranks by the application or a fault-aware library (in this case, the toplayer within FenixLR).

We also assume a homogeneous system in which all the sockets/nodes have thesame compute characteristics. Therefore, the computation time required for a par-ticular stencil should be the same regardless of the mapping of cells to ranks. Thismapping, however, may affect the network transfer costs depending on the networktopology, since neighbor cells may be located in different parts of the machine de-


Linear Cell-to-Rank Mapping

Quadratic Cell-to-Rank Mapping

Domain Decomposition of Cells

Cells in a rank Failed rank Rank immediately affected by the failure

node20

node20

Fig. 5. Decomposition and mapping of a two-dimensional domain into a machine with sixteenranks per node. (top) Domain of 144 × 36 cells decomposed in chunks, or domain sections, of3 × 3 cells each, for a total of 48 × 12 chunks. (center) Linear mapping of domain sections toranks; a failure in a 16-rank node (node20) affects the execution of 34 neighboring ranks in theiteration immediately following the failure. (bottom) Quadratic mapping of domain sections toranks; a failure in node20 affects now only 16 neighboring ranks in the iteration immediately afterthe failure, providing a much slower failure propagation.

pending on the mapping. The possible effects due to network characteristics will beaccounted for when evaluating the approach.

One-dimensional Stencil Domain. A one-dimensional domain can be decom-posed into different homogeneous chunks, i.e., each containing the same number ofcells, and the best cell-to-rank mapping can be achieved by linearly mapping cells tocontiguous ranks. Assuming a node failure, the number of cells who are affected bythe failure increases by two every iteration: a failure in iteration i affects two cellswhen iteration i+1 is reached, four cells when iteration i+2 is reached, and so on. Inthis sense, it is intuitive to understand that the best mapping to minimize the recov-ery propagation of a node failure is the linear mapping. With a random mapping, forexample, the number of cells affected by a failure increase much faster as iterationsadvance. Other mappings less extreme than the random mapping can be less harmfulto the failure propagation time, but a linear mapping still provides the best case.

Two-dimensional Stencil Domain. In the two-dimensional case, however, alinear mapping might not be the best when trying to minimize the rate of increaseof the number of cells affected by a node failure. Assuming a 16-rank node, the bestmapping to minimize the propagation effect would be to assign four-by-four blocks ofcells to the sixteen ranks in each node, as shown in the bottom of Figure 5. In the


Fig. 6. Section of a three-dimensional domain mapped to a machine with sixteen ranks pernode. The mapping of domain sections to ranks is done through the shape of a rectangular prism. Afailure in a 16-rank node (node20, set of boxes in red) affects the execution of 40 neighboring ranks(set of boxes in dark gray) in the iteration immediately following the failure.

linear mapping case displayed in the center of Figure 5, a node failure would affect 34cells during the first iteration, while in the quadratic mapping case, the same failurewould affect 16 cells during the first iteration. The conclusion is that a standard,default, linear mapping –that provided an optimal failure propagation contention inthe 1-D case– is not optimal in the 2-D case.

Three-dimensional Stencil Domain. By default, S3D –our guiding stencilscientific computation– assigns each part of the decomposed 3-D domain following alinear mapping, assigned based on MPI CART CREATE using Fortran order. Inother words, assuming that there are Nx cells in the X direction, NY cells in the Ydirection and NZ cells in the Z direction, and we want to map a single cell to a rank,the rank number R would be assigned the cell (CX , CY , CZ), where

CZ =

⌊R

NXNY

⌋CX = (R− CZNXNY ) mod NX

CY =

⌊(R− CZNXNY )

NX

⌋This linear mapping is suboptimal, since a failure in a 16-rank node may affect

66 neighboring ranks in the first iteration after the failure.To reduce the rapidity with which the number of affected ranks increase, we can

decompose the domain into perfect cubes or use other kinds of space-filling curves.Since complex space-filling curves may imply irregular mappings (i.e. the cells as-signed to a particular node do not form a homogeneous “shape” in the 3-D domain),the failure propagation speed may depend on the failed node. Since this is not a


desirable property, this paper studies how to assign cells to nodes so that the 3-Dshapes of the cells are exactly the same in all nodes.

In particular, as depicted in Figure 6, for a machine with sixteen ranks per node,a 4× 2× 2 rectangular prism (in any direction) homogeneous mapping minimizesthe rapidity of failure delay propagation to 40 neighboring ranks in the first iterationafter the failure.

We can generalize the rectangular prism mapping depicted in Figure 6 as follows.Making the same assumptions as before, as well as the fact that the rectangularprism shape must be GX × GY × GZ (for example, 4 × 2 × 2 in the figure), witha size of SG = GXGY GZ ranks, the rank number R would be assigned in the corenumber (CoX , CoY , CoZ) of the compute node with coordinates (CnX , CnY , CnZ),and would be mapped to the cell (CX , CY , CZ), where

CnZ =

⌊R

GZNXNY

⌋CnX =

(R

SG− CnZNXNY

GXGY

)mod

NX

GX

CnY =

(R

SG− CnZNXNY

GXGY

)/NX

GX

CoZ = (R mod SG) / (GXGY )

CoX = ((R mod SG)− (CoZGXGY )) mod GX

CoY = ((R mod SG)− (CoZGXGY ))/GX

CX = (CnXGX) + CoX

CY = (CnY GY ) + CoY

CZ = (CnZGZ) + CoZ

6. FenixLR Implementation. This section describes the implementation ofthe techniques presented in Sections 3, 4, and 5. It first describes the initial attemptsat implementing the methods on top of an MPI runtime and a prototype of one of itsleading fault tolerance proposals, User Level Fault Mitigation (ULFM). The sectionthen presents the rationale for implementing FenixLR from scratch directly on top ofCray’s transport layer interface, uGNI, while still maintaining the same user interfaceas Fenix.

This adds to the complexity of implementing local recovery as opposed to globalrecovery while still offering deterministic results. In global recovery mode, there is norequirement to distinguish survivor ranks from repaired ones (either re-spawned ortaken from a process pool) due to the fact that all ranks must re-execute all recov-ery actions. When implementing local recovery, however, recovered ranks may issuereceive operations to obtain messages that may be already received by failed ranks.Henceforth, a mechanism is required for recovered ranks to obtain these message in atransparent manner, regardless of the execution stage of the sender rank.

6.1. Experiences Implementing on Top of MPI. One of the key opera-tions for communicator recovery is the shrinking communicator operation offered byULFM, which eliminates failed ranks from the failed communicator, returning a fullyfunctioning communicator with less ranks. The cost of this operation in the ULFMprototype evaluated was not trivial, and we experimentally determined that the cost


of the shrink implementation increased super-linearly with the size of the commu-nicator. Hence, we explored recovery options that avoided the recovery of a globalcommunicator, as described below. No conceptual problems of ULFM prevented itsusage but rather, the limitation was the robustness of the prototype implementation.

Usage of a Failed, Non-Repaired Communicator. ULFM specifies that acommunicator can be re-used after a failure for point to point communications, with-out the need to repair it, as long as the failure is acknowledged and no communicationsare initiated with the failed rank. The first prototype of our local recovery frameworkleveraged this property of ULFM and, after implementing it and testing it by using aPartial Differential Equation solver on top of the prototype ULFM implementation,inconsistencies were found in the use of some MPI operations in a failed communicatorthat was not repaired. In most cases, the runtime reached an inconsistent state aftera failure and was unable to deliver messages to survivor ranks.

Pair-based Communicator Layout. After several attempts to fix the afore-mentioned unpredictable behavior, and provided that ULFM is the only fault tol-erance proposal that would allow local recovery, we decided to switch to a differentstrategy. Since each rank of the targeted application type communicates with a con-stant number of ranks independent of the total domain size, we designed a layout ofcommunicators composed of only two ranks. The communicators including a rankthat failed can be disposed and do not need to be repaired. Assuming no collectiveoperations were required by the application, we designed the creation of a separatecommunicator for every pair of processes that need to communicate directly. Rightbefore a collective operation, the communicator involving all ranks would need to berepaired.

After a process failure, the processes that were communicating with the failedprocess are free to dispose of the communicator. In order to allow for a non shrinkingrecovery mode, the failed processes will be substituted with spare resources. There-fore, for this method to work, the processes neighboring the failed one would berequired to have communicators already created with the spare ranks. This resultsin a number of communicators that does not scale well with the total number ofprocesses in the system and, therefore, we decided to divide the compute ranks intogroups of a fixed size, to which a certain number of spare ranks would be associated.All ranks in a group would have a communicator to each of the spare ranks in thatgroup. For a one dimensional stencil case, the communication layout for this possibleimplementation is exemplified in Figure 7a and the recovery mechanism is outlined inFigure 7b.

As of version 2 of the MPI standard, in order to set up the communicator struc-ture between spare ranks, we had to create each of them at the beginning of theexecution in a collective manner. We recursively bisected MPI COMM WORLD, us-ing MPI COMM SPLIT, to establish exclusive communicators between rank pairs,an operation whose cost is O(log2(N)), where N is the total number of ranks. Thisprocess needed to be repeated to establish all pair-wise communication relationshipssince each rank is paired with multiple other ranks for carrying out ghost zone com-munication. The total cost of this operation, therefore, is O(R · log2(N)), where R isthe number of ranks each rank has to communicate with (i.e. in the 3-point 1D PDEcase, R = 2).

There were several issues with this solution. For example, communication pat-terns had to be known beforehand. Also, collective operations would require completecommunicator fix, and that would trigger the re-creation of disposable communica-


tors. Another problem is that application code has to be aware of the specific internalrecovery mechanisms. Finally, the fact that a subset of all spare ranks are assignedonly to one group of compute ranks limits the flexibility of the failure size. Forexample, if a failure in a group happens to affect a large number of ranks (not anunrealistic assumption, since some failures are correlated), the runtime may not havepre-assigned enough spare ranks to a particular compute group. However, maybe withthe total number of spare ranks in the system, the runtime would have been able torecover from that failure.

Note, however, that many of these limitations could be overcome if communicatorscould be created in a non-global manner without the need of point to point communi-cation over a failed communicator. This could be achieved if only the ranks participat-ing in the communicator were needed in order to create the communicator. Method-ologies such as [8] cannot be applied in our situation because MPI Intercomm createuses point to point communication between ranks in a failed communicator, which, asexplained above, is unreliable upon failure. By using MPI 3 MPI Comm create groupoperation (assuming it is implemented in a different manner than the described in theabove reference), non-collective communication creation would allow a cost of O(R)to create the topology described in Figure 7a. Also, all the links between the computeranks and the spare ranks could be eliminated, as this can be created upon failure.

This second prototype was also implemented and tested with a one-dimensionalPDE solver. Even though the prototype demonstrated that the concept is feasible,the aforementioned disadvantages of this method were apparent and, therefore, madethis solution unfeasible for other, more complex, stencil computations. Therefore, wedecided to implement a more robust prototype for the FenixLR runtime directly usinguGNI constructs, avoiding the use of the MPI runtime altogether. We understand thatMPI-based frameworks in general and ULFM specifically are evolving rapidly and wewill revisit them once the above issues are fixed.

6.2. Implementation Overview. FenixLR architecture is layered and modu-lar. Four key modules collaborate to achieve its full functionality. First, the modulededicated to communication offers a base class Command that abstracts out the de-tails for a reliable request of service or a reliable communication with other ranksusing a particular transport layer. This is used to abstract out the fault tolerancemechanisms from the particular operations (such as the send or the receive of a mes-sage, a collective barrier, broadcast, or reduction, or even the transfer of checkpointdata), which only extend the base class and are only required to implement the logicof each operation. Second, the module dedicated to process resiliency implements theprotocols used to locally repairing the environment after a failure, which includes butis not limited to the management of the pool of spare processes. Third, the modulededicated to data resiliency implements the methods for creating, storing, and recov-ering checkpoints. In particular, it implements a neighbor-based checkpointing suchas the one described by our previous work [16]. This module offers a well-definedinterface so that libraries optimized for other data resilience methods can be pluggedin instead. Finally, the transport layer also offers a well defined interface to ease theportability to different physical communication APIs. In our implementation, thisinterface has been implemented on top of uGNI, the Cray transport layer API. In theevent of a failure, the runtime takes care of orchestrating the necessary modules duringthe recovery process as well as determining operations required to be re-executed.

FenixLR allows the dynamic connection between ranks, which is key toward re-covery. This connection is implemented through a handshake process that logically


(a) Communicator layout between ranks. (b) Recovery procedure after a failure.

Fig. 7. Possible implementation of the runtime using MPI but avoiding communicator repairoperations. In this version of the implementation, each MPI communicator includes only 2 ranks.A communicator is created between each pair of compute ranks. Each compute rank (C) in a groupalso requires a binary communicator with each spare rank (S). For scalability, compute ranks aredivided in groups and a few spare ranks are assigned to each group. In this example, each groupcontains five compute ranks and two spare ranks.

connects both ranks. When a process fails, a spare process is used in its place. Thisspare process re-creates the handshake process with all processes that were logicallyconnected with the failed process. While the handshake to recreate connections isrunning, the failure is notified to the rest of the domain through a background col-lective operation so that new process that try to connect to the failed rank redirectthe handshake request to the corresponding spare process. If a new process tries toconnect to the failed process before receiving the collective notification, the failurewill be detected and that process will start the failure recovery procedure by con-tacting an available spare rank. However, this will notify that the failure has beenalready recovered and will point to the correct spare rank that is substituting thefailed process. The same behavior is reproduced for multi-process failures, which areconsidered simply a set of process failures.

Our requirement when designing FenixLR was to maintain the same interface asFenix [16], due to the low programming overhead demonstrated with this interface(i.e. requiring less than 35 new, changed, or rearranged lines of code in S3D, as shownin Section IV.B in [16]).

7. Evaluation. This section presents the experimental evaluation performed todetermine the performance and effectiveness of the local recovery and failure maskingapproaches as described in Section 3, as well as ghost region extension and rank re-mapping that target to enhance the probability of failure masking, as described inSection 4 and Section 5. The experimental evaluation is based on the S3D combustionsimulation, a large scale stencil computation, running on top of the Titan Cray XK7supercomputer at ORNL.

This evaluation demonstrates that the algorithms implemented in FenixLR, as


presented in Section 6, efficiently tolerate multiple process, node, and multinode fail-ures occurring at a wide range of frequencies. It also demonstrates that the totalrecovery overhead can be reduced due to multiple failure masking. In general, thissection shows that, even though the overhead of globally recovering from N indepen-dent failures is in the order of N ×O1 (where O1 represents the average overhead ofrecovering from a single failure), the total overhead is closer to O1 when the same Nfailures are recovered locally.

Previous work [16] has used an implementation on top of a ULFM prototype tosuccessfully recover from failures occurring as frequently as every 47 seconds. Thissection demonstrates how the new FenixLR implementation reduces sources of over-head and offers optimized recovery constructs able to tolerate failures occurring asfrequently as every 5 seconds.

7.1. Experimental Evaluation Goals. As this section discusses the experi-mental evaluation of FenixLR, the following are the main goals to be addressed: (1)show that, by using the presented local recovery and failure masking methodologies,stencil computations such as the tightly-coupled S3D combustion simulation can tol-erate failures coming at rates higher than previously explored, (2) demonstrate thatthe presented recovery algorithms lead to scalable recovery, (3) establish that therecovery overhead of multiple failures is constant and independent of system size,and (4) prove, experimentally, that the probability of multiple failures masking eachother can be increased by application-aware algorithmic changes such as ghost regionexpansion and cell-to-rank mapping.

Following subsections present the methodology of experimentation as well as de-scribe the experiments in detail.

7.2. Experimental Methodology. A key goal of this evaluation is to studyhow the presented approach behaves at current scales, and use this to explore behav-iors and performance due to future extreme-scale failure rates. As a result, we haveconducted our experiments on up to 262272 cores.

Testbed. As mentioned above, all the experiments were performed on the CrayXK7 Titan at ORNL. Titan is composed of 18688 nodes each with one 16-core CPUand one GPU. Every pair of nodes is connected to a single custom system-on-chipGemini ASIC network interconnect. Gemini ASICs are connected using a 3D torustopology. Applications can directly access network capabilities using uGNI, the userlevel proprietary interface from Cray, which is forward compatible with newer ver-sions of Cray networks, such as Aries. While the evaluation is centered on Titan,the presented techniques are architecture-agnostic and are not centric to a 3D torustopology. Therefore, they can be reproduced in other HPC architectures and systems.

Experiments. The first experiment studies the performance and scalability ofrecovery from single-process and multi-process failures. This experiment injects worst-case failure sets, engineered to prevent multiple failure masking, since the main interestis to study the behavior of the recovery process itself, not its derived benefit of failuremasking. FenixLR is exposed to failures coming as frequent as every five seconds, andthe results demonstrate FenixLR’s ability to provide sustained performance with 50%overheads in the worst case scenario. Therefore, these results show that the methodpresented in this paper applied to the target application class can be more efficientthan theoretical full redundancy.

The evaluation then experimentally demonstrates the full benefit of local recoverycapabilities in FenixLR, its ability to mask multiple failures. To that end, using S3D


on up to 140608 ranks (plus 128 spare ranks), the experiments demonstrate how realnode failures can mask each other, obtaining a similar overhead regardless of thenumber of randomly injected failures.

The section ends by studying how the techniques studied in Section 4 and Sec-tion 5 – ghost region thickening and optimal cell-to-rank mapping – can increase theprobability of failure masking.

Methodology. All the aforementioned experiments inject node failures, whichare simulated by determining all application processes in execution in a particularnode and sending simultaneous SIGKILL signals to all of these processes. As thenetwork setup parameters are stored in process memory, when killing the processesno software disconnections are allowed – this is consistent with the behavior if a realnode failure occurs. Processes on other nodes will receive error codes when trying toperform a uGNI operation with the processes that failed. In what follows ‘failures’refer to node failures, which is equivalent to N -process failures, where N is the totalnumber of processes on a system node, blade, etc. By default, the experiments useN = 16. All the experiments present the average, first and third quartile, maximum,and minimum of five repetitions, unless otherwise specified.

As mentioned above, it has been observed that, statistically, failures tend to clus-ter together in time, appearing as periods of time with high instability separated bymore stable periods [29]. For example, even though Titan’s MTBF is approximately7.75 hours, around 30% of all failures occur within one hour of a previous failure.The goal of the experiments presented in this paper is to focus on extreme cases inwhich unrelated failures occur within short periods of time. In particular, we areinterested in those periods of high instability that may occur on future extreme scalemachines. As failure distribution predictions and MTBFs of future extreme scale sys-tems vary significantly, this paper presents a worst case study and, therefore, focuseson experiments with failures frequencies below one minute. If we consider lower failurefrequencies in the order of tens of minutes (or even hours), negligible overheads can beobserved due to failures when recovered using the presented techniques. Specifically,long simulations experiencing low failure frequencies observe: (1) negligible aggre-gated checkpointing cost due to its low overhead and relatively low frequency, (2)rollback overhead similar to that of checkpointing (assuming checkpoint frequency isadjusted correctly following Daly’s approach [7]), and (3) negligible process recoverycost (as shown by Figure 8).

7.3. Recovery Time for Different Failure Frequencies. This section firstfocuses on evaluating FenixLR’s ability of handling failures occurring as frequently asevery five seconds. This experiment is designed to also capture the total overheads dueto process recovery as well as demonstrating a perfect scalability of the algorithms. Itshows that the time to recover from process failures does not depend on the numberof processes in the system. This, combined with the fact that double in-memorycheckpointing does not depend on this number [33, 16, 17], offers a holistic approachfor tolerating failures at any scale, a desirable property towards exascale.

Since the objective of this experiment is to compare the performance of globalrecovery with the performance of local recovery, failures are strategically injected sothat no failure masking can occur.

Overhead of Process Recovery. The average overhead of recovering fromnode failures is studied while injecting failures at different rates. As shown in Fig-ure 8a, which depicts the average process recovery of all node failures injected duringan execution of around 200 seconds of S3D, the overhead of recovering from failures


0

0.01

0.02

0.03

0.04

0.05

5 10 15 20 25 30 35 40 45

Pro

ce

ss r

eco

ve

ry t

ime

(s)

MTBF (s)

(a) Varying MTBFs while fixing the number of cores at 4736, cor-responding to 4096 compute cores (an S3D domain decomposedin a grid of 163 cores) and 640 spare cores.

0

0.01

0.02

0.03

0.04

0.05

4224 8128 13952 32896 64128 140736 262272

Pro

ce

ss r

eco

ve

ry t

ime

(s)

Core count (including 128 spare cores)

(b) Varying total number of cores in the job while fixing MTBFat 10 seconds.

Fig. 8. Time to recover from process failures with varying frequency of failure arrival as wellas varying total number of cores in the job [17]. Note that the recovery process is perfectly scalable,tested up to 250+ thousand cores.

is independent of the MTBF, as was expected. Note that the overhead of the failurerecovery is initially only observed by the spare processes but it soon propagates tothe rest of the domain. In this experiment, a second failure will only be injected in anode that has already suffered from the overhead of a prior failure. This experimentwas executed using S3D on 4736 cores (including 640 spare ranks).

Note that this experiment does not take into account the overhead of recoveringdata, since that cost is similar to the cost of checkpointing [17]. The data recoveryprocess first starts by fetching the checkpoint from the neighbor storing it and, whilemaintaining a local copy for possible future failures, the process continues by copyingthe data to an application buffer.

Scalability of Process Recovery. The scalability of recovering from nodefailures is studied while increasing the total number of cores in the system from 4224to 262272 and injecting failures every 10 seconds. All core counts include 128 spareprocesses. The results are presented in Figure 8b and show that the recovery overheaddoes not depend on the total number of cores in the MPI job, which is an extremelydesirable property and one of the major problems of global recovery. Except forthe two largest experiments, the averages include executions injecting failure countsranging from 1 to 8 node failures; the largest two experiments only include averagesof experiments injecting 8 node failures.

7.4. Total Overhead due to Fault Tolerance. Previous experiments havestudied the different overheads related to fault tolerance techniques in FenixLR sep-arately. It has been shown that both the time to checkpointing storage and the time


1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

5 10 15 20 25 30 35 40 45 47/GR

Ove

rhe

ad

(re

lative

to

ba

se

te

st)

MTBF (s)

recovery (process+data+rollback) total time checkpoint total time

528

240160

112

80

64 64

48 48

48

Fig. 9. Overall failure overhead of different MTBFs relative to a checkpoint-free, failure-freebase test execution [17]. On top of each bar the total number of process failures recovered throughoutthe execution is indicated. Job size has been fixed to 4736 cores, corresponding to 4096 computecores (an S3D domain decomposed in a grid of 163 cores) and 640 spare cores, and each checkpointrequires 217 GB.

of process recovery are independent of total number of cores as well as frequencyof failures [17]. The current experiment, therefore, explores how the total overheadrelated to fault tolerance activities increases while the MTBF decreases. This is ac-complished by observing the total time to solution of a failure-free, checkpoint-freeexecution, and comparing it with the total time to solution of FenixLR-enabled exe-cutions with varying failure rates. To do so, 4736 cores are used (which include 640spare ranks) while checkpointing 53 MB per core, a data size 10 times larger than theone used by pioneering S3D production executions.

The results of the experiment, for MTBFs ranging from 5 seconds to 45 seconds,are plotted in Figure 9, where each bar represents the overhead of fault tolerancecompared to a base failure-free and checkpoint-free execution. On top of each bar anumber indicates the total number of processes killed (from 3 to 33 nodes) during atotal execution time of about 150 seconds.

The bar on the right, labeled 47/GR, represents the overhead of a similar experi-ment, where failures are injected every 47 seconds and recovered globally, as presentedby our previous work [16]. However, note that this cannot be compared directly, sincethat experiment was using S3D checkpoints of ∼9MB/core, and the experiments withlocal recovery use 53MB/core. But even ignoring this, local recovery still offers betterperformance even at higher failure rates. For example, note that the overhead oftolerating failures locally every 20 seconds is 25%, while tolerating failures globallyevery 47 seconds (∼2.35x) incurs an overhead of 31% (∼1.25x).

Since one of the goals of this experiment is to provide a fair comparison betweenglobal and local recovery, failure masking has been disabled by studying which failuresto inject and guaranteeing they do not mask one another. Next subsection studiesthe full advantage of local recovery by enabling its ability to mask multiple failures.

7.5. Local Recovery with Failure Masking. Previous experiments haveshown how local recovery offers perfect scalability and offers sustained performanceeven in scenarios where failures occur frequently. This experiment aims to show thefull potential of local recovery by enabling failure masking. By allowing several failuresto mask one another, the total overhead of recovering from N failures is, surprisingly,closer to O1 than to N×O1, where O1 is the overhead from recovering a single failure.


(a) 4224c 1f

(b) 4224c 2f

(c) 4224c 3f

(d) 4224c 4f

(e) 4224c 8f

(f) 32896c 1f

(g) 32896c 2f

(h) 32896c 3f

(i) 32896c 4f

(j) 32896c 8f

Fig. 10. Execution profile of S3D while injecting different number of failures empirically demon-strating the existence of the failure masking effect (some subfigures appeared in our preliminarywork [17]). Figures on the top row represent tests with one, two, three, four, and eight node failuresrunning on 4224 cores, corresponding to 4096 compute cores (an S3D domain decomposed in a gridof 163 cores) and 128 spare cores. Figures on the bottom row represent the same tests running on alarger domain with 32896 cores, corresponding to a 3-D grid of 323 as well as 128 additional sparecores. In each figure the x-axis represent the core number (as ranks in the world communicator)while the y-axis represent the walltime, advancing from the start of the application to its end. Eachline represents the time a particular core finishes computing a particular iteration. Note that fail-ures, denoted by red crosses, are recovered and rolled back locally, which translates to a delay thatis propagated throughout the domain in successive iterations. Note that the end-to-end time wheninjecting eight failures is slightly longer than in the other cases. In all other cases, the end-to-endtime is similar, demonstrating the benefits of failure masking. Note that ghost resizing or rankremapping have not been applied in these experiments, motivating the need for these techniques toachieve slower propagations.

Figure 10 depicts the behavior of different S3D executions under different circum-stances. The x axis represents the rank number, while the y axis represents the walltime. In each of the plots, each line represents the completion of timestep. Twodifferent scales are depicted, namely 4096 and 32768 ranks in which, in both cases,128 spare ranks are allocated as well. Failure counts ranging from 1 through 8 havebeen injected to study how the propagation of the recovery spreads across the nodesas well as how the recovery delay propagation waves due to different failures interact.It is important to note that, in all cases, the total time to solution (the height of theuppermost iteration) is similar regardless of the total number of failures. This holdsin all cases except the execution of 4096 ranks while injecting 8 failures, since one ofthe failures strikes in a node that has been already delayed by a previous failure.

Figure 11 depicts a summary of a set of experiments ranging from 4096 to 140608total ranks and the number of failures ranges from one to eight failures. Each barrepresents the average of five executions, except for experiments larger than 64000ranks due to allocation issues. These issues also affected scheduled experiments withlarger failure counts. The Y axis represents the execution time relative to the averageoverhead of suffering a single failure. Therefore, the figure depicts how the overheadsof different failures are masked, showing how comparable is the total overhead of aparticular test to that of a single failure. In most cases, the overheads are very similar,varying around 2% or lower in cases where failures are completely masked. The vari-ability is due to the fact that the rollback overhead is the main factor in the recovery


Fig. 11. End-to-end execution time of S3D while injecting multiple node failures and recover-ing locally, relative to the end-to-end execution time when injecting a single failure (adapted withpermission from our preliminary study [17]). A relative time similar to a unit represents failuresmasked perfectly (variability is due to variable rollback overheads), while relative times in the orderof 1.06 or even 1.10 indicate that not all failures masked each other, but some failures occurredafter the delay already propagated to that node. Note how increasing number of nodes implies adecrease of total overhead with high failure counts (e.g. eight failures, as shown with the trend line),indicating that an increase in core count increases failure masking probability.

process and it depends on the distance of a particular failure to the last checkpoint.A failure occurring right after a checkpoint will have minimal rollback overhead whilea failure occurring right before a checkpoint will have a rollback overhead practicallyequivalent to recomputing an entire iteration one more time. In cases where thisrelative overhead increases up to 6%, or even 12%, at least one failure occurred in anode after it had been delayed by the recovery of a previous failure. An example ofthis can be seen in Figure 10, where, as mentioned before, comparing Figure 10e withFigure 10a or Figure 10d shows a non-negligible increase in the total time to solutiondue to a failure occurring after the propagation of a prior failure.

Results shown in Figure 11 and Figure 10 show, therefore, that benefits of failuremasking can be of critical importance when trying to minimize the overheads ofrecovery. Specifically, both figures offer an empirical demonstration that the overheaddue to N failures is closer to O1 than to N × O1, where O1 is the overhead fromrecovering a single failure.

7.6. Increasing the Failure Propagation Window on S3D. The goal ofthis experiment is to demonstrate that by exploiting the characteristics described inSection 4 and Section 5 the opportunity for masking failures is increased. To thisend, the current set of experiments shows how the failure propagation window canbe expanded by increasing the ghost region size and re-mapping cells to ranks. Asdescribed before, the failure propagation window is the pace at which the number ofranks affected by a particular failure increases.

The failure propagation window closes when the effect of recovering from a fail-ure reaches all the ranks in the machine and can be measured (1) generically usingnumber of iterations computed since the iteration in which the failure occurred or (2)specifically using any measure of time. Unless otherwise indicated, the results in thispart of the evaluation section use the former, more generic, method for measuring thefailure propagation window.

We augmented the S3D main code loop to extend the communication frequency byincreasing the ghost region size and replicating part of the ghost region in two different


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Orig. 1 2 3 4 5 6 7 8 9 10

En

d-t

o-e

nd

tim

e (

rela

tive

to

Orig

.)

Number of iterations between communication

ComputationCommunication

(a) S3D execution

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Orig. 1 2 3 4 5 6 7 8 9 10

En

d-t

o-e

nd

tim

e (

rela

tive

to

Orig

.)

Number of iterations between communication

ComputationCommunication

(b) S3D execution modified to operate on onefloating-point number per cell, instead of nine.

Fig. 12. Overhead of different levels of ghost region expansion on the end-to-end time whencompared to the unmodified, original S3D code (labeled as ‘Orig.’ in both subfigures). The X-axisrepresents the number of iterations between consecutive communications (parameter i in the modelpresented in Section 4). The experiments were conducted using a generic Linux cluster of 32 nodesconnected via Infiniband. Each node has two Intel quad-core Xeon E5620 processors and 24GBof RAM. Our executions used a 27-node allocation (running 8 processes per node) to simulate acubical domain split into 63 = 216 homogeneous partitions. Each bar represents the average of 10repetitions, each simulating 100 S3D iterations. No checkpoints are created nor failures injected inany of these tests.

ranks. We also augmented the part that decomposes and maps the simulated domainin the different ranks to support the topology described in Figure 6 and Section 5.

Evaluating the Overhead of Ghost Region Expansion. To evaluate theoverhead of the ghost region expansion technique on the execution time we performeda set of experiments using the unmodified S3D as well as the augmented code. Fig-ure 12 shows the results of different levels of ghost region expansion, starting froma single layer of ghost points and increasing it to allow communication up to everyten iterations. As the computation and the communication are overlapped by usingasynchronous operations, the portion of each bar labeled ‘Communication’ representsthe time spent in relevant MPI operations.

Figure 12a was created performing 100 iterations of S3D, parameterized as in therest of the experiments in this section. The results show an increasing overhead dueto ghost region expansion. In particular, since communicating every iteration requiresmessage exchange with 6 neighbors and communicating every two iterations requirescontacting 18 neighbors, the communication time is significantly larger in the lattercase. The figure shows how the impact of decreasing the communication frequency toonce every three or four iterations is below 25%.

A similar set of executions that use and communicate a single real number percell (rather than nine real numbers per cell) is shown in Figure 12b to exemplify aslightly different application pattern. In this case, it can be observed that ghost regionexpansion could actually benefit execution time rather than negatively impactingit. In particular, trading off extra computation per iteration (by increasing ghostregion size) in order to decrease communication frequency offers an overall gain inperformance.

As a conclusion, the executions depicted in Figure 12 demonstrate the impor-tance of understanding and studying the aforementioned tradeoff in each individualstencil application. Particular application behaviors influence the optimal communi-cation frequency and the impact it has on both computational performance and fault


0

100

200

300

400

500

600

2 3 4 5 6 7 8 9

Num

ber

of a

ffect

ed r

anks

Simulation iteration number

Random (1)Random (2)

Cubic (2x2x2)Cubic (4x4x4)

R. prism (1x2x8)R. prism(4x2x2)

Fig. 13. Different domain cell to rank mapping strategies on a 512-rank execution with a failureinjected in all sixteen ranks of the third node, on the 16th second after the starting of the application.Two random mappings are tested, providing the worst behavior, as well as two cubic configurations(2×2×2 and 4×4×4) and two rectangular prism configurations (1×2×8 and 4×2×2). Note thatonly the rectangular prism configurations fill exactly a 16-rank node, which is the target architecture.

tolerance overhead.

Comparing Different Domain Cells to Rank Mappings. The first testneeds to determine the effect of the domain cell to rank mapping on the propagationdelay shape. To that end, Figure 13 shows different mapping strategies on a 512-rankexecution with a failure injected in sixteen ranks. The worst behavior can be observedwhen randomly assigning the cells to the ranks. The figure shows the behavior of tworandom mappings. Figure 13 also depicts tests done with two regular configurations:cubes of side two and cubes of side four, both providing power of two total number ofcores – and, hence, divisible by the total domain size of 512. Either configurations aresuboptimal, since the total number of ranks included in such cubes are not the exactnumber of rank in a node of Titan. Therefore, two extra configurations based onrectangular prisms are tested. The sizes of these two extra configurations are 1×2×8and 4 × 2 × 2, and in both cases the total number of ranks per decomposed shapeis exactly sixteen, which matches the target architecture. The results show that therectangular prism of 4×2×2 and the cubic shape of side 2 have a similar propagationdelay curve. The remaining experiments with rank re-mapping use the rectangularprism of 4× 2× 2, unless otherwise indicated.

Ghost Region Extension and Rank Re-Mapping Empirical Evaluation.Figure 14 shows several executions of S3D on 4096 cores, in which a single node failurewas injected by killing the 16 ranks running on it. In particular, all the processes innode 31 received a SIGKILL at the 25th second from the beginning of the execution.On the X-axis, the iteration number is plotted while the Y-axis represents the totalnumber of MPI ranks affected by the recovery overhead of that failure. The numberof ranks affected are counted as depicted in Figure 15. The left subfigure of Figure 14shows the profile of propagation of the affected ranks for different ghost sizes, fromthe base case of one to the extreme case of eight. In each case, a test labeled ‘Gxghost size’ represents a test in which the communication is done every G iterations.The scale of these experiments being small (4096 cores), at least when comparedwith production runs, for the base case of no ghost region extension, the failure is


0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 20 40 60 80 100

Num

ber

of a

ffect

ed r

anks


Propagation of a 16-core failure with different ghost region sizes

0

200

400

600

800

1000

1200

1400

10 20 30 40 50 60Simulation iteration number

Detail comparing default mapping with rectangular prism mapping

RR improvements1x ghost size2x ghost size4x ghost size6x ghost size8x ghost size1x ghost size, RR2x ghost size, RR4x ghost size, RR6x ghost size, RR8x ghost size, RR

Fig. 14. Effects of ghost region expansion and rank re-mapping on the propagation window.Each line shows an execution on 4096 ranks in which a node failure (16 cores) is injected. The leftfigure includes experiments with increasing number of ghost region sizes with a default cell-to-rankmapping. The right figure includes a detail of the same executions with both the default mappingand a mapping using the rectangular prism approach. The gray area represents the improvementin the propagation curve induced by using the topology-aware mapping. Each line is one of fourexecutions; the difference between executions was unnoticeable.

15

20

25

30

35

40

0 512 1024 1536 2048 2560 3072 3584 4096

Wal

ltim

e (s

)

MPI Rank

00000

192480896144021122912

3840

4096409640964096

Fig. 15. Detail of the execution from Figure 14 using the default mapping and no ghost regionextension, showing how the ranks affected by a failure are counted. The X-axis depicts MPI ranknumber. The Y-axis depicts the walltime, and each line represents the point in time in which anS3D iteration is finished by each rank. The number on the right of each line counts how many rankshave been affected by the node failure indicated with a red cross.

propagated to all the ranks in the domain quickly: in less than 5 iterations. Whenextending the ghost region to as low as twice the original size, this number is extendedto 16 iterations, and in the extreme case of eight times the ghost size, the propagationdelay is extended to 80 iterations. On the other hand, the right part of the samefigure shows a zoomed detail of the same five cases. Each of the cases was repeatedwith a topology-aware 4 × 2 × 2 rectangular prism mapping (as shown in Figure 6).The re-mapping tests are marked as ‘RR’ and plotted in a dashed line. The respectiveimprovements of re-mapping the ranks in each of the ghost size extension cases aredepicted as a gray area between both dashed and non-dashed lines. In all cases,


0

100

200

300

400

500

600

0 10 20 30 40 50

Num

ber

of a

ffect

ed r

anks


RR improvements2x ghost size4x ghost size

2x ghost size, RR4x ghost size, RR

Fig. 16. Effect of ghost point size and rank re-mapping on the total number of affected ranksby a node failure on a 512-rank execution. The node failure is injected in all cases by injectinga 16-rank failure in the third node at the 16th second. The gray area represents the improvementin the propagation curve induced by using the topology-aware mapping. Each line is one of fourexecutions; the difference between executions was unnoticeable.

40

50

60

70

80

90

100

1x 2x 4x 6x 8x

Exec

utio

n tim

e (p

erce

ntag

e of

bas

e te

st)

Ghost size expansion

Ghost expansionGhost expansion + rank remapping

(a) 4096 ranks

40

50

60

70

80

90

100

1x 2x 4x 6x 8x

Exec

utio

n tim

e (p

erce

ntag

e of

bas

e te

st)

Ghost size expansion

Ghost expansionGhost expansion + rank remapping

(b) 32768 ranks

Fig. 17. Execution time of the different techniques while injecting and recovering from a singlenode failure. The base test is considered the left-most bar in each of the figures, representing noghost expansion (1×) and no cell to rank re-mapping. The y-axis represents the total time, as apercentage compared to the base test. The number of iterations between consecutive checkpoints havebeen set to follow the degree of ghost rank expansion. Aside from the ranks indicated in the captions,the experiments allocated an additional set of 128 spare ranks.

especially with higher ghost size ranges, re-mapping the ranks as a rectangular prismas opposed to the default linear mapping offers a delayed propagation of the failure,which is an extremely desirable effect. In all ten cases, each line is one out of fourexecutions: the reason why only one is included is due to unnoticeable differencebetween the different executions –i.e. the lines almost overlap.

A similar experiment has been repeated with a much smaller domain of 512 totalranks, and is depicted in Figure 16. In this case, a node failure has been injected inthe 16th second (all sixteen ranks running in node 3 receive a SIGKILL at the 16thsecond).

Additionally, the performance of the techniques described in Section 4 and Sec-tion 5 are studied by this empirical evaluation while injecting a single node failure.Figures 17a and 17b depict the total, end-to-end, execution time when using 1×,


0

50

100

150

200

250

300

350

400

1 2 3 4

Ove

rhea

d of

tes

t di

vide

d by

ove

rhea

d of

sin

gle

failu

re (

%)

Number of node failures

2x ghost size

1 2 3 4

2x ghost size, RR

1 2 3 4

4x ghost size

1 2 3 4

4x ghost size, RR

(a) Two and four times ghost region expan-sion; default and rectangular prism rank map-ping techniques. All executions run S3D on 512ranks with an additional 64 spare ranks. Over-heads are compared with the single failure, “2xghost size” test.

0

100

200

300

400

500

600

700

800

1 4 8

Ove

rhea

d of

tes

t di

vide

d by

ove

rhea

d of

sin

gle

failu

re (

%)

Number of node failures (16 cores/node)

4096 cores

1 4 8

32768 cores

(b) No ghost region expansion; 4 × 2 × 2 rect-angular prism rank mapping. The subfigure onthe left used S3D with 4096 compute ranks and128 spare ranks, and the subfigure on the rightused S3D with 32768 compute ranks and 128spare ranks.

Fig. 18. Overhead on the total time to solution over several executions with increasing numberof failures recovered using local recovery in different scenarios. The X-axis for each subfigure isthe total number of failures, while the Y-axis represents the overhead of each test relative to theoverhead caused by a single failure (%). For each total number of failures, the extrapolated overheadin a theoretical execution with best-case global recovery (100% for one failure, 200% for two failures,300% for three...) is indicated by the lighter color (green). As such, the lighter part of each barindicates overhead reductions due to failure masking.

2×, 4×, 6×, and 8× ghost region expansion combined with both rank re-mappingand no re-mapping, while running on 4,000 and 32,000 ranks. These experimentsare performed using a cube S3D domain decomposed on 16 × 16 × 16 ranks and32× 32× 32 ranks, respectively. The checkpoint period has been adjusted to coincidewith the ghost region expansion rate, since checkpointing involves communicationwith a pre-determined neighboring rank, and this communication needs to be delayedto maximize the failure delay propagation window. Therefore, the increase in perfor-mance when expanding the ghost region and further delaying communication is bothdue to the savings from reducing communication latency and synchronization as wellas reduced usage of bandwidth from the checkpointing process.

As can be observed in both cases, the use of rank re-mapping has no performancepenalty, while greatly improving the probability of failure masking, as expected andshown previously in the right part of Figure 14.

Staircase Behavior of Ghost Region Extension. Note that, in both Figure14 and Figure 16, the cases with extended ghost region present a staircase shapedprofile, which is more pronounced the larger the extension. This shape can be ex-plained by understanding that during the horizontal part of each ‘stair’, the delay ofrecovery is not propagated for several iterations. The reasoning behind the periodsof non-propagation lies in the fact that, by increasing the ghost region size G times,the communication among ranks will only occur every G iterations, and the delay isonly propagated while communication occurs.

Total Overhead for Different Number of Failures. Figure 18a depicts thetotal overhead of several failures when compared to the overhead of recovering froma single failure. Each subfigure represents a different combination of ghost regionsize increase and rank re-mapping technique, namely 2x and 4x ghost sizes each with


both a default linear rank mapping and a 4× 2× 2 rectangular prism mapping. Foreach failure count, Figure 18a compares the total overhead with local recovery of realexecutions with a theoretical, best-case global recovery. The comparison with globalrecovery in these Figures is considered to be best-case since the recovery cost (exclud-ing the rollback cost) is considered to be identical in both local and global recovery,which is not a safe assumption even at medium scales of hundreds or thousands ofcores [17]. We can observe that when jumping from 2 to 4 ghost region size extensionmultipliers the opportunities of failure masking are increased and, therefore, the totaloverheads are lower in the case of three and four node failures. The same happenswhen changing the mapping strategy from linear to based on a 4× 2× 2 rectangularprism, in both the cases of 2x and 4x ghost regions sizes. The results show only asingle experiment for each test.

In Figure 18b the same overhead and comparison with global recovery that wasdepicted by Figure 18a can be observed. In this case, the scale is increased from 512ranks to 4096 and 32768, respectively. Local recovery offers a much lower overheadthan global recovery, as seen in the 512-rank case, which can be explained throughthe benefits of failure masking. It can be seen how the local recovery version offersa higher reduction in the 32768-rank execution than in the 4096-rank case, which isexpected, since the possibilities of failure masking are increased in larger domains.

8. Conclusion. This paper first introduces the technique of local recovery andthe effect of multiple failure masking that can be observed when tolerating failureslocally in stencil computations such as explicit PDEs. Failure masking reduces theoverheads on the total execution time due to recovery from multiple failures downto a level comparable to that of a lower failure count. The paper then describestwo application-level optimizations that can be applied to improve the effect andprobability of failure masking. These two techniques involve increasing the size ofthe ghost region and mapping the cell to ranks so that the propagation effect of anode failure is limited. Both techniques revolve around the concept of decreasing thecommunication frequency to increase the failure propagation window.

The paper also presents the design and implementation of the discussed techniquesin the FenixLR framework and its integration with the S3D combustion simulation.Due to the simple FenixLR interface, this integration requires less than 35 lines ofcodes to be added or rearranged. This implementation is deployed on the Titansupercomputer and used to offer an extensive empirical evaluation –while injectingreal process/node failures by sending SIGKILL signals– used to demonstrate the an-alytical and theoretical behaviors previously studied. The results demonstrate thatthe performance of both data recovery algorithms and local process recovery meth-ods is constant and independent of the number of nodes and MTBF. Therefore, thecombination of techniques introduced by this study gives an empirical demonstrationthat the performance of stencil computations can be sustained even in extreme fu-ture scenarios. For example, it is shown that failures occurring as frequently as everyfive seconds can be tolerated and that failures occurring every thirty seconds can berecovered with an overall overhead of less than 14% when compared to a checkpoint-free, failure-free execution – a more realistic scenario. Finally, our experiments alsodemonstrated how local recovery enables failure masking as well as how cell to rankre-mapping and ghost region expansion on stencil computations increase the effect offailure masking. For example, it is shown that four failures can mask each other in anexecution with 140000+ cores to provide overheads equal to those of a single failure.

Our future work includes (1) the exploration of these ideas beyond stencil com-


putations and (2) the modeling and simulation of failure local recovery propagationdelay to increase the understanding of failure masking as well as to derive its exactprobability in certain cases.

REFERENCES

[1] R. F. Barrett, S. D. Hammond, C. T. Vaughan, D. W. Doerfler, M. A. Heroux,J. P. Luitjens, and D. Roweth, Navigating an evolutionary fast path to exascale,in Proceedings of the 2012 SC Companion: High Performance Computing, NetworkingStorage and Analysis, SCC ’12, Washington, DC, USA, 2012, IEEE Computer Soci-ety, pp. 355–365, http://dx.doi.org/10.1109/SC.Companion.2012.55, http://dx.doi.org/10.1109/SC.Companion.2012.55.

[2] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Fran-zon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards,A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, K. Yelick,K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Fran-zon, W. Harrod, J. Hiller, S. Keckler, D. Klein, P. Kogge, R. S. Williams, andK. Yelick, Exascale computing study: Technology challenges in achieving exascale sys-tems, 2008.

[3] M. S. Bouguerra, A. Gainaru, L. B. Gomez, F. Cappello, S. Matsuoka, and N. Maruyam,Improving the Computing Efficiency of HPC Systems Using a Combination of Proac-tive and Preventive Checkpointing, International Parallel and Distributed Processing Sym-posium, 0 (2013), pp. 501–512, http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/IPDPS.2013.74.

[4] A. Bouteiller, G. Bosilca, and J. Dongarra, Redesigning the message logging model forhigh performance, Concurrency and Computation: Practice and Experience, 22 (2010),pp. 2196–2211, http://dx.doi.org/10.1002/cpe.1589, http://dx.doi.org/10.1002/cpe.1589.

[5] A. Bouteiller, T. Ropars, G. Bosilca, C. Morin, and J. Dongarra, Reasons for a pes-simistic or optimistic message logging protocol in MPI uncoordinated failure, recovery, inIEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009,2009, pp. 1–9, http://dx.doi.org/10.1109/CLUSTR.2009.5289157.

[6] B. Brandfass, T. Alrutz, and T. Gerhold, Rank reordering for MPI communication opti-mization, Computers & Fluids, 80 (2013), pp. 372–380.

[7] J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps,Future Generation Computer Systems, (2006).

[8] J. Dinan, S. Krishnamoorthy, P. Balaji, J. R. Hammond, M. Krishnan, V. Tipparaju,and A. Vishnu, Noncollective Communicator Creation in MPI., in EuroMPI, Y. Cotro-nis, A. Danalis, D. S. Nikolopoulos, and J. Dongarra, eds., vol. 6960 of Lecture Notes inComputer Science, Springer, 2011, pp. 282–291.

[9] C. Ding, C. Karlsson, H. Liu, T. Davies, and Z. Chen, Matrix Multiplication on GPUswith On-Line Fault Tolerance, in IEEE 9th International Symposium on Parallel and Dis-tributed Processing with Applications (ISPA), 2011, 2011, pp. 311–317, http://dx.doi.org/10.1109/ISPA.2011.50.

[10] J. Dongarra and et al, The International Exascale Software Project Roadmap, Interna-tional Journal of High Performance Computing Applications, 25 (2011), pp. 3–60, http://dx.doi.org/10.1177/1094342010391989, http://hpc.sagepub.com/content/25/1/3.abstract,arXiv:http://hpc.sagepub.com/content/25/1/3.full.pdf+html.

[11] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, Algorithm-based faulttolerance for dense matrix factorizations, in Proceedings of the 17th ACM SIGPLANsymposium on Principles and Practice of Parallel Programming, PPoPP 2012, New York,NY, USA, 2012, ACM, pp. 225–234, http://dx.doi.org/10.1145/2145816.2145845, http://doi.acm.org/10.1145/2145816.2145845.

[12] E. Elnozahy and W. Zwaenepoel, Manetho: transparent roll back-recovery with low over-head, limited rollback, and fast output commit, IEEE Transactions on Computers, 41(1992), pp. 526–531, http://dx.doi.org/10.1109/12.142678.

[13] E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, A survey ofrollback-recovery protocols in message-passing systems, ACM Comput. Surv., 34(2002), pp. 375–408, http://dx.doi.org/10.1145/568522.568525, http://doi.acm.org/10.1145/568522.568525.

[14] K. B. Ferreira, S. Levy, P. Widener, D. Arnold, and T. Hoefler, Understanding theeffects of communication and coordination on checkpointing at scale, in International Con-

http://dx.doi.org/10.1109/SC.Companion.2012.55



http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/IPDPS.2013.74

http://dx.doi.org/http://doi.ieeecomputersociety.org/10.1109/IPDPS.2013.74

http://dx.doi.org/10.1002/cpe.1589

http://dx.doi.org/10.1002/cpe.1589

http://dx.doi.org/10.1109/CLUSTR.2009.5289157

http://dx.doi.org/10.1109/ISPA.2011.50

http://dx.doi.org/10.1109/ISPA.2011.50

http://dx.doi.org/10.1177/1094342010391989

http://dx.doi.org/10.1177/1094342010391989

http://hpc.sagepub.com/content/25/1/3.abstract

http://arxiv.org/abs/http://hpc.sagepub.com/content/25/1/3.full.pdf+html

http://dx.doi.org/10.1145/2145816.2145845

http://doi.acm.org/10.1145/2145816.2145845

http://doi.acm.org/10.1145/2145816.2145845

http://dx.doi.org/10.1109/12.142678

http://dx.doi.org/10.1145/568522.568525

http://doi.acm.org/10.1145/568522.568525

http://doi.acm.org/10.1145/568522.568525


ference on High Performance Computing, Networking, Storage and Analysis (SC14), 2014.[15] M. P. Forum, Mpi: A message-passing interface standard, tech. report, University of Tennessee

Knoxville, Knoxville, TN, USA, 1994.[16] M. Gamell, D. S. Katz, H. Kolla, J. Chen, S. Klasky, and M. Parashar, Exploring

Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales, in Pro-ceedings of the International Conference on High Performance Computing, Networking,Storage and Analysis, SC ’14, 2014.

[17] M. Gamell, K. Teranishi, M. Heroux, J. Mayo, H. Kolla, J. Chen, and M. Parashar,Local recovery and failure masking for stencil-based applications at extreme scales, in Pro-ceedings of the International Conference for High Performance Computing, Networking,Storage and Analysis, SC ’15, New York, NY, USA, 2015, ACM, pp. 70:1–70:12, http://dx.doi.org/10.1145/2807591.2807672, http://doi.acm.org/10.1145/2807591.2807672.

[18] M. Gamell, K. Teranishi, M. A. Heroux, J. Mayo, H. Kolla, J. Chen, and M. Parashar,Exploring Failure Recovery for Stencil-based Applications at Extreme Scales, in Pro-ceedings of the 24th International Symposium on High-Performance Parallel and Dis-tributed Computing, HPDC ’15, New York, NY, USA, 2015, ACM, pp. 279–282, http://dx.doi.org/10.1145/2749246.2749260, http://doi.acm.org/10.1145/2749246.2749260.

[19] A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello, Uncoordinated Check-pointing Without Domino Effect for Send-Deterministic MPI Applications, in IEEE In-ternational Parallel and Distributed Processing Symposium (IPDPS), 2011, pp. 989–1000,http://dx.doi.org/10.1109/IPDPS.2011.95.

[20] P. H. Hargrove and J. C. Duell, Berkeley Lab Checkpoint/Restart (BLCR) for Linux clus-ters, in Journal of Physics: Conference Series, IOP Publishing, 2006, p. 494.

[21] M. A. Heroux, Toward Resilient Algorithms and Applications, in Proceedings of the 3rdWorkshop on Fault-tolerance for HPC at Extreme Scale, FTXS 2013, New York, NY,USA, 2013, ACM, pp. 1–2, http://dx.doi.org/10.1145/2465813.2465814, http://doi.acm.org/10.1145/2465813.2465814.

[22] J. Hursey, Coordinated checkpoint/restart process fault tolerance for MPI applications onHPC systems, PhD thesis, Indiana University, Indianapolis, IN, USA, 2010. AAI3423687.

[23] J. Hursey, T. I. Mattox, and A. Lumsdaine, Interconnect agnostic checkpoint/restart inOpen MPI, in Proceedings of the 18th ACM international symposium on High PerformanceDistributed Computing, HPDC 2009, New York, NY, USA, 2009, ACM, pp. 49–58, http://dx.doi.org/http://doi.acm.org/10.1145/1551609.1551619.

[24] J. Hursey, J. Squyres, T. Mattox, and A. Lumsdaine, The Design and Implementation ofCheckpoint/Restart Process Fault Tolerance for Open MPI, in IEEE International Paralleland Distributed Processing Symposium, 2007, pp. 1–8, http://dx.doi.org/10.1109/IPDPS.2007.370605.

[25] F. B. Kjolstad and M. Snir, Ghost cell pattern, in Proceedings of the 2010 Work-shop on Parallel Programming Patterns, ParaPLoP ’10, New York, NY, USA, 2010,ACM, pp. 4:1–4:9, http://dx.doi.org/10.1145/1953611.1953615, http://doi.acm.org/10.1145/1953611.1953615.

[26] R. Riesen, K. Ferreira, D. Da Silva, P. Lemarinier, D. Arnold, and P. G. Bridges,Alleviating scalability issues of checkpointing protocols, in Proceedings of the InternationalConference on High Performance Computing, Networking, Storage and Analysis, SC 2012,Los Alamitos, CA, USA, 2012, IEEE Computer Society Press, pp. 18:1–18:11, http://dl.acm.org/citation.cfm?id=2388996.2389021.

[27] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak,P. Bose, F. Cappello, B. Carlson, and et al., Addressing Failures in Exascale Com-puting, U.S. DoE, 2013, http://dx.doi.org/10.2172/1078029, http://www.osti.gov/scitech/servlets/purl/1078029.

[28] K. Teranishi and M. A. Heroux, Toward Local Failure Local Recovery Resilience ModelUsing MPI-ULFM, in Proceedings of the 21st European MPI Users’ Group Meeting, Eu-roMPI/ASIA ’14, New York, NY, USA, 2014, ACM, pp. 51:51–51:56, http://dx.doi.org/10.1145/2642769.2642774, http://doi.acm.org/10.1145/2642769.2642774.

[29] D. Tiwari, S. Gupta, and S. S. Vazhkudai, Lazy checkpointing: Exploiting temporal localityin failures to mitigate checkpointing overheads on extreme-scale systems, in 2014 44thAnnual IEEE/IFIP International Conference on Dependable Systems and Networks, June2014, pp. 25–36, http://dx.doi.org/10.1109/DSN.2014.101.

[30] M. Turmon, R. Granat, D. Katz, and J. Lou, Tests and tolerances for high-performancesoftware-implemented fault detection, IEEE Transactions on Computers, 52 (2003),pp. 579–591, http://dx.doi.org/10.1109/TC.2003.1197125.

[31] B. Vinnakota and N. Jha, A dependence graph-based approach to the design of algorithm-

http://dx.doi.org/10.1145/2807591.2807672

http://dx.doi.org/10.1145/2807591.2807672

http://doi.acm.org/10.1145/2807591.2807672

http://dx.doi.org/10.1145/2749246.2749260

http://dx.doi.org/10.1145/2749246.2749260

http://doi.acm.org/10.1145/2749246.2749260

http://dx.doi.org/10.1109/IPDPS.2011.95

http://dx.doi.org/10.1145/2465813.2465814

http://doi.acm.org/10.1145/2465813.2465814

http://doi.acm.org/10.1145/2465813.2465814

http://dx.doi.org/http://doi.acm.org/10.1145/1551609.1551619

http://dx.doi.org/http://doi.acm.org/10.1145/1551609.1551619



http://dx.doi.org/10.1145/1953611.1953615

http://doi.acm.org/10.1145/1953611.1953615

http://doi.acm.org/10.1145/1953611.1953615

http://dl.acm.org/citation.cfm?id=2388996.2389021

http://dl.acm.org/citation.cfm?id=2388996.2389021

http://dx.doi.org/10.2172/1078029

http://www.osti.gov/scitech/servlets/purl/1078029

http://www.osti.gov/scitech/servlets/purl/1078029

http://dx.doi.org/10.1145/2642769.2642774

http://dx.doi.org/10.1145/2642769.2642774

http://doi.acm.org/10.1145/2642769.2642774

http://dx.doi.org/10.1109/DSN.2014.101

http://dx.doi.org/10.1109/TC.2003.1197125


based fault tolerant systems, in 20th International Symposium on Fault-Tolerant Com-puting. FTCS-20. Digest of Papers., 1990, pp. 122–129, http://dx.doi.org/10.1109/FTCS.1990.89347.

[32] S.-J. Wang and N. Jha, Algorithm-based fault tolerance for FFT networks, IEEE Transactionson Computers, 43 (1994), pp. 849–854, http://dx.doi.org/10.1109/12.293265.

[33] G. Zheng, X. Ni, and L. V. Kale, A scalable double in-memory checkpoint and restartscheme towards exascale, in IEEE/IFIP 42nd International Conference on DependableSystems and Networks Workshops (DSN-W), 2012, pp. 1–6, http://dx.doi.org/10.1109/DSNW.2012.6264677.

http://dx.doi.org/10.1109/FTCS.1990.89347

http://dx.doi.org/10.1109/FTCS.1990.89347

http://dx.doi.org/10.1109/12.293265

http://dx.doi.org/10.1109/DSNW.2012.6264677

http://dx.doi.org/10.1109/DSNW.2012.6264677

designing and increasing failure masking scalability in stencil … · 2017-07-24 · scalable...

Documents