2. an adaptive programming model for fault tolerant distributed computing
TRANSCRIPT
An Adaptive Programming Modelfor Fault-Tolerant Distributed Computing
Sergio Gorender, Raimundo Jose de Araujo Macedo, Member, IEEE Computer Society, and
Michel Raynal
Abstract—The capability of dynamically adapting to distinct runtime conditions is an important issue when designing distributed
systems where negotiated quality of service (QoS) cannot always be delivered between processes. Providing fault tolerance for such
dynamic environments is a challenging task. Considering such a context, this paper proposes an adaptive programming model for
fault-tolerant distributed computing, which provides upper-layer applications with process state information according to the current
system synchrony (or QoS). The underlying system model is hybrid, composed by a synchronous part (where there are time bounds
on processing speed and message delay) and an asynchronous part (where there is no time bound). However, such a composition can
vary over time, and, in particular, the system may become totally asynchronous (e.g., when the underlying system QoS degrade) or
totally synchronous. Moreover, processes are not required to share the same view of the system synchrony at a given time. To
illustrate what can be done in this programming model and how to use it, the consensus problem is taken as a benchmark problem.
This paper also presents an implementation of the model that relies on a negotiated quality of service (QoS) for communication
channels.
Index Terms—Adaptability, asynchronous/synchronous distributed system, consensus, distributed computing model, fault tolerance,
quality of service.
Ç
1 INTRODUCTION
DISTRIBUTED systems are composed of processes, locatedon one or more sites, that communicate with one
another to offer services to upper-layer applications. Amajor difficulty a system designer has to cope with in thesesystems lies in the capture of consistent global states fromwhich safe decisions can be taken in order to guarantee asafe progress of the upper-layer applications. To study andinvestigate what can be done (and how it has to be done) inthese systems when they are prone to process failures, twodistributed computing models have received significantattention, namely, the synchronous model and the asynchro-nous model.
The synchronous distributed computing model provides
processes with bounds on processing time and message
transfer delay. These bounds, explicitly known by the
processes, can be used to safely detect process crashes and,
consequently, allow the noncrashed processes to progress
with safe views of the system state (such views can be
obtained with some “time-lag”). In contrast, the asynchro-
nous model is characterized by the absence of time bounds
(that is why this model is sometimes called the time-free
model). In these systems, a system designer can only assume
an upper bound on the number of processes that can crash
(usually denoted as f) and, consequently, design protocolsrelying on the assumption that at least (n� f) processes arealive (n being the total number of processes). The protocolhas no means to know whether a given process is alive ornot. Moreover, if more than f processes crash, there is noguarantee on the protocol behavior (usually, the protocolloses its liveness property).
Synchronous systems are attractive because they allowsystem designers to solve many problems. The price thathas to be paid is the a priori knowledge on time bounds. Ifthey are violated, the upper-layer protocols may be unableto still guarantee their safety property. As they do not relyon explicit time bounds, asynchronous systems do not havethis drawback. Unfortunately, they have another one,namely, some basic problems are impossible to solve inasynchronous systems. The most famous is the consensusproblem, that has no deterministic solution when even asingle process can crash [12].
The consensus problem can be stated as follows: Eachprocess proposes a value, and has to decide a value, unlessit crashes (termination), such that there is a single decidedvalue (uniform agreement), and that value is a proposedvalue (validity). This problem, whose statement is particu-larly simple, is fundamental in fault-tolerant distributedcomputing as it abstracts several basic agreement problems.While consensus is considered as a “theoretical” problem,system designers are usually interested in the morepractical Atomic Broadcast problem. That problem is both acommunication problem and an agreement problem. Itscommunication part specifies that the processes can broad-cast and deliver messages in such a way that the processesthat do not crash deliver at least the messages they send. Itsagreement part specifies that there is a single delivery order
18 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2007
. S. Gorender and R.J. de Araujo Macedo are with the Distributed SystemsLaboratory (LaSiD), Computer Science Department, Federal University ofBahia, Campus de Ondina, 40170-110, Salvador, Bahia, Brazil.E-mail: {gorender, macedo}@ufba.br.
. M. Raynal is with IRISA, Universite de Rennes 1, Campus de Beaulieu,35042 Rennes Cedex, France. E-mail: [email protected].
Manuscript received 24 May 2005; revised 31 Dec. 2005; accepted 22 Aug.2006; published online 2 Feb. 2007.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TDSC-0075-0505.
1545-5971/07/$25.00 � 2007 IEEE Published by the IEEE Computer Society
(so, the correct processes deliver the same sequence ofmessages, and a faulty process delivers a prefix of thissequence of messages). It has been shown that consensus andatomic broadcast are equivalent problems in asynchronoussystems prone to process crashes [6]. Consequently, inasynchronous distributed systems prone to process crashes,the impossibility of solving consensus extends to atomicbroadcast. This impossibility has motivated researchers tofind distributed computing models, weaker than thesynchronous model but stronger than the asynchronousmodel, in which consensus can be solved.
1.1 Content of the Paper
In practice, systems are neither fully synchronous, nor fullyasynchronous. Most of the time they behave synchro-nously, but can have “unstable” periods during which theybehave in an anarchic way. Moreover, there are now “QoSarchitectures” that allow processes to dynamically negoti-ate the quality of service of their communication channels,leading to settings with hybrid characteristics that canchange over time. These observations motivate the designof a distributed programming model that does its best toprovide the processes with information on the current stateof the other processes. This paper is a step in that direction.
This programming model is time free in the sense thatprocesses are not provided with time bounds guaranteed bythe lower system layer. Each process pi is provided withthree sets denoted downi, livei, and uncertaini. These sets,which are always a partition of the whole set of processes,define the view pi has of the state of the other processes.More precisely, if pk 2 downi, then pi knows that pk hascrashed; when pk 2 livei, then pi can consider pk as beingalive; finally, when pk 2 uncertaini, pi has no informationon the current state of pk. These sets can evolve in time, andcan have different values for different processes. Forexample, it is possible that (due to the fact that somequality of service can no longer be ensured) the view pi hason pk be degraded in the sense that the model moves pkfrom livei to uncertaini. So, the model is able to benefit fromtime windows to transform timeliness properties (currentlysatisfied by the lower layer) into time-free properties(expressed on the previous sets) that can be used by upperlayer protocols. Interestingly, this programming modelconsiders the synchronous model and the asynchronousmodel as particular cases. The synchronous model corre-sponds to the case where, for all the processes pi, the setsuncertaini are always empty. The asynchronous modelcorresponds to the case where, for all the processes pi, thesets uncertaini always include all processes.
It is important to notice that our approach is orthogonalto the failure detector (FD) approach [6]. Given a problem Pthat cannot be solved in the asynchronous model, the FDapproach is a computability approach: it consists ofenriching the asynchronous model with properties thatare sufficient for the problem to become solvable. Ourapproach is “engineering” oriented. It aims at benefitingfrom the fact that systems are built on QoS architectures,thereby allowing a process not to always consider the otherprocesses as being in an “uncertain” state. As the proposedmodel includes the asynchronous model, it is still necessaryto enrich it with appropriate mechanisms to solve problems
that are impossible to solve in pure time-free systems. Toillustrate this point and evaluate the proposed approach,the paper considers the consensus problem as a benchmarkproblem. A consensus protocol is presented, that is basedon the fact that each process pi is provided with 1) three setsdowni, livei, and uncertaini with the previously definedsemantics, and 2) an appropriate failure detector module(namely, �S). Interestingly, while �S-based consensusprotocols in asynchronous systems require f < n=2, theproposed �S-based protocol allows bypassing this boundwhen juncertainij < n during the protocol execution.
Another interesting feature of the proposed program-ming model lies in the fact that it allows us to designgeneric protocols, in the sense that a protocol designed forthat model “includes” both a synchronous protocol and anasynchronous protocol.
This paper is an extended and enhanced version of aprevious publication that appeared in [14].
1.2 Related Work
Our approach relates to previous work on distributedsystem models and adaptiveness. The timed asynchronousmodel [8] considers asynchronous processes equipped withphysical clocks. The concept of hybrid architectures hasbeen explored where a distributed wormhole (i.e., a smallsynchronous part of the system) equipped with synchro-nized clocks provides timely services to the rest of thesystem (the asynchronous part) [27]. The concept of a secureand crash resilient wormhole has also been used toimplement efficient solutions to consensus protocols sub-jected to Byzantine faults in the asynchronous part of thesystem [7].
Several works aimed to circumvent the impossibility ofconsensus in the asynchronous system model [12]. Minimalsynchronism needed to solve consensus is addressed in [9].Partial synchrony making consensus solvable is investi-gated in [10]. Finally, the failure detector approach and itsapplication to the consensus problem have been introducedand investigated in [5], [6]. A general framework forconsensus algorithms for the classical asynchronous systemmodel enriched with failure detectors or random numbergenerators is given in [16].
Many systems have addressed adaptiveness, including[24], [26], [19], [4]. AquA [24] provides adaptive faulttolerance to CORBA applications by replicating objects andproviding a high-level method that an application can useto specify its desired level of reliability. AquA also providesan adaptive mechanism capable of coping with applicationdependability requirements at runtime. It does so by usinga majority voting algorithm. Ensemble [26] offers a fault-tolerant mechanism that allows adaptation at runtime. Thisis used to dynamically adapt the components of a groupcommunication service to switch between two total orderalgorithms (sequencer-based versus token-based) accordingto the desired system overhead and end-to-end latency forapplication messages.
The concept of real-time dependable channels (RTD) hasbeen introduced in [19] to handle QoS adaptation at channelcreation time. In their paper, the authors show how anapplication can customize a RTD channel according tospecific QoS requirements, such as bounded delivery time,
GORENDER ET AL.: AN ADAPTIVE PROGRAMMING MODEL FOR FAULT-TOLERANT DISTRIBUTED COMPUTING 19
reliable delivery, message ordering, and jitter (variance inmessage transmission time). Such QoS requirements arestated in terms of the probabilities related to specific QoSguarantees and are enforced by a composite protocol basedon the real-time version of the Cactus system (CactusRT).
The work presented in [4] explicitly addresses theproblem of dynamically adapting applications when agiven negotiated QoS can no longer be delivered. In orderto accomplish that, they defined a QoS coverage service thatprovides the so-called time-elastic applications (such asmission-critical or soft real-time systems) with the ability ofdependably deciding how to adapt time bounds in order tomaintain a constant coverage level. Such service isimplemented over the wormhole Timely Computing Base orTCB. By using its duration measurement service, TCB isable to monitor the current QoS level (in terms of time-liness), allowing applications to dynamically adapt when agiven QoS cannot be delivered.
Like [4], our work tackles QoS adaptability in terms oftimeliness at runtime. However, we do not adapt timebounds to time-elastic applications. Instead, our hybridprogramming model adapts to the available QoS ofcommunication channels, providing applications with safeinformation on the current state of processes. The commu-nication channel QoS can be timely, with guaranteed timebounds for message delivery, or untimely, otherwise.1
Moreover, we show how an adaptive consensus protocolcan take advantage of the proposed hybrid model. As weshall see in Section 4, though we have our own implemen-tation based on QoS architectures, our programming modelcould also be implemented on top of systems such as TCBand CactusRT.
1.3 Roadmap
The paper is made up of five sections. Section 2 introducesthe model. Section 3 presents and proves the correctness ofa consensus protocol suited to the model, and Section 4describes an implementation of this model based on adistributed system architecture providing negotiated QoSguarantees. Finally, Section 5 concludes the paper.
2 THE HYBRID AND ADAPTIVE PROGRAMMING
MODEL
This section describes the computation model offered to theupper-layer applications. These applications are generallydistributed programs implementing middleware services. Itis important to notice that the model offered to upper-layerapplications is time-free in the sense that it offers themneither timing assumptions, nor time functions.
2.1 Asynchronous Distributed System with ProcessCrash Failures
We consider a system consisting of a finite set � of n � 2processes, namely, � ¼ fp1; p2; . . . ; png. A process executessteps (a step is the reception of a message, the sending of amessage, each with the corresponding local state change, or
a simple local state change). A process can fail by crashing,i.e., by prematurely halting. After it has crashed, a processdoes not recover. It behaves correctly (i.e., according to itsspecification) until it (possibly) crashes. By definition, aprocess is correct in a run if it does not crash in that run.Otherwise, a process is faulty in the corresponding run. Inthe following, f denotes the maximum number of processesthat can crash ð1 � f < nÞ. Until it possibly crashes, thespeed of a process is positive but arbitrary.
Processes communicate and synchronize by sending andreceiving messages through channels. Every pair ofprocesses ðpi; pjÞ is connected by two directed channels,denoted as pi ! pj and pj ! pi. Channels are assumed to bereliable: they do not create, alter, or lose messages. Inparticular, if pi sends a message to pj, then if pi is correct,eventually pj receives that message unless it fails. There isno assumption about the relative speed of processes ormessage transfer delays (let us observe that channels are notrequired to be FIFO). The primitive broadcast MSGðvÞ is ashortcut for: 8pj 2 � do send MSGðvÞ to pj end (MSG is themessage tag, v its value).
The previous definitions describe what is usually called atime-free asynchronous distributed system prone to processcrashes. We assume the existence of a global discrete clock.This clock is a fictional device that is not known by theprocesses; it is only used to state specifications or proveprotocol properties. The range T of clock values is the set ofnatural numbers. Let F ðtÞ be the set of processes that havecrashed through time t. As a process that crashes does notrecover, we have F ðtÞ � F ðtþ 1Þ.
2.2 How a Process Sees the Other Processes: ThreeSets per Process
A crucial issue encountered in distributed systems is theway each process perceives the state of the other processes.To that end, the proposed programming model provideseach process pi with three sets denoted downi, livei, anduncertaini. The only thing a process pi can do with respectto these sets is to read the sets it is provided with; it cannotwrite them and has no access to the sets of the otherprocesses.
These sets, that can evolve dynamically, are made up ofprocess identities. Intuitively, the fact that a given process pjbelongs to downi, livei, or uncertaini provides pi with somehint of the current status of pj. More operationally, ifpj 2 downi, pi can safely consider pj as being crashed. Ifpj =2 downi, the state of pj is not known by pi with certainty:more precisely, if pj 2 livei, pi is given a hint that it cancurrently consider pj as not crashed; when pj 2 uncertaini, pihas no information on the current state (crashed or live) of pj.
At the abstraction level defining the computation model,these sets are defined by abstract properties (the way theyare implemented is irrelevant at this level, it will bediscussed in Section 4). The specification of the sets downi,livei, and uncertaini, 1 � i � n, is the following (wheredowniðtÞ is the value of downi at time t, and similarly forliveiðtÞ and uncertainiðtÞ):
R0 Initial global consistency. Initially, the sets livei(respectively, downi and uncertaini) of all the
20 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2007
1. For ensuring such a distinct QoS guarantee, we assume that theunderlying QoS infrastructure is capable of reserving resources andcontrolling admission of new communication flows related to new createdchannels.
processes pi are ident ical . Namely, 8i; j:liveiðtÞ ¼ livejðtÞ, downiðtÞ ¼ downjðtÞ, and
uncertainiðtÞ ¼ uncertainjðtÞ;
for t ¼ 0.R1 Internal consistency. The sets of each pi define a
partition:
. 8i: 8t: downiðtÞ [ liveiðtÞ [ uncertainiðtÞ ¼ �.
. 8i: 8t: any two sets
ðdowniðtÞ; liveiðtÞ; uncertainiðtÞÞ
have an empty intersection.R2 Consistency of the downi sets.
. A downi set is never decreasing: 8i: 8t:downiðtÞ � downiðtþ 1Þ.
. A downi set is always safe with respect tocrashes: 8i: 8t: downiðtÞ � F ðtÞ.
R3 Local transitions for a process pi. While an upperlayer protocol is running, the only changes a processpi can observe are the moves of a process px fromlivei to downi or uncertaini.
R4 Consistent global transitions. The sets downi anduncertainj of any pair of processes pi and pj evolveconsistently. More precisely:
. 8i; j; k; t0 : ððpk 2 liveiðt0ÞÞ^ðpk 2 downiðt0 þ 1ÞÞÞ) ð8t1 > t0 : pk =2 uncertainjðt1ÞÞ.
. 8i; j; k; t0 :�ðpk 2 liveiðt0ÞÞ ^ðpk2uncertaini
ðt0 þ 1ÞÞÞ ) ð8t1 > t0 : pk =2 downjðt1ÞÞ.R5 Conditional crash detection. If a process pj crashes
and there is a time after which it does never appearin the uncertaini set of any other process pi, iteventually appears in the downi set of each pi. Moreprecisely:8pi, if pj crashes at time t0, and there is a time
t1 � t0 such that 8t2 � t1 we have pj =2 uncertainiðt2Þ,then there is a time t3 � t1 such that 8t4 � t3 we have
pj 2 downiðt4Þ.As we can see from this specification, at any time t and
for any pair of processes pi and pj, it is possible to have
liveiðtÞ 6¼ livejðtÞ (and similarly for the other sets). Oper-
ationally, this means that distinct processes can have
different views of the current state of each other process.
Let us also observe that downi is the only safe information
on the current state of the other processes that a process pihas. An example, involving nine processes, namely,
� ¼ fp1; . . . ; p9g, is described in Table 1. The columns
(respectively, rows) describe pi’s (respectively, pj’s) view
of the three sets at some time t, e.g., downiðtÞ ¼ fp1; p2; p5g,while livejðtÞ ¼ fp5; p6g.
The rules [R0-R5] define a distributed programming
model that interestingly satisfies the following strong
consistency property. That property provides the processes
with a mutually consistent view on the possibility to detect
the crash of a given process. More specifically, if the crash of
a process pk is never known by pi (because pk continuously
belongs to uncertaini), then no process pj will detect the
crash of pk (by having pk 2 downj). Conversely, if the crash
of pi is known by pj, the other processes will also know it.More formally, we have:
Property 1: Mutual consistency.
8i; j : 8t1; t2 : downiðt1Þ \ uncertainjðt2Þ ¼ ;:
Proof. The proof is by contradiction. Let us assume that9pi; pj; t1; t2 and a process p such that
p 2 downiðt1Þ \ uncertainjðt2Þ:
We consider two cases:
. Case i ¼ j.
- t1 ¼ t2 is impossible, as it would violate thepartitioning rule R1.
- t1 < t2 or t1 > t2 are also impossible, as theywould violate the local transition rule R3.
. Case i 6¼ j.First, let us observe that at the initial config-
uration (t ¼ 0), p 2 liveið0Þ and p 2 livejð0Þ, sincethe only possible transitions for p are the movesfrom live to down or to uncertain (R3) andprocesses see the same set formations (R0) at thatstage, which always form a partition (R1).
Due to R3 and the initial assumption,
8t3; ððt3 > t2Þ ^ ðt3 > t1ÞÞ ! ððp 2 downiðt3ÞÞ^ ðp 2 uncertainjðt3ÞÞÞ:
Therefore, the formulas below follow, whichcontradict R4:
- 9t0, t0 < t3 such that ðp 2 liveiðt0Þ ^ p 2downiðt0 þ 1Þ ^ p 2 uncertainjðt3ÞÞ, and
- 9t00, t00 < t3 such that (p 2 livejðt00Þ ^ p 2uncertainjðt00 þ 1Þ ^ p 2 downiðt3Þ). tu
2.3 An Upgrading Rule
Let us consider two consecutive instances (runs) of anupper-layer protocol (e.g., the consensus protocol describedin Section 3) built on top of the model defined by the rulesR0-R5. Moreover, let t1 be the time at which the firstinstance terminates and t2 the time at which the secondinstance starts ðt1 < t2Þ. During any instance of the protocol,the set livei of a process can only downgrade (as no processcan go from downi or uncertaini to livei). But, it is importantto notice that nothing prevents the model from upgradingbetween consecutive instances, by moving a process px 2uncertainiðt1Þ into liveiðt2Þ or downiðt2Þ. Such an upgrade ofa livei or downi sets between two runs of an upper-layerprotocol do correspond to “synchronization” points during
GORENDER ET AL.: AN ADAPTIVE PROGRAMMING MODEL FOR FAULT-TOLERANT DISTRIBUTED COMPUTING 21
TABLE 1Example of down, live, and uncertain Sets
which the processes are allowed to renegotiate the qualityof service of their channels (see Section 4).
R6 Upgrade. If no upper-layer protocol is runningduring a time interval ½t1; t2�, it is possible that, 8i,processes be moved from uncertaini to livei or downiduring that interval.
2.4 On the Wait Statement and Examples
Interestingly, the following proposition on the conditionaltermination of a wait statement can easily be derived fromthe previous computation model specification.
Proposition 1. Let us consider the following statement issued bya process pi at some time t0: “wait until ((a message from pj isreceived) _ ðpj 2 downiÞ)” and let us assume that either thecorresponding message has been sent by pj or pj has crashed.2
The model guarantees that this wait statement alwaysterminates if there is a time t1 ðt1 � t0Þ such that 8t2 � t1we have pj 2 downiðt2Þ [ liveiðt2Þ.
This proposition is very important from a practical pointof view. Said in another way, it states that if, from sometime, pj never belongs to uncertaini, the wait statementalways terminates. This is equivalent to say that when, aftersome time, pj remains always in uncertaini or, alternatively,belongs to livei and uncertaini, there is no guarantee on thetermination of the wait statement (it can then terminate ornever terminate).
To illustrate the model, let us consider here twoparticular “extreme” cases. The first is the case ofsynchronous distributed systems. As indicated in theIntroduction, due to the upper bounds on processing timesand message transfer delays, these systems correspond tothe model where 8i, 8t, we have uncertainiðtÞ ¼ ;.
The second example is the case of fully asynchronousdistributed systems where there is no time bound. In thatcase, given a process pj, a process pi can never know if pjhas crashed or is only slow, or if its communicationchannels are very slow. This type of system does corre-spond to the model where we have 8i, 8t: uncertainiðtÞ ¼ �(or, equivalently, downiðtÞ ¼ liveiðtÞ ¼ ;). In these systems(as it appears in many asynchronous distributed algo-rithms), it is important to observe that, in order not to beblocked by a crashed process, a process pi can only issue“anonymous” wait statements such that “wait until
(messages have been received from ðn� fÞ processes).” Insuch a wait statement, pi relies only on the fact that at mostf processes can crash. It does not wait for a message from aparticular process (as in Proposition 1). This is an“anonymous” wait in the sense that the identities of theprocesses from which messages are received are notspecified in the wait statement; these identities can beknown only when the wait terminates.
2.5 The Hybrid and Adaptive Programming Model inPerspective
Our programming model provides the upper-layer applica-tions with sufficient process state information (the sets) that
can be used in order to adapt to the available systemsynchrony or QoS (in terms of timely and untimelychannels), providing more efficient solutions to fault-tolerant problems when possible (this will be illustrated inthe next section).
As a matter of fact, the implementation of fault-tolerantservices, such as consensus [17], depends decisively on theexistence of upper bounds for message transmission andprocess scheduling delays (i.e., timely channels and timelyprocess executions). Such upper bounds can only becontinuously guaranteed in synchronous systems. On theother hand, fault-tolerant services can also be guaranteed insome asynchronous or partially synchronous environmentswhere the “synchronous behavior” is verified duringsufficient long periods of time, even if such a behavior isnot continuously assured. Thus, such partially synchronoussystems can alternate between “synchronous behavior,”with time bound guarantees, and “asynchronous behavior,”with no time bound guarantees. So, in some sense, thesesystems are hybrid in the time dimension. For example, thetimed asynchronous system depends on a sufficient longstability period to deliver the related fault-tolerant services,and the system behavior can alternate between stable andunstable periods (i.e., synchronous and asynchronous) [8].The GST—Global Stabilization Time—assumption is neces-sary to guarantee that �S-based consensus protocolseventually deliver the proper service [6] (i.e., to workcorrectly, the system must eventually exhibit a “synchro-nous behavior”).
There are other models that consider not only thehybridism in the time dimension, but also that the systemis composed of a synchronous and an asynchronous part atthe same time. So, we can regard such systems as beinghybrid in the space dimension. This is the case of the TCBmodel, which relies on a synchronous wormhole toimplement fault-tolerant services [27], and such a spatialhybridism persist without discontinuity in time.
We assume that our underlying system model is capableof providing distinct QoS communications, so that a givenphysical channel may transmit communication flowsrelated to multiple virtual channels with distinct QoS(e.g., one timely channel and two untimely channelsimplemented in the same physical channel). Because timelyand untimely channels may exist in parallel, our underlyingsystem model can be hybrid in space. Moreover, untimelychannels may exhibit a synchronous behavior in certainstable periods. Therefore, our underlying system modeldiffers from all the above mentioned models in the sensethat it not only considers both the temporal and spatialhybridism dimensions (in this sense, similar to TCB), butalso that the nature of such a hybridism can change overtime, with periods where there is no spatial hybridism. Thatis, the underlying system behavior can become totallyasynchronous (i.e., uncertain ¼ �), given that in somecircumstances, the QoS of timely channels can degrade(due to router failures, for instance).
As our programming model cannot ensure the condi-tions to solve consensus in circumstances where the systembecomes totally asynchronous, we need a computabilitymodel to correctly specify the consensus algorithms. As
22 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2007
2. This means that we consider that the upper-layer distributedprograms are well formed in the sense that they are deadlock-free.
discussed earlier, the failure detector approach, denoted asFD, is such a computability model that we use to solveconsensus (see Section 3.1). However, nothing prevents usfrom using our programming model with another oracle orcomputability assumptions (such as the eventual leaderoracle (�) [5]) with adequate modifications in the consensusalgorithm. That is why we consider the FD-based approachas being orthogonal to our approach.
In summary, partial synchronous models such as GSTmodel and timed asynchronous systems do not considerspatial hybridism. TCB is hybrid in space and time and itsspatially hybrid nature persists continuously. Our model, incontrast, considers an underlying system that can also behybrid in time and space; however, it can completely looseits synchronous part over time—requiring the correspond-ing adaptive mechanism. The price that has to be paid toallow the required adaptation is the implementation of QoSmanagement mechanisms (provision and monitoring) cap-able of continuously assessing the QoS of the channels[1]—this is to some extent similar to what is realized by thefail-awareness service implemented in timed asynchronoussystems, where the occurrence of too many performancefailure raises an exception signal indicating that the under-lying system can no longer satisfy the synchronousbehavior [11]. The difference lies in the fact that in thetimed asynchronous system, the signaling is not based onQoS management but on the violation of round-trip-timemessage bounds.
The approach presented in [16] focuses on the consensusproblem in asynchronous message-passing systems. Itproposes a general framework for failure detector-basedconsensus algorithms. Hence, its underlying model is theclassical asynchronous model enriched with failure detec-tors. There is no notion of hybridism. The approachproposed here is different because it considers an under-lying QoS infrastructure.
3 USING THE MODEL
As a relevant example of the way the previous program-ming model can benefit middleware designers, this sectionpresents a consensus protocol built on top of this model.The consensus problem has been chosen to illustrate theway to use the model because of its generality, bothpractical and theoretical [17].
3.1 Enriching the Model to Solve Consensus
3.1.1 The Consensus Problem
In the consensus problem, every correct process pi proposes avalue vi and all correct processes have to decide on the samevalue v, that has to be one of the proposed values. Moreprecisely, the Consensus problem is defined by two safetyproperties (Validity and Uniform Agreement) and aTermination Property [6], [12]:
. Validity: If a process decides v, then v was proposedby some process.
. Uniform Agreement: No two processes decidedifferently.
. Termination: Every correct process eventually deci-des on some value.
3.1.2 The Class �S of Failure Detectors
It is well known that the consensus problem cannot besolved in pure time-free asynchronous distributed systems[12]. So, these systems have to be equipped with additionalpower (as far as the detection of process crashes isconcerned) in order consensus can be solved. Here, weconsider that the system is augmented with a failuredetector of the class denoted �S [6] (which has been shownto be the weakest class of failure detectors able to solveconsensus despite asynchrony [5]).
A failure detector of the �S class is defined as follows [6]:Each process pi is provided with a set suspectedi that containsprocesses suspected to have crashed. If pj 2 suspectedi, wesay “pi suspects pj.” A failure detector providing such setsbelongs to the class �S if it satisfies the following properties:
. Strong Completeness: Eventually, every process thatcrashes is permanently suspected by every correctprocess.
. Eventual Weak Accuracy: There is a time after whichsome correct process is never suspected by thecorrect processes.
As we can see, a failure detector of the class �S can makean infinite number of mistakes (e.g., by erroneouslysuspecting a correct process, in a repeated way).
3.2 A Consensus Protocol
The consensus protocol described in Fig. 1 adapts the�S-based protocol introduced in [21] to the computationmodel defined in Section 2. Its main difference lies in thewaiting condition used in line 8. Moreover, while thealgorithmic structure of the proposed protocol is close to thestructure described in [21], due to the very different modelit assumes, its proof (mainly in Lemma 1) is totally differentfrom the one used in [21].
A process pi starts a consensus execution by invokingConsensus ðviÞ where vi is the value it proposes. Thisfunction is made up of two tasks, T1 (the main task) and T2.The processes proceed by consecutive asynchronousrounds. Each process pi manages two local variables whosescope is the whole execution, namely, ri (current roundnumber) and esti (current estimate of the decision value),and two local variables whose scope is the current round,namely, auxi and reci. ? denotes a default value whichcannot be proposed by processes.
A round is made up of two phases (communicationsteps), and the first phase of each round r is managed by acoordinator pc (where c ¼ ðr mod nÞ þ 1).
. First phase (lines 4-6). In this phase, the current roundcoordinator pc broadcasts its current estimate esti. Aprocess pi that receives it, keeps it in auxi; otherwise,pi sets auxi to ?. The test pc 2 ðsuspectedi [ downiÞ isused to prevent pi from blocking forever.
It is important to notice that, at the end of the firstphase, the following property holds:
8i; j ðauxi 6¼ ? ^ auxj 6¼ ?Þ ) ðauxi ¼ auxj ¼ vÞ:
. Second phase (lines 7-13). During the second phase,the processes exchange their auxi values. Each
GORENDER ET AL.: AN ADAPTIVE PROGRAMMING MODEL FOR FAULT-TOLERANT DISTRIBUTED COMPUTING 23
process pi collects in reci such values from theprocesses belonging to the sets livei and uncertainias defined in the condition stated at line 8. As aconsequence of the property holding at the end ofthe first phase and the condition of line 8, thefollowing property holds at line 9:
8i reci ¼ fvg or reci ¼ fv;?g or reci ¼ f?gwhere v ¼ estc:
The proof of this property is the main part of the
proof of the uniform agreement property (see
Lemma 1). Then, according to the content of its set
reci, pi either decides (case reci ¼ fvg), or adopts v as
its new estimate (case reci ¼ fv;?g), or keeps its
previous estimate (case reci ¼ f?g). When it does
not decide (two last cases), pi proceeds to the next
round.
As a process that decides stops participating in the
sequence of rounds and processes do not necessarily
terminate in the same round, it is possible that processes
proceeding to round rþ 1 wait forever for messages from
processes that decided during r. The aim of the second task
is to prevent such deadlocks by disseminating the decided
value in a reliable way ([18]).It is easy to see that the processes decide in a single
round (two communication steps) when the first coordi-
nator is neither crashed nor suspected. Moreover, it is
important to notice that the value f (upper bound on the
number of faulty processes) does not appear in the protocol
and, consequently, the protocol does not impose an a priori
constraint on f . Actually, if the sets uncertaini remain
always empty, the underlying failure detector becomes
useless and the protocol solves consensus whatever the
value of f .
3.3 Correctness Proof
The proofs for the uniform agreement and termination
properties, given by Theorems 1 and 2, respectively, follow.
The proof for validity is omitted here due to space
limitations, and can be found elsewhere [15].
Lemma 1. Let pi and pj be two processes that terminate line 8 of
round r, and let v be the estimate value ðestcÞ of the
coordinator pc of r (if any). We have: 1) reci ¼ fvg, or
reci ¼ fv;?g, or reci ¼ f?g, and 2) reci ¼ fvg and recj ¼f?g are mutually exclusive.
Proof. Let us first consider a process pi that, during a round
r, terminates the wait statement of line 8. As there is a
single coordinator per round, the auxj values sent at
line 7 by the processes participating in round r can only
be either the estimate estc ¼ v of the current round
coordinator pc, or ?. Item 1 of the lemma follows
immediately from this observation.Let us now prove item 2, namely, if both pi and pj
terminate the wait statement at line 8 of round r, it isnot possible to have reci ¼ fvg and recj ¼ f?g duringthat round. Let Qi (respectively, Qj) be the set ofprocesses from which pi (respectively, pj) has receivedPHASE2ðr;�Þ messages at line 8. The set reci(respectively, recj) is defined from the aux valuescarried by these messages. We claim that Qi \Qj 6¼ ;.As any process in the intersection sends the samemessage to pi and pj, the lemma follows.
24 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2007
Fig. 1. Consensus protocol.
Proof of the claim. To prove that Qi \Qj 6¼ ;, thereasoning is on their size. Let ti (respectively, tj) be thetime at which the waiting condition evaluated by pi(respectively, pj) is satisfied. Let us observe that ti and tjcan be different. To simplify the notation, we do notindicate the time at which a set is considered, i.e., downi,livei, and uncertaini denote the values of the correspond-ing sets at time ti (and similarly for pj’s sets). Moreover,let maji be the majority subset of uncertaini that makessatisfied the condition evaluated at line 8, i.e., Qi ¼livei [maji (notice that maji ¼ ; when uncertaini ¼ ;).Similarly, let Qj ¼ livej [majj.
To prove Qi \Qj 6¼ ;, we restrict our attention to theset of processes � n ðdowni [ downjÞ. This is because thepredicate evaluated at line 8 by pi (respectively, pj)involves only the processes in � n downi (respectively,� n downj). In the following, we denote with a primesymbol a set value used in the proof, but known neitherby pi nor by pj. So, down0i;j ¼ downi [ downj, and live0i ¼livei n down0i;j and live0j ¼ livej n down0i;j.
Let us observe that, due to (R1), we have live0i ¼livei n downj and live0j ¼ livej n downi. Combining theprevious set definitions with Property 1, we get thefollowing property (E): live0i [ uncertaini ¼ � n down0i;j ¼live0j [ uncertainj:
With these sets, we can now define two nonemptysets Q0i and Q0j as follows: Q0i ¼ live0i [maji andQ0j ¼ live0j [majj. From their very definition, wehave Q0i ¼ ðlive0i [majiÞ � ðlivei [majiÞ ¼ Qi. Similarly,Q0j � Qj. As Q0i \Q0j 6¼ ; implies that Qi \Qj 6¼ ;, therest of the proof consists of showing that Q0i \Q0j 6¼ ;.We consider three cases:
. Case 1: maji 6¼ ; ^majj 6¼ ;.From the fact that Q0i is built from the universe
of processes � n down0i;j and (E), we have the
following:
jQ0ij ¼ jlive0ij þ jmajij > jlive0ij þjuncertainij
2
� jlive0ij þ juncertainij
2¼j�j � jdown0i;jj
2:
We have the same for pj, i.e., jQ0jj >j�j�jdown0i;jj
2 . As
both Q0i and Q0j are built from the same universe
of processes (� n down0i;j) and each of them
contains more than half of its elements, we have
Q0i \Q0j 6¼ ;.. Case 2: maji ¼ ; ^majj ¼ ;.
In that case, we have
uncertaini ¼ uncertainj ¼ ;:
So, proving Q0i \Q0j 6¼ ; amounts to prove that
live0i \ live0j 6¼ ;.Let us remind that ti (respectively, tj) is the
time with respect to which downi is defined
(respectively, downj), and let us assume without
loss of generality that ti < tj. As pj has not
crashed at tj (i.e., when it terminates the wait
statement of line 8), we have (from R2) pj =2 downiand pj =2 downj, i.e., pj =2 down0i;j, from which we
conclude pj 2 live0i and pj 2 live0j. Hence, live0i \live0j 6¼ ; which proves the case.
. Case 3: maji 6¼ ; ^majj ¼ ;.In that case, we have uncertaini 6¼ ; and
uncertainj ¼ ;. As uncertaini \ downj ¼ ; (Prop-erty 1) and uncertainj ¼ ;, we have live0j ¼live0i [ uncertaini (Table 2 figures out these sets).Hence, uncertaini � live0j. As maji � uncertaini,we have maji � live0j from which we concludeQ0i \Q0j 6¼ ;. (The case maji ¼ ; ^majj 6¼ ; isproved by exchanging i and j.) End of the proof
of the claim. tu
Theorem 1. No two processes decide different values.
Proof. Let us first observe that a decided value is computedeither in line 10 or in Task T2. However, we examineonly the values decided at line 10, as a value decided inTask T2 has already been computed by some process inline 10.
Let r be the smallest round during which a process pidecides, and let v be the value it decides. As pi decidesduring r, we have reci ¼ fvg during that round (line 10).Due to item 1 of Lemma 1, it follows that 1) v is thecurrent value of the estimate of the coordinator of r, and2) as there is a single coordinator per round, if anotherprocess pj decides during r, it decides the same value v.
Let us now consider a process pk that proceeds fromthe round r to the round rþ 1. Due to item 2 of Lemma 1,it follows that reck ¼ fv;?g from which we conclude thatthe estimates of all the processes that proceeds to rþ 1are set to v at line 11. Consequently, no futurecoordinator will broadcast a value different from v, andagreement follows. tu
Theorem 2. Let us assume that the system is equipped with �Sand 8i, 8t: a majority of processes in uncertainiðtÞ are not in
F ðtÞ. Every correct process decides.
Proof. If a process decides, then all correct processes decide.This is an immediate consequence of the broadcastprimitive used just before deciding at line 10. Thisbroadcast (line 10, and Task T2) is such that if a processexecutes it entirely, the broadcast message is received byall correct processes.
So, let us assume (by contradiction) that no processdecides. We claim that no correct process blocks foreverat line 5 or line 8. As a consequence, the correct processeseventually start executing rounds coordinated by acorrect process (say, pc) that is no longer suspected tohave crashed (this is due to the eventual strong accuracyof �S). Moreover, this process cannot appear in downi(R2). Let us consider such a round occurring after all the
GORENDER ET AL.: AN ADAPTIVE PROGRAMMING MODEL FOR FAULT-TOLERANT DISTRIBUTED COMPUTING 25
TABLE 2Case 3 in the Proof of the Claim
faulty processes have crashed. The correct process pcthen sends the same non-? value v to each correctprocess pi that consequently executes auxi v (line 6).As no process blocks forever at line 8, and only the valuev can be received, it follows that each correct processdecides v at line 10.
Proof of the claim. No correct process blocks forever atline 5 or line 8. We consider each line separately. First,the fact that no process can block forever at line 5 followsdirectly from the fact that every crashed processeventually appears in suspectedi [ downi. Second, thefact that no process can block forever at line 8 followsfrom the following observations:
1. Item 2 of line 8 cannot entail a permanent blocking:this follows from the theorem assumption.
2. Let us now consider item 1 of line 8: pi couldblock if it waits for a message from pk 2 livei andthat process pk has crashed before sending thecorresponding message. If pk is later moved touncertaini, it can no longer block pi, see 1). If pk isnever moved to uncertaini, then it is eventuallymoved to downi (R5) and, consequently, pi willstop waiting for a message from pk. Hence, inboth cases, a crashed process pk cannot prevent acorrect process pi from progressing. End of theproof of the claim. tu
3.4 Discussion
As presented in the Introduction and Section 2.4, thismodel includes two particular instances that have beendeeply investigated. The first instance, namely, 8i,t : uncertaini ¼ �, does correspond to the time-free asyn-chronous system model. When instantiated in such aparticular model, the protocol described in Fig. 1 can besimplified by suppressing the set downi at line 5 and theset livei at line 8 (as they are now always empty). Theassumption required in Theorem 2 becomes f < n=2, andthe protocol becomes the �S-based consensus protocoldescribed in [21]. It has been shown that f < n=2 is anecessary requirement in such a model [6]. In that sense,the proposed protocol is optimal.
The second “extreme” instance of the model is definedby 8i; t : uncertaini ¼ ;, and includes the classic synchro-nous distributed system model. It is important to noticethat systems with less synchrony than the classic synchro-nous systems can also be such that 8i, 8t, we haveuncertainiðtÞ ¼ ;.3 As previously, the protocol of Fig. 1 canbe simplified for this particular model. More precisely, theset suspectedi (line 5) and item 2 of line 8 can besuppressed. We then obtain an early deciding protocolthat works for any number of process failures, i.e., forf < n. (It is natural to suppress the sets suspectedi insynchronous systems as �S is only needed to cope with thenet effect of asynchrony and failures.) Interestingly, thesynchronous protocol we obtain is not the classical flood-set consensus protocol [20], [23], but a synchronouscoordinator-based protocol.
A noteworthy feature of the protocol lies in its genericdimension: The same protocol can easily be instantiated infully synchronous systems or fully asynchronous systems.Of course, these instantiations have different requirementson the value of f . A significant characteristic of the protocolis to suit to distributed systems that are neither fullysynchronous, nor fully asynchronous. The price that has tobe paid consists then of equipping the system with a failuredetector of the class �S. The benefit it brings lies in the factthat the constraint on f can be weaker than f < n=2.4
The rule R3 of our model restricts process identifiers tobe moved from uncertain to live or from uncertain to downduring a given run. So, as a consequence, we restrict theapplication from upgrading the QoS of related channelsfrom untimely to timely during the execution of a consensus.One question that could be raised, though, is what wouldhappen if during an execution of consensus, processescould renegotiate the QoS of channels from untimely totimely—therefore, allowing processes to move from uncer-tain to live. It turns out that in such situations, the requireduniform agreement cannot be achieved. Take the simpleexample of a group with three processes p1, p2, and p3, allinitially belonging to the set uncertain as all channels areinitially untimely. Now, consider that during a givenconsensus execution, processes p2 and p3 negotiate timelychannels with p1 moving the three process identifiers fromtheir local uncertain sets to their local live sets. Now,suppose that before the new negotiated QoS informationreaches p1, p1 decides based on the quorum of twoprocesses, p1 and p3 (at this stage, p1’s view is that allprocesses belong to uncertain) and, after some time, both p1
and p3 crash. Some time later, process p2 detects the crash ofp1 and also of p3 and decides based on its own initial value(for p2 all processes are in live). Under these conditions,there is no intersection in the decision quorums used by p1
and p2, so that the decided value of p1 will be different fromthe decided value of p2—which violates the uniformagreement property.
Regarding transitions from uncertain to down, if the stateof a process p is unknown or uncertain in our model (whatjustifies p to be in the uncertain set), it does not make senseto move p into down (which is considered a safe informationin our model, as defined by R2). So, in order to be movedfrom uncertain to down, p would have first to be moved fromuncertain to live (so, the situation is exactly as above, whichalso justifies R3). However, the same reasoning also appliesif we allowed process identifiers to be directly moved fromuncertain to down (just take the above example andconsider moving p1 and p3 from uncertain2 into down2).As a matter of fact, without R3 one could not deduceProperty 1 and, consequently, prove Uniform Agreement(see Lemma 1). Nevertheless, the above restriction does notprevent a process from taking advantage of a better QoS forits channels in future consensus executions (rule R6): Itsuffices that the new live and uncertain sets, obtained fromQoS improvements, be set up in all cooperating processesbefore the next consensus execution. A simple way torealize that (without making use of extra mechanisms such
26 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2007
3. It is sufficient that each process of the system is connected with atimely channel—for example, a system with four processes, two timelychannels, and all the other channels being untimely can be as suchuncertainiðtÞ ¼ ;.
4. The protocols designed for asynchronous systems we are aware ofrequire 1) �S (or a failure detector that has the same computabilty power,e.g., a leader oracle [5]), and 2) the upper bound f < n=2 on the number ofprocess crashes. Our protocol has the same requirement for item 1, but aweaker for item 2.
as a membership protocol) is to send the new live settogether with the coordinator proposed value. Then, whenthe consensus is eventually decided, this new live set isequally installed in all processes (and the complement of it,the uncertain set).
4 IMPLEMENTATION OF THE ADAPTIVE DISTRIBUTED
COMPUTING MODEL
Implementing our hybrid programming model requires
some basic facilities such as the provision and monitoring of
QoS communications with both bounded and unbounded
delivery times, and also a mechanism to adapt the system
when timely channels can no longer be guaranteed, due to
failures.
Our programming model builds on facilities typically
encountered in QoS architectures [1], such as Omega [22],
QoS-A [3], Quartz [25], and Differentiated Services [2]. In
particular, we assume that the underlying system is capable
of providing timely communication channels (alike services
such as QoS hard [3], deterministic [25], and Express Forward
[2]). Similarly, we assume the existence of best-effort
channels where messages are transmitted without guaran-
teed bounded time delays. We call these channels untimely.
QoS monitoring and fail-awareness have been implemented
by the QoS Provider (QoSP), failure and state detectors
mechanisms, briefly presented below. It was a design
decision to build our model on top of a QoS-based system.
However, we could also have implemented our program-
ming model based on facilities encountered in existing
hybrid architectures: For instance, timely channels could be
implemented using RTD channels by setting the probabil-
ities Pd (deadline probability) and Pr (reliability probability)
close to one, and untimely channels could be implemented
with a basic channel without any guarantees [19]. The
timing failure detection service of TCB [4] could then
complement the required functionality.
The QoS-based underlying distributed system we con-
sider is a set of n processes � ¼ p1; . . . ; pn, located in one or
more sites, communicating through a set � of nðn� 1Þ=2channels, where ci=j means a communication channel
between pi and pj. That is, the system is represented by a
complete graph DSð�;�Þ, where � are the nodes and � the
edges of the graph.
We assume that processes in � are equipped with
enough computational power so that the time necessary to
process the messages originated by the implemented model
(i.e., the state detector, the failure detector, and the QoS
Provider) are negligible small compared with network
delays. Therefore, such messages are assumed to be
promptly computed.5 Moreover, the processes are assumed
to fail only by crashing and the network is not partitionable.
We do not distinguish channels for upper-layer mes-
sages (e.g., generated by the consensus protocol) and for
messages generated by the state and failure detectors. These
messages are transmitted in the same channels. Regarding
the QoSP, as will be seen in the next section, there is a
unique module in each site of the system. When the system
is initiated, the QoSP modules try to establish timely
channels between them. However, if resources are not
available for timely channels, untimely channels are
generated.
4.1 The QoS Provider
In order to make our programming model portable to
distinct QoS architectures, we define a number of functions
to be encapsulated in a software device we call the QoS
Provider (QoSP), for creating and assessing QoS commu-
nication channels on behalf of application processes.
Thanks to this modular encapsulation, porting our system
to a given QoS infrastructure (such as DiffServ [2]) means
implementing the QoSP functions in such a new target
environment. The QoSP is made up of a module in each site
of the system. The basic data structure maintained by each
module is a table holding information about existing
channels. These modules exchange messages to carry out
modifications on the QoS of channels (due to failures or
application requests).
This section describes its main functionalities, which are
needed for implementing our programming model. (The
complete description of the QoS Provider is beyond the
scope of this paper. More details can be seen elsewhere [13],
[15]). Processes interact with the QoS Provider through the
following functions:
CreateChannelðpx; pyÞ : �2 ! �;
DefineQoSðpx; py; qosÞ : �2 � ftimely; untimelyg! ftimely; untimelyg;
QoSðpx; pyÞ : �2 ! ftimely; untimelyg; and
Delayðpx; pyÞ : �2 ! Nþ:
The functions CreateChannelðÞ, DefineQoSðÞ, QoSðÞ, and
DelayðÞ are used for creating a channel, changing its QoS
(with admission test and resource reservation), obtaining its
current QoS (monitoring), and obtaining the expected
delay—in milliseconds—for message transfer for the chan-
nel cx=y, respectively. Besides the above functions, each QoSP
module continuously monitors all timely channels linked to
the related site, to check whether failures or lack of resources
have resulted in a modification of the channel QoS (from
timely to untimely). A particular case that the QoSP also
assesses is the existence of a timely channel where no
message flow happens within a period of time, which
indicates that the timely channel is possibly linking two
crashed processes (in this circumstance, the QoS of the
channel is modified to untimely in order to release
resources). Modifications on the QoS of channels are
immediately reported to the state detectors related to the
processes linked by such channels, through messages
changeQoSðpx; py; newQoSÞ, indicating that the QoS of the
channel cx=y has been modified to newQoS (see Section 4.2.1).
GORENDER ET AL.: AN ADAPTIVE PROGRAMMING MODEL FOR FAULT-TOLERANT DISTRIBUTED COMPUTING 27
5. This assumption can be relaxed by using real-time operating systemssuch as CactusRT [19], which can provide bounded process execution times.
When an application process px wants to create atimely channel with the process py, it does so by firstinvoking the CreateChannelðpx; pyÞ function and then theDefineQoSðpx; py; timelyÞ function. If these functions suc-ceed (i.e., the channel is created and the related QoS is setto timely), and assuming that px and py do not crash, thenew created channel will remain timely during the systemexecution, unless either DefineQoSðpx; py; untimelyÞ func-tion is explicitly invoked or a fault in the communicationpath liking px and py makes it impossible to stillguarantee the previously reserved resources for thischannel. So, degrading QoS it is not just a matter ofincreasing communication delays, but loosing reservationguarantees due to failures. As an example of a fault,consider a crash of a QoS router in the path linking pxand py, leading the communication flow through anotherrouter that has no resource reserved for that channel. Inthis situation, communication between px and py ispreserved but with a degraded QoS (i.e., untimely). Themonitoring mechanism available in the underlying QoSinfrastructure signalizes the loosing of QoS to the QoS-Provider, which changes the QoS of channel ðpx; pyÞ fromtimely to untimely. This monitoring mechanism is a sortof fail-awareness scheme; however, it is different from thekind implemented in the timed-asynchronous model [11],which has no relation with resource reservation asdiscussed above. In particular, notice that in our under-lying system model, degrading QoS does not necessarilymean increasing communication delays. The contrary ishowever always true. That is, if a timeout expires for atimely channel, either the communicating process crashedor the QoS downgraded. Therefore, our failure detectormechanism (described in Section 4.2.2) uses the expirationof timeouts, based on previously negotiated time delaysfor timely channels, as a safe indication of a crash.However, this crash is only confirmed if the QoSinfrastructure assures that the related channel QoS hasnot been lost due to a failure, or deliberately downgradedby the application.
When a process crashes, the QoS Provider can still giveinformation about the QoS of the channels linked to thatcrashed process, because the communication resources arestill reserved, and the QoSP module maintains informationabout these channels. However, if the site hosting a processcrashes, all the channels allocated to processes in this siteare destroyed. If a QoS Provider module crashes, all relatedchannels have their QoS changed to untimely.
It should be observed at this point that at runtimeinitialization the modules of the QoS Provider try to createtimely channels to connect them. If timely channels aresuccessfully created for the QoSP modules, then timeoutsare used to detect the crash of a QoSP module. Otherwise,untimely channels are created for the QoSP modules—and,as a consequence, upper-layer applications will not be ableto create any timely channel as well. In other words,application related timely channels can only be created ifand after the timely channels for the related QoSP modulesexist. Another important observation is that if QoSP losesthe QoS of its channels, the same will happen for allapplications with channels in the related site—hence, thesituation of having timely channels for applications anduntimely channels for the related QoSP will never happen.
If a given QoS Provider module cannot deliver informa-
tion about a given channel (possibly, because the related
QoSP module has crashed), this channel is then assumed to
be untimely (which may represent a change in its previous
QoS condition).On the other hand, the crash of a QoSP module Mx can
only be detected by another QoSP module My if there is a
timely channel linking Mx to My. The creation and
monitoring of such a timely channel is realized in the same
way as for the application process channels. That is, by
using the underlying QoS infrastructure facilities. After the
expiration of an expected timeout, the underlying QoS
infrastructure is queried to verify whether the QoS for the
channel linking Mx and My remains timely and, in this case,
My detects the crash of Mx. Otherwise, the channel linking
Mx and My is made untimely, and all application channels
for Mx and My are thereafter made untimely as well.
4.2 The Programming Model Implementation
In our programming model, distributed processes perceive
each other’s state by reading the contents of the sets down,
live, and uncertain. These sets can evolve dynamically
following system state changes while respecting the rules
R0 to R5. Therefore, implementing our model implies in
providing the necessary mechanisms to maintain the sets
according to their semantics. Two mechanisms have been
developed to this end:
. state detector that is responsible for maintaining thesets live and uncertain, in accordance with theinformation delivered by the QoSP, and
. a failure detector that utilizes the information pro-vided by both, the QoSP and the state detector, todetect crashes and update the down sets accordingly.
Associated with each process pi there is a module of the
state detector, a module of the failure detector, a represen-
tation of the DSð�;�Þ graph, and the three sets: livei,
uncertaini, and downi. The DSð�;�Þ graph is constructed
by using the QoSP functions CreateChannel() and Define-
QoS(), for creating channels according to the QoS required
and resources available in the system. The modules of the
state detector exchange the information of the created
channels so that they keep identical DSð�;�Þ graphs during
the system initialization phase.6
During the initialization phase, the set downi is set to
empty and the contents of livei and uncertaini are initialized
so that the identity of a process pj is placed into livei if and
only if there is a timely channel linking pj to another process
(i.e., 9px 2 � such that DSiðpj; pxÞ ¼ timely). Otherwise, the
identity of pj is placed in uncertaini. When the initialization
phase ends, processes in � observe identical contents for
their respective live, down, and uncertain sets, and a given
process is either in live or in uncertain (ensuring, therefore, the
restrictions R0 and R1 of our model).
28 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2007
6. At the present prototype version, we use a centralized entity (process)and a distributed transaction to establish the initial graph (it is like readingthe same information from a shared file). Of course, the system is onlyinitialized if the transaction succeeds without failures (or suspicions of).Another option that we intend to explore is to use our consensus algorithmover the QoSP modules to define the initial configuration.
During the application execution, the contents of the
three sets are dynamically modified according to the
existence of failures and/or QoS modifications. Next, the
implemented mechanisms in charge of updating the
contents of live and uncertain (the state detector), and the
contents of down (the failure detector) are described.
4.2.1 An Implementation of the State Detector
There is a module of the state detector for each process in
�. The module of the state detector associated with piexecutes two concurrent tasks to update the DS graph
maintained by pi.The first task is activated by messages
changeQoSðpi; px; newQoSÞ
from the local module of QoSP, indicating that the QoS of
the channel linking pi to px has been modified. Upon
receiving the changeQoS message, the state detector of pifirst verifies whether px is in the downi set, and in this case,
it terminates the task (this is necessary to guarantee R4).
Otherwise, it passes on the information of the new QoS of
the channel ci=x to the remote modules of the state detector,
and the local DS graph is updated accordingly (i.e.,
DSiðpi; pxÞ is set to NewQoS).The second task is activated when the failure detector
communicates the crash of a process px (see details in the
next section). The goal of this task is to check whether there
is a process in the live set, say py, that had a timely channel
to the crashed process. If the channel cx=y is the only timely
channel to py, it can no longer be detectable and therefore
must be moved from live to uncertain. This is realized by
setting all channels linked to the crashed process as
untimely in the DS graph.In both tasks, after updating the DS graph, the
procedure UpdateState(), described in Fig. 2, is called for
each px linked to a modified channel, to update the sets liveiand uncertaini, accordingly. Process px is moved from liveito uncertaini if no timely channel linking px is left (lines 1-
4); px is moved from uncertaini to livei if a new timely
channel linking px has been created (lines 6-8).7
4.2.2 An Implementation of the Failure Detector
Besides maintaining the set downi, the failure detector alsomaintains the set suspectedi for keeping the identities ofprocesses suspected of having crashed. A process piinteracts with the failure detector by accessing these sets.
The failure detector works in a pull model, where eachmodule (working on behalf of a process px) periodicallysends “are you alive?” messages to the other modulesrelated to the other processes in �. The timeout value usedfor awaiting “I am alive” messages from a monitoredprocess py is calculated using the QoSP functionDelayðpx; pyÞ. The timeout includes the so-called round-triptime ðrttÞ,8 and a safety margin (�), to account for necessarytime to process these messages at px and py.
For timely channels, the calculated timeout is accurate inthe sense that network and system resources and relatedscheduling mechanisms guarantee the rtt within a boundedlimit. Therefore, the expiration of the timeout is an accurateindication that py crashed and, in that case, py is movedfrom livex to downx. To account for a possible modificationof the QoS of the channel cx=y, before producing anotification, the failure detector checks it out whether thechannel remained timely using the QoSP functionQoSðpx; pyÞ.9 On the other hand, if the channel linking pxand py is untimely, the expiration of the timeout is only ahint of a possible crash and, in that case, besides belongingto uncertainx, py is also included in the set suspectedx.
The algorithm for the failure detector for a process pi,described in Fig. 3, is composed of five parallel tasks. Theparameter monitoringInterval indicates the time intervalbetween two consecutive “are-you-alive?” messages sentby pi. The array timeouti½1::n� holds the calculated timeoutfor pi to receive the next “I am alive” message from eachprocess in �. The function CTiðÞ returns the current localtime. Task 1 periodically sends a “I am alive” message to allprocesses (actually, the related failure detector modules)after setting a timeout value to receive the corresponding “Iam alive” message, which in turn is sent by Task 5. Task 2assesses the expiration of timeouts, and it sends notificationmessages when the timeouts expire for processes in livei,moving them into the downi set. Otherwise, if the timeout
GORENDER ET AL.: AN ADAPTIVE PROGRAMMING MODEL FOR FAULT-TOLERANT DISTRIBUTED COMPUTING 29
Fig. 2. Algorithm to update the sets livei and uncertaini.
7. Observe that as applications are not allowed to upgrade the QoS ofchannels during a run (R3), the else part of this algorithm will only beexecuted between runs (R6).
8. That is the time to transfer the “are you alive?” message from px to pyplus the time to transfer the “I am alive” message from py to px.
9. One should observe here that the QoS Provider holds the informationand resources related to a given channel even after the crashes of theprocesses linked by that channel.
expires for processes in the uncertaini set, their identitiesare also included into the suspectedi set. Task 3 cancels atimeout for a given process to receive the “I am alive”message, and if such a process belongs to the suspectedi set,it is removed from it. Task 4 handles crash notificationmessages and updates the sets downi and livei, accordingly.
5 CONCLUSION
This paper proposed and fully developed an adaptiveprogramming model for fault-tolerant distributed comput-ing. Our model provides upper-layer applications withprocess state information according to the current systemsynchrony (or QoS). The underlying system model ishybrid, comprised of a synchronous part and an asynchro-nous part. However, such a composition can vary over timein such a way that the system may become totallysynchronous or totally asynchronous.
The programming model is given by three sets (pro-cesses perceive each other’s states by accessing the contentsof their local nonintersecting sets—uncertain, live, and down)and the rules R0-R6 that regulate modifications on the sets.Moreover, we showed how those rules and the sets can beimplemented in real systems.
To illustrate the adaptiveness of our model, we devel-oped a consensus algorithm that makes progress despitedistinct views of the corresponding local sets, and cantolerate more faults, the more processes are in the live set.The presented consensus algorithm adapts to the currentQoS available (via the sets) and uses the majority assump-tion only when needed.
To the best of our knowledge, there is no other systemthat implements the same functionality we have just
described, taking advantage of the available QoS to provideprocess state information that can be explored to yieldefficient solutions, when possible. For example, in somecircumstances, the live set can include all system processes,even if the synchrony of the underlying system is not as inthe classic synchronous systems (e.g., a system with fourprocesses and with two timely channels and all the otherchannels being untimely, can be as such). In theseconditions, if there is no QoS degradation, our protocoltolerates f < n process faults. On the other hand, oursystem allows for a graceful degradation when the under-lying QoS degrade, allowing to circumvent the f < n=2
lower bound associated with the consensus problem inasynchronous systems equipped with eventual failuredetectors, if juncertainj < n during the execution.
An implementation of the model on top of a QoSinfrastructure has been presented. In order to specify theunderlying functionality needed to implement it, a mechan-ism (called the QoS provider) has been developed andimplemented. Thanks to this modularity dimension of theapproach, porting the model implementation to a givenenvironment requires us to only implement the QoSProvider functions that have been defined. The proposedsystem has been implemented in JAVA and tested over a setof networked LINUX workstations, equipped with QoScapabilities. More details of this implementation can befound elsewhere [15].
ACKNOWLEDGMENTS
The authors are grateful to the anonymous reviewers fortheir constructive comments.
30 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2007
Fig. 3. Algorithm for the failure detector module ðpiÞ.
REFERENCES
[1] C. Aurrecoechea, A.T. Campbell, and L. Hauw, “A Survey of QoSArchitectures,” ACM Multimedia Systems J., special issue on QoSarchitecture, vol. 6, no. 3, pp. 138-151, May 1998.
[2] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss,“An Architecture for Differentiated Services,” RFC 2475, Dec.1998.
[3] A. Campbell, G. Coulson, and D. Hutchison, “A Quality of ServiceArchitecture,” ACM Computer Comm. Rev., vol. 24, no. 2, pp. 6-27,Apr. 1994.
[4] A. Casimiro and P. Verıssimo, “Using the Timely Computing Basefor Dependable QoS Adaptation,” Proc. 20th Symp. ReliableDistributed Systems (SRDS), pp. 208-217, Oct. 2001.
[5] T.D. Chandra, V. Hadzilacos, and S. Toueg, “The Weakest FailureDetector for Solving Consensus,” J. ACM, vol. 43, no. 4, pp. 685-722, July 1996.
[6] T.D. Chandra and S. Toueg, “Unreliable Failure Detectors forReliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225-267,Mar. 1996.
[7] M. Correia, N. Neves, L. Lung, and P. Verıssimo, “LowComplexity Byzantine-Resilient Consensus,” Distributed Comput-ing, vol. 17, no. 3, pp. 237-249, Mar. 2005.
[8] F. Cristian and C. Fetzer, “The Timed Asynchronous DistributedSystem Model,” IEEE Trans. Parallel and Distributed Systems,vol. 10, no. 6, pp. 642-657, June 1999.
[9] D. Dolev, C. Dwork, and L. Stockmeyer, “On the MinimalSynchronism Needed for Distributed Consensus,” J. ACM,vol. 34, no. 1, pp. 77-97, Jan. 1987.
[10] C. Dwork, N. Lynch, and L. Stockmeyer, “Consensus in thePresence of Partial Synchrony,” J. ACM, vol. 35, no. 2, pp. 288-323,Apr. 1988.
[11] C. Fetzer and F. Cristian, “Fail-Awareness in Timed Asynchro-nous Systems,” Proc. 15th ACM Symp. Principles of DistributedComputing (PODC ’96), pp. 314-321, May 1996.
[12] M.J. Fischer, N. Lynch, and M.S. Paterson, “Impossibility ofDistributed Consensus with One Faulty Process,” J. ACM, vol. 32,no. 2, pp. 374-382, Apr. 1985.
[13] S. Gorender and R. Macedo, “Fault-Tolerance in Networks withQoS,” Technical Report Number RT001/03, Distributed SystemLaboratory (LaSiD), UFBA (in Portuguese), 2003.
[14] S. Gorender, R. Macedo, and M. Raynal, “A Hybrid and AdaptiveModel for Fault-Tolerant Distributed Computing,” Proc. IEEE Int’lConf. Dependable Systems and Networks (DSN ’05), pp. 412-421, June2005.
[15] S. Gorender, R. Macedo, and M. Raynal, “A Hybrid Model forFault-Tolerant Distributed Computing,” Technical Report RT001/06, Distributed System Laboratory (LaSiD), UFBA, 2006.
[16] R. Guerraoui and M. Raynal, “The Information Structure ofIndulgent Consensus,” IEEE Trans. Computers, vol. 53, no. 4,pp. 453-466, Apr. 2004.
[17] R. Guerraoui and A. Schiper, “The Generic Consensus Service,”IEEE Trans. Software Eng., vol. 27, no. 1, pp. 29-41, Jan. 2001.
[18] V. Hadzilacos and S. Toueg, “Fault-Tolerant Broadcasts andRelated Problems,” Distributed Systems, S. Mullender ed., pp. 97-145, ACM Press, 1993.
[19] M. Hiltunen, R. Schlichting, X. Han, M. Cardozo, and R. Das,“Real-Time Dependable Channels: Customizing QoS Attributesfor Distributed Systems,” IEEE Trans. Parallel and DistributedSystems, vol. 10, no. 6, pp. 600-612, June 1999.
[20] N. Lynch, Distributed Algorithms, p. 872. Morgan Kaufmann, 1996.[21] A. Mostefaoui and M. Raynal, “Solving Consensus Using
Chandra-Toueg’s Unreliable Failure Detectors: A General Quor-um-Based Approach,” Proc. 13th Symp. Distributed Computing(DISC ’99), pp. 49-63, Sept. 1999.
[22] K. Nahrstedt and J.M. Smith, “The QoS Broker,” IEEE Multimedia,vol. 2, no. 1, pp. 53-67, 1995.
[23] M. Raynal, “Consensus in Synchronous Systems: A ConciseGuided Tour,” Proc. Ninth IEEE Pacific Rim Int’l Symp. DependableComputing (PRDC ’02), pp. 221-228, Dec. 2002.
[24] Y. Ren, M. Cukier, and W.H. Sanders, “An Adaptive Algorithmfor Tolerating Values Faults and Crash Failures,” IEEE Trans.Parallel and Distributed Systems, vol. 12, no. 2, pp. 173-192, Feb.2001.
[25] F. Siqueira and V. Cahill, “Quartz: A QoS Architecture for OpenSystems,” Proc. 18th Brazilian Symp. Computer Networks, pp. 553-568, May 2000.
[26] R. van Renesse, K. Birman, M. Hayden, A. Vaysburd, and D. Karr,“Building Adaptive Systems Using Ensemble,” Software Practiceand Experience, vol. 28, no. 9, pp. 963-979, July 1998.
[27] P. Verıssimo and A. Casimiro, “The Timely Computing BaseModel and Architecture,” IEEE Trans. Computers, special issue onasynchronous real-time systems, vol. 51, no. 8, pp. 916-930, Aug.2002.
Sergio Gorender received the BSc degree incomputer science from the Federal University ofBahia (UFBA), Brazil, in 1991, and the MSc andPhD degrees also in computer science from theFederal University of Pernambuco (UFPE),Brazil, in 1995 and 2005, respectively. Since1996, he has been a lecturer in the Departmentof Computer Science at UFBA and, since 2000,he has been a researcher of LaSiD, theDistributed System Laboratory at the same
university. His research interests include the many aspects of depend-able distributed systems.
Raimundo Jose de Araujo Macedo receivedthe BSc, MSc, and PhD degrees in computerscience from Federal University of Bahia(UFBA), Brazil, the University of Campinas(Unicamp/Brazil), and the University of New-castle upon Tyne (England), respectively. Since2000, he has been a professor of computerscience at UFBA, where he founded theDistributed Systems Laboratory (LaSiD) in1995. Currently, he is the head of LaSiD and
of the new created Doctorate Program on Computer Science. Formerly,he coordinated the Graduate Program on Mechatronics at UFBA. Hisresearch interests include the many aspects of dependable distributedsystems and real-time systems. He has served as a program committeemember on a number of conferences, including the IEEE/IFIPInternational Dependable Systems and Networks Conference (DSN),the ACM/IFIP/USENIX International Middleware Conference, theBrazilian Symposium on Computer Networks and Distributed Systems(SBRC), and the Latin-American Dependable Computing Symposium(LADC). He was the program chair and general chair of SBRC in 2002and 1999, respectively. He is a member of the IEEE Computer Society.
Michel Raynal has been a professor of compu-ter science since 1981. At IRISA (CNRS-INRIA-University Joint Computing Research Labora-tory located in Rennes), he founded a researchgroup on distributed algorithms in 1983. Hisresearch interests include distributed algorithms,distributed computing systems, networks, anddependability. His main interest lies in thefundamental principles that underly the designand the construction of distributed computing
systems. He has been the principal investigator of a number of researchgrants in these areas, and has been invited by many universities all overthe world to give lectures on distributed algorithms and distributedcomputing. He belongs to the editorial board of several internationaljournals. He has published more than 100 papers in journals (Journal ofthe ACM, Acta Informatica, Distributed Computing, Communications ofthe ACM, etc.) and more than 200 papers in conferences (ACM STOC,ACM PODC, ACM SPAA, IEEE ICDCS, etc.). He has also written sevenbooks devoted to parallelism, distributed algorithms, and systems (MITPress and Wiley). He has served on program committees for more than70 international conferences (including ACM PODC, DISC, ICDCS,IPDPS, DSN, LADC, SRDS, SIROCCO, etc.) and chaired the programcommittee of more than 15 international conferences (including DISC(twice), ICDCS, SIROCCO, and ISORC). Professor Raynal served asthe chair of the steering committee leading the DISC symposium seriesfrom 2002-2004. He received the IEEE ICDCS Best Paper Award threetimes in a row: 1999, 2000, and 2001. Recently, he cochaired theSIROCCO 2005 (devoted to communication complexity), IWDC 2005,and IEEE ICDCS 2006.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
GORENDER ET AL.: AN ADAPTIVE PROGRAMMING MODEL FOR FAULT-TOLERANT DISTRIBUTED COMPUTING 31