2. an adaptive programming model for fault tolerant distributed computing

An Adaptive Programming Modelfor Fault-Tolerant Distributed Computing

Sergio Gorender, Raimundo Jose de Araujo Macedo, Member, IEEE Computer Society, and

Michel Raynal

Abstract—The capability of dynamically adapting to distinct runtime conditions is an important issue when designing distributed

systems where negotiated quality of service (QoS) cannot always be delivered between processes. Providing fault tolerance for such

dynamic environments is a challenging task. Considering such a context, this paper proposes an adaptive programming model for

fault-tolerant distributed computing, which provides upper-layer applications with process state information according to the current

system synchrony (or QoS). The underlying system model is hybrid, composed by a synchronous part (where there are time bounds

on processing speed and message delay) and an asynchronous part (where there is no time bound). However, such a composition can

vary over time, and, in particular, the system may become totally asynchronous (e.g., when the underlying system QoS degrade) or

totally synchronous. Moreover, processes are not required to share the same view of the system synchrony at a given time. To

illustrate what can be done in this programming model and how to use it, the consensus problem is taken as a benchmark problem.

This paper also presents an implementation of the model that relies on a negotiated quality of service (QoS) for communication

channels.

Index Terms—Adaptability, asynchronous/synchronous distributed system, consensus, distributed computing model, fault tolerance,

quality of service.

Ç

1 INTRODUCTION

DISTRIBUTED systems are composed of processes, locatedon one or more sites, that communicate with one

another to offer services to upper-layer applications. Amajor difficulty a system designer has to cope with in thesesystems lies in the capture of consistent global states fromwhich safe decisions can be taken in order to guarantee asafe progress of the upper-layer applications. To study andinvestigate what can be done (and how it has to be done) inthese systems when they are prone to process failures, twodistributed computing models have received significantattention, namely, the synchronous model and the asynchro-nous model.

The synchronous distributed computing model provides

processes with bounds on processing time and message

transfer delay. These bounds, explicitly known by the

processes, can be used to safely detect process crashes and,

consequently, allow the noncrashed processes to progress

with safe views of the system state (such views can be

obtained with some “time-lag”). In contrast, the asynchro-

nous model is characterized by the absence of time bounds

(that is why this model is sometimes called the time-free

model). In these systems, a system designer can only assume

an upper bound on the number of processes that can crash

(usually denoted as f) and, consequently, design protocolsrelying on the assumption that at least (n� f) processes arealive (n being the total number of processes). The protocolhas no means to know whether a given process is alive ornot. Moreover, if more than f processes crash, there is noguarantee on the protocol behavior (usually, the protocolloses its liveness property).

Synchronous systems are attractive because they allowsystem designers to solve many problems. The price thathas to be paid is the a priori knowledge on time bounds. Ifthey are violated, the upper-layer protocols may be unableto still guarantee their safety property. As they do not relyon explicit time bounds, asynchronous systems do not havethis drawback. Unfortunately, they have another one,namely, some basic problems are impossible to solve inasynchronous systems. The most famous is the consensusproblem, that has no deterministic solution when even asingle process can crash [12].

The consensus problem can be stated as follows: Eachprocess proposes a value, and has to decide a value, unlessit crashes (termination), such that there is a single decidedvalue (uniform agreement), and that value is a proposedvalue (validity). This problem, whose statement is particu-larly simple, is fundamental in fault-tolerant distributedcomputing as it abstracts several basic agreement problems.While consensus is considered as a “theoretical” problem,system designers are usually interested in the morepractical Atomic Broadcast problem. That problem is both acommunication problem and an agreement problem. Itscommunication part specifies that the processes can broad-cast and deliver messages in such a way that the processesthat do not crash deliver at least the messages they send. Itsagreement part specifies that there is a single delivery order

18 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2007

. S. Gorender and R.J. de Araujo Macedo are with the Distributed SystemsLaboratory (LaSiD), Computer Science Department, Federal University ofBahia, Campus de Ondina, 40170-110, Salvador, Bahia, Brazil.E-mail: {gorender, macedo}@ufba.br.

. M. Raynal is with IRISA, Universite de Rennes 1, Campus de Beaulieu,35042 Rennes Cedex, France. E-mail: [email protected].

Manuscript received 24 May 2005; revised 31 Dec. 2005; accepted 22 Aug.2006; published online 2 Feb. 2007.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TDSC-0075-0505.

1545-5971/07/$25.00 � 2007 IEEE Published by the IEEE Computer Society

(so, the correct processes deliver the same sequence ofmessages, and a faulty process delivers a prefix of thissequence of messages). It has been shown that consensus andatomic broadcast are equivalent problems in asynchronoussystems prone to process crashes [6]. Consequently, inasynchronous distributed systems prone to process crashes,the impossibility of solving consensus extends to atomicbroadcast. This impossibility has motivated researchers tofind distributed computing models, weaker than thesynchronous model but stronger than the asynchronousmodel, in which consensus can be solved.

1.1 Content of the Paper

In practice, systems are neither fully synchronous, nor fullyasynchronous. Most of the time they behave synchro-nously, but can have “unstable” periods during which theybehave in an anarchic way. Moreover, there are now “QoSarchitectures” that allow processes to dynamically negoti-ate the quality of service of their communication channels,leading to settings with hybrid characteristics that canchange over time. These observations motivate the designof a distributed programming model that does its best toprovide the processes with information on the current stateof the other processes. This paper is a step in that direction.

This programming model is time free in the sense thatprocesses are not provided with time bounds guaranteed bythe lower system layer. Each process pi is provided withthree sets denoted downi, livei, and uncertaini. These sets,which are always a partition of the whole set of processes,define the view pi has of the state of the other processes.More precisely, if pk 2 downi, then pi knows that pk hascrashed; when pk 2 livei, then pi can consider pk as beingalive; finally, when pk 2 uncertaini, pi has no informationon the current state of pk. These sets can evolve in time, andcan have different values for different processes. Forexample, it is possible that (due to the fact that somequality of service can no longer be ensured) the view pi hason pk be degraded in the sense that the model moves pkfrom livei to uncertaini. So, the model is able to benefit fromtime windows to transform timeliness properties (currentlysatisfied by the lower layer) into time-free properties(expressed on the previous sets) that can be used by upperlayer protocols. Interestingly, this programming modelconsiders the synchronous model and the asynchronousmodel as particular cases. The synchronous model corre-sponds to the case where, for all the processes pi, the setsuncertaini are always empty. The asynchronous modelcorresponds to the case where, for all the processes pi, thesets uncertaini always include all processes.

It is important to notice that our approach is orthogonalto the failure detector (FD) approach [6]. Given a problem Pthat cannot be solved in the asynchronous model, the FDapproach is a computability approach: it consists ofenriching the asynchronous model with properties thatare sufficient for the problem to become solvable. Ourapproach is “engineering” oriented. It aims at benefitingfrom the fact that systems are built on QoS architectures,thereby allowing a process not to always consider the otherprocesses as being in an “uncertain” state. As the proposedmodel includes the asynchronous model, it is still necessaryto enrich it with appropriate mechanisms to solve problems

that are impossible to solve in pure time-free systems. Toillustrate this point and evaluate the proposed approach,the paper considers the consensus problem as a benchmarkproblem. A consensus protocol is presented, that is basedon the fact that each process pi is provided with 1) three setsdowni, livei, and uncertaini with the previously definedsemantics, and 2) an appropriate failure detector module(namely, �S). Interestingly, while �S-based consensusprotocols in asynchronous systems require f < n=2, theproposed �S-based protocol allows bypassing this boundwhen juncertainij < n during the protocol execution.

Another interesting feature of the proposed program-ming model lies in the fact that it allows us to designgeneric protocols, in the sense that a protocol designed forthat model “includes” both a synchronous protocol and anasynchronous protocol.

This paper is an extended and enhanced version of aprevious publication that appeared in [14].

1.2 Related Work

Our approach relates to previous work on distributedsystem models and adaptiveness. The timed asynchronousmodel [8] considers asynchronous processes equipped withphysical clocks. The concept of hybrid architectures hasbeen explored where a distributed wormhole (i.e., a smallsynchronous part of the system) equipped with synchro-nized clocks provides timely services to the rest of thesystem (the asynchronous part) [27]. The concept of a secureand crash resilient wormhole has also been used toimplement efficient solutions to consensus protocols sub-jected to Byzantine faults in the asynchronous part of thesystem [7].

Several works aimed to circumvent the impossibility ofconsensus in the asynchronous system model [12]. Minimalsynchronism needed to solve consensus is addressed in [9].Partial synchrony making consensus solvable is investi-gated in [10]. Finally, the failure detector approach and itsapplication to the consensus problem have been introducedand investigated in [5], [6]. A general framework forconsensus algorithms for the classical asynchronous systemmodel enriched with failure detectors or random numbergenerators is given in [16].

Many systems have addressed adaptiveness, including[24], [26], [19], [4]. AquA [24] provides adaptive faulttolerance to CORBA applications by replicating objects andproviding a high-level method that an application can useto specify its desired level of reliability. AquA also providesan adaptive mechanism capable of coping with applicationdependability requirements at runtime. It does so by usinga majority voting algorithm. Ensemble [26] offers a fault-tolerant mechanism that allows adaptation at runtime. Thisis used to dynamically adapt the components of a groupcommunication service to switch between two total orderalgorithms (sequencer-based versus token-based) accordingto the desired system overhead and end-to-end latency forapplication messages.

The concept of real-time dependable channels (RTD) hasbeen introduced in [19] to handle QoS adaptation at channelcreation time. In their paper, the authors show how anapplication can customize a RTD channel according tospecific QoS requirements, such as bounded delivery time,

GORENDER ET AL.: AN ADAPTIVE PROGRAMMING MODEL FOR FAULT-TOLERANT DISTRIBUTED COMPUTING 19

reliable delivery, message ordering, and jitter (variance inmessage transmission time). Such QoS requirements arestated in terms of the probabilities related to specific QoSguarantees and are enforced by a composite protocol basedon the real-time version of the Cactus system (CactusRT).

The work presented in [4] explicitly addresses theproblem of dynamically adapting applications when agiven negotiated QoS can no longer be delivered. In orderto accomplish that, they defined a QoS coverage service thatprovides the so-called time-elastic applications (such asmission-critical or soft real-time systems) with the ability ofdependably deciding how to adapt time bounds in order tomaintain a constant coverage level. Such service isimplemented over the wormhole Timely Computing Base orTCB. By using its duration measurement service, TCB isable to monitor the current QoS level (in terms of time-liness), allowing applications to dynamically adapt when agiven QoS cannot be delivered.

Like [4], our work tackles QoS adaptability in terms oftimeliness at runtime. However, we do not adapt timebounds to time-elastic applications. Instead, our hybridprogramming model adapts to the available QoS ofcommunication channels, providing applications with safeinformation on the current state of processes. The commu-nication channel QoS can be timely, with guaranteed timebounds for message delivery, or untimely, otherwise.1

Moreover, we show how an adaptive consensus protocolcan take advantage of the proposed hybrid model. As weshall see in Section 4, though we have our own implemen-tation based on QoS architectures, our programming modelcould also be implemented on top of systems such as TCBand CactusRT.

1.3 Roadmap

The paper is made up of five sections. Section 2 introducesthe model. Section 3 presents and proves the correctness ofa consensus protocol suited to the model, and Section 4describes an implementation of this model based on adistributed system architecture providing negotiated QoSguarantees. Finally, Section 5 concludes the paper.

2 THE HYBRID AND ADAPTIVE PROGRAMMING

MODEL

This section describes the computation model offered to theupper-layer applications. These applications are generallydistributed programs implementing middleware services. Itis important to notice that the model offered to upper-layerapplications is time-free in the sense that it offers themneither timing assumptions, nor time functions.

2.1 Asynchronous Distributed System with ProcessCrash Failures

We consider a system consisting of a finite set � of n � 2processes, namely, � ¼ fp1; p2; . . . ; png. A process executessteps (a step is the reception of a message, the sending of amessage, each with the corresponding local state change, or

a simple local state change). A process can fail by crashing,i.e., by prematurely halting. After it has crashed, a processdoes not recover. It behaves correctly (i.e., according to itsspecification) until it (possibly) crashes. By definition, aprocess is correct in a run if it does not crash in that run.Otherwise, a process is faulty in the corresponding run. Inthe following, f denotes the maximum number of processesthat can crash ð1 � f < nÞ. Until it possibly crashes, thespeed of a process is positive but arbitrary.

Processes communicate and synchronize by sending andreceiving messages through channels. Every pair ofprocesses ðpi; pjÞ is connected by two directed channels,denoted as pi ! pj and pj ! pi. Channels are assumed to bereliable: they do not create, alter, or lose messages. Inparticular, if pi sends a message to pj, then if pi is correct,eventually pj receives that message unless it fails. There isno assumption about the relative speed of processes ormessage transfer delays (let us observe that channels are notrequired to be FIFO). The primitive broadcast MSGðvÞ is ashortcut for: 8pj 2 � do send MSGðvÞ to pj end (MSG is themessage tag, v its value).

The previous definitions describe what is usually called atime-free asynchronous distributed system prone to processcrashes. We assume the existence of a global discrete clock.This clock is a fictional device that is not known by theprocesses; it is only used to state specifications or proveprotocol properties. The range T of clock values is the set ofnatural numbers. Let F ðtÞ be the set of processes that havecrashed through time t. As a process that crashes does notrecover, we have F ðtÞ � F ðtþ 1Þ.

2.2 How a Process Sees the Other Processes: ThreeSets per Process

A crucial issue encountered in distributed systems is theway each process perceives the state of the other processes.To that end, the proposed programming model provideseach process pi with three sets denoted downi, livei, anduncertaini. The only thing a process pi can do with respectto these sets is to read the sets it is provided with; it cannotwrite them and has no access to the sets of the otherprocesses.

These sets, that can evolve dynamically, are made up ofprocess identities. Intuitively, the fact that a given process pjbelongs to downi, livei, or uncertaini provides pi with somehint of the current status of pj. More operationally, ifpj 2 downi, pi can safely consider pj as being crashed. Ifpj =2 downi, the state of pj is not known by pi with certainty:more precisely, if pj 2 livei, pi is given a hint that it cancurrently consider pj as not crashed; when pj 2 uncertaini, pihas no information on the current state (crashed or live) of pj.

At the abstraction level defining the computation model,these sets are defined by abstract properties (the way theyare implemented is irrelevant at this level, it will bediscussed in Section 4). The specification of the sets downi,livei, and uncertaini, 1 � i � n, is the following (wheredowniðtÞ is the value of downi at time t, and similarly forliveiðtÞ and uncertainiðtÞ):

R0 Initial global consistency. Initially, the sets livei(respectively, downi and uncertaini) of all the


1. For ensuring such a distinct QoS guarantee, we assume that theunderlying QoS infrastructure is capable of reserving resources andcontrolling admission of new communication flows related to new createdchannels.

processes pi are ident ical . Namely, 8i; j:liveiðtÞ ¼ livejðtÞ, downiðtÞ ¼ downjðtÞ, and

uncertainiðtÞ ¼ uncertainjðtÞ;

for t ¼ 0.R1 Internal consistency. The sets of each pi define a

partition:

. 8i: 8t: downiðtÞ [ liveiðtÞ [ uncertainiðtÞ ¼ �.

. 8i: 8t: any two sets

ðdowniðtÞ; liveiðtÞ; uncertainiðtÞÞ

have an empty intersection.R2 Consistency of the downi sets.

. A downi set is never decreasing: 8i: 8t:downiðtÞ � downiðtþ 1Þ.

. A downi set is always safe with respect tocrashes: 8i: 8t: downiðtÞ � F ðtÞ.

R3 Local transitions for a process pi. While an upperlayer protocol is running, the only changes a processpi can observe are the moves of a process px fromlivei to downi or uncertaini.

R4 Consistent global transitions. The sets downi anduncertainj of any pair of processes pi and pj evolveconsistently. More precisely:

. 8i; j; k; t0 : ððpk 2 liveiðt0ÞÞ^ðpk 2 downiðt0 þ 1ÞÞÞ) ð8t1 > t0 : pk =2 uncertainjðt1ÞÞ.

. 8i; j; k; t0 :�ðpk 2 liveiðt0ÞÞ ^ðpk2uncertaini

ðt0 þ 1ÞÞÞ ) ð8t1 > t0 : pk =2 downjðt1ÞÞ.R5 Conditional crash detection. If a process pj crashes

and there is a time after which it does never appearin the uncertaini set of any other process pi, iteventually appears in the downi set of each pi. Moreprecisely:8pi, if pj crashes at time t0, and there is a time

t1 � t0 such that 8t2 � t1 we have pj =2 uncertainiðt2Þ,then there is a time t3 � t1 such that 8t4 � t3 we have

pj 2 downiðt4Þ.As we can see from this specification, at any time t and

for any pair of processes pi and pj, it is possible to have

liveiðtÞ 6¼ livejðtÞ (and similarly for the other sets). Oper-

ationally, this means that distinct processes can have

different views of the current state of each other process.

Let us also observe that downi is the only safe information

on the current state of the other processes that a process pihas. An example, involving nine processes, namely,

� ¼ fp1; . . . ; p9g, is described in Table 1. The columns

(respectively, rows) describe pi’s (respectively, pj’s) view

of the three sets at some time t, e.g., downiðtÞ ¼ fp1; p2; p5g,while livejðtÞ ¼ fp5; p6g.

The rules [R0-R5] define a distributed programming

model that interestingly satisfies the following strong

consistency property. That property provides the processes

with a mutually consistent view on the possibility to detect

the crash of a given process. More specifically, if the crash of

a process pk is never known by pi (because pk continuously

belongs to uncertaini), then no process pj will detect the

crash of pk (by having pk 2 downj). Conversely, if the crash

of pi is known by pj, the other processes will also know it.More formally, we have:

Property 1: Mutual consistency.

8i; j : 8t1; t2 : downiðt1Þ \ uncertainjðt2Þ ¼ ;:

Proof. The proof is by contradiction. Let us assume that9pi; pj; t1; t2 and a process p such that

p 2 downiðt1Þ \ uncertainjðt2Þ:

We consider two cases:

. Case i ¼ j.

- t1 ¼ t2 is impossible, as it would violate thepartitioning rule R1.

- t1 < t2 or t1 > t2 are also impossible, as theywould violate the local transition rule R3.

. Case i 6¼ j.First, let us observe that at the initial config-

uration (t ¼ 0), p 2 liveið0Þ and p 2 livejð0Þ, sincethe only possible transitions for p are the movesfrom live to down or to uncertain (R3) andprocesses see the same set formations (R0) at thatstage, which always form a partition (R1).

Due to R3 and the initial assumption,

8t3; ððt3 > t2Þ ^ ðt3 > t1ÞÞ ! ððp 2 downiðt3ÞÞ^ ðp 2 uncertainjðt3ÞÞÞ:

Therefore, the formulas below follow, whichcontradict R4:

- 9t0, t0 < t3 such that ðp 2 liveiðt0Þ ^ p 2downiðt0 þ 1Þ ^ p 2 uncertainjðt3ÞÞ, and

- 9t00, t00 < t3 such that (p 2 livejðt00Þ ^ p 2uncertainjðt00 þ 1Þ ^ p 2 downiðt3Þ). tu

2.3 An Upgrading Rule

Let us consider two consecutive instances (runs) of anupper-layer protocol (e.g., the consensus protocol describedin Section 3) built on top of the model defined by the rulesR0-R5. Moreover, let t1 be the time at which the firstinstance terminates and t2 the time at which the secondinstance starts ðt1 < t2Þ. During any instance of the protocol,the set livei of a process can only downgrade (as no processcan go from downi or uncertaini to livei). But, it is importantto notice that nothing prevents the model from upgradingbetween consecutive instances, by moving a process px 2uncertainiðt1Þ into liveiðt2Þ or downiðt2Þ. Such an upgrade ofa livei or downi sets between two runs of an upper-layerprotocol do correspond to “synchronization” points during


TABLE 1Example of down, live, and uncertain Sets

which the processes are allowed to renegotiate the qualityof service of their channels (see Section 4).

R6 Upgrade. If no upper-layer protocol is runningduring a time interval ½t1; t2�, it is possible that, 8i,processes be moved from uncertaini to livei or downiduring that interval.

2.4 On the Wait Statement and Examples

Interestingly, the following proposition on the conditionaltermination of a wait statement can easily be derived fromthe previous computation model specification.

Proposition 1. Let us consider the following statement issued bya process pi at some time t0: “wait until ((a message from pj isreceived) _ ðpj 2 downiÞ)” and let us assume that either thecorresponding message has been sent by pj or pj has crashed.2

The model guarantees that this wait statement alwaysterminates if there is a time t1 ðt1 � t0Þ such that 8t2 � t1we have pj 2 downiðt2Þ [ liveiðt2Þ.

This proposition is very important from a practical pointof view. Said in another way, it states that if, from sometime, pj never belongs to uncertaini, the wait statementalways terminates. This is equivalent to say that when, aftersome time, pj remains always in uncertaini or, alternatively,belongs to livei and uncertaini, there is no guarantee on thetermination of the wait statement (it can then terminate ornever terminate).

To illustrate the model, let us consider here twoparticular “extreme” cases. The first is the case ofsynchronous distributed systems. As indicated in theIntroduction, due to the upper bounds on processing timesand message transfer delays, these systems correspond tothe model where 8i, 8t, we have uncertainiðtÞ ¼ ;.

The second example is the case of fully asynchronousdistributed systems where there is no time bound. In thatcase, given a process pj, a process pi can never know if pjhas crashed or is only slow, or if its communicationchannels are very slow. This type of system does corre-spond to the model where we have 8i, 8t: uncertainiðtÞ ¼ �(or, equivalently, downiðtÞ ¼ liveiðtÞ ¼ ;). In these systems(as it appears in many asynchronous distributed algo-rithms), it is important to observe that, in order not to beblocked by a crashed process, a process pi can only issue“anonymous” wait statements such that “wait until

(messages have been received from ðn� fÞ processes).” Insuch a wait statement, pi relies only on the fact that at mostf processes can crash. It does not wait for a message from aparticular process (as in Proposition 1). This is an“anonymous” wait in the sense that the identities of theprocesses from which messages are received are notspecified in the wait statement; these identities can beknown only when the wait terminates.

2.5 The Hybrid and Adaptive Programming Model inPerspective

Our programming model provides the upper-layer applica-tions with sufficient process state information (the sets) that

can be used in order to adapt to the available systemsynchrony or QoS (in terms of timely and untimelychannels), providing more efficient solutions to fault-tolerant problems when possible (this will be illustrated inthe next section).

As a matter of fact, the implementation of fault-tolerantservices, such as consensus [17], depends decisively on theexistence of upper bounds for message transmission andprocess scheduling delays (i.e., timely channels and timelyprocess executions). Such upper bounds can only becontinuously guaranteed in synchronous systems. On theother hand, fault-tolerant services can also be guaranteed insome asynchronous or partially synchronous environmentswhere the “synchronous behavior” is verified duringsufficient long periods of time, even if such a behavior isnot continuously assured. Thus, such partially synchronoussystems can alternate between “synchronous behavior,”with time bound guarantees, and “asynchronous behavior,”with no time bound guarantees. So, in some sense, thesesystems are hybrid in the time dimension. For example, thetimed asynchronous system depends on a sufficient longstability period to deliver the related fault-tolerant services,and the system behavior can alternate between stable andunstable periods (i.e., synchronous and asynchronous) [8].The GST—Global Stabilization Time—assumption is neces-sary to guarantee that �S-based consensus protocolseventually deliver the proper service [6] (i.e., to workcorrectly, the system must eventually exhibit a “synchro-nous behavior”).

There are other models that consider not only thehybridism in the time dimension, but also that the systemis composed of a synchronous and an asynchronous part atthe same time. So, we can regard such systems as beinghybrid in the space dimension. This is the case of the TCBmodel, which relies on a synchronous wormhole toimplement fault-tolerant services [27], and such a spatialhybridism persist without discontinuity in time.

We assume that our underlying system model is capableof providing distinct QoS communications, so that a givenphysical channel may transmit communication flowsrelated to multiple virtual channels with distinct QoS(e.g., one timely channel and two untimely channelsimplemented in the same physical channel). Because timelyand untimely channels may exist in parallel, our underlyingsystem model can be hybrid in space. Moreover, untimelychannels may exhibit a synchronous behavior in certainstable periods. Therefore, our underlying system modeldiffers from all the above mentioned models in the sensethat it not only considers both the temporal and spatialhybridism dimensions (in this sense, similar to TCB), butalso that the nature of such a hybridism can change overtime, with periods where there is no spatial hybridism. Thatis, the underlying system behavior can become totallyasynchronous (i.e., uncertain ¼ �), given that in somecircumstances, the QoS of timely channels can degrade(due to router failures, for instance).

As our programming model cannot ensure the condi-tions to solve consensus in circumstances where the systembecomes totally asynchronous, we need a computabilitymodel to correctly specify the consensus algorithms. As


2. This means that we consider that the upper-layer distributedprograms are well formed in the sense that they are deadlock-free.

discussed earlier, the failure detector approach, denoted asFD, is such a computability model that we use to solveconsensus (see Section 3.1). However, nothing prevents usfrom using our programming model with another oracle orcomputability assumptions (such as the eventual leaderoracle (�) [5]) with adequate modifications in the consensusalgorithm. That is why we consider the FD-based approachas being orthogonal to our approach.

In summary, partial synchronous models such as GSTmodel and timed asynchronous systems do not considerspatial hybridism. TCB is hybrid in space and time and itsspatially hybrid nature persists continuously. Our model, incontrast, considers an underlying system that can also behybrid in time and space; however, it can completely looseits synchronous part over time—requiring the correspond-ing adaptive mechanism. The price that has to be paid toallow the required adaptation is the implementation of QoSmanagement mechanisms (provision and monitoring) cap-able of continuously assessing the QoS of the channels[1]—this is to some extent similar to what is realized by thefail-awareness service implemented in timed asynchronoussystems, where the occurrence of too many performancefailure raises an exception signal indicating that the under-lying system can no longer satisfy the synchronousbehavior [11]. The difference lies in the fact that in thetimed asynchronous system, the signaling is not based onQoS management but on the violation of round-trip-timemessage bounds.

The approach presented in [16] focuses on the consensusproblem in asynchronous message-passing systems. Itproposes a general framework for failure detector-basedconsensus algorithms. Hence, its underlying model is theclassical asynchronous model enriched with failure detec-tors. There is no notion of hybridism. The approachproposed here is different because it considers an under-lying QoS infrastructure.

3 USING THE MODEL

As a relevant example of the way the previous program-ming model can benefit middleware designers, this sectionpresents a consensus protocol built on top of this model.The consensus problem has been chosen to illustrate theway to use the model because of its generality, bothpractical and theoretical [17].

3.1 Enriching the Model to Solve Consensus

3.1.1 The Consensus Problem

In the consensus problem, every correct process pi proposes avalue vi and all correct processes have to decide on the samevalue v, that has to be one of the proposed values. Moreprecisely, the Consensus problem is defined by two safetyproperties (Validity and Uniform Agreement) and aTermination Property [6], [12]:

. Validity: If a process decides v, then v was proposedby some process.

. Uniform Agreement: No two processes decidedifferently.

. Termination: Every correct process eventually deci-des on some value.

3.1.2 The Class �S of Failure Detectors

It is well known that the consensus problem cannot besolved in pure time-free asynchronous distributed systems[12]. So, these systems have to be equipped with additionalpower (as far as the detection of process crashes isconcerned) in order consensus can be solved. Here, weconsider that the system is augmented with a failuredetector of the class denoted �S [6] (which has been shownto be the weakest class of failure detectors able to solveconsensus despite asynchrony [5]).

A failure detector of the �S class is defined as follows [6]:Each process pi is provided with a set suspectedi that containsprocesses suspected to have crashed. If pj 2 suspectedi, wesay “pi suspects pj.” A failure detector providing such setsbelongs to the class �S if it satisfies the following properties:

. Strong Completeness: Eventually, every process thatcrashes is permanently suspected by every correctprocess.

. Eventual Weak Accuracy: There is a time after whichsome correct process is never suspected by thecorrect processes.

As we can see, a failure detector of the class �S can makean infinite number of mistakes (e.g., by erroneouslysuspecting a correct process, in a repeated way).

3.2 A Consensus Protocol

The consensus protocol described in Fig. 1 adapts the�S-based protocol introduced in [21] to the computationmodel defined in Section 2. Its main difference lies in thewaiting condition used in line 8. Moreover, while thealgorithmic structure of the proposed protocol is close to thestructure described in [21], due to the very different modelit assumes, its proof (mainly in Lemma 1) is totally differentfrom the one used in [21].

A process pi starts a consensus execution by invokingConsensus ðviÞ where vi is the value it proposes. Thisfunction is made up of two tasks, T1 (the main task) and T2.The processes proceed by consecutive asynchronousrounds. Each process pi manages two local variables whosescope is the whole execution, namely, ri (current roundnumber) and esti (current estimate of the decision value),and two local variables whose scope is the current round,namely, auxi and reci. ? denotes a default value whichcannot be proposed by processes.

A round is made up of two phases (communicationsteps), and the first phase of each round r is managed by acoordinator pc (where c ¼ ðr mod nÞ þ 1).

. First phase (lines 4-6). In this phase, the current roundcoordinator pc broadcasts its current estimate esti. Aprocess pi that receives it, keeps it in auxi; otherwise,pi sets auxi to ?. The test pc 2 ðsuspectedi [ downiÞ isused to prevent pi from blocking forever.

It is important to notice that, at the end of the firstphase, the following property holds:

8i; j ðauxi 6¼ ? ^ auxj 6¼ ?Þ ) ðauxi ¼ auxj ¼ vÞ:

. Second phase (lines 7-13). During the second phase,the processes exchange their auxi values. Each


process pi collects in reci such values from theprocesses belonging to the sets livei and uncertainias defined in the condition stated at line 8. As aconsequence of the property holding at the end ofthe first phase and the condition of line 8, thefollowing property holds at line 9:

8i reci ¼ fvg or reci ¼ fv;?g or reci ¼ f?gwhere v ¼ estc:

The proof of this property is the main part of the

proof of the uniform agreement property (see

Lemma 1). Then, according to the content of its set

reci, pi either decides (case reci ¼ fvg), or adopts v as

its new estimate (case reci ¼ fv;?g), or keeps its

previous estimate (case reci ¼ f?g). When it does

not decide (two last cases), pi proceeds to the next

round.

As a process that decides stops participating in the

sequence of rounds and processes do not necessarily

terminate in the same round, it is possible that processes

proceeding to round rþ 1 wait forever for messages from

processes that decided during r. The aim of the second task

is to prevent such deadlocks by disseminating the decided

value in a reliable way ([18]).It is easy to see that the processes decide in a single

round (two communication steps) when the first coordi-

nator is neither crashed nor suspected. Moreover, it is

important to notice that the value f (upper bound on the

number of faulty processes) does not appear in the protocol

and, consequently, the protocol does not impose an a priori

constraint on f . Actually, if the sets uncertaini remain

always empty, the underlying failure detector becomes

useless and the protocol solves consensus whatever the

value of f .

3.3 Correctness Proof

The proofs for the uniform agreement and termination

properties, given by Theorems 1 and 2, respectively, follow.

The proof for validity is omitted here due to space

limitations, and can be found elsewhere [15].

Lemma 1. Let pi and pj be two processes that terminate line 8 of

round r, and let v be the estimate value ðestcÞ of the

coordinator pc of r (if any). We have: 1) reci ¼ fvg, or

reci ¼ fv;?g, or reci ¼ f?g, and 2) reci ¼ fvg and recj ¼f?g are mutually exclusive.

Proof. Let us first consider a process pi that, during a round

r, terminates the wait statement of line 8. As there is a

single coordinator per round, the auxj values sent at

line 7 by the processes participating in round r can only

be either the estimate estc ¼ v of the current round

coordinator pc, or ?. Item 1 of the lemma follows

immediately from this observation.Let us now prove item 2, namely, if both pi and pj

terminate the wait statement at line 8 of round r, it isnot possible to have reci ¼ fvg and recj ¼ f?g duringthat round. Let Qi (respectively, Qj) be the set ofprocesses from which pi (respectively, pj) has receivedPHASE2ðr;�Þ messages at line 8. The set reci(respectively, recj) is defined from the aux valuescarried by these messages. We claim that Qi \Qj 6¼ ;.As any process in the intersection sends the samemessage to pi and pj, the lemma follows.


Fig. 1. Consensus protocol.

Proof of the claim. To prove that Qi \Qj 6¼ ;, thereasoning is on their size. Let ti (respectively, tj) be thetime at which the waiting condition evaluated by pi(respectively, pj) is satisfied. Let us observe that ti and tjcan be different. To simplify the notation, we do notindicate the time at which a set is considered, i.e., downi,livei, and uncertaini denote the values of the correspond-ing sets at time ti (and similarly for pj’s sets). Moreover,let maji be the majority subset of uncertaini that makessatisfied the condition evaluated at line 8, i.e., Qi ¼livei [maji (notice that maji ¼ ; when uncertaini ¼ ;).Similarly, let Qj ¼ livej [majj.

To prove Qi \Qj 6¼ ;, we restrict our attention to theset of processes � n ðdowni [ downjÞ. This is because thepredicate evaluated at line 8 by pi (respectively, pj)involves only the processes in � n downi (respectively,� n downj). In the following, we denote with a primesymbol a set value used in the proof, but known neitherby pi nor by pj. So, down0i;j ¼ downi [ downj, and live0i ¼livei n down0i;j and live0j ¼ livej n down0i;j.

Let us observe that, due to (R1), we have live0i ¼livei n downj and live0j ¼ livej n downi. Combining theprevious set definitions with Property 1, we get thefollowing property (E): live0i [ uncertaini ¼ � n down0i;j ¼live0j [ uncertainj:

With these sets, we can now define two nonemptysets Q0i and Q0j as follows: Q0i ¼ live0i [maji andQ0j ¼ live0j [majj. From their very definition, wehave Q0i ¼ ðlive0i [majiÞ � ðlivei [majiÞ ¼ Qi. Similarly,Q0j � Qj. As Q0i \Q0j 6¼ ; implies that Qi \Qj 6¼ ;, therest of the proof consists of showing that Q0i \Q0j 6¼ ;.We consider three cases:

. Case 1: maji 6¼ ; ^majj 6¼ ;.From the fact that Q0i is built from the universe

of processes � n down0i;j and (E), we have the

following:

jQ0ij ¼ jlive0ij þ jmajij > jlive0ij þjuncertainij

2

� jlive0ij þ juncertainij

2¼j�j � jdown0i;jj

2:

We have the same for pj, i.e., jQ0jj >j�j�jdown0i;jj

2 . As

both Q0i and Q0j are built from the same universe

of processes (� n down0i;j) and each of them

contains more than half of its elements, we have

Q0i \Q0j 6¼ ;.. Case 2: maji ¼ ; ^majj ¼ ;.

In that case, we have

uncertaini ¼ uncertainj ¼ ;:

So, proving Q0i \Q0j 6¼ ; amounts to prove that

live0i \ live0j 6¼ ;.Let us remind that ti (respectively, tj) is the

time with respect to which downi is defined

(respectively, downj), and let us assume without

loss of generality that ti < tj. As pj has not

crashed at tj (i.e., when it terminates the wait

statement of line 8), we have (from R2) pj =2 downiand pj =2 downj, i.e., pj =2 down0i;j, from which we

conclude pj 2 live0i and pj 2 live0j. Hence, live0i \live0j 6¼ ; which proves the case.

. Case 3: maji 6¼ ; ^majj ¼ ;.In that case, we have uncertaini 6¼ ; and

uncertainj ¼ ;. As uncertaini \ downj ¼ ; (Prop-erty 1) and uncertainj ¼ ;, we have live0j ¼live0i [ uncertaini (Table 2 figures out these sets).Hence, uncertaini � live0j. As maji � uncertaini,we have maji � live0j from which we concludeQ0i \Q0j 6¼ ;. (The case maji ¼ ; ^majj 6¼ ; isproved by exchanging i and j.) End of the proof

of the claim. tu

Theorem 1. No two processes decide different values.

Proof. Let us first observe that a decided value is computedeither in line 10 or in Task T2. However, we examineonly the values decided at line 10, as a value decided inTask T2 has already been computed by some process inline 10.

Let r be the smallest round during which a process pidecides, and let v be the value it decides. As pi decidesduring r, we have reci ¼ fvg during that round (line 10).Due to item 1 of Lemma 1, it follows that 1) v is thecurrent value of the estimate of the coordinator of r, and2) as there is a single coordinator per round, if anotherprocess pj decides during r, it decides the same value v.

Let us now consider a process pk that proceeds fromthe round r to the round rþ 1. Due to item 2 of Lemma 1,it follows that reck ¼ fv;?g from which we conclude thatthe estimates of all the processes that proceeds to rþ 1are set to v at line 11. Consequently, no futurecoordinator will broadcast a value different from v, andagreement follows. tu

Theorem 2. Let us assume that the system is equipped with �Sand 8i, 8t: a majority of processes in uncertainiðtÞ are not in

F ðtÞ. Every correct process decides.

Proof. If a process decides, then all correct processes decide.This is an immediate consequence of the broadcastprimitive used just before deciding at line 10. Thisbroadcast (line 10, and Task T2) is such that if a processexecutes it entirely, the broadcast message is received byall correct processes.

So, let us assume (by contradiction) that no processdecides. We claim that no correct process blocks foreverat line 5 or line 8. As a consequence, the correct processeseventually start executing rounds coordinated by acorrect process (say, pc) that is no longer suspected tohave crashed (this is due to the eventual strong accuracyof �S). Moreover, this process cannot appear in downi(R2). Let us consider such a round occurring after all the


TABLE 2Case 3 in the Proof of the Claim

faulty processes have crashed. The correct process pcthen sends the same non-? value v to each correctprocess pi that consequently executes auxi v (line 6).As no process blocks forever at line 8, and only the valuev can be received, it follows that each correct processdecides v at line 10.

Proof of the claim. No correct process blocks forever atline 5 or line 8. We consider each line separately. First,the fact that no process can block forever at line 5 followsdirectly from the fact that every crashed processeventually appears in suspectedi [ downi. Second, thefact that no process can block forever at line 8 followsfrom the following observations:

1. Item 2 of line 8 cannot entail a permanent blocking:this follows from the theorem assumption.

2. Let us now consider item 1 of line 8: pi couldblock if it waits for a message from pk 2 livei andthat process pk has crashed before sending thecorresponding message. If pk is later moved touncertaini, it can no longer block pi, see 1). If pk isnever moved to uncertaini, then it is eventuallymoved to downi (R5) and, consequently, pi willstop waiting for a message from pk. Hence, inboth cases, a crashed process pk cannot prevent acorrect process pi from progressing. End of theproof of the claim. tu

3.4 Discussion

As presented in the Introduction and Section 2.4, thismodel includes two particular instances that have beendeeply investigated. The first instance, namely, 8i,t : uncertaini ¼ �, does correspond to the time-free asyn-chronous system model. When instantiated in such aparticular model, the protocol described in Fig. 1 can besimplified by suppressing the set downi at line 5 and theset livei at line 8 (as they are now always empty). Theassumption required in Theorem 2 becomes f < n=2, andthe protocol becomes the �S-based consensus protocoldescribed in [21]. It has been shown that f < n=2 is anecessary requirement in such a model [6]. In that sense,the proposed protocol is optimal.

The second “extreme” instance of the model is definedby 8i; t : uncertaini ¼ ;, and includes the classic synchro-nous distributed system model. It is important to noticethat systems with less synchrony than the classic synchro-nous systems can also be such that 8i, 8t, we haveuncertainiðtÞ ¼ ;.3 As previously, the protocol of Fig. 1 canbe simplified for this particular model. More precisely, theset suspectedi (line 5) and item 2 of line 8 can besuppressed. We then obtain an early deciding protocolthat works for any number of process failures, i.e., forf < n. (It is natural to suppress the sets suspectedi insynchronous systems as �S is only needed to cope with thenet effect of asynchrony and failures.) Interestingly, thesynchronous protocol we obtain is not the classical flood-set consensus protocol [20], [23], but a synchronouscoordinator-based protocol.

A noteworthy feature of the protocol lies in its genericdimension: The same protocol can easily be instantiated infully synchronous systems or fully asynchronous systems.Of course, these instantiations have different requirementson the value of f . A significant characteristic of the protocolis to suit to distributed systems that are neither fullysynchronous, nor fully asynchronous. The price that has tobe paid consists then of equipping the system with a failuredetector of the class �S. The benefit it brings lies in the factthat the constraint on f can be weaker than f < n=2.4

The rule R3 of our model restricts process identifiers tobe moved from uncertain to live or from uncertain to downduring a given run. So, as a consequence, we restrict theapplication from upgrading the QoS of related channelsfrom untimely to timely during the execution of a consensus.One question that could be raised, though, is what wouldhappen if during an execution of consensus, processescould renegotiate the QoS of channels from untimely totimely—therefore, allowing processes to move from uncer-tain to live. It turns out that in such situations, the requireduniform agreement cannot be achieved. Take the simpleexample of a group with three processes p1, p2, and p3, allinitially belonging to the set uncertain as all channels areinitially untimely. Now, consider that during a givenconsensus execution, processes p2 and p3 negotiate timelychannels with p1 moving the three process identifiers fromtheir local uncertain sets to their local live sets. Now,suppose that before the new negotiated QoS informationreaches p1, p1 decides based on the quorum of twoprocesses, p1 and p3 (at this stage, p1’s view is that allprocesses belong to uncertain) and, after some time, both p1

and p3 crash. Some time later, process p2 detects the crash ofp1 and also of p3 and decides based on its own initial value(for p2 all processes are in live). Under these conditions,there is no intersection in the decision quorums used by p1

and p2, so that the decided value of p1 will be different fromthe decided value of p2—which violates the uniformagreement property.

Regarding transitions from uncertain to down, if the stateof a process p is unknown or uncertain in our model (whatjustifies p to be in the uncertain set), it does not make senseto move p into down (which is considered a safe informationin our model, as defined by R2). So, in order to be movedfrom uncertain to down, p would have first to be moved fromuncertain to live (so, the situation is exactly as above, whichalso justifies R3). However, the same reasoning also appliesif we allowed process identifiers to be directly moved fromuncertain to down (just take the above example andconsider moving p1 and p3 from uncertain2 into down2).As a matter of fact, without R3 one could not deduceProperty 1 and, consequently, prove Uniform Agreement(see Lemma 1). Nevertheless, the above restriction does notprevent a process from taking advantage of a better QoS forits channels in future consensus executions (rule R6): Itsuffices that the new live and uncertain sets, obtained fromQoS improvements, be set up in all cooperating processesbefore the next consensus execution. A simple way torealize that (without making use of extra mechanisms such


3. It is sufficient that each process of the system is connected with atimely channel—for example, a system with four processes, two timelychannels, and all the other channels being untimely can be as suchuncertainiðtÞ ¼ ;.

4. The protocols designed for asynchronous systems we are aware ofrequire 1) �S (or a failure detector that has the same computabilty power,e.g., a leader oracle [5]), and 2) the upper bound f < n=2 on the number ofprocess crashes. Our protocol has the same requirement for item 1, but aweaker for item 2.

as a membership protocol) is to send the new live settogether with the coordinator proposed value. Then, whenthe consensus is eventually decided, this new live set isequally installed in all processes (and the complement of it,the uncertain set).

4 IMPLEMENTATION OF THE ADAPTIVE DISTRIBUTED

COMPUTING MODEL

Implementing our hybrid programming model requires

some basic facilities such as the provision and monitoring of

QoS communications with both bounded and unbounded

delivery times, and also a mechanism to adapt the system

when timely channels can no longer be guaranteed, due to

failures.

Our programming model builds on facilities typically

encountered in QoS architectures [1], such as Omega [22],

QoS-A [3], Quartz [25], and Differentiated Services [2]. In

particular, we assume that the underlying system is capable

of providing timely communication channels (alike services

such as QoS hard [3], deterministic [25], and Express Forward

[2]). Similarly, we assume the existence of best-effort

channels where messages are transmitted without guaran-

teed bounded time delays. We call these channels untimely.

QoS monitoring and fail-awareness have been implemented

by the QoS Provider (QoSP), failure and state detectors

mechanisms, briefly presented below. It was a design

decision to build our model on top of a QoS-based system.

However, we could also have implemented our program-

ming model based on facilities encountered in existing

hybrid architectures: For instance, timely channels could be

implemented using RTD channels by setting the probabil-

ities Pd (deadline probability) and Pr (reliability probability)

close to one, and untimely channels could be implemented

with a basic channel without any guarantees [19]. The

timing failure detection service of TCB [4] could then

complement the required functionality.

The QoS-based underlying distributed system we con-

sider is a set of n processes � ¼ p1; . . . ; pn, located in one or

more sites, communicating through a set � of nðn� 1Þ=2channels, where ci=j means a communication channel

between pi and pj. That is, the system is represented by a

complete graph DSð�;�Þ, where � are the nodes and � the

edges of the graph.

We assume that processes in � are equipped with

enough computational power so that the time necessary to

process the messages originated by the implemented model

(i.e., the state detector, the failure detector, and the QoS

Provider) are negligible small compared with network

delays. Therefore, such messages are assumed to be

promptly computed.5 Moreover, the processes are assumed

to fail only by crashing and the network is not partitionable.

We do not distinguish channels for upper-layer mes-

sages (e.g., generated by the consensus protocol) and for

messages generated by the state and failure detectors. These

messages are transmitted in the same channels. Regarding

the QoSP, as will be seen in the next section, there is a

unique module in each site of the system. When the system

is initiated, the QoSP modules try to establish timely

channels between them. However, if resources are not

available for timely channels, untimely channels are

generated.

4.1 The QoS Provider

In order to make our programming model portable to

distinct QoS architectures, we define a number of functions

to be encapsulated in a software device we call the QoS

Provider (QoSP), for creating and assessing QoS commu-

nication channels on behalf of application processes.

Thanks to this modular encapsulation, porting our system

to a given QoS infrastructure (such as DiffServ [2]) means

implementing the QoSP functions in such a new target

environment. The QoSP is made up of a module in each site

of the system. The basic data structure maintained by each

module is a table holding information about existing

channels. These modules exchange messages to carry out

modifications on the QoS of channels (due to failures or

application requests).

This section describes its main functionalities, which are

needed for implementing our programming model. (The

complete description of the QoS Provider is beyond the

scope of this paper. More details can be seen elsewhere [13],

[15]). Processes interact with the QoS Provider through the

following functions:

CreateChannelðpx; pyÞ : �2 ! �;

DefineQoSðpx; py; qosÞ : �2 � ftimely; untimelyg! ftimely; untimelyg;

QoSðpx; pyÞ : �2 ! ftimely; untimelyg; and

Delayðpx; pyÞ : �2 ! Nþ:

The functions CreateChannelðÞ, DefineQoSðÞ, QoSðÞ, and

DelayðÞ are used for creating a channel, changing its QoS

(with admission test and resource reservation), obtaining its

current QoS (monitoring), and obtaining the expected

delay—in milliseconds—for message transfer for the chan-

nel cx=y, respectively. Besides the above functions, each QoSP

module continuously monitors all timely channels linked to

the related site, to check whether failures or lack of resources

have resulted in a modification of the channel QoS (from

timely to untimely). A particular case that the QoSP also

assesses is the existence of a timely channel where no

message flow happens within a period of time, which

indicates that the timely channel is possibly linking two

crashed processes (in this circumstance, the QoS of the

channel is modified to untimely in order to release

resources). Modifications on the QoS of channels are

immediately reported to the state detectors related to the

processes linked by such channels, through messages

changeQoSðpx; py; newQoSÞ, indicating that the QoS of the

channel cx=y has been modified to newQoS (see Section 4.2.1).


5. This assumption can be relaxed by using real-time operating systemssuch as CactusRT [19], which can provide bounded process execution times.

When an application process px wants to create atimely channel with the process py, it does so by firstinvoking the CreateChannelðpx; pyÞ function and then theDefineQoSðpx; py; timelyÞ function. If these functions suc-ceed (i.e., the channel is created and the related QoS is setto timely), and assuming that px and py do not crash, thenew created channel will remain timely during the systemexecution, unless either DefineQoSðpx; py; untimelyÞ func-tion is explicitly invoked or a fault in the communicationpath liking px and py makes it impossible to stillguarantee the previously reserved resources for thischannel. So, degrading QoS it is not just a matter ofincreasing communication delays, but loosing reservationguarantees due to failures. As an example of a fault,consider a crash of a QoS router in the path linking pxand py, leading the communication flow through anotherrouter that has no resource reserved for that channel. Inthis situation, communication between px and py ispreserved but with a degraded QoS (i.e., untimely). Themonitoring mechanism available in the underlying QoSinfrastructure signalizes the loosing of QoS to the QoS-Provider, which changes the QoS of channel ðpx; pyÞ fromtimely to untimely. This monitoring mechanism is a sortof fail-awareness scheme; however, it is different from thekind implemented in the timed-asynchronous model [11],which has no relation with resource reservation asdiscussed above. In particular, notice that in our under-lying system model, degrading QoS does not necessarilymean increasing communication delays. The contrary ishowever always true. That is, if a timeout expires for atimely channel, either the communicating process crashedor the QoS downgraded. Therefore, our failure detectormechanism (described in Section 4.2.2) uses the expirationof timeouts, based on previously negotiated time delaysfor timely channels, as a safe indication of a crash.However, this crash is only confirmed if the QoSinfrastructure assures that the related channel QoS hasnot been lost due to a failure, or deliberately downgradedby the application.

When a process crashes, the QoS Provider can still giveinformation about the QoS of the channels linked to thatcrashed process, because the communication resources arestill reserved, and the QoSP module maintains informationabout these channels. However, if the site hosting a processcrashes, all the channels allocated to processes in this siteare destroyed. If a QoS Provider module crashes, all relatedchannels have their QoS changed to untimely.

It should be observed at this point that at runtimeinitialization the modules of the QoS Provider try to createtimely channels to connect them. If timely channels aresuccessfully created for the QoSP modules, then timeoutsare used to detect the crash of a QoSP module. Otherwise,untimely channels are created for the QoSP modules—and,as a consequence, upper-layer applications will not be ableto create any timely channel as well. In other words,application related timely channels can only be created ifand after the timely channels for the related QoSP modulesexist. Another important observation is that if QoSP losesthe QoS of its channels, the same will happen for allapplications with channels in the related site—hence, thesituation of having timely channels for applications anduntimely channels for the related QoSP will never happen.

If a given QoS Provider module cannot deliver informa-

tion about a given channel (possibly, because the related

QoSP module has crashed), this channel is then assumed to

be untimely (which may represent a change in its previous

QoS condition).On the other hand, the crash of a QoSP module Mx can

only be detected by another QoSP module My if there is a

timely channel linking Mx to My. The creation and

monitoring of such a timely channel is realized in the same

way as for the application process channels. That is, by

using the underlying QoS infrastructure facilities. After the

expiration of an expected timeout, the underlying QoS

infrastructure is queried to verify whether the QoS for the

channel linking Mx and My remains timely and, in this case,

My detects the crash of Mx. Otherwise, the channel linking

Mx and My is made untimely, and all application channels

for Mx and My are thereafter made untimely as well.

4.2 The Programming Model Implementation

In our programming model, distributed processes perceive

each other’s state by reading the contents of the sets down,

live, and uncertain. These sets can evolve dynamically

following system state changes while respecting the rules

R0 to R5. Therefore, implementing our model implies in

providing the necessary mechanisms to maintain the sets

according to their semantics. Two mechanisms have been

developed to this end:

. state detector that is responsible for maintaining thesets live and uncertain, in accordance with theinformation delivered by the QoSP, and

. a failure detector that utilizes the information pro-vided by both, the QoSP and the state detector, todetect crashes and update the down sets accordingly.

Associated with each process pi there is a module of the

state detector, a module of the failure detector, a represen-

tation of the DSð�;�Þ graph, and the three sets: livei,

uncertaini, and downi. The DSð�;�Þ graph is constructed

by using the QoSP functions CreateChannel() and Define-

QoS(), for creating channels according to the QoS required

and resources available in the system. The modules of the

state detector exchange the information of the created

channels so that they keep identical DSð�;�Þ graphs during

the system initialization phase.6

During the initialization phase, the set downi is set to

empty and the contents of livei and uncertaini are initialized

so that the identity of a process pj is placed into livei if and

only if there is a timely channel linking pj to another process

(i.e., 9px 2 � such that DSiðpj; pxÞ ¼ timely). Otherwise, the

identity of pj is placed in uncertaini. When the initialization

phase ends, processes in � observe identical contents for

their respective live, down, and uncertain sets, and a given

process is either in live or in uncertain (ensuring, therefore, the

restrictions R0 and R1 of our model).


6. At the present prototype version, we use a centralized entity (process)and a distributed transaction to establish the initial graph (it is like readingthe same information from a shared file). Of course, the system is onlyinitialized if the transaction succeeds without failures (or suspicions of).Another option that we intend to explore is to use our consensus algorithmover the QoSP modules to define the initial configuration.

During the application execution, the contents of the

three sets are dynamically modified according to the

existence of failures and/or QoS modifications. Next, the

implemented mechanisms in charge of updating the

contents of live and uncertain (the state detector), and the

contents of down (the failure detector) are described.

4.2.1 An Implementation of the State Detector

There is a module of the state detector for each process in

�. The module of the state detector associated with piexecutes two concurrent tasks to update the DS graph

maintained by pi.The first task is activated by messages

changeQoSðpi; px; newQoSÞ

from the local module of QoSP, indicating that the QoS of

the channel linking pi to px has been modified. Upon

receiving the changeQoS message, the state detector of pifirst verifies whether px is in the downi set, and in this case,

it terminates the task (this is necessary to guarantee R4).

Otherwise, it passes on the information of the new QoS of

the channel ci=x to the remote modules of the state detector,

and the local DS graph is updated accordingly (i.e.,

DSiðpi; pxÞ is set to NewQoS).The second task is activated when the failure detector

communicates the crash of a process px (see details in the

next section). The goal of this task is to check whether there

is a process in the live set, say py, that had a timely channel

to the crashed process. If the channel cx=y is the only timely

channel to py, it can no longer be detectable and therefore

must be moved from live to uncertain. This is realized by

setting all channels linked to the crashed process as

untimely in the DS graph.In both tasks, after updating the DS graph, the

procedure UpdateState(), described in Fig. 2, is called for

each px linked to a modified channel, to update the sets liveiand uncertaini, accordingly. Process px is moved from liveito uncertaini if no timely channel linking px is left (lines 1-

4); px is moved from uncertaini to livei if a new timely

channel linking px has been created (lines 6-8).7

4.2.2 An Implementation of the Failure Detector

Besides maintaining the set downi, the failure detector alsomaintains the set suspectedi for keeping the identities ofprocesses suspected of having crashed. A process piinteracts with the failure detector by accessing these sets.

The failure detector works in a pull model, where eachmodule (working on behalf of a process px) periodicallysends “are you alive?” messages to the other modulesrelated to the other processes in �. The timeout value usedfor awaiting “I am alive” messages from a monitoredprocess py is calculated using the QoSP functionDelayðpx; pyÞ. The timeout includes the so-called round-triptime ðrttÞ,8 and a safety margin (�), to account for necessarytime to process these messages at px and py.

For timely channels, the calculated timeout is accurate inthe sense that network and system resources and relatedscheduling mechanisms guarantee the rtt within a boundedlimit. Therefore, the expiration of the timeout is an accurateindication that py crashed and, in that case, py is movedfrom livex to downx. To account for a possible modificationof the QoS of the channel cx=y, before producing anotification, the failure detector checks it out whether thechannel remained timely using the QoSP functionQoSðpx; pyÞ.9 On the other hand, if the channel linking pxand py is untimely, the expiration of the timeout is only ahint of a possible crash and, in that case, besides belongingto uncertainx, py is also included in the set suspectedx.

The algorithm for the failure detector for a process pi,described in Fig. 3, is composed of five parallel tasks. Theparameter monitoringInterval indicates the time intervalbetween two consecutive “are-you-alive?” messages sentby pi. The array timeouti½1::n� holds the calculated timeoutfor pi to receive the next “I am alive” message from eachprocess in �. The function CTiðÞ returns the current localtime. Task 1 periodically sends a “I am alive” message to allprocesses (actually, the related failure detector modules)after setting a timeout value to receive the corresponding “Iam alive” message, which in turn is sent by Task 5. Task 2assesses the expiration of timeouts, and it sends notificationmessages when the timeouts expire for processes in livei,moving them into the downi set. Otherwise, if the timeout


Fig. 2. Algorithm to update the sets livei and uncertaini.

7. Observe that as applications are not allowed to upgrade the QoS ofchannels during a run (R3), the else part of this algorithm will only beexecuted between runs (R6).

8. That is the time to transfer the “are you alive?” message from px to pyplus the time to transfer the “I am alive” message from py to px.

9. One should observe here that the QoS Provider holds the informationand resources related to a given channel even after the crashes of theprocesses linked by that channel.

expires for processes in the uncertaini set, their identitiesare also included into the suspectedi set. Task 3 cancels atimeout for a given process to receive the “I am alive”message, and if such a process belongs to the suspectedi set,it is removed from it. Task 4 handles crash notificationmessages and updates the sets downi and livei, accordingly.

5 CONCLUSION

This paper proposed and fully developed an adaptiveprogramming model for fault-tolerant distributed comput-ing. Our model provides upper-layer applications withprocess state information according to the current systemsynchrony (or QoS). The underlying system model ishybrid, comprised of a synchronous part and an asynchro-nous part. However, such a composition can vary over timein such a way that the system may become totallysynchronous or totally asynchronous.

The programming model is given by three sets (pro-cesses perceive each other’s states by accessing the contentsof their local nonintersecting sets—uncertain, live, and down)and the rules R0-R6 that regulate modifications on the sets.Moreover, we showed how those rules and the sets can beimplemented in real systems.

To illustrate the adaptiveness of our model, we devel-oped a consensus algorithm that makes progress despitedistinct views of the corresponding local sets, and cantolerate more faults, the more processes are in the live set.The presented consensus algorithm adapts to the currentQoS available (via the sets) and uses the majority assump-tion only when needed.

To the best of our knowledge, there is no other systemthat implements the same functionality we have just

described, taking advantage of the available QoS to provideprocess state information that can be explored to yieldefficient solutions, when possible. For example, in somecircumstances, the live set can include all system processes,even if the synchrony of the underlying system is not as inthe classic synchronous systems (e.g., a system with fourprocesses and with two timely channels and all the otherchannels being untimely, can be as such). In theseconditions, if there is no QoS degradation, our protocoltolerates f < n process faults. On the other hand, oursystem allows for a graceful degradation when the under-lying QoS degrade, allowing to circumvent the f < n=2

lower bound associated with the consensus problem inasynchronous systems equipped with eventual failuredetectors, if juncertainj < n during the execution.

An implementation of the model on top of a QoSinfrastructure has been presented. In order to specify theunderlying functionality needed to implement it, a mechan-ism (called the QoS provider) has been developed andimplemented. Thanks to this modularity dimension of theapproach, porting the model implementation to a givenenvironment requires us to only implement the QoSProvider functions that have been defined. The proposedsystem has been implemented in JAVA and tested over a setof networked LINUX workstations, equipped with QoScapabilities. More details of this implementation can befound elsewhere [15].

ACKNOWLEDGMENTS

The authors are grateful to the anonymous reviewers fortheir constructive comments.


Fig. 3. Algorithm for the failure detector module ðpiÞ.

REFERENCES

[1] C. Aurrecoechea, A.T. Campbell, and L. Hauw, “A Survey of QoSArchitectures,” ACM Multimedia Systems J., special issue on QoSarchitecture, vol. 6, no. 3, pp. 138-151, May 1998.

[2] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss,“An Architecture for Differentiated Services,” RFC 2475, Dec.1998.

[3] A. Campbell, G. Coulson, and D. Hutchison, “A Quality of ServiceArchitecture,” ACM Computer Comm. Rev., vol. 24, no. 2, pp. 6-27,Apr. 1994.

[4] A. Casimiro and P. Verıssimo, “Using the Timely Computing Basefor Dependable QoS Adaptation,” Proc. 20th Symp. ReliableDistributed Systems (SRDS), pp. 208-217, Oct. 2001.

[5] T.D. Chandra, V. Hadzilacos, and S. Toueg, “The Weakest FailureDetector for Solving Consensus,” J. ACM, vol. 43, no. 4, pp. 685-722, July 1996.

[6] T.D. Chandra and S. Toueg, “Unreliable Failure Detectors forReliable Distributed Systems,” J. ACM, vol. 43, no. 2, pp. 225-267,Mar. 1996.

[7] M. Correia, N. Neves, L. Lung, and P. Verıssimo, “LowComplexity Byzantine-Resilient Consensus,” Distributed Comput-ing, vol. 17, no. 3, pp. 237-249, Mar. 2005.

[8] F. Cristian and C. Fetzer, “The Timed Asynchronous DistributedSystem Model,” IEEE Trans. Parallel and Distributed Systems,vol. 10, no. 6, pp. 642-657, June 1999.

[9] D. Dolev, C. Dwork, and L. Stockmeyer, “On the MinimalSynchronism Needed for Distributed Consensus,” J. ACM,vol. 34, no. 1, pp. 77-97, Jan. 1987.

[10] C. Dwork, N. Lynch, and L. Stockmeyer, “Consensus in thePresence of Partial Synchrony,” J. ACM, vol. 35, no. 2, pp. 288-323,Apr. 1988.

[11] C. Fetzer and F. Cristian, “Fail-Awareness in Timed Asynchro-nous Systems,” Proc. 15th ACM Symp. Principles of DistributedComputing (PODC ’96), pp. 314-321, May 1996.

[12] M.J. Fischer, N. Lynch, and M.S. Paterson, “Impossibility ofDistributed Consensus with One Faulty Process,” J. ACM, vol. 32,no. 2, pp. 374-382, Apr. 1985.

[13] S. Gorender and R. Macedo, “Fault-Tolerance in Networks withQoS,” Technical Report Number RT001/03, Distributed SystemLaboratory (LaSiD), UFBA (in Portuguese), 2003.

[14] S. Gorender, R. Macedo, and M. Raynal, “A Hybrid and AdaptiveModel for Fault-Tolerant Distributed Computing,” Proc. IEEE Int’lConf. Dependable Systems and Networks (DSN ’05), pp. 412-421, June2005.

[15] S. Gorender, R. Macedo, and M. Raynal, “A Hybrid Model forFault-Tolerant Distributed Computing,” Technical Report RT001/06, Distributed System Laboratory (LaSiD), UFBA, 2006.

[16] R. Guerraoui and M. Raynal, “The Information Structure ofIndulgent Consensus,” IEEE Trans. Computers, vol. 53, no. 4,pp. 453-466, Apr. 2004.

[17] R. Guerraoui and A. Schiper, “The Generic Consensus Service,”IEEE Trans. Software Eng., vol. 27, no. 1, pp. 29-41, Jan. 2001.

[18] V. Hadzilacos and S. Toueg, “Fault-Tolerant Broadcasts andRelated Problems,” Distributed Systems, S. Mullender ed., pp. 97-145, ACM Press, 1993.

[19] M. Hiltunen, R. Schlichting, X. Han, M. Cardozo, and R. Das,“Real-Time Dependable Channels: Customizing QoS Attributesfor Distributed Systems,” IEEE Trans. Parallel and DistributedSystems, vol. 10, no. 6, pp. 600-612, June 1999.

[20] N. Lynch, Distributed Algorithms, p. 872. Morgan Kaufmann, 1996.[21] A. Mostefaoui and M. Raynal, “Solving Consensus Using

Chandra-Toueg’s Unreliable Failure Detectors: A General Quor-um-Based Approach,” Proc. 13th Symp. Distributed Computing(DISC ’99), pp. 49-63, Sept. 1999.

[22] K. Nahrstedt and J.M. Smith, “The QoS Broker,” IEEE Multimedia,vol. 2, no. 1, pp. 53-67, 1995.

[23] M. Raynal, “Consensus in Synchronous Systems: A ConciseGuided Tour,” Proc. Ninth IEEE Pacific Rim Int’l Symp. DependableComputing (PRDC ’02), pp. 221-228, Dec. 2002.

[24] Y. Ren, M. Cukier, and W.H. Sanders, “An Adaptive Algorithmfor Tolerating Values Faults and Crash Failures,” IEEE Trans.Parallel and Distributed Systems, vol. 12, no. 2, pp. 173-192, Feb.2001.

[25] F. Siqueira and V. Cahill, “Quartz: A QoS Architecture for OpenSystems,” Proc. 18th Brazilian Symp. Computer Networks, pp. 553-568, May 2000.

[26] R. van Renesse, K. Birman, M. Hayden, A. Vaysburd, and D. Karr,“Building Adaptive Systems Using Ensemble,” Software Practiceand Experience, vol. 28, no. 9, pp. 963-979, July 1998.

[27] P. Verıssimo and A. Casimiro, “The Timely Computing BaseModel and Architecture,” IEEE Trans. Computers, special issue onasynchronous real-time systems, vol. 51, no. 8, pp. 916-930, Aug.2002.

Sergio Gorender received the BSc degree incomputer science from the Federal University ofBahia (UFBA), Brazil, in 1991, and the MSc andPhD degrees also in computer science from theFederal University of Pernambuco (UFPE),Brazil, in 1995 and 2005, respectively. Since1996, he has been a lecturer in the Departmentof Computer Science at UFBA and, since 2000,he has been a researcher of LaSiD, theDistributed System Laboratory at the same

university. His research interests include the many aspects of depend-able distributed systems.

Raimundo Jose de Araujo Macedo receivedthe BSc, MSc, and PhD degrees in computerscience from Federal University of Bahia(UFBA), Brazil, the University of Campinas(Unicamp/Brazil), and the University of New-castle upon Tyne (England), respectively. Since2000, he has been a professor of computerscience at UFBA, where he founded theDistributed Systems Laboratory (LaSiD) in1995. Currently, he is the head of LaSiD and

of the new created Doctorate Program on Computer Science. Formerly,he coordinated the Graduate Program on Mechatronics at UFBA. Hisresearch interests include the many aspects of dependable distributedsystems and real-time systems. He has served as a program committeemember on a number of conferences, including the IEEE/IFIPInternational Dependable Systems and Networks Conference (DSN),the ACM/IFIP/USENIX International Middleware Conference, theBrazilian Symposium on Computer Networks and Distributed Systems(SBRC), and the Latin-American Dependable Computing Symposium(LADC). He was the program chair and general chair of SBRC in 2002and 1999, respectively. He is a member of the IEEE Computer Society.

Michel Raynal has been a professor of compu-ter science since 1981. At IRISA (CNRS-INRIA-University Joint Computing Research Labora-tory located in Rennes), he founded a researchgroup on distributed algorithms in 1983. Hisresearch interests include distributed algorithms,distributed computing systems, networks, anddependability. His main interest lies in thefundamental principles that underly the designand the construction of distributed computing

systems. He has been the principal investigator of a number of researchgrants in these areas, and has been invited by many universities all overthe world to give lectures on distributed algorithms and distributedcomputing. He belongs to the editorial board of several internationaljournals. He has published more than 100 papers in journals (Journal ofthe ACM, Acta Informatica, Distributed Computing, Communications ofthe ACM, etc.) and more than 200 papers in conferences (ACM STOC,ACM PODC, ACM SPAA, IEEE ICDCS, etc.). He has also written sevenbooks devoted to parallelism, distributed algorithms, and systems (MITPress and Wiley). He has served on program committees for more than70 international conferences (including ACM PODC, DISC, ICDCS,IPDPS, DSN, LADC, SRDS, SIROCCO, etc.) and chaired the programcommittee of more than 15 international conferences (including DISC(twice), ICDCS, SIROCCO, and ISORC). Professor Raynal served asthe chair of the steering committee leading the DISC symposium seriesfrom 2002-2004. He received the IEEE ICDCS Best Paper Award threetimes in a row: 1999, 2000, and 2001. Recently, he cochaired theSIROCCO 2005 (devoted to communication complexity), IWDC 2005,and IEEE ICDCS 2006.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


2. an adaptive programming model for fault tolerant distributed computing

Technology

underlying system model

distributed systems

synchronous model

consensus problem

system designers

communication problem

processing time

progress problem