[ieee comput. soc. press proceedings. fourth international conference on data engineering - los...

TRANSACTION ATOMICITY IN THE PRESENCE OF NETWORK PARTITIONS

K.\:.S. Ramarao Department of Computer Science

Univeisity of Pittsbur h Pittsburgh, PA 152 68

Ahtract Atomic mnsactions form a highly successful

paradigm for the design of fault-tolerant systems. Implementation of atomic transactions in the presence of failures is among the most widely investigated questions in distributed sjstems. Here ur study the network pizition failure and determine the necessary and sufficient conditions for the implementation of atDmic transactions in the presence of partitions. Two aspects are thoroughly explored properties of the distributed system and the topology of the communication network. The essense of the results rpported is that protocols tr implement atomic actions in spite of partitions exist only under unrpalistically strong conditions.

1. Motivation. The atomudy pvblem for distributed systems can

be defined as that of implementing atomic transactions such :hat all nodes participating in a transaction either complete it successfully (it is committed ) or all of them reject it (it is aborted >. Commit p r ~ w h are solutions to the atomicity problem. In addition to solving the atomicity problem, a commit protocol is usually expected to satisfy the temi”inan con9rainZ : the maximum local time elaped at all sites participating in any execution of such a protocol is bounded. Commit protocols satisfying the termination constraint in spite of an arbitrary instance of a failure tjpe are nonblocking to that failure type [ 101. Failures being most widely studied in the literature are, a> malicious behaviour of processodlinks, and b) dean site/link failures (a site/link operates in only two modes : conect functioning and total quies- cence). Simultaneous clean failures of several sites and/or links can lead to another class of failures : ne twrk partition where the system is partitioned

512 CH2550-2/88/0000/0512$01.00 @ 1988 IEEE

~~~ -~

into a number of groups with no physical communication among the sites of different groups. There is no commit protocol satisfmg the termination constraint in the presence of an arbitmry partition [ 103. In this paper w investigate the following question: under what conditions do commit protocols nonblocking to partitions exist?

2 Existing Literature and the Contributions made. The existence of commit protocols nonblocking

to clean node failures is formally investigated in [ 131, [ 141, and [ 151, wbere the effects of varying degrees of synchrony of the sjstem parameters - processors, links, transmission, message order, and the granularity of basic actions - are thoroughly explored. The nonexistence of commit protocols nonblocking to arbitmry partitions is formally established first in [IO]. The following existence result also appeared there : assume that only link fa i lum occulz, then there is a nonblocking commit protocol if the following conditions hold : there are exactly two groups formed by the partition, and all undelivered messages are retumed to the senders.

In the light of the above negative result, twu divergent approaches are taken in the literature to handle the partition failure : a) the consemtive approach where the preservence of atomicity is considered unconipromizable, and b) the optimistic approach where the atomicity requirement is relaxed. The Consercvlti~e A p p c h : The majority-consensus approach of [121 is generalized to wighted voting

[ 111, which allows all groups that have a quorum for commit/abrt to complete a transaction in the presence of a partition. An interesting connection between site failures and partitions is reported in [31 where it is proved that a commit protocol can exhibit a nonrrivial behaviour in presence of a partition if and only if it is nonblocking to site fiiilures. A formal theory for the termination p t m l s (protocols

invoked to complete the transactions after a failure is detected) in presence of partitions is developed and the optimal protocols under two important metrics are reported in [4]. [16] discusses the properties of commit protocols nonblocking to simple partitions (partitions with exactly tw groups).

The Opimkfic A p p c h : Two directions are being considered under the banner of optimism : i) cau- tious optimism, which relaxes the atomicity requirement only temporady, and ii) unconditional optimism, which makes no such guarantees. In the cau- tious optimism [51, it is assumed that the effects of a committed transaction can be undone. Thus, when tw groups come into contact, it is first checked if there is an inconsistency and certain transactions are undone to remove this inconsistency. [91 give a tech- nique for realizing this assumption assuming that each update of an atomic transaction has an associated compensating action. [7,11] deal with the unconditional optimism and propose techniques to manage the inconsistent copies of a database. [61 provides a comprehensive survey of most of the work done on partition handling.

In this paper, we prove the following results: When the toplogy of the network is not a complete graph, there does not exist a nonblocking commit protocol if even a single node can fail, either concurrently with the occurrence of the partition, or at a diff'erent time, however strong the assumptions about the fault-detection capabilities are. In fact, the only case in which nonblocing protocols exist is when exactly two groups are formed, partitions are caused by the failure of links exclusively, and undelivered messages are retumed to the senders. For networks whose topology is a complete graph, nonblocking commit protocols exist in spite of node failures, but no node should fail after a partition takes place (and of course, only two groups must be formed and the undelivered messages must be retumed) .

3. Existence of nonblocking c o d t p m t ~ ~ ~ h . We informally define the system conditions first.

The nodes in a distributed system are synchromus if there is a constant o such that each operational node takes at least one step (in a computation) while any other node has taken 0 steps. The communication links are synchronous if there is a constant A such that each operational link delivers at least one message while any other link has delivered A messages.

The messagedrder is synchronous if any t w messages sent on the same link, if they are delivered, are delivered in the same order in which they were generated. The transmission is pd-to-pint if a node can send atmost one message IO one of its neighbors in a step. The transmission is b " t if a node can send copies of the same message to any number of its neighbors in a step. The send/receive primitives are azomic if a node can receive a set of messages and send out message(s) within the same step; they are nonatomic otherwise.

Let G = (V,E) be the network representing a distributed system in which the processors are s p - chmnizd in a lock-steB hansmission delay on any link is unity, the message-order is spckmnous, the send/receive are atomic, and the pmssors and links are of fail-stop type. Thus, 0 = A = 1 in such a system. Let M = n and El = 1. A partition of G is a subset S of V U E whose removal disconnects G. The connected components of G formed due to the removal of S are called the p u p s of S.

Most of the terminology of the formal model used below is borrowed from [I51 and 1131. The reader is referred to these original works for the motivation and details. Several extensions are necessary here to deal with the partition failure. The synchrony among the nodes, communication links, and message order, and the atomicity of send/receive operations are all incorporated into the following formal model of the system.

The task of a commit protocol can be abstracted as follows : initially each node pi I < i< n has a binary value vi and at the end of the protocol each node has irreversibly decided on a binary value g such that two distinct nodes (whether operational or not) do not decide on different values. A commit protocol is nonblocking to partitions if every operational node decides within a finite number of steps in spite of a partition failure either before or during the execution of the protocoL

A commit protocol is modelled as a collection of automata with state set 2. Let M be the message alphabet se t Each automaton p is described by a transition function 6, and an output function ap where 6, : Z x M n + Z , and aP : Z x M n --P MxV U X (the transmission mode used here is pint-to- p in t ; the definition of ap needs a straightforward modification if it is broadcast). Final states are divided into two classes, Com and Ab. An event e is the receipt of a set m of messages by a node p and

513

is denoted by the pair (p,m>. Processor p is the agent of the event e. When a message initiated at p destined to q passes through several other nodes before it reaches q, the receipt of the message by each of the intermediate nodes is also an event

A confrgumtion C of the system consists of the states of the nodes (st(p,,C)) and the contents of all message buffers (ba(p1,p,,C)) where bufT(pi,pJ,C) refers to the message sent to p, by a neighbor pJ and is not received byp, yet Notice that due to the synchrony of the links, there can be atmost one message in each ba(p, ,pj ,C) at any step. An event e =

(p,m) ap#ies to a configuration C if p is the node to take the next step (dictated by the synchrony among the nodes; formally expresses later) and U {buff(p,q,C) I q is a neighbor of p) = m. The configuration e(C) resulting from the application of the event e to the configuration C is obtained as follows : change p’s state from z = st(p,C> to s,(z,m) and no other processor‘s state is changed, set \uff(p,q,e(C)) = A for all q, and append m’ to buff(q,p,e(C)) where s,(z,m) = (q,m’>.

A &&le is a sequence of events. A schedule U = up2 . . is applicable to an initial oonfigunzrion I if a> u1 is applicable to I, u2 is applicable to ul(I), and SO on, b) for every consecutive subsequence 7 of U,

if some processor takes 2 steps in 7 and if p does not take any step in 7 , then p takes no steps in the por- tion of U following T (that is, nodes are synchronized in lock-step and have fail-stop property), and c) if q outputs a message to p in U, for some i and for the smallest j such that j>i and U, = (p,m), m has no message from q, then for any uL = (p,m’) where k>i, m’ has no message from q (that is, links are synchronous and have fai-stop property).

The following notation is used in denoting the application of multiple events: a sequence such as u1u2... implies the application of ul followed by ul so on while the composition of event applications is written with explicit parentheses; for example, ul(sigma2. . . )) ... ) implies that u1 is applied to the configuration resulting from the application of u2 and so on.

Failux Detection.

By the fail-stop property and the synchronization among the nodes and links, it is trivial to detect the failure of a link if only link failures are possible: if a node expecting a message from a neighbor at a

particular step does not receive it, then that node knows the link to be faulty. Similarly, if only node failures are possible, then the detection of node failures is also straightforward. On the other hand, if both links and nodes can fail, then the non-receipt of an expected message cannot by itself indicate which component has failed. Though it is possible to isolate a failure, these details are unimportant here. For the sake of convenience, w shall assume that a failed component makes a special hila tmnsition in place of the usual transition. Thus, a failed node simply makes a failure transition each time it is sup posed to output a message and sends a special message (with no information other than an indication of its failure) to the node it is expected to send a regu- lar message. Similarly, a failed link, u;hen a message is sent over it, outputs a special message indicating its failure (and loses the actual message sent).

The case of netuork partitions is harder. Clearly no local detection is possible and some global mechanism is necessary. Again, without going into the actual details of the implementation, wr like to make a reasonable assumption about how a partition is detected. As in the case of node and link failures, w assume that a special partition transition is made by all the components that constitute the partition. One question arises: when do these transitions take place? We see at least twu choices: a> all failed components simultaneously make the partition “sitions (that is, the partition occurs a t once >, or b) each component in the partition makes a partition transition only when its tum comes. Both of these choices are used in this paper: the former choice while proving the nonexistence of the protocol, and the latter while proving the existence of the protocols. Thus, for the purposes of the impossibility proofs, the effect of a partition P into the groups GI.G2, . . .Gp is as follows: for any configuration C, P(S) is a configuration in which a) all messages in the buffers of the nodes in a group adjacent in G to nodes in other groups are erased, and b) each nodellink in the partition makes a special partition failure transition at the same time, and output a special message (nodes send this message to all of their neighbors and links send to both of the end nodes ). Thus a node detects a failurr when it receives a failure message. Notice that this implies that all nodes need not detect the failures simdtaneously. For convenience (and wlg) w assume that the partitions are stable .- a partition once formed Rmains unchanged thmughout the execution of the pmtccol.

514

When node failures in addition to partitions are p s - sible, the partition transition and the node failun: transitions need not be simultaneous unless specfied otherwise.

For a finite schedule a applicable to an initial configuration I, the configuration a(I) is accesrible. For a configuration C accessible by a schedule T , a is applicable to C iff TU is applicable to I. For any configuration C and schedule U applicable to C, a(C) is EachabZe from C. A schedule, together with the corresponding configurations, is a run A run is a deciding run if every operational node enters a final state. A run is a (k,p)admissible run for O<k<n, 0 6 p < l from I if the corresponding schedule is applicable to I and atmost k nodes and p links have failure transitions in it all part of a partition. A configuration C has a deckion whe 1 if st(p,C) E Com for some p and a decision value 0 if st(p,C)EAb for some p. Thus, for a nonblocking commit protocol, a) no accessible configuration has m~ than one decision value, b) ewxy (n-2,1)-ad”ssible run with a partition ttansition in it is a deciding run, and c) for each v€{O,If, there is an accessible configuration with decision value v. A commit pro- too01 is (k,p)-resilient if the condition (b) abow deals with only (k,p)-admissible runs.

A configuration C is bivalent if there a= configurations C’,C” reachable from C with decision

values 0,l respectively. It is @valent if only configurations with 0 decision value are reachable and 1-valent if only configurations with decision value 1 are reachable.

To show the necessity of a certain set of conditions for the existence of nonblocking commit protocols, we follow the approach of [ 151. Specifically, wr assume for contradiction that one of the conditions is not necessary (so that there is a protocol in spite of that condition being false) and derive the contradiction in three stages using an ”inductive proof“ [ 131: ”a) Show that there is a bivalent initial configuration, b) show that if C is a bivalent configuration and p is a processor, then there is a schedule U such that cr(C) is bivalent and p takes a step in a; also, if p’s buffers are nonempty in C, then for any set of messages m in p’s bf le rs there is such a a in which p receives m, and c) using (a) and (b), construct an infinite run that is not deciding as follows: let B1 be an initial bivalent configuration; in general, if E, is bivalent, let p = p, where j i(mod n) and let B,+l = &I,) where a is obtained from (b). Also, if p’s

buffers are nonempty, then let p receive all the messages in the buffers. The resulting infinite run is non-deciding because each of the configurations in it is bivalent” Since the third stage is essentially identical in all proofs we give, w shall omit that in the nonexistence proofs given here.

For a schedule a = a1.a2. . . . .a,, define counterta) as <~al.iI~,~a2.i2~, . . . ,(um.im)> where ir denotes the count of the steps taken by the agent of

and S C V, let cS denote the subconfiguration of C on the nodes from S . We say a schedule a is on S if the agents of all events in a are from S. Given two schedules a, U’ respectively on the subconfigumtions Cs and c ’ ~ wbere S n S ’ = 0, the ineger of a, U’ denoted as merghgma .a’> is defined as the sequence of events obtained by sorting the hvo sequences

countedu), counteda’) and ignoring the counts after that

Lemma 1. For any k>O if there is a (k,l)-resilient commit protocol, then the protocol has a bivalent initial Configuration. Further, this is true even if exactly two groups are formed due to the partition and the transmission mode is broadcast Proof. Assume that all initial configurations are either 0-valent or I-valent Since 0 and 1 are both possible decision values, there must be initial configurations Io and I , such that Z, is v-valent Let S be the set of nodes whose initial values in I,, II are different If the subgraphs formed byS and v\S are both connected, then x make a partition transition from I,, failing all links connecting S and q. If this is not the case, then we alter the initial values of some of the nodes in S one at a time so that wr amve at a O-valent configuration J,, and a 1-valent configuration J 1 such that the set of nodes S’ w ~ o s e values differ in J , and J1 is such that S’, @’ both form connected subgraphs. This is alwap possible since in the worst case S’ may contain a single node and I,, Z1 may be chosen so that the seperation of this node leaves the rest of the network connected. Again, we make a partition transition P from Jo forming two groups S ’ , @’. Consider a deciding run from P(J,) which must exist by the hypothesis. Let a, be the schedule of this run. Since J~ is @valent, 0 is the decision reached by all nodes in this run. Consider P(J,) obtained by the same partition transition from J1 . Let a1 be the schedule of a deciding run from P(J,>. Let T ~ , T ] respectively be the subschedules of cro, u1 on @’. We claim that

Upto (and including) a k . For any COnfigUratiOn c

515

merg(u,-.ro,.rl) is applicable to P(J,) and merg(u,-.r,..r,) is applicable to P(J,). Since P(J,) is @valent, m e r g ( ~ ~ - . ~ . ~ ~ ) ( P ( J ~ ) > has decision value 0 while merg(ul-T1)(P(J,)) must have decision value 1. But this means that the nodes in WS' reach final states from Ab in one and from Com in the other, a contradiction since the subconfigumtions on WS' are identical in these two configurations. Notice that exactly two groups are formed due to the partition and that the construction is independent of the transmission mode. [ ]

Lemma 2. For lib0 assume that there is a (k,l)- resilient commit protocol nonblocking to partitions with exactly two groups. Let C be a bivalent configuration of the protocol reachable from some initial configuration. Let e be an event applicable to C. Then e(C) is bivalent Also, this is true even if the transmission mode is broadcast Proof. Assume first that there is no partition transition in reaching C. Let p be the agent of e and let a,(st(p,C),m) = (r,m'). Let P be a partition transition where S is a set of links whose removal partitions the network into two groups GI, G2 such that p,r are in the same group G1 (if it is not possible to form two groups then the result would hold an).way; thus, the assumption of two groups represents the best case). We now claim that both of D = P(C> and D' = e(C) have the same valency. If not, assume wlg that D is @valent and that D' is 1- valent The1 P(D') must also be 1-valent Let U be the schedde from P(D') to a configiimtion C 1 with deiis.dn value 1. Let T be the subschedule of U on G2. Since D is 0-valent, let U' be the schedule from D to a configuration CO with decision value 0. Let T'

be the subschedule of U' on G,. Then the sequence of evens U" = merg(7,~') is applicable to both D and D'. Also, the nodes in G2 reach the same final states in both u"(D') and u"(D), a contradiction. Similarly w can prove that if D is 1-valent then so is D'. Conversely, if D' is \-valent, it can be shown using the same argument that D is also v-valent. Observe that under the system model w are investi- gating, one of exactly two transitions is possible from C: a partition transition and S,(st(p,C),m). Thus wr amve at a contradiction to the hypothesis that C is bimlent if either of D,D' is univaknt

A s " now that there was a partition transition in reaching c and that C is bivalent We claim that e(C) is bivalent where e = (p,m) is an event applicable to C. This is obvious since there c m be no

more failure transitions, implying that e is the only event applicable to C (and C 'is bivalent by the hypothesis). [ 1 Corollary 1 [IO]. There is no commit p t o c o I nonblocking to partitions even if exactly two groups are formed and only links faiL

We now introduce the notion of 1.etuming the undeliverd messages. Recall that until now w have modelled the effect of a failure by the loss of messages. The following result is known previously [IO]: there is a nonblocking commit protocol to partitions if the following conditions hold exactly tw groups are formed, no nodes fail, and undelivered messages are retumed to the senders. When there are no node failures, the retum of undelivered messages is modelled by placing the messages back into the buffers of the senders (in other words, a link failing as part of a partition behaves as follows: if there is a message being sent over the link, then the link sends the message back with a special partition failure message to the sender of the message and only the partition failure message to the destination; notice that in our model therr can be atmost one message in transit on a link). When a node fails, all messages in the buffers of that node are retumed to their senders with a failure message of the appropri- ate type (thus, if some neighbor neither has a message to be retumed nor is expecting a message from the failed node, then it may not detect the failuxr until long after it has occured). I t is straightfomrd to extend these ideas to the partition failure when only links fad. It is slightly tricky if both links and nodes fail to form a partition because the sender of a message may fail too and it is not obvious what happens to the messages sent by that node. s i n e w are going to deal only with partitions with exactly two groups, it seems reasonable to assume that the network system passes an undelivered message whose sender is down to an arbitrary node in the group other than the one the destination node (of the message) is in. If a node fails after a partition, then the messages not delivered to that node can be retumed only to the nodes in the p u p the failed nodewin.

Now let us consider the possibility of node failures simultaneous to a partition. Since the retum of undelivered messages is necessary even if only link failures lead to a partition, w assume that all undelivered messages are retumed. Interestingly, it tums out that the existence of nonblocking protocok

516

depends on the topology of the network Thus, we report a negative result for non-complete graphs and a positive result for complete graphs. The negative result is given f i r s t

Notice that the result of Lemma 1 is not applicable anymore since its proof has explicitly used the assumption that undelivered messages are lost Hence we need to first establish the existence of a bivalent initial configuration when undelivered messages are retumed.

~ m m a 3. For kbO, assume that there is a (k,l)- resilient commit protocol when undelivered messages are retumed. Then the protocol has a bivalent initial configuration. This is true even if exactly t w groups are formed and the transmission mode is broadcast Proof. Assume for contradiction that all initial configurations are univalent Thus there exist initial configurations Io, Il such that I,, is v-valent Since some nodes must have different initial values in Io, 11, we can find tw initial configurations Jo, J1 which dfler in the initial values of exactly one node p by changing the initial values one at a time, such that J,

is v-valent Let U, be a schedule applicable to J, such that u,(J,) has the decision value v. Let uop, ulp respectively be the events in uo, u1 for which p is the agent for the first time. Write U,, = T ~ ~ ~ U ’ and u1 = T1ulpu*. p is not the agent of any events in T ~ , T ]

(and hence p has not sent any message to any node in these schedules). Let P denote a partition transition forming two groups {p), v‘\{p}. Clearly T O P U ~ ~ ’ ( J ~ ) is @valent and TIPUl,Un(J1) is 1-valent, by the hypothesis. Observe now that Pu , ,~’ is applicable to T ~ ( J ~ ) and Pubon is applicable to T ~ ( J ~ ) . But this is a contradiction since TIPUopU’(J1) is 1-valent and T ~ U ~ , U * ( J ~ ) is @valent while the nodes must be in the same final states in both of these configurations. [ 1

Observe that the above result holds independent of the netwrk topology and whether or not nodes fail. But the induction step requires the topology to be restricted.

TheoEm 1. Let G be a network which is not a complete graph. Then there is no (l,l)-resilient commit protocol even if : a) there are exactly hm operational groups formed, b) undelivered messages are retumed, and c) the transmission mode is broadcast Proof. Since there can be some nodes with degree n-1, w first need the following result

Lemma A Let p be a node with degree less than n-1. Then there is an accessible bivalent configuration C =ached by a failure-free run such that p is the agent of the non-failure event applicable to C. Proof. Assume to the contrary. Let C be any bivalent accessible configuration such that the agent of the non-failure event applicable to C is a neighbor r of p (if no such configuration exists, then w can consider a neighbor of one of them and use the same argument as follows). Let P be a partition transition from c with two groups {p} and V{p}. Let e be the non-failure event applicable to C and let P’ be a partition transition from e(C> with the same two groups. Now the valencies of e(C) and P(C> can be proved to be the same if none of them is bivalent Assume wlg that e(C) is 0-valent while P(c> is 1- valent Let CO be a configuration reachable from e(C) with decision value 0 and let C1 be a configuration reachable from P(C) with decision value 1. Let (T” and u1 be the comsonding schedules. Let r0 and T~ respectively be their subschedules on p. p must reach the same final state in the configurations reached by the schedules merg(uo-70.Tl)(e(C)> and merg(ul-Tl.To>(P(C>>, a contradiction to the assumption that C is bivalent

Let C be a bivalent configuration reached from an initial configuration by a failure-free run. Let P be the agent of the (non-failure) event e applicable to C, let X be the set of neighbors of p and let q be the agent of the (non-failure) event applicable to e(C>. Assume wlg that the degree of p is less than n-1. Two transitions are pcissible from C: a partition transition and a transition on the event e. Again show that the configurations reached due to these transitions are of the same valence, obtaining a contradiction to the assumption that C is bivalent Let P be a partition transition from e(C> forming two groups G 1 = X U {p,q) and G2 = V.ps loGl (G2 * 0 since the degree of p is less than 11-11. Let P’ be a partition transition from C forming two groups G1, G2 identical to the ones above. Assume wdg that e(C> is 0-valent while P’(C) is I-valent Let c0 be a configuration reachable from P(e(C)) with decision value 0 and let c1 be a configuration reachable from P’(C) with decision value 1. Let uo be the schedule associated to CO and let u1 be the schedule for C1. Let T~ and T~ be the subschedules respectively of uo and ul on G2 It is easy to see that merg(uo-70.Tl) is applicable to P’(C) and merg(ul-T,.To) is applicable to P(e(C)). The nodes in G2 must reach the same

517

final states in the configurations reached by these schedules, a contradiction.

Assume now that there is a partition transition in the schedule for reaching C from I and that C is binlent Since e is the only event applicable to C, e(C> also must be bivalent. [I

Thus, the failure of even a single node simultaneous to a partition failure rules out the existence of the nonblocking commit protocols under the most favorable system conditions. Now we ask, does there exist a nonblocking commit protocol if partitions are formed only due to the failure of links, but one node can fail independently (that is, not concurrently with the partition)? Unfortunately, the ansuer to this question is also in the negative and in fact the above proof can be transprted here almost in its entire?;.

Theorem 2 Let G be a nehwrk which is not a complete graph. There is no nonblocking commit protocol if a single node can fail independently and not concurrently with the partition even if a) the partition is formed due to the failure of links only, b) the undelivered messages are retumed, c) exactly twu groups are formed, and d) the transmission mode is broadcast Proof. Let C be an accessible bivalent configuration such that a node p with degree less than n-1 is the agent of the non-failure event applicable to C. Assume that C is reached by a failure-free run. Three transitions are possible from C: failure of p, a partition, and the non-failure event e. We prove that the config rations obtained by all of these transitions have the same valency unless one of them is bivalent Let X be the set of the neighbors of p. Informally, we do the following: a) apply a partition transition to e(C) in which two groups are formed, G I = X U {p), and G2 = V.ps 1&,; b) applya partition to f(C) where f denotes the failure of p forming two groups X, G2, and c) apply f to P(C) where P is a partition forming the groups GI, G2 as above. It can now be shown as in the proofs above that the nodes in G2 reach the same final states when appropri-tely merged schedules are applied to each of the configurations obtained in (a), (b), and (c).

If C is reached by a run containing failure transitions then it is easy to show that e(C) is bivalent, as in the previous proofs. [I Theorem 3. Let G be a network with the topology of a complete graph. Assume that partitions form exactly two groups, and that undelivered messages are retumed. Then there is a (k,p)-resilient commit

protocol for any k and p Proof. We give an informal description of the p r o t o - col. The transmission mode is point-to-point k t 1,2 ,..., n denote the node ids. Each node, in its i'th step, sends its initial value to the node i. At n+ i'th step, each node sends a message to the node i with a 1 or 0, 1 indicating that it has received all values and 0 indicating that it has not Each node waits until it has determined that all nodes have received all initial values, before moving into a final state. The nodes move into states from Com if all initial values are 1 and move into states from Ab otherwise. Assume that there is a partition at a certain instant and that several nodes also fail simultaneously. We observe first that the nodes in each of the groups can correctly determine if there is an operational node in the other group which has received all of the initial values or not since they receive the undelivered messages (in an actual implementation, a node receiving the partition failure message initiates the eledion of a leader - which can be done easily since there can be no more failures within a group - and the leader pools the states of the nodes in its group and the retumed messages, to determine this). If a group has a node which has received all of the initial values and finds that there is at least one node in the other group which also has received all of the initial values, then they both reach the same final states. If a group knows that there is no node in the other group which has received all of the initial values then the nodes in that group move into Ab states. If no node in a group has received all of the initial values then the nodes in that group also move into Ab states. I t is clear now that the algorithm wrks correctly - the two groups move into the same final states in all cases and no failed node would have moved into a final state until it has known that all of the other nodes have received the initial values. [I

Now we explore the possibility of nodes failing independent of the partition failure. Thus a node can fail either before the partition or after i t Consider first the node failures after a partition occurs. The above algorithm fails in this case because it is not possible any more for a group to determine correctly if there is a node in the other group which has received all of the initial values: all nodes in the other group which have received all of the initial values may fail immediately after the partition. In fact we have the following result:

518

'Iheorem 4 Let G be a network whose topology is a complete graph. Then there is no nonblocking commit protocol if a single node can fail after a partition occurs even if a) exactly tvm groups are formed due to the partition, b) all undelivered messages are retumed, and c) the transmission mode is broadcast. Proof. Let C be an accessible bivalent configuration reached by a failure-free run. Let e denote the non- failure event applicable to C and let p be the agent of e. T m trasitions are possible from C: a partition, and the transition due to e. Let P denote a partition transition from C such that two groups G1, G, are formed and only links fail. Assume that p is in G1. Assume that none of P(C) and e(C) is bivalent and that their valencies are not the same. Let us assume Wig that their valencies are 0 and 1 respctively. As in the proof of Theorem 2, w can obtain a contradiction to the hypothesis that C is bivalent. Let P' be a partition transition from e(C) forming the same groups GI, G2. k t f denote the failure of p. P'(e(C)) must be 1-valent and flP(C)) must be 0- valent. Let co be a configuration reachable from f(P(C)) with decision value 0 and c1 a configuration reachable from P'(e(C)) with decision value 1. Let uo, ul be the corresponding schedules and let T ~ , T~

be their subschedules on G,. Now it is easy to see that merg(o0-7,.7,)(f(P(C) 1) and merg(ol-71,70)(P'(e(C))) must lead to configurations in which the nodes in G2 arr in the same final states. But this is a contradiction. [I

Theorem 5 Let G be a netmrk with the topology of a complete graph. Assume that any number of nodes can fail but no nodes fail after a partitiopl occum Assume that exactly two groups are formed due to partitions, and undelivered messages are retumed. Then there is a nonblocking commit p- tocol Proof. The protocol given in the proof of Theorem 3 remains nonblocking under the conditions given here a h . The details of correctness are similar and are omitted. []

R efellences

111 P.A. Alskrg et al, "Multicopy resiliency techniques," in Tutorial : Distributed Database Manage- ment, IEEE Computer Society, 1978, pp. 128-1 75. [21 P.A. Bemstein et al, "Concurrency control in a system for distributed databases (SDD-I)," ACM TODS, 51 , Mar. 80.

[3] F.Y. Chin and K.V.S. Ramarao, "An information-based model for failure-handling in distributed database systems," IEEE Tr. on Software Engg, Apnl 1987. 141 - , "Optimal termination protocols for network partitioning," SIAM J. on Computing, Feb. 1986. A preliminary version has appeared in Proc. 2nd ACM PODS, 1983, pp. 25-35. [51 S . Davidson, "Optimism and consistency in partitioned distributed database systems," ACM TODS, Sept. 1984, pp. 456482. [6] S. Davidson et al, "Consistency in a partitioned network : a survey," Tech Rep., U. of Penn., 1984. 171 H. Garcia-Molina et al, "Data-patch : integrathg inconsistent copies of a database after a partition," Proc. 3rd IEEE Symp on Reliability in Distributed Softwdre and Database Systems, 1983, pp. 3846. [81 D.K. GiEord, "Weighted voting for replicated data," Proc. 7th Symp. on Operating System PMci- ples, 1979, pp. 150-162. [9] S.K. Sarin et al, "System arrhitecture for partition-tolerant distributed databases," IEEE TC, Dec. 1985, pp. 11584163. [ 101 D. Skeen and M. Stonebraker, "A formal model of crash recovery in a distributed system," IEEE TSE, May 1983. [ 111 D. Stott Parker et al, "Detection of mutual inconsistency in dishbuted systems," IEEE TSE, May 1983. [12] R.H. Thomas, "A maprity consensus approach to concurrency control for multiple copy databases," ACM TODS, June 79, pp. 180-209. [131 D. Dolev et al, "On the minimal synchronism needed for distributed consensus," JACM, Vol. 34,

[I41 C. D m r k et al, "Consensus in the presence of partial synchrony," Proc. 3rd ACM Symp on PMci- ples of Distributed Computing, 1984, pp. 103-1 18. [I51 M. Fischer et al, "Impssibility of distributed consensus with one faulty process," JACM, VoL 32,

[161 C-L. Huang and V.O.K. Li, "A termination protocol for simple network partitioning in distributed database systems," Proc. 3rd IEEE Symp. on Data Engineering, Feb. 1987, pp. 455-465.

NO. 1, Jan. 1987, p ~ . 77-97.

NO. 2, Apnl 1985, pp. 374-382.

519

[ieee comput. soc. press proceedings. fourth international conference on data engineering - los...

Documents