consistency in optimistic replication systems - … in optimistic replication systems by ... one...

40
Consistency in Optimistic Replication Systems by Sunny Ming-Cheung Ho B.Sc.(Honours), University of British Columbia, 1999 AN ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) We accept this essay as conforming to the required standard The University of British Columbia August 2003 c Sunny Ming-Cheung Ho, 2003

Upload: buidieu

Post on 08-May-2018

220 views

Category:

Documents


3 download

TRANSCRIPT

Consistency in Optimistic Replication Systems

by

Sunny Ming-Cheung Ho

B.Sc.(Honours), University of British Columbia, 1999

AN ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

Master of Science

in

THE FACULTY OF GRADUATE STUDIES

(Department of Computer Science)

We accept this essay as conformingto the required standard

The University of British Columbia

August 2003

c© Sunny Ming-Cheung Ho, 2003

Abstract

Consistency is a major issue in optimistic replication. Four concepts related toconsistency will be discussed: eventual consistency, bounded inconsistency, client-centric consistency, and conflict resolution. Eventual consistency is concerned withways of ensuring that all replicas eventually converge to a common state. Boundedinconsistency relates to limiting the amount of inconsistency between any two repli-cas. Client-centric consistency refers to providing clients with guarantees on thequality of data in a replica. Conflict resolution is used to reconcile conflicting up-dates made to replicas.

ii

Contents

Abstract ii

Contents iii

Acknowledgements v

1 Introduction 1

1.1 Pessimistic Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Optimistic Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Eventual Consistency 4

2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 State-Transfer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Operation-Transfer Systems . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Update Propagation . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.2 Update Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.3 Dealing with Conflicts . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Bounded Inconsistency 11

3.1 Continuum between Strong and Optimistic Consistency . . . . . . . 12

iii

3.2 TACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Numerical Error . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.2 Order Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.3 Staleness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Client-Centric Consistency 19

4.1 Read Your Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Monotonic Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Writes Follow Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4 Monotonic Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Conflict Resolution 26

5.1 Conflict Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.1 Syntactic Conflict Detection . . . . . . . . . . . . . . . . . . . 26

5.1.2 Semantic Conflict Detection . . . . . . . . . . . . . . . . . . . 27

5.2 Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Conclusion 31

Bibliography 34

iv

Acknowledgements

I would like to thank my supervisor, Dr. Norm Hutchinson, for his support, patience,and guidance during the course of my studies in the Masters program.

Sunny Ming-Cheung Ho

The University of British ColumbiaAugust 2003

v

Chapter 1

Introduction

Data replication can be used to improve availability and performance [7]. It can allow

data to remain accessible even when there are node and network failures [7], and

it can also guard against permanent data loss when a replica fails [8]. Performance

can be improved by allowing concurrent access to replicas and by reducing network

latency because accesses can be directed to nearby replicas [7].

Pessimistic replication and optimistic replication are two contrasting replica-

tion models. They represent the two extremes in the availability-consistency tradeoff

[13]. Pessimistic replication favours consistency over availability, while optimistic

replication favours availability over consistency. In describing these two models, I

assume that replicas can accept read and write requests when there are no network

or node failures in the system.

1.1 Pessimistic Replication

In pessimistic replication, users never observe any inconsistencies in the replicated

data. In terms of consistency, it appears to the users as if there is only one replica

1

[7]. Conceptually, an update made to a replica is synchronously propagated to all

the other replicas. When nodes or networks fail, access to data may be denied

to prevent users from viewing inconsistent data [7]. For instance, in the presence

of network partitions, this may mean that access to data may be denied until the

partition is healed. Also, if a replica is unavailable (perhaps due to node failure),

it may prevent other replicas from being accessed temporarily until either the node

failure is detected [7] or the node recovers.

1.2 Optimistic Replication

Optimistic replication allows users to access any replica for reading or writing even

when there are network failures or when some replicas are unavailable. This property

of optimistic replication has two significant implications. First, the states of replicas

can be temporarily mutually inconsistent. An update can be applied to a single

replica without the update being synchronously applied to other replicas. There

may even be a substantial time lag from when an update is applied at a replica

to when it is eventually propagated to other replicas, which may result in stale

reads. Second, concurrent updates to different replicas may introduce conflicts. For

instance, in an optimistically-replicated airline reservation system [13], two replicas

may accept a reservation for the same seat. Despite these two drawbacks, optimistic

replication presents a number advantages over pessimistic replication [7]:

• Availability. Data availablity is high as accesses to data are never blocked.

• Networking flexibility. Networks do not need to be fully connected for replicas

to be remain fully accessible.

2

• Scalability. Greater number of sites can be supported because synchronous

communication is not needed for accepting updates.

This essay will introduce some issues related to consistency in optimistic

replication. In particular, four concepts related to consistency that researchers have

investigated will be discussed: eventual consistency, bounded inconsistency, client-

centric consistency, and conflict resolution. Eventual consistency is concerned with

ways of ensuring that all replicas eventually converge to a common state. Bounded

inconsistency relates to limiting the amount of inconsistency between any two repli-

cas. Client-centric consistency refers to providing clients with guarantees on the

quality of the data when accessing a replica. Conflict resolution is used to reconcile

conflicting updates made to replicas.

3

Chapter 2

Eventual Consistency

A fundamental goal in optimistic replication is for all the replicas to eventually con-

verge to a common state [7]. Researchers have studied various mechanisms for pro-

viding eventual consistency. Saito and Shapiro [7] provide a comprehensive overview

of some of the techniques and methods developed for acheiving eventual consistency

from which I present the basic concepts.

2.1 System Model

Conceptually, updates are initiated at a single replica and eventually propagated

to all other replicas. Updates are applied immediately to the local replica. There

are two forms in which updates can be propagated from one replica to another. In

state-transfer systems [7], the entire replica is transferred and overwrites another

replica. The updates applied to the replica are implicitly propagated when the

replica is transferred. In operation-transfer systems [7], each replica maintains a

write log that contains updates initiated locally as well as updates received from

other replicas. Only the updates in the write log are propagated. There are varying

4

methods for achieving eventual consistency using either state-transfer or operation-

transfer.

2.2 State-Transfer Systems

Achieving eventual consistency in state-transfer systems only requires that the most

recent replica be propagated and copied over the other replicas. This can be easily

accomplished by associating a timestamp with each replica that denotes the time

when it was last updated. To carry out update propagation, two replicas compare

their timestamps, and the replica with the more recent timestamp is copied over

the older replica. Given enough rounds of propagation, every replica will eventually

contain the most recent version. Assuming that clocks at the replicas are loosely

synchronized, using physical clocks for timestamping replicas is acceptable.

Even if concurrent updates to different replicas are not detected, eventual

consistency can still be achieved as long as one timestamp can be unambiguously

determined to be larger than the other whenever two timestamps are compared.

Without detecting concurrent updates, some updates, though, may be lost. For

instance, if there are two file replicas and both are updated concurrently, then the

replica with the more recent timestamp will overwrite the other replica, and the

update that was made at the overwritten replica will essentially be lost. However,

ignoring concurrent updates is generally not an acceptable policy. A possible way

to achieve eventual consistency even when concurrent updates are detected is to

create a reconciled version of the two replicas containing the concurrent updates

and to timestamp the reconciled version in a way so that it is considered more

recent than either replica [1]. Details of detecting concurrent updates is related to

conflict detection and resolution and will be discussed in Chapter 5.

5

2.3 Operation-Transfer Systems

Eventual consistency in operation-transfer systems can be achieved through two

mechanisms: update propagation and update scheduling [7].

2.3.1 Update Propagation

Updates made at each replica need to be propagated to all other replicas. This

can be achieved through epidemic propagation. In epidemic propagation, a replica

propagates both the updates in its log that were initiated at the replica as well as

updates it received from other replicas. One advantage of epidemic propagation

is that updates may still reach all the replicas even if the network is never fully

connected at any point in time.

To efficiently determine the updates to propagate to another replica, each

replica maintains a timestamp vector that indicates the set of writes in its write log.

First of all, each update initiated at a replica receives a unique timestamp from that

replica. The timestamp can come from a physical clock, Lamport logical clock [6],

or an increasing counter. The timestamp vector, TV, has an entry for each replica.

At a given replica, a value of i for an entry j, indicates that the replica has received

all updates with timestamp less than or equal to i from replica j. Hence, a replica

can simply update its own entry when it accepts an update initiated locally. When

one replica wants to propagate its updates to another replica, it first obtains the

other replica’s timestamp vector to determine which writes it has that the receiving

replica doesn’t have. Then the replica scans its write log and sends the updates that

are missing at the receiving replica. After sending the updates, the replica sends its

timestamp vector to the receiving replica to allow the receiving replica to update its

timestamp vector. The receiving replica updates its timestamp vector by comparing

6

the entries in its own timestamp vector with those of the other timestamp vector

and taking the pairwise maximum for each entry.

2.3.2 Update Scheduling

Even if all the updates have propagated to all the replicas, each replica may have

received them in a different order. Update scheduling deals with ways of ordering

updates in a write log so that when each replica executes its own sequence of updates,

the result is that all the replicas are identical. A simple way of acheiving this goal

is to totally order the updates at every replica (i.e., the order of updates is identical

at each replica). However, this is not always required for replicas to converge. For

example, if updates only contain increment and decrement operations that commute,

then two replicas will converge to the same value as long as they both have the same

set of updates.

Update scheduling is applied when either an update is initiated at a local

replica or when a replica receives a set of updates through update propagation. The

write log may need to be undone and redone when update scheduling is performed.

Update scheduling can be categorized as syntactic or semantic scheduling [7].

Syntactic Scheduling

In syntactic scheduling, updates in a write log are ordered according to the times-

tamp values of the updates. This results in a total ordering of the updates at every

replica and ensures that the replicas are identical if they have received the same

set of updates. If physical clocks are used for the timestamp values, the replicas’

clocks should be loosely synchronized so that new updates do not get ordered before

older updates in the write log (that have a “newer” timestamp due to clocks being

7

excessively out of sync). Using Lamport logical clocks [6] for timestamps ensures

that causal dependencies based on the Lamport’s happened-before relationship are

maintained across all logs.

Semantic Scheduling

Semantic scheduling takes into account the semantics of the updates themselves to

determine an ordering for the updates in the write log. This may reduce update

conflicts and the amount of undoing and redoing in the write log [7]. For instance,

suppose a write log contains two concurrent updates to a file system in which one

update modifies a file in a directory while another update deletes the parent direc-

tory. Logically, a file cannot be modified after its parent directory has been deleted.

Semantic scheduling could ensure that the update which modifies the file is ordered

before the update that deletes the parent directory, regardless of any timestamps

that may be associated with those updates. As another example, consider updates

that have only commutative increment or decrement operations. Using semantic

scheduling instead of syntactic scheduling would avoid any undoing of the write log

since the replicas would be mutually consistent once all the updates in the system

have propagated to all the replicas.

Update Commitment

Update commitment is a supplemental mechanism in the update scheduling process

that finalizes the position of updates in a write log. Even though total update

propagation combined with syntactic or semantic scheduling will ensure eventual

consistency, update commitment is useful for stabilizing tentative data and for log

management.

8

When updates are initially inserted into the write log, they are in a tentative

state because the effect of the updates may change if updates propagated from other

replicas are inserted in an earlier position in the log. For instance, in a meeting room-

scheduling application, consider the case when a user reserves a room at one replica

and another user concurrently makes a conflicting reservation at another replica.

Let the first reservation be R1, and the conflicting reservation be RC. Suppose, RC

is propagated to the first replica and inserted into the write log at a position earlier

than R1’s position. This requires R1 to be undone before RC is inserted into the

log. When R1 is redone, a conflict results. Depending on the conflict detection and

resolution mechanism being used, the effects of update R1 may be nullified, and

only reservation RC holds. If R1 had been committed, then its position in the write

log is fixed and no updates received later through propagation would be inserted

before it in the write log. Hence the effect of R1 on the replica at the time it is

executed in the log will be certain. Within an update log, the degree of committed

updates gives an indication of the stability of the replicated data. The greater the

percentage of committed updates, the greater the degree of data stability as there

would be fewer updates in the log whose effects when executed on the replica could

potentially change.

Also, for practical reasons, the number of updates in a write log cannot

accumulate indefinitely. Since committed updates do not need to be undone and

redone anymore, they may be permanently removed from the write log to save

space. There are various mechanisms for committing updates. One mechanism is to

use a primary replica to commit updates [9]. The primary replica simply fixes the

position of updates in its write log and propagates the commitment information to

other replicas. Tentative updates are ordered after the committed updates in the

9

log.

2.3.3 Dealing with Conflicts

Conflict detection and resolution in operation-transfer systems will be discussed in

Chapter 5.

2.4 Discussion

As mentioned by Saito and Shapiro [7], the choice of whether to use state-transfer

or log-transfer really depends on the constraints and goals of the system being

implemented as each has its own advantages over the other. One advantage of

state-transfer over log-transfer is that there is no additional overhead for storing

an update log. However, in state-transfer, the time to propagate an entire replica

increases linearly with the size of the replica, while propagation time is independent

of replica size using log-transfer. In log-transfer, though, the size of the log can grow

quickly and consume space if updates occur frequently. It is issues like these that

the designers of optimistic replication systems need to carefully consider.

10

Chapter 3

Bounded Inconsistency

One of the problems with optimistic replication is that there are no bounds on

how far apart any two replicas may be at any moment. Even though eventual

consistency ensures that replicas eventually converge to a common state, at any

given moment, two replicas may have diverged far apart from each other. Without

any bounds on the degree of divergence, some applications cannot be practically

employed. For instance, consider an airline reservation system that has two replicas

where each replica can accept reservations [13]. If there were no guarantees on how

far the two replicas may diverge from each other, in the worse case, each seat can

become double-booked. In their paper [13], Yu and Vahdat explore different ways

of bounding inconsistency among replicas in an optimistic replication environment,

and I present some of their work below.

11

3.1 Continuum between Strong and Optimistic Consis-

tency

Yu and Vadhat [13] note that there is a continuum between strong consistency and

optimistic consistency. Strong consistency is same kind of consistency provided

by pessimistic replication in which users never observe any inconsistencies in the

replicated data. Optimistic consistency refers to the kind of consistency offered

by optimistic replication systems in which replicas eventually converge but may be

highly divergent from one another at any given moment. Along the continuum

the maximum distance between any replica and a “consistent” image [13] (that

represents a replica that has had all the writes in the system applied to it) varies

from zero to infinity. The distance of zero corresponds to strong consistency and a

distance of infinity corresponds to optimistic consistency.

Moving along the continuum involves a trade-off between consistency and

availability. Availability increases as consistency decreases because there is less prob-

ability that each write will require synchronous communication with other replicas

to ensure a certain level of mutual consistency.

Yu and Vadhat present a set of metrics that allow applications to specify

their desired level of consistency throughout the range between strong consistency

and optimistic consistency. These consistency requirements essentially translate into

bounds on the degree of divergence between replicas. These metrics are implemented

in the form of a middleware toolkit called TACT (Tunable Availability/Consistency

Tradeoff). For a given read/write request, TACT determines whether coordination

with other replicas is needed to ensure that the inconsistency bounds are not ex-

ceeded. If coordination is not needed, then the read/write is processed locally, but

12

if coordination with other replicas is necessary, TACT blocks the read/write request

until it is able to pull updates from other replicas and/or push updates to other

replicas in order to stay within the inconsistency bounds. The replication model

assumed by Yu and Vadhat is based on operation-transfer. Updates are considered

tentative until committed using a commitment algorithm that uses the minimum

value in its timestamp vector to serialize and commit the updates with timestamp

values less than or equal to that minimum value.

3.2 TACT

TACT includes three metrics for specifying consistency requirements: Numerical

Error, Order Error, and Staleness. Numerical error limits the difference between

the numerical weight of writes applied locally and the numerical weight of writes

in the “consistent” image. Order error bounds the number of tentative writes at a

replica. Staleness specifies the maximum real time before the most recent write in

the system must be propagated to a replica. If the three metrics are bound to zero,

then TACT essentially enforces pessimistic replication. If there are no bounds, then

TACT would essentially be implementing optimistic replication.

3.2.1 Numerical Error

Numerical Error bounds the numerical difference between the sum of the weighted

updates seen locally at a replica and the sum of the weighted updates in a “consis-

tent” image. Each update has a weight associated with it. For instance, if a replica

stores the balance of a bank account, then the associated weight for each update to

the balance may be the amount of the withdrawl or deposit. In this case, numerical

error bounds the discrepancy between the bank balance at a replica and the actual

13

bank balance. Each replica is responsible for pushing local updates to other replicas

to ensure that the numerical error bound is not violated. For replica values that are

not inherently numerical in nature, weights can be assigned to updates to denote

the importance for the update to be quickly propagated to other replicas. A heavily

weighted update is more likely to force propagation than a lightly weighted update.

Two kinds of numerical error can be specified: absolute numerical error and relative

numerical error.

The notation used for describing the algorithms for implementing numerical

error will be consistent with that presented in the paper by Yu and Vadhat [13]. Vi

denotes the value of replica i, and Vfinal denotes the replica value if all writes in the

system were applied. Note that the value of Vfinal may not necessarily be known at

any replica.

Absolute Numerical Error

Absolute numerical error is used to specify a bound on the absolute difference be-

tween Vi for any replica i and Vfinal. In other words, |Vfinal − Vi| ≤ α for some

α ≥ 0. This bounds the maximum difference between the value of a replica and the

actual value of a “consistent” image.

Yu and Vahdat’s method for implementing an absolute numerical error bound,

α, involves dividing up α among all the replicas. The division may not necesarily be

uniform, and each replica is aware of the allocated values for all the other replicas.

For instance, if the absolute numerical error for a bank balance is α = $100 and

there are ten replicas, then one possible allocation is for two replicas to be allocated

$2 each and for the other eight replicas to be allocated $12 each, totalling $100

across all replicas. Each update is associated with a numerical weight. For a replica

14

that stores a bank balance and processes updates for deposits and withdrawls, the

amount of the deposit or withdrawl can be the associated weight for the update.

Each replica maintains two local variables for each of the other replicas. These

variables record the sum of the positive and negative weighted updates that have

been accepted locally but not yet propagated to the other replicas. Using the bank

balance example, a deposit may have a positive weight and a withdrawl may have a

negative weight. When a replica receives a request for an update, the replica checks

whether accepting the write locally would violate the allocated absolute numerical

error values for any other replica. This can be done by checking whether either local

variable associated with the replica would have a value that exceeds the allocated

numerical error for that replica if the update is accepted locally. If any bounds are

exceeded, the replica pushes local updates from the its log or the requested update

to the other replicas until it is able to accept the update. The local variables asso-

ciated with the contacted replicas are also updated. Local updates may be blocked

until the replica is able to successfully propagate the necessary updates.

This method for implementing the absolute numerical error bound is con-

servative in that one replica may propagate updates to another replica even when

the absolute numerical error bound is not violated. For instance, if a $100 absolute

numerical error bound on a bank balance is evenly divided among ten replicas, then

a deposit of $50 at one replica will cause the update to propagated to all the other

replicas even though the largest difference between a replica’s bank balance and the

actual bank balance is only $50. However, since a replica has no local knowledge of

the unpropagated updates at other replicas, it is possible that other replicas each

have $10 worth of unpropagated deposits, in which case, accepting the $50 deposit

without propagating it to other replicas would violate the numerical error.

15

Relative Numerical Error

Relative numerical error bounds the absolute numerical error at a replica (i.e.,

|Vfinal − Vi|) in relation to |Vfinal|. Each replica can specify its own relative error,

γi = |Vfinal−Vi||Vfinal| , and it is assumed that each replica knows the relative numerical

error bound of all the other replicas. Before a replica accepts an update, it may need

to push updates from its log or the requested update to other replicas to ensure that

the other replicas’ relative numerical error bounds are not violated upon accepting

a local update. Specifically, each replica must ensure that for all other replicas with

a specified relative numerical error bound, say γj , that |Vfinal − Vj | ≤ γj × |Vfinal|.

If γj × |Vfinal| is considered as an absolute numerical error bound, then it is con-

ceivable to apply the method used to bound absolute numerical error in order to

determine whether any relative error bounds are violated. However, the problem is

that replicas will generally not know |Vfinal|.

The method proposed by Yu and Vahdat to enforce the relative numerical

error bound is to convert the relative numerical error bound into an absolute numer-

ical error bound by using a conservative estimate of |Vfinal| based on information

local to each replica. To show how this method works, suppose that all the replicas

are initially within their own relative numerical error bounds. Each replica can lo-

cally compute its estimate of |Vfinal| using Vi and γi. Specifically, at each replica,

−|Vfinal− Vi| ≥ −γi× |Vfinal|. Also, |Vfinal| − |Vi| ≥ −|Vfinal− Vi|. Combining and

rearranging the inequality results in |Vfinal| ≥ |Vi|(1+γi)

, which gives a lower bound

for |Vfinal|. Using the expression for the lower bound for |Vfinal| in the expression,

|Vfinal−Vj | ≤ γj ×|Vfinal|, generates the expression |Vfinal−Vj | ≤ γj × |Vi|(1+γi)

. The

expression γj× |Vi|(1+γi)

can be treated as an allocated absolute numerical error bound

for replica j. Hence, the method used to implement absolute numerical error can

16

be applied to determine whether a replica needs to push updates to another replica

before accepting an update.

Since a lower bound on |Vfinal| is used as an estimate for the value of |Vfinal|,

the method for enforcing relative numerical error is conservative and may push

updates even when the relative numerical error bounds are not actually violated.

3.2.2 Order Error

Order error bounds the number of tentative writes at a replica. Each replica can

locally bound order error by checking if the number of tentative updates in its

write log would exceed the specified order error limit if it accepts a local update.

Each replica can specify its own order error limit independent of other replicas. If

the order error is exceeded, the replica performs a write commitment algorithm to

commit updates in its write log to reduce the number of tentative writes. One way

of committing updates, as suggested by Yu and Vadhat, is to find the smallest value

in the replica’s timestamp vector and to use that value to denote the set of the

updates that could be committed. All updates with timestamps less than or equal

to the minimum value in the timestamp vector can be committed.

3.2.3 Staleness

Staleness specifies the maximum real time before the most recent write in the sys-

tem must be propagated to a replica. To bound staleness, each replica maintains

a real-time vector, RT , that denotes the real time of the latest update received

from another replica. If the staleness bound is T , then each replica checks if

currentT ime − RT [i] < T for each server i. If a replica discovers that the stal-

eness bound is violated, it must pull updates from those replicas that violate the

17

staleness bound. The real-time clocks at the replicas should be loosely synchronized.

3.3 Discussion

The three metrics are not expected to be exported to end users. Instead, end users

should deal with semantically meaningful consistency bounds, and it is up to the

application programmer to realize those bounds using the metrics provided by TACT

[12].

One potential problem with using the commitment protocol suggested by Yu

and Vadhat is that an inactive replica could prevent other replicas from committing

updates in their write log, which in turn would prevent a replica from accepting

additional writes if the order error limit was already reached. The staleness metric

could conceivably be utilized to avoid this problem.

Availability to replicas may be adversely affected because local reads and

writes may be blocked until the local replica is able to successfully push updates to or

pull updates from other replicas in order to adhere to the inconsistency bounds [13].

As mentioned, numerical error actively pushes its local updates to other replicas

while order error and staleness pull updates from other replicas, and if a replica is

unavailable, perhaps due to node or network failure, it may prevent other replicas

from accepting local writes as the unavailable replica would not be able to receive

or send writes as required by other replicas.

18

Chapter 4

Client-Centric Consistency

In an optimistic replication system, a user who accesses a single replica will always

see consistent data [8]. If a user switches from one replica to another, he may see

inconsistent data [8]. Even if bounded inconsistency is used (e.g., TACT), the user

may still view inconsistencies (as long as the extreme case of strong consistency is

not being enforced under bounded inconsistency). For example, a user may issue a

write to a replica and then try to read what was just written. If the read is processed

by a different replica and the write has not yet propagated to that replica, then the

user will read stale data. From a practical perspective, this problem is relevant to

mobile computing environments where clients may connect to the nearest replica for

access to data. The form of consistency that addresses this problem is called client-

centric consistency [8]. Terry et. al. presents a particular solution to this problem

with session guarantees [10], and I present the main aspects of their solution.

Terry defines a session to be a “sequence of read and write operations per-

formed during the execution of an application” [10]. Consistency guarantees asso-

ciated with a session give assurances to the application that the replicas that are

accessed are consistent with respect to the operations that have previously been

19

requested during the session. There are four guarantees, each of which can be inde-

pendently applied, in any combination, to a session:

• Read Your Writes

• Monotonic Reads

• Writes Follow Reads

• Monotonic Writes

Conceptually, clients and applications can have multiple sessions simultaneously.

Reads and writes within a session may access different replicas, but from a consis-

tency point of view, it appears to the application as if a single shared replica is being

accessed.

In providing such guarantees, availability is traded-off for consistency. Since

the underlying replication model is assumed to be optimistic, it is possible, perhaps

due to node or network failures, that none of the available replicas contain the

required updates to meet the consistency guarantees associated with a session, in

which case access to data will be denied.

We define DependentWrites(R) for a read R to be the smallest set of writes

at the replica processing R such that executing R on that set returns the same result

as processing R using the entire replica. This is essentially the same as the definition

of RelevantWrites in Terry’s paper [10].

4.1 Read Your Writes

The Read Your Writes guarantee ensures that any reads made during the session are

processed at replicas that contain all preceding writes made during the same session.

20

To see the motivation for this guarantee, consider the case when a user updates a

database and then tries to read the data that was updated. The Read Your Writes

guarantee ensures that if the update and subsequent read request were made during

the same session, then the read request would be performed at a replica that already

includes the update. Without this guarantee, it is possible for the read to return

stale data from a replica that has not received the update, possibly leaving the user

confused.

Since session guarantees only give the illusion of a single shared replica, it is

possible for reads and writes in a given session to be interleaved with writes outside

the session. Hence, in the database example, the read may return more up-to-date

information than a previous update made during the session.

4.2 Monotonic Reads

The Monotonic Reads guarantee ensures that reads in a session are only processed

by replicas that contain all writes seen by previous reads made in the same session.

This means that reads will return data that is at least as recent as what has been

previously read during the session. To be precise, if a session guarantees Monotonic

Reads, then a read request in the session can only be processed at a replica which

contains all the writes in DependentWrites(R), for every preceding read request, R,

made in the session.

As an example, consider a replicated database. Suppose the user wishes to

read a data value that had been previously read during the same session. With the

Monotonic Reads guarantee, the user is ensured that the value returned is at least

as recent as what was initially read. Without the Monotonic Reads guarantee, it is

possible that stale data may be returned if an out-of-date replica is accessed for the

21

latter read request.

4.3 Writes Follow Reads

The Writes Follow Reads guarantee ensures that Writes are only applied to replicas

that contain all the Writes seen by previous Reads during the session. Specifically,

before a Write can be accepted at a replica, the replica must contain all the Depen-

dentWrites(R) for each read R that preceded the write during the session.

As an example of how this guarantee can be used, consider a newsgroup

service [10]. Upon reading a posting a user may post a reply to it. Using Writes

Follow Reads will ensure that users of the service will see the reply only if the original

posting is also available at the server.

4.4 Monotonic Writes

The Monotonic Writes guarantee ensures that a write is accepted at a server only if

all preceding writes issued during the session are already at the server. An example

of where this might be useful would be when a programmer updates a software

library and subsequently updates the application to use the updated library [10].

Using the Monotonic Writes guarantee ensures that the write that updated the

application will only be applied to a server that contains the write for updating

the library, preventing the situation where a server has the updated version of the

application but an outdated version of the library.

22

4.5 Implementation

Terry suggests a practical implementation for session guarantees that is based on

timestamp vectors. (Terry calls it version vectors). Each server maintains a times-

tamp vector that indicates the writes it has in its log. Timestamp values can be

based on any monotonically increasing clock.

On the client side, each session is associated with either one or two timestamp

vectors depending on the session guarantees chosen for the session. One timestamp

vector - call it write-set [10] - gives an indication of the writes made by the client

and would be used in any combination of the four session guarantees except when

only Monotonic Reads is used. The other timestamp vector - call it read-set [10] -

would be associated with the reads requested by the client and would be used in

any combination of the four session guarantees except when only the Monotonic

Writes guarantee is used. Upon reading from a server, the server’s timestamp vec-

tor is merged with the client’s read-set. Merging a server’s timestamp vector with

a read-set involves setting each entry in the read-set to the maximum of the cor-

responding entries in both timestamp vectors. Upon writing to a server, the client

receives a timestamp for the write and uses it to update the corresponding entry

in its write-set timestamp vector. Also, one timestamp vector dominates another

timestamp vector when all of its entries are at least as large as the corresponding

entries in the other vector.

The four guarantees would be implemented as follows using the read-set and

write-set of a session:

Monotonic Reads

When accessing a server to process a read request, a check is made as to whether

the server’s timestamp vector dominates the session’s read-set.

23

Read Your Writes

When accessing a server to process a read request, a check is made as to whether

the server’s timestamp vector dominates the session’s write-set.

Monotonic Writes

When accessing a server to process a write request, a check is made as to whether

the server’s timestamp vector dominates the session’s write-set.

Writes Follow Reads

When accessing a server to process a write request, a check is made as to whether

the server’s timestamp vector dominates the session’s read-set.

One simplification in the suggested implementation is that DependentWrites(R)

is conservatively estimated to be all the writes that are applied at the replica. With-

out this simplification, the computation of DependentWrites(R) for a read request

may be very expensive, especially for a complex query [10].

4.6 Discussion

There has been seemingly no further research in this area since session guarantees

were introduced by Terry, which could be an indication that his solution is satisfac-

tory.

I believe that the trade-off between availability and consistency should not

necessarily be as simple as denying access when the session guarantees cannot be

met by a replica. When none of the available replicas can satisfy a given session

guarantee, perhaps because the only replica that would have allowed access has

24

crashed, then access to stale replicas may be preferred by the user rather than

having no access at all. For example, for a replicated email inbox, allowing access to

a stale version of the inbox may be preferred over denying access to all the inboxes

because they do not satisfy the constraints of some session guarantee. This is really

a matter of policy which can be implemented on top of the session guarantees, and

one which, I believe, a system designer should be aware of.

25

Chapter 5

Conflict Resolution

Conflict resolution is a major issue in optimistic replication systems. By definition,

optimistic replication systems allow concurrent updates to different replicas, and

the concurrent updates may cause the replicas to be in a mutually conflicting state.

In this chapter, I will introduce the general ideas behind conflict detection and

resolution and present some mechanisms that have been devised for dealing with

conflicts.

5.1 Conflict Detection

Before conflicts can be resolved, they must first be detected. Conflicts can be de-

tected either syntactically or semantically [7].

5.1.1 Syntactic Conflict Detection

Given any two replicas, a syntactic conflict exists if concurrent updates have been

applied to the replicas. Hence, detection of a syntactic conflict simply implies de-

tecting the presence of concurrent updates. However, concurrent updates do not

26

necessarily mean that there is a semantic conflict as far as the application is con-

cerned. For instance, concurrent updates that reserve different meeting rooms in a

replicated calendar program is not considered an application-level conflict. Semantic

conflicts are considered later in this chapter.

In state-transfer systems, a syntactic conflict can be reliably detected by

associating version vectors with each replica [7]. Each replica has a version vector

with one entry for each replica. A version vector indicates the set of writes in a

replica. When an update is applied locally to a replica, it increments the timestamp

value in its own version vector entry. During update propagation, when one replica

overwrites another replica, the version vector of the source replica also overwrites

the version vector of the destination replica. A version vector V1 is said to dominate

version vector V2 if every entry in V1 is equal to or larger than the corresponding

entry in V2. A syntactic conflict exists between two replicas when neither version

vector dominates the other. In operation-transfer systems, a syntactic conflict can

be detected by comparing the timestamp vectors of the replicas. If neither vector

dominates the other, then a syntactic conflict exists.

5.1.2 Semantic Conflict Detection

In semantic conflict detection, two concurrent updates conflict if they would violate

the semantics of the application had the updates been applied to the same replica.

For example, using semantic conflict detection in a replicated airline reservation

system, two concurrent reservations for the same seat would be considered a semantic

conflict. However, if the concurrent reservations were for different seats, then there

would not be a semantic conflict. In state-transfer systems detection of semantic

conflicts would require the replication system to have semantic knowledge of the

27

contents of the replicas to determine whether a semantic conflict exists. In log-

transfer systems, a possible way to detect semantic conflicts is for each update to

check whether a semantic conflict would occur if the update were executed against

the current state of the replica. Bayou [9] uses this approach.

5.2 Conflict Resolution

Once a conflict is detected, it needs to be resolved. Resolution of conflicts inherently

requires knowledge of the application semantics. For instance, in a file system,

conflicting updates to file replicas cannot be resolved by the file system without

knowledge of the semantics of the data in the file. Conceptually, when conflicts in

replicas are resolved, the conflicting replicas are overwritten with a new value that

is the result of reconciling the conflicting replicas.

Conflicts can be resolved manually or automatically. Manual resolution re-

quires user intervention, while automatic resolution doesn’t. Using the example of

the meeting-scheduling application, if a semantic conflict is detected for conflicting

reservations for a meeting room, manual resolution may simply notify the user of

the conflict and let the user decide how to resolve it. With automatic resolution,

the system has the means to attempt conflict resolution without user intervention,

perhaps because the system understands the semantics of the replica contents or

it provides mechanisms for application-specific resolution. Automatic resolution is

preferred in optimistic replication systems since updates may propagate epidemi-

cally and the user may not be available at the time a conflict is detected. Coda

and Bayou are two systems that support automatic conflict resolution through the

provision of mechanisms that allow applications to specify how conflicts are to be

resolved.

28

Coda is a distributed file system [4]. Conflicting updates to files are possi-

ble due to network partitions or disconnected client operation. Coda uses version

vectors to syntactically detect file conflicts. To facilitate automatic conflict resolu-

tion, Coda allows users to install application-specific resolvers (ASRs), which are

programs that can be invoked by Coda to resolve file conflicts [5]. Upon detecting

a file conflict, Coda locates the associated ASR for the file and executes it.

Bayou is a weakly consistent replicated database storage system [9]. It uses

log-transfer for update propagation and ensures eventual consistency by totally or-

dering the update logs at all replicas. Bayou supports application-specific conflict

detection and resolution through two mechanisms: dependency checks and merge

procedures. Each update is associated with an application-specified dependency

check and merge procedure. The dependency check is used to determine whether

the update semantically conflicts with any previous update in the log (i.e., the cur-

rent state of the replica). The dependency check contains a query and an expected

result. When the query is run on the replica and the value matches the expected

result, the update is considered not to conflict with any previous updates, and the

update is then executed. However, if the query result does not match the expected

result, then a semantic conflict is detected, and the merge procedure is run. The

merge procedure is application-specific and is used to resolve the conflict. Since

Bayou uses log-transfer, updates in the log may be undone and redone numerous

times during update scheduling. A dependency check is run every time its asso-

ciated update is executed. As a concrete example, consider a meeting-scheduling

application. An update to reserve a meeting room may include as its dependency

check a query on whether the room is available at a given time, while the associated

merge procedure may try to reserve another room.

29

5.3 Discussion

Bayou’s dependency check is a flexible application-independent mechanism. In

fact, one can simply ignore conflicts in Bayou by using null dependency checks,

in which case the updates simply modify the underlying database without checking

for application-specific conflicts [11]. The dependency check mechanism has also

been adopted in other recent systems. Oceanstore, which is a global utility infras-

tructure intended to provide persistent, highly available storage, has adopted the

dependency check for detecting file conflicts [3]. IceCube [2] is a log reconcilia-

tion system that uses application-specified dependency checks (or preconditions as

denoted in IceCube) to semantically detect conflicts.

30

Chapter 6

Conclusion

Consistency in optimistic replication is a major issue. This essay introduced four

concepts related to consistency: eventual consistency, bounded inconsistency, client-

centric consistency, and conflict resolution.

Eventual consistency is one of the fundamental requirements in optimistic

replication. It refers to the requirement for all replicas to eventually be mutually

consistent despite the presence of concurrent updates at different replicas. The basic

concepts for achieving eventual consistency are that updates need to be propagated

to all replicas and ordered in a way so that the replicas are mutually consistent. As

mentioned in this essay, updates can either be propagated implicitly using state-

transfer mechanism or explicitly with a log-transfer mechanism. Using log-transfer

for propagation requires some form of update scheduling, using either syntactic or

semantic scheduling. Syntactic scheduling involves only comparing the timestamps

of the updates. Semantic scheduling determines an order based on the semantics of

the updates.

One of the problems in optimistic replication is that the contents of repli-

cas may diverge significantly from one another. This may lead to a high rate of

31

application-level update conflicts, which may be unacceptable for some applications,

such as an airline reservation system. Bounded inconsistency refers to bounding the

degree of divergence among replicas. Yu and Vahdat developed the TACT toolkit

that allows applications to specify inconsistency bounds along three dimensions:

numerical error, order error, and staleness.

Client-centric consistency deals with the problem that users may see inconsis-

tent data with respect to their own sequence of reads and writes when the sequence

involves accesses to different replicas. A solution proposed by Terry [10] is session

guarantees. A session is an abstraction for the sequence of reads and writes a user

or application performs. Four independent consistency guarantees may be associ-

ated with a session: Read Your Writes, Monotonic Reads, Writes Follow Reads,

and Monotonic Writes. An efficient implementation involves associating timestamp

vectors with each replica and with the read and write sets of each session.

One of the key differences between optimistic replication and pessimistic

replication is that concurrent updates are permitted in optimistic replication while

they are prohibited in pessimistic replication. Concurrent updates may lead to

conflicts in the contents of the replica in terms of application semantics. Detection

of conflicts can be done syntactically or semantically. Syntactic conflict detection

simply detects potential application-level conflicts by the presence of concurrent

updates. Normally, this kind of conflict detection is the only option in systems where

the application semantics of the replicas are unknown to the replication system.

Semantic conflict detection only detects concurrent updates that violate application

semantics.

Resolving conflicts inherently requires knowledge of the semantics of the

replica contents. Although manual resolution requiring user intervention is possi-

32

ble, it is undesirable since conflicts may be detected long after the user is no longer

using the system. Some recent systems have provided mechanisms for supporting

application-specific conflict resolution. One such system is Bayou, which provides

applications with the means to detect and resolve application-level conflicts through

the dependency checks and merge procedures supported by Bayou.

I believe that Bayou’s dependency check mechanism represents a significant

research result in optimistic replication. This is supported by the fact that two

systems that were developed after Bayou, Oceanstore and IceCube, both adopted

features from Bayou’s novel conflict detection mechanism. The flexibility of the

mechanism and its application-independent interface, which decouples the replica-

tion system from the application semantics, makes it easy to be adopted in other

replication systems. One possible concern with using dependency checks is the larger

update log entries and their impact on propagation delay and storage space at the

replicas.

The commitment protocols used in optimistic replication systems seem to

lack robustness. For instance, Bayou’s commitment protocol uses a primary replica

to commit updates. However, the primary replica becomes a single point of failure.

Also, in TACT, the use of the minimum timestamp in the timestamp vector for

purposes of update commitment is ineffective if there are inactive replicas. As I

alluded to in Chapter 3, a possible way around this problem is for a replica to

contact all other replicas. However, this is clearly not scalable and may not even be

possible when network connectivity is intermittent. I believe that there is potential

for further research to make update commitment more robust.

33

Bibliography

[1] Gerald J. Popek Gerard Rusidin Allen Stoughton Bruce J. Walker Evelyn Wal-ton Johanna M. Chow David Edwards Stephen Kiser D.Stott Parker, Jr. andCharles Kline. Detection of mutual inconsistency in distributed systems. IEEETransactions on Software Engineering, May 1983. Vol. 9, No. 3. pp. 240-247.

[2] A. Kermarrec, A. Rowstron, M. Shapiro, and P. Druschel. The icecube approachto the reconciliation of divergent replicas. Proceedings of Twentieth ACMSymposium on Principles of Distributed Computing, August 2001.

[3] John Kubiatowicz, David Bindel, Yan Chen, Patrick Eaton, Dennis Geels,Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westly Weimer,Christopher Wells, and Ben Zhao. Oceanstore: An architecture for global-scalepersistent storage. Proceedings of ACM ASPLOS, November 2000.

[4] Puneet Kumar. Mitigating the Effects of Optimistic Replication in a DistributedFile System, PhD Thesis. Carnegie Mellon University, December 1994.

[5] Puneet Kumar and M. Satyanarayanan. Flexible and safe resolution of fileconflicts. USENIX Winter, 1995. pp. 95-106.

[6] L. Lamport. Time, clocks, and the ordering of events in distributed systems.Communications of the ACM, 1978. Vol. 21, No. 7.

[7] Y. Saito and M. Shapiro. Replication: Optimistic approaches. Hewlett PackardTechnical Report HPL-2002-33, March 2002.

[8] Andrew S. Tannebaum and Maarten van Steen. Distributed Systems Principlesand Paradigms. Prentice Hall, 2002.

[9] D.B. Terry, M.M. Theimer, K. Petersen, A.J. Demers, M.J. Spreitzer, and C.H.Hauser. Managing update conflicts in bayou, a weakly connected replicatedstorage system. ACM Symp. on Operating Systems Principles, December 1995.pp. 172-183.

34

[10] D.B. Terry, M.M. Theimer, K. Petersen, A.J. Demers, M.J. Spreitzer, and B.B.Walsh. Session guarantees for weakly consistent replicated data. ProceedingsThird International Conference on Parallel and Distributed Information Sys-tems, September 1994. pp. 140-149.

[11] D.B. Terry, M.M. Theimer, K. Petersen, and M.J. Spreitzer. The case fornon-transparent replication: Examples from bayou. IEEE Data EngineeringBulletin, December 1998. Vol 21., No. 4, pp. 12-20.

[12] Haifeng Yu and Amin Vahdat. Building replicated internet services using tact:A toolkit for tunable availability and consistency tradeoffs. The Second Inter-national Workshop on Advanced Issues of E-Commerce and Web-Based Infor-mation Systems, October 2000.

[13] Haifeng Yu and Amin Vahdat. Design and evaluation of a continous consistencymodel for replicated services. Fourth Symposium on Operating Systems Designand Implementation, 2000.

35