murex: a mutable replica control protocol for structured peer-to-peer storage systems

MUREX: A Mutable Replica Control Protocol for

Structured Peer-to-Peer Storage Systems

P2P Systems

• For sharing resources at the edge of Internet

• Classification– Unstructured

• Napster, Gnutella

– Structured– Chord, Pastry, Tapestry, CAN, Tornado

Replication

• Data items are replicated for the purpose of fault-tolerance.

• Some DHTs have provided replication utilities, which are usually used to replicate routing states.

• The proposed protocol replicates data items in the application layer so that it can be built on top of any DHT.

Faulty model

• Fail-stop

• Byzantine

• Middle ground

Hashtable

0 2128-1

Peernodes

…Hash Function 1 Hash Function 2 Hash Function n

Data Item

replica 1 replica n replica 2

Duplicating a Data Item

How to keep replica consistent

• Primary copy mechanism: update propagation

• Quorum-based mechanism:(Qw intersect Qr, Qw intersects Qw)– ROWA: read one write all– Majority– Multi-Column Protocol– …

Problems

• State loss

• Replica Regeneration

• Replica Transfer

Hashtable

0 2128-1

Peernodes


Data Item


Peer Joins/Leaves

Hashtable

0 2128-1

Peernodes


Data Item


Peer Joins/Leaves

replica transferstate loss

peer joinspeer leaves

Hashtable

0 2128-1

Peernodes


Data Item

new replica replica n replica 2

Peer Joins/Leaves

replica regenerationreplica transfer

state loss

peer joinspeer leaves

Solutions

• Leased Lock

• Replica Pointer

• Auto-Replica Regeneration

Implementation

• Solutions can be integrated with any quorum-based replica control protocols on the basis of some DHT

• We choose multi-column protocol (MCP) and Tornado DHT for the following reasons:– MCP has small quorum sizes: SQRT(n) (MCP can a

chieve constant quorum sizes in the best case if necessary)

– Tornado is a typical DHT system developed by ourselves.

Multi-Column Structure

• Multi-Column structure MC(m)(C1,...,Cm), is a list of pairwise disjoint sets of replicas satisfying Ci>1 for 1im .

• For example, ({r1,r2}, {r3,r4,r5}, {r6,r7,r8,r9}) and ({r1,r2,r3,r4,r5}, {r6,r7}, {r8,r9}) are multi-column structures, where r1,...,r9 are (keys of) replicas of a data item.

Write/Read Quorums

• A write quorum under MC(m) is a set that contains all replicas of some column Ci, 1im (note that i=1 is included), and one replica of each of the columns Ci+1,...,Cm.

• A read quorum under MC(m) is either– Type-1: a set that contains one replica of each of the

columns C1,...,Cm.or

– Type-2: a set that contains all replicas of some column Ci, 1<im (note that i=1 is excluded), and one replica of each of the columns Ci+1,...,Cm.

Construction of Write Quorums• one primary cohort• with supporting cohorts

at rear• e.g. quorums:

{2, 6, 10, 11, 14}{1, 2, 5, 9, 13, 15}

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15 15141312

111098

7654

3210

Randomized Alg. for write quorums

Function Get_Write_Quorum((C1,...,Cm): Multi-Column): Set;

Var S: Set; i, j: Integer;

S=Ci, where i=Random(1..m); //i will be an integer between 1 and m

Choose one arbitrary member in Cj and add it into S for j=i+1,…,m.

Return S;End Get_Write_Quorum

Randomized Alg. for type-1 read quorums

Function Get_Read_Quorum1((C1,...,Cm): Multi-Column): Set;

Var S: Set; i: Integer;

Choose one arbitrary member in Ci and add it into S for i=1,…,m.

End Get_Read_Quorum

Randomized Alg. for type-2 read quorums

Function Get_Read_Quorum2((C1,...,Cm): Multi-Column): Set;

Var S: Set; i, j: Integer;

S=Ci, where i=Random(2..m); //i will be an integer between 2 and m

Choose one arbitrary member in Cj and add it into S for j=i+1,…,m.

Return S;End Get_Write_Quorum

n Hash functions

• There are several ways to disseminate the n replicas

• In MUREX, we adopt n hashing function methods.

• There are n replicas with hashing key k1,…,kn for each data item, where k1=HASH1(data item name),…,kn=HASHn(data item name).

Operations

• publish(data, data item name): to place data replicas of the original data item at nodes associated with k1,…,kn with version number 0.

• read(data item name): to return the replica of the up-to-date version number (by collecting a read quorum of replicas and returning the one with the highest version number).

• write(data, data item name): to update the data item (by writing a write quorum of replicas with the highest version number plus one).

Messages

• LOCK

• OK

• WAIT

• MISS

• UNLOCK

Initialization

• Initially, the data originator publishes the original data item by calling publish(data, data item name), which stores the data item with version number 0 at the n nodes associated with k1,…,kn.

Read/Write

• Afterwards, any participant can call read (or write) operation to read (or write) the data item by issuing LOCK requests, with the help of the DHT, to all members of a read (or write) quorum Q.

Asking a missed replica

• When a node receives a lock request, it sends a MISS message if it does not own the replica. It is noted that MISS is sent just once for each replica.

Check lock conflict

• On the other hand, if the node owns the replica, it then checks if there is a lock conflict, i.e., if a read-locked replica receives a write-lock request, or if a write-locked replica receives a write-lock or a read-lock request.

OK v.s. WAIT

• If there is no lock conflict, it locks the replica and replies an OK message containing the replica version number.

• On the contrary, if there is a lock conflict, the node replies a WAIT message.

Wait Period

• After sending LOCK requests, a node enters the “wait period” of length W.

• During the wait period, if a node has gathered OK messages form all members of quorum Q, it can execute the desired operation.

• A node sends UNLOCK messages to unlock the replicas after the operation is finished.

Usage of the Version Number

• A read operation in MUREX reads the replica of the largest version number from one of Q’s members.

• On the other hand, a write operation always writes into all members of Q the newest replica attached with the version number which is one more than the largest version number just encountered.

Quorum Reselection

• During the wait period, if a node u cannot gather OK messages from all members of Q after the specific waiting period W, it should select another quorum Q, send LOCK requests, and then enter another wait period again.

• A node may enter wait periods repeatedly until enough OK messages are gathered or until any lock expires.

Quorum Reselection Case 1

• No WAIT message is received by u:This case occurs when there is no contention. For such a case, Q should be such a quorum that QR is minimized, where R is the set of nodes having replied OK messages. Node u sends LOCK messages to the nodes that u has not sent LOCK messages yet.

Quorum Reselection Case 2

• One or more WAIT messages are received by u:This case occurs when there is contention. For such a case, node u first sends UNLOCK messages to all the members of Q. (Some of the UNLOCK messages does not need to send through DHT since node u has learned IPs of some nodes from reply messages.) Then, u select an arbitrary quorum Q. After a random backoff time, node u sends LOCK messages to all members of Q. The random backoff concept is similar to that of Ethernet [Cho] and is used to avoid continuous conflicts among contending nodes.

Deadlock- and Livelock-Free

• In MUREX, every lock is assumed to be a leased lock that has a leased period of L. We also assume that the critical section of a read or a write operation takes C time to complete. A leased lock automatically expires after the leased period L expired. Thus, a node should release the lock if H>L-C-D, where H is the holding time of the lock and D is the propagation delay of transmitting the lock. It is noted that the lock holding time H is the time when a node received OK messages.

L, D, H and C

u:

v:

L

H D C

OK message

time

time

The leased lock makes node substitution work

• If a node u hosting replica r leaves, a node v will be selected for replacing u to host r. If r is still locked while u leaves, then the lock-state of r is lost. When v later obtains the copy of r somehow, it cannot grant the lock for r at time E when it obtains the replica. Instead, it can grant the lock at time E+L, where L is the leased lock period. By this, a replica is never locked more than once, and the lock-state loss problem is solved.

Replica Pointer (1/2)

• When a node v arrives to share part of the load of node u, say from key k1 to key k2 of the hash space, u should transfer to v the replicas of keys from k1 to k2.

• To reduce the cost of transferring all the replicas, MUREX transfers replica pointers instead of the actual replicas.

• A replica pointer is a five-tuple: (key, data item name, version number, lock state, storing location IP). It is produced when a replica is generated and can be used to locate the actual replica stored.

• When node v owns the replica pointer of replica r, it is regarded as r’s host, which can reply the lock request of r.

Replica Pointer (2/2)

• On the other hand, when node u sends out the replica pointer of replica r, it is no more the host of r and cannot reply the lock request of r (even if it stores the actual replica of r).

• A replica pointer is a lightweight mechanism for transferring replicas; it can be propagated from node to node.

• When a node w owing replica pointer of r receives a lock request for r, it should check whether the node storing the actual replica of r is still alive. If so, w can behave as host of r. Otherwise, w regards itself as having no replica r.

• Every transfer of replica pointer between two nodes, say from u to v, should be recorded locally by u so that an UNLOCK message can be sent to the last node having the replica pointer.

Replica Auto-Regeneration (1/2)

• When node v receives from node u a LOCK message for locking replica r, v sends a MISS message if it does not own replica r. It is noted that MISS is sent for a replica only once. Node v is assumed to have no replica r if the following conditions hold:

1. v does not have the replica pointer of r2. v has the replica pointer of r, which indicates

that w stores r, but w is not alive.

Replica Auto-Regeneration (2/2)

• After obtaining (resp., generating) the newest replica by executing a read (or resp., write) operation, node u should send the newest replica to node v. After receiving the newest replica, node v generates a replica pointer for the replica and can start to reply to lock request at time E+L, where E is the time of receiving the replica and L is the leased lock period. In such a manner, replica regeneration can be performed automatically with little overhead.

Analysis – Availability (1/2)

• We assume that all data replicas have the same up-probability p, the probability that a single replica is up (i.e., accessible).

• Let RAV(k) denote the availability of read quorums under MC(k), and WAV(k), the availability of write quorums under MC(k).

Analysis – Availability (2/2) RAV(k) = Prob.(all replicas in Ck are up) +

Prob.(at least one replica but not all replicas in Ck are up) RAV(k 1)

= pSk + (1 pSk (1 p)Sk )RAV(k 1)

WAV(k) = Prob.(all replicas in Ck are up) +

Prob.(at least one replica but not all replicas in Ck are up) WAV(k 1)

= pSk + (1 pSk (1 p)Sk )WAV(k 1)

RAV(1) = (1 (1 p)S1 )

WAV(k) = pS1

Analysis – Quorum Size

The write quorum under a n -column

multi-column structure has n size in the best case

and 2 n -1 in the worst case, where n is the number of replicas. Under the same structure, the read

quorum has n size in the best case and 2 n -2 size in the worst case. The size of multi-column

quorums is relatively small when compared to

related quorums.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Up Probability

Read

Ava

ilabi

lity R:2*2

R:3*3

R:4*4

R:5*5

R:6*6

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Up Probability

Writ

e Ava

ilabi

lity

W:2*2

W:3*3

W:4*4

W:5*5

W:6*6

• When up prob. is high (for example, in the well controlled environment), we can adopt larger column size if the write availability is the most significant concern.

• When up prob. Is low (for example, in the Internet environment), we can adopt smaller column size if the read availability is the most significant concern.

Simulation

Related Work• As far as we have known, there are four existent mutable P2

P storage systems proposed for P2P environments: Ivy [Mut], Eliot [Ste], Oasis [Rod], and Om [Yu].

• The protocols, on trying to maintain data consistency, all encounter the problems caused by “node substitution”, although not mentioned explicitly, and solves them by the concepts of logs, replicated metadata service, dynamic quorum membership, and replica membership reconfiguration, respectively.

• A mechanism called informed backoff is proposed in [Lin] to intelligently collect replica states to achieve mutual exclusion (i.e., exclusive lock) among replicas. The mechanism treats “node substitution” as a malicious fault, and uses the term “random reset” to refer the fault.

Ivy

• Ivy [Mut] is based on a set of logs stored with the aid of distributed hash tables. It keeps a log storing all updates for every participant, and maintains data consistency optimistically by performing conflict resolutions among all logs. The logs should be kept indefinitely and participants must scan all the logs to look up file data. Thus, Ivy is only suitable for small group of participants.

Eliot

• Eliot [Ste] relies a reliable, fault-tolerant, immutable P2P storage substrate Charles to store data blocks, and uses an auxiliary metadata service (MS) for storing mutable metadata. It supports NFS-like consistency semantics; however, the traffic between MS and the client is high for such semantics. It also supports AFS open-close consistency semantics; however, this semantics may cause the problem of lost updates. The MS service is provided by a conventional replicated database, which may be not fit for dynamic P2P environments.

Oasis

• Oasis [Rod] is based on Gifford’s weighted voting quorum concept and allows dynamic quorum membership. It spreads versioned metadata along with data replicas over the P2P network. To complete an operation, a client must first find related metadata to form a quorum. If the metadata is not found, the operation may fail.

Om

• Om [Yu] is based on the concepts of automatic replica regeneration and replica membership reconfiguration. The consistency is maintained by two quorum systems: a read-one-write-all quorum system for accessing replicas, and a witness-modeled quorum system for reconfiguration. Om allows replica regeneration from single replica. However, a write in Om is always forwarded to the primary copy, which serializing all writes and uses a two-phase protocol to propagate the write to every secondary replica. The primary replica may become a bottleneck and the overhead incurred by the two phase protocol may be too high. Moreover, the reconfiguration by witness model has the probability of violating consistency.

Sigma• The paper [Lin] utilizes informed backoff mechanism to design al

gorithms achieving mutual exclusion among replicas. A node u wishing to be the winner of the mutual exclusion sends a request for each of the totally n (n=3k+1) replicas and waits for responses. On receiving a request, a node should put the request in a FIFO queue and send the ID of the node whose request is in the front of the queue. When the number of responses received by node u exceeds m (m=2k+1), node u then regards node v (v may be u) as the winner if more than m responses take v as the winner. Otherwise, node u sends a release message to all replicas that take u as the winner to relinquish the request. To avoid repeated conflicts for high contention environment, only after a random backoff time will node u start over to send requests. In this manner, a winner can be elected successfully even if replica is reset when “node substitution” occurs. The work in [Lin] regards “node substitution” as sort of malicious faults, while our protocol regards it as sort of omission faults.

References:1. [Bha] R. Bhagwan, D. Moore, S. Savage, and G. Voelker, “Replication Strategies for

Highly Available Peer-to-peer Storage,” Proc. WFDDC, 2002.2. [Dab] F. Dabek, M. Kaashoek, D. Karger, R. Morris, and I. Stoica, “Wide-area Cooper

ative Storage with CFS,” Proc. SOSP, 2001.3. [Gop] V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee, and P. Keleher, “Adaptive Rep

lication in Peer-to-peer Systems,” Proc. International Conference on Distributed Computing Systems, 2004.

4. [Kub] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao, “OceanStore: An Architecture for Global-Scale Persistent Storage,” Proc. ASPLOS, 2000.

5. [Lin] S. Lin, Q. Lian, M. Chen, and Z. Zhang, “A practical distributed mutual exclusion protocol in dynamic peer-to- peer systems,” In 3rd International Workshop on Peer-to-Peer Systems (IPTPS’04), 2004.

6. [Mut] A. Muthitacharoen, R. Morris, T. Gil, and B. Chen, “Ivy: A Read/write Peer-to-peer File System,” Proc. SOSDI, 2002.

7. [Rod] M. Rodrig, and A. Lamarca, “Decentralized Weighted Voting for P2P Data Management,” Proc. of the 3rd ACM International Workshop on Data Engineering for Wireless and Mobile Access, pp. 85–92, 2003.

8. [Ste] C. Stein, M. Tucker, and M. Seltzer, “Building a Reliable Mutable File System on Peer-to-peer Storage,” Proc. WRP2PDS, 2002.

9. [Yu] H. Yu. and A. Vahdat, “Consistent and Automatic Replica Regeneration,” Proc. NSDI, 2004.

10. [Zho] B. Zhou, D. A. Joseph, J. Kubiatowicz, “Tapestry: A Fault Tolerant Wide Area Network Infrastructure,” Proc. ACM SIGCOMM, 2001, 2001.

murex: a mutable replica control protocol for structured peer-to-peer storage systems

Documents