murex: a mutable replica control protocol for structured peer-to-peer storage systems
TRANSCRIPT
MUREX: A Mutable Replica Control Protocol for
Structured Peer-to-Peer Storage Systems
P2P Systems
• For sharing resources at the edge of Internet
• Classification– Unstructured
• Napster, Gnutella
– Structured– Chord, Pastry, Tapestry, CAN, Tornado
Replication
• Data items are replicated for the purpose of fault-tolerance.
• Some DHTs have provided replication utilities, which are usually used to replicate routing states.
• The proposed protocol replicates data items in the application layer so that it can be built on top of any DHT.
Faulty model
• Fail-stop
• Byzantine
• Middle ground
Hashtable
0 2128-1
Peernodes
…Hash Function 1 Hash Function 2 Hash Function n
Data Item
replica 1 replica n replica 2
Duplicating a Data Item
How to keep replica consistent
• Primary copy mechanism: update propagation
• Quorum-based mechanism:(Qw intersect Qr, Qw intersects Qw)– ROWA: read one write all– Majority– Multi-Column Protocol– …
Problems
• State loss
• Replica Regeneration
• Replica Transfer
Hashtable
0 2128-1
Peernodes
…Hash Function 1 Hash Function 2 Hash Function n
Data Item
replica 1 replica n replica 2
Peer Joins/Leaves
Hashtable
0 2128-1
Peernodes
…Hash Function 1 Hash Function 2 Hash Function n
Data Item
replica 1 replica n replica 2
Peer Joins/Leaves
replica transferstate loss
peer joinspeer leaves
Hashtable
0 2128-1
Peernodes
…Hash Function 1 Hash Function 2 Hash Function n
Data Item
new replica replica n replica 2
Peer Joins/Leaves
replica regenerationreplica transfer
state loss
peer joinspeer leaves
Solutions
• Leased Lock
• Replica Pointer
• Auto-Replica Regeneration
Implementation
• Solutions can be integrated with any quorum-based replica control protocols on the basis of some DHT
• We choose multi-column protocol (MCP) and Tornado DHT for the following reasons:– MCP has small quorum sizes: SQRT(n) (MCP can a
chieve constant quorum sizes in the best case if necessary)
– Tornado is a typical DHT system developed by ourselves.
Multi-Column Structure
• Multi-Column structure MC(m)(C1,...,Cm), is a list of pairwise disjoint sets of replicas satisfying Ci>1 for 1im .
• For example, ({r1,r2}, {r3,r4,r5}, {r6,r7,r8,r9}) and ({r1,r2,r3,r4,r5}, {r6,r7}, {r8,r9}) are multi-column structures, where r1,...,r9 are (keys of) replicas of a data item.
Write/Read Quorums
• A write quorum under MC(m) is a set that contains all replicas of some column Ci, 1im (note that i=1 is included), and one replica of each of the columns Ci+1,...,Cm.
• A read quorum under MC(m) is either– Type-1: a set that contains one replica of each of the
columns C1,...,Cm.or
– Type-2: a set that contains all replicas of some column Ci, 1<im (note that i=1 is excluded), and one replica of each of the columns Ci+1,...,Cm.
Construction of Write Quorums• one primary cohort• with supporting cohorts
at rear• e.g. quorums:
{2, 6, 10, 11, 14}{1, 2, 5, 9, 13, 15}
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15 15141312
111098
7654
3210
Randomized Alg. for write quorums
Function Get_Write_Quorum((C1,...,Cm): Multi-Column): Set;
Var S: Set; i, j: Integer;
S=Ci, where i=Random(1..m); //i will be an integer between 1 and m
Choose one arbitrary member in Cj and add it into S for j=i+1,…,m.
Return S;End Get_Write_Quorum
Randomized Alg. for type-1 read quorums
Function Get_Read_Quorum1((C1,...,Cm): Multi-Column): Set;
Var S: Set; i: Integer;
Choose one arbitrary member in Ci and add it into S for i=1,…,m.
End Get_Read_Quorum
Randomized Alg. for type-2 read quorums
Function Get_Read_Quorum2((C1,...,Cm): Multi-Column): Set;
Var S: Set; i, j: Integer;
S=Ci, where i=Random(2..m); //i will be an integer between 2 and m
Choose one arbitrary member in Cj and add it into S for j=i+1,…,m.
Return S;End Get_Write_Quorum
n Hash functions
• There are several ways to disseminate the n replicas
• In MUREX, we adopt n hashing function methods.
• There are n replicas with hashing key k1,…,kn for each data item, where k1=HASH1(data item name),…,kn=HASHn(data item name).
Operations
• publish(data, data item name): to place data replicas of the original data item at nodes associated with k1,…,kn with version number 0.
• read(data item name): to return the replica of the up-to-date version number (by collecting a read quorum of replicas and returning the one with the highest version number).
• write(data, data item name): to update the data item (by writing a write quorum of replicas with the highest version number plus one).
Messages
• LOCK
• OK
• WAIT
• MISS
• UNLOCK
Initialization
• Initially, the data originator publishes the original data item by calling publish(data, data item name), which stores the data item with version number 0 at the n nodes associated with k1,…,kn.
Read/Write
• Afterwards, any participant can call read (or write) operation to read (or write) the data item by issuing LOCK requests, with the help of the DHT, to all members of a read (or write) quorum Q.
Asking a missed replica
• When a node receives a lock request, it sends a MISS message if it does not own the replica. It is noted that MISS is sent just once for each replica.
Check lock conflict
• On the other hand, if the node owns the replica, it then checks if there is a lock conflict, i.e., if a read-locked replica receives a write-lock request, or if a write-locked replica receives a write-lock or a read-lock request.
OK v.s. WAIT
• If there is no lock conflict, it locks the replica and replies an OK message containing the replica version number.
• On the contrary, if there is a lock conflict, the node replies a WAIT message.
Wait Period
• After sending LOCK requests, a node enters the “wait period” of length W.
• During the wait period, if a node has gathered OK messages form all members of quorum Q, it can execute the desired operation.
• A node sends UNLOCK messages to unlock the replicas after the operation is finished.
Usage of the Version Number
• A read operation in MUREX reads the replica of the largest version number from one of Q’s members.
• On the other hand, a write operation always writes into all members of Q the newest replica attached with the version number which is one more than the largest version number just encountered.
Quorum Reselection
• During the wait period, if a node u cannot gather OK messages from all members of Q after the specific waiting period W, it should select another quorum Q, send LOCK requests, and then enter another wait period again.
• A node may enter wait periods repeatedly until enough OK messages are gathered or until any lock expires.
Quorum Reselection Case 1
• No WAIT message is received by u:This case occurs when there is no contention. For such a case, Q should be such a quorum that QR is minimized, where R is the set of nodes having replied OK messages. Node u sends LOCK messages to the nodes that u has not sent LOCK messages yet.
Quorum Reselection Case 2
• One or more WAIT messages are received by u:This case occurs when there is contention. For such a case, node u first sends UNLOCK messages to all the members of Q. (Some of the UNLOCK messages does not need to send through DHT since node u has learned IPs of some nodes from reply messages.) Then, u select an arbitrary quorum Q. After a random backoff time, node u sends LOCK messages to all members of Q. The random backoff concept is similar to that of Ethernet [Cho] and is used to avoid continuous conflicts among contending nodes.
Deadlock- and Livelock-Free
• In MUREX, every lock is assumed to be a leased lock that has a leased period of L. We also assume that the critical section of a read or a write operation takes C time to complete. A leased lock automatically expires after the leased period L expired. Thus, a node should release the lock if H>L-C-D, where H is the holding time of the lock and D is the propagation delay of transmitting the lock. It is noted that the lock holding time H is the time when a node received OK messages.
L, D, H and C
u:
v:
L
H D C
OK message
time
time
The leased lock makes node substitution work
• If a node u hosting replica r leaves, a node v will be selected for replacing u to host r. If r is still locked while u leaves, then the lock-state of r is lost. When v later obtains the copy of r somehow, it cannot grant the lock for r at time E when it obtains the replica. Instead, it can grant the lock at time E+L, where L is the leased lock period. By this, a replica is never locked more than once, and the lock-state loss problem is solved.
Replica Pointer (1/2)
• When a node v arrives to share part of the load of node u, say from key k1 to key k2 of the hash space, u should transfer to v the replicas of keys from k1 to k2.
• To reduce the cost of transferring all the replicas, MUREX transfers replica pointers instead of the actual replicas.
• A replica pointer is a five-tuple: (key, data item name, version number, lock state, storing location IP). It is produced when a replica is generated and can be used to locate the actual replica stored.
• When node v owns the replica pointer of replica r, it is regarded as r’s host, which can reply the lock request of r.
Replica Pointer (2/2)
• On the other hand, when node u sends out the replica pointer of replica r, it is no more the host of r and cannot reply the lock request of r (even if it stores the actual replica of r).
• A replica pointer is a lightweight mechanism for transferring replicas; it can be propagated from node to node.
• When a node w owing replica pointer of r receives a lock request for r, it should check whether the node storing the actual replica of r is still alive. If so, w can behave as host of r. Otherwise, w regards itself as having no replica r.
• Every transfer of replica pointer between two nodes, say from u to v, should be recorded locally by u so that an UNLOCK message can be sent to the last node having the replica pointer.
Replica Auto-Regeneration (1/2)
• When node v receives from node u a LOCK message for locking replica r, v sends a MISS message if it does not own replica r. It is noted that MISS is sent for a replica only once. Node v is assumed to have no replica r if the following conditions hold:
1. v does not have the replica pointer of r2. v has the replica pointer of r, which indicates
that w stores r, but w is not alive.
Replica Auto-Regeneration (2/2)
• After obtaining (resp., generating) the newest replica by executing a read (or resp., write) operation, node u should send the newest replica to node v. After receiving the newest replica, node v generates a replica pointer for the replica and can start to reply to lock request at time E+L, where E is the time of receiving the replica and L is the leased lock period. In such a manner, replica regeneration can be performed automatically with little overhead.
Analysis – Availability (1/2)
• We assume that all data replicas have the same up-probability p, the probability that a single replica is up (i.e., accessible).
• Let RAV(k) denote the availability of read quorums under MC(k), and WAV(k), the availability of write quorums under MC(k).
Analysis – Availability (2/2) RAV(k) = Prob.(all replicas in Ck are up) +
Prob.(at least one replica but not all replicas in Ck are up) RAV(k 1)
= pSk + (1 pSk (1 p)Sk )RAV(k 1)
WAV(k) = Prob.(all replicas in Ck are up) +
Prob.(at least one replica but not all replicas in Ck are up) WAV(k 1)
= pSk + (1 pSk (1 p)Sk )WAV(k 1)
RAV(1) = (1 (1 p)S1 )
WAV(k) = pS1
Analysis – Quorum Size
The write quorum under a n -column
multi-column structure has n size in the best case
and 2 n -1 in the worst case, where n is the number of replicas. Under the same structure, the read
quorum has n size in the best case and 2 n -2 size in the worst case. The size of multi-column
quorums is relatively small when compared to
related quorums.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Up Probability
Read
Ava
ilabi
lity R:2*2
R:3*3
R:4*4
R:5*5
R:6*6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Up Probability
Writ
e Ava
ilabi
lity
W:2*2
W:3*3
W:4*4
W:5*5
W:6*6
• When up prob. is high (for example, in the well controlled environment), we can adopt larger column size if the write availability is the most significant concern.
• When up prob. Is low (for example, in the Internet environment), we can adopt smaller column size if the read availability is the most significant concern.
Simulation
Related Work• As far as we have known, there are four existent mutable P2
P storage systems proposed for P2P environments: Ivy [Mut], Eliot [Ste], Oasis [Rod], and Om [Yu].
• The protocols, on trying to maintain data consistency, all encounter the problems caused by “node substitution”, although not mentioned explicitly, and solves them by the concepts of logs, replicated metadata service, dynamic quorum membership, and replica membership reconfiguration, respectively.
• A mechanism called informed backoff is proposed in [Lin] to intelligently collect replica states to achieve mutual exclusion (i.e., exclusive lock) among replicas. The mechanism treats “node substitution” as a malicious fault, and uses the term “random reset” to refer the fault.
Ivy
• Ivy [Mut] is based on a set of logs stored with the aid of distributed hash tables. It keeps a log storing all updates for every participant, and maintains data consistency optimistically by performing conflict resolutions among all logs. The logs should be kept indefinitely and participants must scan all the logs to look up file data. Thus, Ivy is only suitable for small group of participants.
Eliot
• Eliot [Ste] relies a reliable, fault-tolerant, immutable P2P storage substrate Charles to store data blocks, and uses an auxiliary metadata service (MS) for storing mutable metadata. It supports NFS-like consistency semantics; however, the traffic between MS and the client is high for such semantics. It also supports AFS open-close consistency semantics; however, this semantics may cause the problem of lost updates. The MS service is provided by a conventional replicated database, which may be not fit for dynamic P2P environments.
Oasis
• Oasis [Rod] is based on Gifford’s weighted voting quorum concept and allows dynamic quorum membership. It spreads versioned metadata along with data replicas over the P2P network. To complete an operation, a client must first find related metadata to form a quorum. If the metadata is not found, the operation may fail.
Om
• Om [Yu] is based on the concepts of automatic replica regeneration and replica membership reconfiguration. The consistency is maintained by two quorum systems: a read-one-write-all quorum system for accessing replicas, and a witness-modeled quorum system for reconfiguration. Om allows replica regeneration from single replica. However, a write in Om is always forwarded to the primary copy, which serializing all writes and uses a two-phase protocol to propagate the write to every secondary replica. The primary replica may become a bottleneck and the overhead incurred by the two phase protocol may be too high. Moreover, the reconfiguration by witness model has the probability of violating consistency.
Sigma• The paper [Lin] utilizes informed backoff mechanism to design al
gorithms achieving mutual exclusion among replicas. A node u wishing to be the winner of the mutual exclusion sends a request for each of the totally n (n=3k+1) replicas and waits for responses. On receiving a request, a node should put the request in a FIFO queue and send the ID of the node whose request is in the front of the queue. When the number of responses received by node u exceeds m (m=2k+1), node u then regards node v (v may be u) as the winner if more than m responses take v as the winner. Otherwise, node u sends a release message to all replicas that take u as the winner to relinquish the request. To avoid repeated conflicts for high contention environment, only after a random backoff time will node u start over to send requests. In this manner, a winner can be elected successfully even if replica is reset when “node substitution” occurs. The work in [Lin] regards “node substitution” as sort of malicious faults, while our protocol regards it as sort of omission faults.
References:1. [Bha] R. Bhagwan, D. Moore, S. Savage, and G. Voelker, “Replication Strategies for
Highly Available Peer-to-peer Storage,” Proc. WFDDC, 2002.2. [Dab] F. Dabek, M. Kaashoek, D. Karger, R. Morris, and I. Stoica, “Wide-area Cooper
ative Storage with CFS,” Proc. SOSP, 2001.3. [Gop] V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee, and P. Keleher, “Adaptive Rep
lication in Peer-to-peer Systems,” Proc. International Conference on Distributed Computing Systems, 2004.
4. [Kub] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao, “OceanStore: An Architecture for Global-Scale Persistent Storage,” Proc. ASPLOS, 2000.
5. [Lin] S. Lin, Q. Lian, M. Chen, and Z. Zhang, “A practical distributed mutual exclusion protocol in dynamic peer-to- peer systems,” In 3rd International Workshop on Peer-to-Peer Systems (IPTPS’04), 2004.
6. [Mut] A. Muthitacharoen, R. Morris, T. Gil, and B. Chen, “Ivy: A Read/write Peer-to-peer File System,” Proc. SOSDI, 2002.
7. [Rod] M. Rodrig, and A. Lamarca, “Decentralized Weighted Voting for P2P Data Management,” Proc. of the 3rd ACM International Workshop on Data Engineering for Wireless and Mobile Access, pp. 85–92, 2003.
8. [Ste] C. Stein, M. Tucker, and M. Seltzer, “Building a Reliable Mutable File System on Peer-to-peer Storage,” Proc. WRP2PDS, 2002.
9. [Yu] H. Yu. and A. Vahdat, “Consistent and Automatic Replica Regeneration,” Proc. NSDI, 2004.
10. [Zho] B. Zhou, D. A. Joseph, J. Kubiatowicz, “Tapestry: A Fault Tolerant Wide Area Network Infrastructure,” Proc. ACM SIGCOMM, 2001, 2001.