replica control for peer-to- peer storage systems

Replica Control for Peer-to-Peer Storage Systems

P2P

• Peer-to-peer (P2P) has emerged as an important paradigm model for sharing resources at the edges of the Internet.

• the most widely exploited resource is storage, as typified in P2P music file sharing– Napster– Gnutella

• Following the great success of P2P file sharing, a natural next step is to develop wide-area, P2P storage systems to aggregate the storage across the Internet.

Replica Control Protocol

•Replication– maintain multiple copies of some critical data to increase the availability

– to reduce read access times

•Replica Control Protocol – to avoid inconsistent updates

– to guarantee a consistent view of the replicated data

Resiliency Requirement

• Need data replication– Even if some nodes fail, the computation can

progress– Consistency requirement– Failures may partition the network– Rejoining need to use consistency control

algorithms

One-copy equivalence consistency criteria

• The set of replicas must behave as if there is only a single copy. Conditions to ensure one-copy equivalence are– no two write operations can proceed at the sa

me time– no a pair of a read operation and a write oper

ation can proceed at the same time– a read operation always returns the replica th

at the last write operation writes

Replica Control Methods

• Optimistic– Proceed with computation on the available

subgroup– Optimistic to join later with consistency

• Pessimistic– Restrict computations with worst-case

assumptions– Approaches

• Primary site • Voting

Optimistic Approach

• Version vector for file f– N element vector, where N is the number of n

odes in which f is stores– The ith element represents the number of upda

tes done by node I

• A vector V dominated V’ if– Every element in V >= corresponding element

in V’

• Conflicts if neither dominates

Optimistic (cont’d)

• Consistency resolution – If V dominates V’, inconsistent; can be

resolved by copying V to V’– If V and V’ conflict, inconsistency cannot be

resolved

• Version vector can resolve only update conflicts; cannot resolve read-write conflicts

Primary Site Approach

• Data replicated on at least k+1 nodes (for k-resilient)

• One node acts as the primary site (PS)– Any read request is served by the PS– Any write request is copied to all other back-

up sites– Any write request to back-up sites are

forwarded to the PS

PS Failure Handling

• If back-up fails, no interruption in service• If PS fails, there are two possibilities

– If the network not segmented• Choose another node in the set as the primary• If checkpointing has been active, need to restart

only from the previous checkpoint

– If segmented• Only the partition with PS can progress• Other partitions stops updates on data• Necessary to distinguish between site failures and

network partitions

Witnesses

Witness - small entity that maintains enough information to identity

the replicas that contain the most recent version of the data

- this information could be a timestamp containing the time of

the latest update

- replaced by a version number, which is an integer

incremented each time the data are updated

Voting Approach

• V votes are distributed to n replicas with – Vw+Vr > V– Vw+Vw > V

• Obtain Vr or more votes to read

• Obtain Vw or more votes to write

• Quorum system is more general than voting

Quorum Systems

• Trees

• Grid-based (array-based)

• Torus

• Hierarchical

• Multi-column

and so on…

Classification of P2P Storage Sys.

• Unstructured– “Replication Strategies for Highly Available Peer-to-peer Sto

rage”– “Replication Strategies in Unstructured Peer-to-peer Networ

ks” • Structured

– CFS– PAST– LAR– Ivy– Oasis– Om– Eliot– Sigma (for mutual exclusion primitive)

Read only

Read/Write (Mutable)

Ivy

• Stores a set of logs with the aid of distributed hash tables.

• Ivy keeps, for each participant, a log storing all its updates, and maintains data consistency optimistically by performing conflict resolutions among all logs. (Maintain data consistency in a best-effort manner)

• The logs should be kept indefinitely and a participant must scan all the logs related to a file to look up the up-to-date file data. Thus, Ivy is only suitable for small groups of participants.

Eliot

• Eliot relies a reliable, fault-tolerant, immutable P2P storage substrate Charles to store data blocks, and uses an auxiliary metadata service (MS) for storing mutable metadata.

• It supports NFS-like consistency semantics; however, the traffic between MS and the client is high for such semantics.

• It also supports AFS open-close consistency semantics; however, this semantics may cause the problem of lost updates.

• The MS service is provided by a conventional replicated database, which may be not fit for dynamic P2P environments.

Oasis

• Oasis is based on Gifford’s weighted voting quorum concept and allows dynamic quorum membership.

• It spreads versioned metadata along with data replicas over the P2P network.

• To complete an operation on a data object, a client must first find a metadata related to the object and figure out the total number of votes, required votes for read/write operations, replica list, and so on, to form a quorum accordingly.

• One drawback of Oasis is that if a node happens to use a stale metadata, the data consistency may be violated.

Om

• Om is based on the concepts of automatic replica regeneration and replica membership reconfiguration.

• The consistency is maintained by two quorum systems: a read-one-write-all quorum system for accessing replicas, and a witness-modeled quorum system for reconfiguration.

• Om allows replica regeneration from single replica. However, a write in Om is always first forwarded to the primary copy, which serializing all writes and uses a two-phase procedure to propagate the write to all secondary replicas.

• The drawbacks of Om are (1) the primary replica may become a bottleneck (2) the overhead incurred by the two-phase procedure may be too high (3) the reconfiguration by witness model has the probability of violating consistency.

Sigma

• The Sigma protocol intelligently collect states from all replicas to achieve mutual exclusion.

• The basic idea of the Sigma protocol is as follows. A node u wishing to be the winner of the mutual exclusion sends a timestamped request for each of the totally n (n=3k+1) replicas and waits for replies. On receiving a request from u, a node v should put u’s request in a local queue by the timestamp order, takes the node as the winner whose request is in the front of the queue, and reply the winner ID to u.

Sigma• When the number of replies received by u exceeds m (m=2k+

1), u acts according to the following conditions:(1) if more than m replies take v as the winner, then u is the winner. (2) if more than m replies take w (wu) as the winner, then w is the winner and u just keeps waiting.(3) if no node is regarded as the winner by more than m replies, then u sends YIELD message to cancel its request temporarily and then re-inserts its request again.

• In this manner, one node can eventually be elected as the winner even when communication delay variance is large.

• A drawback of the Sigma protocol is that a node needs to send requests to all replicas and gets advantaged replies from a large portion (2/3) of nodes to be the winner of the mutual exclusion, which will incur large overhead. Moreover, the overhead will even be larger under an environment of high contention.

MUREX comes to the rescue!

replica control for peer-to- peer storage systems

Documents