high availability data model for p2p storage network

Upload: kieu-minh-duc

Post on 03-Jun-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 High Availability Data Model for P2P Storage Network

    1/6

    L. Chen, P. Triantafillou, and T. Suel (Eds.): WISE 2010, LNCS 6488, pp. 322327,2010.

    Springer-Verlag Berlin Heidelberg 2010

    High Availability Data Model for P2P Storage Network

    BangYu Wu1, Chi-Hung Chi1, Cong Liu1,

    ZhiHeng Xie1, and Chen Ding

    2

    1School of Software, Tsinghua University, Beijing, China2 Department of Computing Science, Ryerson University, Canada

    Abstract.With the goal to provide high data availability, replicas of data will

    be distributed based on the idea of threshold, meaning that data service is guar-

    anteed to be available so long as any k out of n peers are online. The key distri-

    bution algorithm of the model as well as its scalability, management, and otheravailability-related factors are presented and analyzed.

    Keywords:P2P Network, Storage, Data Availability Model.

    1 Introduction

    Peer-to-Peer (P2P) has been proven to be one of the most effective and popular

    networks for large scale content sharing on Internet. Unlike traditional client-server

    environment where data is managed in a few highly reliable centralized servers, P2Pnetwork distributes the burden of data storage among its hundreds and thousands of

    peers. This results in the hard problem of data availability in the P2P network. Al-

    though ad hoc solutions based on redundancy are proposed, no guarantee of data

    availability is made.

    In this paper, we would like to investigate and improve the data availability prob-

    lem of P2P network. In particular, we focus on the following issues: (i) how should

    the replicas placement and distribution be managed so that it can cope consistently

    with the statistics of the peer failure, and (ii) despite the constant changes in the peer

    online membership, what should be the replica placement scheme so that files arehighly available without excessively replicating them over lots of peers. To answer

    these questions, we propose a new high availability model for data storage in P2P

    network. The key idea behind is to distribute replicas based on threshold, an idea

    borrowed from secret sharing in security [7]. Analysis shows that our threshold model

    is very efficient in enhancing data availability in P2P network.

    2 Related Work

    To ensure data availability in P2P network, redundancy is often added to the originaldata. Reed Solomon erasure coding is proposed to distribute bulk of data to millions

    of users through multicast and broadcast. Its basic principle is to divide a file into m

    fragments and recode them into nfragments, where m< n. At the other end, the file

    can be reassembled back from any mfragments with an aggregated size equal to the

    original file size [5]. Various strategies for replication and placement have been

  • 8/12/2019 High Availability Data Model for P2P Storage Network

    2/6

    High Availability Data Model for P2P Storage Network 323

    proposed to address the data availability problem of P2P network. PAST [6] main-

    tains a strict replication rule: for each data item in the network, there should be k

    available replicas stored in k NodeID contagious living peers. OceanStore [2] em-

    ploys multiple hashing functions for replication. It hashes data items ID with a few

    different seed numbers and assigns them to multiple target peers. PlanetP [1] and [3]generate redundant packets when a member needs to increase the availability of a file

    during periodic estimation of data availability. Path replication method [4] demands

    that the requested data is replicated in all the peers that are along the data transmission

    path between the peer requesting the data and the peer having the data.

    While all these replication rules and strategies give good foundation in data avail-

    ability for P2P networks, they are usually either too restrictive with the replica num-

    ber or too complicated to be used in practical systems.

    3 Threshold-Based Model for High Availability Data Storage

    Under P2P networks with ad-hoc peer nature, we argue that there are at least four main

    design requirements for any threshold-based high availability data storage model:

    Differential Data Reliability and Availability Setting. Data availability of eachfile in a P2P network should be uniquely defined based on the degree of its

    significance.

    Reconstruction of Original Data File through Simple Re-assembling of DataChunks, but Without Complicated Computational Overhead. Minimizing datastorage overhead should be achieved without sacrificing retrieval performance.

    Robustness of Data Availability Support. Good threshold-based data storagemodel should be as robust as possible.

    Easy Management of Data Chunk Replicas.Good threshold-based data storagemodel should provide file update without complicated, dynamic replication and

    migration. All data chunks should be managed, maintained, and searched easily.

    3.1 Assumptions and Notation

    Before we go into the model, we should like to list down the assumptions made in ourthreshold-based data availability model. In the P2P network of our interest, every peer

    is delegated with equal responsibility. All peers are uniform in their storage capacity

    resources to permit our threshold model to deploy data chunk replicas freely. Fur-

    thermore, the P2P network is relatively stable in that peers eventually rejoin the

    community after they go offline. And the online time of peers is relatively much

    longer than their offline time. With the last assumption, it makes sense for us to talk

    about the average availability of a peer.

    To simplify the rest of our discussion, we define the following notations used in

    the paper. Let a file with size lis available in a P2P network with npeers and averagepeer availability p. The P2P network should provide file service with availability a

    when any kpeers are online (which we call this model (k, n)). To achieve this goal,

    the file will be divided into m data chunks with size s each. Among them, r data

    chunks should be located in each peer so that when any k peers are available, the

    original file can be reconstructed back by assembling data chunks in those kpeers.

  • 8/12/2019 High Availability Data Model for P2P Storage Network

    3/6

    324 B. Wu et al.

    In general, since every peer in the network might have different probability for be-

    ing online, saypifor each peer i, it would be too expensive to compute the exact file

    availability. Instead, we use the following approximation that uses the average prob-

    ability of being online:

    average online time

    (average online time average offline time)i

    p =+

    , and =

    =

    n

    i

    ipn

    p1

    1 .

    3.2 Threshold Constraints

    The overall description of our threshold model is as follows. There are n peers in the

    P2P network. For each file stored in the network, it is divided into many chunks. The

    constraint of our model is that if any k or more peers are online, the original file is

    guaranteed to be reconstructed back, else the file reconstruction will fail. What we

    target is to achieve the minimum of average storage overhead in each peer. The chal-

    lenge is: how to divide the file and how to distribute the divided file chunks among

    peers? About the threshold value k, it should be computed in terms of the required

    system availability. Given a. k, m,and nthat constrain the distribution of file chunks,

    for our (k, n) model, the availability function a(n,k,p) of a P2P network should be of

    one of the following forms:

    a(n,k,p) = =

    n

    ki

    inii

    n ppC )1(

    (1)

    =

    =

    kn

    i

    iini

    n ppC0

    )1(

    (2)

    =

    =

    1

    0

    )1(1k

    i

    inii

    n ppC (3)

    Equation (1) explains the file availability probability from the viewpoint that at least k

    peers are online. Equation (2) explains the file availability probability from the oppo-

    site viewpoint that at most n-kpeers are offline. And Equation (3) gives the availabil-

    ity probability when the total number of online peers is more than k. These three

    equations are identical in semantic.

    3.3 Model Prototype

    LetD= {d1, d2, , dm} be the set of mdata chunks that a given file is divided into. A

    P2P network peers set is denoted by E= {e1, e2, , en}. A data chunk distribution

    n*mmatrix W = [wij] is obtained by multiplying a 0-1 matrix G = [gij] with size n*m

    to a diagonal matrix Hwith sizem*m formed by di. The rows of Wact as the data

    chunk vectors distributed among peers. It is W= G*H, Therefore,

    =

    mmnnnn

    m

    m

    mnnnn

    m

    m

    d

    d

    d

    gggg

    gggg

    gggg

    wwww

    wwww

    wwww

    ...000

    0...00

    0...00

    *

    ...

    ...

    ...

    ...

    ...

    ...

    2

    1

    ,3,2,1,

    ,23,22,21,2

    ,13,12,11,1

    ,3,2,1,

    ,23,22,21,2

    ,13,12,11,1

  • 8/12/2019 High Availability Data Model for P2P Storage Network

    4/6

    High Availability Data Model for P2P Storage Network 325

    where wijD {0}. If wij=dj(1in, 1j m), then dj will be distributed over peer

    i. Otherwise, if wij= 0, then djwill not be distributed over peer i. gij{0,1}. Implement-

    ing a threshold model means that it needs to compute the distribution matrix W. Since

    matrixHis known, the construction of the 0-1 matrix G will be the key problem.

    Theorem 1. Data chunk distribution matrix W that is consistent with the threshold

    model is accomplished if and only if three conditions about Gare satisfied. They are,

    (1) For anyj, =

    n

    i

    ji mjkg1

    , 1,1)1( .

    (2) 1knm C

    .

    (3) There must exist 1knC columns in matrix G such that the number of columnswith 0 is k-1, and that with 1 is n-(k-1). These columns together represent all the

    combinations with k-1 zeros distributed in n locations. Each column belongs to

    any one of these combinations, and all these columns are different from each

    other.

    ProofData chunk distribution based on our threshold model must satisfy the above three

    conditions. Condition (1) guarantees that a file can be reassembled by not less than k

    peers. Condition (2) and Condition (3) guarantee that the file reconstruction cannot be

    implemented by less than kpeers.

    In terms of the (k, n) model, a file can be reconstructed when the number of online

    peers is not less than k. So, if we pick any krows from G, at least one element 1 repre-senting some data chunk will appear in each column of these krows. If Condition (1) is

    not correct, there will exist a columnjsuch that the number of zero elements is larger

    than k-1 and we can pick up at least krows of Wsuch that elements are zero in column

    j. In this case, the file cannot be reconstructed from these rows since at least one data

    chunk is missing even if the number of online peers are more than k-1. This result is

    paradoxical with the definition of the threshold definition, so Condition (1) is correct.

    For Condition (2), the original file will lack at least one data chunk to be recon-

    structed from all data chunks located in any k-1 peers, and any missing data chunks

    from any two different subsets of k-1 peers are different. Suppose there are two dif-

    ferent sets Aand Bwith |A| = |B| = k-1. If the reconstruction of file from them both

    miss data chunk d1,then the file will also lack of the same data chunk d1even if it is

    reconstructed from more than k-1 peers inAB. This situation is contrary to the (k,

    n) model. So in the threshold model, a file should be divided into at least1k

    nc datachunks.

    For Condition (3), still using the reduction to absurdity, we assume that there does

    not exist such columns satisfying the combinations where k-1 zero elements are dis-

    tributed over n different locations. Suppose one combination X =(x1, x2, , xn) is

    missing in G, let y1, y2, , yk-1be the subscripts when 1yx , 2yx 1kyx in Xare zero.So there does not exist such column in Gthat its elements are zero in rowy1,y2, ,

    yk-1. We pick up the corresponding rowy1,y2, ,yk-1in W, and the original file can be

    constructed from these k-1rows since each data chunk is located at least in one peer.

    This result is inconsistent with the threshold model, so Condition (3) is proved.

    Hence, the proof is completed.

  • 8/12/2019 High Availability Data Model for P2P Storage Network

    5/6

    326 B. Wu et al.

    Theorem 2. There are m!/n! alternatives of data chunk distribution methods for our

    threshold model.

    Proof

    There are mcolumns in the 0-1 matrix G, and G still satisfies the threshold requirement

    no matter how all these columns are replaced with each other. This is because each

    data chunk di is independent with each other, and the distribution method for dican

    also be used for other data chunks. So we have m! placement methods for all columns.

    On the other hand, there are nrows in the 0-1 matrix G, with each row representing

    the data chunks located in the corresponding peer. G still satisfies the threshold re-

    quirement no matter how all rows are replaced with each other. Since each peer ei is

    independent of each other, the data chunks located in peer ei can also be distributed to

    other peers. But exchanging rows do not change the distribution of data chunks, it

    only influences peers distribution. So, we have n! placement methods for all rows.

    Hence, there are m!/n! alternatives of data chunk distribution methods for ourthreshold model, and the proof is completed.

    3.4 Algorithm of Distributing Data Chunks

    From above analysis, we solve the problem of how many data chunks a file should be

    divided into and how many data chunks should be located in each peer. We under-

    stand that all data chunks are not scattered casually to peers, and there are (m!/n!)

    alternatives of distribution methods. So the last question is on how to distribute the

    divided data chunks among peers? The data chunks distribution matrix W can be

    computed when the 0-1 matrix Gis solved. To compute matrix G, we construct 1knC

    combinations. The algorithm first determines the distribution of all zero elements,

    then set 1 to all the remaining elements in G. The algorithm of producing Gis given

    in Fig. 1, its complexity is O( 1knC ).

    Step 1. Initialization: Suppose the first combination generates the first column.

    There are continuous k-1 zero elements in the first column, and their row

    numbers are y1 = 1,y2= 2, , yk-1=k-1 respectively. Let y1,y2 ,, yk-1be

    the current combination.

    Step2. Forall the remaining1k

    nC -1columns, compute the next combination ofy1,y2,,yk-1according to the current combination as follows:

    S1. i = max{j|yj< n-(k-1) +j}

    S2. yi=yi+1

    S3. yj=yj-1+1,j= i+1, i+2, , k-1

    S4. y1,y2,,yk-1forms the current combination.

    Step 3. According to each combination y1,y2 ,, yk-1 above, construct columns in

    Gin this way:

    Set 0 to row elements whose row numbers are y1 = 1,y2= 2, ,yk-1 = k-1,

    and set 1 to the remaining elements.Fig 1. Algorithm to Compute Matrix G

    Finally, the distribution matrix Wis calculated by W=G*H.

  • 8/12/2019 High Availability Data Model for P2P Storage Network

    6/6

    High Availability Data Model for P2P Storage Network 327

    4 Conclusion

    This paper proposes a theoretical model to achieve high availability for P2P system

    by partitioning file into chunks and distributing these chunks over peers. The avail-

    ability is implemented based on the threshold model such that file is available byreassembling blocks as long as any kpeers are online. The algorithms of partitioning

    and distribution are carefully designed to guarantee that the file chunks located in any

    kpeers can be used to reconstruct back the original file. By comparing the threshold

    model with simple replication mechanism, we can show that our model provides more

    flexible availability and download performance than simple replication mechanism

    which just throws kreplicas to kselected peers in a dynamic P2P network.

    Acknowledgement

    This work is supported by the China National 863 project #2008AA01Z129.

    References

    1. Cuenca-Acuna, F.M., Peery, C., Martin, R.P., Nguyen, T.D.: PlanetP: Using Gossiping toBuild Content Addressable Peer-to-Peer Information Sharing Communities. In: Proceed-

    ings of the IEEE International Symposium on High Performance Distributed Computing

    (2003)

    2. Kubiatowicz, J., Bindel, D., Chen, Y., Eaton, P., Geels, D., Gummadi, R., Rhea, S.,Weatherspoon, H., Weimer, W., Wells, C., Zhao, B.: Oceanstore: An Architecture for

    Global-Scale Persistent Storage. In: Proceedings of ACM ASPLOS Conference (2000)

    3. Liu, X.Z., Yang, G.W., Wang, D.X.: Stationary and Adaptive Replication Approach toData Availability in Structured Peer-to-Peer Overlay Networks. In: Proceedings of the

    11th IEEE International Conference on Networks (2003)

    4. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and Replication in UnstructuredPeer-to-Peer Networks. In: Proceedings of the 16th ACM International Conference on Su-

    percomputing, New York (June 2002)5. Ray, S., Francis, P., Handey, M., Karp, R., Shenker, S.: A Scalable Content-AddressableNetwork. In: Proceedings of the ACM SIGCOMM Conference (2001)

    6. Rowstran, A., Druschel, P.: Storage Management and Caching in PAST, a Large-Scale,Persistent Peer-to-Peer Storage Utility. In: Proceedings of ACM Symposium on Operating

    Systems Principles (October 2001)

    7. Shamir, A.: How to Share a Secret. Communications of the ACM 22(11), 612613 (1979)