pond: the oceanstore prototype sean rhea, patric eaton, dennis gells, hakim weatherspoon, ben zhao,...
Post on 29-Dec-2015
213 Views
Preview:
TRANSCRIPT
Pond: the OceanStore Prototype
Sean Rhea, Patric Eaton, Dennis Gells,
Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz
University of California, Berkeley
Proc. of the 2nd USENIX Conf. On File and Storage Technologies (FAST ‘03)
Presented by Park, Seon-Yeong
3/26
OceanStore Overview
Internet-scale, Cooperative File System
ApplicationCalendars, Email, Contact Lists, Large Digital Libraries, Repositories for Scientific Data, Distributed Design Tool, etc.
RequirementsUniversal Availability
Durability
Understandable Consistency Model
Privacy vs. Information Sharing
4/26
Data Model (1/2)
Data ObjectA File in a Traditional File System
Named by an Active Globally-Unique Identifier, AGUID– Location Independent
– Preventing Name Space Collisions
SHA-1
AGUID
Application-specified Name + Owner’s Public Key
5/26
Data Model (2/2)
Data ObjectSequences of Read-only Versions
Block Reference– Cryptographically-secure Hash of Child Block’s Contents
< Structure of Data Object >
6/26
Underlying Technology
Access Control
Data UpdatePrimary Replica
Archival Storage
Secondary Replica
Data Read
Data Location & Routing ;Tapestry
7/26
Access Control
Reader RestrictionEncrypt All Data
Distribute Encryption Key to Users with Read Permission
Writer RestrictionAccess Control List (ACL) for an Object
All Writes be Signed so that Well-behaved Servers and Clients Verify them based on the ACL
8/26
Underlying Technology
Access Control
Data UpdatePrimary Replica
Archival Storage
Secondary Replica
Data Read
Data Location & Routing
9/26
Data Update (1/2)
UpdateAdding a New Version to the Head of Version Stream
Array of Potential Actions each Guarded by a Predicate– Predicate Examples
• Checking Latest Version_Num, Comparing a Region of Bytes to an Expected Value, etc.
– Action Examples• Replacing a Set of Bytes, Appending New Data, Truncating the
Object, etc.
TimestampClient ID<Predicate 1, Action 1><Predicate 2, Action 2> . . .<Predicate N, Action N>Client Signature < Update Message Format >
10/26
Data Update (2/2)
Application
Primary Replica(Inner Ring)
Archival Storages
ApplicationSecondary
ReplicaSecondary
Replica
< OceanStore Update Path >
11/26
Primary Replica
Inner RingA Set of Servers that Implement Object’s Primary Replica
Applies Updates and Creates New Versions– Serialization
– Access Control
– Create Archival Fragments
Update Agreements– Byzantine Agreement Protocol
• Distributed Decision Process in which All Non-faulty Participants Reach the Same Decision for a Group of Size 3f+1, no more than f Faulty Servers
12/26
Archival Storage
Simple ReplicationTolerance of One Failure for an Addition 100% Storage Cost
Erasure CodesEfficient and Stable Storage for Archival Copies
Storage Cost by a Factor of N/M
Original Block can be Reconstructed from Any M Fragments
Block
Fragment 1
Fragment 2
Fragment N
. . .
Fragment 1
Fragment 2
Fragment M
. . .Encoded by
Erasure Code
M < N
Fragment 3
13/26
Secondary Replica
Whole-block Caching to Avoid Erasure Codes on Frequently-read Objects
Push-based UpdateEvery Time the Primary Replica Applies an Update
Dissemination TreeApplication-level Multicast Tree
Rooted at Primary Replica
Parent Nodes are Pre-existing Replicas to Serve Objects
14/26
Underlying Technology
Access Control
Data UpdatePrimary Replica
Archival Storage
Secondary Replica
Data Read
Data Location & Routing
15/26
Data Read
Application
Primary Replica(Inner Ring)
Archival Storages
SecondaryReplica
1. AGUID
2. Latest VGUID
3. Search Blocks from Secondary Replicas
4. Search enough Fragments from Archival Storages
16/26
Underlying Technology
Access Control
Data UpdatePrimary Replica
Archival Storage
Secondary Replica
Data Read
Data Location & Routing
17/26
Data Location & Routing (1/4)
TapestryDecentralized Object Location and Routing System
Using Globally Unique Identifier (GUID) to Hosts and Resources
Location Independent
Locality Aware
18/26
Data Location & Routing (2/4)
Routing Example
Messages are Routed to the Destination ID Digit by Digit***8=>**98=>*598=>4598
B4F8
9098
0325
2BB8
75984598
87CA
0098
3E98
1598
D598
2118
L1
L2
L2
L3
L4 L4
L2
L4
L3
L3
L1
19/26
Data Location & Routing (3/4)
Location Independent & Locality Aware
L1
L2
L2
L3
L4 L4
L2
L4
L3
L3
ReplicaLocation Pointer
L1
22/26
Experimental Results (1/2)
Update Performance
< Table. Results of Latency Microbenchmark > < Figure. Throughput in Local Area >
23/26
Experimental Results (2/2)
Comparison with NFS
< Figure. Andrew Benchmark >
Write
Read
Read/Write
24/26
Related Work
Other Peer-to-peer File SystemsPAST[Rows01] and CFS[Dabe01]
– No Write Sharing
IVY[Muth02], Pangaea[Sait02]– Provide Both Read and Write Sharing but,
– No Single Point of Consistency
25/26
Conclusion
Operational OceanStore PrototypeUniversally Accessible, Fault-tolerance, Security and Information Sharing
Future ResearchImproving Performance
– Efficient Threshold Schemes and Archival Data Generation
Self-Maintenance
Stability and Fault-tolerance
Supporting More Applications
26/26
Discussion
System Design ChoiceSecurity vs. Fast Response
Simple vs. Complicate Design
Storage Service Provider (SSP)Independent SSP vs.
Confederation of Companies such as IBM, AT&T
Efficient Storage Usage
27/26
Primary Replica (Ext.)
Modification of Byzantine Agreement ProtocolPublic Key Cryptography
– Symmetric-key Message Authentication Codes (MACs) for Inner Ring
– Public-key Cryptography for All Other Machines
Proactive Threshold Signatures– Flexibility in Choosing the Membership of Inner Ring– Single Public Key with l Private Key Shares– Any k Correctly Generated Signature Shares among l– Independent Sets of Key Shares can be Used to Control
Membership
Responsible Party– To Choose the Hosts that Make Up Inner Rings
top related