jeff terrace michael j. freedman
DESCRIPTION
Object Storage on CRAQ High throughput chain replication for read-mostly workloads. Jeff Terrace Michael J. Freedman. Data Storage Revolution. Relational Databases Object Storage (put/get) Dynamo PNUTS CouchDB MemcacheDB Cassandra. Speed Scalability Availability Throughput - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/1.jpg)
Jeff TerraceJeff TerraceMichael J. FreedmanMichael J. Freedman
Object Storage on CRAQObject Storage on CRAQ
High throughput chain replication forHigh throughput chain replication forread-mostly workloadsread-mostly workloads
![Page 2: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/2.jpg)
Data Storage Revolution
• Relational Databases
• Object Storage (put/get)– Dynamo– PNUTS– CouchDB– MemcacheDB– Cassandra
SpeedScalabilityAvailabilityThroughput
No Complexity
![Page 3: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/3.jpg)
Eventual Consistency
Manager
Replica
ReplicaReplica
Write Request
Read RequestReplica
B
A
Read Request
![Page 4: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/4.jpg)
Eventual Consistency
• Writes ordered after commit
• Reads can be out-of-order or stale
• Easy to scale, high throughput
• Difficult application programming model
![Page 5: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/5.jpg)
Traditional Solution to Consistency
Manager
Replica
Replica
ReplicaReplica
Write Request
Two-Phase Commit:
1.Prepare2.Vote: Yes3.Commit4.Ack
![Page 6: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/6.jpg)
Strong Consistency
• Reads and Writes strictly ordered
• Easy programming
• Expensive implementation
• Doesn’t scale well
![Page 7: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/7.jpg)
Our Goal
• Easy programming
• Easy to scale, high throughput
![Page 8: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/8.jpg)
Chain Replication
Manager
Replica
Replica
ReplicaReplica
HEAD TAIL
Write Request Read
Request
W1R1W2R2R3
van Renesse & Schneider(OSDI 2004)
W1
R1
R2
W2
R3
![Page 9: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/9.jpg)
Chain Replication
• Strong consistency
• Simple replication
• Increases write throughput
• Low read throughput
• Can we increase throughput?
• Insight:– Most applications are read-heavy (100:1)
![Page 10: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/10.jpg)
CRAQ
• Two states per object – clean and dirty
Replica TAILReplicaReplicaHEAD
Read Request
Read Request
Read Request
Read Request
Read Request
V1 V1 V1 V1 V1
![Page 11: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/11.jpg)
CRAQ
• Two states per object – clean and dirty
• If latest version is clean, return value
• If dirty, contact tail for latest version number
Replica TAILReplicaReplicaHEAD
V1 V1 V1 V1 V1
Write Request
,V2 ,V2 ,V2
Read Request
V1
Read Request 1V1
,V2 V2V2V2V2V2
2V2
![Page 12: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/12.jpg)
Multicast Optimizations
• Each chain forms group
• Tail multicasts ACKs
Replica TAILReplicaReplicaHEAD
V1 V1 V1 V1,V2 ,V2 ,V2 ,V2 V2V2V2V2V2
![Page 13: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/13.jpg)
Multicast Optimizations
• Each chain forms group
• Tail multicasts ACKs
• Head multicasts write data
Replica TAILReplicaReplicaHEAD
V2 V2 V2 V2
Write Request
,V3 ,V3 ,V3 ,V3 V2,V3V3
![Page 14: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/14.jpg)
CRAQ Benefits
• From Chain Replication– Strong consistency– Simple replication– Increases write throughput
• Additional Contributions– Read throughput scales :
• Chain Replication with Apportioned Queries
– Supports Eventual Consistency
![Page 15: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/15.jpg)
High Diversity
• Many data storage systems assume locality– Well connected, low latency
• Real large applications are geo-replicated– To provide low latency– Fault tolerance
(source: Data Center Knowledge)
![Page 16: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/16.jpg)
TAILTAIL
Multi-Datacenter CRAQ
HEAD
HEAD
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
TAILTAILReplic
aReplic
a
Replica
Replica
DC1
DC2
DC3
![Page 17: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/17.jpg)
Multi-Datacenter CRAQ
HEAD
HEAD
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
TAILTAILReplic
aReplic
a
Replica
Replica
ClientClient
DC1
DC2
DC3
ClientClient
![Page 18: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/18.jpg)
Motivation
1. Popular vs. scarce objects
2. Subset relevance
3. Datacenter diversity
4. Write locality
Solution
1. Specify chain size
2. List datacenters− dc1, dc2, … dcN
3. Separate sizes– dc1, chain_size1, …
4. Specify master
Chain Configuration
![Page 19: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/19.jpg)
Master Datacenter
HEAD
HEAD
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
Replica
DC1
DC2
HEAD
HEAD
Writer
Writer
TAILTAIL
TAILTAIL
Replica
Replica
Replica
Replica
DC3
![Page 20: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/20.jpg)
Implementation
• Approximately 3,000 lines of C++
• Uses Tame extensions to SFS asynchronousI/O and RPC libraries
• Network operations use Sun RPC interfaces
• Uses Yahoo’s ZooKeeper for coordination
![Page 21: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/21.jpg)
Coordination Using ZooKeeper
• Stores chain metadata
• Monitors/notifies about node membership
CRAQ
CRAQ CRAQCRAQ
CRAQCRAQ
CRAQCRAQ
CRAQCRAQ
CRAQCRAQ
CRAQ
CRAQ
CRAQCRAQ
CRAQCRAQ
DC1
DC3
DC2
ZooKeeper
ZooKeeper
ZooKeeper
ZooKeeper
ZooKeeper
ZooKeeper
![Page 22: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/22.jpg)
Evaluation
• Does CRAQ scale vs. CR?
• How does write rate impact performance?
• Can CRAQ recover from failures?
• How does WAN effect CRAQ?
• Tests use Emulab network emulation testbed
![Page 23: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/23.jpg)
Read Throughput as Writes Increase
1x-
3x-
7x-
![Page 24: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/24.jpg)
Failure Recovery (Read Throughput)
![Page 25: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/25.jpg)
Failure Recovery (Latency)
Time (s) Time (s)
![Page 26: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/26.jpg)
Geo-replicated Read Latency
![Page 27: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/27.jpg)
If Single Object Put/Get Insufficient
• Test-and-Set, Append, Increment– Trivial to implement– Head alone can evaluate
• Multiple object transaction in same chain– Can still be performed easily– Head alone can evaluate
• Multiple chains– An agreement protocol (2PC) can be used– Only heads of chains need to participate– Although degrades performance (use
carefully!)
![Page 28: Jeff Terrace Michael J. Freedman](https://reader035.vdocument.in/reader035/viewer/2022062309/5681505d550346895dbe5d65/html5/thumbnails/28.jpg)
Summary
• CRAQ Contributions?– Challenges trade-off of consistency vs.
throughput
• Provides strong consistency
• Throughput scales linearly for read-mostly
• Support for wide-area deployments of chains
• Provides atomic operations and transactions
ThankYou
Questions?