amazon’s key-value store: dynamo

26
AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available key-value store. SOSP 2007 UCSB CS271 1 apted from Amazon’s Dynamo Presentation

Upload: teleri

Post on 24-Feb-2016

80 views

Category:

Documents


0 download

DESCRIPTION

DeCandia,Hastorun,Jampani , Kakulapati , Lakshman , Pilchin , Sivasubramanian , Vosshall , Vogels : Dynamo: Amazon's highly available key-value store . SOSP 2007. Amazon’s Key-Value Store: Dynamo. Adapted from Amazon’s Dynamo Presentation. Motivation. Reliability at a massive scale - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Amazon’s Key-Value Store: Dynamo

UCSB CS271 1

AMAZON’S KEY-VALUE STORE: DYNAMO

DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available key-value store. SOSP 2007

Adapted from Amazon’s Dynamo Presentation

Page 2: Amazon’s Key-Value Store: Dynamo

UCSB CS271 2

Motivation

• Reliability at a massive scale• Slightest outage significant financial consequences• High write availability• Amazon’s platform: 10s of thousands of servers and

network components, geographically dispersed• Provide persistent storage in spite of failures• Sacrifice consistency to achieve performance,

reliability, and scalability

Page 3: Amazon’s Key-Value Store: Dynamo

UCSB CS271 3

Dynamo Design rationale

• Most services need key-based access:– Best-seller lists, shopping carts, customer

preferences, session management, sales rank, product catalog, and so on.

• Prevalent application design based on RDBMS technology will be catastrophic.

• Dynamo therefore provides primary-key only interface.

Page 4: Amazon’s Key-Value Store: Dynamo

UCSB CS271 4

Dynamo Design Overview

• Data partitioning using consistent hashing• Data replication• Consistency via version vectors• Replica synchronization via quorum protocol• Gossip-based failure-detection and

membership protocol

Page 5: Amazon’s Key-Value Store: Dynamo

UCSB CS271 5

System Requirements• Data & Query Model:

– Read/write operations via primary key– No relational schema: use <key, value> object– Object size < 1 MB, typically.

• Consistency guarantees:– Weak– Only single key updates– Not clear if read-modify-write isolate

• Efficiency:– SLA 99.9 percentile of operations

• Notes:– Commodity hardware– Minimal security measures since for internal use

Page 6: Amazon’s Key-Value Store: Dynamo

Service Level Agreements (SLA)

• Application can deliver its functionality in a bounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds.

• Example SLA: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.

UCSB CS271 6

Page 7: Amazon’s Key-Value Store: Dynamo

UCSB CS271 7

System Interface

• Two basic operations:– Get(key):

• Locates replicas• Returns the object + context (encodes meta data

including version)– Put(key, context, object):

• Writes the replicas to the disk• Context: version (vector timestamp)

• Hash(key) 128-bit identifier

Page 8: Amazon’s Key-Value Store: Dynamo

Partition Algorithm• Consistent hashing: the

output range of a hash function is treated as a fixed circular space or “ring” a la Chord.

• “Virtual Nodes”: Each node can be responsible for more than one virtual node (to deal with non-uniform data and load distribution)

UCSB CS271 8

Page 9: Amazon’s Key-Value Store: Dynamo

UCSB CS271 9

Virtual Nodes

Page 10: Amazon’s Key-Value Store: Dynamo

Advantages of using virtual nodes• The number of virtual nodes that a node

is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure.

• A real node’s load can be distributed across the ring, thus ensuring a hot spot is not targeted to a single node.

• If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes.

• When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.

UCSB CS271 10

Page 11: Amazon’s Key-Value Store: Dynamo

Replication

• Each data item is replicated at N hosts.

• preference list: The list of nodes that is responsible for storing a particular key.

• Some fine-tuning to account for virtual nodes

UCSB CS271 11

Page 12: Amazon’s Key-Value Store: Dynamo

UCSB CS271 12

Replication

Page 13: Amazon’s Key-Value Store: Dynamo

UCSB CS271 13

Replication

Page 14: Amazon’s Key-Value Store: Dynamo

UCSB CS271 14

Preference Lists

• List of nodes responsible for storing a particular key.

• Due to failures, preference list contains more than N nodes.

• Due to virtual nodes, preference list skips positions to ensure distinct physical nodes.

Page 15: Amazon’s Key-Value Store: Dynamo

UCSB CS271 15

Data Versioning

• A put() call may return to its caller before the update has been applied at all the replicas

• A get() call may return many versions of the same object.

• Challenge: an object may have distinct versions• Solution: use vector clocks in order to capture

causality between different versions of same object.

Page 16: Amazon’s Key-Value Store: Dynamo

UCSB CS271 16

Vector Clock

• A vector clock is a list of (node, counter) pairs.• Every version of every object is associated with

one vector clock.• If the all counters on the first object’s clock are

less-than-or-equal to all of the counters in the second clock, then the first is an ancestor of the second and can be forgotten.

• Application reconciles divergent versions and collapses into a single new version.

Page 17: Amazon’s Key-Value Store: Dynamo

Vector clock example

UCSB CS271 17

Page 18: Amazon’s Key-Value Store: Dynamo

UCSB CS271 18

Routing requests• Route request through a generic load

balancer that will select a node based on load information.

• Use a partition-aware client library that routes requests directly to relevant node.

• A gossip protocol propagates membership changes. Each node contacts a peer chosen at random every second and the two nodes reconcile their membership change histories.

Page 19: Amazon’s Key-Value Store: Dynamo

UCSB CS271 19

Sloppy Quorum

• R and W is the minimum number of nodes that must participate in a successful read/write operation.

• Setting R + W > N yields a quorum-like system.• In this model, the latency of a get (or put)

operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency and availability.

Page 20: Amazon’s Key-Value Store: Dynamo

UCSB CS271 20

Highlights of Dynamo

• High write availability• Optimistic: vector clocks for resolution• Consistent hashing (Chord) in controlled

environment• Quorums for relaxed consistency.

Page 21: Amazon’s Key-Value Store: Dynamo

CASSANDRA (FACEBOOK)

Lakshman and Malik Cassandra—A Decentralized Structured Storage System. LADIS 2009

UCSB CS271 21

Page 22: Amazon’s Key-Value Store: Dynamo

Data Model

• Key-value store—more like Bigtable.• Basically, a distributed multi-dimensional map

indexed by a key.• Value is structured into Columns, which are

grouped into Column Families: simple and super (column family within a column family).

• An operation is atomic on a single row.• API: insert, get and delete.

UCSB CS271 22

Page 23: Amazon’s Key-Value Store: Dynamo

System Architecture

• Like Dynamo (and Chord).• Uses order preserving hash function on a fixed

circular space. Node responsible for a key is called the coordinator.

• Non-uniform data distribution: keep track of data distribution and reorganize if necessary.

UCSB CS271 23

Page 24: Amazon’s Key-Value Store: Dynamo

Replication

• Each item is replicated at N hosts.• Replicas can be: Rack Unaware; Rack Aware

(within a data center); Datacenter Aware.• System has an elected leader.• When a node joins the system, the leader

assigns it a range of data items and replicas.• Each node is aware of every other node in the

system and the range they are responsible for.

UCSB CS271 24

Page 25: Amazon’s Key-Value Store: Dynamo

Membership and Failure Detection• Gossip-based mechanism to maintain cluster membership.• A node determines which nodes are up and down using a

failure detector.• The Φ accrual failure detector returns a suspicion level, Φ,

for each monitored node.• Say a node suspects A when Φ=1, 2, 3, then the likelihood

of a mistake is 10%, 1% and .1%.• Every node maintains a sliding window of interarrival times

of gossip messages from other nodes to determine distribution of interarrival times and then calculate Φ. Approximate using an exponential distribution.

UCSB CS271 25

Page 26: Amazon’s Key-Value Store: Dynamo

Operations• Use quorums: R and W• If R+W < N then read will return latest value.

– Read operations return value with highest timestamp, so may return older versions

– Read Repair: with every read, send newest version to any out-of-date replicas.

– Anti-Entropy: compute Merkle tree to catch any out of synch data (expensive)

• Each write: first into a persistent commit log, then an in-memory data structure.

UCSB CS271 26