pbs @ twitter · 2012. 6. 23. · pbs peter bailis @pbailis shivaram venkataraman, mike franklin,...

Post on 04-Oct-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PBSPeter Bailis @pbailis Shivaram Venkataraman,Mike Franklin,Joe Hellerstein,Ion Stoica

@ Twitter6.22.12

UC Berkeley

ProbabilisticallyBoundedStaleness

PBS

1. Fast2. Scalable3. Available

solution: replicate for 1. request capacity 2. reliability

solution: replicate for 1. request capacity 2. reliability

solution: replicate for 1. request capacity 2. reliability

keep replicas in sync

keep replicas in sync

keep replicas in sync

keep replicas in sync

keep replicas in sync

keep replicas in sync

keep replicas in sync

keep replicas in sync

slow

keep replicas in sync

slowalternative: sync later

keep replicas in sync

slowalternative: sync later

keep replicas in sync

slowalternative: sync later

keep replicas in sync

slowalternative: sync later

inconsistent

keep replicas in sync

slowalternative: sync later

inconsistent

keep replicas in sync

slowalternative: sync later

inconsistent

keep replicas in sync

slowalternative: sync later

inconsistent

⇧consistency, ⇧latencycontact more replicas,

read more recent data

consistency, ⇧ ⇧

latencycontact fewer replicas,

read less recent data

⇧consistency, ⇧latencycontact more replicas,

read more recent data

consistency, ⇧ ⇧

latencycontact fewer replicas,

read less recent data

eventual consistency“if no new updates are

made to the object, eventually all accesses

will return the last updated value”

W. Vogels, CACM 2008

HowHow long do I have to wait?

eventual?

consistent?How

What happens if I don’t wait?

solution:

problem:

technique:

no guarantees with eventual consistency

consistency prediction

measure latencies use WARS model

PBS

Dynamo:Amazon’s Highly Available Key-value Store

SOSP 2007

Apache, DataStax

Project Voldemort

Adobe

Cisco

Digg

Gowalla

IBM

Morningstar

NetflixPalantir

Rackspace

Reddit

Rhapsody

Shazam

Spotify

Soundcloud

Twitter

Mozilla

Ask.comYammerAol

GitHubJoyentCloud

Best Buy

LinkedIn

Boeing

Comcast

Cassandra

RiakVoldemortGilt Groupe

N = 3 replicas

Coordinator

client

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas

Coordinator

client

read(“key”)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas

Coordinator

read(“key”)

client

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

client

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

client

(“key”, 1)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas

CoordinatorreadR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

read(“key”)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

read(“key”)

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

(“key”, 1)

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

(“key”, 1)readR=3

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

CoordinatorreadR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

read(“key”)readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

read(“key”)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

(“key”, 1)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

(“key”, 1)

(“key”, 1)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)

(“key”, 1)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N = 3 replicas

Coordinator

(“key”, 1)(“key”, 1)(“key”, 1)

(“key”, 1)

send read to all

readR=1

R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)

client

N replicas/keyread: wait for R replieswrite: wait for W acks

Coordinator W=1

R1(“key”, 1) R2 (“key”, 1) R3 (“key”, 1)

Coordinator

write(“key”, 2)

W=1

R1(“key”, 1) R2 (“key”, 1) R3 (“key”, 1)

Coordinator

write(“key”, 2)

W=1

R1(“key”, 1) R2 (“key”, 1) R3 (“key”, 1)

Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

Coordinator

ack(“key”, 2)

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

Coordinator Coordinator

read(“key”)ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

R=1

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

read(“key”)

R=1

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”, 1)

R=1

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”,1)

R=1

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”,1)

R=1

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”,1)

R=1

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”,1)

R=1

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3 (“key”, 1)(“key”, 2)

(“key”,1)

ack(“key”, 2)

R=1

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

ack(“key”, 2) ack(“key”, 2)

(“key”, 2)

R=1

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

(“key”, 2)

R=1

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

(“key”, 2)

(“key”, 2)

R=1

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

(“key”, 2)

(“key”, 2) (“key”, 2)

R=1

(“key”, 2)

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2 R3(“key”, 2)

(“key”,1)

(“key”, 2)

R=1

“strong”consistency

else:

R+W > Nif:

eventualconsistency

then:

R+W

R+W

strong consistency

R+W

strong consistencylower latency

R+W

strong consistencylower latency

R+W

strong consistencylower latency

Cassandra:R=W=1, N=3

by default

(1+1 ≯ 3)

http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

"In the general case, we typically use [Cassandra’s] consistency level of [R=W=1], which provides

maximum performance. Nice!" --D. Williams, “HBase vs Cassandra: why we moved” February 2010

Consistency or Bust: Breaking a Riak Cluster

NoSQL Primer

Sunday, July 31, 11

23

Consistency or Bust: Breaking a Riak Cluster

Low Value Data

n = 2, r = 1, w = 1

Sunday, July 31, 11

http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster

Consistency or Bust: Breaking a Riak Cluster

Low Value Data

n = 2, r = 1, w = 1

Sunday, July 31, 11

http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster

Consistency or Bust: Breaking a Riak Cluster

Mission Critical Data

n = 5, r = 1, w = 5, dw = 5

Sunday, July 31, 11

http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster

Consistency or Bust: Breaking a Riak Cluster

Mission Critical Data

n = 5, r = 1, w = 5, dw = 5

Sunday, July 31, 11

http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster

Voldemort @ LinkedIn

N=3 not required, “some consistency”:

R=W=1, N=2Alex Feinberg, personal communication

“very low latency and high availability”:

R=W=1, N=3

Anecdotally, EC“worthwhile” for

many kinds of data

Anecdotally, EC“worthwhile” for

many kinds of data

How eventual?How consistent?

Anecdotally, EC“worthwhile” for

many kinds of data

How eventual?How consistent?

“eventual and consistent enough”

Can we do better?

Probabilistically Bounded Staleness

can’t make promisescan give expectations

Can we do better?

PBS is:a way to quantify latency-consistency trade-offs

what’s the latency cost of consistency?what’s the consistency cost of latency?

PBS is:a way to quantify latency-consistency trade-offs

what’s the latency cost of consistency?what’s the consistency cost of latency?

an “SLA” for consistency

t-visibility: probability p of consistent reads after after t seconds

(e.g., 99.9% of reads will be consistent after 10ms)

How eventual?

t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy)

t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy)

anti-entropy:only decreases stalenesscomes in many flavorshard to guarantee rate

Focus on message delays

focus on

with failures:

steady state

unavailableor sloppy

Coordinator Replicaonce per replica Time

Coordinator Replicawrite

once per replica Time

Coordinator Replicawrite

ack

once per replica Time

Coordinator Replicawrite

ackwait for W responses

once per replica Time

Coordinator Replicawrite

ackwait for W responses

t seconds elapse

once per replica Time

Coordinator Replicawrite

ack

read

wait for W responses

t seconds elapse

once per replica Time

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

once per replica Time

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

once per replica Time

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica Time

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica Time

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica Time

Coordinator Replicawrite

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica Time

N=2

Time

write write

N=2

Time

write

ack

write

ack

N=2

Time

write

ack

write

ackW=1

N=2

Time

write

ack

write

ackW=1

N=2

Time

write

ack

read

write

ackW=1

N=2

read

Time

write

ack

read

response

write

ackW=1

N=2

read

response

Time

write

ack

read

response

write

ackW=1

R=1

N=2

read

response

Time

write

ack

read

response

write

ackW=1

R=1

N=2

read

response

good

Time

N=2

Time

writewrite

N=2

Time

write

ack

write

ackN=2

Time

write

ack

write

ack

W=1

N=2

Time

write

ack

write

ack

W=1

N=2

Time

write

ack

read

write

ack

W=1

N=2

read

Time

write

ack

read

response

write

ack

W=1

N=2

read

response

Time

write

ack

read

response

write

ack

W=1

R=1

N=2

read

response

Time

write

ack

read

response

write

ack

W=1

R=1

N=2

read

response

bad

Time

write

ack

read

response

write

ack

W=1

R=1

N=2

read

response

bad

Time

write

ack

read

response

write

ack

W=1

R=1

N=2

read

response

bad

Time

write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replicaCoordinator Replica Time

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2(“key”, 1) R3 (“key”, 1)(“key”, 2)

R=1

Coordinator Coordinator

ack(“key”, 2)

W=1

R1 R2(“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”, 1)

(“key”,1)

R=1

Coordinator Coordinator

write(“key”, 2)

ack(“key”, 2)

W=1

R1 R2(“key”, 1) R3 (“key”, 1)(“key”, 2)

(“key”, 1)

(“key”,1)

R=1

R3 replied beforelast write arrived!

write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replicaCoordinator Replica Time

(W)write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replicaCoordinator Replica Time

(W)write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

(A)

Coordinator Replica Time

(R)

(W)write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

(A)

Coordinator Replica Time

(R)

(W)write

ack

read

response

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

(A)

(S)

Coordinator Replica Time

Solving WARS: hardMonte Carlo methods: easier

To use WARS:

W53.244.5101.1

...

A10.38.211.3...

R15.322.419.8...

S9.614.26.7...

run simulationMonte Carlo, sampling

gather latency data

How eventual?

key: WARS modelneed: latencies

t-visibility: consistent reads with probability p after after t seconds

consistent?What happens if I don’t wait?

How

Probability of reading later older than k versions is exponentially reduced by k

Pr(reading latest write) = 99%Pr(reading one of last two writes) = 99.9%

Pr(reading one of last three writes) = 99.99%

Cassandra cluster, injected latencies:

t-staleness RMSE: 0.28%latency N-RMSE: 0.48%

WARS Simulation accuracy

Yammer100K+ companies

uses Riak

LinkedIn 150M+ users

built and uses Voldemort

production latenciesfit gaussian mixtures

N=3

10 ms

N=3

99.9% consistent reads:R=2, W=1

t = 13.6 msLatency: 12.53 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 15.01 ms

LNKD-DISK

N=3

99.9% consistent reads:R=2, W=1

t = 13.6 msLatency: 12.53 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 15.01 ms

LNKD-DISK

N=3

16.5% faster

99.9% consistent reads:R=2, W=1

t = 13.6 msLatency: 12.53 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 15.01 ms

LNKD-DISK

N=3

16.5% fasterworthwhile?

N=3

N=3

N=3

99.9% consistent reads:R=1, W=1

t = 1.85 msLatency: 1.32 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 4.20 ms

LNKD-SSD

N=3

99.9% consistent reads:R=1, W=1

t = 1.85 msLatency: 1.32 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 4.20 ms

LNKD-SSD

N=3

59.5% faster

Coordinator Replica

write

ack(A)

(W)

response(S)

(R)

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

critical factor in staleness

read

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

N=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

N=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

N=3

Coordinator Replica

write

ack(A)

(W)

response(S)

(R)

wait for W responses

t seconds elapse

wait for R responses

response is stale

if read arrives before write

once per replica

SSDs reducevariance

compared todisks!

read

N=3

N=3

N=3

99.9% consistent reads:R=1, W=1

t = 202.0 msLatency: 43.3 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 230.06 ms

YMMR

N=3

99.9% consistent reads:R=1, W=1

t = 202.0 msLatency: 43.3 ms

Latency is combined read and write latency at 99.9th percentile

100% consistent reads:R=3, W=1

Latency: 230.06 ms

YMMR

N=3

81.1% faster

1. Tracing2. Simulation3. Tune N, R, W4. Profit

Workflow

solution:

problem:

technique:

no guarantees with eventual consistency

consistency prediction

measure latencies use WARS model

PBS

consistency

to measure is a metric

to predict

R+W

strong consistencylower latency

R+W

latency vs. consistency trade-offs

simple modeling with WARS

model staleness in time, versionsPBS

latency vs. consistency trade-offs

simple modeling with WARS

model staleness in time, versionsPBSeventual consistency

often fastoften consistent

PBS helps explain when and why

latency vs. consistency trade-offs

simple modeling with WARS

model staleness in time, versions

pbs.cs.berkeley.edu/#demo

PBSeventual consistency

often fastoften consistent

PBS helps explain when and why

@pbailis

Extra Slides

Related Work

Quorum System Theorye.g., Probabilistic Quorums

k-quorums

Deterministic Stalenesse.g., TACT/conits

FRACS

Consistency Verificatione.g., Golab et al.

(PODC ’11),Bermbach and Tai(M4WSOC ’11)

PBS and apps

staleness requires either:

staleness-tolerant data structurestimelines, logs

cf. commutative data structures logical monotonicity

asynchronous compensation codedetect violations after data is returned; see paper

cf. “Building on Quicksand” memories, guesses, apologies

write code to fix any errors

minimize:(compensation cost)×(# of expected anomalies)

asynchronouscompensation

Read only newer data?

client’s read rateglobal write rate

(monotonic reads session guarantee)

# versions tolerablestaleness

=

(for a given key)

Failure?

latency spikes

Treat failures as

How l o n gdo partitions last?

what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr

8.76 consecutive hours down⇒ bad 8-hour rolling average

what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr

8.76 consecutive hours down⇒ bad 8-hour rolling average

hide in tail of distribution ORcontinuously evaluate SLA, adjust

10�2 10�1 100 101 102 103

0.20.40.60.81.0

W=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

LNKD-SSD LNKD-DISK YMMR WAN

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

LNKD-SSD LNKD-DISK YMMR WAN

N=3

10�2 10�1 100 101 102 103

0.20.40.60.81.0

R=3

LNKD-SSD LNKD-DISK YMMR WAN

LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN

LNKD-SSD LNKD-DISK YMMR WAN

10�2 10�1 100 101 102 103

0.20.40.60.81.0

CD

F

W=1

10�2 10�1 100 101 102 103

Write Latency (ms)

0.20.40.60.81.0

W=2

(LNKD-SSD and LNKD-DISK identical for reads)N=3

Probabilistic quorum systems

N-WR(pinconsistent NR(

))

=

81

k-staleness: probability p of reading one of last k versions

How consistent?

82

How consistent?N-W

R(NR(

)))( K1-

82

How consistent?

closed-form solutionstatic quorum choice

N-WR(NR(

)))( K1-

<k,t>-staleness:versions and time

<k,t>-staleness:versions and time

approximation: exponentiate

t-staleness by k

reads return the last written value or newer(defined w.r.t. real time,when the read started)

consistency___“strong”

R1

N = 3 replicas

R2 R3

Write to W, read from R replicas

R1

N = 3 replicas

R2 R3

R=W=3 replicas{ }}{ R1 R2 R3

R=W=2 replicas{ }R1{ R2 } R2{ R3 } R1{ R3 }

Write to W, read from R replicas

quorum system:guaranteedintersection

R1

N = 3 replicas

R2 R3

R=W=3 replicas

R=W=1 replicas

{ }}{ R1 R2 R3

{ }R1 }{ R2 }{ R3 }{

R=W=2 replicas{ }R1{ R2 } R2{ R3 } R1{ R3 }

Write to W, read from R replicas

quorum system:guaranteedintersection

partial quorum system:

may not intersect

Synthetic,Exponential Distributions

N=3, W=1, R=1

Synthetic,Exponential Distributions

W 1/4x ARS

N=3, W=1, R=1

Synthetic,Exponential Distributions

W 1/4x ARS

W 10x ARS

N=3, W=1, R=1

concurrent writes:deterministically choose

Coordinator R=2

(“key”, 1) (“key”, 2)

top related