scaling datastores and the cap theorem

Put Your Thinking

CAP OnTomer Gabel, Wix

JDay Lviv, 2015

Credits

Originally a talk by

Yoav Abrahami (Wix)

Based on “Call Me Maybe” by

Kyle “Aphyr” Kingsbury

Brewer’s CAP Theorem

Partition Tolerance

ConsistencyAvailability

By Example

• I want this book!

– I add it to the cart

– Then continue

browsing

• There’s only one copy

in stock!

By Example

• I want this book!

– I add it to the cart

– Then continue

browsing

• There’s only one copy

in stock!

• … and someone else

just bought it.

Consistency

Consistency: Defined

• In a consistent

system:

All participants

see the same value

at the same time

• “Do you have this

book in stock?”

Consistency: Defined

• If our book store is an

inconsistent system:

– Two customers may

buy the book

– But there’s only one

item in inventory!

• We’ve just violated a

business constraint.

Availability

Availability: Defined

• An available system:

– Is reachable

– Responds to requests

(within SLA)

• Availability does not

guarantee success!

– The operation may fail

– “This book is no longer

available”

Availability: Defined

• What if the system is

unavailable?

– I complete the

checkout

– And click on “Pay”

– And wait

– And wait some more

– And…

• Did I purchase the

book or not?!

Partition

Tolerance

Partition Tolerance: Defined

• Partition: one or

more nodes are

unreachable

• No practical

system runs on a

single node

• So all systems are

susceptible!

A

B

C

D

E

“The Network is Reliable”

• All four happen in an

IP network

• To a client, delays

and drops are the

same

• Perfect failure

detection is provably

impossible1!

A B

drop delay

duplicate reorder

A B

A B A B

time

1 “Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch and Paterson

http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf

Partition Tolerance: Reified

• External causes:– Bad network config

– Faulty equipment

– Scheduled maintenance

• Even software causes partitions:– Bad network config.

– GC pauses

– Overloaded servers

• Plenty of war stories!– Netflix

– Twilio

– GitHub

– Wix :-)

• Some hard numbers1:– 5.2 failed devices/day

– 59K lost packets/day

– Adding redundancy only improves by 40%

1 “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, Gill et al

http://research.microsoft.com/en-us/um/people/navendu/papers/sigcomm11netwiser.pdf

“Proving” CAP

In Pictures

• Let’s consider a simple

system:

– Service A writes values

– Service B reads values

– Values are replicated

between nodes

• These are “ideal”

systems

– Bug-free, predictable

Node 1

V0A

Node 2

V0B

In Pictures

• “Sunny day scenario”:

– A writes a new value V1

– The value is replicated

to node 2

– B reads the new value

Node 1

V0A

Node 2

V0B

V1

V1

V1

V1

In Pictures

• What happens if the

network drops?


– Replication fails

– B still sees the old value

– The system is

inconsistent

Node 1

V0A

Node 2

V0B

V1

V0

V1

In Pictures

• Possible mitigation is

synchronous replication


– Cannot replicate, so write is

rejected

– Both A and B still see V0

– The system is logically

unavailable

Node 1

V0A

Node 2

V0B

V1

What does it all mean?

The network is not reliable

• Distributed systems must handle partitions

• Any modern system runs on >1 nodes…

• … and is therefore distributed

• Ergo, you have to choose:

– Consistency over availability

– Availability over consistency

Granularity

• Real systems comprise many operations

– “Add book to cart”

– “Pay for the book”

• Each has different properties

• It’s a spectrum, not a binary choice!

Consistency Availability

Shopping CartCheckout

CAP IN THE REAL

WORLD

Kyle “Aphyr” Kingsbury

Breaking consistency

guarantees since 2013

PostgreSQL

• Traditional RDBMS

– Transactional

– ACID compliant

• Primarily a CP system

– Writes against a

master node

• “Not a distributed

system”

– Except with a client at

play!

PostgreSQL

• Writes are a simplified

2PC:

– Client votes to commit

– Server validates

transaction

– Server stores changes

– Server acknowledges

commit

– Client receives

acknowledgement

Client Server

Store

PostgreSQL

• But what if the ack is

never received?

• The commit is already

stored…

• … but the client has

no indication!

• The system is in an

inconsistent state

Client Server

Store

?

PostgreSQL

• Let’s experiment!

• 5 clients write to a

PostgreSQL instance

• We then drop the server

from the network

• Results:

– 1000 writes

– 950 acknowledged

– 952 survivors

So what can we do?

1. Accept false-negatives

– May not be acceptable for your use case!

2. Use idempotent operations

3. Apply unique transaction IDs

– Query state after partition is resolved

• These strategies apply to any RDBMS

• A document-oriented database

• Availability/scale via replica sets

– Client writes to a master node

– Master replicates writes to n replicas

• User-selectable consistency guarantees

MongoDB

• When a partition occurs:

– If the master is in the

minority, it is demoted

– The majority promotes a

new master…

– … selected by the highest

optime

MongoDB

• The cluster “heals” after partition resolution:

– The “old” master rejoins the cluster

– Acknowleged minority writes are reverted!

MongoDB

• Let’s experiment!

• Set up a 5-node

MongoDB cluster

• 5 clients write to

the cluster

• We then partition

the cluster

• … and restore it to

see what happens

MongoDB

• With write concern unacknowleged:– Server does not ack

writes (except TCP)

– The default prior to November 2012

• Results:– 6000 writes


– 3319 survivors

– 42% data loss!

MongoDB

• With write concern

acknowleged:

– Server acknowledges

writes (after store)

– The default guarantee

• Results:

– 6000 writes


– 3692 survivors

– 37% data loss!

MongoDB

• With write concern replica acknowleged:– Client specifies

minimum replicas

– Server acks after writes to replicas



– 3768 survivors

– 33% data loss!

MongoDB

• With write concern majority:– For an n-node cluster,

requires at least n/2replicas

– Also called “quorum”



– 5701 survivors

– No data loss

So what can we do?

1. Keep calm and carry on

– As Aphyr puts it, “not all applications need

consistency”

– Have a reliable backup strategy

– … and make sure you drill restores!

2. Use write concern majority

– And take the performance hit

The prime suspects

• Aphyr’s Jepsen tests

include:

– Redis

– Riak

– Zookeeper

– Kafka

– Cassandra

– RabbitMQ

– etcd (and consul)

– ElasticSearch

• If you’re

considering them,

go read his posts

• In fact, go read his

posts regardless

http://aphyr.com/tags/jepsen


STRATEGIES FOR

DISTRIBUTED SYSTEMS

Immutable Data

• Immutable (adj.):

“Unchanging over

time or unable to be

changed.”

• Meaning:

– No deletes

– No updates

– No merge conflicts

– Replication is trivial

Idempotence

• An idempotent

operation:

– Can be applied one or

more times with the

same effect

• Enables retries

• Not always possible

– Side-effects are key

– Consider: payments

Eventual Consistency

• A design which prefers

availability

• … but guarantees that

clients will eventually see

consistent reads

• Consider git:

– Always available locally

– Converges via push/pull

– Human conflict resolution

Eventual Consistency

• The system expects

data to diverge

• … and includes

mechanisms to regain

convergence

– Partial ordering to

minimize conflicts

– A merge function to

resolve conflicts

Vector Clocks

• A technique for partial ordering

• Each node has a logical clock

– The clock increases on every write

– Track the last observed clocks for each item

– Include this vector on replication

• When observed and inbound vectors have

no common ancestor, we have a conflict

• This lets us know when history diverged

CRDTs• Commutative Replicated Data Types1

• A CRDT is a data structure that:

– Eventually converges to a consistent state

– Guarantees no conflicts on replication

1 “A comprehensive study of Convergent and Commutative Replicated Data Types”, Shapiro et al

http://hal.upmc.fr/file/index/docid/555588/filename/techreport.pdf

CRDTs

• CRDTs provide specialized semantics:

– G-Counter: Monotonously increasing counter

– PN-Counter: Also supports decrements

– G-Set: A set that only supports adds

– 2P-Set: Supports removals but only once

• OR-Sets are particularly useful

– Keeps track of both additions and removals

– Can be used for shopping carts

Questions?

Complaints?

WE’RE DONE

HERE!

Thank you for listening

[email protected]

@tomerg

http://il.linkedin.com/in/tomergabel

Aphyr’s “Call Me Maybe” blog posts:


mailto:[email protected]

http://www.twitter.com/tomerg

http://il.linkedin.com/in/tomergabel


scaling datastores and the cap theorem

Software