scaling datastores and the cap theorem

Post on 02-Jul-2015

1.603 Views

Category:

Software

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

A talk given at BuildStuff 2014 in Vilnius, Lithuania; originally developed by Yoav Abrahami, and based on the works of Kyle "Aphyr" Kingsbury. Original abstract: Friday 4th June 1976, the Sex Pistols kicked off their first gig, a gig that's considered to change western music culture forever, pioneering the genesis of punk rock. Wednesday 19th July 2000 had a similar impact on internet scale companies as the Sex Pistols did on music, with the keynote speech by Eric Brewer at the ACM symposium on the [Principles of Distributed Computing] (PODC). Eric Brewer claimed that as applications become more web-based we should stop worrying about data consistency, because if we want high availability in those new distributed applications, then we cannot have data consistency. Two years later, in 2002, Seth Gilbert and Nancy Lynch [formally proved] Brewer's claim as what is known today as the Brewer's Theorem or CAP. The CAP theorem mandates that a distributed system cannot satisfy both Consistency, Availability and Partition tolerance. In the database ecosystem, many tools claim to solve our data persistence problems while scaling out, offering different capabilities (document stores, key/values, SQL, graph, etc). In this talk we will explore the CAP theorem + We will define what are Consistency, Availability and Partition Tolerance + We will explore what CAP means for our applications (ACID vs BASE) + We will explore practical applications on MySQL with read slave, MongoDB and Riak based on the work by [Aphyr - Kyle Kingsbury].

TRANSCRIPT

Put Your Thinking

CAP OnTomer Gabel, Wix

JDay Lviv, 2015

Credits

Originally a talk by

Yoav Abrahami (Wix)

Based on “Call Me Maybe” by

Kyle “Aphyr” Kingsbury

Brewer’s CAP Theorem

Partition Tolerance

ConsistencyAvailability

Brewer’s CAP Theorem

Partition Tolerance

ConsistencyAvailability

By Example

• I want this book!

– I add it to the cart

– Then continue

browsing

• There’s only one copy

in stock!

By Example

• I want this book!

– I add it to the cart

– Then continue

browsing

• There’s only one copy

in stock!

• … and someone else

just bought it.

Consistency

Consistency: Defined

• In a consistent

system:

All participants

see the same value

at the same time

• “Do you have this

book in stock?”

Consistency: Defined

• If our book store is an

inconsistent system:

– Two customers may

buy the book

– But there’s only one

item in inventory!

• We’ve just violated a

business constraint.

Availability

Availability: Defined

• An available system:

– Is reachable

– Responds to requests

(within SLA)

• Availability does not

guarantee success!

– The operation may fail

– “This book is no longer

available”

Availability: Defined

• What if the system is

unavailable?

– I complete the

checkout

– And click on “Pay”

– And wait

– And wait some more

– And…

• Did I purchase the

book or not?!

Partition

Tolerance

Partition Tolerance: Defined

• Partition: one or

more nodes are

unreachable

• No practical

system runs on a

single node

• So all systems are

susceptible!

A

B

C

D

E

“The Network is Reliable”

• All four happen in an

IP network

• To a client, delays

and drops are the

same

• Perfect failure

detection is provably

impossible1!

A B

drop delay

duplicate reorder

A B

A B A B

time

1 “Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch and Paterson

Partition Tolerance: Reified

• External causes:– Bad network config

– Faulty equipment

– Scheduled maintenance

• Even software causes partitions:– Bad network config.

– GC pauses

– Overloaded servers

• Plenty of war stories!– Netflix

– Twilio

– GitHub

– Wix :-)

• Some hard numbers1:– 5.2 failed devices/day

– 59K lost packets/day

– Adding redundancy only improves by 40%

1 “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, Gill et al

“Proving” CAP

In Pictures

• Let’s consider a simple

system:

– Service A writes values

– Service B reads values

– Values are replicated

between nodes

• These are “ideal”

systems

– Bug-free, predictable

Node 1

V0A

Node 2

V0B

In Pictures

• “Sunny day scenario”:

– A writes a new value V1

– The value is replicated

to node 2

– B reads the new value

Node 1

V0A

Node 2

V0B

V1

V1

V1

V1

In Pictures

• What happens if the

network drops?

– A writes a new value V1

– Replication fails

– B still sees the old value

– The system is

inconsistent

Node 1

V0A

Node 2

V0B

V1

V0

V1

In Pictures

• Possible mitigation is

synchronous replication

– A writes a new value V1

– Cannot replicate, so write is

rejected

– Both A and B still see V0

– The system is logically

unavailable

Node 1

V0A

Node 2

V0B

V1

What does it all mean?

The network is not reliable

• Distributed systems must handle partitions

• Any modern system runs on >1 nodes…

• … and is therefore distributed

• Ergo, you have to choose:

– Consistency over availability

– Availability over consistency

Granularity

• Real systems comprise many operations

– “Add book to cart”

– “Pay for the book”

• Each has different properties

• It’s a spectrum, not a binary choice!

Consistency Availability

Shopping CartCheckout

CAP IN THE REAL

WORLD

Kyle “Aphyr” Kingsbury

Breaking consistency

guarantees since 2013

PostgreSQL

• Traditional RDBMS

– Transactional

– ACID compliant

• Primarily a CP system

– Writes against a

master node

• “Not a distributed

system”

– Except with a client at

play!

PostgreSQL

• Writes are a simplified

2PC:

– Client votes to commit

– Server validates

transaction

– Server stores changes

– Server acknowledges

commit

– Client receives

acknowledgement

Client Server

Store

PostgreSQL

• But what if the ack is

never received?

• The commit is already

stored…

• … but the client has

no indication!

• The system is in an

inconsistent state

Client Server

Store

?

PostgreSQL

• Let’s experiment!

• 5 clients write to a

PostgreSQL instance

• We then drop the server

from the network

• Results:

– 1000 writes

– 950 acknowledged

– 952 survivors

So what can we do?

1. Accept false-negatives

– May not be acceptable for your use case!

2. Use idempotent operations

3. Apply unique transaction IDs

– Query state after partition is resolved

• These strategies apply to any RDBMS

• A document-oriented database

• Availability/scale via replica sets

– Client writes to a master node

– Master replicates writes to n replicas

• User-selectable consistency guarantees

MongoDB

• When a partition occurs:

– If the master is in the

minority, it is demoted

– The majority promotes a

new master…

– … selected by the highest

optime

MongoDB

• The cluster “heals” after partition resolution:

– The “old” master rejoins the cluster

– Acknowleged minority writes are reverted!

MongoDB

• Let’s experiment!

• Set up a 5-node

MongoDB cluster

• 5 clients write to

the cluster

• We then partition

the cluster

• … and restore it to

see what happens

MongoDB

• With write concern unacknowleged:– Server does not ack

writes (except TCP)

– The default prior to November 2012

• Results:– 6000 writes

– 5700 acknowledged

– 3319 survivors

– 42% data loss!

MongoDB

• With write concern

acknowleged:

– Server acknowledges

writes (after store)

– The default guarantee

• Results:

– 6000 writes

– 5900 acknowledged

– 3692 survivors

– 37% data loss!

MongoDB

• With write concern replica acknowleged:– Client specifies

minimum replicas

– Server acks after writes to replicas

• Results:– 6000 writes

– 5695 acknowledged

– 3768 survivors

– 33% data loss!

MongoDB

• With write concern majority:– For an n-node cluster,

requires at least n/2replicas

– Also called “quorum”

• Results:– 6000 writes

– 5700 acknowledged

– 5701 survivors

– No data loss

So what can we do?

1. Keep calm and carry on

– As Aphyr puts it, “not all applications need

consistency”

– Have a reliable backup strategy

– … and make sure you drill restores!

2. Use write concern majority

– And take the performance hit

The prime suspects

• Aphyr’s Jepsen tests

include:

– Redis

– Riak

– Zookeeper

– Kafka

– Cassandra

– RabbitMQ

– etcd (and consul)

– ElasticSearch

• If you’re

considering them,

go read his posts

• In fact, go read his

posts regardless

http://aphyr.com/tags/jepsen

STRATEGIES FOR

DISTRIBUTED SYSTEMS

Immutable Data

• Immutable (adj.):

“Unchanging over

time or unable to be

changed.”

• Meaning:

– No deletes

– No updates

– No merge conflicts

– Replication is trivial

Idempotence

• An idempotent

operation:

– Can be applied one or

more times with the

same effect

• Enables retries

• Not always possible

– Side-effects are key

– Consider: payments

Eventual Consistency

• A design which prefers

availability

• … but guarantees that

clients will eventually see

consistent reads

• Consider git:

– Always available locally

– Converges via push/pull

– Human conflict resolution

Eventual Consistency

• The system expects

data to diverge

• … and includes

mechanisms to regain

convergence

– Partial ordering to

minimize conflicts

– A merge function to

resolve conflicts

Vector Clocks

• A technique for partial ordering

• Each node has a logical clock

– The clock increases on every write

– Track the last observed clocks for each item

– Include this vector on replication

• When observed and inbound vectors have

no common ancestor, we have a conflict

• This lets us know when history diverged

CRDTs• Commutative Replicated Data Types1

• A CRDT is a data structure that:

– Eventually converges to a consistent state

– Guarantees no conflicts on replication

1 “A comprehensive study of Convergent and Commutative Replicated Data Types”, Shapiro et al

CRDTs

• CRDTs provide specialized semantics:

– G-Counter: Monotonously increasing counter

– PN-Counter: Also supports decrements

– G-Set: A set that only supports adds

– 2P-Set: Supports removals but only once

• OR-Sets are particularly useful

– Keeps track of both additions and removals

– Can be used for shopping carts

Questions?

Complaints?

WE’RE DONE

HERE!

Thank you for listening

tomer@tomergabel.com

@tomerg

http://il.linkedin.com/in/tomergabel

Aphyr’s “Call Me Maybe” blog posts:

http://aphyr.com/tags/jepsen

top related