scaling out and the cap theorem

CAP Theorem

Reversim Summit 2014

CAP theorem, or Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees

•  Consistency –  All nodes see the same data at the same @me

•  Availability –  A guarantee that every request receives a response about whether it was

successful or failed

•  Par@@on Tolerance –  The system con@nues to operate despite arbitrary message loss or failure of

part of the system

It means that for internet scale companies we should stop worrying about data consistency

If we want high availability in such distributed systems

then guaranteed consistency of data is something we cannot have

An Example

Consider an online bookstore

•  You want to buy the book “The tales of the CAP theorem” –  The store has only one copy in stock –  You add it to your cart and con@nue browsing, looking for another book

(“ACID vs BASE, a love story?”)

•  As you browse the shop, someone else goes and buy the “The tales of the CAP theorem” –  Adds the book to the cart and checks-‐out process

Consistency

Consistency

A Service that is consistent operates fully or not at all. In our bookstore example •  There is only one copy in stock and only one person will get it

•  If both customers can con@nue through the order process (payment) the lack of consistency will become a business issue

•  Scale this inconsistency and you have a major business issue

•  You can solve this issue using a database to manage inventory –  The first checkout operates fully, the second not at all

Consistency

Note: CAP Consistency is the Atomicity in ACID

•  CAP consistency is a constraint that mul@ple values of the same data are not allowed

•  ACID Atomicity requires that each transac@on is “all or nothing” –  Which implies that mul@ple values of the same data are not allowed

•  ACID consistency means that any transac@on brings the database from one consistent state to another –  Global consistency – of the whole database

Availability

Availability

Availability means just that – the service is available

•  When you purchase a book you want to get a response –  Not some schrodinger message about the site being uncommunica@ve

•  Availability most oYen deserts you when you need it the most –  Services tend to go down at busy periods

•  A service that’s available but cannot be reached is of no benefit to anyone

Par@@on Tolerance

Par@@on Tolerance

Par@@on happens when a node in your system cannot communicate with another node •  Say, because a network cable gets chopped

•  Par@@ons are equivalent to server crash –  If nothing can connect to it, it may as well not be there

•  If your applica@on and database runs on one box then your server acts as a kind of atomic processor –  it either works or it doesn’t –  How far can you scale on one host?

•  Once you scale to mul@ple hosts, you need par@@on tolerance

Par@@ons

But wait, are par@@ons real? •  Our infrastructure is reliable, right? Formally, in any network •  IP networks do all four

•  TCP means no dupes, reorder –  Unless you retry!

•  Delays are indis@nguishable from drops (aYer a @meout) –  There is no perfect failure detector in an async network

A B

drop delay

duplicate reorder

A B

A B A B

time

Par@@ons are real! Some Causes •  GC Pause

–  Is actually a delay

•  Network maint •  Segfaults & crashes •  Faulty NICs •  Bridge loops •  VLAN problems •  Hosted networks •  The cloud •  WAN links & Backhoes

Published examples •  Neclix •  Twilio •  Fog Creek •  AWS •  Github •  Wix •  MicrosoY datacenter study

–  Average failure rate 5.2 devices per day and 40.8 links per day

–  Median packet loss 59,000 packets –  Network redundancy improves

median traffic by 43%

More examples at hkp://aphyr.com/posts/288-‐the-‐network-‐is-‐reliable

The CAP Theorem proof

Proof in Pictures

•  Consider a system with two nodes N1 and N2

•  They both share the same data V

•  On N1 runs the program A •  On N2 runs program B

–  We consider both A and B to be ideal -‐ safe, bug free, predictable and reliable

•  In this example, A writes a new values of V and B reads the values of V

Proof in Pictures

Sunny-‐day scenario

1.  A writes a new value of V, denoted as V1

2.  A message M is passed from N1 to N2 which updates the copy of V there

3.  Any read by B of V will return V1

Proof in Pictures

In the case of network par@@on

•  Messages from N1 to N2 are not delivered –  Even if we use guaranteed delivery of M, N1 has no way of knowing if a

message is delayed by par@@oning event or failure on N2

–  Then N2 contains an inconsistent value of V when step 3 occurs

•  We have lost consistency!

Proof in Pictures

In the case of network par@@on •  We can make M synchronous

–  Which means the write of A on N1 and the update N1 to N2 is an atomic opera@on

–  A write will fail in case of a par@@on

•  We have lost availability!

What does it all mean?

In prac@cal terms

For a distributed system to not require par@@on-‐tolerance it would have to run on a network which is guaranteed to never drop messages (or even

deliver them late) and whose nodes are guaranteed to never die. Such systems do not exist.

Make your choice

Choose consistency over availability Choose availability over consistency

Choose neither

CAP Locality

•  It holds per opera@on independently –  A system can be both CP and AP, for different opera@ons –  Different opera@ons can be modeled with different CAP proper@es

•  An opera@on can be –  CP – consistent and par@@on tolerant –  AP – available and par@@on tolerant –  P with mixed A & C – trading off between A and C

•  Eventual consistency, for example

Consistency Availability

Add item to cart Checkout

Lets look at some examples

Using the findings of Kyle Kingsbury aphyr.com

Postgres

•  A classic open source database •  We think of it as a CP system

–  It accept writes only on a single primary node –  Ensuring a write to slaves as well

•  If a par@@on occurs –  We cannot talk to the server and the system is unavailable –  Because transac@ons are ACID, we’re always consistent

However •  The distributed system composed of the server and client together may

not be consistent –  They may not agree if a transac@on took place

Postgres

•  Postgres’ commit protocol is a two phase commit – 2PC 1.  The client votes to commit and sends a message

to the server 2.  The server checks for consistency and votes to

commit (or reject) the transac@on 3.  It writes the transac@on to storage 4.  The server informs the client that a commit took place

•  What happens if the acknowledgment message is dropped? –  The client doesn't know whether the commit succeeded or not! –  The 2PC protocol requires the client to wait an ack –  The client will eventually get a @meout (or deadlock)

Postgres

The experiment •  Install and run Postgres on one host •  Run 5 clients who write to postgres within a transac@on •  During the experiment, drop the network for one of the nodes The findings •  Out of 1000 write opera@ons •  950 successfully acknowledged, and all are in the database •  2 writes succeeded, but the client got an excep@on claiming an error

occurred! –  Note that the client has no way know if the write succeeded or failed

Postgres

2PC Strategies

•  Accept false nega@ves –  Just ignore the excep@on on the client. Those errors happen only for in-‐flight

writes at the @me the par@@on began.

•  Use idempotent opera@ons –  On a network error, just retry

•  Using transac@on ID –  When a par@@on is resolved, the client checks if a transac@on was commiked

using the transac@on ID.

Note those strategies applies to most SQL engines

MongoDB

•  MongoDB is a document-‐oriented database

•  Replicated using a replica set –  Single writable primary node –  Asynchronously replicates writes as an oplog

to N secondaries

•  MongoDB supports different levels of guarantees –  Asynchronous replica@on –  Confirm successful write to its disk log –  Confirm successful replica@on of a write to secondary nodes

•  Is MongoDB consistent? –  MongoDB is promoted as a CP system –  However, it may “revert opera@ons” on network par@@on in some cases

MongoDB

What happens when the primary becomes unavailable?

•  The remaining secondaries will detect the failed connec@on –  Will try to get to a consensus for a new leader –  If they have majority, they’ll select the node with the highest op@me

•  The minority nodes will detect they no longer have a quorom –  Will demote the primary to a secondary

•  If our primary is on n1 and we cut n1 & n2 from the rest, we expect n3, n4 or n5 to become the new primary

MongoDB

The experiment •  Install and run MongoDB on 5 hosts •  With 5 clients

–  Wri@ng some data to the cluster

•  During the experiment, par@@on the network –  To a minority and primary nodes

•  Then restore the network

•  check what happened? –  What writes survived

MongoDB

Write concern unacknowledged •  The default at the @me Kyle run the

experiment

The findings •  6000 total writes •  5700 acknowledged •  3319 survivors •  2381 acknowledged writes lost (42% write loss)

Not surprising, we have data loss.

MongoDB

42% data loss? •  What happened? •  When the par@@on started

–  The original primary (N1) con@nued to accept writes

–  But those writes never made it to the new primary (N5)

•  When the par@@on ended –  The original primary (N1) and the new primary (N5) compare notes –  They figure that the N5 op@me is higher –  N1 find the last point the two agreed on the oplog and

rolls back to that point •  During a rollback, all writes the old primary accepted a@er the common

point in the oplog are removed from the database

MongoDB

Write concern safe or acknowledged •  The current default •  Allows clients to catch network,

duplicate key and other errors


Write concern acknowledged only verifies the write was accepted on the master. We need to ensure replicas also see the write

MongoDB

Write concern replicas_safe or replica_acknowledged •  Waits for at least 2 servers for the

write opera@on


Mongo only verifies that a write took place against two nodes. A new primary can be elected without having seen those writes. In this case, Mongo will rollback those writes.

MongoDB

Write concern majority •  Waits for a majority of servers for the

write opera@on The findings •  6000 total writes •  5700 acknowledged •  5701 survivors •  2 acknowledged writes lost •  3 unacknowledged write found

The reason we have 2 writes lost is a bug in Mongo that caused it to threat network failures as successful writes. This bug was fixed in 2.4.3 (or 2.4.4) The fact we have 3 unacknowledged writes found it not a problem -‐ similar arguments to Postgres

Majority

MongoDB

Takeaways for MongoDB You can either •  Accept data loss

–  At most WriteConcern levels Mongo can get to a point it rollback data

•  Use WriteConcern.Majority –  With performance impact

Other distributed systems Kyle tested All have different Caveats Worth a read at aphyr.com

ZooKeeper

Kafka

Strategies for distributed data & systems

Immutable Data

•  Immutable data means –  No updates –  No deletes –  No need for data merges –  Easier to replicate

•  Immutable data solves the problems that cause distributed systems to delete data (MongoDB, Riak, Cassandra, etc.) –  However, even if your data is immutable, exis@ng tools assume it is mutable

and may s@ll delete your data

•  Can you model all your data to be immutable? •  How do you model inventory using immutable data?

Idempotent Opera@ons

An opera@on is idempotent if, whenever it is applied twice, it gives the same result as if it were applied once •  It enables recovering from availability problems

–  A way to introduce fault tolerant –  The Postgres client-‐server ack issue, for example

•  In case of any failure, just retry –  Undetermined response –  Failure to write

•  However, it does not solve the CAP constraints

•  Can you model all your opera@ons to be Idempotent?

BASE

BASE •  Defined by Eric Brewer •  Basically Available

–  The system guarantee availability, in terms of the CAP theorem

•  SoY state –  The system state is sta@s@cally consistent. It may change in @me without

external input. •  Eventual consistency

–  The system will converge to be consistent over @me

•  Considered the contrast to ACID (Atomicity, Consistency, Isola@on, Durability) –  Not really J –  Both are actually contrived

Eventual Consistency

For AP systems, we can make the system to fix consistency •  We all know such a system – Git

–  Available on each node, full par@@on tolerant –  Gains consistency using Git push & pull –  Human merge of data

•  Can we take those ideas to other distributed systems?

•  How can we track history? –  Iden@fy conflicts?

•  Can we make the merge automa@c?

Vector Clocks

•  A way to track ordering of events in a distributed system •  Enables detec@ng conflic@ng writes

–  And the shared point in history the divergence started •  Each write includes a logical clock

–  A clock per node –  Each @me a node write data, it increases it’s clock

•  Nodes sync with each other using gossip a protocol

•  Mul@ple implementa@ons –  Node based –  Opera@on based

Eventual Consistency

•  A system that expects data to diverge –  For small intervals in @me –  For as long as a par@@on exists

•  Built to regain consistency –  Using some sync protocol (gossip) –  Using vector clocks or @mestamps to compare values

•  Needs to handle values merge –  Minimize merges using vector clocks

•  Merge only if values actually diverge

–  Using @mestamp to select newer value –  Using business specific merge func@ons –  Using CRDTs

CRDTs

Commuta@ve Replicated Data Type (also known as Conflict-‐free Replicated Data Type)

•  Not a lot of data types available to select from –  G-‐Counter, PN-‐Counter, G-‐Set, 2P-‐Set, OR-‐Set, U-‐Set, Graphs

•  OR-‐Set –  For social graphs –  can be used for shopping cart (with some modifica@ons)

QuesEons? anyone?

scaling out and the cap theorem

Software

ontolerance par

consistency note

b time

ontolerance thesystemcon

ontolerant apavailableandpar

ons dierentopera

ons butwait

reorder unlessyouretry