scaling out and the cap theorem
DESCRIPTION
Friday 4th June 1976, the Sex Pistols kicked off their first gig, a gig that's considered to change western music culture forever, pioneering the genesis of punk rock. Wednesday 19th July 2000 had a similar impact on internet scale companies as the Sex Pistols did on music, with the keynote speech by Eric Brewer at the ACM symposium on the [Principles of Distributed Computing](http://www.podc.org/podc2000/) (PODC). Eric Brewer claimed that as applications become more web-based we should stop worrying about data consistency, because if we want high availability in those new distributed applications, then we cannot have data consistency. Two years later, in 2002, Seth Gilbert and Nancy Lynch [formally proved](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf) Brewer's claim as what is known today as the Brewer's Theorem or CAP. The CAP theorem mandates that a distributed system cannot satisfy both Consistency, Availability and Partition tolerance. In the database ecosystem, many tools claim to solve our data persistence problems while scaling out, offering different capabilities (document stores, key/values, SQL, graph, etc). In this talk we will explore the CAP theorem + We will define what are Consistency, Availability and Partition Tolerance + We will explore what CAP means for our applications (ACID vs BASE) + We will explore practical applications on MySQL with read slave, MongoDB and Riak based on the work by [Aphyr - Kyle Kingsbury](http://aphyr.com/posts).TRANSCRIPT
CAP Theorem
Reversim Summit 2014
CAP theorem, or Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees
• Consistency – All nodes see the same data at the same @me
• Availability – A guarantee that every request receives a response about whether it was
successful or failed
• Par@@on Tolerance – The system con@nues to operate despite arbitrary message loss or failure of
part of the system
It means that for internet scale companies we should stop worrying about data consistency
If we want high availability in such distributed systems
then guaranteed consistency of data is something we cannot have
An Example
Consider an online bookstore
• You want to buy the book “The tales of the CAP theorem” – The store has only one copy in stock – You add it to your cart and con@nue browsing, looking for another book
(“ACID vs BASE, a love story?”)
• As you browse the shop, someone else goes and buy the “The tales of the CAP theorem” – Adds the book to the cart and checks-‐out process
Consistency
Consistency
A Service that is consistent operates fully or not at all. In our bookstore example • There is only one copy in stock and only one person will get it
• If both customers can con@nue through the order process (payment) the lack of consistency will become a business issue
• Scale this inconsistency and you have a major business issue
• You can solve this issue using a database to manage inventory – The first checkout operates fully, the second not at all
Consistency
Note: CAP Consistency is the Atomicity in ACID
• CAP consistency is a constraint that mul@ple values of the same data are not allowed
• ACID Atomicity requires that each transac@on is “all or nothing” – Which implies that mul@ple values of the same data are not allowed
• ACID consistency means that any transac@on brings the database from one consistent state to another – Global consistency – of the whole database
Availability
Availability
Availability means just that – the service is available
• When you purchase a book you want to get a response – Not some schrodinger message about the site being uncommunica@ve
• Availability most oYen deserts you when you need it the most – Services tend to go down at busy periods
• A service that’s available but cannot be reached is of no benefit to anyone
Par@@on Tolerance
Par@@on Tolerance
Par@@on happens when a node in your system cannot communicate with another node • Say, because a network cable gets chopped
• Par@@ons are equivalent to server crash – If nothing can connect to it, it may as well not be there
• If your applica@on and database runs on one box then your server acts as a kind of atomic processor – it either works or it doesn’t – How far can you scale on one host?
• Once you scale to mul@ple hosts, you need par@@on tolerance
Par@@ons
But wait, are par@@ons real? • Our infrastructure is reliable, right? Formally, in any network • IP networks do all four
• TCP means no dupes, reorder – Unless you retry!
• Delays are indis@nguishable from drops (aYer a @meout) – There is no perfect failure detector in an async network
A B
drop delay
duplicate reorder
A B
A B A B
time
Par@@ons are real! Some Causes • GC Pause
– Is actually a delay
• Network maint • Segfaults & crashes • Faulty NICs • Bridge loops • VLAN problems • Hosted networks • The cloud • WAN links & Backhoes
Published examples • Neclix • Twilio • Fog Creek • AWS • Github • Wix • MicrosoY datacenter study
– Average failure rate 5.2 devices per day and 40.8 links per day
– Median packet loss 59,000 packets – Network redundancy improves
median traffic by 43%
More examples at hkp://aphyr.com/posts/288-‐the-‐network-‐is-‐reliable
The CAP Theorem proof
Proof in Pictures
• Consider a system with two nodes N1 and N2
• They both share the same data V
• On N1 runs the program A • On N2 runs program B
– We consider both A and B to be ideal -‐ safe, bug free, predictable and reliable
• In this example, A writes a new values of V and B reads the values of V
Proof in Pictures
Sunny-‐day scenario
1. A writes a new value of V, denoted as V1
2. A message M is passed from N1 to N2 which updates the copy of V there
3. Any read by B of V will return V1
Proof in Pictures
In the case of network par@@on
• Messages from N1 to N2 are not delivered – Even if we use guaranteed delivery of M, N1 has no way of knowing if a
message is delayed by par@@oning event or failure on N2
– Then N2 contains an inconsistent value of V when step 3 occurs
• We have lost consistency!
Proof in Pictures
In the case of network par@@on • We can make M synchronous
– Which means the write of A on N1 and the update N1 to N2 is an atomic opera@on
– A write will fail in case of a par@@on
• We have lost availability!
What does it all mean?
In prac@cal terms
For a distributed system to not require par@@on-‐tolerance it would have to run on a network which is guaranteed to never drop messages (or even
deliver them late) and whose nodes are guaranteed to never die. Such systems do not exist.
Make your choice
Choose consistency over availability Choose availability over consistency
Choose neither
CAP Locality
• It holds per opera@on independently – A system can be both CP and AP, for different opera@ons – Different opera@ons can be modeled with different CAP proper@es
• An opera@on can be – CP – consistent and par@@on tolerant – AP – available and par@@on tolerant – P with mixed A & C – trading off between A and C
• Eventual consistency, for example
Consistency Availability
Add item to cart Checkout
Lets look at some examples
Using the findings of Kyle Kingsbury aphyr.com
Postgres
• A classic open source database • We think of it as a CP system
– It accept writes only on a single primary node – Ensuring a write to slaves as well
• If a par@@on occurs – We cannot talk to the server and the system is unavailable – Because transac@ons are ACID, we’re always consistent
However • The distributed system composed of the server and client together may
not be consistent – They may not agree if a transac@on took place
Postgres
• Postgres’ commit protocol is a two phase commit – 2PC 1. The client votes to commit and sends a message
to the server 2. The server checks for consistency and votes to
commit (or reject) the transac@on 3. It writes the transac@on to storage 4. The server informs the client that a commit took place
• What happens if the acknowledgment message is dropped? – The client doesn't know whether the commit succeeded or not! – The 2PC protocol requires the client to wait an ack – The client will eventually get a @meout (or deadlock)
Postgres
The experiment • Install and run Postgres on one host • Run 5 clients who write to postgres within a transac@on • During the experiment, drop the network for one of the nodes The findings • Out of 1000 write opera@ons • 950 successfully acknowledged, and all are in the database • 2 writes succeeded, but the client got an excep@on claiming an error
occurred! – Note that the client has no way know if the write succeeded or failed
Postgres
2PC Strategies
• Accept false nega@ves – Just ignore the excep@on on the client. Those errors happen only for in-‐flight
writes at the @me the par@@on began.
• Use idempotent opera@ons – On a network error, just retry
• Using transac@on ID – When a par@@on is resolved, the client checks if a transac@on was commiked
using the transac@on ID.
Note those strategies applies to most SQL engines
MongoDB
• MongoDB is a document-‐oriented database
• Replicated using a replica set – Single writable primary node – Asynchronously replicates writes as an oplog
to N secondaries
• MongoDB supports different levels of guarantees – Asynchronous replica@on – Confirm successful write to its disk log – Confirm successful replica@on of a write to secondary nodes
• Is MongoDB consistent? – MongoDB is promoted as a CP system – However, it may “revert opera@ons” on network par@@on in some cases
MongoDB
What happens when the primary becomes unavailable?
• The remaining secondaries will detect the failed connec@on – Will try to get to a consensus for a new leader – If they have majority, they’ll select the node with the highest op@me
• The minority nodes will detect they no longer have a quorom – Will demote the primary to a secondary
• If our primary is on n1 and we cut n1 & n2 from the rest, we expect n3, n4 or n5 to become the new primary
MongoDB
The experiment • Install and run MongoDB on 5 hosts • With 5 clients
– Wri@ng some data to the cluster
• During the experiment, par@@on the network – To a minority and primary nodes
• Then restore the network
• check what happened? – What writes survived
MongoDB
Write concern unacknowledged • The default at the @me Kyle run the
experiment
The findings • 6000 total writes • 5700 acknowledged • 3319 survivors • 2381 acknowledged writes lost (42% write loss)
Not surprising, we have data loss.
MongoDB
42% data loss? • What happened? • When the par@@on started
– The original primary (N1) con@nued to accept writes
– But those writes never made it to the new primary (N5)
• When the par@@on ended – The original primary (N1) and the new primary (N5) compare notes – They figure that the N5 op@me is higher – N1 find the last point the two agreed on the oplog and
rolls back to that point • During a rollback, all writes the old primary accepted a@er the common
point in the oplog are removed from the database
MongoDB
Write concern safe or acknowledged • The current default • Allows clients to catch network,
duplicate key and other errors
The findings • 6000 total writes • 5900 acknowledged • 3692 survivors • 2208 acknowledged writes lost (37% write loss)
Write concern acknowledged only verifies the write was accepted on the master. We need to ensure replicas also see the write
MongoDB
Write concern replicas_safe or replica_acknowledged • Waits for at least 2 servers for the
write opera@on
The findings • 6000 total writes • 5695 acknowledged • 3768 survivors • 1927 acknowledged writes lost (33% write loss)
Mongo only verifies that a write took place against two nodes. A new primary can be elected without having seen those writes. In this case, Mongo will rollback those writes.
MongoDB
Write concern majority • Waits for a majority of servers for the
write opera@on The findings • 6000 total writes • 5700 acknowledged • 5701 survivors • 2 acknowledged writes lost • 3 unacknowledged write found
The reason we have 2 writes lost is a bug in Mongo that caused it to threat network failures as successful writes. This bug was fixed in 2.4.3 (or 2.4.4) The fact we have 3 unacknowledged writes found it not a problem -‐ similar arguments to Postgres
Majority
MongoDB
Takeaways for MongoDB You can either • Accept data loss
– At most WriteConcern levels Mongo can get to a point it rollback data
• Use WriteConcern.Majority – With performance impact
Other distributed systems Kyle tested All have different Caveats Worth a read at aphyr.com
ZooKeeper
Kafka
Strategies for distributed data & systems
Immutable Data
• Immutable data means – No updates – No deletes – No need for data merges – Easier to replicate
• Immutable data solves the problems that cause distributed systems to delete data (MongoDB, Riak, Cassandra, etc.) – However, even if your data is immutable, exis@ng tools assume it is mutable
and may s@ll delete your data
• Can you model all your data to be immutable? • How do you model inventory using immutable data?
Idempotent Opera@ons
An opera@on is idempotent if, whenever it is applied twice, it gives the same result as if it were applied once • It enables recovering from availability problems
– A way to introduce fault tolerant – The Postgres client-‐server ack issue, for example
• In case of any failure, just retry – Undetermined response – Failure to write
• However, it does not solve the CAP constraints
• Can you model all your opera@ons to be Idempotent?
BASE
BASE • Defined by Eric Brewer • Basically Available
– The system guarantee availability, in terms of the CAP theorem
• SoY state – The system state is sta@s@cally consistent. It may change in @me without
external input. • Eventual consistency
– The system will converge to be consistent over @me
• Considered the contrast to ACID (Atomicity, Consistency, Isola@on, Durability) – Not really J – Both are actually contrived
Eventual Consistency
For AP systems, we can make the system to fix consistency • We all know such a system – Git
– Available on each node, full par@@on tolerant – Gains consistency using Git push & pull – Human merge of data
• Can we take those ideas to other distributed systems?
• How can we track history? – Iden@fy conflicts?
• Can we make the merge automa@c?
Vector Clocks
• A way to track ordering of events in a distributed system • Enables detec@ng conflic@ng writes
– And the shared point in history the divergence started • Each write includes a logical clock
– A clock per node – Each @me a node write data, it increases it’s clock
• Nodes sync with each other using gossip a protocol
• Mul@ple implementa@ons – Node based – Opera@on based
Eventual Consistency
• A system that expects data to diverge – For small intervals in @me – For as long as a par@@on exists
• Built to regain consistency – Using some sync protocol (gossip) – Using vector clocks or @mestamps to compare values
• Needs to handle values merge – Minimize merges using vector clocks
• Merge only if values actually diverge
– Using @mestamp to select newer value – Using business specific merge func@ons – Using CRDTs
CRDTs
Commuta@ve Replicated Data Type (also known as Conflict-‐free Replicated Data Type)
• Not a lot of data types available to select from – G-‐Counter, PN-‐Counter, G-‐Set, 2P-‐Set, OR-‐Set, U-‐Set, Graphs
• OR-‐Set – For social graphs – can be used for shopping cart (with some modifica@ons)
QuesEons? anyone?