consistency in distributed systems, part 2

44
Consistency in Distributed Systems: II Mike Miller Co-Founder, Chief Scientist @mlmilleratmit

Upload: dataversity

Post on 25-Jun-2015

420 views

Category:

Technology


1 download

DESCRIPTION

Building on the introductory work of a past webinar, we take a deep dive in the locking, replication, and failure modes of leading NoSQL databases. We focus on three main areas critical for modern developers and architects: Industry survey of ACID compliance. Best practices to store 1-many and many-many relationships in Riak, Cassandra, MongoDB, DyanamoDB, and Cloudant. Consistency between primary and secondary indexes (an often neglected subject) and implications for immutable data models.

TRANSCRIPT

Page 1: Consistency in Distributed Systems, Part 2

Consistency in Distributed Systems: IIMike MillerCo-Founder, Chief Scientist @mlmilleratmit

Page 2: Consistency in Distributed Systems, Part 2

2014-07-31 2

AMP on Consistency

https://amplab.cs.berkeley.edu/tag/consistency/

Page 3: Consistency in Distributed Systems, Part 2

2014-07-31

Topics For Today‣ Brief review of Part I (2014-06-12)

‣ Why is Consistency hard?

‣ What should you really care about? • Single object/row/document operations • Multi-part transactions • Primary/secondary indexes

‣ Dirty little (ACID) secrets: Results from industry survey

‣ Failure modes, strategies, and gotchas

3

Page 4: Consistency in Distributed Systems, Part 2

2014-07-31

Motivation

4

Page 5: Consistency in Distributed Systems, Part 2

2014-07-31 5

MobileBig Data

=> Stress models for consistency, transactional reasoning

Page 6: Consistency in Distributed Systems, Part 2

2014-07-31

This is your problem when… !

… data doesn’t fit on one server. … data replicated between servers (e.g. read slaves). … data spread between data centers. … state spread across more than one device (mobile!) … mixed workloads with concurrency. … state spread across more than one process.

6

Page 7: Consistency in Distributed Systems, Part 2

2014-07-31

This is now everyone’s problem

7

Page 8: Consistency in Distributed Systems, Part 2

2014-07-31

Good news — market response: NewSQL, NoSQL, Cloud, …

8

Page 9: Consistency in Distributed Systems, Part 2

2014-07-31 9

ships with a mobile strategy

Page 10: Consistency in Distributed Systems, Part 2

2014-07-31

{Write: ‘Local’, Sync: ‘Later’}

Embedded, Edge, Satellites

Desktop, Browser

Cloud

10

Page 11: Consistency in Distributed Systems, Part 2

2014-07-31

NoSQL Taxonomy

11

Page 12: Consistency in Distributed Systems, Part 2

2014-07-31 12

Page 13: Consistency in Distributed Systems, Part 2

2014-07-31 13

…http://www.bailis.org/papers/ramp-sigmod2014.pdf

Fundamental reason: CAP Theorem

Page 14: Consistency in Distributed Systems, Part 2

2014-07-31

You do need to understand your datastore.

14

Page 15: Consistency in Distributed Systems, Part 2

2014-07-31

Why is Consistency Hard?

15

Page 16: Consistency in Distributed Systems, Part 2

2014-07-31

1. The network is reliable.

2. Latency is zero. (Fallacies of Distributed Computing, P. Deutsch)

16

Page 17: Consistency in Distributed Systems, Part 2

MySQL, MongoDB, CouchDB, SOLR, …

Dynamo, Cloudant, Cassandra, Riak, …

Page 18: Consistency in Distributed Systems, Part 2

Primary

Secondary

Client

Repl

icat

ion

w(x=1)success

Clientr(x)x=1

time

Perfect Network

Page 19: Consistency in Distributed Systems, Part 2

Primary

Secondary

Client

Repl

icat

ion

w(x=1)success

Clientr(x)x=Null

time

Network Partition: Primary Only

Available, temporarily inconsistent

Page 20: Consistency in Distributed Systems, Part 2

Primary

Secondary

Client

Repl

icat

ion

w(x=1)

Client success

time

Network Partition: Primary+Secondary

Consistent

Page 21: Consistency in Distributed Systems, Part 2

Primary

Secondary

Client

Repl

icat

ion

w(x=1)

failure

time

Network Partition: Primary+Secondary

Not Available

Page 22: Consistency in Distributed Systems, Part 2

2014-07-31

Partition Failures Dominate‣ 2011 (AWS): • misconfiguration => 12 hour outage

‣ 2011 Survey (Microsoft): • 13,300 customer impacting network failures • Median 60,000 packts lost per failure • mean 41 link failures per day (95% of 136) • median time to repair of 5 minutes (up to a week) • Redundant networks only reduce failure impact by 40%

‣ HP Managed Enterprise Networks • 28% of customer tickets due to network problems • 39% of all support tickets due to network problems • Median incident duration: 114-188 minutes

22

http

://qu

eue.

acm

.org

/det

ail.c

fm?i

d=26

5573

6

Page 23: Consistency in Distributed Systems, Part 2

2014-07-31

LatencyNetwork health really depends on your latency tolerance.

A slow network can be just as bad as a broken network.

The tails matter.

23

Page 24: Consistency in Distributed Systems, Part 2

2014-07-31

Median Latencies

24

Same AZ Different AZs

Different Regions

http://www.bailis.org/blog/communication-costs-in-real-world-networks/

Page 25: Consistency in Distributed Systems, Part 2

2014-07-31

99.99% Latencies

25

Same AZ Different AZs

Different Regions

http://www.bailis.org/blog/communication-costs-in-real-world-networks/

Page 26: Consistency in Distributed Systems, Part 2

2014-07-31

Latency Summary‣ Distributed, coordinated operations: ‣ rate ~ 1/latency

‣ Real world latencies are substantial, with long tails

‣ At scale, 0.01% events happen constantly

‣ Picture actually much worse due to systematic fluctuations

‣ 99.99% Latencies: ‣ Same AZ: ~50 ms ‣ Same Region: ~80 ms ‣ Inter-Region: 200-400 ms!

26

Page 27: Consistency in Distributed Systems, Part 2

2014-07-31

Thank god for ACID (New)SQL, right? !

… not so fast

27

Page 28: Consistency in Distributed Systems, Part 2

2014-07-31

ACID in the Wild

28

http://arxiv-web3.library.cornell.edu/abs/1302.0309v1

Page 29: Consistency in Distributed Systems, Part 2

2014-07-31

Beware the Marketing

29

http://arxiv-web3.library.cornell.edu/abs/1302.0309v1

Page 30: Consistency in Distributed Systems, Part 2

2014-07-31 30

Wow!

Page 31: Consistency in Distributed Systems, Part 2

2014-07-31

So… What do we use? What should we worry about?

31

Page 32: Consistency in Distributed Systems, Part 2

2014-07-31

1. Locks / Concurrency 2. Relationships / Foreign Keys 3. Inter-index consistency

32

Distinguishing Characteristics

Page 33: Consistency in Distributed Systems, Part 2

2014-07-31

Subjective Classification

33

Cassandra Cloudant MongoDB Riak

Locking Minimal None Writes and Reads Minimal

Consistency Quorum, Optional Paxos Quorum Single document

LocksQuorum,

Optional Paxos

Relationships, “JOINs”

De-normalize, Materialized

Views

Normalize, Materialized

Views

De-normalize, Application Joins

De-normalize or Link Walking

Leading Strategies Immutability Immutability Fat Documents Immutability

“Intention” HA, Shared Nothing, Many Servers

HA, Shared Nothing, Many Servers

Master/Slave, Single Server

HA, Shared Nothing, Many Servers

Page 34: Consistency in Distributed Systems, Part 2

2014-07-31

It happens in all no-SQL systems. Is it the application's responsibility or the DB?

34

De-normalization

Page 35: Consistency in Distributed Systems, Part 2

2014-07-31

Relationships as Single Documents

35

Natural fit for some applicationshttp://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

Page 36: Consistency in Distributed Systems, Part 2

2014-07-31

Relationships as Single Documents

36

Duplication sucks, pathologicallyhttp://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

Page 37: Consistency in Distributed Systems, Part 2

2014-07-31

Materialized Views Rule

37

Cassandra, Cloudant: “JOINs” via materialized views

Page 38: Consistency in Distributed Systems, Part 2

2014-07-31

Review: Cassandra

‣ Highly Available

‣ CQL eases pain of de-normalization

‣ 1-many, many-many relationships via inserts into multiple column families at update

‣ Eventual consistency as those updates propagate

‣ Can appeal to Paxos API with latency, availability hit

38

Page 39: Consistency in Distributed Systems, Part 2

2014-07-31

Review: Cloudant‣ Highly Available

‣ Normalize document structure, include foreign keys to other documents.

‣ Manage foreign key integrity yourself

‣ 1-many, many-many relationships via materialized views

‣ Eventual consistency between primary-index and (batch updated) materialized view

39

Page 40: Consistency in Distributed Systems, Part 2

2014-07-31

Review: MongoDB‣ Understand when MongoDB locks

‣ Go as far as you can with “fat”, de-normalized documents

‣ Beware the consistency subtleties of replica sets, de-normalization

40

Page 41: Consistency in Distributed Systems, Part 2

2014-07-31

Review: Riak‣ Highly Available

‣ Include foreign keys to other documents.

‣ Manage foreign key integrity yourself

‣ one-way (“graphy”) relationships via link-walking API

‣ Can appeal to Paxos API with latency, availability hit

41

Page 42: Consistency in Distributed Systems, Part 2

2014-07-31

My Final $0.02‣ Time to market should be your #1 concern.

‣ You will probably run both SQL and NoSQL.

‣ We’ve focused on the database, but all new apps need a mobile strategy.

‣ You’ll never engineer a perfect network • Focus on Availability and Partition Tolerance

‣ You will need to become advanced/expert in data modeling for your choice of DB

42

Page 43: Consistency in Distributed Systems, Part 2

2014-07-31

cloudant.com

[email protected]

@mlmilleratmit

#Cloudant

Thanks!

43

IRC

Page 44: Consistency in Distributed Systems, Part 2

2014-07-31 44