why distributed databases?

159
Databases Sargun Dhillon @Sargun

Upload: sargun-dhillon

Post on 27-Jul-2015

434 views

Category:

Technology


1 download

TRANSCRIPT

DatabasesSargun Dhillon

@Sargun

What is a database? A database is an organized collection of data

ApplicationsWhat are databases for?

Internet ApplicationsExperiencing exploding growth

Internet Traffic vs. Penetration

0

25

50

75

100

0

10000

20000

30000

40000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

IP Traffic (PB/mo) Global Penetration (%)

Number of Internet Users in 2012

Average Distance to Every Human

ExtrapolatingWe have not yet Peak “Web” and we wont see it for

some time

ApplicationsHow are they built?

Basic Application

Useful ApplicationAdd Persistence

Scale Out

Scale Out with Correctness

What is a Transaction?A Unit of Work

Transaction SchedulingConcurrent Operations

Non-Conflicting ConcurrencyParallel Execution

ACID

ACID = AtomicityA transaction executes or it does not

ACID = ConsistencyCorrectness; Require the database to follow set of

invariants

ACID = IsolationPrevent inter-actor visibility during concurrent operations

ACID = DurabilityOnce you write, it will survive

Lifecycle of a Transaction

Vertically ScalabilityMoore’s Law can take us places

Biggest AWS Database• vCPUs: 32

• Memory: 244

• Storage: 3TB

• IOPs: 30,000 IOPs

• Networking: 10 Gigabit

• Resiliency: Multi-AZ

• SLA: 99.95%

• Backend: Postgresql

$141,052.66/yr

Scaling Beyond

Sharding?

Do we have a natural sharding key?

Add a Coordinator?

Two-phase commit?

Three-phase commit?

Paxos?

Enhanced Three-phase commit?

Wat?

Egalitarian Paxos?

Do we really want to run NxM databases?

Partial Availability

Failure detectors are hard

Database Failure

Cascading App Failure

Recovery

Hotspots? (The “Beiber” problem)

Scaling SSI databases is a hard problem

What if want multidatacenter?

No latency win for mutable data

Must sacrifice recency for latency win

Complex Routing Semantics

Multi-master requires at least 1 RTT

80ms+ writes!

“Average partition duration ranged from 6 minutes for software-related failures to more than 8.2 hours for

hardware-related failures (median 2.7 and 32 minutes; 95th percentile of 19.9 minutes and 3.7 days,

respectively).” -The Network is Reliable

WANs Fail

Is there another way?

Into Riak

Design Requirements

Incremental ScalabilityMust be able to add nodes for greater reliability, or

throughput

High AvailabilityMust be able to seamlessly handle failures, and always

respond to operations

EfficiencyMeet stringent latency requirements

Implementation

“Experience at Amazon has shown that data stores that provide

ACID guarantees tend to have poor availability.”

Dynamo: Amazon’s Highly Available Key-value Store

The RingA cluster composed of set of virtual nodes (vnodes)

The Ring

Virtual Node Placement

The Ring

Data Placement

Data Placement

Fault Tolerance

Hinted Handoff

Fallback Virtual Nodes

Hinted Handoff

Read Repair

Replicas

Partial Failure

Divergence

Read Repair

Read Repair

Read Repair

Active-Anti Entropy

Merkle Tree

Compare Trees

Compare Trees

Compare Trees

Compare Trees

Repair Trees

Fault Tolerance

• Read Repair

• Active Anti-Entropy

• Hinted Handoff

Eventual Consistency

CAP Theorem

“A shared-data system can have at most two of the three following properties:

Consistency, Availability, and tolerance to network Partitions.”

-Dr. Eric Brewer

On Consistency

• ACID Consistency: Any transaction, or operation will bring the database from one valid state to another

• CAP Consistency: All nodes see the same data at the same time (synchrony)

On Partition Tolerance

• The network will be allowed to lose arbitrarily many messages sent from one node to another.

• Databases systems, in order to be useful must have communication over the network

• Clients count

There is no such thing as a 100% reliable network:

Can’t choose CA

http://codahale.com/you-cant-sacrifice-partition-tolerance

Very “AP”

Weak Consistency

Weak Consistency

“This is a specific form of weak consistency; the storage system

guarantees that if no new updates are made to the object,

eventually all accesses will return the last updated value.”

Definition of “Eventual Consistency” from “Eventually Consistency Revisited” - Werner Vogels

Tunable CAP Controls• R (Read Acks) tunable: Default Quorum

• W (Write Acks) tunable: Default Quorum

• PR (Primary Read Acks) tunable: Default 0

• PW (Primary Write Acks) tunable: Default 0

• N (replicas) tunable: Default 3

Strong Eventual Consistency PW+PR>N

How do you even use this?

Vector Clocks

Vector Clocks

• Extension of Lamport Clocks

• Used to detect cause and effect in distributed systems

• Can determine concurrency of events, and causality violations

CRDTs

• CRDTs:

• Convergent Replicated Data Types

• Commutative Replication Data Types

• Enables data structures to be always writeable on both sides of a partition, and replay after healing a partition

• Enable distributed computation across monotonic functions

• Two Types:

• CvRDTs

• CmRDTs

CRDTs

CvRDTs

• State / value based CRDTs

• Minimal state

• Don’t require active garbage collection

Set CvRDT

CmRDTs

• Op / method based CRDTs

• Size grows monotonically

• Uses version vectors to determine order of operations

Counter CmRDT

CRDTs in the Wild• Sets

• Observe-remove set

• Grow-only sets

• Counters

• Grow-only counters

• PN-Counters

• Flags

• Maps

Data structures that are CRDTs

• Probabilistic, convergent data structures

• Hyper log log

• Bloom filter

• Co-recursive folding functions

• Maximum-counter

• Running Average

• Operational Transform

CRDTs

• Incredibly powerful primitive

• Not only useful for in-database manipulation but client-database interaction

• You can compose them, and build your own

• Garbage collection is tricky

RAMP: Read Atomic Multi-Partition Transactions

Multikey Transaction

Potential Consistency Violation

Add Metadata

Uncommitted State

Uncommitted State

Committed State

Have your availability and consistency too

RAMP

Eventual Consistency in the WAN

Low-latency everywhere

Write AnywhereBeat the speed of the light

MDC Replication

Hybrid Topologies

Bidirectional Replication

Unidirectional Replication

Replication Hooks

Tied Writes

Hook on Replication

Hook on Replication

Replicate Hook Return Data

Build for WAN locality

Eventual Consistency In Summary

Invariant Operation AP / CPSpecify unique ID Any CP

Generate unique ID Any AP

> INCREMENT AP

> DECREMENT CP

< INCREMENT CP

< DECREMENT AP

Secondary Index Any AP

Materialized View Any APAUTO_INCREMENT INSERT CP

Linearizability CAS CP

Operations Requiring

Weak Consistency

vs.

Strong Consistency

BASE not ACID• Basically Available: There will be a response

per request (failure, or success)

• Soft State: Any two reads against the system may yield different data (when measured against time)

• Eventually Consistent: The system will eventually become consistent when all failures have healed, and time goes to infinity

Deploying Riak

AWS Deployment• 6 x i2.4xlarge

• 732GB of RAM

• 19TB of storage

• 960,000 IOPs

• 96 vCPUs

• 3 x Replication

• 10 Gigabit networking

• 99.9999999997% availability

$74,790/yr

Real World Use Case

Ad Network

• Sell targeted ads with minimum latency

• Two datasets:

• Ads

• Users

Deployment

Deployment

Overselling Ads is Okay

Choose Random Ad Based on Weight of

Outstanding Impressions

Batch System

Batch SystemGenerated targeted ads in offline process

Ad Graph

Ad Store

Initial Visit

Fetch All Ads

Choose AdBased upon weighted random

Decrement Value

Test Model

• 50 actors

• 5 Ads with inventory between 1000, and 1200

• Actors randomly get [1,3] times to choose per round

• Rounds continue until entire inventory is exhausted

Test Model

Out

stan

ding

Impr

essi

ons

-300

0

300

600

900

1200

Round Number

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76

Ad 1 Ad 2 Ad 3 Ad 4 Ad 5

Garbage Collection

Garbage CollectionUtilizes secondary indexes in batch process to delete

exhausted ads from user records

Ad Serving

• Requires batch generation of targets

• Requires external GC

• Allows for multidatacenter operation

In Summary

Riak

Distributed

Fault-Tolerant

ScalableSc

alab

ility

Processors

Toolchest

Why Distributed Databases?

Sargun Dhillon

@Sargun