copyright 2015 – noah mendelsohn consistency and scalability noah mendelsohn tufts university...

61
Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: [email protected] Web: http://www.cs.tufts.edu/~noah COMP 150-IDS: Internet Scale Distributed Systems (Spring 2015)

Upload: vivien-york

Post on 12-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

Copyright 2015 – Noah Mendelsohn

Consistency and Scalability

Noah MendelsohnTufts UniversityEmail: [email protected]: http://www.cs.tufts.edu/~noah

COMP 150-IDS: Internet Scale Distributed Systems (Spring 2015)

Page 2: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn2

What you should get from today’s session

You will explore challenges relating to maintaining data consistency in a computing system

You will learn about techniques used to make storage systems more reliable

You will learn about transactions and their implementation using logs

You will learn about the CAP theorem and why scaling and consistency tend not to come together

Page 3: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn3

A note about scope

The challenges & principles we cover today reappear at every level of system design– CPU Instruction set and memory– Parallel programming languages– Single machine databases– Distributed applications and databases

Today we will focus mainly on larger scale systems

Page 4: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn4

Why Worry About Consistency?

Page 5: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn5

Duplicate information in computing systems

Why complicated things?– Mirrored disks for reliability– Parallel processing higher throughput– Geographic distribution reduces network delay (one each in Europe, Asia, US)– Higher availability if network crashes, each “partition” may still have a copy

Inter-dependent data– Bank account records have total for each account– Bank record keeps total for all accounts

Memory Hierarchies– CPU Caches, file system caches, Web proxies, etc.

If we allow updates, then maintaining consistency is tricky

Page 6: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn6

Simple Examples:Parallel Disk Systems

Page 7: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn7

Mirrored disks

Logical disk

Mirrored Implementation

X

X X

Everything written twice

Better performance on reads (slower on writes)

Page 8: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn8

Duplicate data and crash recovery

Logical disk

Mirrored Implementation

X

X X

After a crash, data survives

Crash!

Page 9: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn9

Mirrored disks

Logical disk

Mirrored Implementation

X

X X

Replacement drive can be reconstructed in the

background

Page 10: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn

Unix Kernel

REVIEW: How is the disk used in Unix / Linux?

Sector

Ap

plicati

on

Access bycylinder/track/sector

Filesystem

Files/Dirssecurity, etc

Buffered block r/w: hides timing

Sector

In-memory BlockCache

Blo

ck D

evic

e D

river

Direct read/write of filesystem“blocks” (hides sector size anddevice geometry)

Raw

Devic

e D

river

Page 11: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn

Unix Kernel

We can use mirrored disks with UnixA

pp

licati

on

Filesystem

Files/Dirssecurity, etc

Buffered block r/w: hides timing

Sector

In-memory BlockCache

Blo

ck D

evic

e D

river

MIR

RO

RED

Devic

e D

river

Mir

rore

d Im

ple

men

tati

on

Abstraction:The mirrored disk provides

the same service as a single disk…just faster and more

reliable!

Page 12: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn12

Atomicity and update synchronziation

Logical disk

Mirrored Implementation

X

X X

Mirrored writes DO NOT happen at quite the

same timeQuestion: when is the update committed?

Page 13: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn13

Logical disk

RAID – Reliable Arrays of Inexpensive Disks

X

XX X

RAID Implementation

Page 14: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn14

RAID – Reliable Arrays of Inexpensive Disks

RAID Implementation

Y

XX

Y

X

XXOR(X,Y)

Logical disk

Page 15: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn15

RAID – Reliable Arrays of Inexpensive Disks

RAID Implementation

Y

XX

Y

X

XOR(X,Y,Z)

Z

Z

Much less space overhead than

mirroring…but typically slower

Logical disk

Page 16: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn16

RAID – Reliable Arrays of Inexpensive Disks

RAID Implementation

Y

XX

Y

X

XOR(X,Y,Z)

Z

ZCrash!

If any disk is lost…you can reconstruct from information on the

others!

Logical disk

Page 17: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn17

WhyConsistency

is Hard

Page 18: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn18

Synchronization problem

NA =Access Noah’s Bank accountBal = NA.Balance;NewBalance = Bal + $1000NA.Balance.Write NewBalance

Some code to add money to my account

NA =Access Noah’s Bank accountBal = NA.Balance;NewBalance = Bal + $1000NA.Balance.Write NewBalance

Some code to add money to my account

Let’s run code for two deposits in parallel

Can you see the problem?

There’s a risk that both copies will pick up X before either updates. If that happens, I only get $1000 not $2000!

Page 19: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn19

Solution - locking

Lock Noah’s Bank AccountNA =Access Noah’s Bank accountBal = NA.Balance;NewBalance = Bal + $1000NA.Balance.Write NewBalanceUnlock Noah’s Bank Account

Some code to add money to my account

Now the two copies can’t run at once on the same account…but if each locks a different bank account they can.

Only one transaction or thread can hold the lock at a time

Page 20: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn20

Consistency and Crash Recovery

NA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write Ybal

Some code to transfer money

Can you see the problem?

If the system crashes just after writing my balance, the bank loses $1000 (it’s still in your account too)

This gets lost during crash

Page 21: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn21

Transactions

Page 22: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn22

Transactions: automated consistency & crash recovery!

BEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION

Some code to transfer money

The system guarantees that either everything in the transaction happens, or nothing…and it guarantees more!

Page 23: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn23

ACID Properties of a Transaction

Atomicity– Everything happens or nothing

Consistency– If the database has rules they are obeyed at transaction end

(e.g. balance must be < $1,000,000)

Isolation– Any two parallel transactions act as if serial– Most transaction systems do the locking automatically!

Durability– Once committed, never lost

That seems almost magic…how can we achieve all this?

Page 24: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn24

How to implement transactions - logging The key idea: a shared log records information needed to undo any

change made by any transaction

When a transaction commits:– All data is written to the main data store– A commit record is written to the log. This is the atomic point at which the transaction

“happens”

After a crash, the log is “replayed”– For any transactions that did not commit, the undo operations are performed– After the crash, only commited operations have happened!

When combined with transaction driven locking, we can automatically support ACID properties with almost no application code complexity

This is all built into SQL databases like Oracle, Postgres, DB2, and SQL Server

Logging and transaction processing are two of the most important and beautiful data processing technologies

Page 25: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn25

Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION

Some code to transfer money

Noah.Bal = $100Your.Bal = $1300

Page 26: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn26

Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION

Some code to transfer money

Noah.Bal = $100Your.Bal = $1300

Begin Trans 1

Log

Page 27: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn27

Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION

Some code to transfer money

Noah.Bal = $1100Your.Bal = $1300

Begin Trans 1

Log

Old Noah Bal = $100

Page 28: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn28

Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Balance.Write YbalEND_TRANSACTION

Some code to transfer money

Noah.Bal = $1100Your.Bal = $1300

Begin Trans 1

Log

Old Noah Bal = $100 Old Your Bal = $1300

Page 29: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn29

Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Write YbalEND_TRANSACTION

Some code to transfer money

Noah.Bal = $1100Your.Bal = $1300

Begin Trans 1

Log

Old Noah Bal = $100 Old Your Bal = $1300 Commit Tr 1

Page 30: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn30

Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Write YbalEND_TRANSACTION

Some code to transfer money

Noah.Bal = $1100Your.Bal = $1300

Begin Trans 1

Log

Old Noah Bal = $100 Old Your Bal = $1300 Commit Tr 1

What if we crash while the data is inconsistent?

Page 31: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn31

Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION

Some code to transfer money

Noah.Bal = $100Your.Bal = $1300

Page 32: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn32

Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Write YbalEND_TRANSACTION

Some code to transfer money

Noah.Bal = $100Your.Bal = $1300

Begin Trans 1

Log

Page 33: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn33

Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Write YbalEND_TRANSACTION

Some code to transfer money

Noah.Bal = $1100Your.Bal = $1300

Begin Trans 1

Log

Old Noah Bal = $100

Crash!

Page 34: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn34

Recovery!

Noah.Bal = $1100Your.Bal = $1300

Begin Trans 1

Log

Old Noah Bal = $100

When system restarts, data is inconsistent…

…but we can play the log to restore consistency!

Page 35: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn35

Recovery!

Noah.Bal = $1100Your.Bal = $1300

Begin Trans 1

Log

Old Noah Bal = $100

We notice that Transaction 1never committed, so we

apply all of its undo entries

Page 36: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn36

Recovery!

Noah.Bal = $1100Your.Bal = $1300

Begin Trans 1

Log

Old Noah Bal = $100

We notice that Transaction 1never committed, so we

apply all of its undo entries

$100

Page 37: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn37

Logging – keeping consistency after crashes

The key idea: a shared log records information on how to undo any change to the main data

When a transaction commits:– All data is written to the main data store– A commit record is written to the log. This is the atomic point at which the transaction

“happens”

After a crash, the log is “replayed”– For any transactions that did not commit, the undo operations are performed– After the crash, only commited operations have happened!

When combined with locking, we can automatically support ACID properties with almost no application code complexity

This is all built into SQL databases like Oracle, Postgres, DB2, and SQL Server

Logging and transaction processing are two of the most important and beautiful data processing technologies

Full Disclosure

This explanation is highly simplified but the spirit is exactly right.

Examples of things not covered:

• Some databases use redo vs. undo logging or log both old and new values

• Transactions can abort (a ROLLBACK record is logged instead of COMMIT)

• Useful if programmer wants to give up• The system can abort a transaction if there is an error• The system can abort a transaction if locking has caused

deadlock• The same logs, if carefully designed, can be used to help with

backup, recovery from disk drive failure, and synchronization of distributed systems.

Page 38: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn38

Atomicity and hardware

Important: transactions are committed by an atomic hardware write to the log– Before the commit is written, the transaction has not happened– After it’s written all of its work is committed– It all happens at once: atomically

Principle: Almost any computing activity that is to be done atomically must be achieved in a single atomic hardware operation!– Store, Test_and_set or compare_and_swap CPU instructions– Write a disk block

When designing systems that require consistency, start by studying what your hardware can do atomically

Page 39: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn39

Consistency in Distributed Systems

Page 40: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn40

Problem

In a distributed system, we want to do work in lots of places

To get consistency, we need to do an atomic update to the system state

Challenge: can we get consistency in a distributed system?

Page 41: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn41

Can we get distributed consensus and consistency?

Yes! (but with some limitations)

First we need to think about how distributed systems fail…

…individual nodes can fail

…what if the network partitions?

In general, implementing transactions or otherconsistency guarantees in distributed systems is hard!

Page 42: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn42

Network Partition

This network is fully connected

Page 43: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn43

Network Partition

If these links break the networkis partitioned

All computers are still up!Updates in one partition

can’t be sent to the other.

Page 44: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn44

Questions about failures in distributed systems

Can we support replicated data and maintain consistency?

Can we run distributed transactions in which work (updating accounts) is spread through the network and achieve consistency?

How can we do crash recovery?

How do we continue running when the network partitions?

Page 45: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn45

Voting: a simple approach to replicated data

Copies of the same data can be kept at any or all nodes…but when reading you must use the value

stored at a majority of nodes!

Page 46: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn46

Network Partition All computers are still up!Updates in one partition

can’t be sent to the other.

During partition, only one group of nodes can be a majority…the other can’t proceed!

Page 47: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn47

The Famous CAP Theorem

Page 48: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn48

The Cap Theorem

When designing a system with distributed data youwould like to have:

Consistency: everyone agrees on the dataAvailability: nobody ever has to stop processingPartition tolerance: keep going even when the network partitions

The CAP theorem says: you can have any two simultaneously, but not all three!

If your network can partition, then either some nodes will have to stop working (no availability) or data may become

inconsistent (other partition doesn’t see the updates)

Page 49: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn49

Network Partition With the voting algorithm, only the orange

partition can do work.

The CAP theorem explains why we can never build a system that does better, unless we are willing to

sacrifice consistency.

Page 50: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn50

Distributed Transactions

Page 51: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn51

Distributed transactions: the challenge

What if our computation is distributed?

We still want ACID properties– Atomicity– Consistency– Isolation– Durability

Per the CAP theorem: let’s ignore partition for now

Amazingly, there are ways to do this:– Isolation and Consistency: distributed lock managers– Atomicity and Durability: Distributed Two Phase Commit (DTPC)

Page 52: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn52

Distributed two phase commit

Allows a single transaction to be spread across multiple nodes

Logging is done at each node as for traditional transactions

Special protocol ensures atomic commit of distributed work

One of the great achievements of 20th century distributed computing research

Page 53: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn53

Distributed Two Phase Commit

BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT

Node 1 logic

Noah.Bal = $100

Node 1 Log

Begin Trans 1

JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal

Node 2 Logic

Your.Bal = $1300

Node 2 Log

Join Trans 1

Page 54: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn54

Distributed Two Phase Commit

BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT

Node 1 logic

Noah.Bal = $1100

JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal

Node 2 Logic

Your.Bal = $300

Node 1 Log

Begin Trans 1

Node 2 Log

Join Trans 1

Old Noah Balance = $100

Old YourBalance = $1300

Page 55: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn55

Distributed Two Phase Commit

BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT

Node 1 logic

Noah.Bal = $1100

JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal

Node 2 Logic

Your.Bal = $300

Node 1 Log

Begin Trans 1

Node 2 Log

Join Trans 1

Old Noah Balance = $100

Old YourBalance = $1300

Prepared

Are you prepared to commit?

Prepared

Yes, I am prepared

Page 56: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn56

Distributed Two Phase Commit

BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT

Node 1 logic

Noah.Bal = $1100

JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal

Node 2 Logic

Your.Bal = $300

Node 1 Log

Begin Trans 1

Node 2 Log

Join Trans 1

Old Noah Balance = $100

Old YourBalance = $1300

Prepared

Are you prepared to commit?

Prepared

Yes, I am prepared

Prepared means: if you ask me later to commit or abortI will be able to do either!

Page 57: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn57

Distributed Two Phase Commit

BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT

Node 1 logic

Noah.Bal = $1100

JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal

Node 2 Logic

Your.Bal = $300

Node 1 Log

Begin Trans 1

Node 2 Log

Join Trans 1

Old Noah Balance = $100

Old YourBalance = $1300

Prepared

Commit!

Prepared

Done

Commit

Commit

Page 58: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn58

What happens if there is a crash?

If a node goes down before the commit, the master node writes an abort record and tells other nodes to abort

When any node comes up after a crash or after partition, it checks with master what has happened to any prepared transactions

Because prepared means it can go either way, that node can either record a commit or execute a rollback using data from the log

We can see the CAP theorem in action again: the algorithm stalls while the network is partitioned

Page 59: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn59

Does Everyone use Distributed 2 Phase Commit?

In the late 1990s everyone thought DTPC would be the key to distributed data

In practice, systems like Amazon can’t stop in case of network partition or master node crashe

Today:– Massive but non-critical data stores do not even attempt

perfect consistency: once in awhile your Amazon shopping cart may lose things you’ve parked there

– Critical transactions (e.g. when you place your order and charge your credit card) are often recorded in less scalable but fully consistent (usually relational) databases

Page 60: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn60

Summary

Page 61: Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: noah@cs.tufts.edunoah@cs.tufts.edu Web: noah

© 2010 Noah Mendelsohn

Summary

Keeping data consistent is important

Techniques like ACID transactions implemented with logs have been spectacularly successful

Consistency and scalability tend not to come together

Atomicity in software tends to require reduction to a single atomic operation in hardware

The CAP theorem says we can’t have Consistency, Availability and Parition tolerance

Techniques like Voting and Distributed Two Phase Commit can achieve distributed consistency at the cost of availability

Many modern systems sacrifice consistency to achieve availability at massive scale

61