11
Scaleable Replicated Databases
Jim Gray (Microsoft)Jim Gray (Microsoft)
Pat Helland (Microsoft)Pat Helland (Microsoft)
Dennis Shasha (Columbia)Dennis Shasha (Columbia)
Pat O’Neil (U.Mass)Pat O’Neil (U.Mass)
22
Outline Replication strategiesReplication strategies
– Lazy and EagerLazy and Eager– Master and GroupMaster and Group
How centralized databases scaleHow centralized databases scale– deadlocks rise non-linearly withdeadlocks rise non-linearly with
transaction size transaction size concurrencyconcurrency
Replication systems are unstable on scaleupReplication systems are unstable on scaleup A possible solutionA possible solution
33
Scaleup, Replication, Partition
NN22 more more workwork
PartitioningTwo 1 TPS systems
ReplicationTwo 2 TPS systems
2 TPS server1 TPS server
100 Users
1 TPS server100 Users
O tp
s
O tp
s
100 Users
2 TPS server100 Users
1 tp
s
1 tp
s
1 TPS server100 Users
Base casea 1 TPS system
2 TPS server200 Users
Scaleupto a 2 TPS centralized system
44
Why Replicate Databases?
Give users a local copy for Give users a local copy for – PerformancePerformance
– AvailabilityAvailability
– Mobility (they are disconnected)Mobility (they are disconnected) But... What if they update it?But... What if they update it? Must propagate updates to other copiesMust propagate updates to other copies
55
Propagation Strategies Eager: Send update right awayEager: Send update right away
– (part of same transaction)(part of same transaction)
– NN times larger transactions times larger transactions Lazy: Send update asynchronouslyLazy: Send update asynchronously
– separate transactionseparate transaction
– NN times more transactions times more transactions Either wayEither way
– NN times more updates per second per node times more updates per second per node
– NN22 times more work overall times more work overall
Write A
Write B
Write C
Commit
Write A
Write B
Write C
Commit
Write A
Write B
Write C
Commit
Write A
Write B
Write C
Commit
Write A
Write B
Write C
Commit
Write A
Write B
Write C
Commit
66
Update Control Strategies
Master Master – Each object has a master nodeEach object has a master node
– All updates start with the masterAll updates start with the master
– Broadcast to the subscribersBroadcast to the subscribers GroupGroup
– Object can be updated by anyoneObject can be updated by anyone
– Update broadcast to all othersUpdate broadcast to all others Everyone Everyone wantswants Lazy Group: Lazy Group:
– update anywhere, anytime, anywayupdate anywhere, anytime, anyway
77
Quiz Questions: Name One Eager Eager
– Master:Master: N-Plexed disksN-Plexed disks– Group: Group: ??
Lazy Lazy – Master: Master: Bibles, Bank accounts, SQLserverBibles, Bank accounts, SQLserver– Group:Group: Name servers, Oracle, Access...Name servers, Oracle, Access...
Note: Note: Lazy contradicts SerializableLazy contradicts Serializable– If two lazy updates collide, then ... If two lazy updates collide, then ... reconcilereconcile
discard one transaction (or use some other rule)discard one transaction (or use some other rule)Ask for human advice Ask for human advice
Meanwhile, Meanwhile, nodes disagree =>nodes disagree =>– Network DB state diverges: Network DB state diverges: System DelusionSystem Delusion
88
Anecdotal Evidence
Update Anywhere systems are attractiveUpdate Anywhere systems are attractive Products offer the featureProducts offer the feature It demos wellIt demos well But when it scales upBut when it scales up
– Reconciliations start to cascadeReconciliations start to cascade– Database drifts “out of sync”Database drifts “out of sync” (System Delusion) (System Delusion)
What’s going on?What’s going on?
99
Outline
Replication strategiesReplication strategies– Lazy and EagerLazy and Eager
– Master and GroupMaster and Group How centralized databases scaleHow centralized databases scale
– deadlocks rise non-linearly deadlocks rise non-linearly Replication is unstable on scaleupReplication is unstable on scaleup A possible solutionA possible solution
1010
Simple Model of Waits TPSTPS transactions per second transactions per second Each Each
– Picks Picks ActionsActions records uniformly records uniformly from set of from set of DBsizeDBsize records records
– Then commitsThen commits About About Transactions Transactions x x Actions/2 Actions/2 resources locked resources locked Chance a request waits isChance a request waits is Action rate is Action rate is TPS x ActionsTPS x Actions
Active Transactions Active Transactions TPS x Actions x Action_Time
Wait Rate = Wait Rate = Action rate Action rate xx Chance a request waits Chance a request waits
==
10x more transactions, 100x more waits10x more transactions, 100x more waits
DBsizeDBsize recordsrecords
TransctionsTransctionsxxActionsActions22
TPSTPS22 xx Actions Actions33 xx Action_Time Action_Time
2 2 xx DB_size DB_size
Transactions Transactions xx Actions Actions2 2 xx DB_size DB_size
1111
Simple Model of Deadlocks
TPSTPS22 xx Actions Actions33 xx Action_Time Action_Time
2 2 xx DB_size DB_size
TPS TPS xx Actions Actions33xx Action_Time Action_Time
2 2 xx DB_size DB_size
TPS x Actions x Action_Time
TPSTPS22 xx Actions Actions55 xx Action_Time Action_Time
4 4 xx DB_size DB_size22
A A deadlockdeadlock is a wait cycle is a wait cycle Cycle of length 2:Cycle of length 2:
– Wait rate x Chance Waitee waits for waiterWait rate x Chance Waitee waits for waiter
– Wait rate x (P(wait) / Transactions)Wait rate x (P(wait) / Transactions)
Cycles of length 3 are PWCycles of length 3 are PW33, so ignored, so ignored..
1010xx bigger trans = 100,000 bigger trans = 100,000xx more deadlocks more deadlocks
1212
Summary So Far
Even centralized systems unstableEven centralized systems unstable Waits:Waits:
– Square of concurrencySquare of concurrency
– 3rd power of transaction size3rd power of transaction size Deadlock rateDeadlock rate
– Square of concurrencySquare of concurrency
– 5th power of transaction size5th power of transaction size
Tra
ns S
ize
Tra
ns S
ize
Concu
rrenc
y
Concu
rrenc
y
1313
Outline
Replication strategiesReplication strategies How centralized databases scaleHow centralized databases scale Replication is unstable on scaleupReplication is unstable on scaleup
Eager (master & group)Eager (master & group)Lazy (master & group & disconnected)Lazy (master & group & disconnected)
A possible solutionA possible solution
1414
Eager Transactions are FAT
If If NN nodes, eager transaction is nodes, eager transaction is NNxx bigger bigger– Takes Takes NNxx longer longer
– 1010xx nodes, 1,000 nodes, 1,000xx deadlocks deadlocks
– (derivation in paper)(derivation in paper) Master slightly better than groupMaster slightly better than group Good news: Good news:
– Eager transactions only deadlockEager transactions only deadlock
– No need for reconciliationNo need for reconciliation
Write A
Write B
Write C
Commit
Write A
Write B
Write C
Commit
Write A
Write B
Write C
Commit
1515
Lazy Master & Group Use optimistic concurrency controlUse optimistic concurrency control
– Keep transaction timestamp with recordKeep transaction timestamp with record
– Updates carry old+new timestampUpdates carry old+new timestamp
– If record has old timestampIf record has old timestamp set value to new valueset value to new value set timestamp to new timestampset timestamp to new timestamp
– If record does not match old timestampIf record does not match old timestamp reject lazy transactionreject lazy transaction
– Not SNAPSHOT isolation Not SNAPSHOT isolation (stale reads)(stale reads)
Reconciliation:Reconciliation:– Some nodes are updatedSome nodes are updated
– Some nodes are “being reconciledSome nodes are “being reconciled””
New New TimestampTimestamp
Write A
Write B
Write C
Commit
Write A
Write B
Write C
Commit
Write A
Write B
Write C
Commit
OID, old time, new valueOID, old time, new value
TRID, TimestampTRID, TimestampA Lazy TransactionA Lazy Transaction
1616
Reconciliation
Reconciliation means System DelusionReconciliation means System Delusion– Data inconsistent with itself and realityData inconsistent with itself and reality
How frequent is it?How frequent is it? Lazy transactions are not fatLazy transactions are not fat
– but N times as manybut N times as many
– Eager waits become Lazy reconciliationsEager waits become Lazy reconciliations
– Rate is:Rate is:
– Assuming everyone is connectedAssuming everyone is connected
TPSTPS22 xx (Actions (Actions xx Nodes) Nodes)33 xx Action_Time Action_Time
2 2 xx DB_size DB_size
1717
Eager & Lazy: Disconnected Suppose mobile nodes disconnected for a daySuppose mobile nodes disconnected for a day When reconnect: When reconnect:
– get all incoming updatesget all incoming updates
– send all delayed updatessend all delayed updates Incoming is Incoming is Nodes x TPS Nodes x TPS xx Actions Actions xx disconnect_time disconnect_time
Outgoing is: Outgoing is: TPS TPS xx Actions Actions xx Disconnect_Time Disconnect_Time
Conflicts are intersection of these two setsConflicts are intersection of these two sets
Action_Time Action_Time
Action_Time Action_Time
Disconnect_Time Disconnect_Time xx ( (TPS TPS xxActions Actions xx Nodes) Nodes)22
DB_sizeDB_size
1818
Outline Replication strategies Replication strategies (lazy & eager, master & group)(lazy & eager, master & group)
How centralized databases scaleHow centralized databases scale Replication is unstable on scaleupReplication is unstable on scaleup A possible solutionA possible solution
– Two-tier architecture: Mobile & Base nodesTwo-tier architecture: Mobile & Base nodes
– Base nodes master objectsBase nodes master objects
– Tentative transactions at mobile nodesTentative transactions at mobile nodesTransactions must be commutativeTransactions must be commutative
– Re-apply transactions on reconnectRe-apply transactions on reconnect
– Transactions may be rejectedTransactions may be rejected
1919
Safe Approach Each object mastered at a nodeEach object mastered at a node Update Transactions onlyUpdate Transactions only
read and write master itemsread and write master items Lazy replication to other nodesLazy replication to other nodes Allow reads of stale data (on user request)Allow reads of stale data (on user request) PROBLEMS: PROBLEMS:
– doesn’t support mobile usersdoesn’t support mobile users
– deadlocks explode with scaleupdeadlocks explode with scaleup ?? How do banks work????? How do banks work???
2020
Two Tier Replication
Two kinds of nodes:Two kinds of nodes:– Base nodes always connected, always upBase nodes always connected, always up
– Mobile nodes occasionally connectedMobile nodes occasionally connected Data mastered at base nodesData mastered at base nodes Mobile nodes Mobile nodes
– have stale copieshave stale copies
– make tentative updatesmake tentative updatesBaseNode
Mobile
2121
Mobile Node Makes Tentative Updates
Updates local database while disconnectedUpdates local database while disconnected Saves transactions Saves transactions When Mobile node reconnects: When Mobile node reconnects:
Tentative transactions re-done Tentative transactions re-done as Eager-Master as Eager-Master (at original time??)(at original time??)
Some may be rejectedSome may be rejected– (replaces reconciliation)(replaces reconciliation)
No System Delusion.No System Delusion.
tentativetransactions
base updates &failed base transactions
BaseNode
Mobile
2222
Tentative Transactions Must be commutative with othersMust be commutative with others
– Debit Debit 50$ 50$ rather than Change rather than Change 150$ 150$ to to 100$.100$.
Must have acceptance criteriaMust have acceptance criteria– Account balance is positiveAccount balance is positive
– Ship date no later than quotedShip date no later than quoted
– Price is no greater than quotedPrice is no greater than quoted
TentativeTentative TransactionsTransactions at local DBat local DB Updates & RejectsUpdates & Rejects
TransactionsTransactionsFrom From OthersOtherssend Tentative Xacts
send Tentative Xacts
2424
Virtue of 2-Tier Approach
Allows mobile operationAllows mobile operation No system delusion No system delusion Rejects detected at reconnect Rejects detected at reconnect (know right away)(know right away)
If commutativity works, If commutativity works, – No reconciliationsNo reconciliations
– Even though work rises as (Mobile + Base)Even though work rises as (Mobile + Base)22
2525
Outline
Replication strategies Replication strategies (lazy & eager, master & group)(lazy & eager, master & group)
How centralized databases scaleHow centralized databases scale Replication is unstable on scaleupReplication is unstable on scaleup A possible solution (two-tier architecture)A possible solution (two-tier architecture)
– Tentative transactions at mobile nodesTentative transactions at mobile nodes
– Re-apply transactions on reconnectRe-apply transactions on reconnect
– Transactions may be rejected & reconciledTransactions may be rejected & reconciled Avoids system delusionAvoids system delusion