transactional storage for geo-replicated systems yair sovran, russell power, marcos k. aguilera,...

22
Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Upload: cristobal-wiggans

Post on 28-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Transactional storage for geo-replicated systems

Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li

NYU and MSR SVC

Page 2: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Life in a web startup

Page 3: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Web apps need geo-replicated storage

Geo-replicated transactional storage

Page 4: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Consistency vs. performance: existing tradeoffs

Eventual Consistency

Less coordinationMore anomalies

More coordinationFewer anomalies

Serializability

• Maximize multi-site performance• Have few anomalies

Snapshot Isolation

Page 5: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Our contribution

1. New semantics: Parallel Snapshot Isolation (PSI)2. Walter: implementing PSI efficiently– Preferred site– Counting set

3. Application experience

Page 6: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Snapshot isolation

Timeline of storage state

Read-X Write-X Commit

Read-Y Write-Y Commit

• Snapshot isolation’s guarantees

1. Read snapshots from global timeline2. Prohibit write-write conflict3. Preserve causality

T1

T2

Page 7: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

PSI avoids global transaction ordering

Site1

Site2

Site1 timeline

Site2 timeline

Read-X Write-X Commit

Read-Y Write-Y Commit

A transaction commits locally first, then propagates to remote sites.

T1

T2

Walter achieves this efficiently

• Snapshot isolation’s guarantees

1. Read snapshots from global timeline2. Prohibit write-write conflict3. Preserve causality

Parallel

Per-site

Page 8: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

PSI has few anomalies

short fork No Yes Yes Yeslong fork No No Yes Yesconflicting fork No No No Yes

Anomaly Serializ-ability

Snapshot Isolation

PSI Eventual

dirty read No No No Yesnon-repeatable read

No No No Yes

lost update No No No Yes

Page 9: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

PSI’s anomaly

T1

T2

Short fork

(allowed bysnapshot isolation)

T1 commits

T2 commits

Long fork

(disallowed bysnapshot isolation)

T1

T2

T1 commits

T2 commits

T1 and T2 propagate to both sites

Page 10: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Walter overviewC

•Start_TX•Commit_TX

•Read•Write

C C C C C

• Replicate data• Coordinate for PSISite1 Site2

• Main challenge: avoid write-write conflict across sites• Walter’s solution

1. Preferred site2. Counting set

Page 11: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Technique #1: preferred site

• Associate each user’s data with a preferred site• Common case: write at preferred site fast commit– Rare case: write at non-preferred site cross-site 2-phase commit

Bob’s photos

Alice’s photos

Write

CC

Alice’s photos

Bob’s photos

Write (fast commit)

slow commit

Site1 Site2

Alice Bob

Page 12: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Technique #2: counting set

• Problem: some objects are modified from many sites• Counting set: a data type free of write-write conflict

Be-friend EveBe-friend Eve

write write

CC

Site 1 Site 2

Eve’sfriendlist

Eve’sfriendlist

Alice Bob

Page 13: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Technique #2: counting set

add(“Bob”)

• Add/del operations commute no need to check for write-write conflict

• Caveat: application developers must deal with counts

C

Bob 1Alice 1

Bob 1

add(“Alice”)

C

addadd

Alice 1

Eve’s friendlistEve’s friendlist

Alice Bob

Be-friend EveBe-friend Eve

Site1 Site2

Page 14: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Site failure

• Two options to handle a site failure– Conservative: block writes whose preferred site failed– Aggressive: re-assign preferred site elsewhere

Warning: Committed but not-yet-replicated transactions may be lost

Page 15: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Application #1: WaltSocial

Wall and Friendlist are counting sets

Meow says: Meow Meow MeowBob-cat says: I saw a mouseBob-cat says: I saw a mousePeanut says: awldaiwdliawdMeow says: I think I ate too much catnip last night. Meow.

Befriend transactionA read Alice’s profileB read Bob’s profileAdd A.uid to B.friendlistAdd B.uid to A.friendlistAdd “Alice is now friends with Bob” to A.wallAdd “Bob is now friends with Alice” to B.wall

Page 16: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Applications #2: Twitter clone

• Third party app in PHP• Our port: switch storage backend from Redis to Walter

Each user’s timeline is a counting set

Post-status transactionwrite status to new object Oforeach f in user’s followers add O to f’s timeline_cset

Page 17: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Evaluation

• Walter prototype– Implemented in C++ with PHP binding– Custom RPC library with Protocol Buffers

• Testbed: Amazon EC2– Extra-large instance– Up to 4-sites (Virginia, California, Ireland, Singapore)• Full replication across sites

Page 18: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Walter scales

• Read/write a 100-byte object• Reads’ working set fits in memory

Read Write

Page 19: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

WaltSocial achieves low latency

A post-on-wall transactionreads 2 objects, writes 2 objects, updates 2 counting sets

Page 20: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Walter lets ReTwis scale to >1 sites

Read Timeline Post status Follow user

Redis Walter (1-site) Walter (2-site)

Page 21: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Related work

• Cloud storage systems– Single-site: Bigtable, Sinfonia, Percolator– No/limited transaction: Dynamo, COPS, PNUTS– Synchronous replication: Megastore, Scatter

• Replicated database systems– Eager vs. lazy replication– Escrow transactions: for numeric data

• Conflict-free replicated data types – Inspired counting sets

Page 22: Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC

Conclusion

• PSI is a good tradeoff for geo-replicated storage– Allows fast commit with asynchronous replication– Prohibits write-write conflict and preserves causality

• Walter realizes PSI efficiently– Preferred site– Conflict-free counting set