gryff: unifying consensus and shared registersmatthelb/talks/gryff-nsdi20-talk.pdf•examples: etcd,...
TRANSCRIPT
![Page 1: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/1.jpg)
Gryff: Unifying Consensus and Shared Registers
Matthew Burke Audrey Cheng Wyatt Lloyd
Cornell University Princeton University
1
![Page 2: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/2.jpg)
Applications Rely on Geo-Replicated Storage• Fault tolerant: data is safe despite failures
2
Client
![Page 3: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/3.jpg)
Applications Rely on Geo-Replicated Storage• Fault tolerant: data is safe despite failures
• Linearizable: intuitive for application developers
3
≡
![Page 4: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/4.jpg)
Linearizable Replicated Storage Systems
4
Cloud Spanner
![Page 5: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/5.jpg)
Status Quo: Consensus or Shared Registers
Consensus Shared Registers
Strong Synchronization ✔ ✘Low ReadTail Latency ✘ ✔
5
• Given the desire for fault tolerance and linearizability
Unify consensus and shared registers?
![Page 6: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/6.jpg)
Consensus & State Machine Replication (SMR)• Generic interface: Command(c(.))
• Stable ordering: all preceding log positions are assigned commands
6
c1 c2 c3 c4
![Page 7: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/7.jpg)
Consensus & State Machine Replication (SMR)• Generic interface: Command(c(.))
• Stable ordering: all preceding log positions are assigned commands
• Used in etcd, CockroachDB, Spanner, Azure Storage, Chubby
6
c1 c2 c3 c4
![Page 8: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/8.jpg)
SMR Requires Stable Order• Allow for strong synchronization primitives like read-modify-writes
• High tail latency in practice (e.g., by serializing through a leader)
7
Consensus
Strong Synchronization ✔Low ReadTail Latency ✘
![Page 9: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/9.jpg)
Shared Registers• Simple interface: Read()/Write(v)
• Unstable ordering: total order without pre-defined positions
8
w1 < w2 < w3 < w4
![Page 10: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/10.jpg)
Shared Registers• Simple interface: Read()/Write(v)
• Unstable ordering: total order without pre-defined positions
8
w1 < w2 < w3 < w4
w5
![Page 11: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/11.jpg)
Shared Registers• Simple interface: Read()/Write(v)
• Unstable ordering: total order without pre-defined positions
• Similar to Cassandra, Dynamo, Riak
8
w1 < w2 < w3 < w4w5 <
![Page 12: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/12.jpg)
Shared Registers Use Unstable Order• Cannot implement strong synchronization primitives [Herlihy91]
• Flexibility of unstable order provides favorable tail latency
9
Consensus Shared Registers
Strong Synchronization ✔ ✘Low ReadTail Latency ✘ ✔
![Page 13: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/13.jpg)
RMWs with low read tail latency?
Shared Objects: Interface for Unification• Interface: Read()/Write(v)/RMW(f(.))
• RMW(f(.))→ read base v, compute new value f(v), write f(v)
• Examples: etcd, Redis, BigTable
10
![Page 14: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/14.jpg)
Consensus-after-Register Timestamps (Carstamps)
11
rmw1 rmw3 rmw4
rmw2
Unstable Order
StableOrder w1 w2 w3 w4< < <
![Page 15: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/15.jpg)
Consensus-after-Register Timestamps (Carstamps)
11
rmw1 rmw3 rmw4
rmw2
Unstable Order
StableOrder w1 w2 w3 w4w5 << < < w6<
![Page 16: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/16.jpg)
Carstamps• Tuple with three fields: (ts, id, rmwc)
• ts and id basis for unstable ordering of writes
• rmwc is set to 1 greater than rmwc of base to ensure stable ordering
12
w1 < w2
rmw1 rmw3
(3,1,0)
(3,1,1)
(4,1,0)
(4,1,1)
rmw2(3,1,2)
![Page 17: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/17.jpg)
Consensus Shared Registers
Gryff
Strong Synchronization ✔ ✘ ✔Low ReadTail Latency ✘ ✔ ✔
Gryff Unifies Consensus and Shared Registers• Only uses consensus when necessary, for strong synchronization
13
![Page 18: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/18.jpg)
Gryff Design• Combine multi-writer [LS97] ABD [ABD95] & EPaxos [MAK13]
• Modifications needed for safety:• Carstamps for proper ordering
• Synchronous Commit phase for rmws
• Modifications for better read tail latency:• Early termination for reads (fast path)
• Proxy optimization for reads (fast path more often)
14
See the paper for details!
![Page 19: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/19.jpg)
Gryff in Action
15
![Page 20: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/20.jpg)
Gryff in Action
15
(2,3,0) (1,0,0) (2,3,0)
![Page 21: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/21.jpg)
Gryff in Action
15
c1
(2,3,0) (1,0,0) (2,3,0)
w1→ (3,1,0)
Writes always terminate in 2 phases
![Page 22: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/22.jpg)
Executed (3,1,1)
(3,1,1)(3,1,1)
Gryff in Action
18
c1
(2,3,0)
c2
rmw1→ (3,1,1)
Writes always terminate in 2 phases
RMW carstamps directly after base
![Page 23: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/23.jpg)
Read1Reply (3,1,1)
(3,1,1)(3,1,1)
Gryff in Action
19
c1
(2,3,0)
c2
r → (3,1,1)
Writes always terminate in 2 phases
RMW carstamps directly after base
Reads often terminate in 1 phase
![Page 24: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/24.jpg)
Evaluation
Relative to state-of-the-art-consensus protocols:
1. How do Gryff’s read/write protocols affect read tail latency?
2. What is the latency distribution of Gryff’s reads, writes, and rmws?
3. What maximum throughput does Gryff achieve?
4. How does Gryff perform in tail-at-scale workloads?
20
![Page 25: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/25.jpg)
Evaluation
Relative to state-of-the-art-consensus protocols:
1. How do Gryff’s read/write protocols affect read tail latency?
2. What is the latency distribution of Gryff’s reads, writes, and rmws?
3. What maximum throughput does Gryff achieve?
4. How does Gryff perform in tail-at-scale workloads?
21
![Page 26: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/26.jpg)
Evaluation Setup
22
• Geo-replication with 3 regions
• Baselines: MultiPaxos (industry standard), EPaxos (leaderless)
not-to-scale ocean
72ms 88ms
![Page 27: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/27.jpg)
Read Tail Latency (94.5% R, 4.5% W, 1% RMW, 25% Conflicts)
23
0
0.2
0.4
0.6
0.8
1
60 120 180 240
Fra
ctio
n o
f R
ead
s
Latency (ms)
MultiPaxos
![Page 28: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/28.jpg)
0
0.2
0.4
0.6
0.8
1
60 120 180 240
Fra
ctio
n o
f R
ead
s
Latency (ms)
MultiPaxos
Read Tail Latency (94.5% R, 4.5% W, 1% RMW, 25% Conflicts)
24
![Page 29: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/29.jpg)
0
0.2
0.4
0.6
0.8
1
60 120 180 240
Fra
ctio
n o
f R
ead
s
Latency (ms)
MultiPaxos
Read Tail Latency (94.5% R, 4.5% W, 1% RMW, 25% Conflicts)
24
serializing through far-away leader
![Page 30: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/30.jpg)
0
0.2
0.4
0.6
0.8
1
60 120 180 240
Fra
ctio
n o
f R
ead
s
Latency (ms)
MultiPaxos
EPaxos
Read Tail Latency (94.5% R, 4.5% W, 1% RMW, 25% Conflicts)
25
![Page 31: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/31.jpg)
0
0.2
0.4
0.6
0.8
1
60 120 180 240
Fra
ctio
n o
f R
ead
s
Latency (ms)
MultiPaxos
EPaxos
Read Tail Latency (94.5% R, 4.5% W, 1% RMW, 25% Conflicts)
25
delaying reads that conflict with concurrent writes
![Page 32: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/32.jpg)
0
0.2
0.4
0.6
0.8
1
60 120 180 240
Fra
ctio
n o
f R
ead
s
Latency (ms)
MultiPaxos
EPaxos
Gryff
Read Tail Latency (94.5% R, 4.5% W, 1% RMW, 25% Conflicts)
26
![Page 33: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/33.jpg)
0
0.2
0.4
0.6
0.8
1
60 120 180 240
Fra
ctio
n o
f R
ead
s
Latency (ms)
MultiPaxos
EPaxos
Gryff
Read Tail Latency (94.5% R, 4.5% W, 1% RMW, 25% Conflicts)
26
1 round to nearest majority in tail
![Page 34: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/34.jpg)
Summary• Consensus: strong synchronization w/ high tail latency
Shared registers: low tail latency w/o strong synchronization
• Carstamps stably order read-modify-writes within a more efficient unstable order for reads and writes
• Gryff unifies an optimized shared register protocol with a state-of-the-art consensus protocol using carstamps
• Gryff provides strong synchronization w/ low read tail latency
27
![Page 35: Gryff: Unifying Consensus and Shared Registersmatthelb/talks/gryff-nsdi20-talk.pdf•Examples: etcd, Redis, BigTable 10. Consensus-after-Register Timestamps (Carstamps) 11 rmw 1 rmw](https://reader033.vdocument.in/reader033/viewer/2022052105/6040ba0b1d7309641d225d84/html5/thumbnails/35.jpg)
Image Attribution• Griffin by Delapouite / CC BY 3.0 Unported (modified)
• etcd
• CockroachDB
• Spanner by Google / CC BY 4.0
28