lotkeeper: a scalable and massively parallel consensus ...bernard/s10_1.pdf · a distributed ledger...
TRANSCRIPT
![Page 1: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/1.jpg)
CANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL
Bernard Wong
CoNEXT 2017
Joint work with Sajjad Rizvi and Srinivasan Keshav
![Page 2: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/2.jpg)
CONSENSUS PROBLEM
2
Agreement between a set of nodes in the presence of failures
Asynchronous environment
Primarily used to provide fault tolerance
W(x)=3 W(y=1) W(z=1) ?
Node A: W(x=1)
Node B: W(x=2)
Replicated log
W(x=3) W(y=1) W(z=1) ?W(x=3) W(y=1) W(z=1) ?
![Page 3: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/3.jpg)
A BUILDING BLOCK IN DISTRIBUTED SYSTEMS
3
SpannerAkka MesosHadoop HBase Kafka BookKeeper …
Syst
em
ap
plica
tions
ZooKeeper ConsulChef PuppetetcdChubby …
Coord
ina
tion s
erv
ices
ZAB Paxos Raft …
MenciusEPaxos SPaxos …
AllConcur NOPaxosNetPaxos …
Conse
nsus
and
ato
mic
bro
adca
st
![Page 4: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/4.jpg)
A BUILDING BLOCK IN DISTRIBUTED SYSTEMS
4
SpannerAkka MesosHadoop HBase Kafka BookKeeper …
Syst
em
ap
plica
tions
ZooKeeper ConsulChef PuppetetcdChubby …
Coord
ina
tion s
erv
ices
ZAB Paxos Raft …
MenciusEPaxos SPaxos …
AllConcur NOPaxosNetPaxos …
Conse
nsus
and
ato
mic
bro
adca
st
Current consensus protocols are not scalable
However, most applications only require a small
number of replicas for fault tolerance
![Page 5: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/5.jpg)
PERMISSIONED BLOCKCHAINS
A distributed ledger shared by all the participants
Consensus at a large scale
Large number of participants (e.g., financial institutions)
Must validate a block before committing it to the ledger
Examples
Hyperledger, Microsoft Coco, Kadena, Chain …
5
![Page 6: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/6.jpg)
CANOPUS
Consensus among a large set of participants
Targets thousands of nodes distributed across the globe
Decentralized protocol
Nodes execute steps independently and in parallel
Designed for modern datacenters
Takes advantage of high performance networks and hardware redundancies
6
![Page 7: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/7.jpg)
SYSTEM ASSUMPTIONS
Non-uniform network latencies and link capacities
Scalability is bandwidth limited
Protocol must be network topology aware
Deployment consists of racks of servers connected by redundant links
Full rack failures and network partitions are rare
7
WAN
…
… …
…
Global view Within a datacenter
![Page 8: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/8.jpg)
CONSENSUS CYCLES
8
Execution divided into a sequence of consensus cycles
In each cycle, Canopus determines the order of writes (state changes) received during the previous cycle
d
e
f g
h
i
Canopus servers
x = 1
y = 3 x = 5
z = 2
![Page 9: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/9.jpg)
SUPER-LEAVES AND VNODES
9
Nodes in the same rack form a logical group called a super-leaf
Use an intra-super-leaf consensus protocol to replicate write requests between nodes in the same super-leaf
d
e
f g
h
i
Super-leaf Super-leaf
![Page 10: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/10.jpg)
SUPER-LEAVES AND VNODES
10
Nodes in the same rack form a logical group called a super-leaf
Use an intra-super-leaf consensus protocol to replicate write requests between nodes in the same super-leaf
d
e
f g
h
i
Super-leaf Super-leaf
d
e
f g
h
i
Super-leaf Super-leaf
b cd, e, f h, g, i
Represent the state of each super-leaf as a height 1 virtual node (vnode)
![Page 11: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/11.jpg)
ACHIEVING CONSENSUS
11
b
d
e
f
c
g
h
i
a
Consensus in
round 1
Consensus in
round 2
d, e, f h, g, i d, e, f h, g, i
Members of a height 1 vnode exchange state with members of nearby height 1 vnodes to compute a height 2 vnode
State exchange is greatly simplified since each vnode is fault tolerant
h rounds in a consensus cycle
A node completes a consensus cycle once it has computed the state of the root vnode
![Page 12: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/12.jpg)
CONSENSUS PROTOCOL WITHIN A SUPER-LEAF
12
a b
c
A1 B1
C1
C2• Exploit low latency within a rack
• Reliable broadcast
• RAFT
![Page 13: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/13.jpg)
CONSENSUS PROTOCOL WITHIN A SUPER-LEAF
13
a b
c
A1634 746 B1
C1
C2
538
1. Nodes prepare a
proposal message that
contains a random
number and a list of
pending write requests
![Page 14: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/14.jpg)
CONSENSUS PROTOCOL WITHIN A SUPER-LEAF
14
a b
c
2. Nodes use reliable
broadcast to exchange
proposals within a
super-leaf
746 B1
A1634
C1
C2
538
746 B1
A1634
C1
C2
538
746 B1
A1634
C1
C2
538
![Page 15: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/15.jpg)
CONSENSUS PROTOCOL WITHIN A SUPER-LEAF
15
a b
c
3. Every node orders
proposals
746 B1
A1634
C1
C2
538
746 B1
A1634
C1
C2
538
746 B1
A1634
C1
C2
538
![Page 16: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/16.jpg)
CONSENSUS PROTOCOL WITHIN A SUPER-LEAF
16
These three steps make up a
consensus round.
At the end, all three nodes have the
same state of their common parent.
a b
c
746 B1
A1634
C1
C2
538
746 B1
A1634
C1
C2
538
746 B1
A1634
C1
C2
538
![Page 17: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/17.jpg)
CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES
17
b
d
e
f
a
c
g
h
i
Representative Emulator
![Page 18: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/18.jpg)
CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES
18
b
d
e
f
a
c
g
h
i
Representative Emulator
{proposal request}
1. Representatives send
proposal requests to
fetch the states of
vnodes
![Page 19: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/19.jpg)
CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES
19
b
d
e
f
a
c
g
h
i
Representative Emulator
2. Emulators reply
with proposals
{proposal response}
![Page 20: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/20.jpg)
CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES
20
b
d
e
f
a
c
g
h
i
Representative Emulator
3. Reliable broadcast
within a super-leaf
![Page 21: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/21.jpg)
CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES
21
b
d
e
f
a
c
g
h
i
Representative Emulator
Consensus cycle ends
for a node when it has
completed the last
round
![Page 22: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/22.jpg)
READ REQUESTS
Read requests can be serviced locally by any Canopus node
Does not need to disseminate to other participating nodes
Provides linearizability by
Buffering read requests until the global ordering of writes has been determined
Locally ordering its pending reads and writes to preserve the request order of its clients
Significantly reduces bandwidth requirements for read requests
Achieves total ordering of both read and write requests
22
![Page 23: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/23.jpg)
ADDITIONAL OPTIMIZATIONS
Pipelining consensus cycles
Critical to achieving high throughput over high latency links
Write leases
For read-mostly workloads with low latency requirements
Reads can complete without waiting until the end of a consensus cycle
23
![Page 24: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/24.jpg)
EVALUATION: MULTI DATACENTER CASE
3, 5, and 7 datacenters
Each datacenter corresponds to a super-leaf
3 nodes per datacenter (up to 21 nodes in total)
EC2 c3.4xlarge instances
100 clients in five machines per datacenter
Each client is connected to a random node in the same datacenter
24
Latencies across datacenters (in ms)
Regions: Ireland (IR), California (CA),
Virginia (VA), Tokyo (TK), Oregon (OR),
Sydney (SY), Frankfurt (FF)
![Page 25: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/25.jpg)
CANOPUS VS. EPAXOS (20% WRITES)
25
![Page 26: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/26.jpg)
EVALUATION: SINGLE DATACENTER CASE
3 super-leaves of sizes of 3, 5, 7, 9 servers (i.e., up to 27 total servers)
Each server has 32GB RAM, 200 GB SSD, 12 cores running at 2.1 GHz
Each server has a 10G to its ToR switch
Aggregation switch has dual 10G links to each ToR switch
180 clients, uniformly distributed on 15 machines
5 machines in each rack
26
![Page 27: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/27.jpg)
ZKCANOPUS VS. ZOOKEEPER
27
![Page 28: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/28.jpg)
LIMITATIONS
We trade off fault tolerance for performance and understandability
Cannot tolerate full rack failure or network partitions
We trade off latency for throughput
At low throughputs, latencies can be higher than other consensus protocols
Stragglers can hold up the system (temporarily)
Super-leaf peers detect and remove them
28
![Page 29: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/29.jpg)
ON-GOING WORK
Handling super-leaf failures
For applications with high availability requirements
Detect and remove failed super-leaves to continue
Byzantine fault tolerance
Canopus currently supports crash-stop failures
Aiming to maintain our current throughput
29
![Page 30: LOTKeeper: A Scalable and Massively Parallel Consensus ...bernard/S10_1.pdf · A distributed ledger shared by all the participants Consensus at a large scale Large number of participants](https://reader033.vdocument.in/reader033/viewer/2022042012/5e72cacda9e4877acb627688/html5/thumbnails/30.jpg)
CONCLUSIONS
Emerging applications involve consensus at large scales
Key barrier is a scalable consensus protocol
Addressed by Canopus
Decentralized
Network topology aware
Optimized for modern datacenters
30