replication election and consensus algorithm refinements for mongodb 3.2
TRANSCRIPT
![Page 1: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/1.jpg)
Distributed Consensus in MongoDB
Spencer T Brody
Senior Software Engineer at MongoDB
@stbrody
![Page 2: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/2.jpg)
Agenda
• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2
![Page 3: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/3.jpg)
Why use Replication?
• Data redundancy• High availability
![Page 4: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/4.jpg)
What is Consensus?
• Getting multiple processes/servers to agree on something• Must handle a wide range of failure modes
• Disk failure• Network partitions• Machine freezes• Clock skews
![Page 5: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/5.jpg)
Basic consensus
State Machine
X 3
Y 2
Z 7
State Machine
X 3
Y 2
Z 7
State Machine
X 3
Y 2
Z 7
![Page 6: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/6.jpg)
Leader Based Consensus
State Machine
X 3
Y 2
Z 7
Replicated Log
X 1⬅️
Y 2⬅️
X 3⬅️
State Machine
X 3
Y 2
Z 7
Replicated Log
X 1⬅️
Y 2⬅️
X 3⬅️
State Machine
X 3
Y 2
Z 7
Replicated Log
X 1⬅️
Y 2⬅️
X 3⬅️
![Page 7: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/7.jpg)
Agenda
• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2
![Page 8: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/8.jpg)
Elections
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
![Page 9: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/9.jpg)
Elections
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
![Page 10: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/10.jpg)
Elections
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
![Page 11: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/11.jpg)
Elections
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
![Page 12: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/12.jpg)
Elections
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
![Page 13: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/13.jpg)
Data Replication
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
![Page 14: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/14.jpg)
Agenda
• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2
![Page 15: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/15.jpg)
Agenda
• Introduction to consensus• Leader-based replicated state machine• Elections and data replication in MongoDB• Improvements coming in MongoDB 3.2
• Goals and inspiration from Raft Consensus Algorithm• Preventing double voting• Monitoring node status• Calling for elections
![Page 16: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/16.jpg)
Goals for MongoDB 3.2
• Decrease failover time• Speed up detection and resolution of false primary situations
![Page 17: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/17.jpg)
Finding Inspiration in Raft
• “In Search of an Understandable Consensus Algorithm” by Diego Ongaro: https://ramcloud.stanford.edu/raft.pdf
• Designed to address the shortcomings of Paxos• Easier to understand• Easier to implement in real applications
• Provably correct• Remarkably similar to what we’re doing already
![Page 18: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/18.jpg)
Raft Concepts
• Term (election) IDs• Monitoring node status using existing data replication channel• Asymmetric election timeouts
![Page 19: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/19.jpg)
Preventing Double Voting
• Can’t vote for 2 nodes in the same election• Pre-3.2: 30 second vote timeout• Post-3.2: Term IDs• Term:
• Monotonically increasing ID• Incremented on every election *attempt*• Lets voters distinguish elections so they can vote twice quickly in
different elections.
![Page 20: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/20.jpg)
Monitoring Node Status
• Pre-3.2: Heartbeats• Sent every two seconds from every node to every other node• Volume increases quadratically as nodes are added to the replica set
• Post-3.2: Extra metadata sent via existing data replication channel• Utilizes chained replication• Faster heartbeats = faster elections and detection of false primaries
![Page 21: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/21.jpg)
Data Replication
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
X 3
Y 2
Z 7
X ⬅️1
Y ⬅️2
X ⬅️3
![Page 22: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/22.jpg)
Determining When To Call For An Election
• Tradeoff between failover time and spurious failovers• Node calls for an election when it hasn’t heard from the primary within
the election timeout• Starting in 3.2:
• Election timeout is configurable• Election timeout is varied randomly for each node• Varying timeouts help reduce tied votes• Fewer tied votes = faster failover
![Page 23: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/23.jpg)
Conclusion
• MongoDB 3.2 will have• Faster failovers• Faster error detection• More control to prevent spurious failovers
• This means your systems are• More stable• More resilient to failure• Easier to maintain
![Page 24: Replication Election and Consensus Algorithm Refinements for MongoDB 3.2](https://reader030.vdocument.in/reader030/viewer/2022032505/55c92d4ebb61ebe2428b46aa/html5/thumbnails/24.jpg)
Questions?