scaling state machine replication · mysql group replication galera cluster, … 2. state machine...
TRANSCRIPT
Scaling State Machine Replication
Fernando Pedone University of Lugano (USI) Switzerland
State machine replication
•Fundamental approach to fault tolerance ✦ Google Spanner ✦ Apache Zookeeper ✦ Windows Azure Storage ✦ MySQL Group Replication ✦ Galera Cluster, …
2
State machine replication is intuitive & simple
•Replication transparency ✦ For clients ✦ For application developers
•Simple execution model ✦ Replicas order all commands ✦ Replicas execute commands deterministically and in the
same order
3
Configurable fault tolerance but bounded performance
•Performance is bounded by what one replica can do ✦ Every replica needs to execute every command ✦ More replicas: same (if not worse) performance
4
Servers
Thro
ughp
ut
How to scale state machine replication?
Scaling performance with partitioning
•Partitioning (aka sharding) application state
5
Problem #1: How to order commands in a partitioned system?Problem #2: How to execute commands in a partitioned system?
Scalable performance (for single-partition commands)
Servers
Thro
ughp
utPartition Px
Partition Py
Ordering commands in a partitioned system
•Atomic multicast ✦ Commands addressed (multicast) to one or more partitions ✦ Commands ordered within and across partitions
• If S delivers C before C’, then no S’ delivers C’ before C
6
Partition Px
Partition Py
C(x)
C(y)
C(x,y)
Scalable SMR
Atomic multicast
Multi-Paxos
Network
Executing multi-partition commands
7
Solution #1: Static partitioning of dataSolution #2: Dynamic partitioning of data
Partition X
Partition YC(x,y) : { x := y }
x x x
y y y
Solution 1: Static partitioning of data
•Execution model ✦ Client queries location oracle to determine partitions ✦ Client multicasts command to involved partitions ✦ Partitions exchange and temporary store objects needed to
execute multi-partition commands ✦ Commands executed by all involved partitions
•Location oracle ✦ Simple implementation thanks to static scheme
8
How to execute multi-partition commands?
9
Partition X
Partition YC(x,y): x := y
x x x
y y y
y y y
x x x
Cached entries
Static scheme, step-by-step
10
query oracle
receive result
deliver command
all local objects?
send result
Client Server
multicast command to involved partitions
end
execute command
Yes
send needed objects/signal to remote partitions
No
start
wait for objects/signal from remote
partitions
Solution 2: Dynamic partitioning of data
•Execution model (key idea) ✦ Turn every command single-partition ✦ If command involves multiple partitions, move objects to a
single partition before executing command
•Location oracle ✦ Oracle implemented as a “special partition” ✦ Move operations involve oracle, source and destination
partitions
11
Dynamic scheme, step-by-step
12
query oracle
one partition?
receive result
deliver command
all local objects?
send result
Client Server
move objects to one partition
No
multicast command to partition
Yes
endNo
retry?Yes
execute command
Yes
result = retry
No
start
Termination and load balance
•Ensuring termination of commands ✦ After retrying n times, command is multicast to all partitions ✦ Executed as a multi-partition command
•Ensure load balancing among partitions ✦ Target partition in multi-partition command chosen randomly
13
Oracle: high availability and performance
•Oracle implemented as a partition ✦ For fault tolerance
•Clients cache oracle entries ✦ For performance ✦ Real oracle needed at first access and when objects change
location ✦ Client retries command if cached location is stale
14
Dynamically (re-)partitioning the state
•Decentralized strategy ✦ Client chooses one partition among involved partitions ✦ Each move involves oracle and concerned partitions ✦ No single entity has complete system knowledge ✦ Good performance with strong locality, but… ✦ …slow convergence ✦ Poor performance with weak locality
15
👍
"
👍
"
P1 P2
Dynamically (re-)partitioning the state
•Centralized strategy ✦ Oracle builds graph of objects and relations (commands) ✦ Oracle partitions O-R graph (METIS) and requests move
operations to place all objects in one partition ✦ Near-optimum partitioning (both strong and weak locality) ✦ Fast convergence ✦ Oracle knows location of and relations among objects ✦ Oracle solves a hard problem
16
👍
"
👍
"
Social network application (similar to Twitter)
•GetTimeline ✦ Single-object command => always involves one partition
•Post ✦ Multi-object command => may involve multiple partitions ✦ Strong locality
• 0% edge cut, social graph can be perfectly partitioned
✦ Weak locality • 1% and 5% of edge cuts, after partitioning social graph
17
0
30
60
90
120
150
1 2 4 8
Thro
ughp
ut (k
cps)
0% edge-cut
SMRSSMR
DSSMRDSSMRv2
SSMRMetis
1 2 4 8
Number of partitions
1% edge-cut
1 2 4 8
5% edge-cut
GetTimelines only (single-partition commands)
18
Throughput
Number of partitions
Classic SMR Static
Dyn decentralized Dyn centralized
Optimized static
Servers
Thro
ughp
ut
all schemes scale! (by design)
0
20
40
60
80
1 2 4 8
Thro
ughp
ut (k
cps)
Number of partitions
0% edge-cut
SMRSSMR
DSSMRDSSMRv2
SSMRMetis
Posts only, strong locality (0% edge cut)
19
Number of partitions
Classic SMR Static
Dyn decentralized Dyn centralized
Optimized static dynamic schemes and optimized scale, but not static
0
10
20
30
40
1 2 4 8
Thro
ughp
ut (k
cps)
Number of partitions
1% edge-cut
SMRSSMR
DSSMRDSSMRv2
SSMRMetis
Posts only, weak locality (1% edge cut)
20
Number of partitions
Classic SMR Static
Dyn decentralized Dyn centralized
Optimized static only optimized and centralized dynamic schemes scale
Conclusions
•Scaling State Machine Replication ✦ Possible but locality is fundamental
• OSs and DBs have known this for years
✦ Replication and partitioning transparency
•The future ahead ✦ Decentralized schemes with quality of centralized schemes ✦ Expand scope of applications (e.g., data structures) ✦ “The inherent limits of scalable state machine replication”
21
THANK YOU!!!
More details: http://www.inf.usi.ch/faculty/pedone/scalesmr.html
Joint work with… Long Hoang Le Enrique Fynn
Eduardo Bezerra Robbert van Renesse