scaling state machine replication · mysql group replication galera cluster, … 2. state machine...

22
Scaling State Machine Replication Fernando Pedone University of Lugano (USI) Switzerland

Upload: others

Post on 31-Jul-2020

22 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Scaling State Machine Replication

Fernando Pedone University of Lugano (USI) Switzerland

Page 2: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

State machine replication

•Fundamental approach to fault tolerance ✦ Google Spanner ✦ Apache Zookeeper ✦ Windows Azure Storage ✦ MySQL Group Replication ✦ Galera Cluster, …

2

Page 3: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

State machine replication is intuitive & simple

•Replication transparency ✦ For clients ✦ For application developers

•Simple execution model ✦ Replicas order all commands ✦ Replicas execute commands deterministically and in the

same order

3

Page 4: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Configurable fault tolerance but bounded performance

•Performance is bounded by what one replica can do ✦ Every replica needs to execute every command ✦ More replicas: same (if not worse) performance

4

Servers

Thro

ughp

ut

How to scale state machine replication?

Page 5: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Scaling performance with partitioning

•Partitioning (aka sharding) application state

5

Problem #1: How to order commands in a partitioned system?Problem #2: How to execute commands in a partitioned system?

Scalable performance (for single-partition commands)

Servers

Thro

ughp

utPartition Px

Partition Py

Page 6: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Ordering commands in a partitioned system

•Atomic multicast ✦ Commands addressed (multicast) to one or more partitions ✦ Commands ordered within and across partitions

• If S delivers C before C’, then no S’ delivers C’ before C

6

Partition Px

Partition Py

C(x)

C(y)

C(x,y)

Scalable SMR

Atomic multicast

Multi-Paxos

Network

Page 7: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Executing multi-partition commands

7

Solution #1: Static partitioning of dataSolution #2: Dynamic partitioning of data

Partition X

Partition YC(x,y) : { x := y }

x x x

y y y

Page 8: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Solution 1: Static partitioning of data

•Execution model ✦ Client queries location oracle to determine partitions ✦ Client multicasts command to involved partitions ✦ Partitions exchange and temporary store objects needed to

execute multi-partition commands ✦ Commands executed by all involved partitions

•Location oracle ✦ Simple implementation thanks to static scheme

8

Page 9: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

How to execute multi-partition commands?

9

Partition X

Partition YC(x,y): x := y

x x x

y y y

y y y

x x x

Cached entries

Page 10: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Static scheme, step-by-step

10

query oracle

receive result

deliver command

all local objects?

send result

Client Server

multicast command to involved partitions

end

execute command

Yes

send needed objects/signal to remote partitions

No

start

wait for objects/signal from remote

partitions

Page 11: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Solution 2: Dynamic partitioning of data

•Execution model (key idea) ✦ Turn every command single-partition ✦ If command involves multiple partitions, move objects to a

single partition before executing command

•Location oracle ✦ Oracle implemented as a “special partition” ✦ Move operations involve oracle, source and destination

partitions

11

Page 12: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Dynamic scheme, step-by-step

12

query oracle

one partition?

receive result

deliver command

all local objects?

send result

Client Server

move objects to one partition

No

multicast command to partition

Yes

endNo

retry?Yes

execute command

Yes

result = retry

No

start

Page 13: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Termination and load balance

•Ensuring termination of commands ✦ After retrying n times, command is multicast to all partitions ✦ Executed as a multi-partition command

•Ensure load balancing among partitions ✦ Target partition in multi-partition command chosen randomly

13

Page 14: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Oracle: high availability and performance

•Oracle implemented as a partition ✦ For fault tolerance

•Clients cache oracle entries ✦ For performance ✦ Real oracle needed at first access and when objects change

location ✦ Client retries command if cached location is stale

14

Page 15: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Dynamically (re-)partitioning the state

•Decentralized strategy ✦ Client chooses one partition among involved partitions ✦ Each move involves oracle and concerned partitions ✦ No single entity has complete system knowledge ✦ Good performance with strong locality, but… ✦ …slow convergence ✦ Poor performance with weak locality

15

👍

"

👍

"

P1 P2

Page 16: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Dynamically (re-)partitioning the state

•Centralized strategy ✦ Oracle builds graph of objects and relations (commands) ✦ Oracle partitions O-R graph (METIS) and requests move

operations to place all objects in one partition ✦ Near-optimum partitioning (both strong and weak locality) ✦ Fast convergence ✦ Oracle knows location of and relations among objects ✦ Oracle solves a hard problem

16

👍

"

👍

"

Page 17: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Social network application (similar to Twitter)

•GetTimeline ✦ Single-object command => always involves one partition

•Post ✦ Multi-object command => may involve multiple partitions ✦ Strong locality

• 0% edge cut, social graph can be perfectly partitioned

✦ Weak locality • 1% and 5% of edge cuts, after partitioning social graph

17

Page 18: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

0

30

60

90

120

150

1 2 4 8

Thro

ughp

ut (k

cps)

0% edge-cut

SMRSSMR

DSSMRDSSMRv2

SSMRMetis

1 2 4 8

Number of partitions

1% edge-cut

1 2 4 8

5% edge-cut

GetTimelines only (single-partition commands)

18

Throughput

Number of partitions

Classic SMR Static

Dyn decentralized Dyn centralized

Optimized static

Servers

Thro

ughp

ut

all schemes scale! (by design)

Page 19: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

0

20

40

60

80

1 2 4 8

Thro

ughp

ut (k

cps)

Number of partitions

0% edge-cut

SMRSSMR

DSSMRDSSMRv2

SSMRMetis

Posts only, strong locality (0% edge cut)

19

Number of partitions

Classic SMR Static

Dyn decentralized Dyn centralized

Optimized static dynamic schemes and optimized scale, but not static

Page 20: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

0

10

20

30

40

1 2 4 8

Thro

ughp

ut (k

cps)

Number of partitions

1% edge-cut

SMRSSMR

DSSMRDSSMRv2

SSMRMetis

Posts only, weak locality (1% edge cut)

20

Number of partitions

Classic SMR Static

Dyn decentralized Dyn centralized

Optimized static only optimized and centralized dynamic schemes scale

Page 21: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

Conclusions

•Scaling State Machine Replication ✦ Possible but locality is fundamental

• OSs and DBs have known this for years

✦ Replication and partitioning transparency

•The future ahead ✦ Decentralized schemes with quality of centralized schemes ✦ Expand scope of applications (e.g., data structures) ✦ “The inherent limits of scalable state machine replication”

21

Page 22: Scaling State Machine Replication · MySQL Group Replication Galera Cluster, … 2. State machine replication is intuitive & simple •Replication transparency

THANK YOU!!!

More details: http://www.inf.usi.ch/faculty/pedone/scalesmr.html

Joint work with… Long Hoang Le Enrique Fynn

Eduardo Bezerra Robbert van Renesse