edward bortnikov 048961 – topics in reliable distributed computing slides partially borrowed from...

55
Edward Bortnikov 048961 – Topics in Reliable Distributed Computing Slides partially borrowed from Nancy Lynch (DISC ’02) Seth Gilbert (DSN ’03) and Idit Keidar (multiple talks) RAMBO Reconfigurable Atomic Memory for Dynamic Networks

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Edward Bortnikov048961 – Topics in Reliable Distributed Computing

Slides partially borrowed from Nancy Lynch (DISC ’02)

Seth Gilbert (DSN ’03) and Idit Keidar (multiple talks)

RAMBOReconfigurable Atomic Memory for

Dynamic Networks

Outline

Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

Distributed Shared MemoryReadWrite(7)

Write(0)

Atomic Consistency

AKA linearizability Definition: Each operation appears to occur at some point

between its invocation and response. Sufficient condition: For each object x, all the read and

write operations for x can be partially ordered by , so that: is consistent with the order of invocations and responses:

there are no operations such that 1 completes before 2 starts, yet 2 1 .

All write operations are ordered with respect to each other and with respect to all the reads.

Every read returns the value of the last write preceding it in .

Read 7

Write(7)

Atomic Consistency

ReadWrite(7)

Write(0)

Quorums

Write(7)Read

Dynamic Atomic Memory

Outline

Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

Prior Work on Quorums

Gifford (79) and Thomas (79)

Upfal and Wigderson (85) majority sets of readers and writers

Vitanyi and Awerbuch (86) matrices of single-writer/single-reader registers

Attiya, Bar-Noy and Dolev (90/95) majorities of processors to implement single-writer/multi-

reader objects in message passing systems

Static

olev))A(ttiya) B(ar-Noy) D

Single-writer multiple-readers Assuming non-faulty processors (nodes)

Majority is a primitive quorum Communicate

Send a request to n processors Await ack from processors

Tags are used for distributed ordering of operations WRITE operations increment the tag READ operations use the tag Both propagate the tag

Properties R returns either the last completed or a concurrent W ≤ tag ordering between R

1

2

n

1

2

n

Write

increment tagsend tag/value

Read

Phase 1:find

tag/value

Phase 2:send tag

Reads and Writes

Value

32

5

24

72

Tag

100

101

102

103

Outline

Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

Dynamic Approaches (1)

Consensus to agree on each operation [Lamport] Consensus for each R/W bad performance!

Virtual synchrony [Birman 85] group communication R/W simulated through atomic broadcast Consensus only for special case (view change) Issue with determining the primary partition (quorum)

[Yeger-Lotem, Keidar, Dolev ’97] – dynamic voting But still - performance issues

One join or failure may trigger view formation delays R/W In the presence of failures, R/W ops can converge indefinitely

Group Communication Abstraction

Send

Send ( G

rp, Msg )

( Grp, M

sg )

Deliver

Deliver ( M

sg )( M

sg )

Join / Leave Join / Leave ( G

rp )( G

rp )

View

V

iew ( G

rp, Mem

bers, Id)( G

rp, Mem

bers, Id)

Group Communication

Group Communication Systems (1)

Group Membership Processes organized into

groups Particular memberships

stamped as views Views provide a form of

Concurrent Common Knowledge about system

In partitionable system, views can be concurrent

p1time

p2 p3

V1 {p1, p2, p3}

V2 {p1, p2}

V5 {p1, p2, p3}

V3 {p3}

Virtual Synchrony [Birman, Joseph 87]

Integration of Multicast and Membership Synchronization of Messages and Views Includes many different properties One key property:

Powerful abstraction for state-machine replicationPowerful abstraction for state-machine replication

Processes that go together through Processes that go together through same same viewsviews, deliver , deliver same sets of messagessame sets of messages..

Reliable Multicast Messages sent to group Total/Causal/FIFO ordering Virtual Synchrony

The same set of multicast messages delivered to group members between view changes

Guaranteed Self Delivery A process will eventually deliver a

self-message or crash (Usually) Sending View Delivery

The message is delivered in the same view that it is sent

Group Communication Systems (2)

p1time

p2 p3

V1 {p1, p2, p3}

V2 {p1, p2}

MovieGroup

Chocolat

MovieGroup

Gladiator

MovieGroup

Spy Kids

Example: a GC-based VOD server

start

update

Movies?ServiceGroup control

SessionGroup

Virtual Synchrony - Membership Issue – accurate estimation on group membership

Natural implementation – consensus But - distributed consensus is impossible under failures in

an asynchronous system [FLP ’85]! How to distinguish between a failed and slow processor?

Solution – failure detectors to deliver views May use mechanisms other than asynchronous message

arrivals to suspect the failed processes Failure detector ◊S

Initially, the output is arbitrary, but eventually … every process that crashes is suspected (completeness) some process does not crash is not suspected (accuracy)

◊S is the weakest FD to solve the consensus Rotating Coordinator algorithm

Virtual Synchrony - Multicast

Assumption: point-to-point reliable FIFO All-or-none message delivery

Only for the view (alive processes) Dead men tell no tales (E.W. Hournung 1899)

STABLE messages and delivery between views What if the sender crashes in the middle of multicast? ISIS algorithm – FLUSH markers Messages can be delayed indefinitely during view formation!

Total message ordering TOTEM (token-ring) algorithm Symmetric (Lamport timestamps) algorithm

Dynamic Voting on Top of GC

R/W service as a replicated state machine (total order) Data replicas managed by the primary partition (quorum)

Problematic in dynamic unreliable network Adaptive quorums – majority of the previous quorum

{a,b,c,d,e} {a,b,c} {a,b} Dynamic linear voting

Pid to break ties between equal-sized partitions Is this enough?

a

b e

c

d

Failures in the Course of the Protocol

a

b e

c

d

{a, b, c} attempt to form a quorum a and b succeed c detaches, unaware of the attempt

{a, b} form a quorum majority of {a, b, c}

Concurrently {c, d, e} form a quorum

majority of {a, b, c, d, e} Inconsistency!

Handling Ambiguous Configurations

Idea: make c aware if a and b succeed in forming {a, b, c} {a, b, c} is ambiguous for c: may or may not have been formed

Processes record ambiguous attempts c records both: {a, b, c, d, e} and {a, b, c}

Requires a majority of both will refuse to form {c, d, e}

Dynamic Voting - Ambiguity Resolution

Upon Membership Changes Exchange information [Sub-quorum of last primary and of all ambiguous attempts]

ATTEMPT: Record the attempt as ambiguous [All attempted]

FORM: become primary + delete all ambiguous attempts

Caveat: Garbage Collection Potentially exponential # of ambiguous attempts Constrain to store a linear #

Dynamic Approaches (1)

Consensus to agree on each operation [Lamport] Consensus for each R/W (not guaranteed to terminate) Bad performance!

Virtual synchrony [Birman 85] group communication R/W simulated through atomic broadcast Consensus only for special case (view change) Issue with determining the primary partition (quorum)

[Yeger-Lotem, Keidar, Dolev ’97] – dynamic voting But still - performance issues

One join or failure may trigger view formation delays R/W In the presence of failures, R/W ops can converge indefinitely

Dynamic Approaches (2)

Quorum-based reads/writes over GC [De Prisco, et al. 99] New view must satisfy space requirements

Intersection between the old and new quorums RAMBO has time requirements

Some quorums of the old and new system are involved in reconfiguration

Single reconfigurer [Lynch, Shvartsman 97], [Englert, Shvartsman 00]: Terminology change: view configuration Allows multiple concurrent configurations SPOF!

Outline

Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

RAMBO – key ideas

Separate the handling of R/W operations from view (configuration) changes R/W ops must complete fast Configuration changes can propagate in the background

Two levels of accommodating changes Small and transient changes – through multiple quorums Large and permanent changes – through reconfiguration

Managing configurations Multiple configurations may co-exist Old configurations can be garbage-collected The nodes agree on the order of configurations (Paxos)

RAMBO Architecture

Net

Recon

read/write

upgrade

read

read-ack

write

write-ack

RAMBO API

Domains I = set of Nodes (Locations) V = set of Values C = set of Configurations

Members ( C ) Read-quorums ( C ) Write-quorums ( C )

Input

//asynchronous - per node/object

• Join

• Read

• Write (v)

• Recon (c, c’)

• Fail

Output

//asynchronous - per node/object

• Join-ack

• Read-ack (v)

• Write ack

• Recon-ack (b) // True/False

• Report (c) // new configuration

Recon Service Specification

Recon Chooses configurations Tells members of the previous and new configuration. Informs Reader-Writer components (new-config).

Behavior (assuming well-formedness): Agreement: Two configs never assigned to same k. Validity: Any announced new-config was previously

requested by someone. No duplication: No configuration is assigned to more than

one k.

Write

Phase 1:choose tag

Phase 2:send tag/value

Read

Phase 1:find tag/value

Phase 2:send tag/value

Reads and Writes

Value

32

5

24

72

Tag

100

101

102

103

Multiple Configurations (1)

Every node can Install a new configuration Garbage-collect an old configuration Learn about both through gossiping

The Recon service guarantees the global order Configuration map

The node’s snapshot of the picture of the world Special configurations: (undefined) and ± (GC’ed)

Multiple Configurations (2)

Some algebra: Update: c, c ± // Configuration lifecycle Extend: c // New configurations Truncate: (c1, c2, , c4) (c1, c2) // Removing holes

Configuration map w/o holes € TRUNCATED

± ± c c c c ... ...

GC’d Defined Mixed Undefined

c

CMAP Evolution

c0

c0 c1

c0 c1 c2 ck

± c1 c2 ck

± ± c2 ck

. . .

. . .

. . .

. . .

. . .

± ± ± c3 ck . . .

± ± ± ± ± c c c c . . .. . .

R/W Automaton Implementation

The node keeps gossiping with the “world” all the time Tags are used for distributed ordering of operations

WRITE operations increment the tag READ operations use the tag Every READ returns the value of WRITE with the same tag

Agreeing on tags Every op consists of the query and propagation phases Query – acquire the tag from “enough” members

R-quorum of every active configuration Propagation – push the value/tag to “enough” members

W-quorum of every active configuration Fixed point: predicate that the respective op has completed

R/W with Multiple Configurations

Key to asynchronous execution of R/W operations No abortion of R/W when a new configuration is reported

Extra work to access additional processes needed for new quorums.

Reaching a quorum for every C in CMAP To synchronize with every process that might hold C Some read-quorum at the QUERY stage

Query-fixed-point precondition Some write-quorum at the PROP stage

Prop-fixed-point precondition

R/W Automata State

world value, tag cmap pnum1, counts phases of locally-initiated operations pnum2[], records latest known phase numbers for all locations

Recall causal ordering and vector clocks! op-record, keeps track of the status of a current locally initiated

read/write operation Includes op.cmap, consisting of consecutive configs.

gc-record, keeps track of the status of a current locally-initiated GC operation

R/W Automaton: Recv() code

CMAP may evolve during the R/W Accept only “recent” messages Local message numbering (PNUM) to ensure causal order

“I have heard from you since you started the op!” Pitfall: a hole in the new CMAP

I am using stale data! Restart the phase with the truncated CMAP

world world := := world world WW if if t > tagt > tag then then (value,tag) := (v,t)(value,tag) := (v,t) cmap := update(cmap,cm) cmap := update(cmap,cm) pnum2(j) := max(pnum2(j), ns)pnum2(j) := max(pnum2(j), ns) gc-record: If message is “recent”, record the sender.gc-record: If message is “recent”, record the sender. op-record: If message is “recent”:op-record: If message is “recent”:

Record the senderRecord the sender Extend Extend op.cmapop.cmap with newly discovered configurations with newly discovered configurations

c6

Largest tag: 100

New tag: 101

Putting it all together…

± ± c3 c4 c5

write(x, 7)

Garbage Collection

A process can initiate a configuration’s garbage collection Provided that the previous configurations are ± One at a time (may be improved !!!)

Multiple processes can start GC of the same configuration Concurrently with R/W A GC can stop if an idempotent GC has completed

The same two-phase protocol Query: reach a read and write quorums of CMAP[k]

Inform W-quorum of old configuration about the new configuration.

Collect object values from R-quorum of the old configuration. Prop: reach the write quorum of CMAP[k+1]

Propagate the latest value to a W-quorum of the new configuration.

Proof Sketch

≤ ordering of tags between sequential GC operations ∩ between the R-quorum of CMAP[k] and W-quorum of

CMAP[k+1] Ordering between sequential GC and R/W

≤ ordering of tags between the GC and READ operations < ordering of tags between the GC and WRITE operations

Ordering between sequential R and W ≤ ordering between */R < ordering between */W Either there is a common configuration C

Tag conveyed through the quorum ∩ property … or the tag info is conveyed through the GC of some

configuration in between

Recon Implementation

Consensus implemented using Paxos Synod algorithm.

Members of old configuration propose a new configuration Proposals reconciled using

consensus. recon(c,c’): Request for

reconfiguration from c to c’. [If c is the k-1st configuration] Send init(Cons(k,c’)) message to

c.members Recv(init):

Participate in consensus. decide(c’): Tell R/W the new

configuration Send new-config message to

members of c’.

Net

Consensus

Recon Recon-ack

Conditional Performance Analysis

Safety is guaranteed … But no absolute performance guarantees!

Under “good” network conditions Bounded message delay d Sufficient spacing between configurations (e) Configuration and quorum viability (e)

Bounds (under quiescence conditions) Join – 2d Reconfiguration – 13d Read/Write – 4d (two phases) GC – 4d (two phase)

Deteriorate under weaker stability conditions!

Network stabilizesRambo stabilizes

Outline

Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO

RAMBO-2

Goal: overcome the bottleneck of one GC at a time Upgrade instead of GC: collect multiple configurations < k Any configuration can be upgraded, even if < indices are not

Problem – no nice RAMBO property: RAMBO: every configuration is upgraded before removal Need to overcome the race condition between two upgrades … which leads to data loss!

Solution: Don’t remove a configuration until the upgrade is complete … even if somebody is removing it in parallel with you!

Proof Intuition: Order between R/W op tags through the transitive closure of

multiple Upgrade op tags (instead of a single GC)

c3 c4± ±

Configuration Upgrade in RAMBO-2

± ± c5

upgrade(5)

largest tag: 101

Performance

0

5

10

15

20

25

30

0 5 10 15 20 25

Frequency of Reconfiguration

Lat

ency

Rambo

Rambo II

Think of the size of CMAP you need to drag along!

GeoQuorums

Problem: Atomic R/W Shared Memory Objects for a Mobile Setting

Constraints: Mobile hosts are constantly moving, turning off, etc. and

thus are highly unreliable to serve as “backbone” of the algorithm.

Idea: Separate that world into regions that are usually populated Clusters of nodes simulate focal points A region or node fails when there are no mobile hosts in

that region that are active.

Rosebud

Problem: Atomic R/W Shared Memory Objects in a Byzantine

environment Environment:

Multiple configurations (RAMBO) + up to f Byzantine replicas Protocols:

The same as RAMBO + cryptographic augmentation Sets of 3f+1 replicas, quorums of 2f+1

Backup Slides

ABD - code

Virtual Synchrony Implementation

ISIS Algorithm - markers When P receives a view change from Gi to Gi+1

Forward all unstable messages from Gi to all other processes in Gi+1.

Mark them stable Multicast flush message for Gi+1

When P receives flush message for Gi+1 from all processes

Install new view change of Gi+1

SAFE messages Network-level vs application-level delivery guarantees

Symmetric Atomic Broadcast

Timestamp = counter + pid Send: increment counter Receive:

Record the neighbor’s counter Adopt the counter on the message if greater than mine

Deliver: Accept the message stamped with a counter ≤ than

every node’s counter Use pid to break ties

)p0,0) (p1,0( )p1,1( )p0,2(

Causally Ordered Broadcast

0,1,0 0,2,0 0,3,0

1,3,0

Every node maintains a vector timestamp vt

Increase my timestamp upon send

Self-messages delivered immediately

Delivering a message from neighbor stamped with v

v[k] ≤ vt[k], k ≠ i