edward bortnikov 048961 – topics in reliable distributed computing slides partially borrowed from...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Edward Bortnikov048961 – Topics in Reliable Distributed Computing
Slides partially borrowed from Nancy Lynch (DISC ’02)
Seth Gilbert (DSN ’03) and Idit Keidar (multiple talks)
RAMBOReconfigurable Atomic Memory for
Dynamic Networks
Outline
Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO
Atomic Consistency
AKA linearizability Definition: Each operation appears to occur at some point
between its invocation and response. Sufficient condition: For each object x, all the read and
write operations for x can be partially ordered by , so that: is consistent with the order of invocations and responses:
there are no operations such that 1 completes before 2 starts, yet 2 1 .
All write operations are ordered with respect to each other and with respect to all the reads.
Every read returns the value of the last write preceding it in .
Outline
Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO
Prior Work on Quorums
Gifford (79) and Thomas (79)
Upfal and Wigderson (85) majority sets of readers and writers
Vitanyi and Awerbuch (86) matrices of single-writer/single-reader registers
Attiya, Bar-Noy and Dolev (90/95) majorities of processors to implement single-writer/multi-
reader objects in message passing systems
Static
olev))A(ttiya) B(ar-Noy) D
Single-writer multiple-readers Assuming non-faulty processors (nodes)
Majority is a primitive quorum Communicate
Send a request to n processors Await ack from processors
Tags are used for distributed ordering of operations WRITE operations increment the tag READ operations use the tag Both propagate the tag
Properties R returns either the last completed or a concurrent W ≤ tag ordering between R
1
2
n
1
2
n
Write
increment tagsend tag/value
Read
Phase 1:find
tag/value
Phase 2:send tag
Reads and Writes
Value
32
5
24
72
Tag
100
101
102
103
Outline
Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO
Dynamic Approaches (1)
Consensus to agree on each operation [Lamport] Consensus for each R/W bad performance!
Virtual synchrony [Birman 85] group communication R/W simulated through atomic broadcast Consensus only for special case (view change) Issue with determining the primary partition (quorum)
[Yeger-Lotem, Keidar, Dolev ’97] – dynamic voting But still - performance issues
One join or failure may trigger view formation delays R/W In the presence of failures, R/W ops can converge indefinitely
Group Communication Abstraction
Send
Send ( G
rp, Msg )
( Grp, M
sg )
Deliver
Deliver ( M
sg )( M
sg )
Join / Leave Join / Leave ( G
rp )( G
rp )
View
V
iew ( G
rp, Mem
bers, Id)( G
rp, Mem
bers, Id)
Group Communication
Group Communication Systems (1)
Group Membership Processes organized into
groups Particular memberships
stamped as views Views provide a form of
Concurrent Common Knowledge about system
In partitionable system, views can be concurrent
p1time
p2 p3
V1 {p1, p2, p3}
V2 {p1, p2}
V5 {p1, p2, p3}
V3 {p3}
Virtual Synchrony [Birman, Joseph 87]
Integration of Multicast and Membership Synchronization of Messages and Views Includes many different properties One key property:
Powerful abstraction for state-machine replicationPowerful abstraction for state-machine replication
Processes that go together through Processes that go together through same same viewsviews, deliver , deliver same sets of messagessame sets of messages..
Reliable Multicast Messages sent to group Total/Causal/FIFO ordering Virtual Synchrony
The same set of multicast messages delivered to group members between view changes
Guaranteed Self Delivery A process will eventually deliver a
self-message or crash (Usually) Sending View Delivery
The message is delivered in the same view that it is sent
Group Communication Systems (2)
p1time
p2 p3
V1 {p1, p2, p3}
V2 {p1, p2}
MovieGroup
Chocolat
MovieGroup
Gladiator
MovieGroup
Spy Kids
Example: a GC-based VOD server
start
update
Movies?ServiceGroup control
SessionGroup
Virtual Synchrony - Membership Issue – accurate estimation on group membership
Natural implementation – consensus But - distributed consensus is impossible under failures in
an asynchronous system [FLP ’85]! How to distinguish between a failed and slow processor?
Solution – failure detectors to deliver views May use mechanisms other than asynchronous message
arrivals to suspect the failed processes Failure detector ◊S
Initially, the output is arbitrary, but eventually … every process that crashes is suspected (completeness) some process does not crash is not suspected (accuracy)
◊S is the weakest FD to solve the consensus Rotating Coordinator algorithm
Virtual Synchrony - Multicast
Assumption: point-to-point reliable FIFO All-or-none message delivery
Only for the view (alive processes) Dead men tell no tales (E.W. Hournung 1899)
STABLE messages and delivery between views What if the sender crashes in the middle of multicast? ISIS algorithm – FLUSH markers Messages can be delayed indefinitely during view formation!
Total message ordering TOTEM (token-ring) algorithm Symmetric (Lamport timestamps) algorithm
Dynamic Voting on Top of GC
R/W service as a replicated state machine (total order) Data replicas managed by the primary partition (quorum)
Problematic in dynamic unreliable network Adaptive quorums – majority of the previous quorum
{a,b,c,d,e} {a,b,c} {a,b} Dynamic linear voting
Pid to break ties between equal-sized partitions Is this enough?
a
b e
c
d
Failures in the Course of the Protocol
a
b e
c
d
{a, b, c} attempt to form a quorum a and b succeed c detaches, unaware of the attempt
{a, b} form a quorum majority of {a, b, c}
Concurrently {c, d, e} form a quorum
majority of {a, b, c, d, e} Inconsistency!
Handling Ambiguous Configurations
Idea: make c aware if a and b succeed in forming {a, b, c} {a, b, c} is ambiguous for c: may or may not have been formed
Processes record ambiguous attempts c records both: {a, b, c, d, e} and {a, b, c}
Requires a majority of both will refuse to form {c, d, e}
Dynamic Voting - Ambiguity Resolution
Upon Membership Changes Exchange information [Sub-quorum of last primary and of all ambiguous attempts]
ATTEMPT: Record the attempt as ambiguous [All attempted]
FORM: become primary + delete all ambiguous attempts
Caveat: Garbage Collection Potentially exponential # of ambiguous attempts Constrain to store a linear #
Dynamic Approaches (1)
Consensus to agree on each operation [Lamport] Consensus for each R/W (not guaranteed to terminate) Bad performance!
Virtual synchrony [Birman 85] group communication R/W simulated through atomic broadcast Consensus only for special case (view change) Issue with determining the primary partition (quorum)
[Yeger-Lotem, Keidar, Dolev ’97] – dynamic voting But still - performance issues
One join or failure may trigger view formation delays R/W In the presence of failures, R/W ops can converge indefinitely
Dynamic Approaches (2)
Quorum-based reads/writes over GC [De Prisco, et al. 99] New view must satisfy space requirements
Intersection between the old and new quorums RAMBO has time requirements
Some quorums of the old and new system are involved in reconfiguration
Single reconfigurer [Lynch, Shvartsman 97], [Englert, Shvartsman 00]: Terminology change: view configuration Allows multiple concurrent configurations SPOF!
Outline
Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO
RAMBO – key ideas
Separate the handling of R/W operations from view (configuration) changes R/W ops must complete fast Configuration changes can propagate in the background
Two levels of accommodating changes Small and transient changes – through multiple quorums Large and permanent changes – through reconfiguration
Managing configurations Multiple configurations may co-exist Old configurations can be garbage-collected The nodes agree on the order of configurations (Paxos)
RAMBO API
Domains I = set of Nodes (Locations) V = set of Values C = set of Configurations
Members ( C ) Read-quorums ( C ) Write-quorums ( C )
Input
//asynchronous - per node/object
• Join
• Read
• Write (v)
• Recon (c, c’)
• Fail
Output
//asynchronous - per node/object
• Join-ack
• Read-ack (v)
• Write ack
• Recon-ack (b) // True/False
• Report (c) // new configuration
Recon Service Specification
Recon Chooses configurations Tells members of the previous and new configuration. Informs Reader-Writer components (new-config).
Behavior (assuming well-formedness): Agreement: Two configs never assigned to same k. Validity: Any announced new-config was previously
requested by someone. No duplication: No configuration is assigned to more than
one k.
Write
Phase 1:choose tag
Phase 2:send tag/value
Read
Phase 1:find tag/value
Phase 2:send tag/value
Reads and Writes
Value
32
5
24
72
Tag
100
101
102
103
Multiple Configurations (1)
Every node can Install a new configuration Garbage-collect an old configuration Learn about both through gossiping
The Recon service guarantees the global order Configuration map
The node’s snapshot of the picture of the world Special configurations: (undefined) and ± (GC’ed)
Multiple Configurations (2)
Some algebra: Update: c, c ± // Configuration lifecycle Extend: c // New configurations Truncate: (c1, c2, , c4) (c1, c2) // Removing holes
Configuration map w/o holes € TRUNCATED
± ± c c c c ... ...
GC’d Defined Mixed Undefined
c
CMAP Evolution
c0
c0 c1
c0 c1 c2 ck
± c1 c2 ck
± ± c2 ck
. . .
. . .
. . .
. . .
. . .
± ± ± c3 ck . . .
± ± ± ± ± c c c c . . .. . .
R/W Automaton Implementation
The node keeps gossiping with the “world” all the time Tags are used for distributed ordering of operations
WRITE operations increment the tag READ operations use the tag Every READ returns the value of WRITE with the same tag
Agreeing on tags Every op consists of the query and propagation phases Query – acquire the tag from “enough” members
R-quorum of every active configuration Propagation – push the value/tag to “enough” members
W-quorum of every active configuration Fixed point: predicate that the respective op has completed
R/W with Multiple Configurations
Key to asynchronous execution of R/W operations No abortion of R/W when a new configuration is reported
Extra work to access additional processes needed for new quorums.
Reaching a quorum for every C in CMAP To synchronize with every process that might hold C Some read-quorum at the QUERY stage
Query-fixed-point precondition Some write-quorum at the PROP stage
Prop-fixed-point precondition
R/W Automata State
world value, tag cmap pnum1, counts phases of locally-initiated operations pnum2[], records latest known phase numbers for all locations
Recall causal ordering and vector clocks! op-record, keeps track of the status of a current locally initiated
read/write operation Includes op.cmap, consisting of consecutive configs.
gc-record, keeps track of the status of a current locally-initiated GC operation
R/W Automaton: Recv() code
CMAP may evolve during the R/W Accept only “recent” messages Local message numbering (PNUM) to ensure causal order
“I have heard from you since you started the op!” Pitfall: a hole in the new CMAP
I am using stale data! Restart the phase with the truncated CMAP
world world := := world world WW if if t > tagt > tag then then (value,tag) := (v,t)(value,tag) := (v,t) cmap := update(cmap,cm) cmap := update(cmap,cm) pnum2(j) := max(pnum2(j), ns)pnum2(j) := max(pnum2(j), ns) gc-record: If message is “recent”, record the sender.gc-record: If message is “recent”, record the sender. op-record: If message is “recent”:op-record: If message is “recent”:
Record the senderRecord the sender Extend Extend op.cmapop.cmap with newly discovered configurations with newly discovered configurations
Garbage Collection
A process can initiate a configuration’s garbage collection Provided that the previous configurations are ± One at a time (may be improved !!!)
Multiple processes can start GC of the same configuration Concurrently with R/W A GC can stop if an idempotent GC has completed
The same two-phase protocol Query: reach a read and write quorums of CMAP[k]
Inform W-quorum of old configuration about the new configuration.
Collect object values from R-quorum of the old configuration. Prop: reach the write quorum of CMAP[k+1]
Propagate the latest value to a W-quorum of the new configuration.
Proof Sketch
≤ ordering of tags between sequential GC operations ∩ between the R-quorum of CMAP[k] and W-quorum of
CMAP[k+1] Ordering between sequential GC and R/W
≤ ordering of tags between the GC and READ operations < ordering of tags between the GC and WRITE operations
Ordering between sequential R and W ≤ ordering between */R < ordering between */W Either there is a common configuration C
Tag conveyed through the quorum ∩ property … or the tag info is conveyed through the GC of some
configuration in between
Recon Implementation
Consensus implemented using Paxos Synod algorithm.
Members of old configuration propose a new configuration Proposals reconciled using
consensus. recon(c,c’): Request for
reconfiguration from c to c’. [If c is the k-1st configuration] Send init(Cons(k,c’)) message to
c.members Recv(init):
Participate in consensus. decide(c’): Tell R/W the new
configuration Send new-config message to
members of c’.
Net
Consensus
Recon Recon-ack
Conditional Performance Analysis
Safety is guaranteed … But no absolute performance guarantees!
Under “good” network conditions Bounded message delay d Sufficient spacing between configurations (e) Configuration and quorum viability (e)
Bounds (under quiescence conditions) Join – 2d Reconfiguration – 13d Read/Write – 4d (two phases) GC – 4d (two phase)
Deteriorate under weaker stability conditions!
Network stabilizesRambo stabilizes
Outline
Definitions and Goals Static Quorum Systems Dynamic Quorum Systems – before RAMBO RAMBO Dynamic Quorum Systems – beyond RAMBO
RAMBO-2
Goal: overcome the bottleneck of one GC at a time Upgrade instead of GC: collect multiple configurations < k Any configuration can be upgraded, even if < indices are not
Problem – no nice RAMBO property: RAMBO: every configuration is upgraded before removal Need to overcome the race condition between two upgrades … which leads to data loss!
Solution: Don’t remove a configuration until the upgrade is complete … even if somebody is removing it in parallel with you!
Proof Intuition: Order between R/W op tags through the transitive closure of
multiple Upgrade op tags (instead of a single GC)
Performance
0
5
10
15
20
25
30
0 5 10 15 20 25
Frequency of Reconfiguration
Lat
ency
Rambo
Rambo II
Think of the size of CMAP you need to drag along!
GeoQuorums
Problem: Atomic R/W Shared Memory Objects for a Mobile Setting
Constraints: Mobile hosts are constantly moving, turning off, etc. and
thus are highly unreliable to serve as “backbone” of the algorithm.
Idea: Separate that world into regions that are usually populated Clusters of nodes simulate focal points A region or node fails when there are no mobile hosts in
that region that are active.
Rosebud
Problem: Atomic R/W Shared Memory Objects in a Byzantine
environment Environment:
Multiple configurations (RAMBO) + up to f Byzantine replicas Protocols:
The same as RAMBO + cryptographic augmentation Sets of 3f+1 replicas, quorums of 2f+1
Virtual Synchrony Implementation
ISIS Algorithm - markers When P receives a view change from Gi to Gi+1
Forward all unstable messages from Gi to all other processes in Gi+1.
Mark them stable Multicast flush message for Gi+1
When P receives flush message for Gi+1 from all processes
Install new view change of Gi+1
SAFE messages Network-level vs application-level delivery guarantees
Symmetric Atomic Broadcast
Timestamp = counter + pid Send: increment counter Receive:
Record the neighbor’s counter Adopt the counter on the message if greater than mine
Deliver: Accept the message stamped with a counter ≤ than
every node’s counter Use pid to break ties
)p0,0) (p1,0( )p1,1( )p0,2(