ordering and durability in isis 2
DESCRIPTION
Ordering and dURABILITY IN Isis 2. Cornell University. Ken Birman. Isis 2 System. “join myGroup”. state transfer. myGroup. update. update. Core functionality: groups of objects … fault-tolerance, speed (parallelism), coordination Intended for use in very large-scale settings - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/1.jpg)
1
ORDERING AND DURABILITY IN ISIS2
Ken BirmanCornell University
![Page 2: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/2.jpg)
2
Isis2 System
Core functionality: groups of objects … fault-tolerance, speed (parallelism),
coordination Intended for use in very large-scale settings
The local object instance functions as a gateway Read-only operations performed on local state Update operations update all the replicas
myGroupstate transfer
“joinmyGroup”
update update
![Page 3: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/3.jpg)
3
Terminology we’ve used Process group: A term for a collection of programs
that are all running (perhaps on different machines, perhaps on the same machine) and that use Isis2
Each process group has a name (you pick it) You can have multiple groups in one application
Message: Data encoded to be sent between programs
State transfer: Data to initialize a new group member
Update: Any action that changes the shared data Lookup: Any action that only queries the data Multicast: A message sent to every group member
![Page 4: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/4.jpg)
4
A distributed request that
updates group “state”...
Some service
A B C D
Example: Cloud-Hosted Service
SafeSend
SafeSend
SafeSend
... and the response
Standard Web-Services method
invocation
![Page 5: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/5.jpg)
5
Multicast properties In the figure, “SafeSend” is a “multicast”
A message that can be sent to a whole group
What properties do these multicasts need to keep the group members consistent?
In Isis2 we focus on Ordering properties: relative to group membership
changes, and relative to other multicasts Durability guarantees: what happens if a crash
occurs?
![Page 6: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/6.jpg)
6
In Isis2 new View upcalls are synchronized relative to message delivery
Key idea: View ordering
![Page 7: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/7.jpg)
7
Membership changes When a group gains or loses a member, the Isis2
Oracle sequences the new view relative to other multicasts. Thus any multicast is delivered in the same view, from the perspective of all recipients.
Also, if a multicast is sent to the group in some view, it reaches all members of the group (of course if some crash, they might not process the message)
State transfers occur after every multicast has been delivered in the prior view and before any are delivered in the new view
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
Group View is synchronized
relative to multicasts
![Page 8: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/8.jpg)
8
Message Ordering The basic idea of Isis2 is to deliver all multicasts in the
same order at all group members receiving them
This keeps the data consistent and allows you to implement “state machine” algorithms: group members perform any desired actions in the same state and in the same order
But we offer various implementations of multicast and if you use them very wisely, some are faster than others. The caveat is that the fast versions can only be used in certain situations, which we’ll discuss.
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
![Page 9: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/9.jpg)
9
A multicast arrives in a group… What information is “the same” for all recipients?
If they call g.GetView(), or remembered properties of the most recently delivered view, all see same view
Also, everyone got the message And the requested ordering was enforced by Isis2
What aspects might differ, for different receivers? Each has its own “rank” in the membership list,
obtained by calling v.GetMyRank() or v.GetRankOf(who)
![Page 10: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/10.jpg)
10
What if a failure happens just as a multicast is being sent?
What about failures?
![Page 11: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/11.jpg)
11
Delayed delivery In Isis2, a multicast send will often delay
(in the platform) for a little while before delivery occurs
As a result, the sender does not know that the group view will be the same when the message is delivered
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
This multicast might have been
“sent” in the prior view when r, s and
t weren’t yet members!
![Page 12: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/12.jpg)
12
How can we know for sure? Suppose the sender of a Query needs to know
how many members processed the query, e.g. to notice that some reply is missing due to a failure. What can it do to know? One option is to have the receivers include View
information (such as how many members were in the View, what rank each replying member had) in the Reply()
The sender is also a receiver, so another approach is for the sender to wait for its own multicast or Query to be delivered and then make note of the View
![Page 13: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/13.jpg)
13
How do we know who sent a message?
You can just include the sender’s Address in the arguments to the message
Cool Isis2 fact: After you see a View notifying you that some
member has failed or voluntarily left the group, you will never receive additional multicasts from that sender!
If a process leaves a group but then tries to send in it, Isis2 throws an exception in that sender.
![Page 14: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/14.jpg)
14
No messages from the dead In the Isis2 system, you never receive
messages from the deceased
Isis2 watches for “late” messages that came from a process which is already considered to have died
It actively blocks such messages and won’t deliver them
Thus if you reconfigure after a failure, and reassign roles, you can’t get a kind of split-brain effect due to late delivery of a message
![Page 15: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/15.jpg)
15
Ordering Properties The most important form of message
ordering is “total order”
Obtained by using g.OrderedSend or g.SafeSend
They both provide the same ordering guarantee. They have different durability properties
Everyone receives these in the same order.
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
Everyone receives A first
Everyone receives B second
A
B
![Page 16: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/16.jpg)
16
Weaker ordering Some applications want the lowest
possible message latency OrderedSend will usually achieve this best
delay, but not always. (Slower case: when multiple group members are calling OrderedSend concurrently)
SafeSend uses a much slower approach. For the very best speed, protocols
guaranteed to be faster are available: Send and RawSend
![Page 17: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/17.jpg)
17
A FIFO Ordering situation Suppose one process sends all the
multicasts that update some variable in a group. What ordering is really needed?
In this group, only the oldest living membersends multicasts
FIFO suffices!
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
We say that p is the leader. It has rank
0
After p and q fail, r is the leader. It has rank 0
in the new view
![Page 18: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/18.jpg)
18
A FIFO Ordering Situation In this group we really only need to
deliver messages in the order the leader sent them
For this purpose, the Send primitive is ideal Send respects the FIFO order its sender
used Guaranteed to be extremely fast
RawSend: Send, but with no effort to guarantee reliability. Respects FIFO order… unless message is lost
![Page 19: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/19.jpg)
19
What if two senders use Send? When different senders use Send, the
ordering will depend on when the messages showed up!
Different members might see different orderings
Example: r sees A B … but p sees B A
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
A
B
![Page 20: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/20.jpg)
20
When is FIFO good enough? Suppose our group manages a collection of
data items Each item has its own leader and only the
leader sends updates for that item Consistency: It suiffices to apply updates in the
order they were sent. g.Send() will do this!
But beware… Multicasts from different senders
can interleave in unpredictable ways
![Page 21: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/21.jpg)
21
When would you use RawSend? This primitive doesn’t guarantee reliability
We use it when reporting data from real-time sensors We want the data delivered in order (new
data replaces older data). RawSend is still FIFO ordered
But if data is lost, there is no point “wasting time” in the platform retransmitting it.
![Page 22: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/22.jpg)
22
What about Query ordering? Each kind of multicast has an associated
QueryMulticast Matching QueryRawSend RawQuerySend QueryOrderedSend OrderedQuerySafeSend SafeQuery
![Page 23: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/23.jpg)
23
CausalSend Included mostly for academic reasons,
but not used very often in Isis2
Intended for situation in which the leader role moves around for each data item First p is in charge, then q is the leader for
a while, then r, then back to p… CausalSend will respect the FIFO order
“with moving leaders”. But we don’t recommend using it.
![Page 24: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/24.jpg)
24
CausalSend picture: B is “after” A
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
A
B
![Page 25: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/25.jpg)
25
Causality idea If B “might have been caused by A”, then
B is causally ordered after A (we write A B)
CausalSend tracks these causality dependencies and makes sure that if A B, then B will be delivered after A
But the Isis2 implementation of CausalSend is slow and this is why it isn’t used very often
![Page 26: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/26.jpg)
26
Exactly what happens in the event of a failure?
Durability
![Page 27: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/27.jpg)
27
Durability
A durability guarantee is the property that information will survive a failure
There are several cases to think about What if the sender of a multicast fails but someone
received the multicast? What if the sender and every receiver (so far) fails? What if a whole group fails, but later restarts? What if the group is managing a replicated database
or files that aren’t even on the same computers?
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
![Page 28: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/28.jpg)
28
Soft State in the Cloud Many Isis2 applications run in cloud settings..
And the cloud favors “soft state” After a node crashes, the entire VM is reloaded Thus any local state (even local files) are restored
to their original state! All local data vanishes
We say that a group manages “hard state” if the group members can fail and yet their state lives on In the cloud a hard-state node costs more $$$
![Page 29: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/29.jpg)
29
Two cases thus arise Durability for soft-state scenarios
Here the entire state “lives in the group members”
They might have files, but the files won’t be preserved if those members crash and later restart, even on the same nodes.
Very common in today’s cloud
Durability for hard-state cases Here the state really is outside the group
![Page 30: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/30.jpg)
30
Multicast durability Isis2 offers all-or-nothing delivery
guarantees
Either every group member receives your multicast, or no group member receives it, even if the sender fails. As we saw, if a sender fails, its messages will be delivered before Isis2 reports the failure
But this statement didn’t explain what happens when a receiver crashes “instantly”
![Page 31: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/31.jpg)
31
Two options: Optimistic/Pessimistic Optimistic case (Send, CausalSend, OrderedSend):
Messages are delivered instantly on arrival (low delay) But if the sender and all receivers with copies fail, an
optimistic message is lost forever even though it might have been delivered to some processes right before they crashed
An optimistic protocol always looks like it was all-or-nothing, but if you could see the details, you might see that in fact, it was delivered, but then “forgotten”
![Page 32: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/32.jpg)
32
Optimistic delivery Consider messages B and C
B was delivered to r,s and t. But it didn’t reach p and q because of a network failure.
C was delivered by p and q but never reached r,s,t
But notice that p and q both crashed In a soft-state case, no evidence survived (unless
they talked to someone outside the group – an external client, for example)
In effect, the surviving portion of the system is consistent
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
A
B
C
![Page 33: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/33.jpg)
33
Optimistic delivery is fastest We deliver messages as soon as they
arrive
But the price of this speed (which is a big benefit) is that these two “bad cases” can arise. Nobody can tell when these things happen,
unless p or q talked to an external client … which leads to the idea of g.Flush(k)
![Page 34: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/34.jpg)
34
How does Flush(k) work?
g.Flush(n) pauses until n group members definitely have all the prior optimistic multicasts. g.Flush() waits for all members, but this is
slow Normally n=2 or n=3 is fine…
By calling g.Flush(2) or g.Flush(3) before talking to an external client, we can be sure these bad cases will not occur!
![Page 35: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/35.jpg)
35
With g.Flush(k)… … those stray delivery events can still occur, but
we know that no external observer notices them! If g.Flush(3) is called prior to talking to the observer,
then until there are 3 or more copies of the message, the Flush waits.
In our example the crash would have occurred while we were waiting for g.Flush() to finish
If a tree falls in a forest… If a message is delivered but every processthat saw it crashes, the effect is the sameas if the message wasn’t delivered!
![Page 36: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/36.jpg)
36
With g.Flush(k)… … those stray delivery events can still occur, but
we know that no external observer notices them! If g.Flush(3) is called prior to talking to the observer,
then until there are 3 or more copies of the message, the Flush waits.
In our example the crash would have occurred while we were waiting for g.Flush() to finish
If a tree falls in a forest… If a message is delivered but every processthat saw it crashes, the effect is the sameas if the message wasn’t delivered!
![Page 37: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/37.jpg)
37
When to call g.Flush(k) Use this primitive
When working with optimistic multicast protocols like Send, OrderedSend
Call it prior to interacting with something outside of the group, like an external client who issued a request
With g.Flush after g.OrderedSend, we get the guarantee that the group won’t forget the update. Without g.Flush, an unlikely failure sequence could cause a problem (sender+first recipients all die).
![Page 38: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/38.jpg)
38
Pessimistic Delivery SafeSend is much more pessimistic
This protocol is a kind of 2-phase commit Gives the message to recipients, and they hold it
(Two cases: In-memory logging, or on-disk logging)
When all have confirmed receipt, then delivery is authorized
No g.Flush(): it wouldn’t ever need to wait
![Page 39: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/39.jpg)
39
Where’s the durable state? SafeSend raises a question of where the
state lives
For our optimistic protocols, state lives in the group
But Isis2 can also support two more cases State lives in a checkpoint that will be reloaded
if the whole group shuts down and restarts State lives in a database or in files external to
the group SafeSend with disk logging aims at this second
case
![Page 40: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/40.jpg)
40
Should I always use SafeSend? The SafeSend protocol is very costly and
scales poorly, so it isn’t a great choice in the cloud
Also, using it correctly is a bit tricky
Better rule of thumb: use g.OrderedSend+g.Flush
![Page 41: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/41.jpg)
41
Sidebar: Paxos family of protocols Experts in this area will know about Leslie Lamport’s
famous Paxos protocol (Wikipedia has a nice writeup) It provides ordered, durable “actions” These are often updates to a replicated database
SafeSend is the Isis2 name for Paxos
You don’t really need to learn about Paxos to understand how SafeSend works, but I’ll include some comments aimed at people who do know about Paxos in this lecture, simply because that work is so famous.
![Page 42: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/42.jpg)
42
How Paxos works Paxos is basically a kind of 2-phase
commit In the first phase a leader proposes some
action (for us, a multicast) A quorum of group members (the
acceptors) need to vote in favor of the proposed ordering for the message, and they need to first save it in a durable place (usually a log that lives on the disk)
In the second phase, delivery occurs (in Paxos: the learners are informed about the new event)
![Page 43: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/43.jpg)
43
Paxos has a notion similar to Flush(k)
In Paxos you can specify the number of “acceptors” that must have a copy of a message before it can be delivered. In Isis2 this same parameter is available by
means of a parameter you can set (g.SetSafeSendThreshold(k)) SafeSend is a true implementation of Paxos if
this number is more than half the group members.
With k smaller, like k=2 or k=3, but in a big group SafeSend starts to act exactly like g.OrderedSend()+g.Flush(k)
![Page 44: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/44.jpg)
44
Isis2: Send v.s. SafeSendSend scales best, but SafeSend with
modern disks (RAM-like performance) and small numbers of acceptors isn’t terrible.
![Page 45: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/45.jpg)
45
Variance from mean, 32-
member case
Jitter: how “steady” are latencies?
The “spread” of latencies is muchbetter (tighter) with Send: the 2-phase
SafeSend protocol is sensitive to scheduling delays
![Page 46: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/46.jpg)
46
Flush delay as function of shard sizeFlush is fairly fast if we only wait foracks from 3-5 members, but is slow
if we wait for acks from all members.After we saw this graph, we changedIsis2 to let users set the threshold.
![Page 47: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/47.jpg)
47
Putting our insights to work…
Several ways to make data durable
![Page 48: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/48.jpg)
48
Checkpointing Any group can be made durable using a
checkpointing file Call g.Persistent(filename) Checkpoint will periodically be saved, or you can force
the creation of checkpoints at times convenient to you Entire group shares a single checkpoint file and it
would normally live in the global file system. It should not live in any sort of soft-state file system!
On restart from a total shutdown, checkpoint is reloaded and the group recovers to its old state
![Page 49: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/49.jpg)
49
External databases If a group is being used to replicate
something like a set of external mySQL databases, recovering the group state just isn’t good enough
We also need to make sure the mySQL replicas are in the identical states after a recovery
This is the case where we use SafeSend with the disklogging option enabled
![Page 50: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/50.jpg)
50
What is the disklogger? The disklogger is a special form of logged
checkpoint, similar to the one used for g.Persistent() But whereas normally there is just one durability
log, this log is replicated with one copy per acceptor
Messages delivered by SafeSend are appended to this log during phase one
When an acceptor restarts, its log is scanned and “replayed”. Isis2 will garbage collect a message once all the learners have seen it
![Page 51: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/51.jpg)
51
A distributed request that
updates group “state”...
Some service
A B C D
Example: Cloud-Hosted Service
SafeSend
SafeSend
SafeSend
... and the response
Standard Web-Services method
invocation
DB
DB
DB
DB
Use the Isis2 version of Paxos to replicate an external database
![Page 52: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/52.jpg)
52
A distributed request that
updates group “state”...
Some service
A B C D
Example: Cloud-Hosted Service
Send
Send
Send
... and the response
Standard Web-Services method
invocation
In-memory collecti
on
In-memory collecti
on
In-memory collecti
on
In-memory collecti
on
Cheaper multicast+Flush suffices with in-memory replicas
or other situations with soft state, like files local to the replicas on VMs that
will be reloaded if a crash occurs
g.Flush()
![Page 53: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/53.jpg)
53
Check your understanding
Suppose we use SafeSend as shown in the figure, with 4 group members, and all are acceptors
You send 1 message. How many disk writes occur? At least 4 (one per log) and
perhaps 8 (the database may have a log too). Also, database needs to be updated!
![Page 54: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/54.jpg)
54
Recovery with an external database is a pain!
g.SetDurabilityMethod
![Page 55: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/55.jpg)
55
SetDurabilityMethod You must tell SafeSend to use the
DiskLogger durability method
When you do this, SafeSend has an extremely strong guarantee: it won’t ever forget messages, until is it explictly told to do so by your code
This yields a version suitable for use when replicating a database
![Page 56: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/56.jpg)
56
Recovering a database replica After restarting a failed database replica, SafeSend
with the DiskLogger durability method will replay all messages that it knows about
Your job is to make sure all of these updates have been applied to the database, exactly once
After that you tell SafeSend it can safely garbage collect these messages, and it does so when every group member has told it that the message is safe to garbage collect (at that point, it truncates the disk log)
![Page 57: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/57.jpg)
57
Why not always use SafeSend? SafeSend is harder to use
Must write code to handle replay of the log after recovery.
And SafeSend is also slower
Many people who assume Paxos is lightweight are surprised that all Paxos systems have high costs Paxos is really a kind of durable database – a
database of messages!
![Page 58: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/58.jpg)
58
Durability Summary To recap:
If your application maintains data purely inside the members of the group, or purely in memory, you can use the standard “optimistic” methods Call g.Flush(k) if worried about the tree-in-the-forest case
Use checkpointing to a log (g.Persistent()) to make the group state survive complete shutdowns
But switch to SafeSend for the strongest durability requirements. You’ll need to enable the DiskLogger durability method, and to write code to handle restarts and to tell SafeSend when it can garbage collect the log.
![Page 59: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/59.jpg)
59
How does one make a checkpoint?
Making Checkpoints
![Page 60: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/60.jpg)
60
State transfer
In general, group members manage data (state)
When s and t join in this example, they don’t have the current state for the group. They obtain it via a state transfer: the white arrow. In this example, p “writes down” its state (a
checkpoint) Then s and t “load” the state (they read the
checkpoint)
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70 White Arrow is a state transfer
![Page 61: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/61.jpg)
61
Making a checkpoint You can save any state you wish
You can call SendChkpt as many times as needed
int istuff; double dstuff; g.MakeChkpt += (Isis.ChkptMaker)delegate(View nv) { g.SendChkpt(istuff); // Checkpoint a single integer g.SendChkpt(dstuff); // Checkpoint a single floating point value g.EndOfChkpt(); // Finished making the checkpoint }; g.LoadChkpt += (loadichkpt)delegate(int what) { IsisSystem.WriteLine(name + ": Got integer checkpoint: istuff=" + what); istuff = what; }; g.LoadChkpt += (loaddchkpt)delegate(double what) { IsisSystem.WriteLine(name + ": Got double checkpoint: dstuff=" + what); dstuff = what; };
![Page 62: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/62.jpg)
62
Steps The MakeCheckpt method is called from
time to time in your program. You can control exactly when this will
happen
That updates the log files
Later, after restart, the LoadCheckpt method(s) will be called to reload the saved state
![Page 63: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/63.jpg)
63
To make a group persistent, store it in a global file system
It will be loaded into the NEXT instance that runs int istuff;
double dstuff; g.MakeChkpt += (Isis.ChkptMaker)delegate(View nv) { g.SendChkpt(istuff); // Checkpoint a single integer g.SendChkpt(dstuff); // Checkpoint a single floating point value g.EndOfChkpt(); // Finished making the checkpoint }; g.LoadChkpt += (loadichkpt)delegate(int what) { IsisSystem.WriteLine(name + ": Got integer checkpoint: istuff=" + what); istuff = what; }; g.LoadChkpt += (loaddchkpt)delegate(double what) { IsisSystem.WriteLine(name + ": Got floating point checkpoint: dstuff=" + what); dstuff = what; };
Note: You must also call myGroup.Persistent(gname);This tells Isis2 to keep checkpoints in a file (in this case with the same name as the group).
There are also ways to control when the checkpoint will be made
![Page 64: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/64.jpg)
64
Why did we register two loaders? Isis2 is polymorphic
Each method can be defined many times with different type signatures
As events occur, upcalls are done to the ones that match
In our examples we had just one argument to SendChkpt(), but we could have given many:
Any data type is allowed but you must register user-defined types with Isis first
g.SendChkpt(x, y, z, ....);
![Page 65: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/65.jpg)
65
State transfer uses checkpoints! If the checkpoint methods are defined, Isis2 will
ask for a checkpoint just as a new member joins The old member makes the checkpoint The new member loads it
This initializes the joining member
myGroupstate transfer
update update
![Page 66: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/66.jpg)
66
Can we tell what a checkpoint will be used for? Can we do “per use” checkpoints?
Persistent or just State Transfer?
![Page 67: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/67.jpg)
67
What are checkpoints used for? When you define a checkpoint
create/load method, that automatically enables state transfer for joining members
With g.Persistent(), a checkpoint plays two roles; they are also logged into a recovery log file that will be reread after recovery from a total shutdown
![Page 68: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/68.jpg)
68
State transfer could be s..l..o..w..And while it happens, the group freezes up!
What if the group state is large?
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
A
B
![Page 69: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/69.jpg)
69
What if the state is very large? Really large states can be slow to transfer. While
they are being sent, the group itself might hiccup
Best solution? Pre-transfer that huge state, perhaps using the highly efficient “Isis OOB” tool Out of band transfer is minimally disruptive and faster
too because the Isis2 system optimizes heavily for this But perhaps a few updates might occur after the pre-
transfer and before the member is added. So you can include an argument to Join that tells how
big the pre-transfer was, or what “time” it was made. Then the checkpoint only needs to include the delta!
![Page 70: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/70.jpg)
70
Pretransfer In this picture we send
data to r, s and t “out of band”
Isis2 has a tool for that, the OOB file transfer tool. Ideal for big copying
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
When they join, we send just
the residual delta…
![Page 71: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/71.jpg)
71
Enabling this feature Instead of calling g.Join(), call g.Join(offset)
Offset tells the group how much of the state you have.
It shows up in the View argument to the make checkpoint method
Offset 0 means “send the whole state”
Example: pretransfer included updates 0… 12345. So you call g.Join(12345). The state transfer contains just updates 12346-12348…
![Page 72: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/72.jpg)
72
What happens in an application that experiences many “events” all at the same time?
When does State Transfer occur?
![Page 73: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/73.jpg)
Isis2 has a strong consistency model: a new form of virtual synchrony.
73
Virtual synchrony is a “consistency” model: Membership epochs: begin when a new configuration
is installed and reported by delivery of a new “view” and associated state
Protocols run “during” a single epoch: rather than overcome failure, we reconfigure when a failure occurs
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
p
q
r
s
t
Time: 0 10 20 30 40 50 60 70
Synchronous execution Virtually synchronous execution
Non-replicated reference execution
A=3
B=7
B = B-A A=A+1
![Page 74: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/74.jpg)
74
What Isis2 ensures is that... State transfer “seems” to occur at the
instant when a new view is delivered (all prior multicasts have already been performed) This means that the member preparing the
state has the correct values for state variables needed by joining member!
It is “safe” to send this state If desired, there is a way for you to
specify which member will send state to each joining process
![Page 75: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/75.jpg)
75 How do Queries handle failure?
![Page 76: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/76.jpg)
Queries when failures occur…
Group g = new Group(“myGroup”);Dictionary<string,double> Values = new
Dictionary<string,double>();g.ViewHandlers += delegate(View v) {
Console.Title = “myGroup members: “+v.members;};g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v;};g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]);};g.Join();
g.Send(UPDATE, “Harry”, 20.75);
List<double> resultlist = new List<double>();nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);
First sets up group
Join makes this entity a member. State transfer isn’t shown
Then can multicast, query. Runtime callbacks to the “delegates” as events arrive
Easy to request security (g.SetSecure), persistence
“Consistency” model dictates the ordering seen for event upcalls and the assumptions user can make
76
![Page 77: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/77.jpg)
77
This example used g.Reply Also available:
g.AbortReply() – throws exception in the Query caller g.NullReply() – Member doesn’t contribute any value
but the caller won’t wait for it (useful with ALL) g.NoReply() – A risky option: like NullReply but no
message of any kind is sent to the caller
Query can also specify an Isis “Timeout” new Timeout(delay_ms, action) Action is: TO_NULLREPLY, TO_FAILURE, TO_ABORT
![Page 78: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/78.jpg)
78
How can a caller sense missing replies?
The caller is told how many replies it got If you expected 3 but got 2, either someone
failed, or they used g.NullReply() to “opt out”
But when you issue the Query you won’t know who is going to be in the group at the time of delivery! This is why it often makes sense for replies
to specify that “this is reply R of N” (R=rank, N=size of view)
![Page 79: Ordering and dURABILITY IN Isis 2](https://reader036.vdocument.in/reader036/viewer/2022062310/56816295550346895dd30d0e/html5/thumbnails/79.jpg)
79
Lecture Summary Isis2 gives you control over
How durable multicasts and group data will be
How strongly ordered they will be Whether to wait until a multicast has
reached k of the destinations before you talk to external observers
Using these forms of control, you can program exactly the behavior you need in a given setting