cs 425 / ece 428 distributed systems fall 2014
DESCRIPTION
CS 425 / ECE 428 Distributed Systems Fall 2014. Indranil Gupta (Indy) Lecture 4: Failure Detection and Membership. You’ ve been put in charge of a datacenter, and your manager has told you, “ Oh no! We don ’ t have any failures in our datacenter! ” Do you believe him/her? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/1.jpg)
CS 425 / ECE 428 Distributed Systems
Fall 2014Indranil Gupta (Indy)
Lecture 4: Failure Detection and Membership
All slides © IG
![Page 2: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/2.jpg)
• You’ve been put in charge of a datacenter, and your manager has told you, “Oh no! We don’t have any failures in our datacenter!”
• Do you believe him/her?
• What would be your first responsibility?• Build a failure detector• What are some things that could go wrong if you didn’t do
this?
A Challenge
![Page 3: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/3.jpg)
… not the exception, in datacenters.
Say, the rate of failure of one machine (OS/disk/motherboard/network, etc.) is once every 10 years (120 months) on average.
When you have 120 servers in the DC, the mean time to failure (MTTF) of the next machine is 1 month.
When you have 12,000 servers in the DC, the MTTF is about once every 7.2 hours!
Soft crashes and failures are even more frequent!
Failures are the Norm
![Page 4: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/4.jpg)
• You have a few options
1. Hire 1000 people, each to monitor one machine in the datacenter and report to you when it fails.
2. Write a failure detector program (distributed) that automatically detects failures and reports to your workstation.
Which is more preferable, and why?
To build a failure detector
![Page 5: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/5.jpg)
5
Target Settings
• Process ‘group’-based systems– Clouds/Datacenters – Replicated servers– Distributed databases
• Crash-stop/Fail-stop process failures
![Page 6: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/6.jpg)
6
Group Membership ServiceApplication Queries e.g., gossip, overlays,
DHT’s, etc.
MembershipProtocol
Group Membership List
joins, leaves, failuresof members
Unreliable Communication
Application Process pi
Membership List
![Page 7: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/7.jpg)
7
Two sub-protocols
Dissemination
Failure Detector
Application Process piGroup
Membership List
Unreliable Communication
• Complete list all the time (Strongly consistent)• Virtual synchrony
• Almost-Complete list (Weakly consistent)• Gossip-style, SWIM, …
• Or Partial-random list (other systems)• SCAMP, T-MAN, Cyclon,…
Focus of this series of lecture
pj
![Page 8: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/8.jpg)
8
Large Group: Scalability A Goal
this is us (pi)
Unreliable CommunicationNetwork
1000’s of processes
Process Group“Members”
![Page 9: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/9.jpg)
9
pj I pj crashed
Group Membership Protocol
Unreliable CommunicationNetwork
piSome process finds out quickly
Failure DetectorII
DisseminationIII
Crash-stop Failures only
![Page 10: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/10.jpg)
Next• How do you design a group membership
protocol?
10
![Page 11: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/11.jpg)
11
I. pj crashes • Nothing we can do about it! • A frequent occurrence• Common case rather than exception• Frequency goes up linearly with size of
datacenter
![Page 12: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/12.jpg)
12
II. Distributed Failure Detectors: Desirable Properties• Completeness = each failure is detected• Accuracy = there is no mistaken detection• Speed– Time to first detection of a failure
• Scale– Equal Load on each member– Network Message Load
![Page 13: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/13.jpg)
13
Distributed Failure Detectors: Properties
• Completeness• Accuracy• Speed– Time to first detection of a failure
• Scale– Equal Load on each member– Network Message Load
Impossible together in lossy networks [Chandraand Toueg]
If possible, then can solve consensus!
![Page 14: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/14.jpg)
14
What Real Failure Detectors Prefer
• Completeness• Accuracy• Speed– Time to first detection of a failure
• Scale– Equal Load on each member– Network Message Load
Guaranteed
Partial/Probabilisticguarantee
![Page 15: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/15.jpg)
15
What Real Failure Detectors Prefer
• Completeness• Accuracy• Speed– Time to first detection of a failure
• Scale– Equal Load on each member– Network Message Load
Guaranteed
Partial/Probabilisticguarantee
Time until some process detects the failure
![Page 16: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/16.jpg)
16
What Real Failure Detectors Prefer
• Completeness• Accuracy• Speed– Time to first detection of a failure
• Scale– Equal Load on each member– Network Message Load
Guaranteed
Partial/Probabilisticguarantee
Time until some process detects the failure
No bottlenecks/single failure point
![Page 17: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/17.jpg)
17
Failure Detector Properties
• Completeness• Accuracy• Speed– Time to first detection of a failure
• Scale– Equal Load on each member– Network Message Load
In spite of arbitrary simultaneous process failures
![Page 18: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/18.jpg)
18
Centralized Heartbeating
…
pi, Heartbeat Seq. l++
pi Hotspot
pj • Heartbeats sent periodically• If heartbeat not received from pi withintimeout, mark pi as failed
![Page 19: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/19.jpg)
19
Ring Heartbeating
pi, Heartbeat Seq. l++
Unpredictable onsimultaneous multiple
failures
pi
……
pj
![Page 20: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/20.jpg)
20
All-to-All Heartbeating
pi, Heartbeat Seq. l++
…
Equal load per memberpi
pj
![Page 21: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/21.jpg)
Next• How do we increase the robustness of all-to-all
heartbeating?
21
![Page 22: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/22.jpg)
22
Gossip-style Heartbeating
Array of Heartbeat Seq. lfor member subset
Good accuracy propertiespi
![Page 23: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/23.jpg)
23
Gossip-Style Failure Detection
1
1 10120 66
2 10103 62
3 10098 63
4 10111 65
2
43
Protocol:
• Nodes periodically gossip their membership list: pick random nodes, send it list
• On receipt, it is merged with local membership list
• When an entry times out, member is marked as failed
1 10118 64
2 10110 64
3 10090 58
4 10111 65
1 10120 70
2 10110 64
3 10098 70
4 10111 65
Current time : 70 at node 2
(asynchronous clocks)
AddressHeartbeat Counter
Time (local)
![Page 24: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/24.jpg)
24
Gossip-Style Failure Detection
• If the heartbeat has not increased for more than Tfail seconds, the member is considered failed
• And after Tcleanup seconds, it will delete the member from the list
• Why two different timeouts?
![Page 25: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/25.jpg)
25
Gossip-Style Failure Detection
• What if an entry pointing to a failed node is deleted right after Tfail (=24) seconds?
• Fix: remember for another Tfail
1
1 10120 66
2 10103 62
3 10098 55
4 10111 65
2
43
1 10120 66
2 10110 64
3 10098 50
4 10111 65
1 10120 66
2 10110 64
4 10111 65
1 10120 66
2 10110 64
3 10098 75
4 10111 65
Current time : 75 at node 2
![Page 26: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/26.jpg)
26
Multi-level Gossiping• Network topology is hierarchical
• Random gossip target selection => core routers face O(N) load (Why?)
• Fix: In subnet i, which contains ni nodes, pick gossip target in your subnet with probability (1-1/ni)
• Router load=O(1)
• Dissemination time=O(log(N))
• What about latency for multi-level topologies?
[Gupta et al, TPDS 06]
Router
N/2 nodes in a subnet
N/2 nodes in a subnet
(Slide corrected after lecture)
![Page 27: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/27.jpg)
27
Analysis/Discussion• What happens if gossip period Tgossip is decreased?
• A single heartbeat takes O(log(N)) time to propagate. So: N heartbeats take: – O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be
O(N)– O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1)– What about O(k) bandwidth?
• What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased?
• Tradeoff: False positive rate vs. detection time vs. bandwidth
![Page 28: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/28.jpg)
Next• So, is this the best we can do? What is the best
we can do?
28
![Page 29: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/29.jpg)
29
Failure Detector Properties …
• Completeness• Accuracy• Speed– Time to first detection of a failure
• Scale– Equal Load on each member– Network Message Load
![Page 30: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/30.jpg)
30
…Are application-defined Requirements
• Completeness• Accuracy• Speed– Time to first detection of a failure
• Scale– Equal Load on each member– Network Message Load
Guarantee always
Probability PM(T)
T time units
![Page 31: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/31.jpg)
31
• Completeness• Accuracy• Speed– Time to first detection of a failure
• Scale– Equal Load on each member– Network Message Load
Guarantee always
Probability PM(T)
T time units
N*L: Compare this across protocols
…Are application-defined Requirements
![Page 32: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/32.jpg)
32
All-to-All Heartbeating
pi, Heartbeat Seq. l++
…
pi Every T units
L=N/T
![Page 33: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/33.jpg)
33
Gossip-style Heartbeating
Array of Heartbeat Seq. lfor member subset
pi
Every tg units=gossip period,send O(N) gossipmessage
T=logN * tg
L=N/tg=N*logN/T
![Page 34: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/34.jpg)
• Worst case load L* per member in the group (messages per second)– as a function of T, PM(T), N
– Independent Message Loss probability pml
•
34
What’s the Best/Optimal we can do?
T
TPM
pml
1.
)log(
))(log(L*
Slide changed after lecture
![Page 35: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/35.jpg)
35
Heartbeating• Optimal L is independent of N (!)• All-to-all and gossip-based: sub-optimal
• L=O(N/T)
• try to achieve simultaneous detection at all processes• fail to distinguish Failure Detection and Dissemination
components
Key:Separate the two componentsUse a non heartbeat-based Failure Detection Component
![Page 36: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/36.jpg)
Next• Is there a better failure detector?
36
![Page 37: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/37.jpg)
37
SWIM Failure Detector Protocol
Protocol period= T’ time units
XK randomprocesses
pi
ping
ack
ping-req
ack
• random pj
X
ack
ping
• random K
pj
![Page 38: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/38.jpg)
38
SWIM versus Heartbeating
Process Load
First DetectionTime
Constant
Constant
O(N)
O(N)
SWIM
For Fixed :• False Positive Rate• Message Loss Rate
Heartbeating
Heartbeating
![Page 39: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/39.jpg)
39
SWIM Failure DetectorParameter SWIM
First Detection Time• Expected periods
• Constant (independent of group size)
Process Load • Constant per period• < 8 L* for 15% loss
False Positive Rate • Tunable (via K)• Falls exponentially as load is scaled
Completeness • Deterministic time-bounded• Within O(log(N)) periods w.h.p.
1e
e
![Page 40: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/40.jpg)
40
Accuracy, Load
• PM(T) is exponential in -K. Also depends on pml (and pf )– See paper
• for up to 15 % loss rates 28*
L
L8
*
][
L
LE
![Page 41: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/41.jpg)
41
• Prob. of being pinged in T’=
• E[T ] =
• Completeness: Any alive member detects failure– Eventually– By using a trick: within worst case O(N) protocol periods
Detection Time
1.T'
e
e
11 1)1
1(1 eN
N
![Page 42: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/42.jpg)
42
Time-bounded Completeness
• Key: select each membership element once as a ping target in a traversal– Round-robin pinging– Random permutation of list after each traversal
• Each failure is detected in worst case 2N-1 (local) protocol periods
• Preserves FD properties
This slide not covered (not in syllabus)
![Page 43: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/43.jpg)
Next• How do failure detectors fit into the big picture
of a group membership protocol? • What are the missing blocks?
43
![Page 44: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/44.jpg)
44
pj I pj crashed
Group Membership Protocol
Unreliable CommunicationNetwork
piSome process finds out quickly
Failure DetectorII
DisseminationIII
Crash-stop Failures only
![Page 45: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/45.jpg)
45
Dissemination Options• Multicast (Hardware / IP)– unreliable – multiple simultaneous multicasts
• Point-to-point (TCP / UDP)– expensive
• Zero extra messages: Piggyback on Failure Detector messages– Infection-style Dissemination
![Page 46: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/46.jpg)
46
Infection-style Dissemination
Protocol period= T time units
X
pi
ping
ack
ping-req
ack
• random pj
X
ack
ping
• random K
pj
Piggybacked membership information
K randomprocesses
![Page 47: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/47.jpg)
47
Infection-style Dissemination• Epidemic/Gossip style dissemination
– After protocol periods, processes would not have heard about an update
• Maintain a buffer of recently joined/evicted processes– Piggyback from this buffer– Prefer recent updates
• Buffer elements are garbage collected after a while– After protocol periods, i.e., once they’ve
propagated through the system; this defines weak consistency)log(. N
)log(. N
This slide not covered (not in syllabus)
![Page 48: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/48.jpg)
48
Suspicion Mechanism• False detections, due to– Perturbed processes– Packet losses, e.g., from congestion
• Indirect pinging may not solve the problem• Key: suspect a process before declaring it as
failed in the group
![Page 49: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/49.jpg)
49
Suspicion MechanismDissmn
FD
pi
Alive
Suspected
Failed
Dissmn (Suspect pj)
Dissmn (Alive pj) Dissmn (Failed pj)
FD:: pi ping failed
Dissmn::(S
uspect pj)Time out
FD::pi ping su
ccess
Dissmn::(A
live pj)
![Page 50: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/50.jpg)
50
Suspicion Mechanism• Distinguish multiple suspicions of a process– Per-process incarnation number– Inc # for pi can be incremented only by pi
• e.g., when it receives a (Suspect, pi) message
– Somewhat similar to DSDV
• Higher inc# notifications over-ride lower inc#’s• Within an inc#: (Suspect inc #) > (Alive, inc #)• (Failed, inc #) overrides everything else
![Page 51: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/51.jpg)
51
Wrap Up• Failures the norm, not the exception in datacenters• Every distributed system uses a failure detector• Many distributed systems use a membership service
• Ring failure detection underlies– IBM SP2 and many other similar clusters/machines
• Gossip-style failure detection underlies– Amazon EC2/S3 (rumored!)
![Page 52: CS 425 / ECE 428 Distributed Systems Fall 2014](https://reader036.vdocument.in/reader036/viewer/2022081506/56813bca550346895da4f4a8/html5/thumbnails/52.jpg)
Important Announcement• Next week Tue and Thu: We’ll have a flipped classroom! (like Khan Academy)
• Homework before Next week
• Please see video lectures for two topics• Timestamps and Ordering before Tue
• Global Snapshots before Thu
• When you come to class on Sep 9th (Tue) and Sep 11th (Thu) the TAs will be helping you
do exercises in class (not HW problems, but other exercise problems we will give you)
• We will not replay videos in class, i.e., there will be no lecturing.
• If you don’t see the videos before class, you will flounder in class. So make sure you see
them before class.
• Exercises may count for grades.
• Please bring a pen/pencil and paper to both classes.