revisiting failure detection for grid...
TRANSCRIPT
![Page 1: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/1.jpg)
Xavier DÉFAGO1) School of Information Science,
Japan Adv. Inst. of Science & Tech. (JAIST)2) PRESTO, Japan Science & Tech. Agency (JST)
RevisitingFailure Detectionfor Grid Systems
IFIP WG 10.4 – Summer 2005 meeting – July 2005. Hakone, Japan.
![Page 2: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/2.jpg)
Acknowledgements
• Naohiro HAYASHIBARA
• now at Tokyo Denki University
• Péter URBÁN
• Rami YARED
• Takuya KATAYAMA
• ... and many people through enlightening discussions
2
![Page 3: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/3.jpg)
Related Projects
• COE program “Trustworthy e-Society”
• PRESTO, JST “Information & Systems”
• Jinzai Yosei “Dependable Internet”
• OBIGrid
• Bioinformatics Grid; RIKEN & AIST
• StarBED Internet Emulator
• OurGrid, PlanetLab.
3
![Page 4: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/4.jpg)
Grid Systems
• What Grid?
• Data-G, computational-G, domain-G, ..., *-Grid
• What is the/a Grid?
• Structured Internet?
• Loosely coupled global / enterprise network?
• Decentralized distributed OS?
• Key point
• Virtualizing of resources, ...
• “Glue” between resources: i.e., distributed system
4
![Page 5: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/5.jpg)
Grid Systems & Fault-Tolerance
• Needs
• 24/7 operation,
• reliability & availability,
• self-managing, auto-configuration,...
• security, accountability,...
• Current Reality
• ... a LOOOONG way to go!
5
![Page 6: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/6.jpg)
Failure Detection in Grid
• Failure detection
• ability to detect failed components
• prevents blocking forever
• basic mechanism for fault-tolerance
• Failure detection as service
• E.g., [Stelling et al. 1998], [van Renesse et al. 1998],...
• E.g., NTP for clock synchronization
6
![Page 7: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/7.jpg)
Failure Detection as Service
• Current situation
• ad hoc detection rather than service
• hardcoded timeouts in programs
• hidden behind heavy abstractions
• “proprietary” mechanisms
• Open challenges (highly opiniated)
• proper abstractions, QoS negotiation
• unattended management
• reduction of overhead, scalability
7
![Page 8: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/8.jpg)
8
Simult. Indep. Requirements
• Large-scale systems• Many distributed applications simultaneously
• Different requirements
p1
p2
r2
r1
q2
q1
q4
q3
host p
host q
host r
A!
A!
A!
A!
applications
![Page 9: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/9.jpg)
9
Example / Motivation
• Simple case
• “Bag-of-Tasks” computations
• Dispatch tasks
• Wait for results
• Environment
• Partial failures
• Heterogeneous
• Unpredictable comm.
BOOM
BOOM
![Page 10: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/10.jpg)
Usage Patterns
• Case 1:
• Cost varies with time:
• amount work completed
• available resources
• Case 2:
• Important task
• Most likely up machine
10
?
BOOM
BOOM
![Page 11: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/11.jpg)
Abstractions
![Page 12: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/12.jpg)
Accrual Failure Detectors
• Accrual failure detection [Hayashibara; PhD 2004]
• 2 roles: monitoring, interpretation
• interpretation –> QoS
• => decoupling
12
Failure
Detection
Service
Programs,
Protocols
Monitoring
Interpretation
Action
Interpretation
ActionParametric
Action
suspicion level
suspicionssuspicions
Monitoring
Interpretation
Action Action Action
suspicions
Binary FD Accrual FD
![Page 13: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/13.jpg)
• Accrual FD abstraction [Défago et al.; DSN 2005]
• combine different QoS
• properties; relation w/FD theory
slqp(t)
t
slqp(t)
Accrual Failure Detectors
13
T1(t)
trust
DT1suspect
T2(t)
DT2
trust
![Page 14: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/14.jpg)
Chen FD as Accrual
• Chen-based adaptation [Chen et al.; TC 2002]
• After freshness point, increase with time
• Reset when receive heartbeat
• Safety margin ! set with threshold
14
slqp(t)
t
slqp(t)
p
q
BOOM
!1 !2 !3 !4 !5 !
!
![Page 15: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/15.jpg)
" Accrual FD
• " failure detector [Hayashibara et al.; SRDS 2004]
• Heartbeat based, estimate arrival distribution
App. 3
do action(!)network
Plater (t)last arrival !
Failure Detector
App. 1
! > !1 ! suspect
App. 2
! > !2 ! suspect
heartbeat arrivals
sampling window
estimation
Plater (t)
t
tnow
! log10 Plater (tnow ! Tlast)Tlast
15
![Page 16: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/16.jpg)
QoS of Failure Detectors
![Page 17: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/17.jpg)
QoS of Failure Detectors
• Metricswhen p faulty:
• Detection time17
trust
suspect
up
down
BOOM
detection time
monitored
process
FD
output
![Page 18: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/18.jpg)
QoS of Failure Detectors
• Metrics (accuracy)when p correct:
• average mistake rate
• query accuracy prob.
• good period duration
trust
suspect
up
mistakes!
FD
output
monitored
process
18
![Page 19: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/19.jpg)
detection time
(in
)acc
ura
cy
Requirements vs. Guarantees
• Application requirements• !{D,A} : max. detect. time, max. mistakes
• FD QoS• "{d,a} : effect. detection time, effect. mistakes
19
!{D,A}
!{d,a}
![Page 20: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/20.jpg)
Min
imum
Netw
ork
Late
ncy
!{d,a}
In a Perfect World
• Ideal• FD limited by min. network latency
• “acceptable” network/system load
!{D,A}
detection time
(in
)acc
ura
cy
20
![Page 21: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/21.jpg)
In a Perfect World
• Perfect FD• “realistic” detection time
• absolute accuracy (no mistakes)
• (some failure types can be detected perfectly)
Min
imum
Netw
ork
Late
ncy
PerfectFailure Detector
!{d,a} detection time
(in
)acc
ura
cy
21
![Page 22: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/22.jpg)
In a Less Perfect World
• Unreliable FDs• “realistic” detection latency
• imperfect accuracy
Min
imum
Netw
ork
Late
ncy Unreliable
Failure Detector
!{d,a}
detection time
(in
)acc
ura
cy
22
![Page 23: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/23.jpg)
Parametric Failure Detector
• Parametric FDs• Parameter value defines FD best QoS
• E.g., Chen FD,...
• Tradeoff: accuracy <-> detection latency
detection time
(in
)acc
ura
cy
23
!{d,a}
!{d',a'}
![Page 24: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/24.jpg)
QoS Coverage
• Coverage of FD• FD could be tuned to support app. req.
• Measure of FD
detection time
(in
)acc
ura
cy
24
![Page 25: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/25.jpg)
Dynamic QoS Coverage
• Approximate coverage• Instantiate several QoS sets
• Find minimal set; minimal change
detection time
(in
)acc
ura
cy
25
![Page 26: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/26.jpg)
Experimentation
![Page 27: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/27.jpg)
Comparative Analyses
• 3 FD implementations
• Chen FD ; [Chen et al.] (FTCS 2000; TC 2002)
• Bertier FD ; [Bertier et al.] (DSN 2002)
• PHI accrual FD ; [Hayashibara et al.] (SRDS 2004)
• Goal
• “Realistic” executions (e.g., LAN, WAN)
• Identify QoS coverage
27
![Page 28: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/28.jpg)
Experimentation: LAN
• LAN
• single FastEther hub
• Parameters
• HB interval: 20!ms
• Duration: 5" hour
• Total HB: 1’000’000
• no loss
1e-05
0.0001
0.001
0.01
0.1
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Mis
take r
ate
[1/s
]Detection time [s]
! FD
Chen FD
Bertier FD ! FDChen FD
Bertier FD
Bertier FD
Chen FD
! accrual FD
28
![Page 29: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/29.jpg)
Experimentation: WAN
• WAN
• JAIST (JP) – EPFL (CH)
• Parameters
• HB interval: 100!ms
• Duration: 1 week
• Total HB: ~ 6’000’000
0.001
0.01
0.1
0.4
0 0.5 1 1.5 2 2.5
Mis
take r
ate
[1/s
]Detection time [s]
Chen FD
!-FD
Bertier FD
! FDChen FD
Bertier FDBertier FD
Chen FD
! accrual FD
29
![Page 30: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/30.jpg)
Experimentation: WAN
30
Apr 3
0:00 0:00 0:00 0:00 0:00 0:00 0:00
Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 Apr 9
time (UTC)
0
50
100
150
200
250
300
350
400
450
500
0 5 10 15 20 25
Occ
ure
nce
[#
burs
ts]
Burst length [# lost messages]
2004
W32/Netsky.T@mm
![Page 31: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/31.jpg)
Wrapping Up
![Page 32: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/32.jpg)
Conclusion
• Ongoing work
• Translucent abstractions
• Improved implementations
• Wider experimentation
• QoS negotiation
• Much work to do...
• Self-configuration
• Low-overhead protocols
• Notification mechanisms
32
![Page 33: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/33.jpg)
Future Directions
33
• QoS Coverage
• stricter definition
• gradients (uncertainty)
• QoS negotiation
• dynamic (re-)negotiation
• prob./best-effort negotiation
• fail-safe enforcement
detection time
(in
)acc
ura
cy
![Page 34: Revisiting Failure Detection for Grid Systemswebhost.laas.fr/TSF/IFIPWG/Workshops&Meetings/48/WS1/06-Defago.pdf · Xavier D FAGO 1) Sch ool of Information Science, Japan Adv. Inst](https://reader030.vdocument.in/reader030/viewer/2022041013/5ec2473393499213b127a7b2/html5/thumbnails/34.jpg)
Future Directions
• Other environments
• E.g., wireless, dial-up,...
• Characterize traffic
• metrics
• clustering
• “benchmarking” sets
latency
entropy
34
Apr 3
0:00 0:00 0:00 0:00 0:00 0:00 0:00
Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 Apr 9
time (UTC)
Wireless
WAN
LAN