why recovery should be free, and often can be armando fox, stanford university june 2003 roc retreat
TRANSCRIPT
Why Recovery Should Be Free,Why Recovery Should Be Free,And Often Can BeAnd Often Can Be
Armando FoxArmando Fox, Stanford University, Stanford University
June 2003 ROC RetreatJune 2003 ROC Retreat
© 2003 Armando Fox
Recovery Should Be Free, and Can Recovery Should Be Free, and Can BeBe
Already espouse arguments about lowering MTTR:Already espouse arguments about lowering MTTR: Mitigates impact on service as a whole [Fox & Patterson, Mitigates impact on service as a whole [Fox & Patterson,
2002]2002]
Results in higher end-user-perceived availability, given Results in higher end-user-perceived availability, given same overall availability [Xie et al. 2002]same overall availability [Xie et al. 2002]
etcetc
Tim Chou, Oracle: maybe more important to make Tim Chou, Oracle: maybe more important to make recovery recovery predictable predictable (so can plan provisioning, anticipate (so can plan provisioning, anticipate impact of outage, etc.)...if we understand it, we can impact of outage, etc.)...if we understand it, we can optimize its speedoptimize its speed
© 2003 Armando Fox
Real win: Recovery management is Real win: Recovery management is hardhard
Determining when to recover is hardDetermining when to recover is hard How to detect that something’s wrong?How to detect that something’s wrong?
How do you know when recovery is really necessary? (fail-How do you know when recovery is really necessary? (fail-stutter, etc.)stutter, etc.)
Will recovery make things worse? (cascading recovery)Will recovery make things worse? (cascading recovery)
Knowing what happens when you recover is hardKnowing what happens when you recover is hard Will a particular recovery technique work? (the machinery Will a particular recovery technique work? (the machinery
needed to perform the recovery may also be broken)needed to perform the recovery may also be broken)
What is the effect on online performance? (recovery can be What is the effect on online performance? (recovery can be expensive)expensive)
What if you needlessly “over-recover”? (cost of making a What if you needlessly “over-recover”? (cost of making a mistake is high)mistake is high)
If recovery were predictable and fast, it would simplify both If recovery were predictable and fast, it would simplify both failure detection failure detection and and recovery management.recovery management.
© 2003 Armando Fox
Simplifying Recovery Management: Crash-Only Simplifying Recovery Management: Crash-Only SoftwareSoftware
Goal: enforce simple invariants on Goal: enforce simple invariants on recovery recovery behavior, behavior, from from outside outside the component(s) being recoveredthe component(s) being recovered
Crash-only component provides PWR switch: Crash-only component provides PWR switch: stop = stop = crashcrash::
clean shutdown = loss of power = kernel panic = ...clean shutdown = loss of power = kernel panic = ...
One way to go down One way to go down one way to come up: one way to come up: start = start = recoverrecover
Power switch is Power switch is externalexternal uniform behavioruniform behavior killkill -9, -9, “turning off” (process kill) a VM, pull power cord“turning off” (process kill) a VM, pull power cord Intuition: the “infrastructure” supporting the power switch is Intuition: the “infrastructure” supporting the power switch is
usually usually simpler simpler than the applications using it, and common than the applications using it, and common across all those applicationsacross all those applications
Can crash-only software actually be built, and if so, how?Can crash-only software actually be built, and if so, how? (a) provide building blocks(a) provide building blocks
(b) formalize C/O definition and provide developer (b) formalize C/O definition and provide developer
© 2003 Armando Fox
Crash-only Building BlocksCrash-only Building Blocks
JAGR/ROC-2, a self-recovering J2EE app server [Candea et al., JAGR/ROC-2, a self-recovering J2EE app server [Candea et al., WIAPP 2003]WIAPP 2003] Micro-reboots used for recovery, application-generic failure-path Micro-reboots used for recovery, application-generic failure-path
inference used for determining recovery strategyinference used for determining recovery strategy Significantly improves performability relative to whole-app redeploySignificantly improves performability relative to whole-app redeploy
SSM: a CO session state manager [Ling, Fox, AMS 2003]SSM: a CO session state manager [Ling, Fox, AMS 2003]
DStore: a CO persistent single-key state manager [Huang, Fox, DStore: a CO persistent single-key state manager [Huang, Fox, submitted to SRDS 2003]submitted to SRDS 2003] Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003]
Common features of both SSM and DStore:Common features of both SSM and DStore: Redundancy used for persistenceRedundancy used for persistence Workload semantics exploited to simplify consistency model & Workload semantics exploited to simplify consistency model &
recoveryrecovery Recovery=restart, safe to reboot any node at any timeRecovery=restart, safe to reboot any node at any time Safe to coerce any failure to a crash (fail-stop) at any timeSafe to coerce any failure to a crash (fail-stop) at any time
© 2003 Armando Fox
Building blocks, cont.Building blocks, cont.
Pinpoint, statistical-anomaly-based failure detectionPinpoint, statistical-anomaly-based failure detection Standard tension: accuracy vs. precision (false positives Standard tension: accuracy vs. precision (false positives
problem)problem)
Different clustering techniques seem to be good at Different clustering techniques seem to be good at detecting different kinds of problemsdetecting different kinds of problems Surprising result from a CS241 project: character-frequency Surprising result from a CS241 project: character-frequency
histograms are a good app-generic way to detect end-user-histograms are a good app-generic way to detect end-user-visible failuresvisible failures
Mostly integrated with JAGR and SSMMostly integrated with JAGR and SSM
On burner: discussions with BEA Systems for integrating into On burner: discussions with BEA Systems for integrating into WebLogic ServerWebLogic Server
Insight: if cost of “over-recovering” is low, aggressive Insight: if cost of “over-recovering” is low, aggressive statistics-based failure detection becomes more appealingstatistics-based failure detection becomes more appealing
© 2003 Armando Fox
Toward a crash-only formalismToward a crash-only formalism
Component frameworks force you into certain app-writing Component frameworks force you into certain app-writing patternspatterns Inter-EJB calls through runtime-managed level of indirectionInter-EJB calls through runtime-managed level of indirection
Restrictions on how persistent state mgt can be expressedRestrictions on how persistent state mgt can be expressed
Restrictions on state sharing: difficult to do without using Restrictions on state sharing: difficult to do without using explicit external storeexplicit external store
Hypothesis: these are the elements that allow C/O to workHypothesis: these are the elements that allow C/O to work
Ongoing work: formalize crash-only SWOngoing work: formalize crash-only SW One possibility: One possibility: observational equivalenceobservational equivalence with respect to a with respect to a
request streamrequest stream
Can be expressed using a Can be expressed using a design pattern design pattern or or denotational denotational semanticssemantics
Ideally, will lead to a tool (“co-lint”) telling you whether your Ideally, will lead to a tool (“co-lint”) telling you whether your component is crash-onlycomponent is crash-only
© 2003 Armando Fox
Summary: Toward a Crash-only Summary: Toward a Crash-only WorldWorld
Goal: simplify Goal: simplify recovery managementrecovery management diagnosisdiagnosis: statistical methods even more appealing if the cost of : statistical methods even more appealing if the cost of
making a mistake is lowmaking a mistake is low
recoveryrecovery: crash-only enforces invariants about what happens : crash-only enforces invariants about what happens when recovery is attemptedwhen recovery is attempted
allows aggressive use of fault model enforcement [Martin et al allows aggressive use of fault model enforcement [Martin et al 2002]2002]
Good progress on providing building blocks for app writersGood progress on providing building blocks for app writers JAGR: J2EE app server that allows fast recovery via micro-reboots JAGR: J2EE app server that allows fast recovery via micro-reboots
and application-generic fault injectionand application-generic fault injection
SSM: a crash-only session state store (in process of integrating SSM: a crash-only session state store (in process of integrating with JAGR)with JAGR)
DStore: a crash-only persistent single-key storeDStore: a crash-only persistent single-key store
PinPoint: statistics-based failure detection (integrated with JAGR, PinPoint: statistics-based failure detection (integrated with JAGR, mostly integrated with SSM)mostly integrated with SSM)
© 2003 Armando Fox
Xie et al: MTTR and End-User Xie et al: MTTR and End-User AvailabilityAvailability
Let ALet AUU=user-perceived unavailability, A=user-perceived unavailability, ASS=system unavailability=system unavailability
Hypothesis: if users retry failed requests, and retry succeeds Hypothesis: if users retry failed requests, and retry succeeds because system had fast recovery, they will perceive higher because system had fast recovery, they will perceive higher availabilityavailability When retry rate is sufficiently frequent, AWhen retry rate is sufficiently frequent, AUU approaches A approaches ASS (for A (for ASS
=99.3%, this threshold is 200-300 sec)=99.3%, this threshold is 200-300 sec)
Method: model user retry behavior and system failure/recovery Method: model user retry behavior and system failure/recovery using Markov models; solve using numerical methodsusing Markov models; solve using numerical methods
Finding: Given 2 systems with same AFinding: Given 2 systems with same ASS, the one with shorter , the one with shorter
MTTR (MTTR (even though it also has lower MTTF)even though it also has lower MTTF) appears better to appears better to the user.the user.
Goal of this project: validate that result empirically (Jeff Goal of this project: validate that result empirically (Jeff Raymakers, Yee-Jiun Song, Wendy Tobagus)Raymakers, Yee-Jiun Song, Wendy Tobagus)
© 2003 Armando Fox
User perceived unavailability vs retry User perceived unavailability vs retry raterate
“sweet spot” Higher user retry rates yields little improvement in perceived availability.
© 2003 Armando Fox
“sweet spot”At low MTTR, lowering MTTR and MTTF at the same time results in worse user perceived unavailability!Variable MTTR, but fixed system
availability (low MTTR -> low MTTF)
Surprise! MTTF eventually catches up with Surprise! MTTF eventually catches up with youyou
© 2003 Armando Fox
Optimization ChoicesOptimization Choices
Fixed MTTF
Fixed MTTR
System Unavailability
User Perceived Unavailability
© 2003 Armando Fox
Results SummaryResults Summary
We can find a “sweet spot” (for a given system We can find a “sweet spot” (for a given system availability) beyond which higher user retry rates availability) beyond which higher user retry rates yield little benefit.yield little benefit.
For two systems of a given availability, the one For two systems of a given availability, the one with lower MTTR does not always yield better user with lower MTTR does not always yield better user perceived availability.perceived availability.
For a given system, we can determine whether For a given system, we can determine whether improving MTTR or MTTF will yield more user-improving MTTR or MTTF will yield more user-visible benefits.visible benefits.
© 2003 Armando Fox
““Clean” shutdown vs. restart?Clean” shutdown vs. restart? Impractical to guarantee zero crashes Impractical to guarantee zero crashes robust robust
systems must be crash-safe anywaysystems must be crash-safe anyway In that case, why support any other kind of shutdown? In that case, why support any other kind of shutdown? Historically, for Historically, for performanceperformance (avoid synchronous writes, (avoid synchronous writes,
do buffering/caching, etc) - leads to replicated/mirrored do buffering/caching, etc) - leads to replicated/mirrored state, more code, special recovery code paths... state, more code, special recovery code paths...
Crash-only software must:(a) be crash-safe & (b) recover quickly
Total recovery time may be shorter even if crash is forced WinXP can be
(mostly) crash-rebooted for upgrades
VMS sysadmins would sometimes crash the system rather than shut it down (if no users were logged on)
© 2003 Armando Fox
Why Crash-Only Simplifies Why Crash-Only Simplifies RecoveryRecovery
““Hardware works, software doesn’t”Hardware works, software doesn’t” Hardware interlocks, timers, etc. have small state spaces Hardware interlocks, timers, etc. have small state spaces
of behavior, hence high confidence they will work as of behavior, hence high confidence they will work as designeddesigned
Crash-only PWR switch is a way to approach that same Crash-only PWR switch is a way to approach that same property for softwareproperty for software
Crash-only makes recovery policies easier to Crash-only makes recovery policies easier to reason aboutreason about Opportunity to aggressively apply SW rejuvenationOpportunity to aggressively apply SW rejuvenation ““Recovery” code exercised on every restart; no exotic-but-Recovery” code exercised on every restart; no exotic-but-
rarely-used code pathsrarely-used code paths ““Over-recovery” may be OK from performability Over-recovery” may be OK from performability
standpoint: if recovery is free (performance & standpoint: if recovery is free (performance & correctness), you stop thinking about it as correctness), you stop thinking about it as recovery recovery and and start thinking about it as start thinking about it as normal aspect of operationnormal aspect of operation
© 2003 Armando Fox
Towards a Crash-Only WorldTowards a Crash-Only World
Existing software that is crash-only or near-crash-onlyExisting software that is crash-only or near-crash-only Stateless apps: most Web serversStateless apps: most Web servers Most RDBMS’s: crash-safe, but long recoveryMost RDBMS’s: crash-safe, but long recovery Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main
codepathcodepath Some appliance storage devices: separate but pretty fast recovery Some appliance storage devices: separate but pretty fast recovery
pathpath
Our goals...Our goals... Focus on Internet (“3 tier”) applications; already “crash-mostly” Focus on Internet (“3 tier”) applications; already “crash-mostly”
except for persistence tier(s)except for persistence tier(s) Make the app server, middle-tier persistence, and back-end tier (to Make the app server, middle-tier persistence, and back-end tier (to
the extent possible) truly crash-onlythe extent possible) truly crash-only Deploy application-generic failure detection techniques (which may Deploy application-generic failure detection techniques (which may
over-recover, but the goal is to make that OK)over-recover, but the goal is to make that OK) Quantify improvement (we hope!) in performability resulting from Quantify improvement (we hope!) in performability resulting from
these changesthese changes By doing it in the middleware, any app on that middleware can benefitBy doing it in the middleware, any app on that middleware can benefit