fast recovery + statistical anomaly detection = self-*

Fast Recovery + Statistical Fast Recovery + Statistical Anomaly Detection = Self-*Anomaly Detection = Self-*

RADS/KATZ CATS PanelRADS/KATZ CATS Panel

June 2004 ROC RetreatJune 2004 ROC Retreat

OutlineOutline

Motivation & approach: complex systems of black Motivation & approach: complex systems of black boxesboxes Measurements that respect black boxesMeasurements that respect black boxes

Box-level Micro-recovery cheap enough to survive false Box-level Micro-recovery cheap enough to survive false positivespositives

Differences from related effortsDifferences from related efforts

Early case studiesEarly case studies

Research agendaResearch agenda

Complex Systems of Black BoxesComplex Systems of Black Boxes

““...our ability to analyze and predict the performance of the ...our ability to analyze and predict the performance of the enormously complex software systems that lies at the core of our enormously complex software systems that lies at the core of our economy is painfully inadequate.” (Choudhury & Weikum, 2000 PITAC economy is painfully inadequate.” (Choudhury & Weikum, 2000 PITAC Report)Report)

Build model of “acceptable” operating envelope by Build model of “acceptable” operating envelope by measurement & analysismeasurement & analysis Control theory, statistical correlation, anomaly detection...Control theory, statistical correlation, anomaly detection...

Rely on Rely on external controlexternal control, using inexpensive and simple , using inexpensive and simple mechanisms that respect the black box, to keep system in its mechanisms that respect the black box, to keep system in its acceptable operating envelopeacceptable operating envelope ““Increase the size of the DB connection pool” [Hellerstein et al]Increase the size of the DB connection pool” [Hellerstein et al]

““Reallocate one or more whole machines” [Lassettre et al]Reallocate one or more whole machines” [Lassettre et al]

““Rejuvenate/reboot one or more machines” [Trivedi, Fox, others]Rejuvenate/reboot one or more machines” [Trivedi, Fox, others]

““Shoot one of the blocked txns” [everyone]Shoot one of the blocked txns” [everyone]

““Induce memory pressure on other apps” [Waldspurger et al]Induce memory pressure on other apps” [Waldspurger et al]

Differences from some existing Differences from some existing problemsproblems

intrusion detection (Hofmeyr et al 98, others)intrusion detection (Hofmeyr et al 98, others) Detections must be actionableDetections must be actionable in a way that is likely to in a way that is likely to

improve improve system (sacrificing availability for safety is system (sacrificing availability for safety is unacceptable)unacceptable)

bug finding via anomaly detection (Engler, others)bug finding via anomaly detection (Engler, others) Human-level monitoring/verification of detections not Human-level monitoring/verification of detections not

feasible, due to number of observations and short feasible, due to number of observations and short timescales for reactiontimescales for reaction

Can separate recovery from diagnosis/repair (don’t always Can separate recovery from diagnosis/repair (don’t always need to know root cause to recover)need to know root cause to recover)

modeling/predicting SLO violations (Hellerstein, modeling/predicting SLO violations (Hellerstein, Goldszmidt, others)Goldszmidt, others) Labeled training set not necessarily availableLabeled training set not necessarily available

Many other examples, but the point Many other examples, but the point is...is...

Statistical techniques identify “interesting” Statistical techniques identify “interesting” features and relationships from large features and relationships from large

datasets, but frequent tradeoff between datasets, but frequent tradeoff between detection rate (or detection time) and detection rate (or detection time) and false false

positivespositives

Statistical techniques identify “interesting” Statistical techniques identify “interesting” features and relationships from large features and relationships from large

datasets, but frequent tradeoff between datasets, but frequent tradeoff between detection rate (or detection time) and detection rate (or detection time) and false false

positivespositives

Make “micro-recovery” so inexpensive that Make “micro-recovery” so inexpensive that occasional false positives don’t matteroccasional false positives don’t matter

Make “micro-recovery” so inexpensive that Make “micro-recovery” so inexpensive that occasional false positives don’t matteroccasional false positives don’t matter

Granularity of black box should match granularity Granularity of black box should match granularity of available external control mechanismsof available external control mechanisms

““Micro-recovery” to survive false Micro-recovery” to survive false positivespositives

Goal: provide “recovery management invariants” Goal: provide “recovery management invariants”

““Salubrious”: returns some part of system to Salubrious”: returns some part of system to known stateknown state Reclaim resources (memory, DB conns, sockets, DHCP Reclaim resources (memory, DB conns, sockets, DHCP

lease...)lease...)

Throw away corrupt transient stateThrow away corrupt transient state

Possibly setup to retry operation, if appropriatePossibly setup to retry operation, if appropriate

Safe: affects only performance, not correctnessSafe: affects only performance, not correctness

Non-disruptive: performance impact is “small”Non-disruptive: performance impact is “small”

Predictable: impact and time-to-complete is stablePredictable: impact and time-to-complete is stableObserve, Analyze, Act:Observe, Analyze, Act:Not recovery, but Not recovery, but continuous adaptationcontinuous adaptation

Observe, Analyze, Act:Observe, Analyze, Act:Not recovery, but Not recovery, but continuous adaptationcontinuous adaptation

Crash-Only Building BlocksCrash-Only Building BlocksSubsystemSubsystem Control pointControl point How realizedHow realized Statistical monitoringStatistical monitoring

SSM (diskless SSM (diskless session state session state store) store) [NSDI [NSDI 04]04]

Whole-node Whole-node fast reboot fast reboot (doesn’t (doesn’t preserve preserve state)state)

Quorum-like Quorum-like redundancyredundancy

Relaxed Relaxed consistencyconsistency

Repair cost Repair cost spread over spread over many operationsmany operations

Time series of state metrics Time series of state metrics (Tarzan)(Tarzan)

DStore DStore (persistent (persistent hashtable) hashtable) [in [in preparation]preparation]

Whole-node Whole-node reboot reboot (preserves (preserves state)state)

JAGR (J2EE JAGR (J2EE application application server) server) [AMS [AMS 2003 & in 2003 & in prep.]prep.]

Microreboots Microreboots of EJB’sof EJB’s

Modify Modify appserver to appserver to undeploy/ undeploy/ redeploy EJB’s redeploy EJB’s and stall and stall pending reqspending reqs

Anomalous code paths and Anomalous code paths and component interactions component interactions (Probabilistic context-free (Probabilistic context-free grammar)grammar)

• Control points are safe, predictable, non-disruptive

• Crash-only design: shutdown=crash, recover=restart

• Makes state-management subsystems as easy to manage as stateless Web servers

Example: Managing DStore and SSMExample: Managing DStore and SSM

Rebooting is the only control mechanismRebooting is the only control mechanism Has predictable effect and takes predictable time, regardless of Has predictable effect and takes predictable time, regardless of

what the process is doingwhat the process is doing• Like kill -9, “turning off” a VM, or pulling power cordLike kill -9, “turning off” a VM, or pulling power cord

Intuition: the “infrastructure” supporting the power switch is Intuition: the “infrastructure” supporting the power switch is simpler than the applications using itsimpler than the applications using it

Due to slight overprovisioning inherent in replication, rebooting Due to slight overprovisioning inherent in replication, rebooting can have minimal effect on throughput & latencycan have minimal effect on throughput & latency

Relaxed consistency guarantees allow this to workRelaxed consistency guarantees allow this to work

Activity and state statistics collected per brick every Activity and state statistics collected per brick every second; any deviation => reboot bricksecond; any deviation => reboot brick

Makes it as easy as managing a stateless server farmMakes it as easy as managing a stateless server farm Backpressure at many design points prevents saturationBackpressure at many design points prevents saturation

Design Lessons Learned So FarDesign Lessons Learned So Far

““A spectrum of cleaning operations” (Eric Anderson, HP Labs)A spectrum of cleaning operations” (Eric Anderson, HP Labs) Consequence: as tConsequence: as t, all problems will converge to “repair of , all problems will converge to “repair of

corrupted persistent data”corrupted persistent data”

Trade “unnecessary” consistency for faster recoveryTrade “unnecessary” consistency for faster recovery spread recovery actions out incrementally/lazily (read repair) rather spread recovery actions out incrementally/lazily (read repair) rather

than doing it all at once (log replay) than doing it all at once (log replay) • gives predictable return-to-service time and gives predictable return-to-service time and acceptable acceptable variation in variation in

performance after recoveryperformance after recovery• keeps data available for readskeeps data available for reads and writes and writes throughout “recovery”throughout “recovery”

Use single phase ops to avoid coupling/locking and the issues they Use single phase ops to avoid coupling/locking and the issues they raise, and justify the cost in consistencyraise, and justify the cost in consistency

It’s OK to say no (backpressure)It’s OK to say no (backpressure) Several places our design got it wrong in SSMSeveral places our design got it wrong in SSM

But even those mistakes could have been worked around by guard But even those mistakes could have been worked around by guard timerstimers

Potential Limitations and ChallengesPotential Limitations and Challenges

Hard failuresHard failures

Configuration failuresConfiguration failures Although similar approach has been used to troubleshoot thoseAlthough similar approach has been used to troubleshoot those

Corruption of persistent stateCorruption of persistent state Data structure repair work (Rinard et al.) may be combinable Data structure repair work (Rinard et al.) may be combinable

with automatic inference (Lam et al.)with automatic inference (Lam et al.)

ChallengesChallenges Stability and the “autopilot problem”Stability and the “autopilot problem”

The base-rate fallacyThe base-rate fallacy

Multilevel learningMultilevel learning

Online implementations of SLT techniquesOnline implementations of SLT techniques

Nonintrusive data collection and storageNonintrusive data collection and storage

Recovery synthesis

Client requests

Responses

Datacenter boundary

Collection

Short-termstore

Long-termstore

Onlinealgo.

Onlinealgo.

Observations fromother datacenters

Offlinealgo.

Offlinealgo.

Recovery actions toother datacenters

Observations toother datacenters

Application component

Application server

An Architecture for An Architecture for Observe, Analyze, Observe, Analyze, ActAct

Separates systems Separates systems concerns from concerns from algorithm algorithm developmentdevelopment Programmable Programmable

network elements network elements provide extension provide extension of approach to of approach to other layersother layers

Consistent with Consistent with technology trendstechnology trends Explicit //ism in Explicit //ism in

CPU usageCPU usage

Lots of disk Lots of disk storage with storage with limited bandwidthlimited bandwidth

ConclusionConclusion

““...Ultimately, these aspects [of autonomic ...Ultimately, these aspects [of autonomic systems] will be emergent properties of a general systems] will be emergent properties of a general architecture, and distinctions will blur into a more architecture, and distinctions will blur into a more general notion of self-maintenance.” (general notion of self-maintenance.” (The Vision The Vision of Autonomic Computingof Autonomic Computing))

The The real real reason to reduce MTTRreason to reduce MTTRis to tolerate false positives: is to tolerate false positives: recovery recovery

adaptationadaptation



Breakout sessions?Breakout sessions?

1.1. [James H] Reserve some resources to deal with problems (by filtering or [James H] Reserve some resources to deal with problems (by filtering or pre-reservation)pre-reservation)

2.2. [Joe H] How black is the black box? What “gray box” prior knowledge can [Joe H] How black is the black box? What “gray box” prior knowledge can you exploit (so you don’t ignore the obvious)?you exploit (so you don’t ignore the obvious)?

3.3. [Joe H] Human role - can make statements about how system [Joe H] Human role - can make statements about how system should should act, act, so doesn’t have to be completely hands-off training. Similarly, during so doesn’t have to be completely hands-off training. Similarly, during training, human can give feedback about what anomalies are actually training, human can give feedback about what anomalies are actually relevant (labeling).relevant (labeling).

4.4. [Lakshmi] What kinds of apps is this intended to apply to? Where do ROC-[Lakshmi] What kinds of apps is this intended to apply to? Where do ROC-like and OASIS-like apps differ?like and OASIS-like apps differ?

5.5. [Mary Baker] People can learn to game the system -> randomness can be [Mary Baker] People can learn to game the system -> randomness can be your friend. If behaviors have small number of modes, just have to look your friend. If behaviors have small number of modes, just have to look for behaviors in the “valleys”for behaviors in the “valleys”

BreakoutsBreakouts1.1. 19 -“golden nuggets” to guide architecture, e.g., persistent 19 -“golden nuggets” to guide architecture, e.g., persistent

identifiers for path-based analysis...what else?identifiers for path-based analysis...what else?

2.2. 8 - act: what {safe,fast,predictable} behaviors of the system 8 - act: what {safe,fast,predictable} behaviors of the system should we expose (other than, eg, rebooting)? Esp. those that should we expose (other than, eg, rebooting)? Esp. those that contribute to security as well as dependability?contribute to security as well as dependability?

3.3. 11 - architectures for different types of stateful systems - what 11 - architectures for different types of stateful systems - what kinds of persistent/semi-persistent state need to be factored out kinds of persistent/semi-persistent state need to be factored out of apps, and how to store it; interfaces, etcof apps, and how to store it; interfaces, etc

4.4. 20 - Given your goal of “generic” techniques for distributed 20 - Given your goal of “generic” techniques for distributed systems, how will you know when you’ve succeeded/how do you systems, how will you know when you’ve succeeded/how do you validate the techniques? (What are the “proof points” you can validate the techniques? (What are the “proof points” you can hand to others to convince them you’ve succeeded, including but hand to others to convince them you’ve succeeded, including but not limited to metrics?) [Aaron/Dave] Metrics: How do you know not limited to metrics?) [Aaron/Dave] Metrics: How do you know you’re observing the right things? What benchmarks will be you’re observing the right things? What benchmarks will be needed?needed?

Open MicOpen Mic

James Hamilton - The Security EconomyJames Hamilton - The Security Economy

ConclusionConclusion

Toward “new science” in autonomic computingToward “new science” in autonomic computing

““...Ultimately, these aspects [of autonomic ...Ultimately, these aspects [of autonomic systems] will be emergent properties of a general systems] will be emergent properties of a general architecture, and distinctions will blur into a more architecture, and distinctions will blur into a more general notion of self-maintenance.” (general notion of self-maintenance.” (The Vision The Vision of Autonomic Computingof Autonomic Computing))





Autonomic & Technology TrendsAutonomic & Technology Trends

CPU speed increases slowing down, need more CPU speed increases slowing down, need more explicit parallelismexplicit parallelism Use extra CPU to collect and locally analyze data; exploit Use extra CPU to collect and locally analyze data; exploit

temporal localitytemporal locality

Disk space is free (though bandwidth and disaster-Disk space is free (though bandwidth and disaster-recovery aren’t)recovery aren’t) Can keep history of parallel as well as historical models for Can keep history of parallel as well as historical models for

regression analysis, trending, etc.regression analysis, trending, etc.

VM’s being used as unit of software distributionVM’s being used as unit of software distribution Fault isolationFault isolation

Opportunity for nonintrusive observationOpportunity for nonintrusive observation

Action that is independent of the hosted appAction that is independent of the hosted app

Data collection & monitoringData collection & monitoring

Component frameworks allow for non-intrusive data Component frameworks allow for non-intrusive data collection without modifying the applicationscollection without modifying the applications Inter-EJB calls through runtime-managed level of indirectionInter-EJB calls through runtime-managed level of indirection Slightly coarser grain of analysis: restrictions on “legal” Slightly coarser grain of analysis: restrictions on “legal”

paths make it more likely we can spot anomaliespaths make it more likely we can spot anomalies Aspect-oriented programming allows further monitoring Aspect-oriented programming allows further monitoring

without perturbing application logicwithout perturbing application logic

Virtual machine monitors provide additional Virtual machine monitors provide additional observation pointsobservation points Already used by ASP’s, for load balancing, app migration, Already used by ASP’s, for load balancing, app migration,

etc.etc. Transparent to applications Transparent to applications and hosted OS’sand hosted OS’s Likely to become the unit of software distribution (intra- Likely to become the unit of software distribution (intra-

and inter-cluster)and inter-cluster)

Optimizing for Specialized State TypesOptimizing for Specialized State Types

Two single-key (“Berkeley DB”) get/set state storesTwo single-key (“Berkeley DB”) get/set state stores Used for user session state, application workflow state, Used for user session state, application workflow state,

persistent user profiles, merchandise catalogs, ...persistent user profiles, merchandise catalogs, ...

Replication to a set of N bricks provides durabilityReplication to a set of N bricks provides durability Write to subset, wait for subset, remember subsetWrite to subset, wait for subset, remember subset

DStore: state persists “forever” as long as DStore: state persists “forever” as long as N/2N/2 bricks survive bricks survive

SSM: If client loses cookie, state is lost; otherwise, persists for SSM: If client loses cookie, state is lost; otherwise, persists for time time t t with probability with probability p, p, where where t, p t, p = F(N, node MTBF)= F(N, node MTBF)

Recovery==restart, takes seconds or lessRecovery==restart, takes seconds or less Efficacy doesn’t depend on whether replica is behaving Efficacy doesn’t depend on whether replica is behaving

correctlycorrectly

SSM: node state SSM: node state not preserved not preserved (in-memory only)(in-memory only)

DStore: node state preserved, read-repair fixesDStore: node state preserved, read-repair fixes

Detection & recovery in SSMDetection & recovery in SSM

9 “State” statistics collected once per second from 9 “State” statistics collected once per second from each brick each brick Tarzan time series analysis: keep N-length time series, Tarzan time series analysis: keep N-length time series,

discretize each data pointdiscretize each data point

count relative frequencies of all substrings of length count relative frequencies of all substrings of length k k or or shortershorter

compare against peer bricks; reboot if at least 6 stats compare against peer bricks; reboot if at least 6 stats “anomalous”; works for aperiodic or irregular-period signals“anomalous”; works for aperiodic or irregular-period signals

Remember! Remember! We are not We are not SLT/ML SLT/ML researchersresearchers!!

Detection & recovery in DStoreDetection & recovery in DStore

Metrics and algorithm comparable Metrics and algorithm comparable to those used in SSMto those used in SSM

We inject “fail-stutter” behavior We inject “fail-stutter” behavior by increasing request latencyby increasing request latency Bottom case: more aggressive Bottom case: more aggressive

detection also results in 2 detection also results in 2 “unnecessary” reboots“unnecessary” reboots

But they don’t matter muchBut they don’t matter much

Currently some voodoo constants Currently some voodoo constants for thresholds in both SSM and for thresholds in both SSM and DStoreDStore

Trade-off of fast detection vs. Trade-off of fast detection vs. false positivesfalse positives

What faults does this handle?What faults does this handle?

Substantially all non-Byzantine faults we injected:Substantially all non-Byzantine faults we injected: Node crash, hang/timeout/freezeNode crash, hang/timeout/freeze

Fail-stutter: Network loss (drop up to 70% of packets Fail-stutter: Network loss (drop up to 70% of packets randomly)randomly)

Periodic slowdown (eg from garbage collection)Periodic slowdown (eg from garbage collection)

Persistent slowdown (one node lags the others)Persistent slowdown (one node lags the others)

Underlying (weak) assumption: “Most bricks are doing mostly Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time”the right thing most of the time”

All anomalies can be safely “coerced” to crash faults All anomalies can be safely “coerced” to crash faults If that turned out to be the wrong thing, it didn’t cost you If that turned out to be the wrong thing, it didn’t cost you

much to try itmuch to try it

Human notified after threshold number of restartsHuman notified after threshold number of restarts

These systems are “always recovering”These systems are “always recovering”

Path-based analysis + MicrorebootsPath-based analysis + Microreboots

Pinpoint captures execution paths through EJB’s as Pinpoint captures execution paths through EJB’s as dynamic call trees (intra-method calls hidden)dynamic call trees (intra-method calls hidden) Build probabilistic context-free grammar from theseBuild probabilistic context-free grammar from these

Detect trees that correspond to very low probability parsesDetect trees that correspond to very low probability parses

Respond by Respond by micro-rebootingmicro-rebooting(uRB) (uRB) suspected-faulty EJB’ssuspected-faulty EJB’s uRB takes 100’s of msecs, vs.uRB takes 100’s of msecs, vs.

whole-app restart (8-10 sec)whole-app restart (8-10 sec)

Component interaction analysisComponent interaction analysiscurrently finds 55-75% of currently finds 55-75% of failuresfailures

Path shape analysis detects Path shape analysis detects >90% of failures; but correctly>90% of failures; but correctlylocalizes fewerlocalizes fewer

Across all expts:80% detection rate with 1.8% FP rate

Across 92% of expts:40% detection rate with 0.2% FP rate

False positive rate

Det

ecti

on r

ate

Crash-Only Design Lessons from SSMCrash-Only Design Lessons from SSM

Eliminate couplingEliminate coupling No dependence on any specific brick, just on a subset of No dependence on any specific brick, just on a subset of

minimum size -- even at the granularity of individual requestsminimum size -- even at the granularity of individual requests

Not even across phases of an operation: single-phase Not even across phases of an operation: single-phase nonblocking ops only => predictable amount of work/requestnonblocking ops only => predictable amount of work/request

Use randomness to avoid deterministic worst cases and Use randomness to avoid deterministic worst cases and hotspotshotspots

We initially violated this guideline by using an off-the-shelf JMS We initially violated this guideline by using an off-the-shelf JMS implementation that was centralizedimplementation that was centralized

Make parts interchangeableMake parts interchangeable Any replica in a write-set is as good as any otherAny replica in a write-set is as good as any other

Unlike erasure coding, only need 1 replica to surviveUnlike erasure coding, only need 1 replica to survive

Cost is higher storage overhead, but we’re willing to pay that to Cost is higher storage overhead, but we’re willing to pay that to get the self-* propertiesget the self-* properties

Enterprise Service WorkloadsEnterprise Service Workloads

ObservationObservation ConsequenceConsequence

Internet service workloads Internet service workloads consist of large numbers of consist of large numbers of independent usersindependent users

Large number of independent Large number of independent samples gives basis for success samples gives basis for success of statistical techniquesof statistical techniques

Even a flaky service is doing Even a flaky service is doing mostly the right thing most of mostly the right thing most of the timethe time

Steady-state behavior can be Steady-state behavior can be extracted from normal extracted from normal operationoperation

Heavy traffic volume means Heavy traffic volume means most of the service is exercised most of the service is exercised in a relatively short timein a relatively short time

Baseline model can be learned Baseline model can be learned rapidly and updated in place rapidly and updated in place periodicallyperiodically

3. We can continuously extract models from 3. We can continuously extract models from the production system orthogonally to the the production system orthogonally to the

applicationapplication

3. We can continuously extract models from 3. We can continuously extract models from the production system orthogonally to the the production system orthogonally to the

applicationapplication

Building models through Building models through measurementmeasurement

Finding bugs using distributed assertion sampling Finding bugs using distributed assertion sampling [Liblit et al, 2003][Liblit et al, 2003] Instrument source code with assertions on pairs of Instrument source code with assertions on pairs of

variables (“features”)variables (“features”)

Use sampling so that any given run of program exercises Use sampling so that any given run of program exercises only a few assertions (to limit performance impact)only a few assertions (to limit performance impact)

Use classification algorithm to identify which features are Use classification algorithm to identify which features are most predictive of faults (observed program crashes)most predictive of faults (observed program crashes)

Goal: bug findingGoal: bug finding

JAGR: JBoss with Micro-rebootsJAGR: JBoss with Micro-reboots

performability of RUBiS (goodput/sec vs. time)performability of RUBiS (goodput/sec vs. time) vanilla JBoss w/manual restarting of app-server, vs. vanilla JBoss w/manual restarting of app-server, vs.

JAGR w/automatic recovery and micro-rebootingJAGR w/automatic recovery and micro-rebooting JAGR/RUBiS does 78% better than JBoss/RUBiSJAGR/RUBiS does 78% better than JBoss/RUBiS Maintains 20 req/sec, even in the face of faultsMaintains 20 req/sec, even in the face of faults Lower steady-state after recovery in first graph: class reloading, recompiling, Lower steady-state after recovery in first graph: class reloading, recompiling,

etc., which is not necessary with micro-rebootsetc., which is not necessary with micro-reboots Also used to fix memory leaks without rebooting whole appserverAlso used to fix memory leaks without rebooting whole appserver

Fast Recovery + Statistical Fast Recovery + Statistical Anomaly Detection = Self-*Anomaly Detection = Self-*

Armando Fox and Emre Kiciman, Stanford Armando Fox and Emre Kiciman, Stanford UniversityUniversity

Michael Jordan, Randy Katz, David Patterson, Ion Michael Jordan, Randy Katz, David Patterson, Ion Stoica,Stoica,

University of California, BerkeleyUniversity of California, Berkeley

SoS Workshop, Bertinoro, ItalySoS Workshop, Bertinoro, Italy

fast recovery + statistical anomaly detection = self-*

Documents