cse 598b: self-* systems path based failure and evolution management mike y. chen, anthony accardi,...

CSE 598B: Self-* Systems

Path Based Failure and

Evolution ManagementMike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave

Patterson, Armando Fox, Eric Brewer (UC Berkeley, Stanford

U, Tellme Networks, eBay Inc.)

Presented by: Arjun R. Nath

2

The Problem..

Computing systems increasing in complexity

Tending towards large, complex, distributed systems

Sometimes there are thousands of machines involved

Basic system management is becoming increasingly difficult.

Detecting and diagnosing failures to understanding application behaviour is becoming very difficult.

3

..the Problem

Existing techniques such as code-level debuggers, program slicing, process profiling and application logs fail to characterize overall system behaviour.

Distribuged debuggers are available but focus on a homogenous subset of the system.

4

Goal of the paper

Techniques to help us understand large distributed systems.

Improve – availability– reliability – manageability

Why are we looking at this paper ? (Self-* context)– This paper is about techniques for monitoring of large, complex, distributed systems.

5

Two main principles

Path-Based Measurement:– Model the system as a collection of paths thru heterogenous components.

– Make local observations along the paths and store these. These can be accessed via queries and visualization techniques.

(Focus is on correctness rather than performance)

Statistical Behaviour Analysis:

– Large volumes of system requests are stored for statistical analysis using classical techniques to identify deviations from normal behaviour. This can be applied to live systems or used for offline analysis.

6

What is a "Path" ?

Associated with a request Control Flow Resources Paths may have inter-path dependencies : shared state, shared database tables, shared filesystems, shared memory.

Multiple paths may be grouped together in sessions.

Coarse grained paths

Fine grained paths

9

How do paths help ?

Failure Management

Evolution (of the system)

10

Failure Management...

Detection: – Reduce downtime associcated with detection delays

– Using paths can help in noticing developing problems before they become severe

The Key is to define "normal" behaviour statistically and then check for deviations

Diagnosis:– Isolate problems using solely the recorded path observations and then drive the diagnosis process with the path information.

– Paths help identify which components are involved in a given failure and aid in identifiying causes.

11

...Failure Management

Impact Analysis:– Helps in knowing the scale of the problem -> estimate time-to-repair

– Which other paths are at risk.

12

Evolution (of the system)

Its very difficult to get an overall picture of how a complex distributed system changes with time: - Software/hardware upgrades, patches, code changes etc.

- Systems evolve through changes to their components and also thru changes in how they interact

Paths help in revealing system structure and dependencies and tracking changes.

Implementation

Implementation: Architecture

15

…Implementation...

Tracers - tracking a request through the target system.– Each request has an identifier associated that is maintained throughout the path

– Ids may be stored in extensible headers (HTTP, SOAP)

– Tracers are platform specific but can be generic to applications using the same platform (J2EE, .NET)

Pinpoint, ObsLogs, SuperCal all have tracers.

…Implementation: tools..

Three systems that support path-based analysis

...Implementation

Aggregator and Repository– Aggregator receives observations from tracers– reconstructs paths using IDs– Stores this in the Repository– There may be also a Central Repository that collects from distributed repositories.

Analysis Engines and Visualization.– Single and multi-path analysis– Dedicated engines for various statistical tests– Support for some data mining tools\– Visualization: Tukey’s boxplots generated using Octave

…Implementation

A trend specific to recognition time in Tellme application A suggests a regression in a speech grammar in that application. The Tukey boxplots shown illustrate a distribution’s center, spread, and asymmetries by using rectangles to show the upper and lower quartiles and the median, and explicitly plotting each outlier.

Limitations and constraints

Cannot resolve fault causes at a very detailed level

Overheads can be high for fine grained paths

Need to decide which observations to include in paths. This is an iterative process.

Can be difficult to implement especially for existing systems

Its important so understand that Path-based analysis is an aid to fault detection and recovery and not a solution in itself. It is meant to be used in combination with traditional fault handling techniques.

Conclusion

As systems get more complex, Path-based analysis tools will have increasing importance.

Path based fault analysis complements traditional techniques

Hardly any fully functional, path-based, fault management tools available.

This paper:– Has breadth but lacks depth in some places.– Needs some more data around production environment experiments

– Should have concentrated on 1 or 2 implementations and included more details.

– Not much info on SuperCal and ObsLogs

Other related stuff

“Pinpoint” project at Stanford http://swig.stanford.edu/pinpoint.shtml (Some interesting papers here)

Magpie project (MicroSoft) Quest Software : Jprobe – Java performance profiler

Borland's OptimizeIt Enterprise Suite

http://swig.stanford.edu/pinpoint.shtml

23

That’s all folks,

Thank You

cse 598b: self-* systems path based failure and evolution management mike y. chen, anthony accardi,...

Documents

system slide

implementation slide

architecture slide

nath slide

systems path

collection of paths

multiple paths

overall system behaviour