cse 598b: self-* systems path based failure and evolution management mike y. chen, anthony accardi,...

23
CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, Eric Brewer (UC Berkeley, Stanford U, Tellme Networks, eBay Inc.) Presented by: Arjun R. Nath

Upload: suzanna-lawrence

Post on 17-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

CSE 598B: Self-* Systems

Path Based Failure and

Evolution ManagementMike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave

Patterson, Armando Fox, Eric Brewer (UC Berkeley, Stanford

U, Tellme Networks, eBay Inc.)

Presented by: Arjun R. Nath

Page 2: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

2

The Problem..

Computing systems increasing in complexity

Tending towards large, complex, distributed systems

Sometimes there are thousands of machines involved

Basic system management is becoming increasingly difficult.

Detecting and diagnosing failures to understanding application behaviour is becoming very difficult.

Page 3: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

3

..the Problem

Existing techniques such as code-level debuggers, program slicing, process profiling and application logs fail to characterize overall system behaviour.

Distribuged debuggers are available but focus on a homogenous subset of the system.

Page 4: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

4

Goal of the paper

Techniques to help us understand large distributed systems.

Improve – availability– reliability – manageability

Why are we looking at this paper ? (Self-* context)– This paper is about techniques for monitoring of large, complex, distributed systems.

Page 5: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

5

Two main principles

Path-Based Measurement:– Model the system as a collection of paths thru heterogenous components.

– Make local observations along the paths and store these. These can be accessed via queries and visualization techniques.

(Focus is on correctness rather than performance)

Statistical Behaviour Analysis:

– Large volumes of system requests are stored for statistical analysis using classical techniques to identify deviations from normal behaviour. This can be applied to live systems or used for offline analysis.

Page 6: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

6

What is a "Path" ?

Associated with a request Control Flow Resources Paths may have inter-path dependencies : shared state, shared database tables, shared filesystems, shared memory.

Multiple paths may be grouped together in sessions.

Page 7: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Coarse grained paths

Page 8: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Fine grained paths

Page 9: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

9

How do paths help ?

Failure Management

Evolution (of the system)

Page 10: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

10

Failure Management...

Detection: – Reduce downtime associcated with detection delays

– Using paths can help in noticing developing problems before they become severe

The Key is to define "normal" behaviour statistically and then check for deviations

Diagnosis:– Isolate problems using solely the recorded path observations and then drive the diagnosis process with the path information.

– Paths help identify which components are involved in a given failure and aid in identifiying causes.

Page 11: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

11

...Failure Management

Impact Analysis:– Helps in knowing the scale of the problem -> estimate time-to-repair

– Which other paths are at risk.

Page 12: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

12

Evolution (of the system)

Its very difficult to get an overall picture of how a complex distributed system changes with time: - Software/hardware upgrades, patches, code changes etc.

- Systems evolve through changes to their components and also thru changes in how they interact

Paths help in revealing system structure and dependencies and tracking changes.

Page 13: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Implementation

Page 14: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Implementation: Architecture

Page 15: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

15

…Implementation...

Tracers - tracking a request through the target system.– Each request has an identifier associated that is maintained throughout the path

– Ids may be stored in extensible headers (HTTP, SOAP)

– Tracers are platform specific but can be generic to applications using the same platform (J2EE, .NET)

Pinpoint, ObsLogs, SuperCal all have tracers.

Page 16: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

…Implementation: tools..

Three systems that support path-based analysis

Page 17: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

...Implementation

Aggregator and Repository– Aggregator receives observations from tracers– reconstructs paths using IDs– Stores this in the Repository– There may be also a Central Repository that collects from distributed repositories.

Analysis Engines and Visualization.– Single and multi-path analysis– Dedicated engines for various statistical tests– Support for some data mining tools\– Visualization: Tukey’s boxplots generated using Octave

Page 18: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

…Implementation

A trend specific to recognition time in Tellme application A suggests a regression in a speech grammar in that application. The Tukey boxplots shown illustrate a distribution’s center, spread, and asymmetries by using rectangles to show the upper and lower quartiles and the median, and explicitly plotting each outlier.

Page 19: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Limitations and constraints

Cannot resolve fault causes at a very detailed level

Overheads can be high for fine grained paths

Need to decide which observations to include in paths. This is an iterative process.

Can be difficult to implement especially for existing systems

Page 20: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Its important so understand that Path-based analysis is an aid to fault detection and recovery and not a solution in itself. It is meant to be used in combination with traditional fault handling techniques.

Page 21: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Conclusion

As systems get more complex, Path-based analysis tools will have increasing importance.

Path based fault analysis complements traditional techniques

Hardly any fully functional, path-based, fault management tools available.

This paper:– Has breadth but lacks depth in some places.– Needs some more data around production environment experiments

– Should have concentrated on 1 or 2 implementations and included more details.

– Not much info on SuperCal and ObsLogs

Page 22: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Other related stuff

“Pinpoint” project at Stanford http://swig.stanford.edu/pinpoint.shtml (Some interesting papers here)

Magpie project (MicroSoft) Quest Software : Jprobe – Java performance profiler

Borland's OptimizeIt Enterprise Suite

Page 23: CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

23

That’s all folks,

Thank You