pinpoint: problem determination in large, dynamic internet services mike chen, emre kıcıman,...

13
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin [email protected] {emrek, fratkin}@cs.stanford.edu ROC Retreat, 2002/01

Upload: marjorie-murphy

Post on 20-Jan-2018

212 views

Category:

Documents


0 download

DESCRIPTION

Current Techniques  Dependency Models –Detect failures, check all components that failed requests depend on –Problem: Need to check all dependencies Hard to generate and keep up-to-date  Monitoring & Alarm Correlation –Detect non-functioning components and often generates alarm storms filter alarms for root-cause analysis –Problem: need to instrument every component hard to detect interaction faults

TRANSCRIPT

Page 1: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Pinpoint: Problem Determination in Large, Dynamic Internet Services

Mike Chen, Emre Kıcıman, Eugene [email protected]

{emrek, fratkin}@cs.stanford.edu

ROC Retreat, 2002/01

Page 2: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Motivation Systems are large and getting larger

– 1000’s of replicated HW/SW components– used in different combinations

Systems are dynamic– resources are allocated at runtime– e.g. load balancing, personalization

Difficult to diagnose failures– how to tell what’s different about failed

requests?

Page 3: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Current Techniques Dependency Models

– Detect failures, check all components that failed requests depend on

– Problem:• Need to check all dependencies• Hard to generate and keep up-to-date

Monitoring & Alarm Correlation– Detect non-functioning components and often

generates alarm storms• filter alarms for root-cause analysis

– Problem:• need to instrument every component• hard to detect interaction faults

Page 4: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

The Pinpoint Approach Trace many real client requests

– Record every component used in a request– Detect success/failure of requests– Can be used as dynamic dependency graphs

Statistical Correlation– Search for components that “cause” failures

Built into middleware – Requires no application code changes– Application knowledge only for end-to-end

failure detection

Page 5: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Framework

Communications Layer

(Tracing & Internal F/D)

A B C

Components#1

Requests

ExternalF/D

#2#3

StatisticalAnalysis

DetectedFaults

1,A1,C2,B..

1, success2, fail3, ...

Logs

Page 6: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Prototype Implementation Built on top of J2EE platform

– Sun J2EE 1.2 single-node reference impl.– Added logging of Beans, JSP, & JSP tags– Detect exceptions thrown out of components– Required no application code changes

Layer 7 network sniffer– TCP timeouts, malformed HTML, app-level

string searches PolyAnalyst statistical analysis

– Bucket analysis & dependency discovery

Page 7: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Experimental Setup Demo app: J2EE Pet Store

– e-commerce site w/~30 components Load generator

– replay trace of browsing– Approx. TPCW WIPSo load (~50% ordering)

Fault injection parameters– Trigger faults based on combinations of used

components– Inject exceptions, infinite loops, null calls

55 tests with single-components faults and interaction faults– 5-min runs of a single client (J2EE server limitation)

Page 8: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Application Observations

0%

20%

40%

60%

80%

100%

1 6 11 16 21

# of components

Cum

ulat

ive

% o

f tot

al re

ques

ts

0

2

4

6

8

10

12

cart

productIte

mList

InventoryEJB

SignOnEJB

ClientC

ontrolle

rEJB

ShoppingCartEJB

# of

tigh

tly c

oupl

ed c

ompo

nent

s

# of components used in a dynamic request: median 14, min 6, max 23

large number of tightly coupled components that are always used together

Page 9: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Metrics

Precision: C/P Recall: C/A Accuracy: whether all actual faults are

correctly identified (recall == 100%)– boolean measure

PredictedFaults (P)

Actual Faults (A)

CorrectlyIdentifiedFaults (C)

Page 10: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

4 Analysis Techniques Pinpoint: clusters of components that

statistically correlate with failures Detection: components where Java

exceptions were detected– union across all failed requests– similar to what an event monitoring system

outputs Intersection: intersection of components

used in failed requests Union: union of all components used in

failed requests

Page 11: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Results

Pinpoint has high accuracy with relatively high precision

76%

50%

83%

100%

33%

77%

15%11%

0%

20%

40%

60%

80%

100%

Pinpoint Detection Intersection Union

average accuracyaverage precision

0%

20%

40%

60%

80%

100%

1 2 3 4

# of Interacting Components

Ave

rage

Acc

urac

y

Pinpoint

Detection

Intersection

Union

Page 12: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Pinpoint Prototype Limitations Assumptions

– client requests provide good coverage over components and combinations

– requests are autonomous (don’t corrupt state and cause later requests to fail)

Currently can’t detect the following:– faults that only degrade performance– faults due to pathological inputs

Page 13: Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene…

Conclusions Dynamic tracing and statistical analysis give

improvements in accuracy & precision– Handles dynamic configurations well– Without requiring application code changes– Reduces human work in large systems– But, need good coverage of combinations and

autonomous requests Future Work

– Test with real distributed systems with real applications• Oceanstore? WebLogic/WebSphere?

– Capture additional differentiating factors– Visualization of dynamic dependency– Performance analysis– Online data analysis