systems support for end-to-end performance management sandip agarwala phd advisor: karsten schwan...

Systems Support for End-to-End Performance Management

Sandip Agarwala

PhD Advisor: Karsten Schwan

College of Computing

Georgia Tech

Source: Gartner (December 2005)

Complexity, complexity, complexity…

Reasons for Complexity

• Application diversity• Interdependencies• Heterogeneous components

– Too many different technologies and platform

• Too little “hints” from the system to the administrators– Legacy issues; Application-specific solutions

• Insufficient information about the system to drive self-management

Lack of Automation

Online System Management

Control Execute

MonitorAnalyze

Workload

•Scheduling•Capacity and SLA management•Design evaluation and tuning•Bottleneck detection•Resource provisioning, accounting, etc.

Proposed Approach: Service Path

Service Path

Front - endWeb Servers

Middle-tierServlet Server

Application Logic(EJBs, etc.)

Data BaseBack - end

I n t e

r n e

t

Pro

xy S

erv

er

• System abstractions that describe the dynamic dependencies between the different distributed application components

• Service Class: Application-level request class, e.g. SLA class

Service Path Characteristics

• End-to-End analysis

• Online

• Non-intrusive

• Application-generic

Outline

• Background• Motivation• Service path

– Discovery with E2EProf– Refinement with SysProf– Automated SLA Enforcement

• Related Work• Future Plans

E2EProf

time

time

(AB)

(BC)

time

time

D1

D2

• Black-box approach• Correlate per-edge time series signals• Monitor network packet traces (source, destination, timestamps)

Model traces as per-edge time series signals or density functions

A

XB

C

D

Basic Approach

Delay at B

• Compute cross-correlation (D1 D2)

A

XB

C

D

(AB) (BC)

(AB) (BD)

SpikeCausality

Spike’s positionDelay

No spike

Evaluation with 4-tier RUBiS1

TomcatServer 1

TomcatServer 2

MySQLServerApache Web

Server

1http://rubis.objectweb.org/

Clients

comment

bidding

CPUbound

I/Obound

EJBServer 2

EJBServer 1

Service Path Detection in RUBiSHighest

delay node

Highest delay node

Highest delay nodes

Static server assignment

Round-robin load balancer

Change detection in RUBiS

Injected Delay

Revenue PipelineTotal Traffic:1.34 million / day (56k / hour)

Delta Air Lines’ Application

TACSIN &TACSOUT

XIN & XOUT

APEXIN &APEXOUT

Error/Warning (Tivoli) Logs

Time of the day

Lat

ency

(se

c)

Delta Air Lines’ Application

TACS

S1 S8S7S3S2

Client requests

TACS

Huge request burst

Outline

• Background• Motivation• Service path

– Discovery with E2EProf– Refinement with SysProf– Automated SLA Enforcement

• Related Work• Future Plans

Beyond dependency and latency…

C1

C2

S1

S3

S2

S5

S6S4

Solution: Zoom into the servicepath with SysProf• No application hints or instrumentation• Monitor resource usage on per-class basis

SysProf Methodology

ethdriver

BDD

Net

wor

kS

tack

System Call

FS/VM/etc.

A1 A2 ANS

ched

uler

UserKernel

Sch

edul

er

Instrumentationpoints

From clientTo client

Init CID

Context SwitchesContext Switches

Net softirq

system call parameters, PID,

App functions

Disk I/O

•Track request context–Work done for processing a request class–May span user-level or kernel-level–Executes in more than one contexts (e.g. processes, threads, softirqs)

–Happens in a system-visible event (e.g. system calls)

Class ID Propagation

InitCID

Process CID

Fromclient

To client

Msg CID

Packet CID Inherits CID

Front-Tier Middle-Tier End-Tier

UserKernel

Application of SysProf

• Resource Accounting

• Utility Billing

• Bottleneck detection

• Capacity Estimation

• Root-Cause Analysis

• Black-Box SLA management

Resource-Aware Adaptive Control

TomcatServer 1

TomcatServer 2

MySQLServer

EJBServer 2

EJBServer 1

Class 1

Class 2

Class 3

Cluster workloadscontending for same resources

Separate Queue/Controller for each cluster

resourcesofset

k k

kj

k

ki

R

r

R

rjiW ,,),(

Fro

nt-e

nd

Con

trol

ler

+Sc

hedu

ler

Resource-Aware Adaptive Control

With SysProf

Capacity = 80 req/s per server

No SysProf

Summary

• Service Path– System abstractions to represent dependencies

and request path

• E2EProf and Pathmap– Dependency and latency analysis

• SysProf– Service-based resource analysis

• Aid human operator and automate end-to-end performance management

Thank You!

Questions?

Email: [email protected]

Extra Slides

Pathmap Optimizations

time

time

time

Packet timestamp trace

Time-series signalOr Density Function

Cross-correlation series

Bursty traffic

Sliding window (W)

Run-length compression

Upper-boundOn latency

W

systems support for end-to-end performance management sandip agarwala phd advisor: karsten schwan...

Documents

systemvisible event

sysprofno application

service path detection

timestampsmodel traces

different technologies

platformtoo little hints

tier rubis1tomcatserver

dynamic dependencies