narrowing down possible causes of performance anomaly in web applications

Narrowing Down Possible Causes of Performance Anomaly in Web Applications

Satoshi Iwata*, Kenji Kono*†

*Keio University, †JST/CREST

Performance Anomalies in Web ApplicationsAre unexpected, undesirable degradations in

performanceLead to

Violations of service level agreements (SLAs)Loss of potential customers

Are difficult to be removed completely in test operationsBecause they are caused by complex interactions between

various runtime factors E.g., workload, resource usage, and elapsed time after system

startup Increase of clients drives performance parameters inappropriate As a service proceeds, memory leak caused by bugs accumulate

little by little

Capturing Symptoms of Performance AnomaliesIs required in real operation

Before the performance anomalies harm services severelyEnables us to increase possibility of avoiding

performance degradationBy proactively debugging the performance anomaliesBy adding servers temporarily

Service status

Normal Severely harmed

Keep degradingRecover performance

Indicating symptoms

Systematic Detection of AnomaliesAllows the administrators to detect symptoms

without any expertiseE.g., Detection system notifies them of anomalous

deviations of response timeBut, the ad-hoc approaches are only for experts

Many thresholds must be determined to detect symptoms E.g., Upper limit the average response time must not exceeds E.g., Period during which average response time must not

continue to increase

Our ProposalApplying control charts to detect symptoms of

performance anomaliesControl charts determine whether a manufacturing or

business process is in a state of statistical control or not By detecting deviations from the standard quality of products

They are suitable for detecting performance anomalies that are deviations from the standard performance

Our method detects symptoms of anomalies that appear in processing timeProcessing time is the time from request arrival at a front-

end server to its departure from the serverProcessing time can be measured unobtrusively

E.g., Apache web server can be configured to log each request’s URL and processing time

Control ChartsThey monitor characteristic values that

characterize the quality of targetE.g., clincher size or company profitability In our technique: processing time

They raise an alarm if the measured values statistically deviate from standard valuesTo do this, they compare

A statistic of current characteristic values E.g., average, median, or defective fraction We chose median

Baselines calculated from the data measured in a controlled environment

Detail of Control ChartsBase lines

Center line is the center of the measured statistics UCL and LCL indicate 3σ above/below the center line

Warnings A plot outside the limits Systematic patterns within the limits

Center Line

Upper Control Limit (UCL)

Lower Control Limit (LCL)

Warning

Lot Number

Med

ian

ofcl

inch

er s

ize

[mm

]

Monitoring clincher manufacturing process

Merits of Control ChartsAdministrators do not have to decide complicated

parameters If so, only experts can use themAll the parameter we have to decide is group size to

calculate statistics In control charts, the default value is “5” We used the default value

Output is simpleOutput is “normal” or “anomaly”Administrators do not have to make a decision

Effort 1: Characteristic ValuesWe chose four statistics of processing time as

characteristic values, instead of processing time itselfAverage, maximum, median, and minimumWe prepare four control charts and plot:

Median of average processing time, median of maximum processing time, and so on

MeritsWe can concentrate on more harmful anomalies

Natural fluctuations in processing time itself raise less important warnings

We can detect various anomalies Different anomalies appear differently in processing time

Effort 2: Observing UnitWe observe processing time of each request type

separatelyBy investigating which servlet processes the request and what

parameters are needed to process it by using URLsA request type can be naturally defined on the basis of

its taskFor RUBiS (an auction site prototype), there are 27 request

types. E.g., SearchItemsInCategory, RegisterUser, and PutBid

MeritsHelps us detect early symptoms

Average processing times are different among request typesHelps us narrow down suspicious components

Different request types use different components

Summary of Our MethodObserves four statistics as characteristic valuesSeparately observes processing time of each

request type

Home

PutBid

Browse

AboutMe

…

Request types

Average Maximum

Median Minimum

Typical ScenarioOur approach can be used in both test and real

operation phasesDuring the test phase, we calculate three preliminary

baselinesDuring the real phase, we use the baselines calculated in

the test phase

Calculated baselines

Test operation phase

2) Debugging1) Calculating baselines

Real operation phase

Warning

Cured

3) Normal operation 4) Debugging

Case StudiesWe show some case studies in which we

1. Detected symptoms of performance anomalies With our method

2. Narrowed down the possible causes With the information acquired from our method

3. Hunted for and debugged the root causeExperimental setup

Interval to calculate four statistics : 10 minutesWeb application : RUBiS [http://rubis.objectweb.org/]

An auction site prototype modeled after eBay.com Implemented using JavaEE platform

Server software : Apache 2.2.11, JBoss 5.0.1, MySQL 5.1.34Machine spec : CPU 3.00 GHz, memory 2 GB, OS Linux 2.6.27

Case 1: SearchItemsInCategoryTo calculate the baselines of control charts, we ran

RUBiS with 200 clientsSearchItemsInCateogry-Min. control chart raised

warningsWe could detect a symptom of performance anomaly

Minimum processing time increased only a few milli-seconds

Med

ian

ofm

inim

um p

roce

ssin

gtim

e [m

sec]

Narrowing Down Possible CausesThe cause of this performance anomaly would be

characteristic to SearchItemsInCategorySince it is the only request type that minimum value

increases as time goes onWe investigated RUBiS source code to determine

suspicious componentsThe components used to process SearchItemsInCategory

but not used to process SearchItemsInRegion are suspicious SearchItemsByCategory Servlet SB_SearchItemsByCategory Session Bean The SQL sent by Query Session Bean

Hunting for the Root CauseWe manually determined the root cause1. We paid special attention to the SQL sent by

Query Session Bean It is the only component that sends complicated SQL

2. We investigated items tableThe SQL only accesses items table

3. We speculated that growing items table caused the performance anomaly

RUBiS never delete any records in the table

4. We added a new index to the tableTo limit the amount of records to be linearly searched

After Adding the IndexWe ran RUBiS with 200 clients againWe could cure the performance anomaly

No warnings were raisedWe could confirm that this change does not have negative

side-effects

AfterBefore

Med

ian

ofm

inim

um p

roce

ssin

gtim

e [m

sec]

Med

ian

ofm

inim

um p

roce

ssin

gtim

e [m

sec]

Case 2: Increasing ClientsAfter calculating baselines, we increased clientsMany warnings were raised by *-Max. control charts

When increased clients from 300 to 400We could detect a symptom of performance anomaly

Average processing time increased only 91 milli-seconds

PutBidAuth-Max. control chart

Med

ian

ofm

axim

um p

roce

ssin

gtim

e [s

ec]

BrowseCategories-Max. control chart

Med

ian

ofm

axim

um p

roce

ssin

gtim

e [s

ec]

Narrowing Down Possible CausesWe investigated source code to categorize request

types into three tiersTo determine the server which includes the root cause

We counted anomalous request types in each tierWe concluded that there would be a problem in the

JBoss application serverNo warnings appear in the web tier

web application database

Request types that raised warnings 0 / 5 2 / 4 17 / 18

Observations with 400 clients

Hunting for the Root CauseWe manually determined the root cause1. We investigated the logs generated by JBoss

It told us that the number of threads reached the maximum number specified by the maxThreads parameter

2. We raised maxThreads from the default (200) to 400

After Raising maxThreadsWe increased clients from 300 to 400 againWe could cure the performance anomaly

No warnings were raisedWe could confirm that this change does not have negative

side-effects

BrowseCategories-Max. control chartsBefore After

PutBidAuth-Max. control chartsAfterBefore

Other Case StudiesMany warnings were raised when increased clients

from 200 to 300Average processing time increased only 141 milli-secondsWe could narrow down possible causes to the web server

Because almost all request types raised warningsWe solved the problem by changing KeepAliveTimeout from

default (5) to 1Many warnings were raised when increased clients

from 400 to 500Average processing time increased only 105 milli-secondsWe could narrow down possible causes to the web server

Because almost all request types raised warningsWe solved the problem by turning off KeepAlive feature

False PositiveA RegisterUser-Min. control chart regarded normal behavior

as a symptom of a performance anomaly

RegisterUser has two main execution paths neither of which is erroneous1. A new user is successfully registered to the database2. The name of the new user has already been taken by someone else

But we think control charts are still useful They are not insignificant warnings

Med

ian

ofm

inim

um p

roce

ssin

gtim

e [s

ec]

Related WorkSome techniques entrust anomaly detection to

administrators They collect and present useful information

Processing time in each component [Aguilera et al. ’03] Resource usage in each component [Chanda et al. ’07]

Control charts may be applied to detect anomalies within themSome techniques systematically detect anomalies within

other metrics We believe that these techniques are complementary to ours

Number of accesses to each page [Bodik et al. ’05] Resource usage at the granularity of machines [Cohen et al. ’04]

Control charts have been used to detect intrusions into web applications [Ye et al. ’02] Event intensity is chosen as a characteristic value in control charts

ConclusionWe applied control charts to detect symptoms of

performance anomaliesThey do not need complicated parametersTheir output is simple

Our methodObserves four statistics of processing time as characteristic

valuesSeparately observes processing time of each request type

Our method could detect several performance anomalies caused by Inadequate index in items table Inappropriate performance parameters

maxThreads, KeepAliveTimeout, and KeepAlive

narrowing down possible causes of performance anomaly in web applications

Documents

performance proceeds

performance anomaliesby

degraded performance

performance degradationby

symptoms of anomalies

average response time

service proceeds

elapsed time