narrowing down possible causes of performance anomaly in web applications
DESCRIPTION
Narrowing Down Possible Causes of Performance Anomaly in Web Applications. Satoshi Iwata * , Kenji Kono *† * Keio University, † JST/CREST. Performance Anomalies in Web Applications. A re unexpected, undesirable degradations in performance L ead to - PowerPoint PPT PresentationTRANSCRIPT
Narrowing Down Possible Causes of Performance Anomaly in Web Applications
Satoshi Iwata*, Kenji Kono*†
*Keio University, †JST/CREST
Performance Anomalies in Web ApplicationsAre unexpected, undesirable degradations in
performanceLead to
Violations of service level agreements (SLAs)Loss of potential customers
Are difficult to be removed completely in test operationsBecause they are caused by complex interactions between
various runtime factors E.g., workload, resource usage, and elapsed time after system
startup Increase of clients drives performance parameters inappropriate As a service proceeds, memory leak caused by bugs accumulate
little by little
Capturing Symptoms of Performance AnomaliesIs required in real operation
Before the performance anomalies harm services severelyEnables us to increase possibility of avoiding
performance degradationBy proactively debugging the performance anomaliesBy adding servers temporarily
Service status
Normal Severely harmed
Keep degradingRecover performance
Indicating symptoms
Systematic Detection of AnomaliesAllows the administrators to detect symptoms
without any expertiseE.g., Detection system notifies them of anomalous
deviations of response timeBut, the ad-hoc approaches are only for experts
Many thresholds must be determined to detect symptoms E.g., Upper limit the average response time must not exceeds E.g., Period during which average response time must not
continue to increase
Our ProposalApplying control charts to detect symptoms of
performance anomaliesControl charts determine whether a manufacturing or
business process is in a state of statistical control or not By detecting deviations from the standard quality of products
They are suitable for detecting performance anomalies that are deviations from the standard performance
Our method detects symptoms of anomalies that appear in processing timeProcessing time is the time from request arrival at a front-
end server to its departure from the serverProcessing time can be measured unobtrusively
E.g., Apache web server can be configured to log each request’s URL and processing time
Control ChartsThey monitor characteristic values that
characterize the quality of targetE.g., clincher size or company profitability In our technique: processing time
They raise an alarm if the measured values statistically deviate from standard valuesTo do this, they compare
A statistic of current characteristic values E.g., average, median, or defective fraction We chose median
Baselines calculated from the data measured in a controlled environment
Detail of Control ChartsBase lines
Center line is the center of the measured statistics UCL and LCL indicate 3σ above/below the center line
Warnings A plot outside the limits Systematic patterns within the limits
Center Line
Upper Control Limit (UCL)
Lower Control Limit (LCL)
Warning
Lot Number
Med
ian
ofcl
inch
er s
ize
[mm
]
Monitoring clincher manufacturing process
Merits of Control ChartsAdministrators do not have to decide complicated
parameters If so, only experts can use themAll the parameter we have to decide is group size to
calculate statistics In control charts, the default value is “5” We used the default value
Output is simpleOutput is “normal” or “anomaly”Administrators do not have to make a decision
Effort 1: Characteristic ValuesWe chose four statistics of processing time as
characteristic values, instead of processing time itselfAverage, maximum, median, and minimumWe prepare four control charts and plot:
Median of average processing time, median of maximum processing time, and so on
MeritsWe can concentrate on more harmful anomalies
Natural fluctuations in processing time itself raise less important warnings
We can detect various anomalies Different anomalies appear differently in processing time
Effort 2: Observing UnitWe observe processing time of each request type
separatelyBy investigating which servlet processes the request and what
parameters are needed to process it by using URLsA request type can be naturally defined on the basis of
its taskFor RUBiS (an auction site prototype), there are 27 request
types. E.g., SearchItemsInCategory, RegisterUser, and PutBid
MeritsHelps us detect early symptoms
Average processing times are different among request typesHelps us narrow down suspicious components
Different request types use different components
Summary of Our MethodObserves four statistics as characteristic valuesSeparately observes processing time of each
request type
Home
PutBid
Browse
AboutMe
…
Request types
Average Maximum
Median Minimum
Typical ScenarioOur approach can be used in both test and real
operation phasesDuring the test phase, we calculate three preliminary
baselinesDuring the real phase, we use the baselines calculated in
the test phase
Calculated baselines
Test operation phase
2) Debugging1) Calculating baselines
Real operation phase
Warning
Cured
3) Normal operation 4) Debugging
Case StudiesWe show some case studies in which we
1. Detected symptoms of performance anomalies With our method
2. Narrowed down the possible causes With the information acquired from our method
3. Hunted for and debugged the root causeExperimental setup
Interval to calculate four statistics : 10 minutesWeb application : RUBiS [http://rubis.objectweb.org/]
An auction site prototype modeled after eBay.com Implemented using JavaEE platform
Server software : Apache 2.2.11, JBoss 5.0.1, MySQL 5.1.34Machine spec : CPU 3.00 GHz, memory 2 GB, OS Linux 2.6.27
Case 1: SearchItemsInCategoryTo calculate the baselines of control charts, we ran
RUBiS with 200 clientsSearchItemsInCateogry-Min. control chart raised
warningsWe could detect a symptom of performance anomaly
Minimum processing time increased only a few milli-seconds
Med
ian
ofm
inim
um p
roce
ssin
gtim
e [m
sec]
Narrowing Down Possible CausesThe cause of this performance anomaly would be
characteristic to SearchItemsInCategorySince it is the only request type that minimum value
increases as time goes onWe investigated RUBiS source code to determine
suspicious componentsThe components used to process SearchItemsInCategory
but not used to process SearchItemsInRegion are suspicious SearchItemsByCategory Servlet SB_SearchItemsByCategory Session Bean The SQL sent by Query Session Bean
Hunting for the Root CauseWe manually determined the root cause1. We paid special attention to the SQL sent by
Query Session Bean It is the only component that sends complicated SQL
2. We investigated items tableThe SQL only accesses items table
3. We speculated that growing items table caused the performance anomaly
RUBiS never delete any records in the table
4. We added a new index to the tableTo limit the amount of records to be linearly searched
After Adding the IndexWe ran RUBiS with 200 clients againWe could cure the performance anomaly
No warnings were raisedWe could confirm that this change does not have negative
side-effects
AfterBefore
Med
ian
ofm
inim
um p
roce
ssin
gtim
e [m
sec]
Med
ian
ofm
inim
um p
roce
ssin
gtim
e [m
sec]
Case 2: Increasing ClientsAfter calculating baselines, we increased clientsMany warnings were raised by *-Max. control charts
When increased clients from 300 to 400We could detect a symptom of performance anomaly
Average processing time increased only 91 milli-seconds
PutBidAuth-Max. control chart
Med
ian
ofm
axim
um p
roce
ssin
gtim
e [s
ec]
BrowseCategories-Max. control chart
Med
ian
ofm
axim
um p
roce
ssin
gtim
e [s
ec]
Narrowing Down Possible CausesWe investigated source code to categorize request
types into three tiersTo determine the server which includes the root cause
We counted anomalous request types in each tierWe concluded that there would be a problem in the
JBoss application serverNo warnings appear in the web tier
web application database
Request types that raised warnings 0 / 5 2 / 4 17 / 18
Observations with 400 clients
Hunting for the Root CauseWe manually determined the root cause1. We investigated the logs generated by JBoss
It told us that the number of threads reached the maximum number specified by the maxThreads parameter
2. We raised maxThreads from the default (200) to 400
After Raising maxThreadsWe increased clients from 300 to 400 againWe could cure the performance anomaly
No warnings were raisedWe could confirm that this change does not have negative
side-effects
BrowseCategories-Max. control chartsBefore After
PutBidAuth-Max. control chartsAfterBefore
Other Case StudiesMany warnings were raised when increased clients
from 200 to 300Average processing time increased only 141 milli-secondsWe could narrow down possible causes to the web server
Because almost all request types raised warningsWe solved the problem by changing KeepAliveTimeout from
default (5) to 1Many warnings were raised when increased clients
from 400 to 500Average processing time increased only 105 milli-secondsWe could narrow down possible causes to the web server
Because almost all request types raised warningsWe solved the problem by turning off KeepAlive feature
False PositiveA RegisterUser-Min. control chart regarded normal behavior
as a symptom of a performance anomaly
RegisterUser has two main execution paths neither of which is erroneous1. A new user is successfully registered to the database2. The name of the new user has already been taken by someone else
But we think control charts are still useful They are not insignificant warnings
Med
ian
ofm
inim
um p
roce
ssin
gtim
e [s
ec]
Related WorkSome techniques entrust anomaly detection to
administrators They collect and present useful information
Processing time in each component [Aguilera et al. ’03] Resource usage in each component [Chanda et al. ’07]
Control charts may be applied to detect anomalies within themSome techniques systematically detect anomalies within
other metrics We believe that these techniques are complementary to ours
Number of accesses to each page [Bodik et al. ’05] Resource usage at the granularity of machines [Cohen et al. ’04]
Control charts have been used to detect intrusions into web applications [Ye et al. ’02] Event intensity is chosen as a characteristic value in control charts
ConclusionWe applied control charts to detect symptoms of
performance anomaliesThey do not need complicated parametersTheir output is simple
Our methodObserves four statistics of processing time as characteristic
valuesSeparately observes processing time of each request type
Our method could detect several performance anomalies caused by Inadequate index in items table Inappropriate performance parameters
maxThreads, KeepAliveTimeout, and KeepAlive