perfaugur: robust diagnostics for performance anomalies in ... · perfaugur: robust diagnostics for...

PerfAugur: Robust Diagnostics for PerformanceAnomalies in Cloud Services

Sudip Roy∗Google Research

Mountain View, CA 94043, [email protected]

Arnd Christian KonigMicrosoft Research

Redmond, WA 98052, [email protected]

Igor Dvorkin, Manish KumarMicrosoft Corp.

Redmond, WA 98052, USA{igord,maniku}@microsoft.com

Abstract—Cloud platforms involve multiple independently de-veloped components, often executing on diverse hardware config-urations and across multiple data centers. This complexity makestracking various key performance indicators (KPIs) and manualdiagnosing of anomalies in system behavior both difficult andexpensive. In this paper, we describe PerfAugur, an automatedsystem for mining service logs to identify anomalies and helpformulate data-driven hypotheses. PerfAugur includes a suite ofefficient mining algorithms for detecting significant anomaliesin system behavior, along with potential explanations for suchanomalies, without the need for an explicit supervision signal. Weperform extensive experimental evaluation using both syntheticand real-life data sets, and present detailed case studies showingthe impact of this technology on operations of the Windows AzureService.

I. INTRODUCTION

Cloud platforms need to constantly improve their qualityof service to meet increasingly stringent performance andavailability requirements from customers. For these, the qualitylife-cycle involves tracking various KPIs quantifying bothperformance and availability, and detecting and diagnosing anybehavioral changes to the underlying system components.

While such detection and diagnosis is a non-trivial task forany system, the heterogeneity and complexity of cloud servicesmakes this task particularly challenging. Cloud services con-tinually deploy new software versions of system componentsand are executed on a diverse set of hardware configurations,often across multiple data centers. This heterogeneity andcomplexity creates both outright service failures as well asvarious regressions in performance/availability which manifestthemselves in increased latency and/or increased failure ratesof individual requests. While outright service failures aretypically rare and quickly detected by operations teams, thesetypes of performance/availability anomalies (which have alsobeen named chronics [1] or latent faults [2]) can remain unde-tected for much longer periods as they typically only affect asubset of requests and occur under complex sets of conditions(which was also observed in [1]). Such anomalies may lead toservice level agreement violations and thereby have significantfinancial implications for cloud service providers.

Anomaly diagnosis is made more difficult by the veryskewed performance counter distributions typically seen inweb services even for well-configured systems. Such skew was

∗Work done during internship at Microsoft Research.

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

De

plo

ym

en

t T

ime

in

Se

con

ds

Percentile

Baseline

Anomaly

Anomaly Predicates:

OS Version = 2.176 and

Time > ‘10/11/13 13:11:45’

Fig. 1: An anomaly in VM deployment time detected by PerfAugur.

also observed in [3] and [4], especially for network round-trip times, where the 99th percentile time can be orders ofmagnitude larger than the median [5], [6]. Because of thisskew, anomalies typically cannot be diagnosed using onlya few instances of latent operations, since it is difficult todistinguish transient anomalies (aberrations in system behaviorwhich occur under a unique and unrepeatable set of circum-stances) from systemic anomalies (which are repeatable andunlikely to be resolved without developer intervention).

Manual diagnosis of regressions involves multiple itera-tions of analyzing service logs, formulating hypotheses basedon experience, and finally validating the hypotheses againstlarger sets of data. This manual approach to diagnosis is notonly slow but can also be expensive. Diagnosing these types ofregressions would greatly benefit from an automated approachcapable of identifying both (a) significant changes in systembehavior relative to a baseline, and (b) system configurationswhich are correlated with the observed changes. Automatingthe first aspect allows subsequent ranking of various anomaliesby importance as well as an initial assessment of the impact offixing the root cause (i.e., deciding “Which problem should Ifix first?”). Automating the second assists with the formulationof root-cause hypotheses and eliminates possible user biastowards their own (potentially incorrect) intuition.

Example 1.1: Consider a scenario where a code-change re-sults in a significant increase in the duration of virtual machine(VM) deployments. Based on the service logs, PerfAugur notonly detects this change, but also identifies that it is mostpronounced for the instances with OS version 2.176 startedafter ‘10/11/13 13:11:45’ (see Figure 1). With this knowledge,the developer can then investigate the code-changes relevantto OS version 2.176 around the specific time point.

Problem Statement: We shall refer to these types of con-straints on OS version and time as predicates that identify ananomalous subset of data. The problem we study in this papercan be formulated as computing compact sets of predicatesthat give rise to the most significant changes in behavior overa baseline.

A. Challenges

Absence of Supervision Signal: Unlike outright failures(which can often be identified from system logs), there is – inthe cases of systemic anomalies we observed in practice – nosimple and reliable way to identify the individual instancesthat make up the anomaly up-front. This make techniquesthat discover correlated predicates based on pre-labeled failurecases or outliers (such as e.g, DRACO [1], Distalyzer [7], orsupervised anomaly-detection [8]) unsuitable for our scenario.

In the presence of the very skewed performance-counterdistributions we discussed earlier, systemic anomalies oftendo not manifest themselves in the form of many additionaloutlier data points, but instead result in a noticeable increasein requests with moderately larger performance measures. Areal-life example of this is our first case study (see Section VI),where a storage misconfiguration results in significant changesin median latency, but very little change in the 95th percentile(see Figure 10, left graph). As a result, the points making upa systemic anomaly can often not be reliably identified usingone of the existing outlier detection techniques [8] up-front.Two-step approaches, in which we first identify the instancesmaking up an anomaly using outlier detection techniques andsubsequently use these outliers as supervision data to identifycorrelated predicates, are thus likely to perform poorly in thesescenarios. We substantiate this experimentally in Section V-A3.

Robustness against Outliers: As opposed to the two stepapproach, we propose an approach that discovers predicatesand performance changes jointly, by searching over the spaceof predicate combinations and comparing the difference be-tween the distribution induced by a set of predicates and theoverall baseline distribution.

Note that this approach also needs to account for the highdata skew when comparing distributions: for example, usingsimple statistics such as averages for these comparisons can beproblematic as they can easily be ‘dominated’ by a few outliervalues. Instead, our approach is based on robust statistics [9]such as percentiles, which avoid this issue, and have becomecommon industrial practice in Service Intelligence. Here, wedefine being robust to outliers using the notion of the break-down point of an estimator [9]; intuitively, the breakdown pointof an estimator is the proportion of incorrect observations (e.g.,arbitrarily large deviations) an estimator can handle beforegiving an incorrect result.

Example “breakdown point”: Given m independent variablesand their realizations x1, . . . xm, we can use x = (

∑mi=1 xi)/m

to estimate their mean. However, this estimator has a break-down point of 0, because we can make x arbitrarily large justby changing any one of the xi. In contrast, the median of adistribution has a breakdown point of 50%, which is the highestbreakdown point possible.

Computational Overhead: There are primarily twosources of computational complexity in mining such anomalies

robustly. First, robust statistics tend to be more expensivecomputationally, especially with respect to incremental compu-tation. To illustrate this, consider the case of updating the meanof a distribution D = {x1, . . . xm} when adding t additionalpoints {xm+1, . . . xm+t}; the mean can easily be updated oncethe sum of the new values sv =

∑ti=1 xm+i (in addition to

m and t) is known. In contrast, updating the median requiresknowledge of both the exact values of {xm+1, . . . xm+t} aswell as at least t additional values in D. As a consequence,many of the standard pruning “tricks” used in data mining(such as the optimizations discussed in [10] for a similarscenario) do not apply. Instead, different types of optimizationswhich leverage the specific characteristics of robust aggregatesand the skewed distributions we see in practice are needed.

Second, testing all possible combinations of predicates foranomalies leads to an inherent combinatorial explosion. Giventhe size of underlying search space, the challenge is to developalgorithms which strike a balance between the quality of themining result and computation cost.

B. Contributions

In this paper, we describe PerfAugur, a system for miningservice telemetry information to identify systemic anomaliesquickly and help formulate data-driven hypotheses as to theinvolved components and root causes. In particular,

• We formally define the problem of mining systemicanomalies based on robust statistics and propose algo-rithms that address the challenges imposed by the sizeof the search space and the computational overhead ofmaintaining robust statistics.

• We evaluate the resulting detection accuracy as well asoverhead using various data sets and compare our ap-proach to two well-known existing techniques, SCOR-PION [10] and decision-tree based diagnostics [11].

• Finally, we demonstrate the impact of our system onindustrial practice via two real-life case studies ofinvestigations into Windows Azure telemetry.

Note that the approach described in this paper generalizesto analytics tasks in other domains for which robust statisticsare required due to data skew/noise.

II. RELATED WORK

Outlier Detection: There exists a rich body of (unsuper-vised) outlier detection techniques in data analytics literature(see [8] for an overview). However, the problem formulationstudied in this paper is inherently different. We do not seek toidentify all sufficiently anomalous outlier data points; insteadwe identify a compact set of predicates that induce a significantchange in behavior over the baseline. Note that for such changeto be significant, it does not require a large number of extremevalues in the performance counters of interest. As a result, two-step approaches that first use outlier detection techniques toidentify individual outlier data points and subsequently seek toidentify predicates from these often fail to capture interestingsystemic issues with system performance.

Log-based Diagnostics tools: Recently, there has beena large body of analysis tools that use machine-learning based

approaches for the analysis of performance logs, includingDistalyzer [7], the work of Cohen et al. [12] (which usesTree-augmented Naive Bayesian Networks to identify com-binations of metrics and threshold values that correlate withperformance states (e.g., “SLO violated”)), the work of Chenet al. [11] (which uses decision trees to mine paths thatdistinguish successful from failed executions), and Fa [13](which uses anomaly-based clustering which clusters mon-itoring data based on how much it differs from the failuredata), and Draco [1]. However, all of these techniques requirea supervision signal in form of failure data or a set of abnormalinstances in a separate log. This makes them unsuitable for ourproblem setting and the type of exploratory analysis PerfAuguris aimed at. In addition, most of the proposed techniques areincapable of handling the range of data types and scoringfunctions (used to quantify the significance of the change inbehavior over the baseline) that PerfAugur is able to deal with.

Clustering based mappings for known anomalies: Thereis also been recent work to use clustering-based methods [13]or specialized fingerprints [14] to map current performanceanomalies to one of a number of known failure states. Unlikeour work, these techniques are optimized for accurate andfast matching and require signatures of known failures tobe effective; in case of unknown, novel failures, they, unlikePerfAugur, provide little insight into potential root causes.

Other approaches: Two recent approaches, SCOR-PION [10] and the work of Roy et al. [15] propose techniquesthat “explain” outliers in aggregate queries by identifyingsubsets of tuples that contribute most to the outlier aggregate.While the approaches of both these papers are similar toPerfAugur as both try to identify some form of predicateswhich explain anomalous aggregate values, PerfAugur is tar-geted at performance diagnostics and, as a result, uses special-ized classes of aggregate functions based on robust statistics.The predicate-search algorithms in PerfAugur are designed forrobust statistics specifically, where none of the optimizationsproposed in [10] and [15] apply. We experimentally comparethe performance and result quality of SCORPION to our tech-niques in Section V.

A very different approach to “explaining” outliers in datais described by Micenkova et al. [16]. Here, the explanationsconsist of the combinations of dimensions in which the out-lier shows the largest deviations; these explanations are verydifferent and far less intuitive in the context of diagnostics.Furthermore, it is not clear how to adapt these techniques forthe various categorical data types handled by our approach.Muller et al. [17] describe a related approach that first identifiesrelevant subspaces using statistical significance tests and thenuses these to rank outliers.

Aproximate Quantile Computation: There exist a num-ber of techniques for the approximate quantile computation(typically, in the context of data streams), e.g. [18], [19].The algorithms in this paper incorporate the computation ofquantiles for all prefixes of an interval as a sub-step; PerfAugurcomputes these exactly, by maintaining a min- and a max-heapof the appropriate size (see [20]). Replacing this componentof PerfAugur with one of the approximate techniques wouldreduce both the computational overhead (typically, by a loga-rithmic factor) and the memory requirements.

Time VM Type DataCenter Latency2014-01-19 03:14:17 IaaS CA 30 ms.2014-01-19 03:15:09 PaaS NY 40 ms.2014-01-19 03:15:57 PaaS CA 43 ms.2014-01-19 03:16:07 PaaS CA 60 ms.

TABLE I: An example of a log relation.

III. ROBUST DIAGNOSTICS

Most cloud services use some form of measurement in-frastructure that collects and compiles telemetry informationin a suitable form for further analysis. For simplicity weassume that the telemetry information is maintained in asingle relation R with attributes A1, ..., Ak. Each tuple in thisrelation corresponds to a single measurement of a particularaction. The set of attributes can be partitioned into two non-overlapping sets Ae and Am such that Ae contains the setof attributes which describe the system environment underwhich actions are taken, and Am contains the set of attributeseach corresponding to a performance indicator. An exampleof such a relation is shown in Table I. Each tuple in thisrelation contains information pertaining to spawning a newvirtual machine. For this relation the set Ae consists of theattributes timestamp, virtual machine type and the data centerlocation and the set Am simply contains the latency attribute.

Anomalies: Let Σ(R,Ai) be some statistical propertycomputed over values of the attribute Ai for all tuples in therelation R (e.g., a median). Given such a statistical propertyover a particular attribute Ai ∈ Am, an anomaly is a subsetof the measurements S ⊆ R such that Σ(S,Ai) differssignificantly from the baseline property defined by Σ(B,Ai)over a baseline set B. In the absence of a pre-specified set B,we use Σ(R,Ai) as the baseline measure.

Predicates: In this paper, we define predicates (denotedθ) as conjunctions of equality predicates of the form Ae = vor range predicates of the form vlow ≺ Ae ≺ vhigh, whereAe ∈ Ae, v, vlow and vhigh are constants, and ≺ defines atotal order over the domain of the attribute Ae. Such predicateseffectively summarize the system environment under whichthe anomaly occurs and therefore, characterize the conditionswhich may be related to the cause of the anomaly. In the restof the paper, we refer to an environment attribute participatingin a predicate as a pivot attribute.

Robust Aggregates: For any subset S = σθ(R), where σis the relational selection operator, we can define how muchS differs from R with respect to one specific performanceindicator Am ∈ Am using suitable aggregate functions. Forthe reasons outlined in Section I-A, we only consider functionsthat are robust (denoted by Σr) to the effect of outliers in thiscontext, such as the median or other percentiles.

Scoring Functions: We use the robust aggregates as partof so-called scoring functions to quantify the impact of ananomaly S with respect to an underlying baseline distribution.We use R as the baseline set in the following for simplicity;however, our approach works identically when the baseline isspecified separately (e.g., as last month’s measurements). Wemeasure impact in terms of the change in distribution betweenS and R for a given performance indicator attribute Am. Ascoring function takes these three parameters (R,S,Am) asinput and outputs a single number used for ranking anomalies.

Each scoring function has to quantify two aspects of impact:(a) how different is the anomaly in terms of the change in(the distribution of) Am, and (b) how many instances ofoperation/objects are affected by the anomaly. Note that twothese factors trade off against each other: typically, the morepoints are included in an anomaly, the smaller is the change indistribution, and vice versa. Obviously, an anomaly coveringall points in R would in turn have the baseline distribution andthus show no change at all.

To quantify the deviation in Am, we use a robust aggre-gation function Σr to compute aggregates for the attributeAm over all items in S as well as those in the baseline R.Subsequently, we measure the degree of the anomaly as thedifference between these two values; we denote this differenceusing the notation Σr(S,Am) ∼ Σr(R,Am). Note that thechoice of Σr as well as appropriate difference operator ∼depends on the scenario and the type of the attribute ofinterest. When A is of a numeric type, Σr is typically apercentile and ∼ the absolute difference between the cor-responding percentiles. On the other hand, for non-numericcategorical attributes (such as error codes or the names offailing function calls), we use the KL-Divergence [21] – astandard measure of distance between probability distributions.Here, the divergence is computed between the probabilitydistribution of values of Am in the baseline set (R) and theanomalous subset (S). Note that the KL-Divergence is a robustmeasure by default, as each individual item can not changethe overall probability distribution disproportionately. We willdiscuss a real-life example of categorical attributes of interestin Section VI.

To quantify how many instances of operations or objectsare affected by the anomaly we use a function of the size ofS. In practice, we typically use the natural logarithm of |S|,giving us the following scoring function:

f(R,S,Am) :=(

Σr(S,Am) ∼ Σr(R,Am)︸︷︷︸Deviation from baseline

)× log |S|︸︷︷︸Impact of # instances

.

Here, the use of the logarithm of the size of S (as opposedto using |S| outright) favors anomalies that result in a largerdeviation from the baseline (but over smaller number ofinstances). Note that the algorithms discussed in Section IVare also applicable when other functions of |S| are used toquantify the effect of the number of instances after somesuitable straight-forward modifications.

Diversity: Finally, in order to avoid providing multiplesimilar explanations for same anomalies or multiple expla-nations for the same set of anomalous measurements, weneed to incorporate a notion of diversity into the miningtask. For instance, the two predicates vlow ≺ Ae ≺ vhighand v′low ≺ Ae ≺ v′high, such that vlow ≈ v′low andvhigh ≈ v′high, while different, convey almost identical infor-mation. Presenting both to the user is unlikely to convey anyadditional information. To incorporate this notion of diversity,our framework supports the specification of a diversity functionfdiv(θ1, θ2) → {true,false} which returns true if theanomalies explained by the predicates θ1 and θ2 are diverse,and false otherwise. The mining algorithms proposed in Sec-tion IV are independent of any specific diversity function.

While the exact definition of diversity may be user defined,

below we propose a simple and meaningful diversity functionthat we will use in this paper going forward. Consider twoatomic predicates, θ1 and θ2, defined over the same environ-ment attribute Ae. As explained earlier, the notion of diversitymust capture the degree of overlap between the two predicates.While there are multiple metrics to measure such overlapsuch as the Jaccard-distance between σθ1(R) and σθ2(R), anextreme form of diversity is to disallow any overlap, i.e.,σθ1(R) ∩ σθ2(R) = ∅. In this paper, we assume this as thedefault notion of diversity for atomic predicates.

The same principle can be extrapolated to anomalies de-fined by a conjunction of many atomic predicates. For suchmulti-predicate anomalies, it is likely that only a subset ofthe predicates also induce a relatively high-scoring anomaly.Consider the following example: if all deployments using buildversion 2.17 have abnormally high latency, then it is likelythat the subset of deployments that use build version 2.17 andare deployed on cluster XYZ will also show high latencies.Therefore, unless the latency spike is specific to cluster XYZ,presenting an anomaly [Build = 2.17 ∧ Cluster =XYZ] inaddition to the original anomaly [Build = 2.17] does notconvey additional information and should be avoided.

Generalizing from the above, we define our default notionof diversity to multi-atom predicates as follows. Let Aθ ⊆ Aebe the set of environment attributes over which the atomicpredicates of θ are defined. We consider two explanationpredicates θ1 and θ2 diverse, if and only if, either Aθ1 * Aθ2and Aθ2 * Aθ1 or, Aθ1 ⊆ Aθ2 or Aθ2 ⊆ Aθ1 andσθ1(R) ∩ σθ2(R) = ∅. Intuitively, the first condition requireseach of the explanations to have at least one distinguishingattribute. The second condition applies when the first conditiondoes not, and similar to the atomic predicate case, requiresthem to explain non-overlapping set of measurements.

Problem Definition: Using the intuition introduced above,we can now define task of mining a set of diverse anomalies:

Problem 3.1 (Diverse Anomaly Mining): Given a teleme-try log R, a particular measurement attribute Am ∈ Am, therequired number of anomalies k, a scoring function for Σr,f(S,R,Am) → R (denoted f(S) in short) and a diversityfunction over explanations, fdiv(θ1, θ2) → {true,false},the diverse anomaly mining task is to determine an ordered listof explanations, L = [θ1, ..., θk], which satisfy the followingproperties:

• For any 1 ≤ i < j ≤ k, fdiv(θi, θj) is true.

• For any i < j, f(σθi(R)) > f(σθj (R))

• Finally, for any θi ∈ L and θ /∈ L, either f(σθi(R)) >f(σθ(R)) or there exists j < i such that f(σθi(R)) <f(σθ(R)) < f(σθj (R)) and fdiv(θj , θ) is false.

Informally, the first condition requires all the top-k predi-cates to be diverse. The second condition ensures that the list ofpredicates is ordered according to their scores. Finally, the thirdcondition ensures that a high-scoring predicate is excludedfrom the top-k anomalies only if there exists an higher scoringpredicate with respect to which the predicate is not diverse.

IV. ANOMALY MINING ALGORITHMS

In this section, we propose algorithms for the diverseanomaly mining task. Our algorithms extract predicates whichidentify the top-k highest-scoring diverse anomalies for a mea-surement log R. First, we describe algorithms for identifyinganomalies defined by atomic predicates over a single attributein Ae, called the pivot attribute. Subsequently, we describealgorithms for anomalies with multiple pivot attributes.

The particular algorithm used for mining anomalies de-pends on the type of pivot attribute. We refer to pivot attributeswhich have an inherent order over values such as numericaland date-time datatypes as ordered pivots. On the other hand,attributes which enumerate values from a certain domain likecluster names and OS versions are called categorical pivots.While for ordered pivots we extract range predicates of theform vlow ≺ Ae ≺ vhigh, for categorical pivots, we extractequality predicates of the form Ae = v, where Ae is thepivot attribute. Identifying anomalies for categorical pivotattributes is computationally easy since the problem reducesto performing a group by over the pivot attribute followed bycomputing each group’s score. Therefore, for the rest of thesection we specifically focus on algorithms for ordered pivots.

Notation: We use Am to denote the performance indicatorover which anomalies are to be detected, Ae to denote apivot attribute and θij as a notational shorthand for the rangepredicate vi ≺ Ae ≺ vj , where vi and vj are the ith and jthvalues of the pivot attribute in sorted order. We also use Sθ asa notational shorthand for σθ(R).

A. Single Pivot Anomalies

We first propose an exhaustive algorithm for ordered pivots.However, this brute force approach does not scale well tovery large data sets. To overcome this, we then proposetwo additional algorithms: first, a order-of-magnitude fasteralgorithm that extracts predicates such that the anomaly scoresare (at least) within a constant factor, α, of those minedexhaustively. Second, an even faster algorithm which performswell in practice and offers a performance guarantee whichdepends on the data characteristics itself.

1) Naıve: The exhaustive algorithm for identifying anoma-lies on ordered pivots sorts the items by the pivot attribute, andthen scores the subset of items within every pair of start andend-points. The computational complexity of this algorithmdepends on the cost of computing the scoring function. Fora median-based scoring function, this cost is O(|σθ(R)|),where θ explains the anomaly being scored. However, thecost of determining the median for an interval θi(j+1) giventhe median for θij can be reduced to O(log |σθij (R)|), bymaintaining the medians of the interval incrementally with twoheaps [20] – a max-heap and a min-heap; this approach alsoworks for all other percentiles, all that changes is the fractionof tuples in each heap. Given this incremental implementationof the scoring function, the cost of the exhaustive algorithm(for N = |R| items) becomes O(N2 logN).

2) Grid-Refinement: Next, we propose an algorithm whichoffers a principled way to trade off the “accuracy” of the minedanomalies for efficiency; instead of returning the highest-scoring anomaly, the algorithm returns an anomaly whose

Q ← ∅ {A priority queue of anomalies sorted by an upper boundon their score.}Let N = |R|Rs = Sort(R,Ae) {Sort instances by pivot attribute Ae}Q.push(θ1N , 0,∞, N) {Initialize Q}TopK ← ∅ {The result set.}while Q 6= ∅ ∧ |TopK| < k do

(θ, s, u, g)← Q.dequeueif s/u ≥ α then

if∧θi∈TopK(fdiv(θ, θi)) then

TopK.Add(θ)else

for all r ∈ Refine(θ, g) doQ.push(r)

return TopK

Algorithm 1: The α-approximate grid-refinement algorithm.

score is guaranteed to be within a factor α of the highest-scoring anomaly (where e.g., α = 0.9). In return for relaxingthe score constraint, this algorithm performs orders of magni-tude faster in practice. The speedup seen by this algorithm isthe result of exploiting two properties typically found in datadistributions seen in the context of cloud diagnostics:

“Small” Anomalies: First, for most datasets, anomaliesare expected to constitute a relatively small fraction of allthe items. The naıve algorithm spends significant amount ofcomputation time in ruling out intervals which resemble thebaseline and are, therefore, non-anomalous. In contrast, thenew algorithm can rule out large fractions of the search spacequickly by bounding the score of the anomalies in it.

Stability of Robust Statistics: For the data distributionstypically seen in practice, robust statistics are relatively stablewith respect to the addition or removal of a small number ofpoints. To illustrate this, consider Figure 2 which shows anexample latency distribution, and the corresponding median.Here, the middle portion of this distribution tends to be“flat”, implying that the median does not change significantlyin response to the insertion or deletion of k points (whichcan at most move the median by k points along the x-axis,corresponding to only a small change along the y-axis). Thisproperty of stability implies that the score of an anomalyvlow ≺ Ae ≺ vhigh is expected to be approximately equalto that of an anomaly defined by v′low ≺ Ae ≺ v′high ifvlow ≈ v′low and vhigh ≈ v′high. The algorithm exploits this byusing the score of one anomaly to compute tight upper boundson the scores of anomalies with similar predicates.

Intuitively, the algorithm uses grids of various levels ofcoarseness to “zoom into” regions in the data containing high-scoring anomalies. First, the algorithm analyzes the data at acoarse granularity, choosing the values of vlow and vhigh onlyfrom the points along the grid and computing upper boundson the possible scores of anomalies found at finer granularity.Only for sub-regions where these upper bounds are sufficientlyhigh, do we then consider anomalies found at a finer gridresolution, repeating the process until we discover an anomalywith a score within a factor of α of the highest (potential)score of all unseen anomalies. We illustrate this process inFigure 3.

Refinement and Bounding: The Grid-refinement algorithm

Percentile

Per

form

ance

Me

asu

rem

ent

100%50%

Fig. 2: Small change in median value due toaddition of k points below the median.

Pivot Attribute

Per

form

ance

Me

asu

rem

en

t

Initial Grid LinesFiner Grid Lines

Anomaly After Refinement

Fig. 3: Refinement by searching the grid nearinterval endpoints at a finer resolution.

Pivot Attribute

Per

form

ance

Me

asu

rem

ent

Fig. 4: Expansion from a seed point to identifyanomalies.

Let θlow,high at grid size g be the interval to be refined.gr ← g/ConvergenceRatio {Refined grid size.}Qrefined ← ∅ {The set of refined anomalies.}for i← (low − g) : gr : low do

for j ← (i+ gr) : gr : (high+ g) dosij = f(R,Sθij , Am)uij = BoundScore(R,Sθij , Am, gr)Qrefined.Add(θij , sij , uij , gr)

return R

Algorithm 2: Refinement procedure for a predicate θlow,high at gridsize g. Boundary-condition checks including handling of intervalssmaller than or equal to a grid size are omitted for simplicity.

is shown in Algorithm 1. The algorithm maintains a priorityqueue of anomalies represented by 4-tuples (θij , s, u, g), whereθij is the interval, s is the score of the current interval, uis an upper bound on the score achievable through arbitraryrefinement of the grid near the end-points of the interval[vi, vj ], and g is the current grid size. The algorithm dequeuesanomalies from the priority queue in order of their upper boundon scores; if the current score is within an α factor of the boundon the scores, then we add it to the result set after checking thediversity constraint. Otherwise, the interval is refined using the“zoom in” procedure shown in Algorithm 2. During refinementof an interval, for each possible refined interval at a finer gridsize, we compute the score of the anomaly as well as an upperbound on the possible improvement achievable by “refining”the grid, i.e., the maximum score possible for an anomaly whenusing (a) an arbitrarily fine grid and (b) the endpoints vlow andvhigh being within one grid size of the original “coarse” pairof endpoints (see Figure 3). The algorithm terminates once thetop-k approximate anomalies are determined.

Bounding Scores: For correctness we require the func-tion BoundScore to provide a sound upper bound onthe score of any given predicate, i.e., for any interval θijat grid g, if Qrefined is the set of intervals obtainedby refining θij as shown in Algorithm 2, then ∀θi′j′ ∈Qrefined, f(Sθi′j′ , R,Am) < u. We show one such methodof estimating the upper bound for scoring functions usingthe median as the robust statistic of choice. Extending it toarbitrary percentiles is trivial using a similar technique.

Let Sθij be an interval at grid size g for which the upperbound is to be estimated. The specific method of refinementshown in Algorithm 2 restricts the maximum deviation of themedian to 2g points since the refinement only allows addition

of points by expansion of the interval by at max g pointson either end of the interval. Let vk be the kth value insorted order of the attribute Am among the points in Sθij .Therefore, vN/2 denotes the median value. Since the medianfor any refinement can at most deviate from the median by 2gpoints, the score for any refinement of the interval is boundedby (vN/2+2g − vN/2) × log(|Sθij |). For typical distributions,the change in median value, and therefore the gap betweenthe upper bounds and the (best) actual score for an interval,is expected to be relatively small due to the stability aroundmedians illustrated in Figure 2. Note that both refinement andcomputing an upper bound on scores for intervals of size equalto one grid block need to be handled specially; the details areomitted due to space limitations.

Correctness: The algorithm satisfies the invariant that ananomaly is added to the set of top-k anomalies if and onlyif the anomaly’s score is within an α factor of the highestscoring anomaly. Let Sθ be the first anomaly to be includedin the top-k by the algorithm shown in Algorithm 1. Also,let Soptθ be the highest scoring anomaly. Also, let Sθ be ananomaly at a grid resolution of g. Let Sβ be the anomalywhich contains Soptθ and has both endpoints at the grid withresolution g. Since the algorithm dequeues anomalies accord-ing to upper bounds on scores, we know that u(Sθ) ≥ u(Sβ).By soundness of the bounding function and the refinementprocedure, we also know that u(Sβ) ≥ f(Soptθ , R,Am).Therefore, u(Sθ) ≥ f(Soptθ , R,Am). Also, since the algorithmchooses the anomaly, we know that f(Sθ, R,Am)/u(Sθ) ≥ α.Therefore, f(Sθ, R,Am) ≥ α× f(Soptθ , R,Am).

3) Seed Expansion: The Grid-Refinement algorithm relieson the stability of medians property (see Figure 2). However,the distributions seen around much higher (or much lower)percentiles are often less stable. We now propose an algorithmfor faster detection of anomalies aimed in particular at scoringfunctions based on these percentiles, or for fast analysis ofvery large data sets. This algorithm offers a significantly lowerasymptotic overhead (O(N1.5)) as well as significantly fasterwall-clock runtime. However, as opposed to Grid-Refinement,which guarantees a constant approximation ratio, the scoresof the anomalies mined by the Seed Expansion algorithm arewithin a data-dependent factor of the optimal anomalies.

The intuition behind this algorithm is based on anomaliesfor high/low percentiles typically containing extreme (i.e.,relatively high or low) values for the performance indicators; to

Let s be the index of the seed in sorted order of pivot Aelnew ← s; rnew ← sMaxScore← −∞while f(S[lnew,rnew ], R,Am) ≥MaxScore dol← lnew; r ← rnewMaxScore← f(S[lnew,rnew ], R,Am)scorel ← f(S[l−1,r], R,Am)scorer ← f(S[l,r+1], R,Am)scorelr ← f(S[l−1,r+1], R,Am)Let [lnew, rnew] be the interval corresponding tomax(scorel, scorer, scorelr).

return [l, r]

Algorithm 3: Expansion of a single seed point (l and r denote leftand right, respectively).

simplify exposition, we will make the assumption that we onlyseek anomalies corresponding to large performance indicatorvalues. The algorithm first chooses the top-

√N number of

points in order of value of the performance indicator; we callthese points seed points. For each seed point we now wantto determine whether it corresponds to an isolated transientanomaly (which we want to ignore), or is part of a systemicanomaly (which we want to detect). In the former case, weexpect the seed point to be a local extremum surrounded(along the pivot axis) by many points that roughly resemblethe baseline distribution; in the latter case, we expect furtherextreme measurement values in the neighborhood of the seed.

Smoothing: To avoid situations where all the seed pointschosen are transient anomalies, we apply an initial smoothingstep before choosing the seed values. Here, we first replaceeach value vi of the performance indicator with the medianvalue among all values in an interval along the pivot-axis ofsize c and “centered” at vi; we then chose the largest valuesamong these. This way, single outlier points within a region oflow values are not chosen as seeds, eliminating (single-point)transient anomalies from consideration.

Given any seed point identified by the index s with thepivot value vs, the algorithm initializes a single-item anomalywith the predicate vlow = vs ≺ Ae ≺ vhigh = vs and tries toexpand this anomaly by adding points in each direction alongthe pivot axis. If the seed point is part of a systemic anomaly,we expect the score of the resulting anomaly to grow withthe expansion. On the other hand, if the seed corresponds to atransient anomaly, we expect the score to decrease (eventually)as points resembling the background distribution are added.

The procedure for expansion of a single seed point isshown in Algorithm 3. The algorithm expands a seed until anexpansion does not result in an improvement in the anomalyscore. This expansion procedure is repeatedly invoked for√N seed points. Seed points which are already included in

the expanded anomalies formed out of previous seed points areexcluded from consideration as seeds. The algorithm maintainsall expanded intervals in a sorted list from which the highest-scoring set of k diverse anomalies (Section III) is returned asthe final result. The pseudocode for the complete algorithm isomitted due to space constraints.

Score Approximation: The quality of the anomalies minedby the seed expansion algorithm depends on how easilydistinguishable the anomalies are from the background dis-

tribution. We use two properties of the dataset to quantifythis distinctiveness of anomalies. First, the maximum gra-dient (i.e. maxi(vi+1 − vi)) of the performance indicatorattribute with respect to the pivot attribute, denoted δmax.Note that this measure is computed after smoothing, effec-tively making this the maximum gradient over any intervalof size c. Second, let ∆ =

vN−vN/2N/2 be the average gradi-

ent between the median and the maximum value. Also, letα = δmax

∆ . Then it can be shown that if Sθ is the bestanomaly mined by the Seed Expansion algorithm and Sθoptis the top scoring pattern mined by an exhaustive algorithm,then f(Sθ, R,Am) ≥ 2 log(Nα)

α logN f(Sθopt , R,Am), where f isthe median-based scoring function and |Sθopt | ≤

√N . While

we omit a formal proof due to space constraints, we brieflyexplain the implications of such a bound. For a distributionwith a very pronounced anomaly, the value of α is expected tobe high since δmax is expected to be high. This in turn impliesthat the approximation factor 2 log(Nα)

α logN evaluates to a lowervalue since the contribution of α to the denominator dominates.Therefore, as expected, if anomalies are more pronounced ina distribution, the algorithm can identify the anomalies moreaccurately, giving us the desired behavior of identifying themost prevalent anomalies in a highly scalable manner.

B. Multi-Pivot Anomalies

Anomalies often occur due to system conditions which canonly be reliably captured by predicates over multiple attributes.For example, response times for operations may degrade onlyunder high memory contention when there also are multipleactive threads on a machine. A brute force approach foridentifying such multi-attribute anomalies would be to checkall combinations of predicates for all subsets of environmentattributes, which is clearly computationally prohibitive. Thiscomputational hardness is not unique to our problem, butis an instance of a general class of problems observed inother domains, such as optimal decision tree construction.Similar to well-known techniques used in the literature forsimilar contexts, our first approach is to construct multi-pivotanomalies greedily.

In practice, the vast majority of anomalies are detectedwell using greedy techniques, however, to detect anomaliesthat are not, we next propose an algorithm that co-refinespivots jointly across different attributes. Finally, we show howwe can leverage a property typically seen in real-life datadistributions (namely, a bound on the extent to which the scoreof the highest-scoring anomaly characterized by l predicatesis reduced when we only consider a subset of the predicates)to provide a tractable algorithm that gives quality guaranteeson the scores of the mined anomalies.

Since the greedy algorithm is straight-forward (and similarto traditional greedy decision tree construction algorithms), weomit the details of this algorithm.

1) Co-refinment: A purely greedy algorithm for mininganomalies may split a single anomaly into multiple anomaliesdue to lack of foresight into the potential refinement by joiningwith other pivots. For handling such corner cases, we proposea co-refinement strategy: here, we first run the greedy miningalgorithm on a small random sample of the data with aweighted scoring function where each data point is weighted

Rγ ← RandomSample(R, γ) {Choose a random sample w/oreplacement of size γ × |R|}fγ(Rγ , S,Am) := (Σ(Rγ , Am) ∼ Σ(S,Am))× log( |R

γ |γ

)TopKCoarse← GreedyMine(Rγ , fγ , Am)TopKRefined← ∅for all θc ∈ TopKCoarse doθr ← θc; g ← |θr|while g >= 1 do

for all θir ∈ θr , where θr =∧i θir do

θ′r ← θ′r ∧ arg maxθ∈Refine(θir,g) f(Sθ, Sθr , Am)θr ← θ′r;g ← g/ConvergenceRatio

TopKRefined.Add(θr)return TopKRefined

Algorithm 4: A sampling and co-refinement based schemefor multi-pivot mining using a greedy mining procedure,GreedyMine(R, f,Am, k), which returns the top-k multi-pivot anomalies ordered by the scoring function f . We use θc todenote the predicates on the sampled data and θr to denote thepredicates on the entire data.

by the inverse sampling ratio. This gives us an initial “rough”set of anomalies. We then co-refine these anomalies usingthe full data set as follows: we adopt an approach similarto the Grid Refinement algorithm of gradually “zooming in”to determine the exact interval boundaries for each predicate.However, instead of refining attributes one after the other,for each anomaly, we determine the best intervals acrossall constituent pivot attributes at a particular grid-size beforedrilling down to the next grid level (see Algorithm 4).

2) α-approximate Multi-Pivot Refinement: While comput-ing the top-scoring anomalies for adversarial data distributionsis computationally prohibitive, we can leverage propertiestypically seen in real-life data to obtain a tractable algorithmwith absolute guarantees on the anomaly score. First, toillustrate these data properties, consider an example anomalywhich is best characterized by intervals along two differentpivot attributes. The heat-map representation of the anomalousmeasurement values w.r.t. to the two pivot attributes for suchan anomaly is shown in Figure 5. The figure also shows threepercentile distributions; two for (predicates on) each of pivotattributes when considered independently and a third for whenthey are considered together. Clearly, the deviation betweenthe anomaly median and the background distribution, observedwhen both the attributes are considered together, shifts towardshigher percentiles when only one of the pivots is considered.This is due to the addition of non-anomalous points to theanomaly. These non-anomalous points can only be filtered bypivoting on the secondary attribute. By limiting the extent towhich this shift occurs, we can provide sound bounds for theimprovement possible in anomaly scores. We formally statethis assumption as follows:

Definition 4.1 (Maximum Refinement Ratio): Given amulti-pivot anomaly delimited by l predicates over pivotattributes, the maximum refinement ratio is the largestconstant γ such that there exists an ordering of the predicates

O such that|S∧i+1θO(i+1)

||S∧iθO(i)|

≥ γ, where γ ∈ [0, 1].

Bounding Multi-Pivot Anomaly Scores: We assume that for agiven log relation R and a performance indicator attribute Am,

Percentile100%50%

Percentile100%50%

Percentile100%50%

Pe

rfo

rman

ce M

ea

sure

me

nt

Shift in anomaly percentile

Shift in anomaly percentile

Piv

ot

1

Pivot 2

Deviation of medians for actual anomaly

Pe

rfo

rman

ce M

ea

sure

me

nt

Pe

rfo

rman

ce M

ea

sure

me

nt

Fig. 5: Shift in the anomaly percentile on projection of anomaly overeither of the two pivot attributes as compared to jointly over both.

the maximum refinement ratio γ is either known or is estimatedconservatively (γ = 1 being most conservative). Under thisassumption, given an l-pivot anomaly Sθl , it is possible toget an estimate of the potential improvement in the anomalyscore by pivoting on additional attributes. Let n = |Sθl |. Ifthe maximum number of attributes in any anomaly is m, forany l-attribute anomaly, the minimum size of an m-predicateanomaly formed by extending Sθl has size at least nmin =γm−ln. For the particular case where the aggregation functionis median, the maximum score obtainable by extending Sθl isthen bounded by

maxi∈[n/2,n(1−γm−l/2)]

(vi − vdn/2e)× log(2i).

This is because in the best case, all the points filtered byadditional pivots are lower than the median value Sθl , andtherefore cause a rightward shift of the median. As morepredicates over pivots are added to the anomaly, this estimatebecomes tighter.

As in the case of the single-pivot Grid Refinement algo-rithm, by maintaining an upper bound over the best possiblel-pivot (unseen) refinements for anomalies with fewer pivots,we can design an α-approximate multi-pivot mining algorithm.

We omit the pseudo-code for the multi-pivot anomalymining algorithm since it is structurally similar to the GridRefinement algorithm (Algorithm 1), except for the refinementprocedure and the initialization step.

V. EXPERIMENTAL EVALUATION

In this section we evaluate the efficiency and accuracy ofPerfAugur in comparison to SCORPION [10] and the decision-tree based approach of [11]. In addition, we test the feasibilityof “two-step” approaches, which use outlier detection to pro-vide the input labels for supervised techniques such as [7],[12], [11], [13], [1], by evaluating the accuracy of this labelingfor real-life anomalies (Section V-A3).

Datasets: Since finding anomalies over ordered pivotsis computationally significantly harder than for categoricalpivots, we focus the experimental evaluation on ordered pivots.In contrast, the case studies in Section VI will illustrate

examples with categorical pivots and both numerical as wellas categorical performance indicators. For our experiments weuse two types of data – actual log data and syntheticallygenerated data. The actual log data is real telemetry dataobtained from Windows Azure operations. For synthetic data,we adopt an approach similar to that of SCORPION. Wecreate a relation R with d environment attributes and onemeasurement attribute as follows. We draw “normal” valuesof the measurement attribute from a distribution N (10, 10);then, to create d-dimensional anomalies, we choose a regionof proportionate size in d-dimensional pivot space and assignmeasurement values drawn from N (80, 10) for all points inthat space. Typically, these anomalies cover 10% of points.In order to inject multiple anomalies, we use multiple, non-overlapping copies of data generated as above.

Alternative Approaches: SCORPION generates predicateswhich, given a user identified set of outlier and holdout (toindicate the baseline) groups of observations, explain outlierobservations in the data. The non-robust scoring functionsused by SCORPION scores predicates by the extent to whichthey “influence” the outlier groups but not the holdout groups.The influence of a predicate is computed by measuring thedifference in the aggregate values before and after removingthe points satisfying a particular predicate. Because the variousperformance optimizations described in [10] don’t apply torobust aggregates, we use the naıve implementation of Scor-pion which performs an exhaustive search for such predicates.While the naıve implementation has a high computationaloverhead, it is guaranteed to produce the best quality predicatesas per SCORPION’s scoring function.

The original decision-tree based approach of [11] requiresa supervision signal (in the form of success/fail labels), whichis generally difficult to obtain in practice; here, however,we avoid such labeling by re-phrasing the underlying taskas a regression problem, making the (numeric) measurementattribute the variable to be approximated by the tree and usinga regression tree implementation based on [22] to compute thetree. This way, the tree will partition the space into regions withsimilar measurement values. We can now treat each partitionin the data induced by the tree as a potential anomaly. Wegive this approach one additional “advantage” in that we setthe number leafs in the tree (= number of partitions) to thenumber required to bound all anomalies in the synthetic data.Both SCORPION and the decision-tree approach have a numberof tuning parameters; all our reported results are for the bestconfiguration obtained after tuning both algorithms.

Setup: All experiments are run on a Intel Xeon L5640processor with 96 GB of main memory running Windows2008 R2 Enterprise. All reported values except for brute forcealgorithms are averages over more than 10 runs.

A. Experimental Results

For all our experiments, we compare the different algo-rithms in two respects – computational overhead, and thequality of the extracted anomalies. For quantifying quality, weuse F1-scores defined as 2·precision·recall

precision+recall measured w.r.t. theknown set of anomalous points injected synthetically. Sincethe ground truth is not known for the real logs, we use theratio of scores of anomalies to those mined by Naıve as the

Naıve Seed Grid (0.8, 0.9, 0.95) SCOR. DTSynth. 0.95∗ 0.83 0.83 0.86 0.89 0.75 0.36Real 1.0 0.77 0.98 0.98 0.99 - -

TABLE II: Comparison of quality of anomalies mined over 100kitems. For synthetic data where anomalies are known, we use F1

scores. For real data, we use the ratio(SR) of the anomaly score tothose mined by Naıve. *Reported values for Naıve are for 10k items,as computation for 100k items did not complete.

quality metric. While PerfAugur supports scoring functionsbased on arbitrary percentiles, due to lack of space, we showexperiments for scoring-functions based on medians only.

1) Experiments on One-Pivot Anomalies: In the first setof experiments, we study the behavior of the five algorithms(Naıve, Grid-Refinement, Seed Expansion, SCORPION, anddecision-tree (DT)) for discovering top-scoring anomalies for asingle (ordered) pivot under different data distributions; for theGrid-refinement algorithm, we also vary α, the approximation-ratio of the scores of the returned anomalies.

Scalability: For this experiment, we increase the numberof items from 10k to 100k, introducing a new anomaly forevery 10k items. Figure 6 shows a comparison of the overheadfor mining the top anomaly with increasingly larger numbers ofitems. Seed Expansion and DT algorithms are about 3-4 orders-of-magnitude faster as compared to Naıve and SCORPION.Similarly, for any approximation ratio, the Grid Refinementalgorithm is around two orders of magnitude faster than Naıveand SCORPION. The corresponding values of F1-scores for100k items are shown in Table II. In comparison to SCORPIONand DT algorithms, all of our algorithms have significantlyhigher F1-scores as shown in Table II.

Over real data, almost all of our algorithms are ableto find an anomaly with almost the same score as thosemined by Naıve. Since scoring metrics for SCORPION andDT are incomparable to our scoring function, it is not a faircomparison to evaluate their quality over real data using thisapproach.

Robustness: We next test the robustness of the algorithmsas we add random noise simulating transient anomalies. Weconstruct a synthetic dataset of size 4k containing a singleanomaly. We then mine for the top-1 anomaly while varyingthe percentage of noisy points up to 30%. The overhead andquality results are shown in Figures 7 and 8.

While SCORPION and DT perform well in the absenceof noise, their performance degrades rapidly as we add noise(Figure 8). In contrast, the Grid Refinement, Seed Expansionand Naıve algorithms, are able to tolerate up to 25% noisewithout significant degradation in quality. Moreover, the GridRefinement algorithm requires more time as the percentage ofnoisy points increases to satisfy the guaranteed approximationratio. Also, the higher the value of the approximation ratio α,the longer the time required to achieve it. In contrast, SeedExpansion is independent of the noise values as the numberof seed points used is dependent on the data size only.

2) Experiments on Multi-Pivot Anomalies: In this set ofexperiments, we study the behavior of the four multi-pivotanomaly mining algorithms (α-Approx, Greedy Multi-pivot,

0.1

1

10

100

1000

10000

10 20 30 40 50 60 70 80 90 100

Tim

e (

in s

eco

nd

s)

Number of items (in 1000s)

SeedGrid (0.8)

Grid (0.9)Grid (0.95)

SCORPIONDT

Fig. 6: Comparison of time required to minethe top-1 pattern on synthetic data. Times forNaıve are omitted as the algorithm takes morethan a day over 100k items.

0.01

0.1

1

10

0 5 10 15 20 25 30

Tim

e (

in s

eco

nd

s)

Percentage of Noisy Points

NaiveSeed

Grid (0.8)

Grid (0.9)Grid (0.95)

SCORPION

DT

Fig. 7: Time for mining with increasing per-centage of noise over 4k items.

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

F1-S

co

re

Percentage of Noisy Points

NaiveSeed

Grid (0.8)

Grid (0.9)Grid (0.95)

SCORPION

DT

Fig. 8: F1 scores with increasing percentageof noise over 4k items.

0.1

1

10

100

1000

5 10 15 20 25 30

Tim

e in

seco

nd

s


Greedy (0.8)Greedy (0.9)

Sample (0.8)Sample (0.9)

Alpha (0.7)DT

(a) Time for mining top-k anomalies involving3 pivots with increasing number of items.

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

F1 S

co

re




Alpha (0.7)DT

(b) Average F1-scores of mined anomalieswith increasing number of items.

0

0.2

0.4

0.6

0.8

1

1 2 3 4

F1 S

co

res

Number of pivot attributes



Alpha (0.7)DT

(c) Average F1-scores of mined anomalies for10k items with increasing number of pivotattributes.

Fig. 9: Evaluation of multi-pivot mining algorithm.

Sample and Co-refinement (SC) and decision-tree (DT)). Inour experiments, for α-Approx, we use α = 0.7 and γ = 0.7.For Greedy Multi-pivot and SC, we use the Grid-refinementalgorithm as an atomic predicate miner with α values set to0.8 and 0.9. For SC, we use a sample size of 2%. Finally,since the Naıve implementation of SCORPION suffers fromcombinatorial explosion, we are unable to compare againstSCORPION for multi-pivot anomalies.

To simulate transient anomalies, we add 20% noise valuesdrawn from N (400, 10). Furthermore, in practice we haveobserved that for many data sets, a small fraction of perfor-mance measurements are disproportionately high (partially dueto time-outs, instrumentation errors, etc.). To simulate thesemeasurements, we sample 2% of the data (independently ofwhether these are part of the anomaly or not) and scale uptheir performance indicator by 100x.

Scalability: For this experiment, we use items with asingle performance indicator and three pivot attributes. Theanomaly is most precisely identifiable with three differentpredicates; one over each pivot attribute. We first study thebehavior of the algorithms as we increase the number of itemsin the set from 3k to 30k. The data set contains 1 anomalyper 3k items. We evaluate the quality of the algorithms bycomparing the average F1-score of the top-k anomalies in thedataset. The results are as shown in Figures 9a and 9b.

While DT is the fastest of all algorithms, the F1-score of

the mined anomalies is quite low and it often misses someof the top-k anomalies all together. However, the Greedyand SC strategies scale well with the number of points andreturn all the anomalies as indicated by the high F1 scores.Finally, the α-approx multi-pivot algorithm also has highvalues of F1 scores but takes significantly longer to guaranteethe desired approximation ratio. While the behavior of Greedyand SC are almost similar, it is possible to construct adversarialdistributions in which Greedy performs poorly but, due toco-refinement, SC performs reasonably well. A recommendedstrategy is to use an ensemble of both. Due to lack of space,we are unable to present this experiment.

Varying the dimensionality: In this experiment, we studythe behavior of algorithms with increasing dimensionality ofthe dataset. As explained in Figure 5, the higher the dimen-sionality, the more obscure the anomaly in its projection alongany one pivot attribute, making it more difficult to detect.

For this experiment we keep the number of items in thedataset constant at 10k with a single anomaly. As shown in Fig-ure 9c, with increasing number of pivot attributes, the qualityof anomalies mined by DT decreases. However, for more thanone dimension, the quality of the anomalies mined by eachof the other algorithms is relatively unchanged; particularly sofor the α-approx. algorithm. This can be attributed partly tothe use of robust statistics and partly to the fact that even athigher dimensions, the projections of the attribute over anypivot attribute is still “anomalous” enough to be picked up by

the greedy approaches.

3) Experiments on Two-Step Approaches: In this exper-iment, we tested the feasibility of “two-step” approachesdiscussed in Section II, which first use outlier detection tech-niques to identify anomalous tuples in the data set and thenuse these as supervision signal for a supervised technique.Note that for this approach to be successful it needs to reliablyidentify a significant fraction of the values that make up theanomaly first, otherwise the subsequent detection of correlatedpredicates is likely to fail. Therefore, we tested how welldifferent outlier detection mechanisms perform at identifyingthe data points that correspond to different performance re-gressions seen in the Windows Azure. We use two real-lifeincidents with known causes here, which means that for thesecases we have exact knowledge of which points are part of theanomaly. The first data set corresponds to anomalies based onslow storage endpoints (which we will describe in detail in thecase study in Section VI), and the second data set correspondsto a spike in connection failures due to a software issue.

For both datasets simple outlier-detection techniques thatuse a constant threshold (such as e.g., a 3σ distance fromthe average) perform very poorly when labeling anomalies, asthey are insensitive to the local ‘context’ (such as changes inthe expected latency at different times), yielding, dependingon how these thresholds were set, either nearly no tupleslabeled as anomalous (when using a 3σ threshold) or, for lowerthresholds, mostly generating false positives.

As a consequence, we instead used the well-known LOFtechnique [23], which takes local context into account, andidentifies outliers based on local density measured. This tech-nique uses only a single tuning-parameter – the MinPtsparameter governing the size of the local neighborhood –which we varied between 0.05% and 2% of the data set size.We also slightly varied the cutoff-threshold used on the outlierfactor output by the algorithm (between 1.1 and 1.3).

For the first data set, we found that LOF was able toidentify a large faction of the anomalous tuples (dependingon the parameter settings, between 77-93%), yet identified amuch larger fraction of the non-anomalous tuples as outliers(between 37-40% of the entire data set). For the 2nd data set,only 7-53% of the anomalous tuples were correctly labeled,whereas the false positives remained similarly high. As aconsequence, we cannot expect a two-step approach based onthese outliers to function well for this data.

While this experiment only examines a very limited numberof data sets and outlier detection approaches, it does illustratethe difficulty of decoupling the identification of outlier valuesand mining of the correlated predicates. In contrast, PerfAugurperforms both steps jointly and was able to identify, in bothdatasets, both the known anomalies (among the top 5 anoma-lies found) as well as predicates pointing to the correct rootcause.

VI. CASE STUDIES

To complement the experimental evaluation on syntheticdata sets, we describe two case studies illustrating the impactof PerfAugur for two investigations into Windows Azuretelemetry.

Case Study 1: “VM deployment latency regression”. Oneof the KPIs for cloud service providers is the latency at whichvirtual machines are deployed, as it is imperative to ensurea consistent user-experience. Starting on 7/9/13, we observeda very significant latency regression in one of the clusters,with latencies at the 70th percentile increasing to ∼34 minutes.When this regression was first observed, PerfAugur was not yetavailable; thus, the initial analysis involved manual analysis ofservice logs in an attempt to identify the root cause by testingindividual hypotheses.

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 20 40 60 80 100

Re

qu

est

La

ten

cy i

n s

eco

nd

s

Percentile

Baseline

Storage Endpoint IP =

137.135.192.78

0

500

1000

1500

2000

2500

0 20 40 60 80 100

Re

qu

est

La

ten

cy i

n s

eco

nd

s

Percentile

Baseline

ReturnCode = 2147023436

Fig. 10: Top PerfAugur Output for Case 1, identifying storage nodeIP and return code as indicative of the performance regression.

These hypotheses centered on the suspected root causebeing issues in the network stack (e.g., an update to the datacenter router software, an update to the routing rules and issueswith cross cluster network latency), but did not find the cause.Subsequently, a second cluster exhibited a similar regression.At this point, engineers from several independent teams werepulled in and began investigating, with focus on recent nodeOS updates, background node servicing and the storage stack;again, these investigations did not yield the root cause. Theregression was then observed in a 3rd cluster.

Finally, after spending several engineer-weeks of investi-gation across many teams to isolate the issue, engineers foundthat in the impacted clusters, a subset of the storage endpointswere experiencing poor throughput. This caused I/O requestsperformed as part of the VM deployment to have significantdelay, and thus delayed the overall VM deployments. Sub-sequently, the nodes backing these storage endpoints wereisolated and a further investigation revealed a BIOS updatethat resulted in very low server fan speeds in certain situations,causing insufficient cooling and ultimately high temperaturesin the blades. This high temperature led to the CPUs throttlingclock speed to reduce heat output. As a result, CPU utilizationon such nodes could not reach the target of 100%, whichresulted in the observed delays.

Once the set of problematic storage endpoints was isolated,the diagnosis of the underlying root case was rapid. However,finding these endpoints required significant time and effort, inpart because the fault was not directly tied to a code check-in,in part because the fault only surfaced with a fan configurationunique to that data center, and most importantly because thedevelopers had never seen this regression before.

We used PerfAugur to analyze the same telemetry data,requiring less than 10 minutes for processing. PerfAugurdirectly identified the individual storage endpoints, with pred-icates of the form “StorageEndPointIP = A.X.Y.Z”, in the topanomalies. In addition, PerfAugur mined a secondary cause ofthe regression: OS pre-fetch failures (predicate: “Return code

= -2147023436”) due to timeouts at the storage endpoints,which had been missed before. Starting from this output,identifying the BIOS update as the root cause would have beenrelatively easy, saving weeks of investigation time.

Case 2: “Understanding the long tail”. For our cloudservice, the “tail” performance (i.e., high percentiles) of VMdeployment operations is crucial, as customers are sensitiveto the consistency of the observed performance [24]. Insummer 2013, we observed a large differential between thelatency at the median (∼5 minutes) and the “long tail” (atthe 99th percentile, ∼20 minutes). However, the analysisof the deployments “in the tail” is made difficult by thefact that the corresponding instances are made up of manydifferent (but individually rare) systemic anomalies as well asindependent transient anomalies. Historically, developers havechosen issues to fix based on anecdotal evidence, often leadingto little or no movement in the tail.

By setting the (categorical) performance indicator to bethe (distribution of) “dominant” sub-operations (i.e., amongall operations involved in a deployment, the sub-operationtaking the most time), we were able to automatically identifypartitions for which this distribution changed the most overthe baseline. We were now able to identify (a) the change indistribution of dominant sub-operations in “tail” deploymentsand (b) which other conditions correlated with specific sub-operations becoming the bottleneck. One outcome of thisanalysis was that we discovered the gap between the creationof a VM and the start of its execution to be the dominant sub-operation for a large partition of the data, which in turn leadto the discovery of a new bug in the deployment code.

Summary: Clouds services have KPIs that they need to opti-mize. These optimizations start with an initial diagnosis, fol-lowed by code improvements. Currently, the initial diagnosisphase is often comprised of trial-and-error with simple analyticmodels, yet still requires multiple engineer-weeks of effort.With PerfAugur, this diagnosis is often completed in minutes,requires no custom development or repeated justification ofmethodology; moreover, PerfAugur helps overcome humanlimitations in both the space of hypotheses that is exploredas well as the time that is required.

PerfAugur in production: PerfAugur is currently activelyused on production data in order to detect performance re-gressions (using the previous week’s data as baseline), detectabnormal increases in error codes (which in turn has allowedus to detect issues with the production system pro-actively)and to support root-cause analysis.

VII. CONCLUSION

In this paper, we presented PerfAugur – a system for auto-matically detecting anomalous system behavior and generatingexplanations in the form of system conditions which co-occurwith anomalous operations. While our system was specificallytargeted towards service diagnostics in cloud platforms, thealgorithms presented here are relevant for any scenario wherethe underlying data skew and noise requires the use of robuststatistics.

VIII. ACKNOWLEDGEMENTS

We would like to thank Andre Vachon, Mayur Upalekar,Mark Hennessy, Adeel Siddiqui, Henrique Polido, JosephChan, Sandi Ganguly, Terence Chan, Amit Kumar, SrivatsanParthasarathy, Luhui Hu, Royi Ronen, Elad Ziklik and theanonymous ICDE reviewers for their insightful comments onthe paper and feedback on the PerfAugur prototype.

REFERENCES

[1] S. P. Kavulya, S. Daniels, K. Joshi, M. Hiltunen, R. Gandhi, andP. Narasimhan, “Draco: Statistical diagnosis of chronic problems inlarge distributed systems,” DSN 2013, vol. 0, 2012.

[2] M. Gabel, A. Schuster, R.-G. Bachrach, and N. Bjorner, “Latent faultdetection in large scale services,” in DSN, 2012.

[3] J. Dean and L. A. Barroso, “The Tail at Scale,” Commun. ACM, vol. 56,no. 2, Feb. 2013.

[4] Y. Xu, Z. Musgrave, B. Noble, and M. Bailey, “Bobtail: Avoiding longTails in the Cloud,” in NSDI, 2013.

[5] M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, andM. Yasuda, “Less is more: trading a little bandwidth for ultra-lowlatency in the data center,” in NSDI, 2012.

[6] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz, “Detail: reducingthe flow completion time tail in datacenter networks,” in SIGCOMM,2012.

[7] K. Nagaraj, C. Killian, and J. Neville, “Structured comparative analysisof systems logs to diagnose performance problems,” in NSDI’12, 2012.

[8] C. C. Aggarwal, Outlier Analysis. Springer, 2013.[9] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier

Detection. Wiley, 1987.[10] E. Wu and S. Madden, “Scorpion: Explaining away outliers in aggregate

queries,” Proc. VLDB Endow., vol. 6, no. 8, Jun. 2013.[11] A. X. Zheng, J. Lloyd, and E. Brewer, “Failure diagnosis using decision

trees,” ser. ICAC ’04, 2004.[12] I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase,

“Correlating Instrumentation Data to System States: A Building Blockfor Automated Diagnosis and Control,” in OSDI, 2004.

[13] S. Duan, S. Babu, and K. Munagala, “FA: A System for AutomatingFailure Diagnosis,” in IEEE ICDE, 2009.

[14] P. Bodik, M. Goldszmidt, A. Fox, D. B. Woodard, and H. Andersen,“Fingerprinting the datacenter: Automated classification of performancecrises,” ser. EuroSys ’10, 2010.

[15] S. Roy and D. Suciu, “A formal approach to finding explanations fordatabase queries,” ser. SIGMOD, 2014, pp. 1579–1590.

[16] B. Micenkova, R. T. Ng, X. H. Dang, and I. Assent, “Explaining outliersby subspace separability,” in ICDM, 2013.

[17] E. Muller, M. Schiffer, and T. Seidl, “Statistical selection of relevantsubspace projections for outlier ranking,” in ICDE, 2011, pp. 434–445.

[18] S. M. Qiang Ma and M. Sandler, “Frugal streaming for estimatingquantiles,” Space-Efficient Data Structures, Streams, and Algorithms,2013.

[19] G. S. Manku, S. Rajagopalan, and B. G. Lindsay, “ApproximateMedians and other Quantiles in one Pass and with limited Memory,” inACM SIGMOD, 1998.

[20] M. D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte, “Min-maxHeaps and Generalized Priority Queues,” CACM, vol. 29, no. 10, Oct.1986.

[21] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann.Math. Statist., vol. 22, no. 1, pp. 79–86, 1951.

[22] J. H. Friedman, “Greedy function approximation: A gradient boostingmachine,” Annals of Statistics., 2000.

[23] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: Identifyingdensity-based local outliers,” SIGMOD Rec., vol. 29, no. 2, 2000.

[24] M. Mao and M. Humphrey, “A performance Study on the VM StartupTime in the Cloud,” in IEEE CLOUD, 2012, pp. 423–430.

perfaugur: robust diagnostics for performance anomalies in ... · perfaugur: robust diagnostics for...

Documents