using black-box performance models to detect ...shang/pubs/liao2020...2014), making it unrealistic...

36
Noname manuscript No. (will be inserted by the editor) Using Black-Box Performance Models to Detect Performance Regressions under Varying Workloads: An Empirical Study Lizhi Liao · Jinfu Chen · Heng Li · Yi Zeng · Weiyi Shang · Jianmei Guo · Catalin Sporea · Andrei Toma · Sarah Sajedi Received: date / Accepted: date Abstract Performance regressions of large-scale software systems often lead to both financial and reputational losses. In order to detect performance regressions, performance tests are typically conducted in an in-house (non- production) environment using test suites with predefined workloads. Then, performance analysis is performed to check whether a software version has a performance regression against an earlier version. However, the real work- loads in the field are constantly changing, making it unrealistic to resemble the field workloads in predefined test suites. More importantly, performance testing is usually very expensive as it requires extensive resources and lasts for an extended period. In this work, we leverage black-box machine learning models to automatically detect performance regressions in the field operations of large-scale software systems. Practitioners can leverage our approaches to Lizhi Liao, Jinfu Chen, Yi Zeng, Weiyi Shang Department of Computer Science and Software Engineering Concordia University Montreal, Quebec, Canada E-mail: {l lizhi, fu chen, ze yi, shang}@encs.concordia.ca Heng Li epartement de g´ enie informatique et g´ enie logiciel Polytechnique Montr´ eal Montreal, Quebec, Canada E-mail: [email protected] Jianmei Guo Alibaba Group Hangzhou, Zhejiang, China E-mail: [email protected] Catalin Sporea, Andrei Toma,Sarah Sajedi ERA Environmental Management Solutions Montreal, Quebec, Canada E-mail: {steve.sporea, andrei.toma, sarah.sajedi}@era-ehs.com

Upload: others

Post on 07-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Noname manuscript No.(will be inserted by the editor)

Using Black-Box Performance Models to DetectPerformance Regressions under Varying Workloads: AnEmpirical Study

Lizhi Liao · Jinfu Chen · Heng Li · YiZeng · Weiyi Shang · Jianmei Guo ·Catalin Sporea · Andrei Toma · SarahSajedi

Received: date / Accepted: date

Abstract Performance regressions of large-scale software systems often leadto both financial and reputational losses. In order to detect performanceregressions, performance tests are typically conducted in an in-house (non-production) environment using test suites with predefined workloads. Then,performance analysis is performed to check whether a software version hasa performance regression against an earlier version. However, the real work-loads in the field are constantly changing, making it unrealistic to resemblethe field workloads in predefined test suites. More importantly, performancetesting is usually very expensive as it requires extensive resources and lastsfor an extended period. In this work, we leverage black-box machine learningmodels to automatically detect performance regressions in the field operationsof large-scale software systems. Practitioners can leverage our approaches to

Lizhi Liao, Jinfu Chen, Yi Zeng, Weiyi ShangDepartment of Computer Science and Software EngineeringConcordia UniversityMontreal, Quebec, CanadaE-mail: {l lizhi, fu chen, ze yi, shang}@encs.concordia.ca

Heng LiDepartement de genie informatique et genie logicielPolytechnique MontrealMontreal, Quebec, CanadaE-mail: [email protected]

Jianmei GuoAlibaba GroupHangzhou, Zhejiang, ChinaE-mail: [email protected]

Catalin Sporea, Andrei Toma,Sarah SajediERA Environmental Management SolutionsMontreal, Quebec, CanadaE-mail: {steve.sporea, andrei.toma, sarah.sajedi}@era-ehs.com

Page 2: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

2 Lizhi Liao et al.

complement or replace resource-demanding performance tests that may noteven be realistic in a fast-paced environment. Our approaches use black-boxmodels to capture the relationship between the performance of a software sys-tem (e.g., CPU usage) under varying workloads and the runtime activitiesthat are recorded in the readily-available logs. Then, our approaches com-pare the black-box models derived from the current software version with anearlier version to detect performance regressions between these two versions.We performed empirical experiments on two open-source systems and appliedour approaches on a large-scale industrial system. Our results show that suchblack-box models can effectively and timely detect real performance regressionsand injected ones under varying workloads that are unseen when training thesemodels. Our approaches have been adopted in practice to detect performanceregressions of a large-scale industry system on a daily basis.

Keywords Performance regression · Black-box performance models · Fieldworkloads · Performance engineering

1 Introduction

Many large-scale software systems (e.g., Amazon, Google, Facebook) provideservices to millions or even billions of users every day. Performance regressionsin such systems usually lead to both reputational and monetary losses. Forexample, a recent incident of performance regressions at Salesforce1 affectedeight data centers, resulting in a negative impact (i.e., slowdown and failuresof the provided services) on the experience of clients across many organizations(Cannon, 2019). It is reported that 59% of Fortune 500 companies experienceat least 1.6 hours of downtime per week (Syncsort, 2018), while most of thefield failures are due to performance issues rather than feature issues (Weyukerand Vokolos, 2000). To avoid such performance-related problems, practitionersperform in-house performance testing to detect performance regressions oflarge-scale software systems (Jiang and Hassan, 2015).

For in-house performance testing, load generators (e.g., JMeter2) are usedto mimic thousands or millions of users (i.e., the workload) accessing the sys-tem under test (SUT) at the same time. Performance data (e.g., CPU usage) isrecorded during the performance testing for later performance analysis. To de-termine whether a performance regression exists, the performance data of thecurrent software version is compared to the performance data from running anearlier version on the same workload, using various analysis techniques (Gaoet al., 2016) (e.g., control charts (Nguyen et al., 2012)). However, such in-houseperformance testing suffers from two important challenges. First, the real work-loads in the field are constantly changing and difficult to predict (Syer et al.,2014), making it unrealistic to test such dynamic workloads in a predefinedmanner. Second, performance testing is usually very expensive as it requires

1 https://www.salesforce.com2 https://jmeter.apache.org

Page 3: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 3

extensive resources (e.g., computing power) and lasts for a long time (e.g.,hours or even weeks). For software systems that follow fast release cycles (e.g.,Facebook is released every few hours), conducting performance testing in ashort period of time becomes particularly difficult.

There exist software testing techniques such as A/B testing (Xu et al.,2015) and canary releasing (Sato, 2014) that depend on end user’s data inthe field for software quality assurance. Similarly, in this work, we proposeto directly detecting performance regressions based on the end users’ data oflarge-scale software systems when they are deployed in the field3. Such anautomated detection can complement or even replace typical in-house perfor-mance testing when testing resources are limited (e.g., in an agile environment)and/or the field workloads are challenging to be reproduced in the testing en-vironment. In particular, we leverage sparsely-sampled performance metricsand readily-available execution logs from the field to detect performance re-gressions, which only adds negligible performance overhead to SUT.

Inspired by prior research (Farshchi et al., 2015; Yao et al., 2018), we useblack-box machine learning and deep learning techniques (i.e., CNN, RNN,LSTM, Linear Regression, Random Forest, and XGBoost) to capture the re-lationship between the runtime activities that are recorded in the readily-available logs of a software system (i.e., the independent variables) and itsperformance under such activities (i.e., the dependent variables). We use theblack-box model that describes the current version of a software system andthe black-box model that describes an earlier version of the same softwaresystem, to determine the existence of performance regressions between thesetwo versions.

Prior research that builds black-box models to detect performance regres-sions (Farshchi et al., 2015; Yao et al., 2018) typically depends on the datafrom in-house performance testing, where the workload is consistent betweentwo releases. However, when using the field data from the end users, one maynot assume that the workloads from the two releases are the same. In orderto study whether the use of black-box performance models under such in-consistent workloads may still successfully detect performance regressions, weperformed empirical experiments on two open-source systems (i.e., OpenMRSand Apache James) and applied our approaches on a large-scale industrialsystem. In particular, our study aims to answer two research questions (RQs):

RQ1: How well can we model system performance under varying workloads?In order to capture the performance of a system under varying workloads,we built six machine learning models (including three traditional modelsand three deep neural networks). We found that simple traditional modelscan effectively capture the relationship between the performance of a sys-tem and its dynamic activities that are recorded in logs. In addition, suchmodels can equivalently capture the performance of a system under newworkloads that are unseen when building the models.

3 Note that our approach serves different purposes and has different usage scenarios fromA/B testing and canary releasing, as discussed in Section 9.

Page 4: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

4 Lizhi Liao et al.

RQ2: Can our approaches detect performance regressions under varying work-loads?Our black-box model-based approaches can effectively detect both the realperformance regressions and injected ones under varying workloads (i.e.,when the workloads are unseen when training the black-box models). Be-sides, our approaches can detect performance regressions with the datafrom a short period of operations, enabling early detection of performanceregressions in the field.

Our work makes the following key contributions:

– Our study demonstrates the effectiveness of using black-box machine learn-ing models to model the performance of a system when the workloads areevolving (i.e., the model trained on a previous workload can capture thesystem performance under new unseen workloads).

– Our experimental results show that we can use black-box machine learningmodels to detect performance regressions in the field where workloads areconstantly evolving (i.e., two releases of the same system are never runningthe same workloads).

– Our approaches can complement or even replace traditional in-house per-formance testing when testing resources are limited.

– We shared the challenges and the lessons that we learned from the success-ful adoption of our approaches in the industry, which can provide insightsfor researchers and practitioners who are interested in detecting perfor-mance regressions in the field.

The remainder of the paper is organized as follows. Section 2 introducesthe background of black-box performance modeling. Section 3 outlines ourapproaches for detecting performance regressions in the field. Section 4 andSection 5 present the setup and the results of our case study on two open-sourcesubject systems, respectively. Section 6 shares our experience of applying ourapproaches on a large-scale industrial system. The challenges and the lessonsthat we learned from the successful adoption of our approaches are discussedin Section 7. Section 8 and Section 9 discuss the threats to the validity of ourfindings and the related work, respectively. Finally, Section 10 concludes thepaper.

2 Background: Black-box performance models

Performance models are built to model system operations in terms of resourcesconsumed. Performance models can be used to predict the performance of sys-tems in the purpose of workload design (Krishnamurthy et al., 2006; Syer et al.,2017; Yadwadkar et al., 2010), resources control (Cortez et al., 2017; Gonget al., 2010), configuration selection (Guo et al., 2013, 2018; Valov et al., 2017),and anomaly detection (Farshchi et al., 2015; Ghaith et al., 2016; Ibidunmoyeet al., 2015). Performance models can be divided into two categories (Didona

Page 5: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 5

et al., 2015). One is white box performance models, which is also called analyt-ical models. White box performance models predict performance based on theassumption of system contexts, i.e., workload intensity and service demand.

Another type of performance model is black-box performance models. Black-box performance models employ statistical modeling or machine learning tech-niques to predict system performance by taking the system as a black box.Black-box performance models typically require no knowledge about the sys-tem’s internal behavior (Didona et al., 2015; Gao et al., 2016). Such modelsapply various machine learning algorithms to model a system’s performancebehavior taking system execution logs or performance metrics as input. Themain strength of using black-box modeling techniques is to model the sys-tem performance when the system is under field operations. While white-box performance models typically require knowledge about the system’s in-ternal behaviour, such knowledge is often not available when the system isdeployed in the production environment. Thus, under the production envi-ronment, black-box approaches appear to be more effective than white-boxmodeling approaches.

Typical examples of black-box performance models are linear regressionmodels (Shang et al., 2015; Yao et al., 2018). For example, prior studies uselinear regression models to capture the relationship between a target perfor-mance metric (e.g., CPU usage) and the system operations (Yao et al., 2018)or other performance metrics (e.g., memory utilization) (Shang et al., 2015):

y = α+ β1x1 + β2x2 + ...+ βnxn (1)

where the response variable y is the target performance metric, while theexplanatory variables (x1, x2, ..., xn) are the frequency of each operation ofthe system (Yao et al., 2018) or each of the other performance metrics (Shanget al., 2015). The coefficients β1, β2, ..., βn describe the relationship betweenthe target performance metric and the corresponding explanatory variables.

Linear regression requires that the input data satisfies a normal distri-bution. With the benefit of being applicable to data from any distribution,regression tree (Xiong et al., 2013) has been used to model system perfor-mance. Regression tree models a system’s performance behavior by using atree-like structure. Quantile regression (de Oliveira et al., 2013) can model asystem’s different phases of performance behavior.

The prior success of black-box performance modeling supports our studyto detect performance regressions under variable workloads from the field.

3 Approaches

In this section, we present our three approaches that automatically detectperformance regression in a new version of a software system based on thelogs and performance data that are collected under varying workloads. Ourapproaches contain three main steps: 1) extracting metrics, 2) building black-box performance models, and 3) detecting performance regressions. The three

Page 6: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

6 Lizhi Liao et al.

approaches share the first two steps, while being different in the third step.The overview of our studied approaches is shown in Figure 1.

3.1 Extracting metrics

We aim to identify whether there is performance regression in the new versionof the system based on modeling the relationship between the system perfor-mance (e.g., CPU usage) and the corresponding logs that are generated duringsystem execution. Specifically, we collect system access logs that are readilygenerated during the execution of web servers, such as the Jetty, Tomcat, andIIS (Internet Information Services). Those logs represent the workload of thesystem during a period of execution. The performance data is sparsely sampledat a fixed frequency (e.g., every 30 seconds).

In this step, we extract log metrics and performance metrics from thesystem access logs and the recorded performance data, respectively.

3.1.1 Splitting logs and performance data into time periods

We would like to establish the relationship between the execution of the systemand the performance of the system during run time. Since both performancedata and logs are generated during system runtime and are not synchronized,i.e., there is no corresponding record of performance metrics for each line oflogs, we would first align the logs and records of performance data by splittingthem into time periods. For example, one may split a two-hour dataset into120 time periods where each time period is one minute in the data. Each logline and each record of performance data are allocated into their correspondingtime period.

3.1.2 Extracting log metrics and performance metrics

We parse the collected logs into events and their corresponding time stamps.For example, a line of web log “[2019-09-27 22:43:13] GET /openmrs/ws/rest/v1/person/HTTP/1.1 200 ” will be parsed into the corresponding web re-quest or URL “GET /openmrs/ws/rest/v1/person/ ” and time stamp “2019-09-27 22:43:13 ”. Afterwards, each value of a log metric is the number oftimes that each log event executes during the period. For example, if thelog event “GET /openmrs/ws/rest/v1/person/ ” is executed 10 times dur-ing a 30-second time period, the corresponding log metric for “GET /open-mrs/ws/rest/v1/person/ ”’s value is 10 for that period. Then, we consider theaggregation (e.g., taking the average) of the records of performance data asthe value of the performance metric of the time period, similar to prior re-search (Foo et al., 2010). Table 1 shows an illustrative example of log metricsand performance metrics in five 30-second time periods.

These log metrics together with the performance metrics in the correspond-ing time periods are used to construct our performance models. Each data

Page 7: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 7

point of a log metric is based on a time period (e.g., 30 seconds) instead of alog entry. For example, if we run the subject system for eight hours and wetake each 30 seconds as a time period, there will be 960 data instances in total:8 (hours) * 60 (minutes per hour) * 2 (time periods per minute).

Table 1 An illustrative example of log metrics and performance metric data from OpenMRS

Log metrics* Performance metricTime slice GET person GET observation CPU

1 sec. - 30 sec. 17 18 74.1931 sec. - 60 sec. 10 2 52.2561 sec. - 90 sec. 5 19 64.6191 sec. - 120 sec. 6 6 56.26121 sec. - 150 sec. 0 3 55.57

* In this example, we show the log metrics that are used in the linear regression,random forest, and XGBoost models. For CNN, RNN, and LSTM, we sort thelogs in each time period by their corresponding time stamp to create a sequenceof log events.

Process

Data

Softwaresystem

1.1Splittinglogsandperformancedataintotimeperiods

1.2Extractinglogmetricsand

performancemetrics

Oldversionlogsandperformancedata

Newversionlogsandperformancedata

Oldversionlogmetricsand

performancemetrics 2.1Reducingmetrics

2.2Buildingmodels

Statisticalanalysis

System

Newversionlogmetricsand

performancemetrics

1.Extractingmetrics 2.Buildingblack-boxperformancemodels

Actualnewversionperformancemetrics

3.Detectingperformanceregressions

Newversionlogmetrics

Applymodelonthedata

Predictedperformancemetrics

Oldversionperformance

model

Oldversionperformancemodel

Oldversionperformancemodel

Newversionperformancemodel

Newversionlogmetrics

Applymodelonthedata

Applymodelonthedata

Predictedperformancemetricsusingtheold

versionmodel

Predictedperformancemetricsusingthenew

versionmodel

Oldversionperformancemodel

Newversionperformancemodel

NewversionLogmetrics

Applymodelonthedata

Applymodelonthedata

Predictedperformancemetricsusingtheold

versionmodel

Predictedperformancemetricsusingthenew

versionmodel

Actualnewversionperformancemetrics

Predictionerrorfromtheoldversionmodel

Predictionerrorfromthenewversionmodel

Compare

Compare

Performanceregression?

Performancemodel

Approach1

Approach2

Approach3

Newversionperformance

model

Fig. 1 An overview of our studied approaches to detecting performance regressions.

3.2 Building black-box performance models

In this step, we use the log metrics and performance metrics extracted fromthe last step to build black-box performance models. The log metrics and the

Page 8: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

8 Lizhi Liao et al.

performance metrics (e.g., CPU usage) are used as the independent variablesand dependent variables of the models, respectively.

3.2.1 Reducing metrics

The frequency of some log events (e.g., periodical events) may not changeover time. The constant appearance of such events may not provide informa-tion about the changes in system workloads. Therefore, after we calculate thelog metrics of each log event, we reduce log metrics by removing redundantlog metrics or log metrics with constant values in both previous and currentversions. We first remove log metrics that have zero variance in both versionsof the performance tests.

Different log events may always appear at the same time, e.g., user loggingin and checking user’s privilege, and provide repetitive information for theworkloads. To avoid bias from such repetitive information, we then performcorrelation analyses and redundancy analyses on the log metrics. We calculatePearson’s correlation coefficient (Benesty et al., 2009) between each pair of logmetrics (i.e., in total N*N/2 times of correlation calculations where N refersto the number of log metrics). If a pair of log metrics have a correlation higherthan 0.7, we remove the one that has a higher average correlation with all othermetrics. We repeat the process until there exists no correlation higher than 0.7.The redundancy analysis would consider a log metric redundant if it can bepredicted from a combination of other log metrics. We use each log metric as adependent variable and use the rest of the log metrics as independent variablesto build a regression model. We calculate the R2 of each model. If the R2 islarger than a threshold (e.g., 0.9), the current dependent variable (i.e., thelog metric) is considered redundant. We then remove the log metric with thehighest R2 and repeat the process until no log metrics can be predicted withan R2 higher than the threshold.

We only apply this step when using traditional statistical models or ma-chine learning models (like linear regression or random forest), while if a deepneural network (like convolutional neural network or recurrent neural network)is adopted to build the black-box performance models, we skip this step.

3.2.2 Building models

We build models that capture the relationship between a certain workloadthat is represented by the logs and the system performance. In particular, theindependent variables are the log metrics from the last step and the dependentvariable is the target performance metric (e.g., CPU usage). One may choosedifferent types of statistical, machine learning or deep learning models, as ourapproach is agnostic to the choice of models. However, the results of usingdifferent types of models may vary (cf. RQ1).

Page 9: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 9

3.3 Detecting performance regressions

The goal of building performance models is to detect performance regressions.Therefore, in this step, we use the black-box performance models that are builtfrom an old version of the system to predict the expected system performanceof a new version. Then, we use statistical analysis to determine whether thereexists performance deviance based on prediction errors of the models. Theinput of this step is the black-box performance models built from the oldversion and the new version; the output of this step is the determination ofwhether there exists performance deviance between the old version and thenew version of the system.

Intuitively, one may use the model that is built from the old version of thesystem to predict the performance metrics from running the new version ofthe system. By measuring the prediction error, one may be able to determinewhether there exists performance deviance (Cohen et al., 2004; Nguyen et al.,2012). However, such a naive approach may be biased by the choice of thresh-olds that are used to determine whether there is performance deviance. Forexample, a well-built performance model may only have less than 5% averageprediction error; while another less fit performance model may have 15% av-erage prediction error. In these cases, it is challenging to determine whetheran average prediction error of 10% on the new version of the system shouldbe considered as a performance regression. Therefore, statistical analyses areused to detect performance regressions in a systematic manner (Foo et al.,2015; Gao et al., 2016; Shang et al., 2015).

In particular, we leverage three approaches to detect performance regres-sions: Approach 1 ) by comparing the predicted performance metrics (using themodel built from the old version) and the actual performance metrics of thenew version, Approach 2 ) by comparing the predicted performance metrics ofthe new version using the model built from the old version and the model builtfrom the new version, and Approach 3 ) by comparing the prediction errors ofthe performance metrics on the new version using the model built from the oldversion and the model built from the new version. We describe each approachin detail in the rest of this subsection.

Approach 1: comparing the predicted performance metrics (using the modelbuilt form the old version) and the actual performance metrics of the new ver-sion. The most intuitive way of detecting performance regression is to com-pare the predicted value and the actual value of performance metrics. Sincethe model is built from the old version of the system, if the actual value ofthe performance metrics from the new version of the system is higher thanthe predicted value and the difference between them is large, we may considerthe existence of performance regressions. In particular, we use the data fromthe old version of the system, i.e., Dataold to build a black-box performancemodel Modelold (cf. Section 3.2). Afterwards, we apply the model Modelold onthe data from the new version of the system, i.e., Datanew. Then we comparethe predicted and the actual values of the performance metrics.

Page 10: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

10 Lizhi Liao et al.

Approach 2: comparing the predicted performance metrics of the new versionusing the model built from the old version and the model built from the newversion. Since our approach aims to be applied to varying workloads, the per-formance regression may only impact a small number of time periods, whilethe source code with performance regressions may not be executed in othertime periods. Therefore, only the time periods that are impacted by the per-formance regressions may contain large prediction errors. To address such anissue, we also built a performance model Modelnew using the data from thenew version of the system (Datanew). This way, Modelnew is built using thedata with potential performance regressions. Afterwards, we use Modelold andModelnew to predict the performance metrics in Datanew.

Since Modelnew is built from Datanew, there exists a bias when applyingModelnew on Datanew, as the training data and testing data are exactly thesame (a model without any generalizability can just remember all the instancesin the training data and make perfect predictions). To avoid the bias, insteadof building one Modelnew, we build n models, where n is the number of datapoints that exist in Datanew. In particular, for each data point in Datanew,we build a performance model Modelnnew by excluding that data point andapply the model Modelnnew on the excluded data point. Therefore, for n datapoints from Datanew, we end up having n models and n predicted values.

Finally, we compare the predicted values using Modelold on Datanew, andthe predicted values by applying each Modelnnew on each of the n data pointsin Datanew. Similar to approach 1, if the predicted values by applying eachModelnnew on each of the n data points in Datanew are higher than predictedvalues using Modelold on Datanew and the difference is large, we may considerthe existence of performance regressions.Approach 3: comparing the prediction errors on the new version using themodel built from the old version and the model built from the new version.The final approach of detecting performance regression is similar to the pre-vious one (Approach 2), where instead of directly comparing the predictedperformance metrics, we compare the distribution of the prediction errors byapplying Modelold on Datanew, and the predicted errors by applying eachModelnnew on each of the n data points in Datanew. The intuition is that,when Modelold has a larger prediction error than Modelnew, it may be an in-dication of performance deviance between the two versions. We note that thisapproach would only be able to determine whether there exists performancedeviance, which may actually be improvement instead of regression.Statistical analysis. All three approaches generate two distributions of eitheractual/predicted performance metric values, or prediction errors. we comparethe two distributions to determine if there are performance regressions betweenthe two software versions. However, the differences between the data distribu-tions may be due to experimental noises rather than the systematic differencebetween the two versions. Therefore, similar to previous studies (Chen et al.,2016), we use statistical methods (i.e., statistical tests and effect sizes) to com-pare the data distributions of the two versions while taking into considerationthe experimental noises.

Page 11: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 11

In particular, we use the Mann-Whitney U test since it is non-parametricand it does not assume a normal distribution of the compared data. we runthe test at the 5% level of significance, i.e., if the p-value of the test is notgreater than 0.05, we would reject the null hypothesis in favor of the alternativehypothesis, i.e., there exists a statistically significant difference between theperformance of old version and new version systems. In order to study themagnitude of the performance deviation without being biased by the size ofthe data, we further adopt the effect sizes as a complement of the statisticalsignificance test. Considering the non-normality of our data, we utilize Cliff’sDelta (Cliff, 1996) and thresholds provided in prior research (Romano et al.,2006). For approach 1 and 2, we can use the direction (i.e., positive or negative)of the effect size to further tell whether the deviation indicates a performanceregression or improvement.

4 Case Study Setup

To study the effectiveness of our approaches for detecting performance regres-sions under varying workloads, we perform case studies on two open-sourcesystems4. In this section, we first present the subject systems. Then, we presentthe workloads applied to the systems, the experimental environment and per-formance issues. Finally, we present the choices of machine learning modelsthat are used in our approaches.

4.1 Subject systems

We evaluate our approaches on two open-source systems, namely OpenMRSand Apache James. We also discuss the successful application of our ap-proach in System X, a large-scale industrial system that provides government-regulation related reporting services (in Section 6). Apache James is a Java-based mail server. OpenMRS is an open-source health care system that sup-ports customized medical records and it is wildly used in developing countries.The two open-source subject systems and the industrial system are all stud-ied in prior research (Gao et al., 2016; Yao et al., 2018) and cover differentdomains, which ensures that our findings are not limited to a specific domain.The details of the two open-source subject systems and the industrial systemare shown in Table 2.

Both two open-source subject systems and the industrial system are CPU-intensive. Besides, CPU is usually the main contributor to server costs (Green-berg et al., 2008). Performance regression in terms of CPU would result in theneed for more CPU resources to provide the same quality of service, therebysignificantly increasing the cost of system operations. Therefore, in this study,we use CPU as the performance metric for detecting performance regressions.

4 Our experimental setup, workloads, and results are shared online https://github.com/

senseconcordia/EMSE2020Data as a replication package.

Page 12: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

12 Lizhi Liao et al.

Table 2 Overview of the open-source subject systems and the industrial system.

Subjects Versions Domains SLOC (K)OpenMRS 2.1.4 Medical 67

Apache James 2.3.2, 3.0M1, 3.0M2 Mail Server 37

System X* 10 releases in 2019 Commercial >2,000* We share our experience of applying our approaches on System X inSection 6.

Table 3 Summary of the test actions with increased appearances (the ones with •) indifferent load drivers.

OpenMRS

Test actionsLoad

driver 1Load

driver 2Load

driver 3Load

driver 4Load

driver 5

Creation of patients • • •Deletion of patients • • •

Searching for patients • •Editing patients • •

Searching for concepts • • •Searching for encounters • •

Searching for observations • • •Searching for types of encounters • • •

Apache James

Test actionsLoad

driver 1Load

driver 2Load

driver 3Load

driver 4Load

driver 5

Sending mails with short messagesand without attachments

Sending mails with long messagesand without attachments

• • • •

Sending mails with short messagesand with small attachments

• •

Sending mails with short messagesand with large attachments

Sending mails with long messagesand with small attachments

Sending mails with long messagesand with large attachments

• •

Retrieving entire mails •Retrieving only the header of mails • • •

4.2 Subject workload design

4.2.1 OpenMRS

OpenMRS provides a web-based interface and RESTFul services. We usedthe default OpenMRS demo database in our performance tests. The demodatabase contains data for over 5K patients and 476K observations. Open-MRS contains four typical scenarios: adding, deleting, searching, and editingoperations. We designed different performance tests that are composed of eightvarious test actions, including 1) creation of patients, 2) deletion of patients,3) searching for patients, 4) editing patients, 5) searching for concepts, 6)searching for encounters, 7) searching for observations, and 8) searching fortypes of encounters. Those performance tests are different in the ratio be-tween the actions. In order to simulate a more realistic workload in the field,we added random controllers and random order controllers in JMeter to varythe workload. Moreover, we also simulated the variety of the number of usersand activities in the field by setting random gaps between the repetitions ofeach user’s activities, randomizing the order of the user activities, and setting

Page 13: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 13

a different number of maximum concurrent users for different workloads atdifferent times.

Furthermore, we designed a total of five JMeter-based performance tests,each of which consists of a mixture of random workloads. In particular, todrive different workloads, for each of the load drivers, we pick several differenttest actions and put them in an extra JMeter loop controller that iterates arandom number of times to increase their appearances in the workload. Table 3shows the detailed choice of the looped actions of each load driver. To makesure that the two versions of the subject systems have different workloads,for one version of the system (e.g., the old version), we run four JMeter teststogether to exercise the system. For the other version (e.g., the new version)of the system, we run one additional JMeter test, i.e, a total of five JMetertests, to ensure that the two versions have undergone different workloads.

Despite the variation in the workloads, we keep our systems not saturatedand running in normal conditions. The CPU saturation can be qualified asthe average CPU load increases to a very large value (e.g., 90%) and remainsfor a long period of time (Krasic et al., 2007). When the system is saturated,the services of the systems are not provided normally and the relationshipbetween the performance of a software system (e.g., CPU usage) and theruntime activities may change. Besides, real software systems are usually notrunning under saturated conditions. The details of the workload, such as theCPU usage values, are shared in our replication package.

For the OpenMRS system, we do not have any evidence showing perfor-mance regressions of specific versions. Therefore, we manually inject severalperformance regressions in the system. The injected performance regressionsare shown in Table 4. The original version, i.e., v0 of OpenMRS does not haveany injected or known performance regressions. We separately injected fourperformance regressions (heavier DB request, additional I/O, constant delayand additional calculation) on the basis of version v0. These four versions arecalled v1, v2, v3, and v4, respectively. The details of the injected performanceregressions of OpenMRS are explained below:

• v1: Injected heavier DB request. We increased the number of DBrequests in the code responsible for accessing the database in OpenMRS.This performance issue will increase the CPU usage of the MySQL server.

• v2: Added additional I/O access. Since accessing I/O storage devices(e.g., hard drives) are usually slower than accessing memory, we addedredundant logging statements to the source code that is frequently exe-cuted. The execution of the logging statements will introduce performanceregression.

• v3: Created constant delay. We added the delay bug in the frequentlyexecuted code, which creates a blocking performance issue.

• v4: Injected additional calculation. We added additional calculationto the source code of OpenMRS that is frequently executed during theperformance test.

Page 14: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

14 Lizhi Liao et al.

We deployed the OpenMRS on two machines, each with a configurationof Intel Core i5-2400 CPU (3.10GHz), 8 GB memory, and 512GB SATA harddrive. One machine is deployed as the application server and the other ma-chine is deployed as the MySQL database server. We ran JMeter using theRESTFul API of OpenMRS on five extra machines with the same specifica-tion to simulate user workloads on the client side for five hours. Each machinehosts one JMeter instance with one type of workload. We used Pidstat (pid,2019) to monitor the CPU usage that is used as the performance metrics ofthis study. To minimize the noise from the system warm-up and cool-downperiods, we filtered out the data from the first and last half hour of runningeach workload. Thus, we only kept four hours of data from each performancetest.

4.2.2 Apache James

Apache James is an open-source enterprise mail server. It contains two mainactions: sending and retrieving mails. Those two actions can be further dividedinto many smaller actions, e.g., sending mails with or without attachmentsand retrieving an entire mail or only the header of the mail. In total, we builteight actions for the performance test, including 1) sending mails with shortmessages and without attachments, 2) sending mails with long messages andwithout attachments, 3) sending mails with short messages and with small at-tachments, 4) sending mails with short messages and with large attachments,5) sending mails with long messages and with small attachments, 6) send-ing mails with long messages and with large attachments, 7) retrieving entiremails, and 8) retrieving only the header of mails. Similar to OpenMRS, wealso created a total of five different workloads, where each workload consistsof a mixture of eight test actions with different ratios. We conduct perfor-mance tests with different versions of Apache James with varying workloads(i.e., 4 tests for one version and 5 tests for the other version). We used JMe-ter to create performance tests that exercise Apache James. Different fromOpenMRS, in which the performance regressions are manually injected, weperformance-tested on three different versions of systems with known real-world performance issues for Apache James. Apart from the last stable version2.3.2, there are multiple Milestone Releases for version 3.0. After checking therelease notes on the Apache James website, we picked 3.0M1 and 3.0M2 touse in our study. These three selected versions have many bug fixes and per-formance improvements (Apa, 2019). The same subjects are studied in priorresearch on performance testing (Gao et al., 2016). Table 4 summarizes thedetails of performance improvements of three selected versions.

We deployed Apache James on a server machine with an Intel Core i7-8700K CPU (3.70GHz), a 16 GB memory, and a 3TB SATA hard drive. Weran JMeter on five extra machines with a configuration of Intel Core i5-2400CPU (3.10GHz), 8 GB memory and 320GB SATA hard drive. These JMeterinstances generate multiple workloads for five hours. Similar to OpenMRS, weuse Pidstat (pid, 2019) to monitor the CPU usage as the performance metrics

Page 15: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 15

of this study. We filtered out the data from the first and last half hour of thetests to minimize the system noises.

Table 4 Performance regressions in the studied open-source systems

System Versions Performance regressionsv0 Original versionv1 Injected heavier DB request

OpenMRS v2 Added additional I/O accessv3 Created constant delayv4 Injected additional calculation

2.3.2 Stable release versionApache James 3.0M1 Improved ActiveMQ spool efficiency (Apa, 2019)

3.0M2 Improved large attachment handling efficiency (Apa, 2019)

4.2.3 Generated workloads

Our generated workloads are different across different software versions. Byconducting chi-square tests of independence between the number of differentactions and the versions, we find that in both OpenMRS and Apache James,the number of different actions and the versions are not statistically indepen-dent (p-value � 0.05), which means that the distribution of the actions arestatistically different across versions. For example, for the OpenMRS system,the total number of all actions ranges from 10,673 to 22,392 across versions.In particular, the number of the deletion of patients action has a range from293 to 1,861 across all versions. For another example from Apache James, thenumber of the retrieving only the header of mails action has a range from 411to 25,532 across all versions. In addition, we provide the detailed numbers ofeach action that are sent to each version in our replication package.

Table 5 summarizes the average CPU utilization of each version of thesystems under our generated dynamic workloads. We would like to note thatwhen we observe a higher CPU usage, it may not correspond to a performanceregression; instead it may be just due to the system having a higher workload.The influence of the dynamic workload on the CPU usage is one of the reasonsthat we leverage performance modeling to detect performance regressions inthe field. The detailed CPU utilization information during the entire test timeperiod is reported in the replication package.

4.3 Subject models

Our approach presented in Section 3 is not designed strictly for any particulartype of machine learning models. In fact, practitioners may choose their pre-ferred models. In this study, we study the use of six different types of models,namely linear regression, random forest, XGBoost, convolutional neural net-work (CNN), recurrent neural network (RNN) and long short-term memory

Page 16: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

16 Lizhi Liao et al.

Table 5 Summary of average CPU utilization of each version in our case studies.

OpenMRSv0 (4w) v0 (5w) v1 v2 v3 v452.03 52.24 55.55 53.76 53.05 55.11

Apache James3.0M2 (4w) 3.0M2 (5w) 3.0M1 2.3.2

0.48 0.51 3.54 4.89

(LSTM). In particular, for linear regression, random forest, and XGBoost, theinputs of the models are vectors whose values are the number of appearancesof each log event in a time period. For CNN, RNN, and LSTM, we sort the logsin each time period by their corresponding time stamp to create a sequence asthe input of the neural networks. Our XGBoost model is fine-tuned using theGridSearchCV (gir, 2019) method. Our CNN, RNN, and LSTM have three,four and four layers, respectively, and they are trained with five, five and tenepochs, respectively. The number of layers and epochs are manually fine-tunedto avoid overfitting.

5 Case Study Results

In this section, we evaluate our approaches by answering two research ques-tions.

RQ1: How well can we model system performance under varying workloads?

Motivation

In order to use black-box machine learning models to detect performance re-gressions under varying workloads, we first need to understand whether suchblack-box models could accurately model the performance of a software sys-tem under varying workloads. Although prior research demonstrates promisingresults of using black-box models to capture the relationship between systemperformance and logs (Farshchi et al., 2015; Yao et al., 2018), these modelsare built on predefined in-house workloads instead of varying field workloads.If such black-box models are sensitive to the variance in the workloads, theymay not be suitable for modeling the performance of software systems underthe field workloads from real end users.

Approach

In order to understand the black-box models’ ability for modeling system per-formance under varying workloads, we train the models on one set of workloadsand evaluate the performance of the models on a different set of workloads (i.e.,unseen workloads).

Page 17: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 17

Performance modeling. For each open-source system (i.e., OpenMRS andApache James), we select the version of the system that is without performanceregressions. We first run the system with four different concurrent workloads(i.e., 4W ) and collect the logs and performance metrics. In order to ensurehaving new workloads to the system, we conduct another run by having anadditional concurrent workload, i.e., having five different concurrent workloads(i.e., 5W ). We build the performance models using the data that is generatedby running the 4W workloads (a.k.a. the training set) and apply the modelon the data that is generated by running the 5W workloads (a.k.a. the testingset). The training and testing sets are derived from independent system runs.We evaluate the prediction performance of the black-box models on the 5Wworkloads by comparing the predicted performance and the measured perfor-mance.Analysis of modeling results. We calculate the prediction errors of themodels on the new workloads (i.e., the 5W workloads for the open-sourcesystems). In order to understand the magnitude of the prediction errors, weuse the prediction errors of the models on the old workloads (i.e., the 4Wworkloads for the open-source systems) as baselines. The baseline predictionerrors are calculated using 10-fold cross-validation to avoid the bias of havingthe same training and testing data.

– Median relative error. The difference between the predicted perfor-mance and the measured performance, normalized by the measured per-formance.

– p-value (Mann-Whitney U). In order to understand whether the modelstrained on the old workloads can equivalently capture the system perfor-mance under the new workloads, we use the Mann-Whitney U test (Nacharet al., 2008) to determine whether there exists a statistically significantdifference (i.e., p-value < 0.05) between the prediction errors on the newworkloads and the prediction errors on the old workloads.

– Effect size (Cliff's delta). Reporting only the statistical significance maylead to erroneous results (i.e., if the sample size is very large, the p-valuecan be very small even if the difference is trivial) (Sullivan and Feinn,2012). Therefore, we apply Cliff's Delta (Cliff, 1996) to quantify the effectsize of the difference between the prediction errors on the old workloadsand the prediction errors on the new workloads.

Results

Table 6 shows the detailed results of using six machine learning techniquesto model the performance of the studied open-source systems (i.e., OpenMRSand Apache James) under varying workloads. The column “MRE” shows themedian relative errors of applying the models (trained on the 4W workloads)on the 4W workloads and the 5W workloads, respectively. The columns “p-value” and “effect size” show the statistical significance and the effect size ofthe difference between the prediction errors of the models on the 4W workloads

Page 18: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

18 Lizhi Liao et al.

and the prediction errors of the models on the 5W workloads, respectively. The“violin plot” column shows the distribution of the relative prediction errorsunder the 4W and 5W workloads.

Page 19: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 19Table

6P

red

icti

on

erro

rd

etails

for

Op

enM

RS

an

dA

pach

eJam

esu

nd

erd

iffer

ent

work

load

s.

Mod

el

Op

enM

RS

Ap

ach

eJam

esM

RE

P-v

alu

eE

ffec

tsi

zeV

ioli

np

lot

MR

EP

-valu

eE

ffec

tsi

zeV

ioli

np

lot

Lin

ear

Reg

ress

ion

4W

3.1

2%

<0.0

1-0

.08

010

2030

40R

elat

ive

erro

r(%

)

4W 5W

4W

23.2

4%

0.4

1N

/A

050

100

150

200

Rel

ativ

e er

ror(

%)

4W 5W

5W

3.5

8%

(neg

ligib

le)

5W

23.7

6%

Ran

dom

Fore

st

4W

2.3

6%

<0.0

1-0

.09

010

2030

40R

elat

ive

erro

r(%

)

4W 5W

4W

22.9

9%

0.4

2N

/A

050

100

150

200

Rel

ativ

e er

ror(

%)

4W 5W

5W

2.9

0%

(neg

ligib

le)

5W

23.0

8%

XG

Boost

4W

2.1

6%

0.1

4N

/A

010

2030

40R

elat

ive

erro

r(%

)

4W 5W

4W

24.2

9%

0.4

4N

/A

050

100

150

200

Rel

ativ

e er

ror(

%)

4W 5W

5W

2.1

1%

5W

24.4

2%

CN

N

4W

9.5

6%

<0.0

10.1

2

010

2030

40R

elat

ive

erro

r(%

)

4W 5W

4W

33.6

1%

0.1

1N

/A

050

100

150

200

Rel

ativ

e er

ror(

%)

4W 5W

5W

7.4

7%

(neg

ligib

le)

5W

37.5

1%

RN

N

4W

5.6

3%

0.3

2N

/A

010

2030

40R

elat

ive

erro

r(%

)

4W 5W

4W

51.3

4%

0.0

10.0

7

050

100

150

200

Rel

ativ

e er

ror(

%)

4W 5W

5W

5.7

3%

5W

47.2

2%

(neg

ligib

le)

LS

TM

4W

4.5

3%

<0.0

1-0

.25

010

2030

40R

elat

ive

erro

r(%

)

4W 5W

4W

34.8

8%

<0.0

1-0

.27

050

100

150

200

Rel

ativ

e er

ror(

%)

4W 5W

5W

6.4

2%

(sm

all

)5W

57.1

1%

(sm

all

)

Note

:T

he

colu

mn

“M

RE

”p

rese

nts

the

med

ian

rela

tive

erro

rs.

Page 20: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

20 Lizhi Liao et al.

Our black-box models can effectively model the performance ofthe studied systems using the dynamic runtime activities that arerecorded in the logs. As shown in Table 6, the XGBoost model achieves thebest results for modeling the performance of the OpenMRS system, with a me-dian relative error of 2.11% on the 5W workloads (i.e., the new workloads) anda median relative error of 2.16% on the 4W workloads (i.e., the baseline work-load). The random forest model achieves the best results for Apache James,reaching a median relative error of 22.99% and 23.08% for the 4W workloadsand 5W workloads, respectively. All the machine learning models achieve bet-ter results for the OpenMRS system than the results for the Apache Jamessystem. The less-promising results for the Apache James system might be ex-plained by the latency between the actual activities of the mail server systemand the recorded logs. For example, the system can take an extended periodof time to process an email with a large attachment, while a log about thesuccessful processing of the email is only printed after the processing period.

The traditional models (e.g., linear regression and random forest)outperform the deep neural networks (e.g., CNN and RNN) formodeling the performance of the studied systems. As shown in Table 6,for both the OpenMRS and the Apache James systems, the three traditionalmodels achieve better results than the three deep neural networks for modelingthe system performance. These results indicate that the relationship betweenthe system performance and the runtime activities recorded in the logs can beeffectively captured by the simple traditional models. Such results also agreewith a recent study (Dacrema et al., 2019) that compares deep neural networksand traditional models for the application of automated recommendations.

Our black-box models can equivalently explain the performance ofa system under new workloads that are unseen when constructingthe models. Table 6 shows the statistical significance (i.e. the p-value) andthe effect size of the difference between the prediction errors of applying theold models (i.e., trained from the 4W workloads) on the new workloads (i.e.,the 5W workloads) and applying the old models on the old workloads (i.e.,the 4W workloads). Table 6 also compares the distributions of the predictionerrors for the 4W and 5W workloads. The prediction errors of most of themodels (except LSTM) have an either statistically insignificant or negligibledifference between the 4W and the 5W workloads, indicating that the modelstrained from the old workloads can equivalently model the system performanceunder new workloads. The LSTM model results in a small difference of theprediction errors between the 4W and the 5W workloads. We suspect thatthe complex LSTM model is likely to over-fit towards the training workloads.

�Simple traditional models (e.g., linear regression and random for-est) can effectively capture the relationship between the performanceof a system and its dynamic activities that are recorded in logs, withvarying workloads.

Page 21: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 21

RQ2: Can our approach detect performance regressions under varying work-loads?

Motivation

In traditional performance testing, in order to detect performance regressions,performance analysts compare the performance data of two versions of a soft-ware system that is generated by running the same workloads from the sameperformance test suites. However, in a field environment, as the workloads ofthe systems are constantly changing, it is almost impossible to run two soft-ware versions on the same workloads to detect performance regressions. Theresults of RQ1 show that our black-box models can accurately capture theperformance of a software system even under new workloads that are unseenwhen training the models. Therefore, in this research question, we would liketo leverage such black-box models to detect performance regressions when theworkloads of the two versions of a system are not consistent.

Running the systems for hours or days before discovering performanceregressions incurs a high cost. As the systems are already running in the field,any delay in detecting the performance regressions may pose a huge impacton the end users. Hence, the desired approach in practice should be able todetect performance regression in a timely manner. Therefore, we also want tostudy how fast our approaches can detect performance regressions.

Approach

The results from RQ1 show that random forest and XGBoost have the lowestprediction errors when modeling performance (cf. Table 6). Since XGBoostrequires resource-heavy fine-tuning, we opt to use random forest in this re-search question. For the open-source systems, we first build the performancemodels from running the systems without performance regressions under the4W workloads (i.e., a combination of four different concurrent workloads). Wethen run the systems without performance regressions under the 5W work-loads (i.e., a combination of five different concurrent workloads). Ideally, ourapproach should not detect performance regressions from these runs. We usesuch results as a baseline to evaluate the effectiveness of our approach for de-tecting performance regressions under the new workloads (i.e., the 5W work-loads). Afterwards, we run the systems with the performance regressions (cf.Table 4) under the 5W workloads. Our approach should be able to detectperformance regressions from these runs.

Finally, to study how fast our approaches can detect performance regres-sions, for the new versions of the systems that have performance regressions,we only use the first 15-minute data and apply our approach to detect the per-formance regressions. Then, we follow an iterative approach to add another5-minute data to the existing data, until our approach can detect the perfor-mance regressions (i.e., with a medium or large effect size that is higher thanthe baseline).

Page 22: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

22 Lizhi Liao et al.

Results

Our black-box-based approaches can effectively detect performanceregressions under varying workloads. Table 7 shows the results of ourthree approaches of performance regression detection (cf. Section 3.3) on Open-MRS and Apache James. We find that with all three approaches, when thereare known performance regressions between two versions, the statistical analy-sis always shows a significant difference between the two versions with mediumor large effect sizes. In addition, the effects sizes from Approach 1 and 2 arenegative, confirming the existence of performance regressions (negative val-ues indicate performance regression and positive values indicate performanceimprovement).

When we compare the effect sizes with the baseline, i.e., running our ap-proaches with systems without performance regressions but under two differentworkloads, we find that the baseline effect sizes are always smaller than thecorresponding ones with performance regressions, except when detecting theregression in v3 of OpenMRS. We consider the reason being the nature ofthe regression in v3, i.e., an injected delay. Since our considered performancemetric is the CPU usage and such a delay may not have a large impact on theCPU usage, it is difficult for our approaches to detect such a regression.

Table 7 Performance regression detection results for OpenMRS and Apache James.

OpenMRS

Versions Approach 1 Approach 2 Approach 3

Oldversion

Newversion P-value Effect size P-value Effect size P-value Effect size

v0 v0 �0.001 0.39 (medium) �0.001 0.36 (medium) �0.001 0.17(small)

v0 v1 �0.001 -0.59 (large) �0.001 -0.69 (large) �0.001 0.38 (medium)

v0 v2 �0.001 -0.44 (medium) �0.001 -0.63 (large) �0.001 0.37 (medium)

v0 v3 �0.001 -0.36 (medium) �0.001 -0.42 (medium) �0.001 0.53 (large)

v0 v4 �0.001 -0.69 (large) �0.001 -0.76 (large) �0.001 0.51 (large)

Apache James

Versions Approach 1 Approach 2 Approach 3

Oldversion

Newversion P-value Effect size P-value Effect size P-value Effect size

3.0m2 3.0m2 0.008 0.09 (negligible) �0.001 -0.12 (negligible) �0.001 -0.03 (negligible)

3.0m2 3.0m1 �0.001 -0.65 (large) �0.001 -0.76 (large) �0.001 0.41 (medium)

3.0m2 2.3.2 �0.001 -0.90 (large) �0.001 -0.93 (large) �0.001 0.82 (large)

Note: For all the old versions, we use four concurrent workloads and for all the new versions with and withoutregressions, we use five concurrent workloads (one extra workload).

Comparing the prediction errors is more effective than comparingthe prediction values when detecting performance regressions be-tween two versions. We observe that for OpenMRS, the differences betweenthe prediction values using Approach 1 and 2 can still be medium (0.39 and0.36) even for the baseline (i.e., without regressions). On the other hand, whencomparing the prediction errors instead of the prediction values, i.e., using Ap-proach 3, the baseline without regressions has only a small effect size (0.17).Such a smaller baseline effect size makes Approach 3 easier to be adopted inpractice, i.e., without the need of spending efforts searching for an optimalthreshold on the effect size to detect performance regressions. However, Ap-proach 3 only shows the deviance of the prediction errors without showingthe direction of the performance deviance, thus it cannot distinguish a perfor-

Page 23: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 23

mance regression from a performance improvement. Hence, Approach 3 maybe used first to flag the performance deviance then be combined with otherapproaches in practice to determine whether the performance deviation is aperformance regression or a performance improvement.Our approaches can detect performance regressions as early as 15minutes after running a new version. Table 8 shows the earliest time thatour approach can detect performance regressions in the studied open-sourcesystems. We find that all the performance regressions in the open-source sys-tems can be detected by at least one approach with less than 20-minute datafrom the new version. In particular, the regressions from both versions ofApache James and three versions of OpenMRS can even be detected usingonly the first 15-minute data. The ability of early detection eases the adoptionof our approaches in the practices of testing in the field, where performanceregressions are detected directly based on the field data, instead of using dedi-cated performance testing. In our future work, we plan to investigate on furthershortening the needed time for detecting performance regressions in the fieldto ease practitioners in adopting our approach.

Table 8 The earliest time for our approaches to detect regressions in the two open-sourcesystems.

OpenMRS Apache Jamesv1 v2 v3 v4 3.0m1 2.3.2

Approach 1 60 mins 160 mins 15 mins 15 mins 15 mins 15 minsApproach 2 20 mins 50 mins 15 mins 15 mins 15 mins 15 minsApproach 3 45 mins 15 mins 15 mins 15 mins 15 mins 15 mins

All three approaches can successfully detect performance regressionsunder varying workloads, requiring data from a very short periodof time (down to 15 minutes). Comparing the prediction errors ismore effective than comparing the prediction values for detectingperformance regressions between two versions.

6 An Industrial Experience Report

System X is a commercial software system that provides government-regulationrelated reporting services. The service is widely used as the market leader ofthe domain. System X has over ten years of history with more than two mil-lion lines of code that are based on Microsoft .Net. System X is deployed inan internal production environment and is used by external enterprise cus-tomers worldwide. System X is a CPU-intensive system. Thus, CPU usageis the main concern of our industrial collaborator in performance regressiondetection. Due to a Non-Disclosure Agreement (NDA), we cannot reveal ad-ditional details about the hardware environment and the usage scenarios ofSystem X.

Page 24: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

24 Lizhi Liao et al.

6.1 Modeling system performance

We directly use the field data that is generated by workloads from the realend users to model the performance of System X. For each release cycle withn days, we use the data from the first n/2 days to build the models andapply them on the data from the second n/2 days to evaluate the predictionperformance of the models. We would like to note that there exists no controlon the end users for applying any particular workload, and that all the datais directly retrieved from the field with no interference on the behavior of theend users. We studied 10 releases of System X.

We find that black-box models can accurately capture the relationshipbetween the system performance and the system activities recorded in the logs,with a prediction error between 10% to 20%. In particular, the models trainedfrom the first half-release data can effectively predict the system performanceof the second half release.

6.2 Detecting performance regressions

For every new release, we use our approaches to compare the field data fromthe new release and the previous release to determine whether there are per-formance regressions. Since there are no injected or pre-known performanceregressions, we present the detected performance regressions to the developersof the system and manually study the code changes to understand whether thedetection results are correct. For the releases that are detected as not havingperformance regressions, we cannot guarantee that these releases are free ofperformance regressions. However, we also present the results to the develop-ers to confirm whether there exist any users who report performance-relatedissues for these releases. If so, our detection results may be considered false.

Within all the 10 studied releases of System X, our approaches detectedperformance regressions from one release. All three approaches detected perfor-mance regressions from the release with large effect sizes when compared withthe previous release. None of the three approaches false-positively detect per-formance regressions from the other nine releases (i.e., with either statisticallyinsignificant difference or negligible effect sizes when compared with the pre-vious release). By further investigating the release with detected performanceregressions, we observed that developers added a synchronized operation tolock the resources that are responsible for generating a report, in order toprotect the shared resources under the multi-thread situation. However, thereporting process is rather resource-heavy, resulting in significant overheadfor each thread to wait and acquire the resources. Thus, the newly added lockcauses the performance regression. After we discussed with the developers whoare responsible for this module, we confirmed that this synchronized opera-tion introduced the performance regression to the software. In addition, forall the nine releases from which our approaches did not detect performance

Page 25: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 25

regressions, the developers of System X have not yet received any reportedperformance issues from the end users till the day of writing this paper.

Our approaches have been adopted by our industrial collaborator to detectperformance regressions of System X on a daily basis.

7 Challenges and Lessons Learned

In this section, we discuss our faced challenges and learned lessons duringapplying our approach to the production environment of an industrial settingwhere a large number of customers worldwide access the system on a dailybasis.

C1: Determining the sampling frequency of performance metrics

Challenge. Our approach uses both logs and performance metrics as the inputdata to our black-box models. We use the logs that are automatically generatedby the web servers, such as the Jetty, Tomcat, and IIS (Internet InformationServices) web servers. The performance metrics (e.g., CPU, I/O) of the systemsare collected using tools (e.g., Pidstat). A higher sampling frequency of theperformance metrics can capture the system performance more accurately.However, a higher sampling frequency would also introduce more performanceoverhead. Since we need to apply our approach to the production environment,it is necessary to produce as low performance overhead as possible.

Solution. At a first attempt, we intuitively chose 10 seconds as the samplinginterval of the performance metrics. After we deployed our approach in pro-duction, we found that there is 0.5%-0.8% CPU overhead each time when ourapproach is collecting the performance metrics, and the overhead happens sixtimes in 1 minute. Such a monitoring overhead cannot be ignored, especiallywhen the system is serving heavy workloads. After working closely with the ITstaffs from our industrial partner, we gained a deeper understanding of howthe sampling frequency of performance metrics impact the monitoring over-head. Finally, we agreed that collecting the performance metrics for every 30seconds would be a good balance between reducing the monitoring overheadand ensuring accurate measurement of the system performance.

Lessons learned. Monitoring a system usually comes with the monitoringoverhead. A higher monitoring frequency can provide a better monitoring ac-curacy at the cost of a larger monitoring overhead, which is usually undesirablewhen the system is serving large workloads. Finding a good balance betweenthe monitoring accuracy and the overhead is crucial for the successful adoptionof similar approaches in practice.

Page 26: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

26 Lizhi Liao et al.

C2: Reducing the time cost of performance regression detection

Challenge. We choose random forest as our final black-box model to detectperformance regressions, as random forest achieves the best results for model-ing the performance of the studied systems (see RQ1). A random forest modelcontains a configurable multitude of decision trees and takes the average out-put of the individual decision trees as the final output. At first, we startedwith the default number of trees (i.e., 500 trees) (Breiman et al., 2018). Ittook 5-6 hours to detect performance regressions between two releases of theindustry system, including training and testing the random forest models andperforming statistical tests. The industrial practitioners usually have an earlyneed of checking if there is performance regression between the current versionand multiple historical versions, which makes our approaches difficult to beadopted in a fast-paced development environment (e.g., an agile environment).

Solution. A larger number of trees usually result in a more accurate randomforest model, at the cost of longer training and prediction time. In order to re-duce the time cost of performance regression detection, we gradually decreasedthe number of trees in our random forest model while ensuring the model per-formance is not significantly impaired. In the end, we kept 100 decision treesin our random forest models. It took less than 2 hours to detect performanceregressions between two releases of the industry system (i.e., training and test-ing the random forest models and performing statistical tests). Such a lightermodel also enables us to detect performance regressions between the currentversion and multiple history versions (e.g., taking less than 10 hours whencomparing the current version with five historical versions). We compared theprediction results of the 100-tree model with the original 500-tree models. Wefound that the 100-tree model only has a slightly higher median relative er-ror (around 0.1%) than the previous 500-tree models, which is negligible inpractice.

Lessons learned. In addition to the model performance, the time cost oftraining and applying a black-box model is also a major concern in the prac-tice of performance regression detection in the field. Seeking an appropriatetrade-off between the model performance and the time cost is essential forthe successful adoption of performance regression detection approaches in thefield.

C3: Early detection of performance regressions

Challenge. In an in-house performance testing process, performance engineersusually wait until all the tests finish before they analyze the testing results.However, in a field environment, any performance regressions can directly im-pact users’ experience. We cannot usually wait for a long time to gather plentyof data before performing performance analysis. If our approach cannot de-tect performance regressions in a timely manner, the performance regressions

Page 27: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 27

would already have brought non-negligible negative impact to users. This is aunique challenge facing the detection of performance regressions in the field.Solution. In order to detect field performance regressions in a timely manner,we continuously apply our approach to detect performance regressions usingthe currently available data. For example, after a new version of the systemhas been up and running for one hour, we use only the one-hour data as ournew workloads to determine the existence of performance regressions in thenew versions. Using the data generated in a short time period also allowsthe analysis part of our approach (i.e., model training and predictions) tobe processed faster. As discussed in RQ2, our approach can effectively detectperformance regressions when the system has run for a very short time (e.g.,down to 15 minutes for Apache James). In other words, our approach candetect early performance regressions in the field.Lessons Learned. Different from performance regression detection in an in-house testing environment, performance regressions in the field need to bedetected in a timely manner, in order to avoid notable performance impact tothe users. Continuously detecting performance regressions using the currentlyavailable data can help detect early performance regressions in the field.

8 Threats to Validity

This section discusses the threats to the validity of our study.External validity. Our study is performed on two open-source systems (i.e.,OpenMRS and Apache James) and one industry system (i.e., System X) thatare from different domains (e.g., health care system and mail server). All ourstudied systems are mature systems with years of history and have been stud-ied in prior performance engineering research. Nevertheless, more case studieson other software systems in other domains can benefit the evaluation of ourapproach. All our studied systems are developed in either Java or .Net. Our ap-proach may not be directly applicable to systems developed in other program-ming languages, especially dynamic languages (e.g., JavaScript and Python).Future work may investigate approaches to minimize the uncertainty in theperformance characterization of systems developed in dynamic languages.Internal validity. Our approach builds machine learning models to capturethe relationship between the runtime activities that are recorded in logs andthe measured performance of a system. However, there might be some runtimeactivities that are not recorded in logs and that also impact system perfor-mance. In our approach, we use logs that capture the high-level workloads ofthe system. Our experiments on our studied systems demonstrated that suchlogs can predict system performance with high accuracy. Nevertheless, the cor-relation between the runtime activities recorded in the logs and the measuredsystem performance does not necessarily suggest a causal relationship betweenthem.

Our approach relies on comparing performance models that are constructedbased on software behaviors that are recorded in logs. We admit that our

Page 28: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

28 Lizhi Liao et al.

approach is not applicable in situations where the old version and the newversion have fundamentally different features. If a new version has many newfeatures that do not exist in an old version, the model constructed on the oldversion would not have the knowledge about the performance impact of thenew features, thus comparing performance models between these two versionswould not be practical. However, we would like to emphasize that the goal ofour approach is to verify whether the black-box performance models can workin the scenarios where the workloads have high variations (e.g., variation inthe ratio between different types of activities).

Our approach relies on non-parametric statistical analysis (i.e., Mann-Whitney U test and Cliff’s delta) to compare the black-box behaviors of twosoftware versions to detect performance regressions. Our assumption is thatstatistically different behaviors between two software versions would suggestperformance regressions. In practice, however, determining whether there isa performance regression usually depends on the subjective judgment of theperformance analysts. Therefore, our approach enables performance analyststo adjust the threshold of the statistics (e.g., the effect size) to detect perfor-mance regressions in their specific scenarios.

Construct validity. In this work, instead of using modern performance mon-itoring tools (e.g., application performance monitoring (APM)), we use tradi-tional system monitoring tools (e.g., Pidstat) to collect the performance data(e.g., CPU usage) when running the systems. The use of APM to collect moreinformation may complement our approach.

We use the CPU usage as our performance metric to detect performanceregressions. There exist other performance metrics, such as memory utilizationand response time that can be considered to detect performance regressions.Considering other performance metrics would benefit our approach. However,monitoring more performance metrics would introduce more performance over-head to the monitored system. In the case of our industrial system, CPU usageis the main concern in performance regression detection. Besides, the threestudied approaches are not limited to the performance metric of CPU usage.Practitioners can leverage our approach to consider other performance metricsthat are appropriate in their context.

We use JMeter to send workloads to the open-source systems. We arenot experts for these open-source software systems and the actual field users’data for these open-source systems is not available. Therefore, the portions ofeach type of action are not pre-chosen but just the results of our randomizedworkload drivers, and we do not have any empirical evidence to support suchnumbers. However, our goal is just to make different workloads across differentversions, instead of having chosen specific numbers. Similarly, in the real world,it may be difficult to anticipate the number of users of a system and their usagepatterns.

Page 29: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 29

9 Related Work

We discuss related work along with four directions: analyzing performance testresults, A/B testing and canary releasing, source code evolution and perfor-mance, and leveraging logs to detect performance-related anomalies.

9.1 Analyzing performance test results

Prior research has proposed many automated techniques to analyze the resultsof performance tests (Gao et al., 2016; Jiang and Hassan, 2015). Initially, theQueuing Network Model (QNM) (Lazowska et al., 1984) has been proposedto model the performance of a software system based on the queuing theory.Based on QNM, Barna et al. (2011) propose an autonomic performance testingframework to locate the software and hardware bottlenecks in the system. Theyuse a two-layer QNM to automatically target hardware and software resourcesutilization limits (e.g., hardware utilization, web container threads number,and response time).

Due to the increasing complexity of software systems and their behaviors,the QNM-based performance model becomes insufficient. Therefore, prior workuses statistical methods to assist in performance analysis. Cohen et al. (2005)present an approach that captures the signatures of the states of a runningsystem and then cluster such signatures to detect recurrent or similar per-formance problems. Nguyen et al. (2011, 2012) use control charts to analyzeperformance metrics across test runs to detect performance regressions. Maliket al. (2010, 2013) use principle component analysis (PCA) to reduce the largenumber of performance metrics to a smaller set, in order to reduce the effortsof performance analysts.

Prior work also uses machine learning techniques for performance anal-ysis (Didona et al., 2015; Foo et al., 2015; Lim et al., 2014; Shang et al.,2015; Xiong et al., 2013). Shang et al. (2015) propose an approach that au-tomatically selects target performance metrics from a larger set, in order tomodel system performance. Xiong et al. (2013) propose an approach thatautomatically identifies the system metrics that are highly related to systemperformance and detects changes in the system metrics that lead to changes inthe system performance. Foo et al. (2015) build ensembles of models and asso-ciation rules to detect performance regressions in heterogeneous environments,in order to achieve high precision and recall in the detection results.

Different from existing approaches that analyze the results of in-houseperformance tests, our approach builds black-box performance models thatleverage performance metrics and readily available logs to detect performanceregressions in the field.

Page 30: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

30 Lizhi Liao et al.

9.2 A/B testing and canary releasing

As the software system evolves, developers need to take actions to ensure thatthe new versions meet the expectations, in terms of both functional and non-functional requirements. In order to reduce the potential risk of introducingnew releases, one common approach is to perform A/B testing to compare twoversions of the system against each other to determine which one performsbetter. For example, Xu et al. (2015) build an A/B testing experimentationplatform at LinkedIn to support data-driven decisions and improve user experi-ences. A/B tests is a standard way to evaluate user engagement or satisfactionwith a new service, feature, or product and its main concern is not about thesystem performance. In comparison, our approach aims to detect performanceregressions in the field.

Another prevalent industrial live experimentation technique is called ca-nary releasing, which gradually delivers the new versions or features to anincreasingly larger user group. Sato (2014) indicates that canary releasingreduces the risk associated with releasing a new version of the software bylimiting its audience at the beginning. Different from canary releasing, ourapproach relies on the field data from the normal operations of a softwaresystem (i.e., without gradual releasing to gain confidence). Our approach hasadvantages over canary releasing when new features need to be delivered toall the users quickly, such as the scenario faced by our industrial partner. Nev-ertheless, our approach can be used in a canary releasing process to detectperformance regressions based on the field data collected from a sample of theusers, which can be a step in our future research.

In summary, although A/B testing, canary releasing and our approach allrely on the end user data from the field, our approach serves different purposesand has different usage scenarios from the other two techniques (as discussedabove). Our approach is not a direct replacement or peer to A/B testing orcanary releasing, and we do not aim to alter how the software is released tothe end users. On the other hand, our approach can complement existing A/Btesting and canary releasing practices.

9.3 Source code evolution and performance

Performance issues are essentially related to the source code. Thus, under-standing how software performance evolves across code revisions is very im-portant. Alcocer and Bergel (2015) conduct an empirical study on the perfor-mance evolution of 19 applications. Their findings show that every three coderevisions introduce a performance variation and the performance variations canbe classified into some patterns. Based on these patterns, Zhou et al. (2019)propose an approach called DeepTLE that can predict the performance of sub-mitted source code before executing the system, based on the semantic andstructural features of the submitted source code. While source-code level tech-niques aim to identify performance issues before running performance tests,

Page 31: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 31

we believe that the performance of the whole system is not just a simple com-bination of each code snippet’s performance. Instead, the performance of thewhole system needed to be measured by performance testing. In this work,we propose three approaches that identify performance issues when the sys-tem is running in the field, which can complement or even replace traditionalperformance testing.

9.4 Leveraging logs to detect performance-related anomalies

Prior research proposes various approaches that leverage execution logs to de-tect performance-related anomalies (He et al., 2018; Jiang and Hassan, 2015;Tan et al., 2010; Xu et al., 2009). Jiang et al. (2009) propose a frameworkthat automatically generates a report to detect and rank potential performanceproblems and the associated log sequences. Xu et al. (2009) extract event fea-tures from the execution logs and leverage PCA to detect performance-relatedanomalies. Tan et al. (2010) present a state-machine view from the executionlogs of Hadoop to understand the system behavior and debug performanceproblems. Syer et al. (2017) propose to leverage execution logs to continu-ously validate whether the workloads in the performance tests are reflectiveof the field workloads. Syer et al. (2013) also propose to combine executionlogs and performance metrics to diagnose memory-related performance issues.Similarly, He et al. (2018) correlate the clusters of log sequences with sys-tem performance metrics to identify impactful system problems (e.g., requestlatency and service availability).

In comparison, our approach leverages the relationship between the logsand performance metrics from the field to detect performance regressions be-tween two versions of a software system.

10 Conclusions

In this paper, we propose to leverage automated approaches that can effectivelydetect early performance regressions in the field. We study three approachesthat use black-box performance models to capture the relationship betweenthe activities of a software system and its performance under such activities,and then compare the black-box models derived from the current version of asoftware system and an earlier version of the same software system, to deter-mine the existence of performance regressions between these two versions. Byperforming empirical experiments on two open-source systems (OpenMRS andApache James) and applying our approaches on a large-scale industrial system,we found that simple black-box models (e.g., random forest) can accuratelycapture the relationship between the performance of a system under varyingworkloads and its dynamic activities that are recorded in logs. We also foundthat these black-box models can effectively detect real performance regressionsand injected performance regressions under varying workloads, requiring data

Page 32: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

32 Lizhi Liao et al.

from only a short period of operations. Our approaches can complement oreven replace typical in-house performance testing when testing resources arelimited (e.g., in an agile environment). The challenges and the lessons thatwe learned from the successful adoption of our approach also provide insightsfor practitioners who are interested in performance regression detection in thefield.

Acknowledgement

We would like to thank ERA Environmental Management Solutions for pro-viding access to the enterprise system used in our case study. The findings andopinions expressed in this paper are those of the authors and do not neces-sarily represent or reflect those of ERA Environmental Management Solutionsand/or its subsidiaries and affiliates. Moreover, our results do not reflect thequality of ERA Environmental Management Solutions’ products.

References

(2019). Apache james project-apache james server 3-release notes.http://james.apache.org/server/release-notes.html. Last accessed10/09/2019.

(2019). Gridsearchcv. https://scikit-learn.org/stable/modules/

generated/sklearn.model_selection.GridSearchCV.html. Last ac-cessed 10/11/2019.

(2019). pidstat: Report statistics for tasks - linux man page. https://linux.die.net/man/1/pidstat. Last accessed 10/11/2019.

Alcocer, J. P. S. and Bergel, A. (2015). Tracking down performance varia-tion against source code evolution. In Proceedings of the 11th Symposiumon Dynamic Languages, DLS 2015, page 129–139, New York, NY, USA.Association for Computing Machinery.

Barna, C., Litoiu, M., and Ghanbari, H. (2011). Autonomic load-testing frame-work. In Proceedings of the 8th International Conference on AutonomicComputing, ICAC 2011, Karlsruhe, Germany, June 14-18, 2011 , pages 91–100.

Benesty, J., Chen, J., Huang, Y., and Cohen, I. (2009). Pearson correlationcoefficient. In Noise reduction in speech processing , pages 1–4. Springer.

Breiman, L., Cutler, A., Liaw, A., and Wiener, M. (2018). Breiman and Cut-ler’s Random Forests for Classification and Regression. R package version4.6-14.

Cannon, J. (2019). Performance degradation affect-ing salesforce clients. https://marketingland.com/

performance-degradation-affecting-salesforce-clients-267699.Last accessed 10/11/2019.

Chen, T., Shang, W., Hassan, A. E., Nasser, M. N., and Flora, P.(2016). Cacheoptimizer: helping developers configure caching frameworks

Page 33: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 33

for hibernate-based database-centric web applications. In Proceedings of the24th ACM SIGSOFT International Symposium on Foundations of SoftwareEngineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016 , pages666–677.

Cliff, N. (1996). Ordinal methods for behavioral data analysis.Cohen, I., Chase, J. S., Goldszmidt, M., Kelly, T., and Symons, J. (2004).

Correlating instrumentation data to system states: A building block forautomated diagnosis and control. In 6th Symposium on Operating SystemDesign and Implementation (OSDI 2004), San Francisco, California, USA,December 6-8, 2004 , pages 231–244.

Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., and Fox, A.(2005). Capturing, indexing, clustering, and retrieving system history. InProceedings of the 20th ACM Symposium on Operating Systems Principles2005, SOSP 2005, Brighton, UK, October 23-26, 2005 , pages 105–118.

Cortez, E., Bonde, A., Muzio, A., Russinovich, M., Fontoura, M., and Bian-chini, R. (2017). Resource central: Understanding and predicting workloadsfor improved resource management in large cloud platforms. In Proceedingsof the 26th Symposium on Operating Systems Principles, Shanghai, China,October 28-31, 2017 , pages 153–167.

Dacrema, M. F., Cremonesi, P., and Jannach, D. (2019). Are we really makingmuch progress? A worrying analysis of recent neural recommendation ap-proaches. In Proceedings of the 13th ACM Conference on Recommender Sys-tems, RecSys 2019, Copenhagen, Denmark, September 16-20, 2019., pages101–109.

de Oliveira, A. B., Fischmeister, S., Diwan, A., Hauswirth, M., and Sweeney,P. F. (2013). Why you should care about quantile regression. In ArchitecturalSupport for Programming Languages and Operating Systems, ASPLOS ’13,Houston, TX, USA - March 16 - 20, 2013 , pages 207–218.

Didona, D., Quaglia, F., Romano, P., and Torre, E. (2015). Enhancing perfor-mance prediction robustness by combining analytical modeling and machinelearning. In Proceedings of the 6th ACM/SPEC International Conference onPerformance Engineering, Austin, TX, USA, Jan 31 - Feb 4, 2015 , pages145–156.

Farshchi, M., Schneider, J., Weber, I., and Grundy, J. C. (2015). Experiencereport: Anomaly detection of cloud application operations using log andcloud metric correlation analysis. In 26th IEEE International Symposiumon Software Reliability Engineering, ISSRE 2015, Gaithersbury, MD, USA,November 2-5, 2015 , pages 24–34.

Foo, K. C., Jiang, Z. M., Adams, B., Hassan, A. E., Zou, Y., and Flora, P.(2010). Mining performance regression testing repositories for automatedperformance analysis. In Proceedings of the 2010 10th International Con-ference on Quality Software, QSIC ’10, pages 32–41.

Foo, K. C., Jiang, Z. M. J., Adams, B., Hassan, A. E., Zou, Y., and Flora, P.(2015). An industrial case study on the automated detection of performanceregressions in heterogeneous environments. In Proceedings of the 37th Inter-national Conference on Software Engineering - Volume 2 , ICSE ’15, pages

Page 34: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

34 Lizhi Liao et al.

159–168.Gao, R., Jiang, Z. M., Barna, C., and Litoiu, M. (2016). A framework to

evaluate the effectiveness of different load testing analysis techniques. In2016 IEEE International Conference on Software Testing, Verification andValidation, ICST 2016, Chicago, IL, USA, April 11-15, 2016 , pages 22–32.

Ghaith, S., Wang, M., Perry, P., Jiang, Z. M., O’Sullivan, P., and Murphy, J.(2016). Anomaly detection in performance regression testing by transactionprofile estimation. Softw. Test., Verif. Reliab., 26(1), 4–39.

Gong, Z., Gu, X., and Wilkes, J. (2010). PRESS: predictive elastic resourcescaling for cloud systems. In Proceedings of the 6th International Conferenceon Network and Service Management, CNSM 2010, Niagara Falls, Canada,October 25-29, 2010 , pages 9–16.

Greenberg, A., Hamilton, J., Maltz, D. A., and Patel, P. (2008). The cost ofa cloud: Research problems in data center networks. SIGCOMM Comput.Commun. Rev., 39(1), 68–73.

Guo, J., Czarnecki, K., Apel, S., Siegmund, N., and Wasowski, A. (2013).Variability-aware performance prediction: A statistical learning approach.In 2013 28th IEEE/ACM International Conference on Automated SoftwareEngineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013 ,pages 301–311.

Guo, J., Yang, D., Siegmund, N., Apel, S., Sarkar, A., Valov, P., Czarnecki, K.,Wasowski, A., and Yu, H. (2018). Data-efficient performance learning forconfigurable systems. Empirical Software Engineering , 23(3), 1826–1867.

He, S., Lin, Q., Lou, J., Zhang, H., Lyu, M. R., and Zhang, D. (2018). Identi-fying impactful service system problems via log analysis. In Proceedings ofthe 2018 ACM Joint Meeting on European Software Engineering Conferenceand Symposium on the Foundations of Software Engineering, ESEC/SIG-SOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018 ,pages 60–70.

Ibidunmoye, O., Hernandez-Rodriguez, F., and Elmroth, E. (2015). Perfor-mance anomaly detection and bottleneck identification. ACM Comput.Surv., 48(1), 4:1–4:35.

Jiang, Z. M. and Hassan, A. E. (2015). A survey on load testing of large-scalesoftware systems. IEEE Trans. Software Eng., 41(11), 1091–1118.

Jiang, Z. M., Hassan, A. E., Hamann, G., and Flora, P. (2009). Automatedperformance analysis of load tests. In 25th IEEE International Conferenceon Software Maintenance (ICSM 2009), September 20-26, 2009, Edmonton,Alberta, Canada, pages 125–134.

Krasic, C., Sinha, A., and Kirsh, L. (2007). Priority-progress CPU adaptationfor elastic real-time applications. In R. Zimmermann and C. Griwodz, edi-tors, Multimedia Computing and Networking 2007 , volume 6504, pages 172– 183. International Society for Optics and Photonics, SPIE.

Krishnamurthy, D., Rolia, J. A., and Majumdar, S. (2006). A synthetic work-load generation technique for stress testing session-based systems. IEEETrans. Software Eng., 32(11), 868–882.

Page 35: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

Detecting Performance Regressions under Varying Workloads 35

Lazowska, E. D., Zahorjan, J., Graham, G. S., and Sevcik, K. C. (1984). Quan-titative system performance - computer system analysis using queueing net-work models. Prentice Hall.

Lim, M., Lou, J., Zhang, H., Fu, Q., Teoh, A. B. J., Lin, Q., Ding, R., andZhang, D. (2014). Identifying recurrent and unknown performance issues. In2014 IEEE International Conference on Data Mining, ICDM 2014, Shen-zhen, China, December 14-17, 2014 , pages 320–329.

Malik, H., Jiang, Z. M., Adams, B., Hassan, A. E., Flora, P., and Hamann, G.(2010). Automatic comparison of load tests to support the performance anal-ysis of large enterprise systems. In 14th European Conference on SoftwareMaintenance and Reengineering, CSMR 2010, 15-18 March 2010, Madrid,Spain, pages 222–231.

Malik, H., Hemmati, H., and Hassan, A. E. (2013). Automatic detection ofperformance deviations in the load testing of large scale systems. In 35th In-ternational Conference on Software Engineering, ICSE ’13, San Francisco,CA, USA, May 18-26, 2013 , pages 1012–1021.

Nachar, N. et al. (2008). The mann-whitney u: A test for assessing whethertwo independent samples come from the same distribution. Tutorials inquantitative Methods for Psychology , 4(1), 13–20.

Nguyen, T. H. D., Adams, B., Jiang, Z. M., Hassan, A. E., Nasser, M. N., andFlora, P. (2011). Automated verification of load tests using control charts.In 18th Asia Pacific Software Engineering Conference, APSEC 2011, HoChi Minh, Vietnam, December 5-8, 2011 , pages 282–289.

Nguyen, T. H. D., Adams, B., Jiang, Z. M., Hassan, A. E., Nasser, M. N.,and Flora, P. (2012). Automated detection of performance regressions us-ing statistical process control techniques. In Third Joint WOSP/SIPEWInternational Conference on Performance Engineering, ICPE’12, Boston,MA, USA - April 22 - 25, 2012 , pages 299–310.

Romano, J., Kromrey, J. D., Coraggio, J., and Skowronek, J. (2006). Appro-priate statistics for ordinal level data: Should we really be using t-test andcohen’sd for evaluating group differences on the nsse and other surveys. Inannual meeting of the Florida Association of Institutional Research, pages1–33.

Sato, D. (2014). Canary release. MartinFowler. com.Shang, W., Hassan, A. E., Nasser, M. N., and Flora, P. (2015). Automated

detection of performance regressions using regression models on clusteredperformance counters. In Proceedings of the 6th ACM/SPEC InternationalConference on Performance Engineering, Austin, TX, USA, January 31 -February 4, 2015 , pages 15–26.

Sullivan, G. M. and Feinn, R. (2012). Using effect size—or why the p value isnot enough. Journal of graduate medical education, 4(3), 279–282.

Syer, M. D., Jiang, Z. M., Nagappan, M., Hassan, A. E., Nasser, M. N.,and Flora, P. (2013). Leveraging performance counters and execution logsto diagnose memory-related performance issues. In 2013 IEEE Interna-tional Conference on Software Maintenance, Eindhoven, The Netherlands,September 22-28, 2013 , pages 110–119.

Page 36: Using Black-Box Performance Models to Detect ...shang/pubs/Liao2020...2014), making it unrealistic to test such dynamic workloads in a prede ned manner. Second, performance testing

36 Lizhi Liao et al.

Syer, M. D., Jiang, Z. M., Nagappan, M., Hassan, A. E., Nasser, M. N., andFlora, P. (2014). Continuous validation of load test suites. In ACM/SPECInternational Conference on Performance Engineering, ICPE’14, Dublin,Ireland, March 22-26, 2014 , pages 259–270.

Syer, M. D., Shang, W., Jiang, Z. M., and Hassan, A. E. (2017). Continuousvalidation of performance test workloads. Autom. Softw. Eng., 24(1), 189–231.

Syncsort (2018). White paper: Assessing the financial impact of downtime.Tan, J., Kavulya, S., Gandhi, R., and Narasimhan, P. (2010). Visual, log-based

causal tracing for performance debugging of mapreduce systems. In 2010International Conference on Distributed Computing Systems, ICDCS 2010,Genova, Italy, June 21-25, 2010 , pages 795–806.

Valov, P., Petkovich, J., Guo, J., Fischmeister, S., and Czarnecki, K. (2017).Transferring performance prediction models across different hardware plat-forms. In Proceedings of the 8th ACM/SPEC on International Conference onPerformance Engineering, ICPE 2017, L’Aquila, Italy, April 22-26, 2017 ,pages 39–50.

Weyuker, E. J. and Vokolos, F. I. (2000). Experience with performance testingof software systems: Issues, an approach, and case study. IEEE Trans.Software Eng., 26(12), 1147–1156.

Xiong, P., Pu, C., Zhu, X., and Griffith, R. (2013). vperfguard: an automatedmodel-driven framework for application performance diagnosis in consol-idated cloud environments. In ACM/SPEC International Conference onPerformance Engineering, ICPE’13, Prague, Czech Republic, pages 271–282.

Xu, W., Huang, L., Fox, A., Patterson, D. A., and Jordan, M. I. (2009). De-tecting large-scale system problems by mining console logs. In Proceedingsof the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP2009, Big Sky, Montana, USA, October 11-14, 2009 , pages 117–132.

Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. (2015). Frominfrastructure to culture: A/b testing challenges in large scale social net-works. In Proceedings of the 21th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , KDD ’15, page 2227–2236, NewYork, NY, USA. Association for Computing Machinery.

Yadwadkar, N. J., Bhattacharyya, C., Gopinath, K., Niranjan, T., and Susarla,S. (2010). Discovery of application workloads from network file traces. In8th USENIX Conference on File and Storage Technologies, San Jose, CA,USA, February 23-26, 2010 , pages 183–196.

Yao, K., B. de Padua, G., Shang, W., Sporea, S., Toma, A., and Sajedi, S.(2018). Log4perf: Suggesting logging locations for web-based systems’ per-formance monitoring. In Proceedings of the 2018 ACM/SPEC InternationalConference on Performance Engineering , ICPE ’18, pages 127–138.

Zhou, M., Chen, J., Hu, H., Yu, J., Li, Z., and Hu, H. (2019). Deeptle: Learningcode-level features to predict code performance before it runs. In 2019 26thAsia-Pacific Software Engineering Conference (APSEC), pages 252–259.