workload modelling and elasticity management of …...workload modelling and elasticity management...

Workload Modelling and ElasticityManagement of Data-Intensive

Systems

Alireza Khoshkbarforoushha

A thesis submitted for the degree ofDoctor of Philosophy

The Australian National University

February 2018

c© Alireza Khoshkbarforoushha 2018

Except where otherwise indicated, this thesis is my own original work.

Alireza Khoshkbarforoushha11 February 2018

I conducted this PhD thesis under the supervision of Dr. Rajiv Ranjan. Most ofthe results in this thesis have been previously published at top-tier conferences andjournals. These publications are listed below and some of them have been achievedin collaboration with other researchers.

Journal Papers

• A. Khoshkbarforoushha, A. Khosravian, and R. Ranjan. ”Elasticity manage-ment of streaming data analytics flows on clouds”, Journal of Computer andSystem Sciences (JCSS), 2016, http://doi.org/10.1016/j.jcss.2016.11.002.

• A. Khoshkbarforoushha, R. Ranjan, R. Gaire, E. Abbasnejad, L. Wang, A. Zomaya,”Distribution Based Workload Modelling of Continuous Queries in Clouds”,IEEE Transactions on Emerging Topics in Computing (TETC), 5.1 (2017): 120-133.

Conference Papers

• A Khoshkbarforoushha, R Ranjan, ”Resource and Performance DistributionPrediction for Large Scale Analytics Queries”. In Proceedings of the 7th ACM/SPECon International Conference on Performance Engineering (pp. 49-54). ACM.

• A Khoshkbarforoushha, R Ranjan, Q Wang, C Friedrich, ”Flower: A Data Ana-lytics Flow Elasticity Manager”. Proceedings of the VLDB Endowment 10, no.12 (2017): 1893-1896.

Apart from the publications above, I published the following papers during myPhD studies, the results of which are not presented in this thesis.

• A Khoshkbarforoushha, M Wang, R Ranjan, L Wang, L Alem, S U Khan, andB Benatallah. ”Dimensions for evaluating cloud resource orchestration frame-works”. Computer, 49(2):24-33, 2016.

• A Khoshkbarforoushha, R Ranjan, P Strazdins. ”Resource Distribution Estima-tion for Data-Intensive Workloads: Give Me My Share & No One Gets Hurt!”In European Conference on Service-Oriented and Cloud Computing (pp. 228-237). Springer International Publishing.

To Masoumeh

Acknowledgments

PhD is an unsurpassed journey with lots of ups and downs. It is my great pleasureto thank those who made this once in the life journey possible. I am indebted tomy supervisor, Dr. Rajiv Ranjan for his supports which has started even long beforemy official PhD commencement, for his superb guidance, and for believing in meat all stages of my PhD. I learnt a lot from him and will never forget his supportiveattitude and big heart.

I would like to thank my panel, Professor Alistair Rendell, Peter Strazdins andJohn Hosking for their valuable comments. I also like to give my special thanks andappreciation to my collaborators Dr. Alireza Khosravian, Dr. Raj Gaire, Dr. EhsanAbbasnejad, Dr. Prem Prakash Jayaraman, and Dr. Qing Wang for their invaluablecomments and discussions we had during my PhD studies. I would also like toacknowledge the academic, technical and financial support of the Data61 CSIRO andAustralian National University.

I want to express my gratitudes to a marvellous group of friends in Canberra,Khosro, Mohammad Sara, Fatemeh Sadegh, Mehdi Marzieh, Abbas Nojan, Monaj,Hajar, Morteza Sahba, Mohsen Sara, Mohammad Ladan, Mohammad Sousan, AlirezaZahra, Salim, Ehsan, Ehsan, Fatemeh, Omid, Dash Zahra, Majid Fatemeh, BehrouzFatemeh, Mousa, Arash Pegah, Mahmoud Mahtab, Meisong, Tony, Miranda, Andy,Matt.

In the end, I would like to thank my wife and family. Words are incapable ofexpressing my deepest love and appreciation to Masoumeh for her endless supportsand encouragements and to Aghajan, Maman, Maryam, Marzieh, Hosein, Yasamin,Baba, Maman, Zahra, Maryam, Hamid, Mohammad.

vii

Abstract

Efficiently and effectively processing large volume of data (often at high velocity)using an optimal mix of data-intensive systems (e.g., batch processing, stream pro-cessing, NoSQL) is the key step in the big data value chain. Availability and af-fordability of these data-intensive systems as cloud managed services (e.g, AmazonElastic MapReduce, Amazon DynamoDB) have enabled data scientists and softwareengineers to deploy versatile data analytics flow applications, such as click-streamanalysis and collaborative filtering with less efforts. Although easy to deploy, run-time performance and elasticity management of these complex data analytics flowapplications has emerged as a major challenge. As we discuss later in this thesis,the data analytics flow applications combine multiple programming models for per-forming specialized and pre-defined set of activities, such as ingestion, analytics, andstorage of data. To support users across such heterogeneous workloads where theyare charged for every CPU cycle used and every data byte transferred in or out ofthe cloud datacenter, we need a set of intelligent performance and workload man-agement techniques and tools. Our research methodology investigates and developsthese techniques and tools by significantly extending the well known formal mod-els available from other disciplines of computer science including machine learning,optimization and control theory.

To this end, this PhD dissertation makes the following core research contribu-tions: a) investigates a novel workload prediction models (based on machine learn-ing techniques, such as Mixture Density Networks) to forecast how performanceparameters of data-intensive systems are affected due to run-time variations in dataflow behaviours (e.g. data volume, data velocity, query mix) b) investigates control-theoretic approach for managing elasticity of data-intensive systems for ensuring theachievement of service level objectives. In the former (a), we propose a novel appli-cation of Mixture Density Networks in distribution-based resource and performancemodelling of both stream and batch processing data-intensive systems. We arguethat distribution-based resource and performance modelling approach, unlike theexisting single point techniques, is able to predict the whole spectrum of resourceusage and performance behaviours as probability distribution functions. Therefore,they provide more valuable statistical measures about the system performance atrun-time. To demonstrate the usefulness of our technique, we apply it to undertakefollowing workload management activities: i) predictable auto-scaling policy settingwhich highlights the potential of distribution prediction in consistent definition ofcloud elasticity rules; and ii) designing a predictive admission controller which isable to efficiently admit or reject incoming queries based on probabilistic servicelevel agreements compliance goals.

In the latter (b), we apply advanced techniques in control and optimization the-

ix

x

ory, for designing an adaptive control scheme that is able to continuously detect andself-adapt to workload changes for meeting the users’ service level objectives. More-over, we also develop a workload management tool called Flower for end-to-endelasticity management of different data-intensive systems across the data analyticsflows. Through extensive numerical and empirical evaluation we validate the pro-posed models, techniques and tools.

Contents

Acknowledgments vii

Abstract ix

1 Introduction 11.1 Research Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . 21.2 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research Questions and Contributions . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 72.1 Big Data Analytics Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Data Ingestion Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Data Analytics Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2.1 Batch Processing Systems . . . . . . . . . . . . . . . . . . 92.1.2.2 Stream Processing Systems . . . . . . . . . . . . . . . . . 12

2.1.3 Data Storage Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Data-Intensive System Performance Prediction . . . . . . . . . . . . . . . 14

2.2.1 White-box and Black-box Approaches . . . . . . . . . . . . . . . . 142.2.2 Machine Learning Driven Performance Prediction . . . . . . . . 15

2.2.2.1 Predicting Performance as a Single Point Value . . . . . 152.2.2.2 Predicting Performance as a Distribution . . . . . . . . 17

2.3 Elasticity Management of Data Analytics Flows on Cloud . . . . . . . . 192.3.1 Multi-Objective Optimization . . . . . . . . . . . . . . . . . . . . . 192.3.2 Elasticity Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Distribution-Based Resource Usage Prediction of Continuous Queries 233.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Resource Usage Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.1 Single Continuous Query . . . . . . . . . . . . . . . . . . . . . . . 293.4.2 Concurrent Workload . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.2.1 Stream Processing Optimizations . . . . . . . . . . . . . 303.4.2.2 Resource Contention . . . . . . . . . . . . . . . . . . . . 31

3.4.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

xi

xii Contents

3.4.3.1 Mixture Density Networks . . . . . . . . . . . . . . . . . 333.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5.1.1 Dataset and Workload . . . . . . . . . . . . . . . . . . . 343.5.1.2 Training and Testing Settings . . . . . . . . . . . . . . . 37

3.5.2 Evaluation: CPU and Memory Usage . . . . . . . . . . . . . . . . 373.5.2.1 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.2.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . 39

3.5.3 Training Times and Overhead . . . . . . . . . . . . . . . . . . . . 413.6 Distribution-Based Workload Management . . . . . . . . . . . . . . . . . 41

3.6.1 Predictable Auto-Scaling Policy Setting . . . . . . . . . . . . . . . 413.6.2 Distribution Based Admission Controller . . . . . . . . . . . . . . 45

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Distribution-Based Workload Modelling of Large-Scale Batch Queries 494.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3 Performance Modelling of Hive . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.1 Query Execution in HiveQL . . . . . . . . . . . . . . . . . . . . . 524.3.2 MDN Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.2 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.3 State of the Art Techniques . . . . . . . . . . . . . . . . . . . . . . 544.4.4 Evaluation: Single Point Estimators . . . . . . . . . . . . . . . . . 554.4.5 Evaluation: Distribution-Based Prediction . . . . . . . . . . . . . 564.4.6 Training Times and Overhead . . . . . . . . . . . . . . . . . . . . 58

4.5 Distribution-Based Prediction Utilization . . . . . . . . . . . . . . . . . . 584.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Elasticity Management of Data Analytics Flows on Clouds 615.1 Challenges in Elasticity Management of Data Analytics Flows . . . . . . 615.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.1 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.2 Resource Share Analysis . . . . . . . . . . . . . . . . . . . . . . . . 665.3.3 Elasticity Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3.3.1 A Framework for Controller Design . . . . . . . . . . . 685.3.3.2 A Generic Adaptive Controller . . . . . . . . . . . . . . 695.3.3.3 Gain Function (lk) Behavior Analysis . . . . . . . . . . . 72

5.4 Automated Control of a data analytics flow . . . . . . . . . . . . . . . . . 745.4.1 Data Ingestion Layer Controller . . . . . . . . . . . . . . . . . . . 745.4.2 Data Analytics Layer Controller . . . . . . . . . . . . . . . . . . . 745.4.3 Data Storage Layer Controller . . . . . . . . . . . . . . . . . . . . 75

Contents xiii

5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 765.5.2 Evaluation Results: Optimized Resource Share Determination . 775.5.3 Evaluation Results: Adaptive Controller Performance . . . . . . 775.5.4 Evaluation Results: Automated Control of the Flow . . . . . . . . 80

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6 Flower: A System for Data Analytics Flow Management 836.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.2 Flower Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.2.1 Resource Share Analysis . . . . . . . . . . . . . . . . . . . . . . . . 856.2.2 Resource Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . 866.2.3 Cross-Platform Monitoring . . . . . . . . . . . . . . . . . . . . . . 86

6.3 Flower Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.4 Flower is Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7 Conclusion 937.1 Resource Performance Prediction for Data-intensive Systems . . . . . . 937.2 Data Analytics Flow Elasticity Management . . . . . . . . . . . . . . . . 947.3 Elasticity Management Tool Support . . . . . . . . . . . . . . . . . . . . . 94

References 95

xiv Contents

List of Figures

2.1 A high-level architecture of the data-driven analytics services. . . . . . . 82.2 A simple instance of large-scale data stream-processing service. . . . . . 82.3 Amazon Kinesis stream architecture. . . . . . . . . . . . . . . . . . . . . . 92.4 Hadoop architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Query processing in Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Apache Storm architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 132.7 CPU usage of CurActiveCars query against average arrival rates show-

ing the multi-valued mapping situation from the same input. . . . . . . 162.8 One hidden layer MLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.9 MDN approximates distribution parameters, conditioned on the input

vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 (a) CPU usage of the query against 500 and 10K tuple/sec arrival rates.(b) Normalized histogram and KDE fitted to CPU usage of CurActive-Cars query against 10K data arrival rate. . . . . . . . . . . . . . . . . . . 25

3.2 Sample distribution prediction of CPU usage for NegAccTollStr query.Actual PDF is a fitted KDE function against the actual CPU usagewhich is used for clarity and comparison with the prediction. . . . . . . 26

3.3 Our approach builds an MDN model based on the historical logs ofqueries to pedict distribution of new incoming workloads. The pre-dicted PDFs are then used for developing two novel workload man-agement strategies: a) Distribution based admission control, and b)Auto-scaling policy setting. . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Overview of the proposed approach for predicting the resource usagedistribution of continuous queries. . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Best fit of sent tuple per second against different distribution models.The figures contain probability density of average tuple sent per sec-ond for the speed rate of (a) 50K and (b)100K for two different queries. 36

3.6 (a) predicted PDF and the observation (b) schematic sketch of the CRPSas the difference between CDFs of prediction and observation. . . . . . 38

3.7 The CPU utilization of (a) NegAccTollStr and (b) SegToll queries for 5minutes. The sample auto-scaling policies cause osiliation behaviourin NegAccTollStr workload, since they have been defined irrespectiveof the workload CPU usage distribution. . . . . . . . . . . . . . . . . . . 43

xv

xvi LIST OF FIGURES

3.8 The probabilities of the randomly generated auto-scaling policies for12 (out of 32) mixes of test queries. Each query mix evaluated against 4auto-scaling policies as shown in the form of bright and dark colouredbars. The bright and dark bars within each policy set respectivelyshow the activated and not activated rules at run-time. Our techniquehas successfully characterized the highly possible policies for all mixesbut Mix 4, 8, and 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.9 The CPU utilization beyond 95% hits the throughput (tuple/sec) of thequery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.10 Single point and distribution based admission controller performanceunder different decision making thresholds. In single point case weset t1=25%, t2=45%, t3=65%, and t4= 85%. . . . . . . . . . . . . . . . . . 47

4.1 Two sample predicted distributions for (a) CPU and (b) Execution Timefor a sample input from Q7 of TPC-H. The histograms show respec-tively the actual CPU and Runtime values for 30 different instancequeries generated based on template-7 and executed in the cluster. . . . 50

4.2 (a) CPU and (b) Response time prediction for Hive queries, modelledusing the Table 1 feature set. . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Relative error (%) for (a) CPU and (b) Response time prediction usingSVM, REPTree, and MLP techniques. . . . . . . . . . . . . . . . . . . . . 56

4.4 Sample PDF predictions for (a) CPU and (b) Execution Time of Hivequeries based on TPC-H workload. . . . . . . . . . . . . . . . . . . . . . . 59

5.1 A data analytics flow that performs real-time sliding-windows analysisover click stream data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 The data arrival rate at the ingestion layer (Amazon Kinesis in Fig.5.1)is strongly correlated (coefficient = 0.95) with the CPU load at the an-alytics layer (Apache Storm in Fig.5.1). . . . . . . . . . . . . . . . . . . . 63

5.3 The proposed solution for managing heterogeneous workloads of thedata analytics flows on clouds. . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4 a) Input-output linear model, b) Control feedback loop. . . . . . . . . . 695.5 Gain parameter behavior under different load scenarios. . . . . . . . . . 735.6 VMs are launched at different time slots so that they are of different

cost to stop. Thus, it is more economical to stop a VM with the mini-mum remaining time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.7 a) Given the 32.25$ daily budget and the dependency between dataingestion and analytics layer, six optimal solutions are generated. b)Since we have three objectives the Pareto front is a surface in 3d space. 77

5.8 a) The data producer puts the same records to the three identical Kine-sis streams, regulated by the controllers. b) Our implementation writesthree copies of the results to the three identical DynamoDB tables. . . . 78

5.9 The RMSE measures for both a) Kinesis and b) DynamoDB workloadsin terms of different desired utilization (yr) values. . . . . . . . . . . . . 79

LIST OF FIGURES xvii

5.10 Throughput QoS for Kinesis workload. . . . . . . . . . . . . . . . . . . . 795.11 Performance comparison of our adaptive controller and the fixed-gain

and quasi-adaptive ones in Amazon Kinesis workload managementwith yr = 70%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.12 Performance comparison of our adaptive controllers and the fixed-gainand quasi-adaptive ones in DynamoDB workload management withyr = 60%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.13 Adaptive controller’s performance in elasticity management of a) dataingestion (yr = 60%), b) analytics (yr = 40%), c) and storage (yr = 70%)layers of the click-stream analytics flow with lk = 0.03 and γ = 0.0001. . 82

6.1 Conceptual design of the Flower system. . . . . . . . . . . . . . . . . . . 846.2 Flower high-level architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 856.3 All-in-one-place visualizer user interface. . . . . . . . . . . . . . . . . . . 866.4 The high-level sequence diagram of how to run an elasticity controller

in Flower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.5 Flower’s flow builder interface . . . . . . . . . . . . . . . . . . . . . . . . . 896.6 Elasticity flow configuration interface . . . . . . . . . . . . . . . . . . . . 896.7 Elasticity service control and monitoring interface . . . . . . . . . . . . . 906.8 Elasticity service setting interface . . . . . . . . . . . . . . . . . . . . . . . 91

xviii LIST OF FIGURES

List of Tables

3.1 Feature input for training model. . . . . . . . . . . . . . . . . . . . . . . . 303.2 Trained classifiers performance as per LRB workload. . . . . . . . . . . . 403.3 Trained classifiers performance as per LRB Mix Workload. . . . . . . . . 403.4 Trained classifiers performance as per TPC-H workload. . . . . . . . . . 403.5 Training times in seconds as regards to different workload sizes for 1K

iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Feature set for resource modelling of Hive queries. . . . . . . . . . . . . 534.2 MDN performance compared with its competitors. . . . . . . . . . . . . 574.3 Training times in seconds with regard to different workload sizes for

500 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.1 List of key notations used in this chapter. . . . . . . . . . . . . . . . . . . 67

xix

xx LIST OF TABLES

Chapter 1

Introduction

Data-driven products and services are revolutionizing nearly every aspect of ourlives ranging from enterprises to consumers and science to government, and noware the fundamental part underpinning real-time decision making by transforminginsights to value. In this regard, efficiently and effectively processing large volumeof batch or streaming data using a chain of data-intensive systems is the key step inthe big data value chain [39]. For example, by analyzing data using data analyticsflows, real-time situational awareness can be developed for handling events such asnatural disasters and major traffic incidents. Similarly, online retail companies canoffer dynamically priced and customized product bundles using data analytics flowsthat process real-time click stream data and up-to-the minute inventory status on thefly.

Data analytics flows typically operate on three layers including ingestion, analyt-ics, and storage [106, 83]. The data ingestion layer accepts data from multiple sourcessuch as online services or back-end system logs. The data analytics layer consists ofmany platforms including stream or batch processing systems, and scalable machinelearning frameworks that ease implementation of data analytics use-cases such as col-laborative filtering and sentiment analysis. The ingestion and analytics layers makeuse of different databases during execution and where required persist the data inthe storage layer.

Recent analysis of cloud providers’ service portfolios [74] shows that the num-ber of data-intensive systems within each layer offered as cloud managed services(e.g, Amazon Elastic MapReduce [4], Amazon Kinesis Streams [5], Microsoft AzureHDInsight [18], Google BigQuery [17]) has surged. Because, they are well appreci-ated by the users, releasing them from the hassle of platforms or cluster setup andmaintenance. Availability and affordability of these cloud services have enabled datascientists and software engineers to easily build versatile data analytics flow appli-cations. Although easy to orchestrate and create, their workload management is achallenge.

1

2 Introduction

1.1 Research Motivation and Objectives

To architect a cloud-hosted data analytics flow, a mix of data-intensive systems areneeded as a number of studies [81, 76] have already reported performance problemsinduced by following the ”one size fits all” notion. Therefore, data analytics flowsare built via orchestrating data processing systems across a network of unlimitedcomputing and storage resources.

To support users across such complex heterogeneous workloads where they arecharged for every CPU cycle used and every data byte transferred within the cloud,we need a set of performance and workload management policies and mechanisms.Our hypothesis is that with the right set of techniques and tools from machine learn-ing, optimization and control theory, complex data analytics flows can be managedautomatically for enabling different service level objectives.

In recent years, numerous studies [51, 82, 96, 90, 111] have shown the benefitsof adapting concepts and tools from statistical machine learning, optimization andcontrol theory in workload management of data-intensive systems. This study aimsto take the next step in performance modelling and workload management of data-intensive systems by: a) investigating and adapting new class of machine learningtechniques in performance and workload management, b) enhancing the existingresource management techniques and tailoring them for a chain of data-intensivesystems as required by the data analytics flows.

1.2 Research Challenges

Workload management operations including a) resource and performance prediction,b) optimal resource share analysis, and c) accurate and timely resource provisioningfor the data analytics flow applications are highly challenging given the followingunique characteristics of data analytics services.

Changing resource and performance behaviour. The data analytics flow ap-plications often deal with immense data volume which, together with uncer-tain velocity of data streams, leads to changing resource consumption patterns.This mandates resource management techniques that could sustain workloadfluctuations time efficiently.

Heterogeneity of workloads. In big data analytics workloads, a typical dataanalytics flow consists of multiple processing tasks, each of which is executedusing different data processing platforms (e.g. batch/stream processing frame-works, NoSQL) across a cluster of machines. In this context, performance andcost optimized elasticity management of a data analytics flow is problematicdue to the heterogeneity of the workloads pertaining to different platformswith different performance and cost measures.

Diversity of cloud resources across the data analytics flows. The data analyt-ics flows are deployed on diverse cloud resources such as queuing partitions,

§1.3 Research Questions and Contributions 3

compute servers and storage throughputs capacity, each of which exhibits dif-ferent performance behaviours and different pricing schemes. In this setting,resource allocation techniques need to cater for diverse resource requirementsand their associated cost dimensions to meet the users’ Service Level Objectives(SLOs).

1.3 Research Questions and Contributions

To achieve the objectives stated in Section 1.1, we formulate three research questionsthat are addressed in this thesis.

(I) How can we predict the resource and performance distribution of data-intensiveworkloads? More specifically:

(i) How can we predict the resource usage distribution of centralized streamprocessing workloads?

(ii) Is the distribution-based workload modelling approach applicable to re-source management problems of stream processing systems?

(iii) How can we predict the resource and performance distribution of largescale analytics queries?

(II) How to satisfy the performance objectives of a data analytics flow applicationdespite its dynamic runtime workload? More specifically:

(i) What share of different resources does each layer of a data analytics flowneed to operate, given the budget constraints?

(ii) How would we cope with the variable resource requirements across thelayers for handling variation in volume and velocity of the data analyticsflow?

(III) How to design and implement a holistic elasticity management system for thedata analytics flows? More specifically:

(i) How to implement a resource share analyser module?

(ii) How to implement an adaptive control system and tailor it to the inges-tion, analytics and storage layer of the data analytics flow?

(iii) How to design and implement a holistic monitoring module that couldoperate across the layers of the data analytics flows?

In response to these research questions, a number of contributions have beenmade and published in several scientific journals, conference proceedings, and work-shops. The contributions and their original published references are as follows:

4 Introduction

• Introducing a new distribution-based performance modelling technique forbatch and stream processing systems. The proposed approach is based on thestatistical machine learning techniques and is easy to adapt to a wide variety ofsystems modelling problems. To demonstrate the usefulness of the distribution-based workload modelling, we also design and implement two workload man-agement mechanisms including i) predictable auto-scaling policy setting; andii) predictive admission controller. In the former, we put forward the claim thatthe workload behaviour distribution prediction provides reliable informationenabling consistent auto-scaling policy setting in public clouds. In the latter,we experimentally take the first step towards developing an admission controlwhich is able to react as per the probabilistic service level agreements (SLAs).These contributions have been published in [70, 71].

• Investigating the problem of multi-layered resource allocation of complex dataanalytics flows deployed on public clouds. For this purpose, we present ameticulous dependency analysis of the workloads along with the mathematicalformulation of the problem as per the data ingestion, analytics, and storagelayers of a data analytics flow. We then design and implement a new adaptivecontrol framework by employing tools from classic nonlinear control theory fordynamic provisioning of data analytics flows. The proposed control systemsis able to continuously detect and self-adapt to workload changes for meetingusers’ SLOs. This contribution has been originally published in [69].

• Designing and implementing a system called Flower for holistic elasticity man-agement of data analytics flows on clouds. Flower provides the user with asuite of rich functionalities including workload dependency analysis, optimalresource share analysis, dynamic resource provisioning, and cross-platformmonitoring. This contribution has been published in [73].

1.4 Thesis Outline

The thesis continues with Chapter 2 which gives the reader the required backgroundin big data ecosystem, the state of the art in performance prediction, and elasticitymanagement. Chapter 3 presents distribution-based workload modelling for central-ized in-memory stream processing systems. In this Chapter, we also demonstrate theuse of distribution-based performance models in workload management operations.Chapter 4 discusses how the proposed prediction methodology can be adapted toparallel distributed batch processing systems. Chapter 5 presents our elasticity man-agement framework for data analytics flows. In this Chapter we describe three corecomponents of the framework including workload dependency analysis, resourceshare analysis, and adaptive controller in detail. In Chapter 6, we present the mainfunctionalities of our developed system, called Flower, for holistic elasticity man-agement of data analytics flow. We then demonstrate how Flower as a high leveleasy-to-use system assists admins and DevOps engineers in provisioning data ana-

§1.4 Thesis Outline 5

lytics flow applications and also allows them to constantly monitor applications forany performance failures or slowdowns. Finally, Chapter 7 summarizes our workand concludes the thesis.

6 Introduction

Chapter 2

Background

This chapter provides required background about well-established theories, tech-niques and technologies which are used in this thesis. In this regard, we first explorea number of major data-intensive systems in big data analytics ecosystem in Section2.1. We then discuss about the usage of statistical machine learning techniques inworkload performance modelling of data-intensive systems in Section 2.2. We finallydiscuss about the elasticity management of data-intensive workloads on clouds inSection 2.3.

2.1 Big Data Analytics Ecosystem

As we delve deeper into the digital universe, we are witnessing explosive growth inthe variety, velocity, and volume of data being transmitted over the Internet. Thisdata are generated mainly from Internet search, social media and mobile devices[97].

Such big data sets are too complicated to be managed and processed by con-ventional data processing platforms including relational databases and data miningframeworks. In response, a mix of large-scale data processing platforms - also knownas data-intensive systems - are used. These systems can be categorized in three lay-ers including ingestion, analytics, and storage as per the specific types of workloadsand functionalities as shown in Fig. 2.1. More specifically, Fig. 2.2, illustrates aninstance of a large-scale data stream processing service, where Apache Kafka [10]serves as a high-throughput distributed messaging system, Apache Storm [12] as adistributed and fault-tolerant real-time computation, and Apache Cassandra [6] as aNoSQL Database. It is worth mentioning that some of the platforms are able to playroles in more than one layer. For example, Amazon Kinesis [5] and Apache Kafkaas distributed message queueing systems can also be used in data analytics layer inorder to provide real-time data analytics and stream processing.

These systems are often available as cloud managed services 1. Because, cloud

1Managed cloud services, unlike unmanaged ones, do not require the user to take care of issues likehow the service responds to changes in load, errors, and situations where resources become unavailable.For example, AWS EC2 is an unmanaged service whereas AWS DynamoDB or AWS RDS are fullymanged solutions.

7

8 Background

Figure 2.1: A high-level architecture of the data-driven analytics services.

Figure 2.2: A simple instance of large-scale data stream-processing service.

resources are a natural fit for processing data-intensive workloads as they allow theunderlying parallel distributed programming and database frameworks to run ata scale in order to handle uncertain data volume and velocity. However, big datasystems provide many configuration options which often significantly impact theirperformance. To understand and predict the effect of configuration options, noveltechniques need to be investigated and proposed for cloud services.

2.1.1 Data Ingestion Layer

Data ingestion layer accepts data from multiple sources such as website click streams,financial transactions, social media feeds, IT infrastructure log data, and location-tracking events. In this layer, distributed message queuing frameworks such asAmazon Kinesis and Apache Kafka, provide a powerful set of primitives for reli-able, high-throughput and low-latency queuing of real-time data streams.

Amazon Kinesis

§2.1 Big Data Analytics Ecosystem 9

Figure 2.3: Amazon Kinesis stream architecture.

Kinesis is a high-throughput streaming data platform that enables rapid and con-tinuous data intake and aggregation. The core concept of Kinesis is stream, an or-dered sequence of data records. The data records in the stream are distributed intoshards. In other words, a stream is composed of one or more shards as shown in Fig.2.3. Each shard provides a fixed unit of capacity.

The data capacity of the stream is determined by the number of shards that areprovisioned for the stream. As the data rate increases, more shards need to be addedto scale up the size of the stream. In contrast, shards can be removed as the datarate decreases. Kinesis as a cloud managed service provides all the infrastructure,storage, networking, and configuration required to transparently handle the shardprovisioning process. Nevertheless, specifying the right number of shards as per theincoming data rates and volume is left to the user.

2.1.2 Data Analytics Layer

Data analytics layer consists of many systems such as stream, batch processing sys-tems and scalable machine learning frameworks that ease implementation of dataanalytics use cases such as collaborative filtering and sentiment analysis. These usecases typically make necessary to get both batch and stream processing platforms towork side by side. For example, in Lambda architecture [88] - as one of the mainbig data processing architecture - to balance between latency and throughput bothbatch and stream processing platforms are simultaneously used to provide respec-tively comprehensive and real-time views of batch and streaming data. These viewsare then joined before presentation.

2.1.2.1 Batch Processing Systems

Distributed batch processing systems are designed to handle large volume of data byprocessing data in parallel using multiple tasks across a cluster of machines. These

10 Background

Figure 2.4: Hadoop architecture.

systems, unlike stream processing platforms, are not meant to have low responsetime and latency. In the subsequent subsection we will briefly present two of themain batch processing platforms in the big data ecosystems, Apache Hadoop [8] andHive [9].Apache Hadoop

Apache Hadoop is an open-source framework used for distributed storage andprocessing of large datasets. Hadoop has two main components including MapRe-duce and Hadoop Distributed File System (HDFS) which are inspired by Googlepapers on their MapReduce [40] and the Google File System (GFS) [52].

HDFS has a master/slave architecture. It has two key software components: Na-meNode and DataNode. NameNode as a master keeps track of which blocks makeup a file and where they are stored, and DataNodes as slaves keep the actual data,one or more blocks of data. Similarly, MapReduce has a master/slave architectureand it has JobTracker and TaskTracker as the main software components. JobTrackeras a matser schedules the jobs’ component tasks on the slaves, monitors them andre-executes the failed tasks. TaskTrackers as slaves executes the tasks as directed bythe master. Typically the compute nodes and the storage nodes are the same, that is,the MapReduce and the HDFS are running on the same set of nodes as shown in Fig.2.4.

MapReduce concept is inherently a divide-and-conquer strategy where a singleproblem is broken into multiple individual subtasks including various instance ofmap and reduce tasks. This approach is reinforced even more via parallelizing sub-tasks in a cluster of inexpensive not necessarily modernized (commodity) hardware[108]. In a MapReduce data flow, a large data set is split into blocks containing data


Figure 2.5: Query processing in Hive.

as a series of key-value pairs which are processed by a number of map tasks in paral-lel. The map functions execute the predefined logic and output resultant temporarydata as a set of key-value pairs which are then fed into reduce function to performaggregation and collect the final set of results.

Apache Hive

Although MapReduce concept is simple, writing MapReduce programs in proce-dural languages such as Java, Scala and Python is not approachable for anyone. Inresponse and to facilitate querying the data resides on HDFS, Hive was introducedby Facebook [107]. Apache Hive is a data warehouse infrastructure built on top ofHadoop that facilitates querying and managing large datasets residing in distributedstorage. It provides a mechanism to project structure onto this data and query thedata using a SQL-style language, HiveQL. HiveQL is a language that allows softwareengineers and more importantly data scientist write analytics queries in a declarativeway.

Hive interprets HiveQL and generates MapReduce jobs that run on the clusterusing Driver component. In other words, Driver translates Hive query to map andreduce function by fetching required schema information from Metastore as shownin Fig. 2.5. Therefore, SQL operators (e.g. table scan) are translated into map andreduce functions and everything is either Map or Reduce behind the scene. End-to-end execution time depends on the number of mappers and reducers and theirruntime performance.

12 Background

2.1.2.2 Stream Processing Systems

The early efforts for processing streaming data started since 2000 with different re-search projects such as NiagaraCQ [36] and Cougar [32]. These systems shapedthe grounds for today’s centralized Stream Processing Engines (SPEs) such as Es-per [15], Odysseus [23] and Oracle Complex Event Processing (CEP) [19]. However,increasing volume of data stresses the need for scalable frameworks that support dis-tributed or parallel-distributed computation of data streams. Because, the centralizedin-memory stream processing systems would no longer be sufficient for real-timecomputation of huge amounts of streaming data.

In centralized architecture all the query computations and state management arehandled in-memory and the granularity of parallel execution tasks are as big asdistinct queries which are deployed at multiple SPE instances. In distributed ap-proach, different operators belonging to the same query are executed at differentSPE instances, whereas in parallel-distributed approach even a single operator canbe executed in parallel via multiple SPE instances. Evidently, parallel-distributedarchitecture promises thoroughly scalable systems comparing to the others [55]. Inthe subsequent subsections we will briefly present Oracle CEP as a centralized andApache Storm as a parallel-distributed stream processing systems.

Oracle CEP

Oracle CEP is a centralized in-memory stream processing system designed tosupport event-driven applications such as algorithmic trading, security and frauddetection. In fact, Oracle CEP is a high performance continuous query engine forfast processing of streaming data. For this purpose, it offers a Continuous QueryLanguage (CQL) similar to SQL with added constructs that supports filtering, cor-relation, and aggregation of streaming data from one or more streams. Using CQL,one can query data streams to perform complex event processing.

Apache Storm

Apache Storm [12] is a distributed fault-tolerant system for processing streamdata at scale. A Storm cluster is superficially similar to a Hadoop cluster, whereas onHadoop jobs are spawned across the cluster, on Storm topologies. Unlike Hadoop jobs,a topology processes messages forever, unless it is killed. Storm cluster consists ofthe master node and the worker nodes which are coordinated by Apache ZooKeeper[14], Fig. 2.6.

The master node runs a daemon called Nimbus which is responsible for dis-tributing code around the cluster, assigning tasks to machines, and monitoring forfailures. Each worker node runs a daemon called the Supervisor that listens for workassigned to its machine and starts and stops worker processes as necessary based onwhat Nimbus has assigned to it. The Nimbus and Supervisor daemons are fail-fastand stateless as all the states are kept in Zookeeper or on local disk.

The basic primitives Storm provides for doing stream transformations are spoutsand bolts. Spouts and bolts have interfaces that you implement to run your application-specific logic. A spout is a source of streams that may read tuples from a queue or


Figure 2.6: Apache Storm architecture.

directly from an API and emit a new transformed stream. A bolt consumes inputstreams, does some processing such as running functions, filtering, aggregations,joins, and possibly emits new streams. Networks of spouts and bolts are packagedinto a topology that is submitted to Storm clusters for execution. A topology is agraph of stream transformations where each node is a spout or bolt and edges indi-cate which bolts are subscribing to which streams.

2.1.3 Data Storage Layer

Data storage layer consists of next-generation database systems for storing and in-dexing final as well as intermediate datasets. NoSQL database frameworks, such asCassandra [6] and Amazon Dynamo [3], along with elastic cashing systems are themajor storage systems at this layer.

DynamoDB

DynamoDB [3] is a NoSQL database system that provides a scalable architecturefor managing key-value and document data structures. In key-value data structure,given the exact key, the value is returned. This well-defined data access patternresults in better scalability and performance predictability that is suitable for storingand indexing real-time streams of big datasets. DynamoDB as a document storesupports querying and updating items in a document format such as JSON, XML,and HTML.

DynamoDB has tables, items, and attributes as core components. A table is acollection of items, and each item is a collection of attributes. DynamoDB storesdata in tables and automatically partitions the data and provisions additional server

14 Background

capacity as the table size grows. However, specifying the right size of provisionedthroughput (i.e. read or write capacity units) for a particular table to handle theworkload dynamics is left to the user.

2.2 Data-Intensive System Performance Prediction

System performance prediction is central to the design and development of variousworkload management strategies such as:

• System resizing: dynamic provisioning of resources such as CPU, Memory andNetwork as per the workload changes.

• Workload scheduling: reordering and scheduling different jobs and queries.

• Admission control: admitting or rejecting an incoming workload.

As discussed in Section 1.2 performance prediction and workload managementacross multiple big data systems is a challenging task due to a) changing resourceconsumption pattern and performance behaviour, b) heterogeneity of workloads, andc) diversity of involved cloud resources. In this section, we will discuss differentapproaches in workload performance modelling and motivate the needs for a newclass of techniques in order to improve workload management.

2.2.1 White-box and Black-box Approaches

System performance modelling and prediction techniques broadly use either white-box, black-box, or combination of both approaches. In white-box modelling, perfor-mance models are built based on the understanding of the internals of the system, itscomponents and their interaction with each other and with operating system mod-ules. For example, in [29], the authors use Actor Model theory to build analyticalperformance models for Apache Storm.

In black-box approach, the system is given as a black-box and the performancemodel is built based on the relationship between the system input workload andconfiguration features and output performance or resource usage. Black-box modelsare primarily developed using Machine Learning (ML) techniques which involvesfour steps: i) initial feature set identification, ii) ML model selection, iii) featureselection, and iv) training and testing.

ML techniques compared to a white-box approach ease the task of cost modelgeneration for increasingly complex data management systems. Because a wise se-lection and usage of various ML techniques captures implicitly the internal behaviorof the system components in terms of their resource footprint. This complexity isfurther intensified in cloud environments due to the heterogeneity, and diversity ofresource types and uncertainties of the underlying cloud environment. However,building an accurate predictive model depends on the availability of training datawhich is a representative of the actual workload.

§2.2 Data-Intensive System Performance Prediction 15

In this thesis, we use black-box approach in workload performance modelling oftwo major platforms in the big data ecosystem. Because, even though white-box ap-proaches do not need training data and unlike sophisticated statistical models, theyare easy to understand and offer a higher extrapolation power, developing, enhanc-ing and maintaining analytical models for ever-changing data-intensive systems isquite challenging.

2.2.2 Machine Learning Driven Performance Prediction

In recent years, different statistical ML techniques such as multilayer perceptron(MLP) [99], Kernel Canonical Correlation Analysis (KCCA) [26], regression trees [33],and Support Vector Machines (SVM) [104] have been used for system resource usageand performance prediction.

We argue that these techniques are inadequate for performance prediction ofdata-intensive systems. Because these systems deal with immense data volumewhich, together with uncertain velocity of data streams, leads to changing resourceconsumption patterns. However, the classic statistical ML techniques are only ableto model the statistical properties of data generator as a conditional average which isa Single-point Value. This means that if the data has a complex structure, for exampleit is a one-to-many mapping, then these techniques are not able to model the wholespectrum performance behaviour [31].

For example, consider Fig. 2.7 which shows the scatter plot of CPU usage againstaverage data arrival rates for CurActiveCars continuous query from Linear RoadBenchmark [25]2. The figure clearly illustrates the multi-valued mapping point,meaning that for the same data arrival rate such as 10K (tuple/second) there aremultiple CPU usage values which range from 20 to 90 percent. Therefore, the con-ditional distribution - which can be visualized by considering the density of pointsalong a vertical slice through the data - for many input values such as 10K or 9998 ismulti-modal. Such a multi-modality can be poorly represented by the conditional av-erage. Therefore, we need the techniques that could capture the multi-modal natureof the target data as probability distribution functions.

To address the issue above, in this thesis we introduce distribution-based per-formance prediction techniques as superior predictors compared to single pointtechniques for data-intensive workloads. In the subsections below, we will discussvarious single-point estimator techniques and also briefly discuss our proposal indistribution-based performance prediction.

2.2.2.1 Predicting Performance as a Single Point Value

In this section, we briefly explore prominent ML algorithms used in this thesis asbaseline competing techniques to distribution-based ones.

2This result is based on one of the experiments conducted on Linear Road Benchmark. The completeexperimental evaluation concerned with distribution-based performance modelling of the systems willbe presented in Chapter 3

16 Background

Figure 2.7: CPU usage of CurActiveCars query against average arrival rates showingthe multi-valued mapping situation from the same input.

Multilayer Perceptron

An MLP [99] is a feedforward Artificial Neural Network (ANN) model. An MLPis a multi-layer network of simple neurons called perceptrons that forms a directedgraph. Given a set of input features and a target, an MLP approximates the outputby forming a linear combination using the input weights and putting the outputthrough some activation function. In mathematical terms, it can be formulated as:

y = ϕn

∑i=1

(wixi + b) = ϕ(WTX + b) (2.1)

where w denotes the vector of weights, x is the input vector, b is the bias and ϕ

is the non-linear activation function.An MLP network typically consists of an input layer, one or more hidden layers

and an output layer as shown in Fig. 2.8. The input layer consists of a set of neurons{xi|x1, x2, ..., xn} representing the input features. Each neuron in the hidden layertransforms the values from the previous layer with a weighted linear summationw1x1 + w2x2 + ... + wnxn, followed by a non-linear activation function such as thelogistic sigmoid 1/(1 + e−x) or the hyperbolic tangent tanh(x). The output layertransforms the received values from the last hidden layer into output values.

The MLP has the ability to learn non-linear models. However, building an MLPmodel requires a fair amount of tuning. Because, there are a number of hyper-parameters such as the number of hidden neurons, layers, and iterations that needto be specified beforehand.

Decision Tree Learning

Decision tree learning or prediction trees are an important type of algorithm

§2.2 Data-Intensive System Performance Prediction 17

Figure 2.8: One hidden layer MLP.

for predictive modeling. Prediction trees have two variants, classification trees andregression trees. When the target variable takes a finite set of categorical values weuse classification trees to identify the classes that a target variable likely belongs to.Regression trees are for dependent variables that takes continuous values.

In this method, the models are built by recursively partitioning the data spaceinto smaller regions and fitting a simple prediction model within each partition.Prediction trees can be graphically represented as decision tree [84] where each noderepresents a partition, attached to it a simple model which applies in that partitiononly [59].

In the scope of this thesis, we use REPtree method [46]. REPtree is a standard treemodel that has been utilized by the existing workload prediction techniques [116].This method partitions the feature space in a top-down and non-linear fashion andbuilds a decision tree using information gain. It then prunes the tree using reduced-error pruning [46].

Support Vector Machines

Support vector machines are a set of supervised learning models used for classifi-cation, regression and outlier detection. SVMs are very effective in high dimensionaldata spaces and they are flexible in the sense that different kernel functions can bespecified for the decision function. In this thesis, we use a regression variant of SVMsas it has been utilized by the existing workload prediction techniques [22].

2.2.2.2 Predicting Performance as a Distribution

A number of approaches such as Mixture Density Networks (MDN) [31], Condi-tional Density Estimation Network [92] and Random Vector Functional Link [63] are

18 Background

Figure 2.9: MDN approximates distribution parameters, conditioned on the inputvector.

available to predict the probability density functions (PDFs). In this thesis, we inves-tigate the MDN to predict resource usage and performance of both batch and streamprocessing workloads.

Although MDN has been introduced two decades ago, it is still one of the best-performing conditional density estimator [105]. More importantly, the benefit ofusing MDN is due to its ability to model unknown distributions as exhibited by data-intensive system workloads [37]. In addition, it has already been successfully appliedin other domains such as speech synthesis, wind speed and power forecasting.

Mixture Density Networks

MDN is a special type of ANN, in which the target is represented as a conditionalprobability density function. The conditional distribution represents a complete de-scription of data generation. A classic MDN fuses a Gaussian mixture model (GMM)with MLP. In MDN, the distribution of the outputs t is described by a parametricmodel whose parameters are determined by the output of a neural network, whichtakes x as inputs.

Fig. 2.9 gives an overview of MDN in which the neural network is responsiblefor mapping the input vector x to the parameters of the mixture model (αi, µi, σ2),which in return provides the conditional distribution. An MDN, in fact, maps inputfeatures x to the parameters of a GMM: mixture weights αi, mean µi, and varianceσ2, which in turn produces the full PDF of an output feature t, conditioned on theinput vector p(t|x).

§2.3 Elasticity Management of Data Analytics Flows on Cloud 19

2.3 Elasticity Management of Data Analytics Flows on Cloud

In recent years, numerous large-scale data processing platforms have been offeredas cloud managed services such as Kinesis, Elastic MapReduce and DynamoDB. Theselling point of these services is having infinite elasticity feature which allows them toadapt to workload changes by provisioning and de-provisioning resources to matchthe demand as closely as possible [61].

Cloud managed services are able to adapt to the workload changes either manu-ally or automatically, such as acting appropriately when some threshold is reached.Manual adaptation does not provide any autoscaling facility; in the best case, theservice alerts the administrator through an email of the need to manually configurethe instances to adapt to new conditions. Services with automatic adaptation willadapt to exceptions through the use of reactive and predictive techniques [49].

Reactive techniques respond to events only after reaching a predefined thresholdthat is determined through monitoring the state of hardware and software resources.Although these techniques are simple to define and implement (nothing more thanif-then-else statements), they are not sufficient to ensure SLOs in some cases, such asduring a peak demand for resources.

Predictive techniques can dynamically anticipate and capture the relationshipbetween an application’s SLO targets, current hardware resource allocation, andchanges in application-workload patterns to adjust hardware allocation. Overall,predictive techniques build on the integration of theoretical workload prediction andresource performance models. Workload prediction models forecast workload be-havior across applications in terms of CPU, storage, I/O, and network bandwidthrequirements.

In recent years several reactive and predictive provisioning techniques [110, 98,101, 34] have been proposed with focus on traditional web applications. None ofthese techniques are capable of provisioning big data processing platform acrossmultiple cloud resources while ensuring strict guarantees on performance targets.Some recent techniques have been proposed for automated provisioning of individ-ual platforms such as NoSQL database [75], distributed streaming systems [55], batchprocessing systems [82], while largely ignoring an end-to-end provisioning needs asrequired by big data analytics flow applications.

In response, we propose a holistic elasticity management system that exploits ad-vanced optimization and control theory techniques to manage elasticity of complexdata analytics flows on clouds. In the following subsections, we explore the basics ofthe multi-objective optimization and feedforward control theories.

2.3.1 Multi-Objective Optimization

To efficiently select configuration of different resources across multiple layers of dataanalytics flows, we propose the novel usage of multi-objective optimization. A multi-objective optimization problem involves minimizing or maximizing multiple possibly

20 Background

conflicting objective functions [89]. Mathematically it can be written as:

max( f1(x), f2(x), ..., fn(x))

s.t. x ∈ X(2.2)

where the variable n ≥ 2 is the number of objectives and the set X is the feasibleset of solutions.

In multi-objective optimization, there does not typically exist a solution that min-imizes or maximizes all objective functions simultaneously. Thus, attention is paidto Pareto optimal solutions; those that cannot be improved in any of the objectiveswithout degrading at least one of the other objectives. This important concept iscalled domination [28]. Put formally, a solution x1 ∈ X is said to dominate anothersolution x2 ∈ X if:

1. ∀i ∈ {1, 2, ..., n} : fi(x1) ≥ fi(x2) and

2. ∃i ∈ {1, 2, ..., n} : fi(x1) > fi(x2)

Quite simply, this definition implies that x1 is Pareto optimal if there exists no feasiblevector of decision variables ~xi ∈ ~R which would increase some criterion withoutcausing a simultaneous decrease in at least one other criterion. Therefore, solving amulti-objective problem would not end up with a single solution, but rather a set ofsolutions called the Pareto front.

2.3.2 Elasticity Controller

Elasticity and auto-scaling techniques have been studied extensively in recent years[85]. Different techniques such as Control theory [86], Queueing theory [109], Fuzzylogic [117], Markov decision process [75] have been applied to tackle the problemwith respect to different resource types such as Cache servers [64], HDFS storage[82], or VMs [48]. However, recent studies in resource management using controltheory [82, 86, 65, 66] have clearly shown the benefits of dynamic resource alloca-tions against fluctuating workloads. What makes the control theory approach standsout in workload management techniques is the fact that it neither relies on any priorinformation about the workload behavior nor does it impose any strong assump-tions on the system model (e.g. as in queueing model). Such features lead to asimple yet effective approach that would sustain any workload’s shape and dynam-ics. Therefore, in this thesis, we propose a framework for design and asymptoticstability analysis of adaptive controllers by employing tools from classic nonlinearcontrol theory. We further design and tailor adaptive controllers for different layersacross a data analytics flows.

In control theory there are two basic types of control: open loop and closed loopcontrol. In open loop control, the controller decision is independent of the systemoutput. In contrast, in closed loop control, the controller action is dependent on thesystem output where measurement of the system output is used as feedback to alterthe controller action. For this reason, closed loop controllers are also called feedback

§2.4 Summary 21

controllers [42]. In this thesis, we propose a novel adaptive controller which is basedupon the basics of the feedback control systems as discussed next.

A feedback control is mathematically defined as:

u(t) = u0 + Lc ∗ e(t), (2.3)

where u(t) represents the controller output, u0 represents the input, Lc representsthe controller’s gain, and e(t) = yr− y represents the error. The control objective is todesign the control input u0 such that the output u(t) remains bounded for all timesand converges to a reference (desired) constant value yr ∈ R as t goes to infinity.

2.4 Summary

In this chapter we discussed background information about well-established con-cepts, techniques and technologies used in this thesis. In this regard, we introduceda number of data-intensive systems including Kinesis, Hadoop, Hive, Storm, OracleCEP and DynamoDB. We discussed the roles and the main functionalities of thesesystems in terms of typical data ingestion, analytics and storage layers.

We also discussed different approaches in workload performance modelling andin particular statistical machine learning. More specifically, different ML techniqueswere discussed to provide the background required for our discussion about distribution-based performance prediction which will be presented in Chapters 3 and 4. We alsobriefly illustrated why distribution-based workload modelling approach is superiorto the existing techniques which predict the performance as single point values. Wefinally discussed the current status of workload elasticity management techniquesand presented necessary basics of optimization and control theories as the scientificbasis of our proposal in elasticity management which will be introduced in Chapter5.

22 Background

Chapter 3

Distribution-Based Resource UsagePrediction of Continuous Queries

Efficient resource consumption estimation in response to a query processing task iscentral to the design and development of various workload management strategiessuch as dynamic provisioning, workload scheduling, and admission control [27, 116].All of these strategies typically possess a prediction module which can provide ac-curate estimations guidance on run-time operations such as adding more resources,reordering query execution, or admitting or rejecting an incoming query.

The data stream processing workload mainly consists of registered continuousqueries and data arrival rate distribution models. The key to proper exploitation ofelasticity is to have intelligence to predict how changing data velocity and mix of con-tinuous queries will affect the performance of the underlying virtualized resources(e.g. CPU). Therefore, building resource usage estimation for continuous queries isvital, yet challenging due to: (i) variability of the data arrival rates and their distribu-tion models, (ii) variable resource consumption of data stream processing workload,(iii) the need to process different mixes of continuous queries, and (iv) uncertaintiesof the underlying cloud resources.

These complexities challenge the task of efficiently processing such streamingworkloads on cloud infrastructures where users are charged for every CPU cycle usedand every data byte transferred in and out of the datacenter. In this context, cloudservice providers have to intelligently balance between various variables includingcompliance with Service Level Agreements (SLAs) and efficient usage of infrastruc-ture at scales while handling simultaneous peak workloads from many clients.

In this chapter, we address first (i) and second (ii) part of the research question (I)as specified in Section 1.3. To this end, we present a novel approach of using mixturedensity networks to estimate the whole spectrum of resource usage as probabilitydensity functions. We evaluate our technique using the Linear Road Benchmark(LRB) [25] and TPC-H [20] in both private and public clouds. We also demonstratethe efficiency and applicability of the proposed approach via two novel applications:i) predictable auto-scaling policy setting which highlights the potential of distributionprediction in consistent definition of cloud elasticity rules; and ii) a distribution basedadmission controller which is able to efficiently admit or reject incoming queries

23

24 Distribution-Based Resource Usage Prediction of Continuous Queries

based on probabilistic service level agreements compliance goals.

3.1 Motivation

Recent work has studied SQL query resource estimation and run-time performanceprediction using machine learning (ML) techniques [22, 50, 80]. These techniquestreat the database system as a black box and try to predict based on the trainingdataset provided. These techniques offer the promise of superior estimation accuracy,since they are able to account for factors such as hardware characteristics of thesystems as well as interaction between various components. All these techniquesapproximate resource usage for each query as a single point value.

Unlike standard SQL queries that may (not) execute multiple times (often eachexecution is independent of the previous one), continuous queries are typically reg-istered in stream processing systems for a reasonable amount of time and streamsof data flow through the graph of operators over this period. Rapidly time-varyingdata arrival rates and different query constructs (e.g. time and tuple-based windows)cause the resource demand for a given query to fluctuate over time. To illustrate howstreaming workload resource demands fluctuate with time, we executed the follow-ing simple CurActiveCars query from the linear road benchmark:

SELECT DISTINCT car_idFROM CarSegStr [RANGE 30 SECONDS];

Fig. 3.1(a) illustrates the CPU usage for this query against two different arrivalrates: 500 tuple/sec and 10K tuple/sec. As expected, the data arrival rates affectthe stream processing system resource demand drastically over time. For example,the fitted Probability Density Function (PDF) of the CPU usage for the query (Fig.3.1(b)), shows that even though the query is highly likely to consume between 20%and 35% CPU, we need to allow for possible peak demands (i.e. 90%) to avoid aperformance hit. Under these circumstances, how can we address questions such as:How much memory and CPU share will the query require if the arrival rates double? orWhat would be the shape of CPU usage for more complex queries?

For problems involving the prediction of continuous variables (e.g. resource con-sumption), the single point estimation which is, in fact, a conditional average, pro-vides only a very limited description of the properties of the target variable. Thisis particularly true for a data stream processing workload in which the mappingto be learned is multi-valued and the average of several correct target values is notnecessarily itself a correct value. Therefore, single point resource usage estimation[22, 50, 80] is often not adequate for streaming workload, since it is neither expressiveenough nor does it capture the multi-modal nature of the target data.

Continuous queries and streaming workload resource management strategiesrather require techniques that provide a holistic picture of resource utilization asa probability distribution. To achieve this, we propose a novel approach for resourceusage estimation of data stream processing workloads. Our approach is based on the

§3.2 Approach Overview 25

(a) (b)

Figure 3.1: (a) CPU usage of the query against 500 and 10K tuple/sec arrival rates.(b) Normalized histogram and KDE fitted to CPU usage of CurActiveCars query

against 10K data arrival rate.

mixture density network (MDN) [31], which approximates the probability distribu-tion over target values.

To illustrate one of the possible advantages of using the proposed approach, con-sider Fig. 3.2. It displays a sample predicted PDF and actual CPU usage in terms ofnormalized histogram and fitted Kernel Density Estimation (KDE) for one of the ex-periments on linear road benchmark queries [25]. As we can see, the estimated PDFapproximates the actual resource usage PDF closely. The predicted PDF providesa complete description of the statistical properties of the CPU utilization throughwhich we are not only able to capture the observation point, but also the wholespectrum of the resource usage. In contrast, a best approximation from the exist-ing resource estimation techniques [22, 50, 80] merely provides the point which isvisualized by a solid vertical line. Unlike PDFs, with such estimation we are notable to directly calculate any valuable statistical measures (e.g. variance, confidenceinterval) about the target data.

3.2 Approach Overview

Fig. 3.3 shows the workflow of our approach as discussed next. In the proposed ap-proach, we use ML technique to train a model on the historical logs. Once the modelis built, the workload manager of the stream processing system is able to employ itin order to predict the distribution of a new incoming workload (i.e. query). The pre-dicted PDFs (or mixture models) are then used for different workload managementstrategies such as admission control and auto-scaling rule setting.

Resource Usage Distribution Prediction: For this purpose, our approach com-bines the knowledge of continuous query processing with the MDN statistical model.


Figure 3.2: Sample distribution prediction of CPU usage for NegAccTollStr query.Actual PDF is a fitted KDE function against the actual CPU usage which is used for

clarity and comparison with the prediction.

To do so, we firstly execute the training query workload and profile its resource usagevalues along with predefined query features. Secondly, we input the query featuresand data arrival rates to the MDN model for training. Following this, the model sta-tistically analyzes the input features’ values and actual observation of the resourceconsumption of the training set and predicts the probability distribution parameters(i.e. mean, variance, and mixing coefficients) over target values. Once the model isbuilt and materialized, it can then be used to estimate the resource usage value ofnew incoming queries based on the query features’ values. Section 3.4 covers thedetails of the technique thoroughly.

Auto-scaling Policy Setting: Once the resource distribution prediction becomesavailable, its exploitation in data stream processing workload management is yet an-other challenge. Auto-scaling policy setting application demonstrates that the distri-bution prediction provides a reliable source of information for defining appropriateresource elasticity rules. To do so, the probability of auto-scaling policy activationis calculated. This estimation is then used as a critical parameter for analysing andpredicting the impacts of the defined rules on the resources. This feature allows usto define consistent auto-scaling policies or revisit the existing thresholds if needed.More details of the application will be given in Section 3.6.1.

Distribution based Admission Controller: As another concrete application ofthe distribution prediction, we develop an admission controller which is able to ef-ficiently admit or reject the incoming queries based on the predicted resource usagePDFs. For this purpose, the SLA miss probability of the incoming workload is calcu-lated. This estimation is then evaluated against different predefined decision makingthresholds. This feature enables wise definition of most-to-least probable thresholdssimultaneously in order to address different SLA compliances cost-effectively. Wewill discuss more about the proposed admission controller in Section 3.6.2.

§3.3 Related Work 27

Figure 3.3: Our approach builds an MDN model based on the historical logs ofqueries to pedict distribution of new incoming workloads. The predicted PDFs arethen used for developing two novel workload management strategies: a) Distribution

based admission control, and b) Auto-scaling policy setting.

3.3 Related Work

There are two lines of related work; one directly investigates query performanceprediction and the other uses estimations for workload management. In this section,we will discuss both and highlight the research gap.

Workload Performance Prediction. Query processing run-time and resource es-timation has been investigated in recent years. This line of work explores the esti-mation of run-time and also resource consumption of SQL queries in the context ofboth interleaved [22, 44, 80] and parallel execution [21, 44, 90, 112]. In the majority ofrelated work, different statistical ML techniques are applied for query performanceestimation. Specifically, techniques such as Kernel Canonical Correlation Analysis(KCCA), Multiple Additive Regression-Trees (MART), and Support Vector Machines(SVM) have been respectively built upon query plan features [50], operator level fea-tures [80], or both [22].

When it comes to concurrent workloads, the authors in [21] describe an experi-mental modelling approach for capturing interactions in query mixes. To do so, theinteractions are modelled statistically using different regression models. Along sim-ilar lines, [44] argues that buffer access latency measure is highly correlated with thequery execution time, and they use linear regression techniques for mapping bufferaccess latency to the execution times. The authors in [45] also use the k-nearestneighbours prediction technique to identify spoiler model coefficients for the newtemplate based on similar ones. All of the above studies approximate the perfor-mance of workload as a single point value. Unlike PDFs, in single point estimates wecan not directly obtain valuable statistical measures such as variance or confidence


interval about the target data.

Data Processing Workload Management. Workload management and resourcesizing for data and stream processing systems use either reactive (e.g. using systemload dynamics monitoring)[35], or predictive techniques (e.g. estimating the work-load performance) [27, 38, 60, 115, 116] for decision making. In all these predictiveapproaches, they estimate the workload performance as a single point value [27, 60],assume that the PDF for the workload execution time is already available [116], orestimate (and not predict) the PDF using sampling based techniques [38, 115].

Specifically, [27] proposes an input and query aware partitioning technique whichrelies on the input rate estimation using time series forecasting. However, predictingworkload using a time series analysis is not adequate because event rates usuallychange in an unpredictable way and a single point estimate does not reflect thedistribution. In this regard, although the authors in [116] voiced the issue, theyassume that the PDF for the execution time of a query is already available to theservice provider. As the single point estimation gives no clue of the confidence onthe estimation, they use a committee based ML model in the next work[115]. Alongsimilar lines, [38] approximates the probability distribution using a histogram-basedapproach. However, this approach is only a simple approximation of distributionbased on a number of already collected query execution times. This means it isincapable of predicting the PDF based on the features of a new incoming query.

Concluding Remarks. Based on the above discussions, readers may have noticedthe broken link between the two threads of work. Most of the existing techniques forquery resource or performance prediction contemplate the target as a single pointvalue, whereas the techniques proposed in recent studies for workload management[38, 115, 116] rely on the whole spectrum of performance or resource usage becauseeven in an Online Transaction Processing (OLTP) workload, queries with the samequery time may follow different query time distributions [116]. The authors in [114]propose a white-box technique for quantifying the uncertainties of query executiontime prediction. It treats fixed constant values of operators selectivities, unit cost ofsingle CPU or I/O operation as random variables and develop analytics techniquesto infer distribution of likely running times. Although the work differs to ours asthey do not target continuous queries and resource usage distribution prediction,it does have the following limitations. The technique is limited to the PostgreSQLoptimizer cost model, and more importantly it does not consider concurrent queryexecution.

Our work attempts to address the above issues by proposing a set of black-boxmodels which are able to predict the distribution of resource usage of highly con-current workloads. Note that ML algorithms compared to white-box approaches[113, 114] ease the task of cost model generation for increasingly complex data man-agement systems since they are able to capture implicitly the internal behaviour ofcomponents and their interaction with OS modules in terms of their resource foot-print. This complexity is further intensified in clouds due to the heterogeneity ofresource types and uncertainties of the underlying infrastructure.

§3.4 Resource Usage Prediction 29

3.4 Resource Usage Prediction

The technique we describe in this chapter combines the knowledge of continuousquery processing with statistical models. Employing an ML technique requires ful-filling the following tasks: i) feature identification and selection, ii) model selection,and iii) training and testing.

3.4.1 Single Continuous Query

A streaming application is represented by a directed graph whose vertices are oper-ators and whose edges are streams. In our approach, the continuous query featureset and data arrival rate distribution models form the input vector. This exploits animportant observation, that data stream processing workload behaviour is predomi-nantly the function of query features along with data arrival rates.

Key to the accuracy of a prediction model is the features used to train the model.We identify a set of potential features that affect the stream processing performanceand the query resource usage. The potential features are gathered by analyzing thoseconsidered in related work [22, 50] and those we observed in various performancetest analyses. Intuitively, not all features have high-correlation with the target of themodel and thus we need to select only those features with high predictive capability.To this end, we use a correlation-based feature subset selection method [57] alongwith best-first search for identifying the most effective attributes from feature vectorspaces.

Table 3.1 lists the feature set used as an input to the model. The attributes areextracted from multiple sources such as query statement text (e.g. win_type_size),distribution model (e.g. avg_arrival_rate), or query plan (e.g. opt_type_count). Al-though previous studies [22] showed that the selectivity of operators and cardinalityestimates are useful features for execution time prediction, the reason why they werenot considered in our feature set is discussed in Section 3.4.2.1. Note that the abovelist is further customized based on the target prediction, because attributes have dif-ferent predictive impact on CPU and memory usage estimation. A feature that highlycorrelates memory consumption might have no correlation with CPU usage. For ex-ample, the feature selection task shows that the window size has an insignificanteffect on CPU usage prediction, while it affects memory usage prediction heavily.

3.4.2 Concurrent Workload

A streaming application typically consists of a number of continuous queries simul-taneously being processed by the system. This means a resource usage modellingtechnique has to consider resource consumption estimation in the presence of con-current executions and the combined workload of a large number of queries.

Queries running concurrently in a mix may either positively or negatively affecteach other [21]. Therefore, to model a concurrent workload, we need to study: i) theway a system runs a batch of queries and applies optimizations to reflect possiblepositive interaction in the feature set, and ii) the way queries compete for shared


Table 3.1: Feature input for training model.Feature Name Description Collection Sourceavg_arrival_rate Average arrival rate (tuple/sec) Distribution Modelstream_no # of data stream sources Query statementsubquery_no # of nested subqueries Query statementagg_func_no # of aggregation functions Query statementjoin_predicate # of join predicates in query Query statementproject_size Projection size of query Query statementequ_predicate # of equality selection predicates Query statementinequ_predicate # of non_equality selection predicates Query statementagg_column_no # of columns involved in GROUP BY

clauseQuery statement

opt_type_count # of each operator type in query plan Query planwin_type_size The size of windows which is either

time unit (sec) in time window or tu-ple unit (number) in tuple windowtype

Query statement

win_type_slide The sliding value of the window type Query statement

hardware resources to identify possible negative effects on the mix performance.These two issues are studied in the following sub-subsections respectively.

3.4.2.1 Stream Processing Optimizations

The first step toward modelling concurrent workload is feature set extension. Thisprocess is, in fact, adapting the features for isolated query resource usage predictionto include features from concurrent executions. Since the proposed technique isbased upon continuous query features, the key to successful modelling of combinedworkloads is the function of understanding the way the system applies optimizations.The main optimization techniques are operator reordering, redundancy elimination,placement, state sharing, and so on [62] that are somewhat supported by today’sstream processing systems. For example, Odysseus [23] supports query rewrite (e.g.selection and projection push down) and query sharing. Note that the mentionedoptimization strategies are not exclusive to multi-query execution. However, somestrategies such as sub-graph sharing or state sharing are more likely to be applied inthe case of concurrent workload.

According to the initial feature set selection (Table 3.1), three optimization strate-gies including redundancy elimination, state sharing, and reordering need to be in-vestigated for feature set extension. Because the others are either i) not applicable dueto the scope of this study (e.g. operator placement which is for distributed streamprocessing environment), ii) application specific (e.g. load shedding that trades per-formance against accuracy of results), or iii) related to system performance configura-tion (e.g. batching which is a typical performance tuning option in stream processingsystems such as Oracle CEP [19]).


Redundancy Elimination. In case of multiple-query registration, a data streamprocessing system constructs a global query graph, which contains all operators of allcurrently active queries in the system. In this case, a query optimization componentis used to detect reusable operators in different queries. For example, the Odysseusstream processing system [23] applies query sharing which uses one operator incase of existing multiple same operators in different queries from the sources to thesinks. Therefore, for concurrent workload we include a list of distinct query executionplan nodes (i.e. operators) for all the queries in our training set as opposed to asingle continuous query. This defines a global feature space to describe all concurrentqueries in the workload.

State Sharing. This strategy optimizes for space by avoiding unnecessary copiesof data. For example, continuous query language (CQL) implements windows bynon-shared array of pointers to shared data items, such that a single data item mightbe pointed to from multiple windows [24]. Therefore, when there are multiple win-dow operators against the same source, we consider the largest window size (i.e.win_type_size) in the feature list.

Operator Reordering. Reordering is profitable when there is a chance to moveselective operators before costly ones. For example, Odysseus [23] applies selectionand projection push-down which avoids unnecessary processing. This optimizationwhich is typically performed by the optimizer affects the selectivity ratio (i.e. thenumber of output data items per input data items) of the operator even in singlequery execution. However we did not include the selectivity ratio of the operator asa feature to our training vector since in a stream processing environment we do nothave control over the selectivity of the operators due to consistent data arrival ratefluctuations. Moreover, preliminary investigation of the influence of operator selec-tivity using a sampling approach in a set of experiments found that its contributionto the accuracy of resource usage distribution prediction is negligible.

3.4.2.2 Resource Contention

When multiple queries are registered on the same host, the operators competing forcommon hardware resources such as disk, memory, or CPU might negatively impactperformance. As we aim at resource usage modelling, the resource contention is nota challenge because our models capture the overall resource utilization. This means ifthere is a contention we will encounter higher CPU utilization and vice versa. Thus,the contention issue is implicitly handled by our models.

Resource contention hits query performance such as latency and throughput. Al-though prediction of these measures is not in the focus of this study, our approach toresource usage modelling paves the way for scrutinizing the concurrency impact onquery performance prediction. Specifically, distribution based prediction of resourceutilization for a given query when it runs either in isolation or in a mix providesupper and lower bounds of resource usage. Based on this information, analyticalor statistical models (e.g. correlation) are able to describe how query performancevaries under different resource availability scenarios.


Figure 3.4: Overview of the proposed approach for predicting the resource usagedistribution of continuous queries.

3.4.3 Model Selection

Our approach employs MDN [31], a special type of Artificial Neural Network (ANN),in which the target (e.g. CPU usage) is represented as a conditional PDF. The con-ditional distribution represents a complete description of data generation. An MDNfuses a mixture model with an ANN. We utilize a Gaussian Mixture Model (GMM)based MDN where the conditional density functions are represented by a weightedmixture of Gaussians. The GMM is a very powerful way of modelling densities, sinceit is able to fully describe models by three parameters that determine Gaussians andtheir membership weights. From this density, we can calculate the mean which is theconditional average of the target data. Moreover, full densities are also used to ac-curately estimate expectation and variance that are two main statistics characterizingthe distribution.


Fig. 3.4 gives an overview of the approach. The main input features of the modelconsists of collected query features from the CQL statement and query plan. In thisprocess, the neural network is responsible for mapping the input vector x to theparameters of the mixture model (αi, µi, σ2), which in return provides the conditionaldistribution. In fact, Fig. 3.4 shows a sketchy example MDN with 2 componentsthat takes a feature set x of dimensionality 4 as input the vector and provides theconditional density p(t|x) over target t of dimensionality 1.

3.4.3.1 Mixture Density Networks

The combined structure of feed-forward neural network and a mixture model makean MDN. In MDN, the distribution of the outputs t is described by a parametricmodel. The parameters of this model are determined by the output of a neuralnetwork. Specifically, an MDN maps a set of input features x to the parameters of aGMM including mixture weights αi, mean µi, and variance σ2 which in turn producesthe full PDF of an output feature t, conditioned on the input vector p(t|x). Thus, theconditional density function takes the form of GMM as follows:

p(t|x) =M

∑i=1

αi(x)φi(t|x) (3.1)

where M is the number of mixture components, φi is the ith Gaussian component’scontribution to the conditional density of the target vector t as follows:

φi(t|x) =1

(2π)c/2σi(x)c exp{− ||t−µi(x)||2

2σi(x)2

}(3.2)

The MDN approximates the GMM as:

αi =exp(zα

i )

∑Mj=1 exp(zα

j )(3.3)

σi = exp(zσi ) (3.4)

µi = zµi (3.5)

where zαi , zσ

i , and zµi are the outputs of the neural network corresponding to the

mixture weights, variance, and mean for the ith Gaussian component in the GMM,given x [31]. To constrain the mixture weights to be positive and sum to unity, thesoftmax function is used in Eq. 3.3 which relates the output of corresponding units inthe neural network to the mixing coefficients. Likewise, the variance parameters (Eq.3.4) are related to the outputs of ANN which constrains the standard deviations tobe positive.

The objective function for training the MDN is to minimize the Negative LogLikelihood (NLL) of observed target data points given to mixture model parameters:


E = −∑n

ln{

∑Mi=1 αi(xn)φi(tn|xn)

}(3.6)

Since the ANN part of the MDN provides the mixture model parameters, the NLLmust be minimized with respect to the network weights. To minimize the errorfunction, the derivatives of the error function with respect to the network weightsare calculated. Specifically, the derivatives of the error are calculated at each networkoutput units including the priors, means and variances of the mixture model andthen propagated back through the network to find the derivatives of the error withrespect to the network weights. Therefore, non-linear optimization algorithms suchas scaled conjugate gradients can be applied to MDN training.

Once an MDN is trained with regards to normal precaution of over-training, itcan predict the conditional density function of CPU and memory, conditioned on theincoming query features and data arrival rate distribution model.

3.5 Experiment

This section explores the steps followed and the results obtained from the experi-ments conducted to evaluate the accuracy of our prediction technique. Moreover,we evaluate the performance of the approach as regards to the state of the art singlepoint prediction techniques. We conduct our experiment on both public and pri-vate clouds to evaluate the accuracy of estimations in the presence of any possibleperformance variations.

3.5.1 Experimental Setup

Two virtual machine (VM) instances, one for load generation and another as a hostfor the stream processing system, were employed from CSIRO private cloud. Thestream generator system was a m1.medium size instance with 4GB RAM, 2 VCPUrunning Ubuntu 12.04.02 Server 64b. All queries were executed on m1.large instancesize with 8GB RAM, 4VCPU, and the same OS. The hypervisor is KVM, and thenodes are connected with 10GB Ethernet. In our cloud each physical machine has 16cores of Intel(R) Xeon(R) CPU 2.20GHz with hyper threading enabled which the OSsees as 32 cores (CPU threads). Therefore, 4 VCPU map to 4 CPU threads and 2 fullCPU cores.

3.5.1.1 Dataset and Workload

To validate our approach we deployed both linear road benchmark (LRB) [25] andslightly modified TPC-H [20] in Oracle CEP as a centralized stream processing sys-tem.

LRB Workload. This workload has primarily been designed for comparing per-formance characteristics of streaming systems. It contains 20 queries with differentlevels of complexity in terms of execution plan. We treated them as template queries.

§3.5 Experiment 35

Excluding the ad-hoc query answering set reduced them to 17 template queries.Various arrival rates (e.g. from 100 to 100k tuple/second) along with random substi-tution of window size (e.g. from 1 to 900 sec.) resulted in 17289 execution traces. Togenerate data streams, 500MB data (i.e. 3 hours simulated traffic management data)was fed into the streaming system using the system’s built-in load generator whichplayed the role of data driver in the LRB. Each query was registered and logged formore than 3× of its window size to capture the impacts of time windows on resourceconsumption properly.

LRB Mix Workload. To build a representative workload of concurrent query exe-cutions, we collected 585 execution traces for 18 query mixes. To generate the dataset,different combinations of the queries at multiprogramming level (MPL) range from 2to 17 were randomly selected and registered in the stream processing system. Oncethe mixes start processing of the incoming data streams the CPU and memory usageof the system are collected.

TPC-H Workload. In contrast to the LRB workload, TPC-H has been designedprimarily for DBMSs, though it has also been used in stream processing research. Inthis context, each relation is considered as a data stream source and the tuples aresent toward the stream processing engine over the network using a load generator.Therefore, each registered query references a subset of the relations in the input overtime.

We created 0.1GB TPC-H database using the DBGen tool as per the specification.To keep the overall experimentation duration under control we did not use largerdatabase size (e.g. 1GB) because the tables are in fact the stream source materialin our experiment and we have to send each tuple over the network. Quite simply,in 1GB database size, LINEITEM table has 6001215 tuples and even with the 5000tuple/sec rate, it takes more than 20 minutes to send all the tuples over the network.With current hardware, this rate is the maximum consumption rate for queries with-out any join such as Q1 and Q6. This rate drops to less than 200 for Q8 with 7 datastream sources. As Oracle CEP does not support correlated sub-queries, we wereforced to exclude templates 4, 11, 15-18, 20-22. We generated 35 executable querytexts using QGen based on the remained 13 TPC-H templates queries.

Furthermore, we slightly modified these queries for our system to make themcompatible with the stream processing context. One of the key changes was adding atime window for each stream source to let queries show the upper bound of CPU andmemory usage. Moreover, some query semantics require that tuples not to leave thetime window until a certain period of time to be able to produce meaningful results.In other words, we needed to keep the first tuple that enters the time window untilthe load generator reads and sends the last tuple from the relation source. To thisend, we set the time window range to the value of S if the load generator needs Sseconds to read and send all the tuples.

The load generator was not allowed to send duplicate tuples. In addition, rela-tions have different cardinalities so that in case of multiple stream sources in onequery, we set all the time windows to the biggest one. This let the relation at timet consist of tuples obtained from all elements of stream up to t. For example, in a


(a) (b)

Figure 3.5: Best fit of sent tuple per second against different distribution models. Thefigures contain probability density of average tuple sent per second for the speed rate

of (a) 50K and (b)100K for two different queries.

join between LINEITEM (∼600k tuples) and NATION (25 tuples) streams, the latterrequires as big a time window as the former to let elements remain in the windowuntil the last tuple from the LINEITEM stream enters the window for processing.

The 35 generated executable queries were registered separately in Oracle CEPand their performance measures against the fluctuating arrival rates were logged.The obtained workload consists of 8783 execution traces.

Performance measures of interests (i.e. CPU and memory) were collected usingthe dstat1. This is a lightweight Python-based tool that collects OS and system statis-tics passively, without affecting performance. To guarantee healthy and repeatabledata gathering, the execution traces of all queries were collected several times. More-over, all queries were run with cold start making sure the buffers were flushed andwe had a fresh JVM.

Note that all the models are trained and tested as per varying data arrival ratedistribution models. To this end, after setting a certain data arrival rate, the generatortypically tries to reach the specified velocity, while adjusting rate based on engineconsumption rate with the aid of a thread sleep function. This means that a query(especially complex ones) might be able to consume only 100 tuples per second evenwhen we set the load generation rate to 200 tuples per second. Thus, a few secondsafter commencement the buffer of the stream processing engine is full and the loadgenerator thread sleeps for a few milliseconds to allow the consumer to exhaust thequeue. This situation inherently emulates different load generation distribution, forexample for rate 50K and 100K the distribution is more fit to Weibull and generalizedextreme value distribution, as shown respectively in Fig. 3.5(a) and 3.5(b).

1http://dag.wiee.rs/home-made/dstat/

§3.5 Experiment 37

3.5.1.2 Training and Testing Settings

To assess how the result of a predictive model would be generalized to an indepen-dent unforeseen data set, we divided the LRB workload randomly into training andtesting datasets with 66% and 34% split rates respectively. For TPC-H workload weused k-fold cross-validation. As regards to the workload size, 2-fold cross-validation wasused to train and test parameters. For each fold, we randomly assigned data pointsto two equal size sets ds1 and ds2. To do so, we shuffled the data array and thendivided it in to two arrays. We then trained on ds1 and tested on ds2, followed bytraining on ds2 and testing on ds1.

Before training and testing, the input and output features were normalized usingz-score and min-max normalization with range (0.1-0.9). For conducting training andtesting, we used a Netlab toolbox [91] which is designed to provide the central toolsnecessary for the simulation of theoretically well founded neural network algorithmsand related models and in particular MDN. The implemented MDN model uses aMLP as a feed forward neural network, though in general any non-linear regressorcan be utilized.

There are a number of hyper-parameters including the number of Gaussian com-ponents or number of neurons in MLP that need to be specified beforehand. Weevaluated several settings and assessed the trade-off between accuracy, training timeand overhead. We concluded that a GMM with 3 components and 2 neurons per fea-ture in the input vector provide an acceptable accuracy within a tolerable overhead.

State of the Art Techniques. To compare the performance of our approach withsingle point estimators, we used REPTree and SVM as the alternative techniques.REPTree and SVM are the main prediction techniques used in [116] and [22] respec-tively. Note that [80] also uses a variant of regression trees as a core predictor. In ourimplementation, these algorithms are called from the package WEKA [58].

3.5.2 Evaluation: CPU and Memory Usage

To determine whether a probabilistic model performs well we must set the goal ofthe model because if, for example, a trained MDN assigns some probability to theactual observation, we should be able judge whether the prediction is accurate ornot. Therefore, in the following subsection we first set the goals and then defineappropriate metrics which have been implemented for evaluation.

3.5.2.1 Error Metrics

The goal of a probabilistic prediction is to maximize the sharpness of the predictivedistributions subject to calibration [53]. Sharpness refers to the concentration of thepredictive distributions. Calibration refers to the statistical consistency between thePDFs. Our objective is to predict calibrated PDFs that closely approximate the regionin which the target lies with proper sharpness. To this end, the Continuous RankedProbability Score (CRPS) [53] is a proper metric to evaluate the accuracy of PDFs.


(a) (b)

Figure 3.6: (a) predicted PDF and the observation (b) schematic sketch of the CRPSas the difference between CDFs of prediction and observation.

The CRPS takes the whole distribution into account when measuring the error:

CRPS(F, t) =∫ ∞

−∞

[F(x)−O(x, t)

]2 dx (3.7)

where F and O are the Cumulative Distribution Function (CDF) of prediction andobservation distributions respectively. O(x, t) is a step function that attains the valueof 1 if x ≥ t and the value of 0 otherwise. Therefore:

CRPS(F, t) =∫ t

−∞

[F(x)

]2 dx +∫ ∞

t

[F(x)− 1

]2 dx (3.8)

To calculate CRPS both the prediction and the observation are converted to CDF.The CRPS compares the difference between CDF of prediction and observation asgiven by the hatched area in Fig. 3.6(b). It can be seen that the area gets smallerif the prediction distribution concentrates probability mass near the observation, i.e.the better it approximates the step function. Moreover, the small CRPS value showsthat the prediction captures the sharpness of prediction accurately.

After calculating the CRPS for each prediction, we need to average the values toevaluate the complete input set:

CRPS =1n

n

∑i=1

CRPS(Fi, ti) (3.9)

The CRPS generalizes the mean absolute error, thereby providing a direct way ofcomparing various deterministic and probabilistic predictions using a single metric[53].

We are also interested in evaluating the spread of predictive density in which ourtargets lie. The average Negative Log Predictive Density (NLPD) [54] error metric is

§3.5 Experiment 39

used for evaluating this aspect, although unlike CRPS it is not sensitive to distance:

NLPD =1n

n

∑i=1−log(p(ti|xi)) (3.10)

where n is the number of observations. The NLPD evaluates the amount of proba-bility that the model assigns to targets and penalizes both over and under-confidentpredictions. Both CRPS and NLPD are proper measures, meaning that they rewardhonest assessments.

The last metric is the Mean-Square Error (MSE):

MSE =1n

n

∑i=1

(ti −mi)2 (3.11)

where m refers to the median of the PDFs as point predictions for the MDNs. Thismetric allows us to compare the proposed technique with single point competitors.

3.5.2.2 Evaluation Results

The results for both the proposed approach using MDN and the single point estima-tors under CRPS, NLPD, and MSE metrics are shown in Tables 3.2 to 3.4 respectively.Note that different MDN architectures including 3, 5, and 8 mixture components (M)were evaluated to analyse the influence of this hyper-parameter in the model.

All three metrics are negatively oriented scores; hence smaller value is better. Letus first evaluate the accuracy of the MDN per se using CRPS and NLPD measures.As we can see, in three workloads the error numbers are small enough to suggest thatthe proposed model is an appropriate one for distribution prediction of data streamprocessing workloads. In LRB Mix workload, sophisticated MDN architecture with 8and 5 components led to respectively better CPU and memory utilization predictionunder CRPS metric. In contrast, both LRB and TPC-H workloads show slightly worseperformance as the architecture becomes more complex.

The MDN shows slightly better performance in memory utilization prediction ofLRB compared with TPC-H in terms of CRPS values, though its performance in CPUprediction in both workloads is nearly identical. This is why the TPC-H workloadis more complex than LRB as the query templates combine complicated query planswith various data sources. Although the LRB workload has a wide complexity rangeof queries, all deal with one data stream.

In terms of concurrent workload, as the results show the model is a reliable pre-dictor for workloads at MPL range from 2 to 17. Specifically, we can see an exactCRPS and MSE values (i.e. .042 and .010) for memory prediction in both LRB andLRB Mix. However, the MDN has better performance in CPU prediction of LRBcompared with LRB Mix. In this regard, the CRPS error reduces as the MDN archi-tecture becomes more complex. This is why the combined workloads are much morecomplex, hence requires more sophisticated architecture.

To compare the proposed approach with the state of the art techniques, we need


Table 3.2: Trained classifiers performance as per LRB workload.MDN REPtree SVM

Resource M CRPS NLPD MSE MSE MSE

CPU3 0.036 -1.95 0.006

0.008 0.0075 0.128 -0.339 0.0968 0.113 -0.865 0.043

Memory3 0.042 -3.136 0.010

0.008 0.0155 0.053 -1.465 0.0668 0.065 0.075 0.046

Table 3.3: Trained classifiers performance as per LRB Mix Workload.MDN REPtree SVM


CPU3 0.114 -0.584 0.032

0.038 0.0135 0.106 -0.544 0.0858 0.099 -0.46 0.056

Memory3 0.081 -1.96 0.010

0.011 0.025 0.042 -1.33 0.0588 0.068 -1.18 0.042

Table 3.4: Trained classifiers performance as per TPC-H workload.MDN REPtree SVM


CPU3 0.034 -2.04 0.007

0.006 0.0085 0.16 -0.98 0.028 0.154 -0.9 0.02

Memory3 0.057 -1.9 0.008

0.006 0.0115 0.092 -0.91 0.0948 0.097 -0.67 0.1

§3.6 Distribution-Based Workload Management 41

Table 3.5: Training times in seconds as regards to different workload sizes for 1Kiterations.

LRB Mix LRB Mix_EC2 TPC-H LRBWorkload Size 0.5K 5.5K 8.7K 17.2KElapsed Time (s) 3.2 9.65 11.78 22.62

to treat it as single point estimator and therefore use MSE metric error for compar-ison. In terms of memory utilization prediction, a closer look at the data indicatesthat the MDN outperforms the SVM technique in all the experiments. In LRB andTPC-H, the REPTree shows less error, whereas in LRB Mix the opposite observationholds true. When it comes to CPU prediction, our approach is a better resource usageestimator compared with both the REPTree and the SVM in LRB workload. In bothLRB Mix and TPC-H, the MDN performance is in between the REPTree and SVM. Tobe more specific, in LRB Mix, our approach outperforms the REPTree while it showshigher MSE value compared with the SVM.

In summary, our approach outperforms the state of the art single point techniquesin 8 out of 12 experiments conducted using the SVM and REPTree. This result isquite promising because it shows that our approach is not only able to predict thefull distribution over targets accurately, it is also a reliable single point estimator.

3.5.3 Training Times and Overhead

In this section we evaluate the training time complexity of the proposed models, aswell as the overhead of using them at runtime. Table 3.5 shows the training timesas regards to different workload sizes. As we can see, the training cost is very smalland it grows linearly as per the training set size.

Prediction Cost. A crucial issue for the deployment of the resulting estimationmodels is the overhead of invoking them at runtime. For this purpose, we measuredthe elapsed time for evaluating an MDN model for a given input feature set on a2.80GHz Intel Core i7, and obtained an overhead of about 0.2 ms for each call. Thesenumbers show that the MDN is quick enough to become as an integral part of anyworkload management strategies at runtime.

3.6 Distribution-Based Workload Management

We now discuss the applications of the proposed technique to answer the followingkey question: Is the proposed approach applicable to resource management problemsof stream processing systems in practice?

3.6.1 Predictable Auto-Scaling Policy Setting

Developing efficient and stable auto-scaling techniques in cloud environments isa challenging task due to heterogeneous infrastructure and transient behaviour of


workloads. A number of studies approach this problem with the aid of control the-ory [82], reinforcement learning [60], and the like. The hard challenge is to determinea suitable policy for the decision maker (e.g. resource provisioner), as poor policy set-tings can lead to either resource inefficiency or instability. For example, considerCPU utilization of the NegAccTollStr and its corresponding auto-scaling policies, asshown in Fig. 3.7(a). Note that these two policies are defined to avoid SLA misses 2

and resource dissipation respectively.For the NegAccTollStr query, as the peaks go beyond 90% within 2 consecutive

periods of 1 minute, the first policy is triggered and an additional virtual server isinstantiated to process the workload (e.g. via stream redirectory technique). How-ever, the load may now drop far below the predefined threshold of the second policy(i.e. avg(cpu)<15%) as the combined capacity of two virtual servers exceeds the cur-rent stream processing demands. Therefore, the second policy is activated and theprovisioner decreases the number of instances to one. This oscillatory behaviour cancontinue indefinitely depending on the variation in data stream arrival rate and con-tinuous query processing resource consumption pattern. [82] also reports the sameobservations. To avoid oscillations, [82] develops the proportional thresholding tech-nique which in fact works by dynamically configuring the range for the controllervariables. Though this technique can tackle the oscillatory problem at run-time, it isincapable of anticipating the effects of auto-scaling policies before workload execu-tion which can lead to SLA violations.

To circumvent the limitation of existing approaches, we propose a novel approachas discussed next. We perceive that the reason for the oscillations is due to defininginconsistent policies that are agnostic to changes in workload behaviour. In ourapproach such inconsistencies are avoided by exploiting the workload distributionprediction for specifying and selecting auto-scaling policies. For example considerSegToll resource usage behaviour as shown in Fig. 3.7(b) in which only the firstpolicy will be triggered. Based on the workload distribution we do not expect tomeet the second policy and following instability even after initial resource resizing.

Based on this observation, we claim that the workload behaviour distribution pre-diction provides more reliable advice for auto-scaling policy setting. In fact, havingthe understanding about the upper and lower bound of resource utilization helps inanticipating auto-scaling policy effects beforehand and adjust the configurations ac-cordingly. In other words, a workload-distribution driven auto-scaling policy settingapproach can help administrators in defining more consistent auto-scaling policies.

To validate the hypothesis, we designed an experiment to evaluate whether thepredicted distribution is able to characterize the most/least probable auto-scaling policiesbefore the actual workload execution or not.

Workload: A representative workload was built based on the Linear Road Bench-mark (LRB), LRB Mix_EC2. The workload contains 5507 execution traces for 17 querymixes. The mixes are at multiprogramming level (MPL) range from 2 to 5. All themixes were logged for about 4 minutes on Amazon t2.micro instance. To make the

2We assume that CPU utilization above 90% leads to SLA misses. We will discuss about this rela-tionship in Section 3.6.2.


(a) (b)

Figure 3.7: The CPU utilization of (a) NegAccTollStr and (b) SegToll queries for 5minutes. The sample auto-scaling policies cause osiliation behaviour in NegAccToll-Str workload, since they have been defined irrespective of the workload CPU usage

distribution.

test workload, we randomly selected 32 mixes of queries – different from the trainingset – at MPL range from 2 to 5.

Auto-scaling Policy Generation: In the next step, 128 random auto-scaling poli-cies were generated. The 128 polices were randomly split into 32 sets, each corre-sponds to a test query mix. This means each mix (out of 32 mixes) is run against agroup of 4 auto-scaling policies. Therefore, before running each of the query mixes,4 auto-scaling policies are defined on t2.micro EC2 instance. We developed all thepolicies as per the Amazon EC2 template:Policy Template: Take action A3 whenever {Average, Max, Min} of CPU Utilizationis {>,≥,<,≤} than γ for at least {2, 5} consecutive periods of {1, 5} minutes.where the threshold γ was randomly generated in the range (0,100) percent.

Training the Model: We trained the MDN classifier based on the LRB Mix_EC2training set. We then used the trained model to predict the PDFs of CPU usage forthe query mixes of test dataset. In the next step, the probabilities of the policies werecalculated based on the predicted PDFs before any workload execution takes place.Once probabilities were calculated, all the query mixes were run one after anotheragainst the predefined rules on the EC2 instance and all the activated policies wererecorded over the experiment period. The experiment duration was specified accord-ing to the policy monitoring period. In our experiment it was twice the monitoring

3In our experiment it is a simple notification email.


duration4. This workflow was continued for all 32 mixes in test dataset.We now discuss how to calculate the auto-scaling policy probability. To do so, we

first compute the probability of the CPU utilization using the following equation:

Pr[a ≤ X ≤ b

]=∫ b

afX(x)dx (3.12)

Where a random variable X has density fX and the variables a and b are the CPUutilization thresholds. Eq. 3.12 gives the probability of CPU (or memory) utilizationwithin the given thresholds. However, the auto-scaling policies are also dependenton the consecutive occurrence of the events (i.e. the condition). The events thatlead to activation of thresholds are independent of time. Based on the probabilitytheory, assuming independence (i.e., the probability of an event such as a thresholdactivation at a given point of time is independent from the past occurrence of thesame type of event), we can compute the probability of two or more independentevents by multiplying their individual probabilities. Therefore, the probability of anauto-scaling policy occurrence for m consecutive periods is calculated in a generalform as:

Probability(policy, m) =m

∏k=1

Prk[ak ≤ X ≤ bk

](3.13)

The above definition relaxes the constraint of having the same thresholds forarbitrary consecutive periods, though existing auto-scaling frameworks (e.g. AmazonAuto-Scaling Service, Azure Fabric Controller) do not offer this important feature yet.We note that the PDFs do not reflect the probability of workload behaviour acrosstime. However, we show in our experiment that extending the probabilities to anarbitrary number of consecutive periods works well in practice.

Before discussing the results, let us recap the purpose of the experiment. Thereare 32 rule groups corresponding to 32 query mixes. Each group contains 4 auto-scaling policies of which two are the most and least probable policies as regards tothe calculated probabilities. This means they are highly likely and highly unlikely tobe triggered after workload execution. We now want to evaluate, for example, Whatpercentage of the rules with the highest probability values are activated?

Based on the experimental results, we found that 62% of rules with the highestprobability were activated after workload execution. Moreover, 87% of rules thatwere characterized as unlikely to be triggered at run-time also held true (i.e. theywere not triggered). Fig. 3.8 displays the results for 12 out of 32 test query mixes.As the bar chart shows, the proposed technique performs well in predicting the mostprobable auto-scaling policies for each policy group. As we can see, it only failed tocharacterize the highly possible policies for Mixes 4, 8, and 12.

In summary, these findings clearly demonstrate that our hypothesis held trueand the distribution-based prediction provides a reliable source of information forpredictable auto-scaling policy setting. Apart from its contribution to oscillatory be-

4For example, the experiment duration is 4 minutes for a policy with a monitoring duration of 2consecutive periods of 1 minute.


Figure 3.8: The probabilities of the randomly generated auto-scaling policies for 12(out of 32) mixes of test queries. Each query mix evaluated against 4 auto-scalingpolicies as shown in the form of bright and dark coloured bars. The bright and darkbars within each policy set respectively show the activated and not activated rules atrun-time. Our technique has successfully characterized the highly possible policies

for all mixes but Mix 4, 8, and 12.

haviour avoidance, we believe that this feature helps users to use cloud infrastructureeconomically where they are charged for every CPU cycle used and every data bytetransferred in and out of the datacenter.

3.6.2 Distribution Based Admission Controller

The main goal of an admission controller is to take actions (e.g. data stream orquery load portioning, firing new VM, etc.) at run-time such as that SLA violationsare minimized. To do so, a decision making module within the admission controlsystem labels new queries and/or data streams as admitted, stalled, or rejected whilecarefully balancing a number of system and application variables such as systemload, incoming workload estimation, SLA miss penalty cost, etc.

The state of the art admission control framework [115, 116] uses single point pre-diction as an underlying technique for incoming workload estimation. However, thenecessity of developing distribution based prediction for making profit-oriented de-cisions [116] has already been voiced. As a response to this issue, we have developeda distribution based admission controller which is able to take decisions related tothe admission or rejection of incoming queries according to the probability of missing theSLA. The developed admission controller has the following main modules:Workload Prediction. To estimate the workload performance, the prediction moduleof the admission controller is trained with the queries’ features (see table 3.1). Tostudy how well the proposed technique performs compared with the state of the artsingle point based controller, the experiments are conducted considering the cases


Figure 3.9: The CPU utilization beyond 95% hits the throughput (tuple/sec) of thequery.

where the prediction module is equipped with the intelligence of:

• Single point predictor: In our implementation we used the REPTree algorithm asa single point prediction technique as employed in [116].

• Distribution based predictor: In this component the pdf is constructed first. Fol-lowing that the probabilities of the SLA misses are calculated.

Decision Making. According to the predefined thresholds and prediction values,the decision making module labels the new queries as admitted or rejected.Processor Control. Once the queries are admitted, the processor registers them inthe stream processing systems.

As a natural fit to our prediction targets, we used CPU utilization predictionsof continuous queries to decide about the SLA misses. SLAs are typically specifiedbased on the high-level QoS measures (e.g. latency, throughput). However, low levelresource QoS metrics correlate with high-level service level measures. For exam-ple, [82] reports the strong correlation of CPU utilization with response time. Ourexperiments also showed correlation between the CPU utilization and throughput.Therefore, the queries are labelled in the missed SLA category when the CPU utiliza-tion goes beyond 95% for a certain number of times within a period5. This is why wehave witnessed the figures above the defined threshold hit the throughput, as shownin Fig. 3.9. As we can see, the throughput decreases as the CPU reaches 95%.

In our experiments the classifiers were trained based on the LRB Mix_EC2 Work-load. Thereafter, we used 33 mixes of queries at MPL range from 2 to 5 to evaluatethe performance of single point and distribution-based admission control techniques.

5In our experiment, we consider 60 times within 4 minutes experiment duration.


Figure 3.10: Single point and distribution based admission controller performanceunder different decision making thresholds. In single point case we set t1=25%,

t2=45%, t3=65%, and t4= 85%.

The mixes were injected in to the system and the decisions undertaken by both con-trollers were logged. After executing the admitted query mixes in the stream pro-cessing system, the number of SLA misses was calculated from the collected runninglogs.

We used a set of values that are 25%, 45%, 65%, and 85% for single point decisionmaking thresholds (t) since it allows us to evaluate the performance under both softand hard conditions. In single point decision making we simply label a query mix asrejected if the predicted CPU utilization is greater than the thresholds. Accordingly,a range of low to high probabilities calculated from distributions were used as thebasis for deciding whether a query mix can be admitted into the stream processingsystem or not.

The bar chart (Fig. 3.10) depicts the relative performance of two techniquesagainst a range of thresholds. Under the fourth threshold (i.e. t4), the distributioncontroller shows slightly better performance in terms of number of missed SLAs.In contrast, for t2 and t3 the single point based controller misses fewer SLAs, al-though the admitted queries are comparatively higher under the distribution basedcontroller. Note that the main reason for this observation is the issue of translatingthe defined threshold for the single point controller to its equivalent for distributionbased one. Put differently, setting exactly similar thresholds between the two con-trollers is hardly possible, meaning that we are only able to evaluate the performanceunder a set of not necessarily equivalent thresholds from soft and hard conditions.

In summary, we have witnessed during the experiment that distribution-basedprediction offers an interesting feature as it supports the definition and applicationof a hierarchy of most-to-least probable thresholds simultaneously. Such a featureallows us to adjust with different thresholds as per penalty costs which are usuallydefined based on the different SLA compliance levels. In fact, the proposed techniqueis the first step towards developing an admission controller which is able to react


against the probabilistic SLAs for stream processing systems hosted on datacenters.

3.7 Summary

In this chapter, we presented a novel approach for resource usage distribution pre-diction of data stream processing workloads. Our approach combined knowledge ofcontinuous query processing with mixture density networks that approximate condi-tional PDF of resource usage. We demonstrated that the predicted distributions havethe potential to become an integral component of the automated workload man-agement systems via developing two novel applications: i) predictable auto-scalingpolicy setting; and ii) a distribution-based admission controller.

In the next chapter, we will investigate distributed-based resource usage and per-formance modelling of large-scale analytics Hive queries.

Chapter 4

Distribution-Based WorkloadModelling of Large-Scale BatchQueries

In a shared multi-system cluster, having a mix of different applications and work-loads (e.g. Pig, Hive) running concurrently is a trivial practice to utilize resourcescost efficiently, while challenging accurate workload performance prediction. Thisleaves us with one key question to answer in this chapter: How can we predict theresource and performance distribution of large-scale batch query processing workloads?

Answering this question is important for the increasingly common data-intensiveplatforms where efficient resource usage prediction is a key operating criterion forproper cluster and resource utilization and service level agreement (SLA) manage-ment. In this context, we argue that with distribution-based prediction of data-intensive workloads we are able to tackle properly the inevitable performance vari-ances in the presence of resource contention.

To this end and in response to third (iii) part of the research question (I) as spec-ified in Section 1.3, in this chapter we derive an optimal proposal model for CPUand Runtime distribution prediction of the major big data workloads, Hive queries.As discussed in Chapter 2, Apache Hive is a data warehouse infrastructure builton top of Hadoop that facilitates querying and managing large datasets residing indistributed storage. It provides a mechanism to project structure onto this data andquery the data using a SQL-style language, HiveQL.

4.1 Approach Overview

Our approach combines knowledge of Hive query processing with Mixture DensityNetworks (MDN) [31], a flexible technique to modelling real-valued distributionswith neural networks. For this purpose, we firstly execute training Hive workloadsand log their CPU usage and runtime values along with predefined query features.Secondly, we input the query features to the MDN model. Finally, the MDN statisti-cally analyses the feeding features’ numbers and actual observation of the resource

49

50 Distribution-Based Workload Modelling of Large-Scale Batch Queries

(a) (b)

Figure 4.1: Two sample predicted distributions for (a) CPU and (b) Execution Timefor a sample input from Q7 of TPC-H. The histograms show respectively the ac-tual CPU and Runtime values for 30 different instance queries generated based on

template-7 and executed in the cluster.

consumption and runtime of training data and predicts the probability distributionparameters (i.e. mean, variance, and mixing coefficients) over target values (i.e. CPUand query execution time). Once the model is built, it can then be used to predict theresource and performance value of new incoming queries based on the query featurevalue set without executing the query.

To illustrate the gains possible by using the proposed approach, consider Fig.4.1(a) and 4.1(b) which displays two sample predicted PDFs for CPU usage and run-time for one of the experiments conducted on TPC-H queries [20]. The predictedPDFs correspond to a test input from Template-7 (Q7) of TPC-H against 100GBdatabase size. To demonstrate the whole possible range of performance values underQ7, the histograms for 30 instance queries based on Q7 from the test set are shownas well.

As we can see, the predicted PDFs properly estimate the CPU and Runtime dis-tribution in which they show high probability around the target value. More im-portantly, they provide information about the whole spectrum of performance andresource usage. Specifically, the predicted PDF in Fig. 4.1(b) shows highly proba-ble Runtime in ranges (0.1, 0.2) and (0.3, 0.5) which are consistent with the actualdistribution, though the predicted PDF corresponds to one input, meaning that theresulting uncertainty of PDF for the range (0.8, 0.9) is defensible. Similarly, the pre-dicted PDF for CPU time (Fig. 4.1(a)) provides a complete description of the statis-tical properties of the CPU usage through which we are not only able to capture theobservation point, but the different range of resource usage. In contrast, a best pre-diction from existing single point techniques [51, 22, 80] merely estimates the pointwhich is visualized by solid vertical line through which, unlike the PDF, we are notable to directly extract valuable statistical measures including variance, expectation,and confidence interval about the target.


4.2 Related Work

Query processing runtime and resource usage estimation has been investigated inthe context of DBMS or MapReduce [22, 50, 80, 51, 21, 44]. In the majority of relatedwork, different statistical ML techniques are applied for query performance estima-tion. Specifically, techniques such as Kernel Canonical Correlation Analysis (KCCA),Multiple Additive Regression-Trees, and Support Vector Machine (SVM) have beenrespectively built upon query plan features [50] operator level features [22, 80] orboth [22]. These approaches build statistical models using past query executions anda representative set of query features which have high predictive power in terms ofresource or performance estimation.

In terms of concurrent workloads, [21] uses various regression models to predictthe completion times of batch query workloads when multiple queries are runningconcurrently. Along similar lines, [44] argues that the buffer access latency metricis correlated with the query runtime, and they use linear regression techniques formapping buffer access latency to the execution times. Though the above approachesprimarily use statistical ML techniques, they apply fine-grained models in a differentcontext, that of massively parallel data processing in the MapReduce environment.

Another related work applies KCCA to the Hive workload using two differentset of features [51]. In their initial job feature vector they consider features corre-sponding to the number of occurrences of each operator in a job’s execution plan.The obtained results suggest that Hive operator occurrence counts are insufficientfor modelling Hive query runtime which is somehow consistent with what we willreport and discuss in 4.4.4 (Fig. 4.2(a) and 4.2(b)). Following that, they include an-other set of low level features pertaining to Hive query execution such as the numberof maps and reduces, bytes read locally, bytes read from HDFS, and bytes input tothe map stage, which lead to good prediction accuracy. However, the provided lowlevel features are not available before the query is executed, so that it can not be usedfor performance prediction of new incoming queries.

Note that all of the above studies approximate the performance of workload asa single point value which is neither expressive enough nor does it capture perfor-mance variances.

4.3 Performance Modelling of Hive

To approach the problem of resource and performance distribution prediction of Hiveworkloads, we use knowledge of Hive query execution in Hadoop combined withMDN technique. As discussed earlier in Chapter 3, building an ML model involvesthree steps: i) selecting input feature set, ii) collecting training data, and iii) training,testing and refining the models. In the following subsection, we first discuss thefeature set extraction in detail.


4.3.1 Query Execution in HiveQL

A key to the accuracy of a prediction model is choosing the most predictive featuresfrom the available set of features to train the model. Therefore, we need to identify aset of potential features that would affect the performance and the query resource us-age. To identify the potential features we need to dissect the way a HiveQL statementis being executed on top of a Hadoop cluster.

Once a Hive query is submitted against the chunks of data residing in distributedfile systems (e.g. HDFS, GFS), the Hive engine compiles it to workflows of MapRe-duce jobs, in which the SQL operators are plugged into map and reduce functionsof the job. At the end of each map and reduce phase, the intermediate results arematerialized on disk. During a Hive query execution, SQL specific operators (e.g.table scan, select) which are implemented inside map and reduce functions alongwith MapReduce specific tasks (e.g. read, spill, shuffle, write) are the main computa-tion tasks which use cluster resources and impact the query completion latency. Theoverhead of the latter is in fact the function of the number of mappers and reducersspawned across a cluster to execute the query’s operators against the data blocks.This number is itself dependent on input data to each query processing stage, jobconfigurations, and available free resources in a multi-system cluster.

As we aim at resource and runtime distribution prediction of Hive queries beforeany actual execution takes place, we inevitably need to stick with the data providedby the Hive query execution plan. Unlike conventional database systems, the Hiveexecution plan is an intermediate step before determining the number of mappersand reducers to be executed as a MapReduce job. However, assuming constant con-figuration, the estimated input data size is a proper predictive feature for alleviatingthe issue of mappers/reducers numbers and their corresponding Hadoop phases(e.g. reading, spilling, shuffling, writing). Thus, our feature set includes SQL andMapReduce operator counts along with the input record number and data size asspecified in Table 4.1.

We note that the resource contention among different concurrent workloads willimpact the performance and following its estimation. However we put forward theclaim that with distribution prediction of data-intensive workloads, we are able totackle properly the inevitable performance variances in the presence of resource con-tention and runtime configurations. We will discuss this issue in detail in 4.4.4 and4.5.

4.3.2 MDN Technique

One of the challenging decisions when using statistical ML models is the choice of theunderlying ML technique itself. This is why identifying the most accurate predictionmodel without training and testing multiple models is hardly possible. Nevertheless,the focus of our work which is conditional probability density prediction, alleviatesthe problem of picking the right model. We use Mixture Density Networks as anunderlying ML technique in the proposed approach. Our decision is backed upby its flexibility in capturing skewed and multi-modal distributions, as exhibited

§4.4 Experimental Evaluation 53

Table 4.1: Feature set for resource modelling of Hive queries.Feature Name DescriptionSQL Operator No Number of SQL operators (e.g. Table Scan)

which appear in the HiveQL query plan.SQL Operator Input Records Input Row Numbers for each operator as

per the query plan.SQL Operator Input Byte Input Data Size to SQL operator.MapReduce Operator No Number of MapReduce operators (e.g. Re-

duce Output Operator), appear in theHiveQL query plan.

MapReduce Operator InputRecords

Input Row Numbers for each operator asper the query plan.

MapReduce Operator InputByte

Input Data Size to the MapReduce specificoperator.

by runtime and resource usage distributions in a multi-system cluster. As we havealready explained the MDN architecture thoroughly, we do not repeat it here andrefer the readers to the Section 3.4.3.1 of Chapter 3.

4.4 Experimental Evaluation

In this section, we present the result of our experiment to evaluate the performance ofthe proposed approach for CPU and runtime distribution estimation of Hive queries.


Infrastructure Setup. We evaluate our models on CSIRO Big Data cluster. Thecluster comprises of 14 worker nodes connected with fast Infiniband network, eachfeaturing 2 x Intel Xeon E5-2660 @ 2.20 GHz CPU (8 cores), 128 GB RAM and 12 x 2TB NL-SAS HD making up the total disk space of 240 TB. All experiments were runon top of HiveQL 0.13.1, and Hadoop 2.3.0 in Yarn mode on.

Workloads. We test our approach on TPC-H benchmark. We execute TPC-Hqueries on six scaling factors: 2, 5, 25, 50, 75, and 100 GB. All databases are generatedin Apache Parquet data file format. The TPC-H workload consists of all queriesexcept the queries that are either super slow (including Q2, Q8, Q9) or failed (e.g.Q191), thereby we run the super slow queries for solely 2 and 5 GB database size tokeep the overall experiment duration under control.

There are approximately 11 queries from each template in six databases. Thus,the resulting data set we used contains 995 queries. Note that our cluster is shared bymultiple users in the organization who submit different ranges of applications (e.g.Spark, MapReduce) for processing. Moreover, queries are either run sequentially or in

1This issue is also reported by users in https://issues.apache.org/jira/browse/HIVE-600


parallel without any pre-defined ordering to simulate real world conditions as muchas possible.

Training and Testing Settings. To assess how the result of a predictive modelwould be generalized to an independent data set, we divide the TPC-H workloadrandomly into training and testing datasets with 66% and 34% respectively. Beforetraining and testing, the input and output features are normalized using z-score andmin-max normalization with range (0.1-0.9). For training and testing, we use a Netlabtoolbox [91] which is designed for the simulation of neural network algorithms andrelated models, in particular MDN. The implemented MDN model uses the MLP asa feed forward neural network.

4.4.2 Error Metrics

In this section, we briefly recite the definition of three error metrics including CRPS,NLPD, and root mean-square error (RMSE) which were thoroughly presented inSection 3.5.2.1 of the Chapter 3.

The CRPS [53] is a proper metric to evaluate the accuracy of pdfs. The CRPStakes the whole distribution into account when measuring the error:

CRPS(F, t) =∫ ∞

−∞

[F(x)−O(x, t)

]2 dx (4.1)

where F and O are the cumulative distribution functions (cdfs) of prediction andobservation distributions respectively. O(x, t) is a step function that attains the valueof 1 if x ≥ t and the value of 0 otherwise. To evaluate the spread of predictive densityin which our targets lie, the average NLPD [54] error metric is used:

NLPD =1n

n

∑i=1−log(p(ti|xi)) (4.2)

where n is the number of observations.The last metric is the RMSE which allows us to compare the proposed estimation

technique with single point competitors:

RMSE =

√1n

n

∑i=1

(ti −mi)2 (4.3)

where m refers to the mean of the pdfs as point predictions for the MDNs.

4.4.3 State of the Art Techniques

In order to compare the performance of distribution-based prediction with singlepoint estimators, we study REPTree, SVM, and MLP as the alternative techniques.REPTree and SVM are the main prediction techniques used in [116] and [22] respec-tively. Since classical MDN uses MLP in its neural network layer, we can expect thatMDN as a single point prediction shows almost the same performance and accuracy


(a) (b)

Figure 4.2: (a) CPU and (b) Response time prediction for Hive queries, modelledusing the Table 1 feature set.

as MLP. Thus, we report and discuss the results under MLP as well. These threealgorithms are implemented in the well-known Weka package [58].

4.4.4 Evaluation: Single Point Estimators

Before presenting and discussing the results under the MDN technique, let us firstinvestigate how accurately the Hive query performance and resource usage could beestimated in terms of the proposed feature set (Table 4.1) using well-established MLtechniques.

Fig. 4.2(a) and 4.2(b) displays the performance of REPTree in CPU and runtimeestimation of Hive workloads where it approximates resource usage more success-fully than runtime. We will discuss this issue later in this section, by then we areinterested in the performance of other competing techniques as well. Because, ingeneral, identifying the most accurate prediction model without training and test-ing multiple models is hardly possible, thereby the Relative Error (%) of CPU andRuntime estimation of ∼ 1000 Hive queries using all three alternative techniques (i.e.REPTree, SVM, and MLP) are evaluated and shown in Fig. 4.3(a) and 4.3(b).

As we can see, REPTree outperforms the other predictors in both CPU and Re-sponse Time estimation with relative errors of 4.83% and 13.28% respectively. Moreimportantly, our classifiers are more successful in resource estimation than runtime.The main reason behind this observation is the resource contention issue in a sharedcluster of machines. As stated earlier, our cluster is shared by multiple users andvarious applications concurrently processing GBs or TBs of data. Therefore, whenmultiple jobs and queries are submitted to the cluster, they compete for commonresources such as disk, memory, or CPU which might negatively impact the perfor-mance. In terms of resource modelling, the contention is not a challenge because ourmodels capture the CPU time which is the amount of time for which CPUs are used


(a) (b)

Figure 4.3: Relative error (%) for (a) CPU and (b) Response time prediction usingSVM, REPTree, and MLP techniques.

for processing instructions, as opposed to, for example, waiting for I/O operations.In contrast, interference of other workloads inevitably hit the query runtime.

Specifically, the authors in [21] describe an experimental modelling approach forcapturing interactions in query mixes. To do so, the interactions are modelled statis-tically using different regression models. Along similar lines, [44] argues that bufferaccess latency measure is highly correlated with the query execution time, and theyuse linear regression technique for mapping buffer access latency to the executiontimes. The authors in [45] also use k-nearest neighbours prediction technique tolearn spoiler model coefficients for the new template based on similar ones.

We argue that with the distribution of query performance we are able properlyto capture and express the whole spectrum of performance (i.e. here response time)and any possible variances in presence of resource contention. To capture the impactof the concurrency and interference in performance, there are some proposals [21, 44]for query executions in DBMSs. However, the proposed techniques are not applicableto the Hive workloads in a multi-system cluster due to i) different abstraction level ofquery processing in Hive, and ii) lack of control on the type of concurrent workloadsin a cluster where they typically hold some assumptions about the mixture of queriesrunning concurrently. Nevertheless, our approach relaxes such constraints and moreimportantly it is able to estimate the performance while the concurrent workloadsare not even from the same platform, for example, where a certain Hive query iscompeting with Spark jobs for the CPU shares.

4.4.5 Evaluation: Distribution-Based Prediction

We now discuss the accuracy of the proposed approach. The results for both theproposed approach using MDN and the single point estimators under CRPS, NLPD,and RMSE metrics are shown in Table 4.2 Note that the number of Gaussian compo-nents is a hyper-parameter in MDN and needs to be specified beforehand. To do so,


Table 4.2: MDN performance compared with its competitors.MDN REPTree SVM MLP

Target M CRPS NLPD RMSE RMSE RMSE RMSE

CPU1 0.093 -1.2 0.077

0.005 0.08 0.0483 0.091 -2.54 0.085 0.024 -2.65 0.081

Response Time1 0.064 -1.1 0.077

0.01 0.073 0.0313 0.031 -2.68 0.0795 0.017 -3.2 0.08

we report the results under 1, 3, and 5 mixture components (M).

All three metrics are negatively oriented scores; hence the smaller the value thebetter. Let us first study the accuracy of the MDN per se using CRPS and NLPDmetric errors. As the small numbers under CRPS and NLPD indicate, the proposedmodel is an appropriate estimator for both CPU and Runtime distribution predictionof Hive workloads. Unlike single point estimators, the MDN shows slightly betterperformance in Runtime prediction rather CPU. Another interesting observation isthat in the TPC-H workload sophisticated MDN architecture with 3 and 5 mixturecomponents led to increased fidelity of results.

To compare the proposed approach with the competing techniques, we need totreat it as a single point estimator, thereby we use RMSE metric error for compari-son. According to Table 4.2, the MDN outperforms SVM in CPU prediction, albeitREPTree has the lowest RMSE value. Similarly, REPTree outperforms the others inResponse Time (RT) estimation. However, this is not the whole story.

Taking the output corresponding to the mean of the predicted PDFs is almostequivalent to using an MLP with linear output activation function, trained with aleast-squares error function. It means that the MLP classifier accuracy is comparableto the MDN. A closer look at the data indicates that the RMSE values under the MLPare in between SVM and REPTree. This observation is consistent with what we sawin RMSE values for MDN.

However, the question may arise "Why are three RMSE values under MDN and MLPtotally different?". This observation is sourced from the different normalization andconfiguration parameters used in MLP implementation in Netlab toolbox [91] andWeka [58] which are respectively employed for the MDN and MLP (as a standalonetechnique) training. To test our hypothesis, we replaced the default MLP configura-tion of Weka with what is used in Netlab, observing almost the same RMSE errors.

In summary, our approach outperforms the state of the art single point techniquesin 2 out of 4 experiments conducted using SVM and REPTree. This result is quitepromising because it shows that our approach is not only able to predict the fulldistribution over targets accurately, it is also a reliable single point estimator.


4.4.6 Training Times and Overhead

In this section we evaluate the model building time and overhead of using them atruntime. Table 4.3 denotes the training times regarding different workload sizes. Asthe results indicate, the training cost is very small and it grows linearly in terms ofthe training set size.

Table 4.3: Training times in seconds with regard to different workload sizes for 500iterations.

Workload Size 1K 2K 4K 8K 16KElapsed Time (sec) 1.47 1.9 2.63 3.84 7.83

Apart from reasonable training time, low overhead in invoking a trained model atruntime is yet another critical parameter because it has to be quick enough to get theestimates ready in time for decision making modules of the workload managementstrategies at runtime, where unreasonable delays may lead to SLA misses. To thisend, we measured the elapsed time for evaluating an MDN model for a given inputfeature set on a 2.80GHz Intel Core i7, and obtained an overhead of about 0.2 ms foreach call. To put these numbers in perspective, the execution plan generation in Hive(using EXPLAIN command) for say Q1 of TPC-H takes 4.97 seconds, meaning thatinvoking the MDN model for each new incoming query would not be a significantfactor in the overall workload management cost.

4.5 Distribution-Based Prediction Utilization

This section provides a clear picture of how the provided prediction could be uti-lized and employed in workload management of data-intensive applications. Wehave visualized some sample predicted PDFs from the test set of the TPC-H work-load as shown in Fig. 4.4(a) and 4.4(b). In particular, the figure plots 14 randomsample predicted PDFs for CPU and execution time. The histograms show the actualCPU and runtime values for the whole test dataset. Each PDF may (not) belong todifferent queries as they were randomly selected from the test set, meaning they areconditioned on different inputs. The dotted vertical line shows the observation value.

As the figures show, the PDFs accurately approximate the resource usage andperformance distributions which are primarily within the range (0.1, 0.4) and (0.1,0.25) for CPU and runtime respectively. In a consistent manner, the models for CPUand execution time beyond the values 0.5 and 0.3 are much more uncertain. Putdifferently, the tendency of all CPU and runtime PDFs is to the right hand side ofdiagram and this is consistent with the plotted histograms of actual resource andperformance values in which, for example, we hardly face resource demand above0.5.

These sample PDFs demonstrate that the MDN is also a reliable classifier in theclassic point estimate sense, where the PDFs cover the observation points with high

§4.5 Distribution-Based Prediction Utilization 59

(a)

(b)

Figure 4.4: Sample PDF predictions for (a) CPU and (b) Execution Time of Hivequeries based on TPC-H workload.

probability in all figures but PDFs number 14 in 4.4(a) and 8 in 4.4(b). However, theylocate the shape of distributions precisely.

We also argue that distribution-based prediction gives the resource and workloadmanagement systems a concise yet lucid way of interpreting workload behaviour.Such capability is crucial for a number of resource management activities such asrun-time performance isolation or diagnosis inspection. In particular, upper andlower bounds of resource usage simplify the task of performance isolation, since forexample our predictions in all figures capture the dominant CPU time precisely. SLAmanagement also becomes more applicable when for example we already know theminimum and maximum required share of CPU for a given query. When it comes toperformance inspection, diagnosing abnormal behaviour as per the predicted num-bers is also viable. Specifically, Fig. 4.4(a) reports that for a given set of querieswe will not face peak CPU time (>0.5) very often, hence a higher peak CPU timeindicates the possible presence of a fault in the software or cluster.


4.6 Summary

In this chapter, we presented a novel approach of using mixture density networksfor CPU and runtime distribution prediction of large-scale analytics Hive queries.We evaluated our approach on TPC-H, showing that it outperforms the state of theart techniques in half of experiments. This result is quite promising as it showsthat proposed approach is not only able to predict the full distribution over targetsaccurately, it is also a reliable single point estimator.

In the next chapter, we will investigate the elasticity management of data-intensiveworkloads on clouds.

Chapter 5

Elasticity Management of DataAnalytics Flows on Clouds

Growing attention to getting real-time insights into streaming data leads to the for-mation of many complex data analytics flows. For example, by analyzing data usingdata analytics flows, real-time situational awareness can be developed for handlingevents such as natural disasters, traffic congestion, or major traffic incidents[43].

A data analytics flow typically operates on three layers: ingestion, analytics, andstorage, each of which is provided by a data-intensive system. These systems are of-ten available as cloud managed services, enabling the users to have pain-free deploy-ment of data analytics flow applications. For example, Fig. 5.1 depicts a click-streamdata analytics flow in which Amazon Kinesis [5] is used for managing the ingestionof streaming data at scale. Apache Storm [12] deployed on EC2, processes streamingdata and persists the aggregated results in DynamoDB [3].

Despite straightforward orchestration, elasticity management of the flows is chal-lenging. This is due to: i) heterogeneity of workloads and diversity of cloud resourcessuch as queue partitions, compute servers, NoSQL throughputs capacity, ii) workloaddependencies between layers, and iii) different performance behaviours and resourceconsumption patterns.

To address the issues above, in this chapter we investigate the problem of multi-layered and holistic resource allocation of the data analytics flows deployed on publicclouds. We propose a framework for design and stability analysis of adaptive con-trollers by employing tools from classic nonlinear control theory. With numerousexperiments on a real-world click-stream data analytics flow, we show that, com-pared to the state of the art techniques, our approach is able to reduce the error (i.e.deviation from desired utilization) by up to 48%.

5.1 Challenges in Elasticity Management of Data AnalyticsFlows

Elasticity management of data analytics flow applications is challenging due to threeunique characteristics of cloud-hosted data-intensive systems. First, data analytics

61

62 Elasticity Management of Data Analytics Flows on Clouds

Figure 5.1: A data analytics flow that performs real-time sliding-windows analysisover click stream data.

flow applications have heterogeneous workloads, in which different platforms andworkloads are dependent on each other. For example, Fig. 5.2 clearly shows how theworkload dynamics in the data ingestion layer is strongly correlated with the ana-lytics layer. To provide smooth elasticity management, these workload dependenciesneed to be detected and considered in resource allocations.

Second, data analytics flow applications often deal with immense data volumewhich, together with uncertain velocity of data streams, leads to changing resourceconsumption patterns. This mandates an elasticity technique that could sustainworkload fluctuations time efficiently, meaning that resources should be acquiredand released as soon as required.

Third, a data analytics flow is deployed on heterogeneous cloud services and re-sources, each of which exhibits different performance behaviours and different pric-ing schemes. In this setting, resource allocation needs to cater for diverse resourcerequirements and their associated cost dimensions to meet the users’ Service LevelObjectives (SLOs). Existing solutions that enjoy dynamic intelligent auto-scaling al-gorithms [82, 75, 94] lack a holistic approach for resource requirements managementof big data analytics workloads. Instead, they focus on one resource type such as Vir-tual Machines (VMs) or particular workload like Hadoop. Nevertheless, [119] showsthat the ability to scale down, for example, both web servers and cache tier leadsto 65% saving of the peak operational cost, compared to 45% if we only considerresizing the web server tier.

5.2 Related Work

Elasticity techniques have been studied extensively in recent years [85]. Several tech-niques such as Control theory [86], Queueing model [109], Markov decision process[75] have been used to tackle the problem with respect to different resource typessuch as Cache servers [64], HDFS storage [82], or VMs [48]. However, recent stud-ies in resource management using control theory [82, 86, 65, 66] have clearly shownthe benefits of dynamic resource allocations against fluctuating workloads. Moreimportantly, what makes the control theory approach stands out in workload man-agement techniques is the fact that it does not rely on any prior information aboutthe workload behavior and unlike for example queueing model it imposes very mildassumptions on the system model. Such features lead to a simple yet effective ap-


0 50 100 150 200 250 300 350 400 450 500 5500

2

4

6

xm104

Inpu

tmRec

ords

mSU)

IngestionmLayermSKinesis)

0 50 100 150 200 250 300 350 400 450 500 5500

10

20

30

TimemSmin)

CP

USA

)

AnalyticsmLayermSApachemStorm)

Figure 5.2: The data arrival rate at the ingestion layer (Amazon Kinesis in Fig.5.1)is strongly correlated (coefficient = 0.95) with the CPU load at the analytics layer

(Apache Storm in Fig.5.1).

proach that would sustain any workload’s shape and dynamics.A number of inquiries [86, 82, 65, 66, 77, 78, 47, 87, 67, 96, 95] have been made into

the elasticity management of either data-intensive systems or single/multi-tier webapplications using control theory. Lama et al. in [77] propose a fuzzy controller forefficient server provisioning on multi-tier clusters which bounds the 90th-percentileresponse time of requests flowing through the multi-tier architecture. They furtherimprove their approach in [78] by adding neural networks to the controller in orderto avoid tuning the parameters on a manual trial-and-error basis, and come up with amore effective model in the face of highly dynamic workloads. Similar to this study,Jamshidi et al. in [65] propose a fuzzy controller that enables qualitative specificationof elasticity rules for cloud-based software. They further equipped their technique in[66] with the Q-Learning technique, a model-free reinforcement learning strategy, tofree users of most tuning parameters. More recently, Farokhi et al. in [47] use a fuzzycontroller for vertical elasticity of both CPU and memory to meet the performanceobjective of an application.

In [82], the authors proposed a fixed-gain controller for elasticity management of aHadoop Distributed File System (HDFS) [102] under dynamic web 2.0 workloads. Toavoid oscillatory behavior of the controller, [82] develops a proportional thresholdingtechnique which in fact works by dynamically configuring the range for the controllervariables. Similarly, in [87], the authors propose a multi-model controller whichin fact integrates decisions from the empirical model and workload forecast modelwith the classical fixed-gain controller. The empirical model is to retrieve distinctconfigurations which are capable of sustaining the anticipated Quality of Service(QoS) based on recorded data from the past. In contrast, the forecast model whichis built by Fourier Transformation is to provide proactive resource resizing decisionsfor specific classes of workloads.

More closely related to the topic of this chapter, the authors in [96] propose a re-source controller for multi-tier web applications. The proposed control system is built


upon a black-box system modelling approach to alleviate the absence of first prin-ciple models for complex multi-tier enterprise applications. Unlike [96], the authorsof [86] modeled the system (i.e. web server) as a second-order differential equation.However, the estimated system model used for control would become inaccurate ifthe real workload range were to deviate significantly from those used for developingthe performance model. The authors of [96] next in [95] enhanced the previous workby employing multi-input multi-output (MIMO) control combined with a model es-timator that captures the relationship between resource allocations and performancein order to assign the right amount of resource. The resource allocation system canthen automatically adapt to workload changes in a shared virtualized infrastructure toachieve the average response time. Along similar lines, the authors in [67] incorpo-rate a Kalman filter into a feedback controller to dynamically allocate CPU resourcesto virtual machines hosting server applications. However, our work differs in thatour control system, rather than adjusting CPU allocation in a shared infrastructure,which commercial cloud providers do not provide, regulates resources in a higherabstraction level that is the number of instantiated VMs, for example. Above all,unlike our work, this class of control systems are only quasi-adaptive as their gainparameters do not rely on the history of the previously computed control gains andhence are unable to dynamically adapt to workload changes (see Section 5.3.3.2).

In summary, almost all of the above studies share the same constraint: lack ofa holistic view on resource requirement management in which they have primar-ily investigated virtual server allocation problems even in a multi-tier Internet ser-vice. Our work completes this picture through studying different cloud resourcesincluding distributed messaging queue partitions (data ingestion layer), VMs (dataanalytics layer), and provisioned read or write throughputs of tables (data storagelayer).

5.3 Proposed Solution

In this section, we first provide an overview of the proposed solution and then wediscuss in detail its main components.

5.3.1 Solution Overview

Fig. 5.3 shows the main building blocks of our solution along with the architectureof our testbed which is a real-world data analytics flow - Click-Stream Analaytics.Our testbed is similar to Amazon’s reference architecture [30] except ElasticCacheis replaced with DynamoDB for seamless scalability. We discuss further the designprinciples followed in building our testbed in Section 5.4.

In a nutshell, the workflow of the solution is as follows: First, dependencies be-tween workloads’ critical resource usage measures such as Kinesis shard utilization,Storm Cluster CPU usage, DynamoDB consumed read/write units are analysed. Tothis end, we apply linear regression techniques to the collected runtime and histori-cal resource profiles to estimate the relationships among variables. The dependency

§5.3 Proposed Solution 65

Figure 5.3: The proposed solution for managing heterogeneous workloads of the dataanalytics flows on clouds.

information along with the cloud services costs and the user’s SLO constitute therequired inputs for the generation and then search of provisioning plan space.

The framework resource analyser is capable of determining the maximum resourceshares of each layer in terms of the user’s budget constraint. Due to the multi-objective nature of the problem, there usually exist multiple feasible solutions; whichone is best suited to the problem in practice must be identified.

Once the upper bound resource shares for each layer are identified, the adaptivecontroller tailored to each of the three layers automatically adjusts resource alloca-tions of that layer. This means that the controllers can now freely operate within thelimits of each layer resource.

The controllers are regulated based on a number of parameters including mon-itored resource utilization value, desired resource utilization value, history of thecontroller’s decisions. In other words, the controllers continuously provision the re-sources to adequately serve the incoming records in order to keep resource utilizationof each layer within the specified desired value. Note that for the sake of simplicity, thesensor and resource actuator as the key components of any controller-based elasticitymanagement frameworks have not been depicted in Fig. 5.3. In our implementationthe sensor module has been built on top of CloudWatch [2] and is responsible for pro-viding recorded resource usage measures as per the specified monitoring window.The actuator is capable of executing the controllers’ commands such as adding andremoving VMs. We discuss more about tool support for our framework in Chapter6.


5.3.2 Resource Share Analysis

Having an efficient elasticity plan for a data analytics flow is challenging due to i) thediversity of cloud resources used to serve the flow, ii) different pricing schemes ofcloud services, and iii) the dependency between workloads that altogether provide acomplex provisioning plan space. This space can be of any shape in terms of SLOs,as such we formulate the goal as:

”Given the budget and estimated dependencies between workloads, what would be themaximum share of resources for each layer in a data analytics flow?”

Problem Formulation: The above problem in a general form is defined as a multi-objective function:

max(r(I)it , r(A)

jt , r(S)kt ) (5.1)

subject to:

∑i,d

r(I)it ∗ cid + ∑

j,dr(A)

jt ∗ cjd + ∑k,d

r(S)kt ∗ ckd ≤ Budt (5.2)

r(I)it = a ∗ r(A)

jt (5.3)

r(I)it = b ∗ r(S)kt (5.4)

r(A)jt = e ∗ r(S)kt (5.5)

∀i, t : r(I)it ≤ Cap(I)

it (5.6)

∀j, t : r(A)jt ≤ Cap(A)

jt (5.7)

∀k, t : r(S)kt ≤ Cap(S)kt (5.8)

∀i, j, k : r(I)it , r(A)

jt , r(S)kt ∈ R+ (5.9)

i, j, k, d ∈N (5.10)

a, b, e ∈ R (5.11)

where variables like r(I)it represents the resource amount of type i of the layer I in a

data analytics flow at time t. cid refers to the cost dimension d of the resource of typei. Table 5.1 summarizes all parameters used for problem formulation of resourceshare analysis. The resource shares of Ingestion (I), Analytics (A), and Storage (S)layers are positive real variables (5.9) subject to the following constraints:

(5.2) Budget Constraint: at time t the sum of costs 1 concerned with different cloudresources across all layers must be within the specified budget.

(5.3-5.5) Dependency Constraints: dependency between layers and in particular theirconstant variables including a, b, and e are learned and determined by the linear

1For the sake of simplicity, we assume that the cloud services base prices (e.g. cid) remain unchangedduring time periods. Moreover, to make the model more readable, we omit other miscellaneous ex-penses such as Data Transfer between layers.


Table 5.1: List of key notations used in this chapter.Parameter DescriptionL = {I, A, S} L set of the typical data analytics flow layers including Inges-

tion (I), Analytics (A), and Storage (S)T = {i, j, k} set of resources of types i, j, or k (e.g. VMs, Queues)r(L)

Tt resource amount of type T of the layer L in a certain dataanalytics flow at time t

cTd cost of resource type T of dimension dBudt specified Budget at time tCap(L)

Tt the capacity of resource type T at layer L at time ta, b, c constant variables obtained via workload dependency analy-

sisuk current actuator value (for example, in the analytics layer it

represents the number of VMs allocated to a Storm cluster)uk+1 new actuator value (it actually represents the next step re-

source allocation amount)lk the controller gain at time step kyk the current sensor measurement (for example, in the analytics

layer it represents the CPU usage measured during the pastmonitoring window)

yr the desired reference sensor measurement (i.e. resource uti-lization) which is specified by the user

l0, lmax, lmin respectively refer to initial, max, and min gain valueγ the controller parameter (γ > 0)

regression technique. Note that every flow would not necessarily exhibit all ofthese dependencies at a given point of time.

(5.6-5.8) Capacity Constraints: at time t the calculated resource share must be withinthe cloud service capacity limits2.

As discussed in Section 2.3.1 of the Chapter 2, in multi-objective optimization,we aim to find a set of feasible solutions which is called Pareto optimal solutions.Solving the multi-objective problem defined above lead to a set of feasible solutionsthat represents the maximum share of different resources across a data analytics flow.Having the maximum resource share of each layer would allow the controllers to op-erate within the limits of each layer freely. Moreover, it implicitly aims at maximizingan important QoS of the streaming workloads - Throughput, as it primarily dependson resource allocation[72].

2There might be some limits in place for some cloud services. For example, provisioned throughputin DynamoDB has initial maximum of 40,000 write capacity units per table in a specific region, thoughone can request an increase on these limits.


5.3.3 Elasticity Controller

Control theory mandates to specify a system model - mathematical relationship be-tween the control input and output - before designing a controller. Few studies inworkload management of computer systems have followed this approach in whichthe system is modelled, for example, as a difference equation [86] or using queue-ing theory. Due to the complexity and uncertainty of computer systems, obtainingdynamic models describing their behavior with difference equations requires im-plementation of comprehensive system identification techniques. These techniquesinevitably increase the complexity of the control system and may decrease the ro-bustness of the closed loop system (or even cause instability) if the system configu-ration or workload range deviates from those used to estimate the unknown systemparameters. Similarly, in queueing theory every model is built upon a number ofassumptions such as arrival process that may not be met by certain applications andworkloads. Building and maintaining these models are complicated even for a multi-process server model [86], let alone a chain of diverse parallel-distributed platforms,as we have in a complex data analytics flow.

For this reason, most prior work [82, 96, 95, 47, 65, 77, 78] on applying control the-ory to computer systems employs a black box approach in which the system model isassumed unknown and minimal assumptions are imposed on the system model thatenable stability analysis of the closed-loop system. The downside of this approachis that it does not provide enough flexibility for proving strong stability results. Infact, most of the results available in the literature either lack proper stability analysisor only prove what is known as the internal stability (or at most bounded-input-bounded-output stability) in the control literature [68], implying that the resultingoutput error (i.e. the difference between the system output and its desired value)is bounded for all times. Nevertheless, it is known in the control literature that,in general, the internal stability does not imply asymptotic (or exponential) stability[68, 103], meaning that the output error is not only bounded, but also asymptotically(exponentially) converges to zero as time passes.

Therefore, we propose a framework for designing controllers for computer sys-tems and analyzing their stability by proposing a static, yet unknown, model for theunderlying systems. Using this framework, we propose a generic adaptive controllerwhich requires very minimal information about the system model parameters. Us-ing tools from classic nonlinear control theory (e.g. Lyapunov theory), we providea rigorous stability analysis and prove the asymptotic (exponential) stability of theresulting closed loop system [100].

5.3.3.1 A Framework for Controller Design

Denote the input of a system (assigned by the actuator) at the time k by uk ∈ Rand the system output (i.e. the sensor reading) by yk ∈ R. We assume that thesystem input and output are related via a static, yet unknown, smooth function.That is to say yk = f (uk) where f : R → R is a smooth function. In practice, thesmooth function f can be linearized at the operating point. Hence, we approximate


𝑢

𝑦

𝑢min 𝑢m𝑎𝑥

𝑦min

𝑦m𝑎𝑥

(a)

𝑢

𝑦

𝑢min 𝑢m𝑎𝑥

𝑦min

𝑦m𝑎𝑥 𝑦 𝑢

𝑦r

System

Controller

? (b)

Figure 5.4: a) Input-output linear model, b) Control feedback loop.

the system model with a linear function (see Fig. 5.4(a)). Nevertheless, we stillassume that the parameters of the linear model, i.e. the slope and the y-intercept, areunknown. That is,

yk = auk + b (5.12)

with unknown a ∈ R and b ∈ R. Note that, a and b generally depend on theoperating point of the system (which itself depends on the workload). We furtherassume that an upper bound of the amplitude of a and the sign of a are known, thatis, we know if the output is a decreasing (a < 0) or an increasing (a > 0) functionof the input. These are very mild assumptions on the system model that can beeasily verified in practice. For instance, increasing the number of virtual machines(i.e. the system input) in the data analytics layer decreases the CPU utilization (i.e.the system output or sensor reading)3. Hence, the corresponding system model forthe data analytics layer is decreasing and a < 0 in this case.

Consider the control feedback loop illustrated in Fig. 5.4(b). The control objectiveis to design the control input uk such that the output yk (remains bounded for alltimes and) converges to a reference (desired) constant value yr ∈ R as k goes toinfinity.

For the sake of simplicity and without loss of generality, we assume a < 0 in theremaining parts of the chapter. Nevertheless, the theory proposed here is applicableto the case that a > 0 with straightforward modifications.

5.3.3.2 A Generic Adaptive Controller

We propose the following adaptive controller.

uk+1 = uk + lk+1(yk − yr), (5.13)

3See Section 5.4.2 for further details on the data analytics layer controller.


where the controller gain lk+1 is adaptively updated according to the following multi-criteria update law.

lk+1 =

lk + γ(yk − yr), if lmin ≤ lk + γ(yk − yr) ≤ lmax

lmin, if lk + γ(yk − yr) < lmin

lmax, if lk + γ(yk − yr) > lmax

(5.14)

Here, lk is the controller gain at the time k, lmin > 0 and lmax > 0 are the lower boundand the upper bound of the controller gain4, respectively, and γ > 0 is a controllerparameter. Table 5.1 summarizes all parameters used in the adaptive controller de-sign.

The multi-criteria update law (5.14) ensures that the values of lk are boundedby lmin and max for all k (the initial controller gain l0 should be chosen such thatlmin ≤ l0 ≤ lmax). The proposed adaptive controller, the fixed-gain controller of[82, 87] and the quasi-adaptive controller of [96] all have the same standard struc-ture (5.13). The difference between these three control schemes is the gain lk+1. Inthe fixed-gain controller, the gain lk+1 is simply constant for all time. In the quasi-adaptive controller, the gain lk+1 is computed as a predetermined function of themeasurements and desired output, however, this function is memoryless meaning thatthe gain lk+1 does not depend on lk which is computed in the previous step. In con-trast, the adaptive update rule (5.14) does utilize the previously computed gain lk forcomputing the new gain lk+1, thus resulting in a truly adaptive control scheme[100].

In order to analyse the stability of the closed loop system, we define the outputerror

ek := yk − yr. (5.15)

In ideal conditions where the measurements are noise free and the system model isaccurate, the control goal is achieved if the error ek converges to zero as k goes toinfinity, yielding yk to converge toward the desired output yr.

For stability analysis in the following theorem, we assume that a and b are con-stants (in theory). This assumption practically implies that the rate at which thecontrol value uk is computed is much faster than the speed at which the system pa-rameters a and b change. This implies that the update rate of the controller shouldbe much faster than the rate of change of the workload.

Theorem 1 Consider the system (5.12) and the controller (5.13) connected together accord-ing to Fig. 5.4(b). Assume that a and b are constants, a < 0, and 0 < lmin ≤ lk ≤ lmax < −2

afor all k. The controller ensures that the closed-loop system is globally exponentially stable.Moreover, defining q(lk) := 1+(alk)(2+ alk) and α = −0.5 ln(max(q(lmin), q(lmax)) > 0,we have |ek| ≤ |e0|exp(−αk) for all k = 0, 1, 2, . . . implying that ek exponentially convergesto zero with a rate of convergence greater than α. Moreover, the controller gain lk of theadaptive update rule (5.14) converges to a constant value for large enough k.

4We later on propose criteria for choosing appropriate lmin and lmax to ensure stability of the closedloop system.


Proof of theorem 1: Using (5.15) and (5.12) we have

ek+1 = yk+1 − yr = auk+1 + b− yr. (5.16)

Replacing for uk+1 from (5.13) into (5.16) and resorting to (5.12) yields

ek+1 = a(uk + lkek) + b− yr = a(1a(yk − b)) + lkek) + b− yr = (1 + alk)ek. (5.17)

Equation (5.17) formulates the dynamics of the control error ek. If the control gainlk is constant for all times, the error dynamics (5.17) represents a simple linear time-invariant (LTI) system whose stability could be analysed using linear control theory[93]. Nevertheless, in the general case where the control gain is time varying, weuse Lyapunov theory from classic nonlinear control theory [68, 103] to analyse thestability of the error dynamics.

Consider the Lyapunov candidate

Vk := e2k . (5.18)

Using (5.17), we have

Vk+1−Vk = e2k+1−e2

k = (1 + alk)2e2

k−e2k = (alk)(2 + alk)e2

k = (alk)(2 + alk)Vk. (5.19)

Assuming that 0 < lk < − 2a , we have Vk+1 −Vk ≤ 0 implying that the Lyapunov

function is decreasing along the system trajectories and the closed-loop system isstable [68]. Defining q(lk) := 1+ (alk)(2+ alk) and using 5.19 we have Vk+1 = q(lk)Vkwhich yields

Vk = V0q(lk)q(lk−1) . . . q(l1)q(l0), (5.20)

for all k = 0, 1, 2, . . .. Since 0 < lmin ≤ lk ≤ lmax < − 2a , it is straightforward to verify

that 0 < q(lk) ≤ qmax < 1 where qmax := max(q(lmin), q(lmax)). Hence, we haveq(lk)q(lk−1) . . . q(l1)q(l0) ≤ qk

max which together with (5.20) yields

Vk ≤ V0qkmax = V0 exp(− ln(q−1

max)k), (5.21)

which implies that the geometric progression Vk decays exponentially to zero with aconvergence rate greater than 2α = ln(q−1

max) > 0. Substituting for Vk from (5.21) into(5.18) and taking the square root of the sides we have

|ek| ≤ |e0|exp(−αk) (5.22)

which proves the first claim of the theorem.

Next, we proceed to show that the gain lk converges to a constant value for largeenough k, say k = ∞. We know that l∞ is governed by one of the three criteriaof the adaptation law (5.14). Suppose that lk is governed by the first criterion forlarge k. We use (5.15) to replace yk − yr with ek in the first criterion of (5.14) toobtain lk+1 − lk = γek. Taking the norm of the sides and recalling (5.22) we have


|lk+1 − lk| ≤ γ|e0|exp(−αk). Hence, for large enough k we have |lk+1 − lk| ≈ 0 whicheffectively implies that lk+1 ≈ lk. This means that lk converges to a constant value iflk is governed by the first criterion for large enough k. It remains to be shown that lkconverges to a constant value if it is not completely governed by the first criterion of(5.14). If l∞ is governed by the second or the third criterion, we have either l∞ = lmin

or l∞ = lmax, respectively. Since lmin and lmax are both constants, l∞ would be constantin this case as well. Note that, for large k, lk cannot indefinitely switch betweenthe three criteria of (5.14) as we have already proved that lk + γ(yk − yr), the valueof which determines the criteria of (5.14), converges to a constant and eventuallysatisfies only one of the criteria for large k. This completes the proof.

Remark 1 The stability proof of Theorem 1 is provided for a generic time varying trajectoryof control gain lk. Hence, this stability proof is valid not only for the proposed adaptiveupdate rule (5.14), but also for the fixed-gain controller (see e.g. [82]) and the quasi-adaptivecontroller [96] (provided that those controllers satisfy the gain requirements of theorem 1).For the case of the fixed-gain controller, it can be verified that the requirements of theorem 1are necessary and sufficient for exponential stability of the closed loop system. In this case,theorem 1 provides coherent analytical limits on the gain of the controller that guarantee thestability of the closed loop system.

5.3.3.3 Gain Function (lk) Behavior Analysis

One of the key indicators of an effective elasticity technique is the ability to quicklyscale up or down, which is called the elasticity speed [79, 61]. The scale-up speedrefers to the time it takes to shift from an under-provisioned state to an optimal orover-provisioned state. Conversely, the scale-down speed refers to the time it takesto shift from an over-provisioned state to an optimal or under-provisioned state [61].

We put forward the claim that our adaptive controller, compared to the fixed-gaincontrollers [82, 87] and quasi-adaptive controllers [96, 95], shows higher elasticityspeed as it keeps the history of the controller decisions through updating the gainparameter with respect to the error (see Eq. 5.14). The bigger the gain-parametervalue (lk+1), the higher the speed of elasticity as per the standard controller Eq. (5.13)(also see Eq. (5) in [96] or Eq. (1) in [82]).

To support our claim, we first explain the normal and extreme scenarios of work-load behaviors:

1. Normal overload/underload: In normal workload behavior, when a system is over-provisioned, de-scaling of resources leads to higher utilization (yk) and hencedecreases the distance from the desired value (yr). Fig. 5.5(a) demonstratesthis situation where the distance reduces from |−80| (on the left) to the optimalpoint that is 0 (on the right). In contrast, when a system is under-provisioned,scaling of resources again reduces the error (i.e. yk − yr) as x-axis displays in5.5(b).

2. Instantaneous massive overload/underload: In contrast to normal workload behav-ior, in the event of instantaneous massive overload or underload, scaling or


(a) Descaling Process (Scenario 1) (b) Scaling Process (Scenario 1)

(c) Descaling Process (Scenario 2) (d) Scaling Process (Scenario 2)

Figure 5.5: Gain parameter behavior under different load scenarios.

de-scaling of the system does not necessarily decrease the error for a while. Inother words, the system load increases or decreases much faster than the rateat which the resource manager reacts. Fig. 5.5(c) and 5.5(d) show this circum-stances where in Fig. 5.5(c) the utilization is not improved (as we move from|−80| to |−100|) even after de-scaling the resource. Similarly, scaling a highlyoverloaded system does not reduce the error yk − yr and the distance increasesfrom |90| towards |160| as shown in Fig. 5.5(d). Note that this situation is tem-porary and the workloads will be back to a normal situation (Fig. 5.5(a) and5.5(b)) sooner or later. Although transient, these extreme cases may typicallyeither hit performance severely or lead to huge resource wastes.

Now we can look into the gain parameter behavior in different scenarios andcompare the elasticity speed of different controllers as shown in Fig. 5.5. The gainfunction of our controller lk+1 = lk + γ(yk − yr) is shown with the green line withcircle marker alongside the fixed-gain function lk+1 = lk [82, 87] (the blue line withsquare marker) and quasi-adaptive gain function lk+1 = ((1/yr)− ε) ∗ (yk/yr) [96, 95](the purple line with asterisk marker).

As the figures show, our gain function produces the higher value in all scenarios,which in turn leads to a higher elasticity speed. The only exception to this observa-tion is Fig. 5.5(a) where a fixed-gain controller generates the higher value. However,we have witnessed in numerous experiments that even in this situation our controllerperforms faster. This scenario does not happen in isolation and a typical workloadexperiences all of the scenarios in its lifetime, so that the gain parameter in its initialstage has a larger value. For example, suppose that the workload firstly goes throughone of the situations depicted in Fig. 5.5(b), 5.5(c), or 5.5(d) and then 5.5(a). Clearly,


after passing one of these situations the value of lk+1 is bigger (here respectively is> 0.08, > 0.15, or > 0.15) than the initial value of a fixed-gain controller in 5.5(a)that is 0.05. Therefore, these numbers simply means that our controller has higherelasticity speed, compared to its competitors. We will discuss the advantages of thiscapability in more detail in Section 5.5.3.

5.4 Automated Control of a data analytics flow

The cloud services and resources across a data analytics flow are quite different andare sometimes restricted to specific limitations (e.g. de-scaling operations limit inDynamoDB). Therefore, each layer mandates a set of specific design principles for itscontroller. This section discusses the principles used in the design of each controllerin terms of the click-stream data analytics flow.

5.4.1 Data Ingestion Layer Controller

Our controller in the data ingestion layer is responsible for resizing of the Kinesiscloud service. A data stream in Kinesis is composed of one or more shards, each ofwhich provides a fixed unit of capacity5 for persisting incoming records. Therefore,the data capacity of the stream is a function of the number of shards that are specifiedfor the stream. As the data rate increases, more shards need to be added to scale upthe size of the stream. In contrast, shards can be removed as the data rate decreases.

As mentioned in Section 5.3.1, each controller is equipped with both sensor andactuator components. The sensor here continuously reads the incoming recordsstream from CloudWatch and calculates the resource utilization as average writeper second:

(∑ni=1 IncomingRecordsi)/(n ∗ 60)

(ShardsCount ∗ 1000)(5.23)

where n is the number of monitoring time windows in minutes and assuming eachrecord is less than 1 KB. This measure inputs the controller and it then makes the nextresource resizing decision as per the logic discussed in Section 5.3.3.1 and invokesthe actuator to execute increaseShards, decreaseShards, or doNothing commands.

5.4.2 Data Analytics Layer Controller

The analytics layer controller in the click-stream application is in charge of ApacheStorm cluster resizing. Our Storm cluster is built on the Amazon EC2 instanceswhose size is regulated by the control system. The sensor here records the CPUutilization of the cluster in terms of the specified time window. Following that theinstances are acquired or released. However, this process is not instant and it maytake several minutes to start up a VM. During this time, the data flow analytics is

5Each shard supports up to 5 transactions/second for reads, up to a maximum total data read rateof 2 MB/second and up to 1,000 records/second for writes, up to a maximum total data write rate of 1MB/second

§5.4 Automated Control of a data analytics flow 75

1h

VM 1

VM 2

VM 3

remaining time

uptime

remaining time

remainingtime

Figure 5.6: VMs are launched at different time slots so that they are of different costto stop. Thus, it is more economical to stop a VM with the minimum remaining time.

vulnerable to missing the SLOs. In response to this problem, we inject a number ofalready configured worker VMs in the cluster under Stopped status. In the event ofscaling, these pre-configured VMs are added to the cluster at the earliest opportunity.

To release a worker VM in the event of de-scaling, our actuator finds the mosteconomical VM to stop. The EC2 instance prices are on an hourly basis. As theinstances are fired in different time slots, it is more economical to stop the instancethat uses the maximum of its current time slot. Therefore, the instance that hasthe least remaining time of the current paid 1 hour slot is the economical candidateto be stopped. Fig. 5.6 shows a sample of three VMs with different uptimes andremaining times where VM 2 and then 1 are respectively the most economical VMsto be stopped. We calculate those instances that show least cost to stop using thefollowing equation:

mini∈n

f (vmi),

f (vmi) = uptimevmi mod t(5.24)

where uptime refers to the VM uptime and t is the time slot which is on an hourlybasis in AWS 6.

5.4.3 Data Storage Layer Controller

DynamoDB sits in our storage layer and is capable of persisting the analytics results.The controller in this layer is responsible for adjusting the number of provisionedwrite capacity units where each unit can handle 1 write per second. To this end, thesensor retrieves the ConsumedWriteCapacityUnits from CloudWatch and calculates

6In our experiments, the uptime is retrieved from CloudWatch in milliseconds, and hence the t is setto the value of 1h * 60min * 60sec * 1000millisec.


write utilization per second as a main input to the control system as follows:

(∑ni=1 ConsumedWriteCapacityUnits)/(n ∗ 60)

ProvisionedWriteCapacityUnits(5.25)

where n is the number of monitoring time windows in minutes and assuming itemsare less than 1 KB in size. This measure inputs the controller and following that thenext resource resizing decision is made and the actuator is invoked to execute in-creaseProvisioinedWriteCapacity, decreaseProvisioinedWriteCapacity, or doNothingcommands.

Cloud services sometimes come with some limitations in the number of scaling orde-scaling operations in a certain period of time. For example, in DynamoDB a usercan not decrease the ReadCapacityUnits/WriteCapacityUnits settings more than fourtimes per table in a single day. This limitation may lead to reasonably high resourcewaste in highly fluctuating workloads. To address this issue, our controller uses asimple yet effective back-off strategy. The actuator performs the de-scaling operationonly after reaching the back-off threshold - a number of consecutive de-scaling re-quests from the controller. This strategy filters transient behavior of workloads andalleviate the problem of resource waste.

5.5 Experimental Results

The purpose of the experiments in this section is to demonstrate that (i) given thebudget and dependency constraints, we are able to efficiently determine the shareof different resources across a data analytics flow (Section 5.5.2), (ii) our controlleroutperforms the state of the art fixed-gain and quasi-adaptive controllers in manag-ing fine-grained cloud resources in dynamic data stream processing settings (Section5.5.3), and most importantly (iii) our framework is able to manage elasticity of thedata analytics flow on public clouds (Section 5.5.4).


We tested our controllers and resource share analyser against real world click-streamanalytics, as shown in Fig. 5.1. To setup the testbed, we used three t2.micro instancesas click-stream data producers7, each is able to produce up to ∼5K records per sec-onds. The data ingestion and storage, as discussed earlier, are handled by AmazonKinesis and DynamoDB services. The Storm cluster is made up of two m4.largeinstances for Zookeeper and Storm master node (i.e. Nimbus server in Storm ter-minology) and a number of workers (i.e. supervisors) which are either t2.micro,t2.small, m3.medium, or m4.xlarge.

7https://github.com/awslabs/amazon-kinesis-data-visualization-sample

§5.5 Experimental Results 77

(a) (b)

Figure 5.7: a) Given the 32.25$ daily budget and the dependency between data inges-tion and analytics layer, six optimal solutions are generated. b) Since we have three

objectives the Pareto front is a surface in 3d space.

5.5.2 Evaluation Results: Optimized Resource Share Determination

To find the Pareto solutions, a large number of algorithms exist in the literature [28].Our framework uses a nondominated sorting genetic algorithm II (NSGA-II) [41] toefficiently search the provisioning plan space.

The prerequisite of resource share determination is workload dependency analy-sis. To illustrate how the framework computes with the resource shares of each layer,in our testbed the following constraints were evaluated:

• 5 ∗ r(A)jt1 ≥ r(I)

it1

• 2 ∗ r(A)jt1 ≤ r(I)

it1

• 2 ∗ r(I)it1 ≤ r(S)kt1

where i, j, and k respectively refer to the number of shards in ingestion, VMs inanalytics, and Write Capacity units in storage layer at time t1. Moreover, supposethat the daily budget of running the click-stream analytics flow on public clouds is$32.25. Given the budget and dependency constraints formulated in Section 5.3.2, thealgorithm finds the Pareto optimal solutions and its corresponding frontier surfaceas displayed in Fig. 5.7(a) and 5.7(b).

As we can see, solving the problem ends up with six feasible solutions (see Fig.5.7(a)), each representing the resource shares of Kinesis, Storm, and DynamoDB si-multaneously. The underlying assumption here is that a solution to the problemmust be identified by an expert to be implemented in practice.

5.5.3 Evaluation Results: Adaptive Controller Performance

In this section, we report the performance of our controller, compared with the stateof the art fixed-gain [82, 87] and quasi-adaptive controllers [96]. Due to workload


(a) (b)

Figure 5.8: a) The data producer puts the same records to the three identical Kinesisstreams, regulated by the controllers. b) Our implementation writes three copies of

the results to the three identical DynamoDB tables.

fluctuations over time, it is hardly possible to provide a truly-fair testbed for com-parison. Having said that, we modified the testbed to make it as fair as possiblefor pair-wise comparison between controllers. To this end, we conducted the exper-iments on all layers but analytics, since it is hard to control the input to this layer.In other words, as the inputs to Kinesis and DynamoDB are manageable, we couldprovide a fair starting point for the systems.

Fig. 5.8(a) and 5.8(b) denote the architecture of the Kinesis and DynamoDB con-trollers testbed. In the former, we have created three streams in Amazon Kinesis withthe same configurations and settings (e.g. No. of shards, region). The data generatorwrites the same click-stream data to the streams. In the latter, we have created threeequivalent tables in which three copies of the results are written by the applicationrunning on a Storm cluster. Note that in both cases, even though the data are nearlysent at the same rate at the beginning, different resource resizing decisions lead todifferent capacity, and hence the workload inevitably becomes a different shape.

Performance Evaluation Metric. To evaluate and compare the performance ofcontrollers, we use root mean squared error (RMSE) of desired resource utilization:√

∑nk=1(yk − yr)2

n(5.26)

where yk and yr are respectively the measured resource usage at time k and thedesired resource utilization (i.e. reference resource usage). The n variable also refersto the number of samples. This measure basically captures the deviation from desiredutilization and intuitively, the well-performed controller shows the least deviation.

To compare our work with the state of the art, five runs were conducted for eachworkload (i.e. Kinesis and DynamoDB) under different desired utilization values(yr): 50%, 60%, 70%, 80%, and 90%. Each run in Kinesis took approximately 4 hours(i.e. 85 samples every 3 minutes). The time period was less than 2 hours (i.e. 60samples for every 2 minutes) in DynamoDB due to the discussed service limitation.


(a) (b)

Figure 5.9: The RMSE measures for both a) Kinesis and b) DynamoDB workloads interms of different desired utilization (yr) values.

Figure 5.10: Throughput QoS for Kinesis workload.

Fig. 5.9(a) shows the results of these runs in Kinesis in which our proposedcontroller outperforms the competing controllers in all runs but one, yr = 90%.Moreover, Fig. 5.10 displays the throughput, incoming records per second to Ki-nesis, recorded during the experiments. As you can see, the adaptive controller in allruns either produces comparable throughput (i.e. when yr = 50% or yr = 70%) orimproves it considerably by up to 55% when yr = 60%.

In the DynamoDB workload, our controller with adaptive gain produces less errorthan the quasi-adaptive controller in all but one run when yr = 60% as shown in Fig.5.9(b). When it comes to comparison with the fixed-gain controller, the adaptivecontrol system is less successful as it shows higher error rates in 3 out of 5 runs(i.e. yr = 50%, 60%, and 90%). This is mainly due to the DynamoDB de-scalinglimitation (as discussed in Section 5.4.3). The adaptive controller essentially adjuststhe gain parameter with respect to the workload dynamics, whereas the service doesnot allow more than four de-scaling operations in a day. To handle that we devised


Figure 5.11: Performance comparison of our adaptive controller and the fixed-gainand quasi-adaptive ones in Amazon Kinesis workload management with yr = 70%.

the back-off strategy (see Section 5.4.3) which even though it alleviates the problem,hinders the optimal functioning of the control system.

To provide a clear picture of the point as well as how the controller functionsin each run, consider Fig. 5.11 and 5.12 which respectively depict two sample runsof Kinesis and DynamoDB. As you can see, the controllers in Kinesis workload per-form better compared with the DynamoDB, since they face no restriction in execut-ing scaling/de-scaling commands. In other words, the adaptive and quasi-adaptivecontrollers could not react to the dynamics of DynamoDB workloads so that theyperform worse than the fixed-gain controller whose gain function is independent ofthe workload changes.

In summary, our control system outperforms the quasi-adaptive and fixed-gaincontrollers respectively in 80% (8 out of 10) and 60% (6 out of 10) of runs conductedbased on two different cloud services using a click-stream analytics flow workload.This finding is based on the fact that our control system has a higher elasticity speedas illustrated in 5.3.3.3 which in turn reduces the error, the deviation from the desiredresource utilization value.

5.5.4 Evaluation Results: Automated Control of the Flow

In this section, we discuss the adaptive controller’s performance in elasticity man-agement of the click-stream analytics flow. Fig. 5.13 shows how the tailored adaptivecontrollers to the data ingestion, analytics, and storage layers function against realdynamic workload. In the data ingestion layer the control system properly respondsto the incoming record workload and increases/decreases the number of shards ac-cordingly in order to keep the utilization (see Eq. 5.23) within the desired threshold


0 10 20 30 40 50 600

10

20

30

Ave

rag

e W

rite

(per

sec

)Adaptive Controller

Provisioned Write Capacity UnitsConsumed Write Capacity Units

0 10 20 30 40 50 600

10

20

30

Ave

rag

e W

rite

(per

sec

)

Fixed−Gain Controller

0 10 20 30 40 50 600

10

20

30

Ave

rag

e W

rite

(per

sec

)

Quasi−Adaptive Controller

No. of Sample (every 2 mins)

Figure 5.12: Performance comparison of our adaptive controllers and the fixed-gainand quasi-adaptive ones in DynamoDB workload management with yr = 60%.

i.e. yr = 60%.

Workload management of the Storm analytics cluster is shown in Fig. 5.13(b) inwhich the controller increases the size of the cluster when CPU usage grows fromaround 34% to 67% at the 47th sample point. As you can see, after scaling the CPUutilization reduces, however its distance from the target utilization value (40%) is notbig enough to cause de-scaling of the cluster.

When it comes to the data storage layer (Fig. 5.13(c)), the adaptive control systemsuccessfully replies to the workload fluctuations. Having said that, the experimentwas conducted while the back-off functionality of the controller was on and set tothe value of 2. It means that the de-scaling command would be executed when weencounter multiple requests for de-scaling in a consecutive manner. For example, atthe 10th sample point the provisioned write capacity units reduce from 7 to 6. In asimilar observation, from the 16th to the 25th sample points the capacity diminishesfrom 8 to 3 units. In contrast, from the 25th onward up to the 92nd sample point, thewrite units enlarge to 8 units.

In summary, our proposed adaptive control system is able to manage elasticityof the data analytics flow on public clouds. Although we have implemented andtailored the framework for the click-stream analytics flow, our approach does nothold any hard assumptions about the specifics of the application and the underlyingdata-intensive systems. Therefore, it can be employed for the other data analyticsflows which may be served by different data-intensive platforms.


Figure 5.13: Adaptive controller’s performance in elasticity management of a) dataingestion (yr = 60%), b) analytics (yr = 40%), c) and storage (yr = 70%) layers of the

click-stream analytics flow with lk = 0.03 and γ = 0.0001.

5.6 Summary

Elasticity management of the big data analytics flow has a number of unique chal-lenges: workload dependencies, different cloud services and monetary schemes, anduncertain stream arrival patterns. To address the first two challenges, in this chap-ter we formulated the problem of resource share analysis across the analytics flowusing a multi-objective optimization technique. The experiments showed that theproposed technique is able to efficiently determine resource shares with respect tovarious constraints.

We then presented three adaptive controllers for data ingestion, analytics, andstorage layers in response to the last challenge. Apart from theoretically provingexponential stability of the control system, numerous experiments were conductedon the click-stream data analytics flow. The results showed that compared with thequasi-adaptive and fixed-gain controllers, our control system respectively reducesthe deviation from desired utilization in 80% and 60% of runs.

In the next chapter, we discuss the details of designing and implementing asystem called Flower for holistic elasticity management of data analytics flows onclouds.

Chapter 6

Flower: A System for DataAnalytics Flow Management

An efficient elasticity management system allows admins and DevOps engineers tosustain the data analytics flows performance 24/7 cost efficiently. In the previouschapter, we discussed that developing elasticity management techniques for hetero-geneous workloads of the big data analytics flow requires a holistic view that caterfor all components rather than treating them as silos.

In this chapter, we present Flower, an open-source system 1 that builds upon ourproposed framework in Chapter 5 to aid users in managing elasticity of the data an-alytics flows on clouds. Flower’s goal is to provide a high level easy-to-use system foradmins and DevOps engineers to assist them in provisioning the data analytics flowapplications according to their application requirements and constantly monitoringthem for any performance failures or slowdowns.

For this purpose, Flower analyzes large volumes of statistics and data collectedfrom different data-intensive systems to provide the user with a suite of rich func-tionalities including workload dependency analysis, optimal resource share analysis,dynamic resource provisioning, and cross-platform monitoring.

6.1 Related Work

In recent years, a few systems have been developed as regards to the workload man-agement of data-intensive systems on cloud in both industry and academia. In termsof commercial solutions, almost all the elasticity management systems offered bycloud providers such as Amazon [1] or Microsoft Azure [18] use simple rule-basedtechniques that quickly trigger in response to predefined threshold violations. Al-though these rules can identify fatal conditions, they often fail to adapt to unplannedor unforeseen changes in demand. Moreover, they entail considerable manual effortsin tuning each system individually and specifying rules based on available resources,which require solid expertise and planning. More importantly, these elasticity man-agement systems are agnostic to workload dependencies in a data analytics flow;

1Flower can be downloaded at https://github.com/Alireza-/Flower

83

84 Flower: A System for Data Analytics Flow Management

Figure 6.1: Conceptual design of the Flower system.

instead focusing on scale/de-scale IaaS level QoS metrics.When it comes to research prototype systems, the authors of [118] propose a

proof of concept system that is able to identify and differentiate application work-loads throughout a complex big data platforms. In this system, the user has toidentify the workload of interest and pass the prioritization information from thejob submission front-end. The proposed demo provides the multiplexing and pri-oritization of Hadoop and Hive workloads which lead to superior performance forinteractive queries. This work as compared to ours has limited scope (i.e. workloadprioritization) since it neither offers a solution for elasticity management of workloadacross an implemented analytics flow nor does it capture the workload dependenciesor resource share analysis.

In [76], the authors propose a visionary prototype for integrated management ofmulti-system cluster. The proposed architecture provides a database-centric view ofmultiple systems that co-exist and cooperate in the same cluster. Although our ob-jective is close to the notion of integrated management of a big data stack but highlydiffers in the approach and targeted challenges due to the abstraction level of thesolution. Specifically, due to the high-level abstraction of the proposed architecture,this is not clear, for example, how they tackle dynamic resource provisioning prob-lem for a multi-system workload or how they analyse resource share of differentworkloads across a shared infrastructure.

More recently, the authors of [94] introduce PerfEnforce, a dynamic scaling enginethat comes with three scaling techniques – feedback control, reinforcement learning,and perceptron algorithm. The demonstration aims to show how well the algorithms

§6.2 Flower Architecture 85

Figure 6.2: Flower high-level architecture.

resize the Amazon EC2 cluster in response to running query workload from TPC-Hbenchmark. This study like its previous counterparts reported in [82, 75, 96] onlyfocuses on dynamic resizing of specific cloud resources i.e. VM instances. Our workcomplements these studies in the sense that it manages dynamic scaling of differentkinds of cloud services (e.g. distributed message queueing systems, NoSQL) andtheir underpinning resources (e.g. Queues, Table read/write capacities).

6.2 Flower Architecture

Flower has three main components including resource share analyser, elasticity con-troller, and all-in-one place visualizer as depicted in the conceptual design of thesystem, Fig. 6.1. The first two components actually build upon the theories we havedeveloped in Sections 5.3.2 and 5.3.3 of the Chapter 5. Fig. 6.2 shows the high-levelarchitecture of the Flower system and in the subsections below, we discuss more abouteach component.

6.2.1 Resource Share Analysis

Finding an efficient resource share plan for a data analytics flow is difficult. To deter-mine how to best allocate budget to various types of resources across a data analyt-ics flow, Flower provides a resource share analyzer module which combines techniquesfrom statistical regression models and multi-objective optimization theory to resolvemaximum share of different resources across the data analytics flow. To implementthe workload dependency and resource share analyser, we respectively used ApacheCommons [7] and the MOEA framework [56] – a Java framework for multi-objectiveoptimization.


Figure 6.3: All-in-one-place visualizer user interface.

6.2.2 Resource Provisioning

To enable accurate yet timely resource provisioning, Flower uses advanced controltheory to automatically reason about resource resizing actions. Flower’s adaptiveresource control engine is able to continuously detect and self-adapt to workloadchanges for meeting users’ service level objectives.

The sensor and resource actuator are the key components of the controllers. Thesensor module is responsible for providing resource usage stats as per the specifiedmonitoring window. The actuator is capable of executing the controllers’ commandssuch as adding or removing VMs and increasing or decreasing number of Shards.Flower’s sensor module periodically collects live data from multiple sources such asCloudWatch [2] and inserts into the actuator module. The controller regulates theactuator value from the previous time step proportionally to the deviation betweenthe current and desired values of the sensor variable in the current time step.

6.2.3 Cross-Platform Monitoring

Many cluster monitoring tools such as Ganglia [16] are available to assist adminis-trators. However, they fail to provide a holistic view of performance measure acrossthe data analytics flow. Therefore, one requires to check out different systems anduser interfaces in order to track any possible performance failures or slowdowns. Forexample, monitoring an analytics flow application built upon Kafka [10], Storm, andDynamoDB systems requires to track performance statistics in three separate userinterfaces. More importantly, these platforms do not necessarily have the consis-tent definitions for the same performance measures, which makes evaluating metrics

§6.3 Flower Workflow 87

across the data analytics flow challenging.To tackle the issues above, Flower introduces a module called all-in-one-place vi-

sualizer, which allows users to visually define a monitoring layer on top of multiplesystems. The module calls the APIs of the systems such as CloudWatch and Storm,and consolidates the following performance measures in an integrated user interfaceas shown in Fig. 6.3:

• System-level measures such as CPU, Memory or Network Utilization provide ageneral view of the system performance.

• Application-level measures refer to specific metrics of the individual system jobor application such as Incoming Records in Kinesis or Acked/ Failed Tuples inStorm.

• Flow-level measures are the combined measures such as Latency from differentplatforms to provide an end to end performance value.

6.3 Flower Workflow

In a nutshell, the workflow of the system is as follows. First, dependencies betweenworkloads’ resource usage measures are analyzed. We apply linear regression tech-niques to estimate the relationships among variables. The dependency informationalong with the cloud services costs and the user’s SLO constitute the required inputsfor the generation and then search of provisioning plan space.

The resource share analyzer module is then invoked to determine the maximumresource shares of each layer given the user’s SLO. Once the upper bound resourceshares for each layer are identified, the adaptive controller tailored to each of thethree layers automatically adjusts resource allocations of that layer. Note that theresource shares can be determined with respect to arbitrary time windows.

The controllers are regulated based on a number of parameters, including moni-tored resource utilization value, desired resource utilization value, history of the con-troller’s decisions. In other words, the controllers continuously provision resourcesto adequately serve the incoming records or input data in order to keep resourceutilization of each layer within the specified desired value.

6.4 Flower is Action

In this section, we give a walk-through over the key features of Flower for managingthe data analytics flow. Fig. 6.4 shows high-level squence diagram of how to run aflow elasticity manager. As you can see, three main objects including Flow Builder,Flow Configuration Wizard, and Controller Performance Monitor are involved inrunning an elasticity manager in a number of steps:


Figure 6.4: The high-level sequence diagram of how to run an elasticity controller inFlower

1. Flow Builder: Flower’s Flow Builder is used to drag and drop multiple plat-forms and create a data analytics flow via its graphical user interface as shownin Fig. 6.5.

2. Flow Configuration Wizard: In this step, we need to complete a wizard toconfigure the controllers with information such as resource name (e.g. tablename in DynamoDB), desired reference value and monitoring period as shownin Fig. 6.6 for the selected systems in previous step.

3. Controller Performance Monitor: Once configuration is completed, we willthen be able to run the service. After starting the service, Flower launches visu-alizations, showing various features like current and future resource allocation,deviation from desired utilization, performance measures as shown in Fig. 6.7.We can then observe how different controllers change the cloud services capac-ities dynamically and the resulting performance.

Flower also provides an interface to adjust tunable parameters of the controllerssuch as elasticity speed, monitoring period, or even their internal settings as shownin Fig. 6.8.

In addition to the demonstration above, one can create a cross-platform monitor-ing dashboard using similar steps, experiencing live monitoring of multiple systemsall in one go.

§6.4 Flower is Action 89

Figure 6.5: Flower’s flow builder interface

Figure 6.6: Elasticity flow configuration interface


Figure 6.7: Elasticity service control and monitoring interface

§6.4 Flower is Action 91

Figure 6.8: Elasticity service setting interface


6.5 Summary

In this chapter, we presented Flower, a system for data analytics flow elasticity man-agement. Flower with a set of rich functionalities aims at assisting admins or DevOpsengineers in workload management of the data analytics flows. Currently, such atask is performed naively using simple auto-scaling systems provided by the cloudproviders where the users need to define some static rules beforehand and tweakthem frequently in order to find efficient allocation plans. In this regard, we high-lighted how Flower helps them maintain their application flows performance withmuch less effort.

Chapter 7

Conclusion

Big Data has been grown rapidly in various complex disciplines such as science, so-cial network, engineering, and commerce and nowadays increasing cloud-hosted bigdata analytics flow applications leads to increased complexity regarding the architect,run-time performance and workload management.

As we have described in Chapter 1, the data analytics flow applications combinemultiple programming models for performing specialized and pre-defined set ofactivities, such as ingestion, analytics, and storage of data. To support users acrosssuch heterogeneous workloads, this thesis successfully proposed a set of intelligentperformance and workload management techniques and tools.

In conclusion, we briefly highlight the major contributions and future researchdirections that can be built upon the outcomes of this research under three main cat-egories including resource performance prediction, dynamic resource management,and tool support.

7.1 Resource Performance Prediction for Data-intensive Sys-tems

We have described our first research question as: How can we predict the resource andperformance distribution of data-intensive workloads?

In response to this question, we proposed a new distribution-based performancemodelling technique for batch and stream processing systems. The proposed ap-proach is based on the statistical machine learning techniques and is easy to adaptto a wide variety of systems modelling problems. To demonstrate the usefulnessof the distribution-based workload modelling, we designed and implemented twoworkload management mechanisms including i) predictable auto-scaling policy set-ting; and ii) predictive admission controller. We thoroughly discussed that a MDN-based prediction approach provides a complete description of the statistical proper-ties while retaining the strength of the existing single-point prediction models.

In this study, we focused on offline learning, but the real-time nature of someworkloads necessitates revisiting this work to investigate new learning models andupdate methods that are both fast and able to capture and adapt to new data pointswith time. To enable online predictive modelling, a natural future direction will be

93

94 Conclusion

enhancing the MDN to be able to refine kernel functions at runtime. To be morespecific, one can plan to enhance the MDN to be able to build prediction modelsat runtime. For this purpose, with the aid of online learning notions the MDNwill be revisited to be able to take an initial guess model and then picks up one-one observation from the training set and recalibrates the weights on each inputparameter.

Another avenue for future work is to hook the proposed model into Apache Yarn[13] or Apache Mesos [11] to build more intelligent resource or query scheduler.

7.2 Data Analytics Flow Elasticity Management

Our second research question was How to satisfy the performance objectives of a dataanalytics flow application despite its dynamic runtime workload?

In response, we first proposed a multi-layered resource allocation scheme forcloud-hosted data analytics flows. In doing so, we presented a meticulous depen-dency analysis of the workloads along with the mathematical formulation of theproblem as per the data ingestion, analytics, and storage layers of a data analyticsflow. Next, we designed and implemented a new adaptive control framework fordynamic provisioning of data analytics flows that is able to continuously detect andself-adapt to workload changes for meeting users’ SLOs.

For future work, one interesting direction would be investigating workload de-pendencies between different workloads and building a general model for rigorousanalysis of the dependencies. Moreover, extending the framework for controller de-sign and analysis enabling application of more advanced control techniques such asrobust or robust-adaptive controller design would be another avenue for future work.

7.3 Elasticity Management Tool Support

Last question we investigated in this thesis was How to design and implement a holisticelasticity management system for the data analytics flows?

In response to the question above, we designed and developed Flower, a systemfor holistic elasticity management of data analytics flows on clouds. Flower providesthe user with a suite of rich functionalities including workload dependency analysis,optimal resource share analysis, dynamic resource provisioning, and cross-platformmonitoring.

Current Flower’s version supports the mentioned functionalities for Kinesis, Dy-namoDB and Apache Storm. Therefore, one natural future direction is to extendthem to support the other popular big data systems such as Apache Cassandra [6],Apache Kafka [10], Apache Spark, and the like.

References

1. Amazon auto scaling. https://aws.amazon.com/autoscaling/. (cited on page 83)

2. Amazon cloudwatch. https://aws.amazon.com/cloudwatch/. (cited on pages 65and 86)

3. Amazon dynamodb. https://aws.amazon.com/dynamodb. (cited on pages 13and 61)

4. Amazon elastic mapreduce. https://aws.amazon.com/emr/. (cited on page 1)

5. Amazon kinesis. http://aws.amazon.com/kinesis. (cited on pages 1, 7, and 61)

6. Apache cassandra. http://cassandra.apache.org/. (cited on pages 7, 13, and 94)

7. Apache commons. https://commons.apache.org/. (cited on page 85)

8. Apache hadoop. http://hadoop.apache.org/. (cited on page 10)

9. Apache Hive. https://hive.apache.org/. (cited on page 10)

10. Apache kafka. http://kafka.apache.org/. (cited on pages 7, 86, and 94)

11. Apache mesos. http://mesos.apache.org/. (cited on page 94)

12. Apache storm. http://storm.apache.org/. (cited on pages 7, 12, and 61)

13. Apache yarn. https://hortonworks.com/apache/yarn/. (cited on page 94)

14. Apache zookeeper. https://zookeeper.apache.org/. (cited on page 12)

15. Esper. http://www.espertech.com/esper/. (cited on page 12)

16. Ganglia monitoring system. http://ganglia.sourceforge.net/. (cited on page 86)

17. Google bigquery. https://cloud.google.com/bigquery/. (cited on page 1)

18. Microsoft azure hdinsight. https://azure.microsoft.com/. (cited on pages 1and 83)

19. Oracle complex event processing. http://www.oracle.com/technetwork/middleware/complex-event-processing. (cited on pages 12 and 30)

20. TPC-H. www.tpc.org/tpch/. (cited on pages 23, 34, and 50)

95

https://aws.amazon.com/autoscaling/

https://aws.amazon.com/cloudwatch/

https://aws.amazon.com/dynamodb

https://aws.amazon.com/emr/

http://aws.amazon.com/kinesis

http://cassandra.apache.org/

https://commons.apache.org/

http://hadoop.apache.org/

https://hive.apache.org/

http://kafka.apache.org/

http://mesos.apache.org/

http://storm.apache.org/

https://hortonworks.com/apache/yarn/

https://zookeeper.apache.org/

http://www.espertech.com/esper/

http://ganglia.sourceforge.net/

https://cloud.google.com/bigquery/

https://azure.microsoft.com/

http://www.oracle.com/technetwork/middleware/complex-event-processing

http://www.oracle.com/technetwork/middleware/complex-event-processing

www.tpc.org/tpch/

96 References

21. Mumtaz Ahmad et al. Predicting completion times of batch query workloadsusing interaction-aware models and simulation. In EDBT, pages 449–460. ACM,2011. (cited on pages 27, 29, 51, and 56)

22. Mert Akdere, Ugur Çetintemel, Matteo Riondato, Eli Upfal, and Stanley BZdonik. Learning-based query performance modeling and prediction. In DataEngineering (ICDE), 2012 IEEE 28th International Conference on, pages 390–401.IEEE, 2012. (cited on pages 17, 24, 25, 27, 29, 37, 50, 51, and 54)

23. H Appelrath, Dennis Geesen, Marco Grawunder, Timo Michelsen, Daniela Nick-las, et al. Odysseus: a highly customizable framework for creating efficientevent stream management systems. In Proceedings of the 6th ACM InternationalConference on Distributed Event-Based Systems, pages 367–368. ACM, 2012. (citedon pages 12, 30, and 31)

24. Arvind Arasu, Shivnath Babu, and Jennifer Widom. The cql continuous querylanguage: semantic foundations and query execution. The VLDB JournalâATTheInternational Journal on Very Large Data Bases, 15(2):121–142, 2006. (cited on page31)

25. Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag SMaskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. Linearroad: a stream data management benchmark. In Proceedings of the Thirtieth in-ternational conference on Very large data bases-Volume 30, pages 480–491. VLDBEndowment, 2004. (cited on pages 15, 23, 25, and 34)

26. Francis R Bach and Michael I Jordan. Kernel independent component analysis.Journal of machine learning research, 3(Jul):1–48, 2002. (cited on page 15)

27. Cagri Balkesen, Nesime Tatbul, and M Tamer Özsu. Adaptive input admissionand management for parallel stream processing. In Proceedings of the 7th ACMinternational conference on Distributed event-based systems, pages 15–26. ACM,2013. (cited on pages 23 and 28)

28. Sanghamitra Bandyopadhyay and Sriparna Saha. Some single- and multi-objective optimization techniques. In Unsupervised Classification, pages 17–58.Springer Berlin Heidelberg, 2013. (cited on pages 20 and 77)

29. Ivan Bedini, Sherif Sakr, Bart Theeten, Alessandra Sala, and Peter Cogan. Mod-eling performance of a parallel streaming engine: bridging theory and costs. InProceedings of the 4th ACM/SPEC International Conference on Performance Engineer-ing, pages 173–184. ACM, 2013. (cited on page 14)

30. Rahul Bhartia. Amazon kinesis and apache storm: Building a real-time sliding-window dashboard over streaming data. Technical report, Amazon Web Ser-vices, October 2014. (cited on page 64)

References 97

31. Christopher M Bishop. Mixture density networks. 1994. (cited on pages 15, 17,25, 32, 33, and 49)

32. Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. Towards sensordatabase systems. In International Conference on Mobile Data Management, pages3–14. Springer, 2001. (cited on page 12)

33. Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classi-fication and regression trees. CRC press, 1984. (cited on page 15)

34. Rodrigo N Calheiros, Rajiv Ranjan, and Rajkumar Buyya. Virtual machine pro-visioning based on analytical performance and qos in cloud computing envi-ronments. In Parallel processing (ICPP), 2011 international conference on, pages295–304. IEEE, 2011. (cited on page 19)

35. Javier Cervino, Evangelia Kalyvianaki, Joaquin Salvachua, and Peter Pietzuch.Adaptive provisioning of stream processing systems in the cloud. In Data En-gineering Workshops (ICDEW), 2012 IEEE 28th International Conference on, pages295–301. IEEE, 2012. (cited on page 28)

36. Jianjun Chen, David J DeWitt, Feng Tian, and Yuan Wang. Niagaracq: A scal-able continuous query system for internet databases. In ACM SIGMOD Record,volume 29, pages 379–390. ACM, 2000. (cited on page 12)

37. Yanpei Chen, Archana Ganapathi, Rean Griffith, and Randy Katz. The case forevaluating mapreduce performance using workload suites. In Modeling, Anal-ysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2011IEEE 19th International Symposium on, pages 390–399. IEEE, 2011. (cited on page18)

38. Yun Chi et al. Distribution-based query scheduling. VLDB, 6(9):673–684, 2013.(cited on page 28)

39. Edward Curry. The big data value chain: Definitions, concepts, and theoreticalapproaches. In New Horizons for a Data-Driven Economy, pages 29–37. Springer,2016. (cited on page 1)

40. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing onlarge clusters. Communications of the ACM, 51(1):107–113, 2008. (cited on page10)

41. Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. Afast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions onEvolutionary Computation, 6(2):182–197, 2002. (cited on page 77)

42. Joseph J DiStefano, Allen J Stubberud, and Ivan J Williams. Schaum’s outline offeedback and control systems. McGraw-Hill Professional, 1997. (cited on page 21)

98 References

43. Mianxiong Dong, He Li, Kaoru Ota, Laurence T Yang, and Haojin Zhu.Multicloud-based evacuation services for emergency management. IEEE CloudComputing, 1(4):50–59, 2014. (cited on page 61)

44. Jennie Duggan et al. Performance prediction for concurrent database work-loads. In SIGMOD, pages 337–348. ACM, 2011. (cited on pages 27, 51, and 56)

45. Jennie Duggan et al. Contender: A resource modeling approach for concurrentquery performance prediction. In EDBT, pages 109–120, 2014. (cited on pages27 and 56)

46. Tapio Elomaa and M Kaariainen. An analysis of reduced error pruning. arXivpreprint arXiv:1106.0668, 2011. (cited on page 17)

47. Soodeh Farokhi, Ewnetu Bayuh Lakew, Cristian Klein, Ivona Brandic, and ErikElmroth. Coordinating cpu and memory elasticity controllers to meet serviceresponse time constraints. In International Conference on Cloud and AutonomicComputing (ICCAC), pages 69–80. IEEE, 2015. (cited on pages 63 and 68)

48. Hector Fernandez, Guillaume Pierre, and Thilo Kielmann. Autoscaling webapplications in heterogeneous cloud infrastructures. In IEEE International Con-ference on Cloud Engineering (IC2E), pages 195–204. IEEE, 2014. (cited on pages20 and 62)

49. Guilherme Galante and Luis Carlos E de Bona. A survey on cloud computingelasticity. In Utility and Cloud Computing (UCC), 2012 IEEE Fifth InternationalConference on, pages 263–270. IEEE, 2012. (cited on page 19)

50. Archana Ganapathi et al. Predicting multiple metrics for queries: Better deci-sions enabled by machine learning. In ICDE, pages 592–603. IEEE, 2009. (citedon pages 24, 25, 27, 29, and 51)

51. Archana Ganapathi et al. Statistics-driven workload modeling for the cloud. InICDEW, pages 87–92. IEEE, 2010. (cited on pages 2, 50, and 51)

52. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file sys-tem. In ACM SIGOPS operating systems review, volume 37, pages 29–43. ACM,2003. (cited on page 10)

53. Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, predic-tion, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007. (cited on pages 37, 38, and 54)

54. Irving John Good. Rational decisions. Journal of the Royal Statistical Society. SeriesB (Methodological), pages 107–114, 1952. (cited on pages 38 and 54)

55. Vincenzo Massimiliano Gulisano. StreamCloud: an elastic parallel-distributedstream processing engine. PhD thesis, Informatica, 2012. (cited on pages 12and 19)

References 99

56. D Hadka. Moea framework a free and open source java framework for multi-objective optimization. http://www.moeaframework.org, 2012. (cited on page85)

57. Mark A Hall. Correlation-based feature selection for machine learning. PhD thesis,The University of Waikato, 1999. (cited on page 29)

58. Mark Hall et al. The weka data mining software: an update. SIGKDD, 11(1):10–18, 2009. (cited on pages 37, 55, and 57)

59. David J Hand, Heikki Mannila, and Padhraic Smyth. Principles of data mining.MIT press, 2001. (cited on page 17)

60. Thomas Heinze, Valerio Pappalardo, Zbigniew Jerzak, and Christof Fetzer.Auto-scaling techniques for elastic data stream processing. In Data EngineeringWorkshops (ICDEW), 2014 IEEE 30th International Conference on, pages 296–302.IEEE, 2014. (cited on pages 28 and 42)

61. Nikolas Roman Herbst, Samuel Kounev, and Ralf Reussner. Elasticity in cloudcomputing: What it is, and what it is not. In ICAC, pages 23–27, 2013. (citedon pages 19 and 72)

62. Martin Hirzel, Robert Soulé, Scott Schneider, Bugra Gedik, and Robert Grimm.A catalog of stream processing optimizations. ACM Computing Surveys (CSUR),46(4):46, 2014. (cited on page 30)

63. Dirk Husmeier. Neural networks for conditional probability estimation: Forecastingbeyond point predictions. Springer Science & Business Media, 2012. (cited onpage 17)

64. Jinho Hwang and Timothy Wood. Adaptive performance-aware distributedmemory caching. In ICAC, pages 33–43, 2013. (cited on pages 20 and 62)

65. Pooyan Jamshidi, Aakash Ahmad, and Claus Pahl. Autonomic resource provi-sioning for cloud-based software. In Proceedings of the 9th International Sympo-sium on Software Engineering for Adaptive and Self-Managing Systems, pages 95–104. ACM, 2014. (cited on pages 20, 62, 63, and 68)

66. Pooyan Jamshidi, Amir M Sharifloo, Claus Pahl, Andreas Metzger, and GiovaniEstrada. Self-learning cloud controllers: Fuzzy q-learning for knowledge evo-lution. In International Conference on Cloud and Autonomic Computing (ICCAC),pages 208–211. IEEE, 2015. (cited on pages 20, 62, and 63)

67. Evangelia Kalyvianaki, Themistoklis Charalambous, and Steven Hand. Adap-tive resource provisioning for virtualized servers using kalman filters. ACMTransactions on Autonomous and Adaptive Systems (TAAS), 9(2):10, 2014. (cited onpages 63 and 64)

http://www.moeaframework.org

100 References

68. H. K. Khalil. Nonlinear Systems. Prentice Hall, New Jersey, 1996. (cited onpages 68 and 71)

69. Alireza Khoshkbarforoushha, Alireza Khosravian, and Rajiv Ranjan. Elasticitymanagement of streaming data analytics flows on clouds. Journal of Computerand System Sciences, 2016. (cited on page 4)

70. Alireza Khoshkbarforoushha and Rajiv Ranjan. Resource and performance dis-tribution prediction for large scale analytics queries. In Proceedings of the 7thACM/SPEC on International Conference on Performance Engineering, pages 49–54.ACM, 2016. (cited on page 4)

71. Alireza Khoshkbarforoushha, Rajiv Ranjan, Raj Gaire, Ehsan Abbasnejad, LizheWang, and Albert Y Zomaya. Distribution based workload modelling of con-tinuous queries in clouds. IEEE Transactions on Emerging Topics in Computing,5(1):120–133, 2017. (cited on page 4)

72. Alireza Khoshkbarforoushha, Rajiv Ranjan, Raj Gaire, Prem P Jayaraman, JohnHosking, and Ehsan Abbasnejad. Resource usage estimation of data stream pro-cessing workloads in datacenter clouds. arXiv preprint arXiv:1501.07020, 2015.(cited on page 67)

73. Alireza Khoshkbarforoushha, Rajiv Ranjan, Qing Wang, and Carsten Friedrich.Flower: A data analytics flow elasticity manager. In Submitted to VLDB. (citedon page 4)

74. Alireza Khoshkbarforoushha, Meisong Wang, Rajiv Ranjan, Lizhe Wang, LeilaAlem, Samee U Khan, and Boualem Benatallah. Dimensions for evaluatingcloud resource orchestration frameworks. Computer, 49(2):24–33, 2016. (citedon page 1)

75. Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos, ChristinaBoumpouka, Nectarios Koziris, and Spyros Sioutas. Tiramola: elastic nosql pro-visioning through a cloud management platform. In SIGMOD, pages 725–728.ACM, 2012. (cited on pages 19, 20, 62, and 85)

76. Mayuresh Kunjir, Prajakta Kalmegh, and Shivnath Babu. Thoth: Towards man-aging a multi-system cluster. VLDB, 7(13), 2014. (cited on pages 2 and 84)

77. Palden Lama and Xiaobo Zhou. Efficient server provisioning with control forend-to-end response time guarantee on multitier clusters. IEEE Transactions onParallel and Distributed Systems, 23(1):78–86, 2012. (cited on pages 63 and 68)

78. Palden Lama and Xiaobo Zhou. Autonomic provisioning with self-adaptiveneural fuzzy control for percentile-based delay guarantee. ACM Transactionson Autonomous and Adaptive Systems (TAAS), 8(2):9, 2013. (cited on pages 63and 68)

References 101

79. Sebastian Lehrig, Hendrik Eikerling, and Steffen Becker. Scalability, elasticity,and efficiency in cloud computing: a systematic literature review of definitionsand metrics. In SIGSOFT, pages 83–92. ACM, 2015. (cited on page 72)

80. Jiexing Li et al. Robust estimation of resource consumption for sql queries usingstatistical techniques. VLDB, 5(11):1555–1566, 2012. (cited on pages 24, 25, 27,37, 50, and 51)

81. Harold Lim, Yuzhang Han, and Shivnath Babu. How to fit when no one sizefits. In CIDR, volume 4, page 35, 2013. (cited on page 2)

82. Harold C Lim, Shivnath Babu, and Jeffrey S Chase. Automated control forelastic storage. In ICAC, pages 1–10. ACM, 2010. (cited on pages 2, 19, 20, 42,46, 62, 63, 68, 70, 72, 73, 77, and 85)

83. Harold Vinson Chao Lim. Workload Management for Data-Intensive Services. PhDthesis, Duke University, 2013. (cited on page 1)

84. Wei-Yin Loh. Classification and regression trees. Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery, 1(1):14–23, 2011. (cited on page 17)

85. Tania Lorido-Botran, Jose Miguel-Alonso, and Jose A Lozano. A review of auto-scaling techniques for elastic applications in cloud environments. Journal of GridComputing, 12(4):559–592, 2014. (cited on pages 20 and 62)

86. Chenyang Lu, Ying Lu, Tarek F Abdelzaher, John Stankovic, Sang Hyuk Son,et al. Feedback control architecture and design methodology for service delayguarantees in web servers. IEEE Transactions on Parallel and Distributed Systems,17(9):1014–1027, 2006. (cited on pages 20, 62, 63, 64, and 68)

87. Simon J Malkowski, Markus Hedwig, Jack Li, Calton Pu, and Dirk Neumann.Automated control for elastic n-tier workloads based on empirical modeling. InProceedings of the 8th ACM international conference on Autonomic computing, pages131–140. ACM, 2011. (cited on pages 63, 70, 72, 73, and 77)

88. Nathan Marz and James Warren. Big Data: Principles and best practices of scalablerealtime data systems. Manning Publications Co., 2015. (cited on page 9)

89. Kaisa Miettinen. Nonlinear multiobjective optimization, volume 12. Springer Sci-ence & Business Media, 2012. (cited on page 20)

90. Barzan Mozafari, Carlo Curino, Alekh Jindal, and Samuel Madden. Perfor-mance and resource modeling in highly-concurrent oltp workloads. In Proceed-ings of the 2013 ACM SIGMOD International Conference on Management of Data,pages 301–312. ACM, 2013. (cited on pages 2 and 27)

91. Ian Nabney. NETLAB: algorithms for pattern recognition. Springer Science & Busi-ness Media, 2002. (cited on pages 37, 54, and 57)

102 References

92. R Neuneier, F Hergert, W Finnoff, and D Ormoneit. Estimation of conditionaldensities: A comparison of neural network approaches. In ICANNâAZ94, pages689–692. Springer, 1994. (cited on page 17)

93. Katsuhiko Ogata. Modern control engineering. Prentice Hall PTR, 2001. (citedon page 71)

94. Jennifer Ortiz, Brendan Lee, and Magdalena Balazinska. Perfenforce demon-stration: Data analytics with performance guarantees. SIGMOD, 2016. (citedon pages 62 and 84)

95. Pradeep Padala, Kai-Yuan Hou, Kang G Shin, Xiaoyun Zhu, Mustafa Uysal,Zhikui Wang, Sharad Singhal, and Arif Merchant. Automated control of mul-tiple virtualized resources. In Proceedings of the 4th ACM European conferenceon Computer systems, pages 13–26. ACM, 2009. (cited on pages 63, 64, 68, 72,and 73)

96. Pradeep Padala, Kang G Shin, Xiaoyun Zhu, Mustafa Uysal, Zhikui Wang,Sharad Singhal, Arif Merchant, and Kenneth Salem. Adaptive control of vir-tualized resources in utility computing environments. In SIGOPS, volume 41,pages 289–302. ACM, 2007. (cited on pages 2, 63, 64, 68, 70, 72, 73, 77, and 85)

97. Rajiv Ranjan. Streaming big data processing in datacenter clouds. IEEE CloudComputing, 1(1):78–83, 2014. (cited on page 7)

98. Luis Rodero-Merino, Luis M Vaquero, Victor Gil, Fermín Galán, Javier Fontán,Rubén S Montero, and Ignacio M Llorente. From infrastructure delivery toservice management in clouds. Future Generation Computer Systems, 26(8):1226–1240, 2010. (cited on page 19)

99. Frank Rosenblatt. Principles of neurodynamics. perceptrons and the theory ofbrain mechanisms. Technical report, DTIC Document, 1961. (cited on pages 15and 16)

100. S. Sastry and M. Bodson. Adaptive control: stability, convergence and robustness.Courier Corporation, 2011. (cited on pages 68 and 70)

101. Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and John Wilkes. Cloudscale:elastic resource scaling for multi-tenant cloud systems. In Proceedings of the 2ndACM Symposium on Cloud Computing, page 5. ACM, 2011. (cited on page 19)

102. Kenn Slagter, Ching-Hsien Hsu, and Yeh-Ching Chung. An adaptive and mem-ory efficient sampling mechanism for partitioning in mapreduce. InternationalJournal of Parallel Programming, 43(3):489–507, 2015. (cited on page 63)

103. Jean-Jacques E Slotine, Weiping Li, et al. Applied nonlinear control, volume 199.Prentice-hall Englewood Cliffs, NJ, 1991. (cited on pages 68 and 71)

References 103

104. Ingo Steinwart and Andreas Christmann. Support vector machines. SpringerScience & Business Media, 2008. (cited on page 15)

105. Masashi Sugiyama, Ichiro Takeuchi, Taiji Suzuki, Takafumi Kanamori, HirotakaHachiya, and Daisuke Okanohara. Conditional density estimation via least-squares density ratio estimation. In AISTATS, pages 781–788, 2010. (cited onpage 18)

106. Roshan Sumbaly, Jay Kreps, and Sam Shah. The big data ecosystem at linkedin.In SIGMOD, pages 1125–1134. ACM, 2013. (cited on page 1)

107. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a ware-housing solution over a map-reduce framework. Proceedings of the VLDB Endow-ment, 2(2):1626–1629, 2009. (cited on page 11)

108. Garry Turkington. Hadoop Beginner’s Guide. Packt Publishing Ltd, 2013. (citedon page 10)

109. Bhuvan Urgaonkar, Prashant Shenoy, Abhishek Chandra, and Pawan Goyal.Dynamic provisioning of multi-tier internet applications. In Second InternationalConference on Autonomic Computing (ICAC), pages 217–228. IEEE, 2005. (citedon pages 20 and 62)

110. Luis M Vaquero, Luis Rodero-Merino, and Rajkumar Buyya. Dynamically scal-ing applications in the cloud. ACM SIGCOMM Computer Communication Review,41(1):45–52, 2011. (cited on page 19)

111. Abhishek Verma et al. Aria: automatic resource inference and allocation formapreduce environments. In ICAC, pages 235–244. ACM, 2011. (cited on page2)

112. Wentao Wu, Yun Chi, Hakan Hacígümüs, and Jeffrey F Naughton. Towards pre-dicting query execution time for concurrent and dynamic database workloads.Proceedings of the VLDB Endowment, 6(10):925–936, 2013. (cited on page 27)

113. Wentao Wu, Yun Chi, Shenghuo Zhu, Junichi Tatemura, Hakan Hacigumus,and Jeffrey F Naughton. Predicting query execution time: Are optimizer costmodels really unusable? In Data Engineering (ICDE), 2013 IEEE 29th InternationalConference on, pages 1081–1092. IEEE, 2013. (cited on page 28)

114. Wentao Wu, Xi Wu, Hakan Hacıgümüs, and Jeffrey F Naughton. Uncertaintyaware query execution time prediction. arXiv preprint arXiv:1408.6589, 2014.(cited on page 28)

115. Pengcheng Xiong, Yun Chi, Shenghuo Zhu, Hyun Moon, Calton Pu, and HakanHacigumus. Smartsla: Cost-sensitive management of virtualized resources forcpu-bound database services. 2014. (cited on pages 28 and 45)

104 References

116. Pengcheng Xiong et al. Activesla: a profit-oriented admission control frame-work for database-as-a-service providers. In SoCC, page 15. ACM, 2011. (citedon pages 17, 23, 28, 37, 45, 46, and 54)

117. Jing Xu, Ming Zhao, Jose Fortes, Robert Carpenter, and Mazin Yousif. On theuse of fuzzy modeling in virtualized data center management. In Fourth Inter-national Conference on Autonomic Computing, pages 25–25. IEEE, 2007. (cited onpage 20)

118. Rui Zhang, Reshu Jain, Prasenjit Sarkar, and Lukas Rupprecht. Getting yourbig data priorities straight: A demonstration of priority-based qos using social-network-driven stock recommendation. VLDB, 7(13), 2014. (cited on page 84)

119. Timothy Zhu, Anshul Gandhi, Mor Harchol-Balter, and Michael A Kozuch. Sav-ing cash by using less cache. In HotCloud, 2012. (cited on page 62)

workload modelling and elasticity management of …...workload modelling and elasticity management...

Documents