cis 602-01: scalable data analysisdkoop/cis602-2017fa/lectures/lecture...university of pennsylvania...

CIS 602-01: Scalable Data Analysis

Cloud Workloads Dr. David Koop

D. Koop, CIS 602-01, Fall 2017

Scaling Up

2D. Koop, CIS 602-01, Fall 2017

[Haeberlen and Ives, 2015]

PC

Scaling Up

2D. Koop, CIS 602-01, Fall 2017


PC Server

Scaling Up

2D. Koop, CIS 602-01, Fall 2017


PC Server Cluster

Scaling Up

2D. Koop, CIS 602-01, Fall 2017


PC Server Cluster Data center

Scaling Up

2D. Koop, CIS 602-01, Fall 2017


PC Server Cluster Data center Network of data centers

Scale Problems1.Difficult to dimension

- Load can vary considerably - Waste resources or lose customers

2.Expensive - Hardware costs - Personnel costs - Maintenance costs

3.Difficult to scale - scaling up (new machines, new buildings) - scaling down (energy, fixed costs)

3D. Koop, CIS 602-01, Fall 2017


University of Pennsylvania

Energy

• Data centers consume a lot of energy - Makes sense to build them near sources of cheap electricity - Example: Price per KWh is 3.6ct in Idaho (near hydroelectric power), 10ct

in California (long distance transmission), 18ct in Hawaii (must ship fuel) - Most of this is converted into heat → Cooling is a big issue!

4D. Koop, CIS 602-01, Fall 2017


Company Servers Electricity Cost

eBay 16K ~0.6*105 MWh ~$3.7M/yr

Akamai 40K ~1.7*105 MWh ~$10M/yr

Rackspace 50K ~2*105 MWh ~$12M/yr

Microsoft >200K >6*105 MWh >$36M/yr

Google >500K >6.3*105 MWh >$38M/yr

USA (2006)Source: Qureshi et al., SIGCOMM 2009

10.9M 610*105 MWh $4.5B/yr


Modular data centers

• Need more capacity? Just deploy another container!

5D. Koop, CIS 602-01, Fall 2017


Power Plant to Cloud Analogy

6D. Koop, CIS 602-01, Fall 2017


Power source

Power source directly connected

Power source

Network Meter Customer

Cloud Computing Architecture

7D. Koop, CIS 602-01, Fall 2017

J Internet Serv Appl (2010) 1: 7–18 9

virtualized resources for high-level applications. A virtual-ized server is commonly called a virtual machine (VM). Vir-tualization forms the foundation of cloud computing, as itprovides the capability of pooling computing resources fromclusters of servers and dynamically assigning or reassigningvirtual resources to applications on-demand.

Autonomic Computing: Originally coined by IBM in2001, autonomic computing aims at building computing sys-tems capable of self-management, i.e. reacting to internaland external observations without human intervention. Thegoal of autonomic computing is to overcome the manage-ment complexity of today’s computer systems. Althoughcloud computing exhibits certain autonomic features suchas automatic resource provisioning, its objective is to lowerthe resource cost rather than to reduce system complexity.

In summary, cloud computing leverages virtualizationtechnology to achieve the goal of providing computing re-sources as a utility. It shares certain aspects with grid com-puting and autonomic computing but differs from them inother aspects. Therefore, it offers unique benefits and im-poses distinctive challenges to meet its requirements.

3 Cloud computing architecture

This section describes the architectural, business and variousoperation models of cloud computing.

3.1 A layered model of cloud computing

Generally speaking, the architecture of a cloud comput-ing environment can be divided into 4 layers: the hard-ware/datacenter layer, the infrastructure layer, the platformlayer and the application layer, as shown in Fig. 1. We de-scribe each of them in detail:

The hardware layer: This layer is responsible for man-aging the physical resources of the cloud, including phys-ical servers, routers, switches, power and cooling systems.In practice, the hardware layer is typically implementedin data centers. A data center usually contains thousandsof servers that are organized in racks and interconnectedthrough switches, routers or other fabrics. Typical issuesat hardware layer include hardware configuration, fault-tolerance, traffic management, power and cooling resourcemanagement.

The infrastructure layer: Also known as the virtualiza-tion layer, the infrastructure layer creates a pool of storageand computing resources by partitioning the physical re-sources using virtualization technologies such as Xen [55],KVM [30] and VMware [52]. The infrastructure layer is anessential component of cloud computing, since many keyfeatures, such as dynamic resource assignment, are onlymade available through virtualization technologies.

The platform layer: Built on top of the infrastructurelayer, the platform layer consists of operating systems andapplication frameworks. The purpose of the platform layeris to minimize the burden of deploying applications directlyinto VM containers. For example, Google App Engine oper-ates at the platform layer to provide API support for imple-menting storage, database and business logic of typical webapplications.

The application layer: At the highest level of the hierar-chy, the application layer consists of the actual cloud appli-cations. Different from traditional applications, cloud appli-cations can leverage the automatic-scaling feature to achievebetter performance, availability and lower operating cost.

Compared to traditional service hosting environmentssuch as dedicated server farms, the architecture of cloudcomputing is more modular. Each layer is loosely coupledwith the layers above and below, allowing each layer toevolve separately. This is similar to the design of the OSI

Fig. 1 Cloud computingarchitecture

[Zhang et al., 2010]

Cloud Comparison

8D. Koop, CIS 602-01, Fall 2017

14 J Internet Serv Appl (2010) 1: 7–18

Table 1 A comparison of representative commercial products

Cloud Provider Amazon EC2 Windows Azure Google App Engine

Classes of Utility Computing Infrastructure service Platform service Platform service

Target Applications General-purpose applications General-purpose Windowsapplications

Traditional web applicationswith supported framework

Computation OS Level on a Xen VirtualMachine

Microsoft Common LanguageRuntime (CLR) VM; Predefinedroles of app. instances

Predefined web applicationframeworks

Storage Elastic Block Store; AmazonSimple Storage Service (S3);Amazon SimpleDB

Azure storage service and SQLData Services

BigTable and MegaStore

Auto Scaling Automatically changing thenumber of instances based onparameters that users specify

Automatic scaling based onapplication roles and aconfiguration file specified byusers

Automatic Scaling which istransparent to users

The .NET Services facilitate the creation of distributedapplications. The Access Control component provides acloud-based implementation of single identity verificationacross applications and companies. The Service Bus helpsan application expose web services endpoints that can beaccessed by other applications, whether on-premises or inthe cloud. Each exposed endpoint is assigned a URI, whichclients can use to locate and access a service.

All of the physical resources, VMs and applications inthe data center are monitored by software called the fabriccontroller. With each application, the users upload a config-uration file that provides an XML-based description of whatthe application needs. Based on this file, the fabric controllerdecides where new applications should run, choosing phys-ical servers to optimize hardware utilization.

5.2.3 Google App Engine

Google App Engine [20] is a platform for traditional webapplications in Google-managed data centers. Currently, thesupported programming languages are Python and Java.Web frameworks that run on the Google App Engine includeDjango, CherryPy, Pylons, and web2py, as well as a customGoogle-written web application framework similar to JSPor ASP.NET. Google handles deploying code to a cluster,monitoring, failover, and launching application instances asnecessary. Current APIs support features such as storing andretrieving data from a BigTable [10] non-relational database,making HTTP requests and caching. Developers have read-only access to the filesystem on App Engine.

Table 1 summarizes the three examples of popular cloudofferings in terms of the classes of utility computing, tar-get types of application, and more importantly their modelsof computation, storage and auto-scaling. Apparently, thesecloud offerings are based on different levels of abstraction

and management of the resources. Users can choose onetype or combinations of several types of cloud offerings tosatisfy specific business requirements.

6 Research challenges

Although cloud computing has been widely adopted by theindustry, the research on cloud computing is still at an earlystage. Many existing issues have not been fully addressed,while new challenges keep emerging from industry applica-tions. In this section, we summarize some of the challengingresearch issues in cloud computing.

6.1 Automated service provisioning

One of the key features of cloud computing is the capabil-ity of acquiring and releasing resources on-demand. The ob-jective of a service provider in this case is to allocate andde-allocate resources from the cloud to satisfy its servicelevel objectives (SLOs), while minimizing its operationalcost. However, it is not obvious how a service provider canachieve this objective. In particular, it is not easy to de-termine how to map SLOs such as QoS requirements tolow-level resource requirement such as CPU and memoryrequirements. Furthermore, to achieve high agility and re-spond to rapid demand fluctuations such as in flash crowdeffect, the resource provisioning decisions must be made on-line.

Automated service provisioning is not a new problem.Dynamic resource provisioning for Internet applications hasbeen studied extensively in the past [47, 57]. These ap-proaches typically involve: (1) Constructing an applicationperformance model that predicts the number of applicationinstances required to handle demand at each particular level,

Cloud Adoption

9D. Koop, CIS 602-01, Fall 2017

Public Cloud Adoption

10D. Koop, CIS 602-01, Fall 2017

Assignment 2• http://www.cis.umassd.edu/~dkoop/

cis602-2017fa/assignment2.html • New York City Trees

- 680,000+ trees - Use WebGL for visualization - Use a Python bridge (mapboxgl) - Use the fork! - Smaller data versions available - Keep using pandas - Label subproblems and answers - Due Thursday, October 19

11D. Koop, CIS 602-01, Fall 2017

http://www.cis.umassd.edu/~dkoop/cis602-2017fa/assignment2.html

http://www.cis.umassd.edu/~dkoop/cis602-2017fa/assignment2.html

http://github.com/dakoop/mapboxgl-jupyter


What is virtualization?

• Suppose Alice has a machine with 4 CPUs and 8 GB of memory, and three customers: - Bob wants 1 CPU and 3GB of memory - Charlie wants 2 CPUs and 1GB of memory - Daniel wants 1 CPU and 4GB of memory

• What should Alice do?

12D. Koop, CIS 602-01, Fall 2017


Alice

Bob

Charlie

Daniel

Physical machine


What is virtualization?

• Alice can sell each customer a virtual machine (VM) with the requested resources - From each customer's perspective, it appears as if they had a physical

machine all by themselves (isolation)

13D. Koop, CIS 602-01, Fall 2017


Alice

Bob

Charlie

DanielPhysical machine

Virtual machines

Virtual machinemonitor


How does it work?

• Resources (CPU, memory, ...) are virtualized - VMM ("Hypervisor") has translation tables that map requests for virtual

resources to physical resources - Ex: VM 1 accesses memory cell #323; VMM maps to memory cell 123. - For which resources does this (not) work? - How do VMMs differ from OS kernels?

14D. Koop, CIS 602-01, Fall 2017


Physical machine

OS 1 OS 2

App

App App

VMM

VM Virt Phys

1 0-99 0-99

1 299-399 100-199

2 0-99 300-399

2 200-299 500-599

2 600-699 400-499

Translation table

VM 1 VM 2


Benefit: Migration

• What if the machine needs to be shut down? - e.g., for maintenance, consolidation, ... - Alice can migrate the VMs to different physical machines without any

customers noticing

15D. Koop, CIS 602-01, Fall 2017


Alice

Bob

Charlie

Daniel

Physical machines

Virtual machines


Emil


Benefit: Time sharing

• What if Alice gets another customer? - Multiple VMs can time-share the existing resources - Result: Alice has more virtual CPUs and virtual memory than physical

resources (but not all can be active at the same time)

16D. Koop, CIS 602-01, Fall 2017


Alice

Bob

Charlie


Virtual machines


Emil


Benefit and challenge: Isolation

• Good: Emil can't access Charlie's data • Bad: What if the load suddenly increases?

- Example: Emil's VM shares CPUs with Charlie's VM, and Charlie suddenly starts a large compute job

- Emil's performance may decrease as a result - VMM can move Emil's software to a different CPU, or migrate it to a

different machine17D. Koop, CIS 602-01, Fall 2017


Alice

Bob

Charlie


Virtual machines

VMM

Emil

Cloud Use by Industry

18D. Koop, CIS 602-01, Fall 2017© The Economist Intelligence Unit Limited 20163

Ascending cloude d ti n ud utin in five industries

There is certainly a great deal of dialogue about

the growth of cloud—but where does it actually

stand? We asked each industry sub-panel to

assess the current presence in their own industry.

e first serv ti n is si nifi nt di eren es in

t e r te ud d ti n e first vers

appear to be those that can generate a digital

“pure play” side-by-side with the legacy industry—

for example, digital banking emerging from

branch networks, or e-commerce competing with

high street retailers and shopping centres.

Manufacturing, as we shall see, presents a

more complex problem—the integration of cloud

into physical structures such as factories,

machines and assembly lines. Finally, as discussed

in our review of these industries, adoption in

Education and Healthcare is slowed by regulatory

constraints and less intense competitive

environments.

The second observation is that, as far as cloud

has come, it has a long way to go. “Pervasive

presence”—ready access and widespread

deployment—averages out to only 7% across

industries. The following industry analysis illustrates

just how that rate of growth is expected to be.

Cloud computing—what does it mean for key industries?

Source: EIU Survey “Cloud Computing and Economic Development”, October 2015

Banking Retail Manu-facturing

Education Healthcare Industry average

Pervasive presence Significant presence

How would you characterise the current presence of cloud in the following industries?% of respondents reporting a significant or pervasive presence

52

757

1

42

7

34

10

31

843

7

[The Economist]

Banking Use

19D. Koop, CIS 602-01, Fall 2017

© The Economist Intelligence Unit Limited 20164


Banking—disruption of a legacy business

Two trends are driving cloud adoption in banking.

e first is t e d ti n ud r fi e

and selected customer operations by traditional

banking institutions. The second is Fintech—digital

insurgents who are regularly using cloud-based

services to compete in key banking products.

According to our banking sub-panel, these

forces will drive a rapid rate of adoption—almost

three out of four predict cloud will be a major

t r in n in in five e rs n sis n in

products and new markets shows a more

nuanced picture of cloud coexisting with non-

cloud systems. This may indicate the growth of

cloud alongside existing legacy systems, coupled

with concerns about security.Source: EIU Survey “Cloud Computing and

Economic Development”, October 2015

Major factorModerate factor

Future cloud penetration of the banking industry% saying cloud will be a moderate or major factor

In one year

In three years

In five years

52

19

53

36

15

74

Source: EIU Survey “Cloud Computing andEconomic Development”, October 2015

Very importantSomewhat important

How important is cloud in supporting sectors of the banking industry?% saying cloud will be somewhat or very important

New ways to make payments

Lowering banking costs

Banking for remote populations

New ways of saving

Banking for poor populations

New ways of lending

68 21

60 32

57 34

50 45

43 42

42 47

[The Economist]

Education and the Cloud

20D. Koop, CIS 602-01, Fall 2017



Cloud, technology and education

Education shows a somewhat slower adoption

rate than other industries. Possible reasons include

a less competitive environment and traditionally

slower rates of technology adoption by

government.3

But after much initial hype, online education

has also suffered a series of disappointments as

“MOOCs” (massive open online courses) found

that large numbers of enrolled students did not

pursue their studies. Nonetheless, adoption looks

set to pick up speed in the 3-5 year timeframe,

and cloud looks set to impact the entire spectrum

of education.

3 Information Week, Public Sector Slow to Adopt Cloud Computing, June 2012



Future cloud penetration of the education industry% saying cloud will be a moderate or major factor

In one year

In three years

In five years

55

20 29

47

18

67



How important is cloud in supporting sectors of the education industry?% saying cloud will be somewhat or very important

Higher education (university)

Education for remote areas

Continuing/adult education

Mid-level students (ages 11-17)

Education for low-income

Educating young students (ages 3-10)

51 34

42 38

40 44

40 46

34 42

25 48

[The Economist]

Supporting Education

21D. Koop, CIS 602-01, Fall 2017



Cloud, technology and education

Education shows a somewhat slower adoption

rate than other industries. Possible reasons include

a less competitive environment and traditionally

slower rates of technology adoption by

government.3

But after much initial hype, online education

has also suffered a series of disappointments as

“MOOCs” (massive open online courses) found

that large numbers of enrolled students did not

pursue their studies. Nonetheless, adoption looks

set to pick up speed in the 3-5 year timeframe,

and cloud looks set to impact the entire spectrum

of education.

3 Information Week, Public Sector Slow to Adopt Cloud Computing, June 2012



Future cloud penetration of the education industry% saying cloud will be a moderate or major factor

In one year

In three years

In five years

55

20 29

47

18

67



How important is cloud in supporting sectors of the education industry?% saying cloud will be somewhat or very important

Higher education (university)

Education for remote areas

Continuing/adult education

Mid-level students (ages 11-17)

Education for low-income

Educating young students (ages 3-10)

51 34

42 38

40 44

40 46

34 42

25 48

[The Economist]

Amazon GovCloud

22D. Koop, CIS 602-01, Fall 2017

https://aws.amazon.com/blogs/publicsector/amazon-web-services-achieves-dod-impact-level-5-provisional-authorization/

Heterogeneity and Dynamicity of Clouds at Scale:Google Trace Analysis

C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch

D. Koop, CIS 602-01, Fall 2017

Issues faced by cloud schedulers• Machine and workload heterogeneity and variability • Highly dynamic resource demand and availability • Predictable, but poorly predicted, resource needs • Resource class preferences and constraints

24D. Koop, CIS 602-01, Fall 2017

[Reiss et al., 2012]

Previous Cluster Workload Studies• Long-Running Services • DAG-of-taks Systems (e.g. MapReduce) • High-Performance Computing

25D. Koop, CIS 602-01, Fall 2017


Discussion

26D. Koop, CIS 602-01, Fall 2017

Machine Configuration

27D. Koop, CIS 602-01, Fall 2017

Number of machines Platform CPUs Memory6732 B 0.50 0.503863 B 0.50 0.251001 B 0.50 0.75

795 C 1.00 1.00126 A 0.25 0.2552 B 0.50 0.125 B 0.50 0.035 B 0.50 0.973 C 1.00 0.501 B 0.50 0.06

Table 1: Configurations of machines in the cluster. CPU andmemory units are linearly scaled so that the maximum machineis 1. Machines may change configuration during the trace; weshow their first configuration.

properties and their lifecycle management, workload behavior, andresource utilization. Zhang et al. [26] study the trace from the per-spective of energy-aware provisioning and energy-cost minimiza-tion, using it to motivate dynamic capacity provisioning and thechallenges associated with it.

3. HETEROGENEITYThe traced ‘cloud computing’ workload is much less homoge-

neous than researchers often assume. It appears to be a mix oflatency-sensitive tasks, with characteristics similar to web site serv-ing, and less latency-sensitive programs, with characteristics sim-ilar to high-performance computing and MapReduce workloads.This heterogeneity will break many scheduling strategies that mighttarget more specific environments. Assumptions that machines ortasks can be treated equally are broken; for example, no schedul-ing strategy that uses fixed-sized ‘slots’ or uniform randomizationamong tasks or machines is likely to perform well.

3.1 Machine types and attributesThe cluster machines are not homogeneous; they consist of three

different platforms (the trace providers distinguish them by indi-cating “the microarchitecture and chipset version” [18]) and a va-riety of memory/compute ratios. The configurations are shown inTable 1. Exact numbers of CPU cores and bytes of memory areunavailable; instead, CPU and memory size measurements are nor-malized to the configuration of the largest machines. We will usethese units throughout this paper. Most of the machines have halfof the memory and half the CPU of the largest machines.

This variety of configurations is unlike the fully homogeneousclusters usually assumed by prior work. It is also distinct fromprior work that focuses on clusters where some machines have fun-damentally different types of computing hardware, like GPUs, FP-GAs, or very low-power CPUs. The machines here differ in waysthat can be explained by the machines being acquired over timeusing whatever configuration was most cost-effective then, ratherthan any deliberate decision to use heterogeneous hardware.

In addition to the CPU and memory capacity and microarchi-tecture of the machines, a substantial fraction of machine hetero-geneity, from the scheduler’s perspective, comes from “machineattributes”. They are obfuscated <key,value> pairs, with a totalof 67 unique machine attribute keys in the cell. The majority ofthose attributes have fewer than 10 unique values ever used by anymachine. That is consistent with [21], where the only machine at-tributes with possible values exceeding 10 were number of disks

Figure 1: Normal production (top) and lower (bottom) priorityCPU usage by hour of day. The dark line is the median and thegrey band represents the quartiles.

and clock speed. In this trace, exactly 10 keys are used with morethan 10 possible values. 6 of these keys are used as constraints. Oneof them has 12569 unique values — an order of magnitude greaterthan all others combined, which roughly corresponds to the numberof machines in the cluster. Based on further analysis in Section 6and [21], these attributes likely reflect a combination of machineconfiguration and location information. Since these attributes areall candidates for task placement constraints (discussed later), theirnumber and variety are a concern for a scheduler. Scheduler de-signs can no longer consider heterogeneity of hardware an aberra-tion.

3.2 Workload typesOne signal of differing job types is the priority associated with

the tasks. The trace uses twelve task priorities (numbered 0 to 11),which we will group into three sets: production (9–11), middle(2–8), and gratis (0–1). The trace providers tell us that latency-sensitive tasks (as marked by another task attribute) in the produc-tion priorities should not be “evicted due to over-allocation of ma-chine resources” [18] and that users of tasks of gratis priorities arecharged substantially less for their resources.

The aggregate usage shows that the production priorities repre-sent a different kind of workload than the others. As shown in Fig-ure 1, production priorities account for more resource usage thanall the other priorities and have the clearest daily patterns in usage(with a peak-to-mean ratio of 1.3). As can be seen from Figure 2,the production priorities also include more long-duration jobs, ac-counting for a majority of all jobs which run longer than a day eventhough only 7% of all jobs run at production priority. Usage at thelowest priority shows little such pattern, and this remains true evenif short-running jobs are excluded.

These are clearly not perfect divisions of job purpose — eachpriority set appears to contain jobs that behave like user-facingservices would and large numbers of short-lived batch-like jobs(based on their durations and utilization patterns). The trace con-tains no obvious job or task attribute that distinguishes between


CPU Usage (Production vs. Lower Priority)

28D. Koop, CIS 602-01, Fall 2017

Number of machines Platform CPUs Memory6732 B 0.50 0.503863 B 0.50 0.251001 B 0.50 0.75

795 C 1.00 1.00126 A 0.25 0.2552 B 0.50 0.125 B 0.50 0.035 B 0.50 0.973 C 1.00 0.501 B 0.50 0.06

Table 1: Configurations of machines in the cluster. CPU andmemory units are linearly scaled so that the maximum machineis 1. Machines may change configuration during the trace; weshow their first configuration.

properties and their lifecycle management, workload behavior, andresource utilization. Zhang et al. [26] study the trace from the per-spective of energy-aware provisioning and energy-cost minimiza-tion, using it to motivate dynamic capacity provisioning and thechallenges associated with it.

3. HETEROGENEITYThe traced ‘cloud computing’ workload is much less homoge-

neous than researchers often assume. It appears to be a mix oflatency-sensitive tasks, with characteristics similar to web site serv-ing, and less latency-sensitive programs, with characteristics sim-ilar to high-performance computing and MapReduce workloads.This heterogeneity will break many scheduling strategies that mighttarget more specific environments. Assumptions that machines ortasks can be treated equally are broken; for example, no schedul-ing strategy that uses fixed-sized ‘slots’ or uniform randomizationamong tasks or machines is likely to perform well.

3.1 Machine types and attributesThe cluster machines are not homogeneous; they consist of three

different platforms (the trace providers distinguish them by indi-cating “the microarchitecture and chipset version” [18]) and a va-riety of memory/compute ratios. The configurations are shown inTable 1. Exact numbers of CPU cores and bytes of memory areunavailable; instead, CPU and memory size measurements are nor-malized to the configuration of the largest machines. We will usethese units throughout this paper. Most of the machines have halfof the memory and half the CPU of the largest machines.

This variety of configurations is unlike the fully homogeneousclusters usually assumed by prior work. It is also distinct fromprior work that focuses on clusters where some machines have fun-damentally different types of computing hardware, like GPUs, FP-GAs, or very low-power CPUs. The machines here differ in waysthat can be explained by the machines being acquired over timeusing whatever configuration was most cost-effective then, ratherthan any deliberate decision to use heterogeneous hardware.

In addition to the CPU and memory capacity and microarchi-tecture of the machines, a substantial fraction of machine hetero-geneity, from the scheduler’s perspective, comes from “machineattributes”. They are obfuscated <key,value> pairs, with a totalof 67 unique machine attribute keys in the cell. The majority ofthose attributes have fewer than 10 unique values ever used by anymachine. That is consistent with [21], where the only machine at-tributes with possible values exceeding 10 were number of disks

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

porti

onof

clus

terC

PU

0.0

0.1

0.2

0.3

0 5 10 15 20time (hour)

0.0

0.1

0.2

Figure 1: Normal production (top) and lower (bottom) priorityCPU usage by hour of day. The dark line is the median and thegrey band represents the quartiles.

and clock speed. In this trace, exactly 10 keys are used with morethan 10 possible values. 6 of these keys are used as constraints. Oneof them has 12569 unique values — an order of magnitude greaterthan all others combined, which roughly corresponds to the numberof machines in the cluster. Based on further analysis in Section 6and [21], these attributes likely reflect a combination of machineconfiguration and location information. Since these attributes areall candidates for task placement constraints (discussed later), theirnumber and variety are a concern for a scheduler. Scheduler de-signs can no longer consider heterogeneity of hardware an aberra-tion.

3.2 Workload typesOne signal of differing job types is the priority associated with

the tasks. The trace uses twelve task priorities (numbered 0 to 11),which we will group into three sets: production (9–11), middle(2–8), and gratis (0–1). The trace providers tell us that latency-sensitive tasks (as marked by another task attribute) in the produc-tion priorities should not be “evicted due to over-allocation of ma-chine resources” [18] and that users of tasks of gratis priorities arecharged substantially less for their resources.

The aggregate usage shows that the production priorities repre-sent a different kind of workload than the others. As shown in Fig-ure 1, production priorities account for more resource usage thanall the other priorities and have the clearest daily patterns in usage(with a peak-to-mean ratio of 1.3). As can be seen from Figure 2,the production priorities also include more long-duration jobs, ac-counting for a majority of all jobs which run longer than a day eventhough only 7% of all jobs run at production priority. Usage at thelowest priority shows little such pattern, and this remains true evenif short-running jobs are excluded.

These are clearly not perfect divisions of job purpose — eachpriority set appears to contain jobs that behave like user-facingservices would and large numbers of short-lived batch-like jobs(based on their durations and utilization patterns). The trace con-tains no obvious job or task attribute that distinguishes between


Job Duration (Log-scale inverted CDF)

29D. Koop, CIS 602-01, Fall 2017

10�2 10�1 100 101 102 103

job duration (hours)

103

104

105

106jo

bslo

nger

allproductionnon-production

Figure 2: Log-log scale inverted CDF of job durations. Only theduration for which the job runs during the trace time period isknown; thus, for example, we do not observe durations longerthan around 700 hours. The thin, black line shows all jobs; thethick line shows production-priority jobs; and the dashed lineshows non-production priority jobs.

types of jobs besides their actual resource usage and duration: even‘scheduling class’, which the trace providers say represents howlatency-sensitive a job is, does not separate short-duration jobs fromlong-duration jobs. Nevertheless, the qualitative difference in theaggregate workloads at the higher and lower priorities shows thatthe trace is both unlike batch workload traces (which lack the com-bination of diurnal patterns and very long-running jobs) and unlikeinteractive service (which lack large sets of short jobs with littlepattern in demand).

3.3 Job durationsJob durations range from tens of seconds to essentially the entire

duration of the trace. Over 2000 jobs (from hundreds of distinctusers) run for the entire trace period, while a majority of jobs lastfor only minutes. We infer durations from how long tasks are activeduring the one month time window of the trace; jobs which are cutoff by the beginning or end of the trace are a small portion (< 1%)of jobs and consist mostly of jobs which are active for at least sev-eral days and so are not responsible for us observing many shorterjob durations. These come from a large portion of the users, so it isnot likely that the workload is skewed by one particular individualor application. Consistent with our intuition about priorities corre-lating with job types, production priorities have a much higher pro-portion of long-running jobs and the ‘other’ priorities have a muchlower proportion. But slicing the jobs by priority or ‘schedulingclass’ (which the trace providers say should reflect how latency-sensitive a job is) reveals a similar heavy-tailed distribution shapewith a large number of short jobs.

3.4 Task shapesEach task has a resource request, which should indicate the amount

of CPU and memory space the task will require. (The requests areintended to represent the submitter’s predicted “maximum” usagefor the task.) Both the amount of the resources requested and theamount actually used by tasks varies by several orders of magni-tude; see Figures 3 and 4, respectively. These are not just out-liers. Over 2000 jobs request less than 0.0001 normalized units ofmemory per task, and over 8000 jobs request more than 0.1 unitsof memory per task. Similarly, over 70000 jobs request less than

Figure 3: CDF of instantaneous task requested resources. (1unit = max machine size.) These are the raw resource requestsin the trace; they do not account for task duration.

Figure 4: Log-log scale inverted CDF of instantaneous me-dian job usage, accounting for both varying per-task usage andvarying job task counts.

0.0001 units of CPU per task, and over 8000 request more than 0.1units of CPU. Both tiny and large resource requesting jobs includehundreds of distinct users, so it is not likely that the particularlylarge or small requests are caused by the quirky demands of a sin-gle individual or service.

We believe that this variety in task “shapes” has not been seen inprior workloads, if only because most schedulers simply do notsupport this range of sizes. The smallest resource requests arelikely so small that it would be difficult for any VM-based sched-uler to run a VM using that little memory. (0.0001 units would bearound 50MB if the largest machines in the cluster had 512GB ofmemory.) Also, any slot-based scheduler, which includes all HPCand Grid installations we are aware of, would be unlikely to havethousands of slots per commodity machine.

The ratio between CPU and memory requests also spans a largerange. The memory and CPU request sizes are correlated, butweakly (linear regression R2 ⇡ 0.14). A large number jobs request0 units of CPU — presumably they require so little CPU they candepend on running in the ‘left-over’ CPU of a machine; it makeslittle sense to talk about the CPU:memory ratio without adjustingthese. Rounding these requests to the next largest request size, the


Resource Requests

30D. Koop, CIS 602-01, Fall 2017

Figure 2: Log-log scale inverted CDF of job durations. Only theduration for which the job runs during the trace time period isknown; thus, for example, we do not observe durations longerthan around 700 hours. The thin, black line shows all jobs; thethick line shows production-priority jobs; and the dashed lineshows non-production priority jobs.

types of jobs besides their actual resource usage and duration: even‘scheduling class’, which the trace providers say represents howlatency-sensitive a job is, does not separate short-duration jobs fromlong-duration jobs. Nevertheless, the qualitative difference in theaggregate workloads at the higher and lower priorities shows thatthe trace is both unlike batch workload traces (which lack the com-bination of diurnal patterns and very long-running jobs) and unlikeinteractive service (which lack large sets of short jobs with littlepattern in demand).

3.3 Job durationsJob durations range from tens of seconds to essentially the entire

duration of the trace. Over 2000 jobs (from hundreds of distinctusers) run for the entire trace period, while a majority of jobs lastfor only minutes. We infer durations from how long tasks are activeduring the one month time window of the trace; jobs which are cutoff by the beginning or end of the trace are a small portion (< 1%)of jobs and consist mostly of jobs which are active for at least sev-eral days and so are not responsible for us observing many shorterjob durations. These come from a large portion of the users, so it isnot likely that the workload is skewed by one particular individualor application. Consistent with our intuition about priorities corre-lating with job types, production priorities have a much higher pro-portion of long-running jobs and the ‘other’ priorities have a muchlower proportion. But slicing the jobs by priority or ‘schedulingclass’ (which the trace providers say should reflect how latency-sensitive a job is) reveals a similar heavy-tailed distribution shapewith a large number of short jobs.

3.4 Task shapesEach task has a resource request, which should indicate the amount

of CPU and memory space the task will require. (The requests areintended to represent the submitter’s predicted “maximum” usagefor the task.) Both the amount of the resources requested and theamount actually used by tasks varies by several orders of magni-tude; see Figures 3 and 4, respectively. These are not just out-liers. Over 2000 jobs request less than 0.0001 normalized units ofmemory per task, and over 8000 jobs request more than 0.1 unitsof memory per task. Similarly, over 70000 jobs request less than

10�3 10�2 10�1 100

instaneous resource request (normalized CPU/memory)

0.0

0.2

0.4

0.6

0.8

1.0po

rtion

ofta

sks

requ

estin

gle

ss

CPUMemory

Figure 3: CDF of instantaneous task requested resources. (1unit = max machine size.) These are the raw resource requestsin the trace; they do not account for task duration.

Figure 4: Log-log scale inverted CDF of instantaneous me-dian job usage, accounting for both varying per-task usage andvarying job task counts.

0.0001 units of CPU per task, and over 8000 request more than 0.1units of CPU. Both tiny and large resource requesting jobs includehundreds of distinct users, so it is not likely that the particularlylarge or small requests are caused by the quirky demands of a sin-gle individual or service.

We believe that this variety in task “shapes” has not been seen inprior workloads, if only because most schedulers simply do notsupport this range of sizes. The smallest resource requests arelikely so small that it would be difficult for any VM-based sched-uler to run a VM using that little memory. (0.0001 units would bearound 50MB if the largest machines in the cluster had 512GB ofmemory.) Also, any slot-based scheduler, which includes all HPCand Grid installations we are aware of, would be unlikely to havethousands of slots per commodity machine.

The ratio between CPU and memory requests also spans a largerange. The memory and CPU request sizes are correlated, butweakly (linear regression R2 ⇡ 0.14). A large number jobs request0 units of CPU — presumably they require so little CPU they candepend on running in the ‘left-over’ CPU of a machine; it makeslittle sense to talk about the CPU:memory ratio without adjustingthese. Rounding these requests to the next largest request size, the


Task Submission (New vs. Rescheduled)

31D. Koop, CIS 602-01, Fall 2017

0 5 10 15 20 25time (days)

0

1M

2M

3Mev

ents

perd

aynew tasksrescheduling

Figure 5: Moving average (over a day-long window) of tasksubmission (a task becomes runnable) rates.

Figure 6: Linear-log plot of inverted CDF of interarrival peri-ods between jobs.

count) crash loops even at the production priorities.Small jobs. Even though the scheduler runs large, long-lived

parallel jobs, most of the jobs in the trace only request a smallamount of a single machine’s resources and only run for severalminutes. 75% of jobs consist of only one task, half of the jobs runfor less than 3 minutes, and most jobs request less than 5% of theaverage machine’s resources. These small job submissions are fre-quent, ensuring that the cluster scheduler has new work to schedulenearly every minute of every day.

Job submissions are clustered together (Figure 6 shows job inter-arrival times), with around 40% of submissions recorded less than10 milliseconds after the previous submission even though the me-dian interarrival period is 900 ms. The tail of the distribution ofinterarrival times is power-law-like, though the maximum job in-terarrival period is only 11 minutes.

The prevalence of many very small interarrival periods suggestthat some sets of jobs are part of the same logical program andintended to run together. For example, the trace providers indi-cate that MapReduce programs run with a separate ‘master’ and‘worker’ jobs, which will presumably each have a different shape.

Figure 7: Moving average (over day-long window) of task evic-tion rates, broken by priority.

Another likely cause is embarrassingly parallel programs being splitinto many distinct single-task jobs. (Users might specify manysmall jobs rather than one job with many tasks to avoid implyingany co-scheduling requirement between the parallel tasks.) A com-bination of the two of these might explain the very large number ofsingle-task jobs.

Evictions. Evictions are also a common cause of task reschedul-ing. There are 4.5M evictions recorded in the trace, more thanthe number of recorded software failures after excluding the largestcrash-looping jobs. As would be expected, eviction rates are re-lated to task priorities. The rate of evictions for production prioritytasks is comparable to the rate of machine churn: between one perone hundred task days and one per fifteen task days, depending onhow many unknown task terminations are due to evictions. Most ofthese evictions are near in time to a machine configuration recordfor the machine the task was evicted from, so we suspect most ofthese evictions are due to machine availability changes.

The rate of evictions at lower priorities varies by orders of mag-nitude, with some weekly pattern in the eviction rate. Gratis pri-ority tasks average about at least 4 evictions per task-day, thoughalmost none of these evictions occur on what appear to be week-ends. Given this eviction rate, an average 100-task job running ata gratis priority would expect about one task to be lost every 15minutes. These programs must tolerate a very high “failure” rateby the standards of a typical cloud computing provider or Hadoopcluster.

Almost all of these evictions occur within half a second of an-other task of the same or higher priority starting on the same ma-chine. This indicates that most of these evictions are probably in-tended to free resources for those tasks. Since the evictions occurso soon after the higher priority task is scheduled, it is unlikelythat many of these evictions are driven by resource usage monitor-ing. If the scheduler were measuring resource usage to determinewhen lower-priority tasks should be evicted, then one would ex-pect many higher-priority tasks to have a ‘warm-up’ period of atleast some seconds. During this period, the resources used by thelower-priority task would not yet be required, so the task would notbe immediately evicted.

Given that the scheduler is evicting before the resources actually


Requested vs. Allocated Resources

32D. Koop, CIS 602-01, Fall 2017

0.0 0.2 0.4 0.6 0.8 1.0time (days)

0.0

0.2

0.4

0.6

0.8

1.0 time (days)0

0.5

1.0

porti

onof

CP

U

used allocated

0 5 10 15 20 250

0.5

1.0

porti

onof

mem

ory

0 5 10 15 20 25

Figure 8: Moving hourly average of CPU (top) and mem-ory (bottom) utilization (left) and resource requests (right).Stacked plot by priority range, highest priorities (production)on bottom (in red/lightest color), followed by the middle prior-ities (green), and gratis (blue/darkest color). The dashed linenear the top of each plot shows the total capacity of the cluster.

come into conflict, some evicted tasks could probably run substan-tially longer — potentially till completion, without any resourceconflicts. To estimate how often this might be occurring, we exam-ined maximum machine usage after evictions events and comparedthese to the requested resources of evicted tasks. After around 30%of evictions, resources requested by the evicted tasks appear to re-main free for an hour after the eviction, suggesting that these evic-tions were either unnecessary or were to make way for brief usagespikes we cannot detect in this monitoring data.

5. RESOURCE USAGE PREDICTABILITYThe trace includes two types of information about the resource

usage of jobs running on the cluster: the resource requests that ac-company each task and the actual resource usage of running tasks.If the scheduler could predict actual task resource usage more ac-curately than that suggested by the requests, tasks could be packedmore tightly without degrading performance. Thus, we are inter-ested in the actual usage by jobs on the cluster. We find that, eventhough there is a lot of task churn, overall resource usage is stable.This stability provides better predictions of resource usage than theresource requests.

5.1 Usage overviewFigure 8 shows the utilization on the cluster over the 29 day

trace period. We evaluated utilization both in terms of the mea-sured resource consumption (left side of figure) and ‘allocations’(requested resources of running tasks; right side of figure). Basedon allocations, the cluster is very heavily booked. Total resourceallocation at almost any time account for more than 80% of thecluster’s memory capacity and more than 100% of the cluster’sCPU capacity. Overall usage is much lower: averaging over one-hour windows, memory usage does not exceed about 50% of thecapacity of the cluster and CPU usage does not exceed about 60%.

The trace providers include usage information for tasks in five-minute segments. At each five-minute boundary, when data is notmissing, there is at least (and usually exactly) one usage record foreach task which is running during that time period. Each recordis marked with a start and end time. This usage record includes a

Figure 9: CDF of changes in average task utilization betweentwo consecutive hours, weighted by task duration. Tasks whichdo not run in consecutive hours are excluded.

Figure 10: CDF of changes in average machine utilization be-tween two consecutive five minute sampling periods. Solid linesexclude tasks which start or stop during one of the five minutesampling periods.

number of types of utilization measurements gathered from Linuxcontainers. Since they are obtained from the Linux kernel, memoryusage measurements include some of the memory usage the kernelmakes on behalf of the task (such as page cache); tasks are expectedto request enough memory to include such kernel-managed mem-ory they require. We will usually use utilization measurements thatrepresent the average CPU and memory utilization over the mea-surement period.

To compute the actual utilization, we divided the trace into thefive-minute sampling periods; within period, for each task usagerecord available, we took the sum of the average CPU and mem-ory usage weighted by the length of the measurement. We did notattempt to compensate for missing usage records (which the traceproducers estimate accounts for no more than 1% of the records).The trace providers state that missing records may result from “themonitoring system or cluster [getting] overloaded” and from filter-ing out records “mislabeled due to a bug in the monitoring system”[18].

5.2 Usage stabilityWhen tasks run for several hours, their resource usage is gener-


Jobs, Tasks, and Resource Requests

33D. Koop, CIS 602-01, Fall 2017

Figure 11: CDF of utilization by task duration. Note that tasksrunning for less than two hours account for less than 10% ofutilization by any measure.

ally stable, as can be seen in Figure 9. Task memory usage changesvery little once most tasks are running. Memory usage data is basedon physical memory usage, so this stability is not simply a con-sequence of measuring the available address space and not actualmemory pressure.

Because there are many small tasks, a large relative change inCPU or memory usage of an individual task may not translate tolarge changes in the overall ability to fit new tasks on a machine.A simple strategy for determining if tasks fit on a machine is toexamine what resources are currently free on each machine andpredict that those resources will remain free. We examined howwell this would perform by examining how much machine usagechanges. On timescales of minutes, this usually predicts machineutilization well as can be seen in Figure 10. Since most tasks onlyrun for minutes, this amount of prediction is all that is likely re-quired to place most tasks effectively. Longer running tasks may re-quire more planning, but their resource usage tends to mimic otherlonger-running tasks, so the scheduler can even plan well for theselong-running tasks by monitoring running task usage.

5.3 Short jobsOne apparent obstacle to forecasting resource availability from

previous resource usage is the frequency with which tasks start andstop. Fortunately, though there are a large number of tasks start-ing and stopping, these short tasks do not contribute significantlyto usage. This is why, as seen in Figure 10, ignoring the many taskswhich start or stop within five minutes does not have a very largeeffect. Figure 11 indicates that jobs shorter than two hours accountfor less than 10% of the overall utilization (even though they repre-sent more than 95% of the jobs). Hence, the scheduler may safelyignore short-running jobs when forecasting cluster utilization.

Even though most jobs are very short, it is not rare for users torun long jobs. 615 of the 925 users of the cluster submit at leastone job which runs for more than a day, and 310 do so outside thegratis priorities.

5.4 Resource requestsEven though we’ve seen that resource requests in this trace are

not accurate in aggregate, perhaps this behavior only arises becauseusers lack tools to determine accurate requests or because userslack incentives to make accurate requests. Thus, we are interested

100 101 102 103 104

task count

101

102

103

104

105

num

bero

fjob

s

0.00 0.05 0.10 0.15 0.20 0.25memory request (per task)

101

102

103

104

105

num

bero

fjob

s

0.00 0.05 0.10 0.15 0.20 0.25CPU request (per task)

101

102

103

104

105

num

bero

fjob

s

Figure 12: Histograms of job counts by task count (top), mem-ory request size (middle) and CPU request size (bottom). Notethe log-scale on each y-axis and the log-scale on the top plot’sx-axis. We speculate that many of the periodic peaks in taskcounts and resource request sizes represent humans choosinground numbers. Memory and CPU units are the same as Table1. Due to the choice of x-axis limits, not all jobs appear on theseplots.

in how closely requests could be made to reflect the monitored us-age.

Non-automation. Resource requests appear to be specified man-ually, which may explain why they do not correspond to actualusage. One sign of manual request specification is the unevendistribution of resource request sizes, shown in Figure 12. Whenusers specify parameters, they tend to choose round numbers like16, 100, 500, 1000, and 1024. This pattern can clearly be seen inthe number of tasks selected for jobs in this trace; it is not plausi-ble that the multiples of powers of ten are the result of a technicalchoice. We cannot directly identify any similar round numbers inthe CPU and memory requests because the raw values have beenrescaled, but the distribution shows similar periodic peaks, whichmight represent, e.g., multiples of 100 megabytes of memory orsome fraction of a core. For memory requests, it is unlikely thatthese round numbers accurately reflect requirements. For CPU re-quests, whole numbers of CPUs would accurately reflect the CPUa disproportionate number of tasks would use [19], but it seemsunlikely that the smallest ‘bands’ (at around 1/80th of a machine)represent a whole number of cores on a 2011 commodity machine.

Request accuracy. Requests in this trace are supposed to in-dicate the “maximum amount . . . a task is permitted to use” [18].Large gaps between aggregate usage and aggregate allocation, there-fore, do not necessarily indicate that the requests are inaccurate. Ifa task ever required those CPU and memory resources for even asecond of its execution, then the request would be accurate, re-gardless of its average consumption. Thus, resource requests could


Discussion

34D. Koop, CIS 602-01, Fall 2017

cis 602-01: scalable data analysisdkoop/cis602-2017fa/lectures/lecture...university of pennsylvania...

Documents