cis 602-01: scalable data analysisdkoop/cis602-2017fa/lectures/lecture...university of pennsylvania...
TRANSCRIPT
CIS 602-01: Scalable Data Analysis
Cloud Workloads Dr. David Koop
D. Koop, CIS 602-01, Fall 2017
Scaling Up
2D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
PC
Scaling Up
2D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
PC Server
Scaling Up
2D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
PC Server Cluster
Scaling Up
2D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
PC Server Cluster Data center
Scaling Up
2D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
PC Server Cluster Data center Network of data centers
Scale Problems1.Difficult to dimension
- Load can vary considerably - Waste resources or lose customers
2.Expensive - Hardware costs - Personnel costs - Maintenance costs
3.Difficult to scale - scaling up (new machines, new buildings) - scaling down (energy, fixed costs)
3D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
University of Pennsylvania
Energy
• Data centers consume a lot of energy - Makes sense to build them near sources of cheap electricity - Example: Price per KWh is 3.6ct in Idaho (near hydroelectric power), 10ct
in California (long distance transmission), 18ct in Hawaii (must ship fuel) - Most of this is converted into heat → Cooling is a big issue!
4D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
Company Servers Electricity Cost
eBay 16K ~0.6*105 MWh ~$3.7M/yr
Akamai 40K ~1.7*105 MWh ~$10M/yr
Rackspace 50K ~2*105 MWh ~$12M/yr
Microsoft >200K >6*105 MWh >$36M/yr
Google >500K >6.3*105 MWh >$38M/yr
USA (2006)Source: Qureshi et al., SIGCOMM 2009
10.9M 610*105 MWh $4.5B/yr
University of Pennsylvania
Modular data centers
• Need more capacity? Just deploy another container!
5D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
Power Plant to Cloud Analogy
6D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
Power source
Power source directly connected
Power source
Network Meter Customer
Cloud Computing Architecture
7D. Koop, CIS 602-01, Fall 2017
J Internet Serv Appl (2010) 1: 7–18 9
virtualized resources for high-level applications. A virtual-ized server is commonly called a virtual machine (VM). Vir-tualization forms the foundation of cloud computing, as itprovides the capability of pooling computing resources fromclusters of servers and dynamically assigning or reassigningvirtual resources to applications on-demand.
Autonomic Computing: Originally coined by IBM in2001, autonomic computing aims at building computing sys-tems capable of self-management, i.e. reacting to internaland external observations without human intervention. Thegoal of autonomic computing is to overcome the manage-ment complexity of today’s computer systems. Althoughcloud computing exhibits certain autonomic features suchas automatic resource provisioning, its objective is to lowerthe resource cost rather than to reduce system complexity.
In summary, cloud computing leverages virtualizationtechnology to achieve the goal of providing computing re-sources as a utility. It shares certain aspects with grid com-puting and autonomic computing but differs from them inother aspects. Therefore, it offers unique benefits and im-poses distinctive challenges to meet its requirements.
3 Cloud computing architecture
This section describes the architectural, business and variousoperation models of cloud computing.
3.1 A layered model of cloud computing
Generally speaking, the architecture of a cloud comput-ing environment can be divided into 4 layers: the hard-ware/datacenter layer, the infrastructure layer, the platformlayer and the application layer, as shown in Fig. 1. We de-scribe each of them in detail:
The hardware layer: This layer is responsible for man-aging the physical resources of the cloud, including phys-ical servers, routers, switches, power and cooling systems.In practice, the hardware layer is typically implementedin data centers. A data center usually contains thousandsof servers that are organized in racks and interconnectedthrough switches, routers or other fabrics. Typical issuesat hardware layer include hardware configuration, fault-tolerance, traffic management, power and cooling resourcemanagement.
The infrastructure layer: Also known as the virtualiza-tion layer, the infrastructure layer creates a pool of storageand computing resources by partitioning the physical re-sources using virtualization technologies such as Xen [55],KVM [30] and VMware [52]. The infrastructure layer is anessential component of cloud computing, since many keyfeatures, such as dynamic resource assignment, are onlymade available through virtualization technologies.
The platform layer: Built on top of the infrastructurelayer, the platform layer consists of operating systems andapplication frameworks. The purpose of the platform layeris to minimize the burden of deploying applications directlyinto VM containers. For example, Google App Engine oper-ates at the platform layer to provide API support for imple-menting storage, database and business logic of typical webapplications.
The application layer: At the highest level of the hierar-chy, the application layer consists of the actual cloud appli-cations. Different from traditional applications, cloud appli-cations can leverage the automatic-scaling feature to achievebetter performance, availability and lower operating cost.
Compared to traditional service hosting environmentssuch as dedicated server farms, the architecture of cloudcomputing is more modular. Each layer is loosely coupledwith the layers above and below, allowing each layer toevolve separately. This is similar to the design of the OSI
Fig. 1 Cloud computingarchitecture
[Zhang et al., 2010]
Cloud Comparison
8D. Koop, CIS 602-01, Fall 2017
14 J Internet Serv Appl (2010) 1: 7–18
Table 1 A comparison of representative commercial products
Cloud Provider Amazon EC2 Windows Azure Google App Engine
Classes of Utility Computing Infrastructure service Platform service Platform service
Target Applications General-purpose applications General-purpose Windowsapplications
Traditional web applicationswith supported framework
Computation OS Level on a Xen VirtualMachine
Microsoft Common LanguageRuntime (CLR) VM; Predefinedroles of app. instances
Predefined web applicationframeworks
Storage Elastic Block Store; AmazonSimple Storage Service (S3);Amazon SimpleDB
Azure storage service and SQLData Services
BigTable and MegaStore
Auto Scaling Automatically changing thenumber of instances based onparameters that users specify
Automatic scaling based onapplication roles and aconfiguration file specified byusers
Automatic Scaling which istransparent to users
The .NET Services facilitate the creation of distributedapplications. The Access Control component provides acloud-based implementation of single identity verificationacross applications and companies. The Service Bus helpsan application expose web services endpoints that can beaccessed by other applications, whether on-premises or inthe cloud. Each exposed endpoint is assigned a URI, whichclients can use to locate and access a service.
All of the physical resources, VMs and applications inthe data center are monitored by software called the fabriccontroller. With each application, the users upload a config-uration file that provides an XML-based description of whatthe application needs. Based on this file, the fabric controllerdecides where new applications should run, choosing phys-ical servers to optimize hardware utilization.
5.2.3 Google App Engine
Google App Engine [20] is a platform for traditional webapplications in Google-managed data centers. Currently, thesupported programming languages are Python and Java.Web frameworks that run on the Google App Engine includeDjango, CherryPy, Pylons, and web2py, as well as a customGoogle-written web application framework similar to JSPor ASP.NET. Google handles deploying code to a cluster,monitoring, failover, and launching application instances asnecessary. Current APIs support features such as storing andretrieving data from a BigTable [10] non-relational database,making HTTP requests and caching. Developers have read-only access to the filesystem on App Engine.
Table 1 summarizes the three examples of popular cloudofferings in terms of the classes of utility computing, tar-get types of application, and more importantly their modelsof computation, storage and auto-scaling. Apparently, thesecloud offerings are based on different levels of abstraction
and management of the resources. Users can choose onetype or combinations of several types of cloud offerings tosatisfy specific business requirements.
6 Research challenges
Although cloud computing has been widely adopted by theindustry, the research on cloud computing is still at an earlystage. Many existing issues have not been fully addressed,while new challenges keep emerging from industry applica-tions. In this section, we summarize some of the challengingresearch issues in cloud computing.
6.1 Automated service provisioning
One of the key features of cloud computing is the capabil-ity of acquiring and releasing resources on-demand. The ob-jective of a service provider in this case is to allocate andde-allocate resources from the cloud to satisfy its servicelevel objectives (SLOs), while minimizing its operationalcost. However, it is not obvious how a service provider canachieve this objective. In particular, it is not easy to de-termine how to map SLOs such as QoS requirements tolow-level resource requirement such as CPU and memoryrequirements. Furthermore, to achieve high agility and re-spond to rapid demand fluctuations such as in flash crowdeffect, the resource provisioning decisions must be made on-line.
Automated service provisioning is not a new problem.Dynamic resource provisioning for Internet applications hasbeen studied extensively in the past [47, 57]. These ap-proaches typically involve: (1) Constructing an applicationperformance model that predicts the number of applicationinstances required to handle demand at each particular level,
Cloud Adoption
9D. Koop, CIS 602-01, Fall 2017
Public Cloud Adoption
10D. Koop, CIS 602-01, Fall 2017
Assignment 2• http://www.cis.umassd.edu/~dkoop/
cis602-2017fa/assignment2.html • New York City Trees
- 680,000+ trees - Use WebGL for visualization - Use a Python bridge (mapboxgl) - Use the fork! - Smaller data versions available - Keep using pandas - Label subproblems and answers - Due Thursday, October 19
11D. Koop, CIS 602-01, Fall 2017
University of Pennsylvania
What is virtualization?
• Suppose Alice has a machine with 4 CPUs and 8 GB of memory, and three customers: - Bob wants 1 CPU and 3GB of memory - Charlie wants 2 CPUs and 1GB of memory - Daniel wants 1 CPU and 4GB of memory
• What should Alice do?
12D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
Alice
Bob
Charlie
Daniel
Physical machine
University of Pennsylvania
What is virtualization?
• Alice can sell each customer a virtual machine (VM) with the requested resources - From each customer's perspective, it appears as if they had a physical
machine all by themselves (isolation)
13D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
Alice
Bob
Charlie
DanielPhysical machine
Virtual machines
Virtual machinemonitor
University of Pennsylvania
How does it work?
• Resources (CPU, memory, ...) are virtualized - VMM ("Hypervisor") has translation tables that map requests for virtual
resources to physical resources - Ex: VM 1 accesses memory cell #323; VMM maps to memory cell 123. - For which resources does this (not) work? - How do VMMs differ from OS kernels?
14D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
Physical machine
OS 1 OS 2
App
App App
VMM
VM Virt Phys
1 0-99 0-99
1 299-399 100-199
2 0-99 300-399
2 200-299 500-599
2 600-699 400-499
Translation table
VM 1 VM 2
University of Pennsylvania
Benefit: Migration
• What if the machine needs to be shut down? - e.g., for maintenance, consolidation, ... - Alice can migrate the VMs to different physical machines without any
customers noticing
15D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
Alice
Bob
Charlie
Daniel
Physical machines
Virtual machines
Virtual machinemonitor
Emil
University of Pennsylvania
Benefit: Time sharing
• What if Alice gets another customer? - Multiple VMs can time-share the existing resources - Result: Alice has more virtual CPUs and virtual memory than physical
resources (but not all can be active at the same time)
16D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
Alice
Bob
Charlie
DanielPhysical machine
Virtual machines
Virtual machinemonitor
Emil
University of Pennsylvania
Benefit and challenge: Isolation
• Good: Emil can't access Charlie's data • Bad: What if the load suddenly increases?
- Example: Emil's VM shares CPUs with Charlie's VM, and Charlie suddenly starts a large compute job
- Emil's performance may decrease as a result - VMM can move Emil's software to a different CPU, or migrate it to a
different machine17D. Koop, CIS 602-01, Fall 2017
[Haeberlen and Ives, 2015]
Alice
Bob
Charlie
DanielPhysical machine
Virtual machines
VMM
Emil
Cloud Use by Industry
18D. Koop, CIS 602-01, Fall 2017© The Economist Intelligence Unit Limited 20163
Ascending cloude d ti n ud utin in five industries
There is certainly a great deal of dialogue about
the growth of cloud—but where does it actually
stand? We asked each industry sub-panel to
assess the current presence in their own industry.
e first serv ti n is si nifi nt di eren es in
t e r te ud d ti n e first vers
appear to be those that can generate a digital
“pure play” side-by-side with the legacy industry—
for example, digital banking emerging from
branch networks, or e-commerce competing with
high street retailers and shopping centres.
Manufacturing, as we shall see, presents a
more complex problem—the integration of cloud
into physical structures such as factories,
machines and assembly lines. Finally, as discussed
in our review of these industries, adoption in
Education and Healthcare is slowed by regulatory
constraints and less intense competitive
environments.
The second observation is that, as far as cloud
has come, it has a long way to go. “Pervasive
presence”—ready access and widespread
deployment—averages out to only 7% across
industries. The following industry analysis illustrates
just how that rate of growth is expected to be.
Cloud computing—what does it mean for key industries?
Source: EIU Survey “Cloud Computing and Economic Development”, October 2015
Banking Retail Manu-facturing
Education Healthcare Industry average
Pervasive presence Significant presence
How would you characterise the current presence of cloud in the following industries?% of respondents reporting a significant or pervasive presence
52
757
1
42
7
34
10
31
843
7
[The Economist]
Banking Use
19D. Koop, CIS 602-01, Fall 2017
© The Economist Intelligence Unit Limited 20164
Ascending cloude d ti n ud utin in five industries
Banking—disruption of a legacy business
Two trends are driving cloud adoption in banking.
e first is t e d ti n ud r fi e
and selected customer operations by traditional
banking institutions. The second is Fintech—digital
insurgents who are regularly using cloud-based
services to compete in key banking products.
According to our banking sub-panel, these
forces will drive a rapid rate of adoption—almost
three out of four predict cloud will be a major
t r in n in in five e rs n sis n in
products and new markets shows a more
nuanced picture of cloud coexisting with non-
cloud systems. This may indicate the growth of
cloud alongside existing legacy systems, coupled
with concerns about security.Source: EIU Survey “Cloud Computing and
Economic Development”, October 2015
Major factorModerate factor
Future cloud penetration of the banking industry% saying cloud will be a moderate or major factor
In one year
In three years
In five years
52
19
53
36
15
74
Source: EIU Survey “Cloud Computing andEconomic Development”, October 2015
Very importantSomewhat important
How important is cloud in supporting sectors of the banking industry?% saying cloud will be somewhat or very important
New ways to make payments
Lowering banking costs
Banking for remote populations
New ways of saving
Banking for poor populations
New ways of lending
68 21
60 32
57 34
50 45
43 42
42 47
[The Economist]
Education and the Cloud
20D. Koop, CIS 602-01, Fall 2017
© The Economist Intelligence Unit Limited 20167
Ascending cloude d ti n ud utin in five industries
Cloud, technology and education
Education shows a somewhat slower adoption
rate than other industries. Possible reasons include
a less competitive environment and traditionally
slower rates of technology adoption by
government.3
But after much initial hype, online education
has also suffered a series of disappointments as
“MOOCs” (massive open online courses) found
that large numbers of enrolled students did not
pursue their studies. Nonetheless, adoption looks
set to pick up speed in the 3-5 year timeframe,
and cloud looks set to impact the entire spectrum
of education.
3 Information Week, Public Sector Slow to Adopt Cloud Computing, June 2012
Source: EIU Survey “Cloud Computing andEconomic Development”, October 2015
Major factorModerate factor
Future cloud penetration of the education industry% saying cloud will be a moderate or major factor
In one year
In three years
In five years
55
20 29
47
18
67
Source: EIU Survey “Cloud Computing andEconomic Development”, October 2015
Very importantSomewhat important
How important is cloud in supporting sectors of the education industry?% saying cloud will be somewhat or very important
Higher education (university)
Education for remote areas
Continuing/adult education
Mid-level students (ages 11-17)
Education for low-income
Educating young students (ages 3-10)
51 34
42 38
40 44
40 46
34 42
25 48
[The Economist]
Supporting Education
21D. Koop, CIS 602-01, Fall 2017
© The Economist Intelligence Unit Limited 20167
Ascending cloude d ti n ud utin in five industries
Cloud, technology and education
Education shows a somewhat slower adoption
rate than other industries. Possible reasons include
a less competitive environment and traditionally
slower rates of technology adoption by
government.3
But after much initial hype, online education
has also suffered a series of disappointments as
“MOOCs” (massive open online courses) found
that large numbers of enrolled students did not
pursue their studies. Nonetheless, adoption looks
set to pick up speed in the 3-5 year timeframe,
and cloud looks set to impact the entire spectrum
of education.
3 Information Week, Public Sector Slow to Adopt Cloud Computing, June 2012
Source: EIU Survey “Cloud Computing andEconomic Development”, October 2015
Major factorModerate factor
Future cloud penetration of the education industry% saying cloud will be a moderate or major factor
In one year
In three years
In five years
55
20 29
47
18
67
Source: EIU Survey “Cloud Computing andEconomic Development”, October 2015
Very importantSomewhat important
How important is cloud in supporting sectors of the education industry?% saying cloud will be somewhat or very important
Higher education (university)
Education for remote areas
Continuing/adult education
Mid-level students (ages 11-17)
Education for low-income
Educating young students (ages 3-10)
51 34
42 38
40 44
40 46
34 42
25 48
[The Economist]
Amazon GovCloud
22D. Koop, CIS 602-01, Fall 2017
Heterogeneity and Dynamicity of Clouds at Scale:Google Trace Analysis
C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch
D. Koop, CIS 602-01, Fall 2017
Issues faced by cloud schedulers• Machine and workload heterogeneity and variability • Highly dynamic resource demand and availability • Predictable, but poorly predicted, resource needs • Resource class preferences and constraints
24D. Koop, CIS 602-01, Fall 2017
[Reiss et al., 2012]
Previous Cluster Workload Studies• Long-Running Services • DAG-of-taks Systems (e.g. MapReduce) • High-Performance Computing
25D. Koop, CIS 602-01, Fall 2017
[Reiss et al., 2012]
Discussion
26D. Koop, CIS 602-01, Fall 2017
Machine Configuration
27D. Koop, CIS 602-01, Fall 2017
Number of machines Platform CPUs Memory6732 B 0.50 0.503863 B 0.50 0.251001 B 0.50 0.75
795 C 1.00 1.00126 A 0.25 0.2552 B 0.50 0.125 B 0.50 0.035 B 0.50 0.973 C 1.00 0.501 B 0.50 0.06
Table 1: Configurations of machines in the cluster. CPU andmemory units are linearly scaled so that the maximum machineis 1. Machines may change configuration during the trace; weshow their first configuration.
properties and their lifecycle management, workload behavior, andresource utilization. Zhang et al. [26] study the trace from the per-spective of energy-aware provisioning and energy-cost minimiza-tion, using it to motivate dynamic capacity provisioning and thechallenges associated with it.
3. HETEROGENEITYThe traced ‘cloud computing’ workload is much less homoge-
neous than researchers often assume. It appears to be a mix oflatency-sensitive tasks, with characteristics similar to web site serv-ing, and less latency-sensitive programs, with characteristics sim-ilar to high-performance computing and MapReduce workloads.This heterogeneity will break many scheduling strategies that mighttarget more specific environments. Assumptions that machines ortasks can be treated equally are broken; for example, no schedul-ing strategy that uses fixed-sized ‘slots’ or uniform randomizationamong tasks or machines is likely to perform well.
3.1 Machine types and attributesThe cluster machines are not homogeneous; they consist of three
different platforms (the trace providers distinguish them by indi-cating “the microarchitecture and chipset version” [18]) and a va-riety of memory/compute ratios. The configurations are shown inTable 1. Exact numbers of CPU cores and bytes of memory areunavailable; instead, CPU and memory size measurements are nor-malized to the configuration of the largest machines. We will usethese units throughout this paper. Most of the machines have halfof the memory and half the CPU of the largest machines.
This variety of configurations is unlike the fully homogeneousclusters usually assumed by prior work. It is also distinct fromprior work that focuses on clusters where some machines have fun-damentally different types of computing hardware, like GPUs, FP-GAs, or very low-power CPUs. The machines here differ in waysthat can be explained by the machines being acquired over timeusing whatever configuration was most cost-effective then, ratherthan any deliberate decision to use heterogeneous hardware.
In addition to the CPU and memory capacity and microarchi-tecture of the machines, a substantial fraction of machine hetero-geneity, from the scheduler’s perspective, comes from “machineattributes”. They are obfuscated <key,value> pairs, with a totalof 67 unique machine attribute keys in the cell. The majority ofthose attributes have fewer than 10 unique values ever used by anymachine. That is consistent with [21], where the only machine at-tributes with possible values exceeding 10 were number of disks
Figure 1: Normal production (top) and lower (bottom) priorityCPU usage by hour of day. The dark line is the median and thegrey band represents the quartiles.
and clock speed. In this trace, exactly 10 keys are used with morethan 10 possible values. 6 of these keys are used as constraints. Oneof them has 12569 unique values — an order of magnitude greaterthan all others combined, which roughly corresponds to the numberof machines in the cluster. Based on further analysis in Section 6and [21], these attributes likely reflect a combination of machineconfiguration and location information. Since these attributes areall candidates for task placement constraints (discussed later), theirnumber and variety are a concern for a scheduler. Scheduler de-signs can no longer consider heterogeneity of hardware an aberra-tion.
3.2 Workload typesOne signal of differing job types is the priority associated with
the tasks. The trace uses twelve task priorities (numbered 0 to 11),which we will group into three sets: production (9–11), middle(2–8), and gratis (0–1). The trace providers tell us that latency-sensitive tasks (as marked by another task attribute) in the produc-tion priorities should not be “evicted due to over-allocation of ma-chine resources” [18] and that users of tasks of gratis priorities arecharged substantially less for their resources.
The aggregate usage shows that the production priorities repre-sent a different kind of workload than the others. As shown in Fig-ure 1, production priorities account for more resource usage thanall the other priorities and have the clearest daily patterns in usage(with a peak-to-mean ratio of 1.3). As can be seen from Figure 2,the production priorities also include more long-duration jobs, ac-counting for a majority of all jobs which run longer than a day eventhough only 7% of all jobs run at production priority. Usage at thelowest priority shows little such pattern, and this remains true evenif short-running jobs are excluded.
These are clearly not perfect divisions of job purpose — eachpriority set appears to contain jobs that behave like user-facingservices would and large numbers of short-lived batch-like jobs(based on their durations and utilization patterns). The trace con-tains no obvious job or task attribute that distinguishes between
[Reiss et al., 2012]
CPU Usage (Production vs. Lower Priority)
28D. Koop, CIS 602-01, Fall 2017
Number of machines Platform CPUs Memory6732 B 0.50 0.503863 B 0.50 0.251001 B 0.50 0.75
795 C 1.00 1.00126 A 0.25 0.2552 B 0.50 0.125 B 0.50 0.035 B 0.50 0.973 C 1.00 0.501 B 0.50 0.06
Table 1: Configurations of machines in the cluster. CPU andmemory units are linearly scaled so that the maximum machineis 1. Machines may change configuration during the trace; weshow their first configuration.
properties and their lifecycle management, workload behavior, andresource utilization. Zhang et al. [26] study the trace from the per-spective of energy-aware provisioning and energy-cost minimiza-tion, using it to motivate dynamic capacity provisioning and thechallenges associated with it.
3. HETEROGENEITYThe traced ‘cloud computing’ workload is much less homoge-
neous than researchers often assume. It appears to be a mix oflatency-sensitive tasks, with characteristics similar to web site serv-ing, and less latency-sensitive programs, with characteristics sim-ilar to high-performance computing and MapReduce workloads.This heterogeneity will break many scheduling strategies that mighttarget more specific environments. Assumptions that machines ortasks can be treated equally are broken; for example, no schedul-ing strategy that uses fixed-sized ‘slots’ or uniform randomizationamong tasks or machines is likely to perform well.
3.1 Machine types and attributesThe cluster machines are not homogeneous; they consist of three
different platforms (the trace providers distinguish them by indi-cating “the microarchitecture and chipset version” [18]) and a va-riety of memory/compute ratios. The configurations are shown inTable 1. Exact numbers of CPU cores and bytes of memory areunavailable; instead, CPU and memory size measurements are nor-malized to the configuration of the largest machines. We will usethese units throughout this paper. Most of the machines have halfof the memory and half the CPU of the largest machines.
This variety of configurations is unlike the fully homogeneousclusters usually assumed by prior work. It is also distinct fromprior work that focuses on clusters where some machines have fun-damentally different types of computing hardware, like GPUs, FP-GAs, or very low-power CPUs. The machines here differ in waysthat can be explained by the machines being acquired over timeusing whatever configuration was most cost-effective then, ratherthan any deliberate decision to use heterogeneous hardware.
In addition to the CPU and memory capacity and microarchi-tecture of the machines, a substantial fraction of machine hetero-geneity, from the scheduler’s perspective, comes from “machineattributes”. They are obfuscated <key,value> pairs, with a totalof 67 unique machine attribute keys in the cell. The majority ofthose attributes have fewer than 10 unique values ever used by anymachine. That is consistent with [21], where the only machine at-tributes with possible values exceeding 10 were number of disks
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
porti
onof
clus
terC
PU
0.0
0.1
0.2
0.3
0 5 10 15 20time (hour)
0.0
0.1
0.2
Figure 1: Normal production (top) and lower (bottom) priorityCPU usage by hour of day. The dark line is the median and thegrey band represents the quartiles.
and clock speed. In this trace, exactly 10 keys are used with morethan 10 possible values. 6 of these keys are used as constraints. Oneof them has 12569 unique values — an order of magnitude greaterthan all others combined, which roughly corresponds to the numberof machines in the cluster. Based on further analysis in Section 6and [21], these attributes likely reflect a combination of machineconfiguration and location information. Since these attributes areall candidates for task placement constraints (discussed later), theirnumber and variety are a concern for a scheduler. Scheduler de-signs can no longer consider heterogeneity of hardware an aberra-tion.
3.2 Workload typesOne signal of differing job types is the priority associated with
the tasks. The trace uses twelve task priorities (numbered 0 to 11),which we will group into three sets: production (9–11), middle(2–8), and gratis (0–1). The trace providers tell us that latency-sensitive tasks (as marked by another task attribute) in the produc-tion priorities should not be “evicted due to over-allocation of ma-chine resources” [18] and that users of tasks of gratis priorities arecharged substantially less for their resources.
The aggregate usage shows that the production priorities repre-sent a different kind of workload than the others. As shown in Fig-ure 1, production priorities account for more resource usage thanall the other priorities and have the clearest daily patterns in usage(with a peak-to-mean ratio of 1.3). As can be seen from Figure 2,the production priorities also include more long-duration jobs, ac-counting for a majority of all jobs which run longer than a day eventhough only 7% of all jobs run at production priority. Usage at thelowest priority shows little such pattern, and this remains true evenif short-running jobs are excluded.
These are clearly not perfect divisions of job purpose — eachpriority set appears to contain jobs that behave like user-facingservices would and large numbers of short-lived batch-like jobs(based on their durations and utilization patterns). The trace con-tains no obvious job or task attribute that distinguishes between
[Reiss et al., 2012]
Job Duration (Log-scale inverted CDF)
29D. Koop, CIS 602-01, Fall 2017
10�2 10�1 100 101 102 103
job duration (hours)
103
104
105
106jo
bslo
nger
allproductionnon-production
Figure 2: Log-log scale inverted CDF of job durations. Only theduration for which the job runs during the trace time period isknown; thus, for example, we do not observe durations longerthan around 700 hours. The thin, black line shows all jobs; thethick line shows production-priority jobs; and the dashed lineshows non-production priority jobs.
types of jobs besides their actual resource usage and duration: even‘scheduling class’, which the trace providers say represents howlatency-sensitive a job is, does not separate short-duration jobs fromlong-duration jobs. Nevertheless, the qualitative difference in theaggregate workloads at the higher and lower priorities shows thatthe trace is both unlike batch workload traces (which lack the com-bination of diurnal patterns and very long-running jobs) and unlikeinteractive service (which lack large sets of short jobs with littlepattern in demand).
3.3 Job durationsJob durations range from tens of seconds to essentially the entire
duration of the trace. Over 2000 jobs (from hundreds of distinctusers) run for the entire trace period, while a majority of jobs lastfor only minutes. We infer durations from how long tasks are activeduring the one month time window of the trace; jobs which are cutoff by the beginning or end of the trace are a small portion (< 1%)of jobs and consist mostly of jobs which are active for at least sev-eral days and so are not responsible for us observing many shorterjob durations. These come from a large portion of the users, so it isnot likely that the workload is skewed by one particular individualor application. Consistent with our intuition about priorities corre-lating with job types, production priorities have a much higher pro-portion of long-running jobs and the ‘other’ priorities have a muchlower proportion. But slicing the jobs by priority or ‘schedulingclass’ (which the trace providers say should reflect how latency-sensitive a job is) reveals a similar heavy-tailed distribution shapewith a large number of short jobs.
3.4 Task shapesEach task has a resource request, which should indicate the amount
of CPU and memory space the task will require. (The requests areintended to represent the submitter’s predicted “maximum” usagefor the task.) Both the amount of the resources requested and theamount actually used by tasks varies by several orders of magni-tude; see Figures 3 and 4, respectively. These are not just out-liers. Over 2000 jobs request less than 0.0001 normalized units ofmemory per task, and over 8000 jobs request more than 0.1 unitsof memory per task. Similarly, over 70000 jobs request less than
Figure 3: CDF of instantaneous task requested resources. (1unit = max machine size.) These are the raw resource requestsin the trace; they do not account for task duration.
Figure 4: Log-log scale inverted CDF of instantaneous me-dian job usage, accounting for both varying per-task usage andvarying job task counts.
0.0001 units of CPU per task, and over 8000 request more than 0.1units of CPU. Both tiny and large resource requesting jobs includehundreds of distinct users, so it is not likely that the particularlylarge or small requests are caused by the quirky demands of a sin-gle individual or service.
We believe that this variety in task “shapes” has not been seen inprior workloads, if only because most schedulers simply do notsupport this range of sizes. The smallest resource requests arelikely so small that it would be difficult for any VM-based sched-uler to run a VM using that little memory. (0.0001 units would bearound 50MB if the largest machines in the cluster had 512GB ofmemory.) Also, any slot-based scheduler, which includes all HPCand Grid installations we are aware of, would be unlikely to havethousands of slots per commodity machine.
The ratio between CPU and memory requests also spans a largerange. The memory and CPU request sizes are correlated, butweakly (linear regression R2 ⇡ 0.14). A large number jobs request0 units of CPU — presumably they require so little CPU they candepend on running in the ‘left-over’ CPU of a machine; it makeslittle sense to talk about the CPU:memory ratio without adjustingthese. Rounding these requests to the next largest request size, the
[Reiss et al., 2012]
Resource Requests
30D. Koop, CIS 602-01, Fall 2017
Figure 2: Log-log scale inverted CDF of job durations. Only theduration for which the job runs during the trace time period isknown; thus, for example, we do not observe durations longerthan around 700 hours. The thin, black line shows all jobs; thethick line shows production-priority jobs; and the dashed lineshows non-production priority jobs.
types of jobs besides their actual resource usage and duration: even‘scheduling class’, which the trace providers say represents howlatency-sensitive a job is, does not separate short-duration jobs fromlong-duration jobs. Nevertheless, the qualitative difference in theaggregate workloads at the higher and lower priorities shows thatthe trace is both unlike batch workload traces (which lack the com-bination of diurnal patterns and very long-running jobs) and unlikeinteractive service (which lack large sets of short jobs with littlepattern in demand).
3.3 Job durationsJob durations range from tens of seconds to essentially the entire
duration of the trace. Over 2000 jobs (from hundreds of distinctusers) run for the entire trace period, while a majority of jobs lastfor only minutes. We infer durations from how long tasks are activeduring the one month time window of the trace; jobs which are cutoff by the beginning or end of the trace are a small portion (< 1%)of jobs and consist mostly of jobs which are active for at least sev-eral days and so are not responsible for us observing many shorterjob durations. These come from a large portion of the users, so it isnot likely that the workload is skewed by one particular individualor application. Consistent with our intuition about priorities corre-lating with job types, production priorities have a much higher pro-portion of long-running jobs and the ‘other’ priorities have a muchlower proportion. But slicing the jobs by priority or ‘schedulingclass’ (which the trace providers say should reflect how latency-sensitive a job is) reveals a similar heavy-tailed distribution shapewith a large number of short jobs.
3.4 Task shapesEach task has a resource request, which should indicate the amount
of CPU and memory space the task will require. (The requests areintended to represent the submitter’s predicted “maximum” usagefor the task.) Both the amount of the resources requested and theamount actually used by tasks varies by several orders of magni-tude; see Figures 3 and 4, respectively. These are not just out-liers. Over 2000 jobs request less than 0.0001 normalized units ofmemory per task, and over 8000 jobs request more than 0.1 unitsof memory per task. Similarly, over 70000 jobs request less than
10�3 10�2 10�1 100
instaneous resource request (normalized CPU/memory)
0.0
0.2
0.4
0.6
0.8
1.0po
rtion
ofta
sks
requ
estin
gle
ss
CPUMemory
Figure 3: CDF of instantaneous task requested resources. (1unit = max machine size.) These are the raw resource requestsin the trace; they do not account for task duration.
Figure 4: Log-log scale inverted CDF of instantaneous me-dian job usage, accounting for both varying per-task usage andvarying job task counts.
0.0001 units of CPU per task, and over 8000 request more than 0.1units of CPU. Both tiny and large resource requesting jobs includehundreds of distinct users, so it is not likely that the particularlylarge or small requests are caused by the quirky demands of a sin-gle individual or service.
We believe that this variety in task “shapes” has not been seen inprior workloads, if only because most schedulers simply do notsupport this range of sizes. The smallest resource requests arelikely so small that it would be difficult for any VM-based sched-uler to run a VM using that little memory. (0.0001 units would bearound 50MB if the largest machines in the cluster had 512GB ofmemory.) Also, any slot-based scheduler, which includes all HPCand Grid installations we are aware of, would be unlikely to havethousands of slots per commodity machine.
The ratio between CPU and memory requests also spans a largerange. The memory and CPU request sizes are correlated, butweakly (linear regression R2 ⇡ 0.14). A large number jobs request0 units of CPU — presumably they require so little CPU they candepend on running in the ‘left-over’ CPU of a machine; it makeslittle sense to talk about the CPU:memory ratio without adjustingthese. Rounding these requests to the next largest request size, the
[Reiss et al., 2012]
Task Submission (New vs. Rescheduled)
31D. Koop, CIS 602-01, Fall 2017
0 5 10 15 20 25time (days)
0
1M
2M
3Mev
ents
perd
aynew tasksrescheduling
Figure 5: Moving average (over a day-long window) of tasksubmission (a task becomes runnable) rates.
Figure 6: Linear-log plot of inverted CDF of interarrival peri-ods between jobs.
count) crash loops even at the production priorities.Small jobs. Even though the scheduler runs large, long-lived
parallel jobs, most of the jobs in the trace only request a smallamount of a single machine’s resources and only run for severalminutes. 75% of jobs consist of only one task, half of the jobs runfor less than 3 minutes, and most jobs request less than 5% of theaverage machine’s resources. These small job submissions are fre-quent, ensuring that the cluster scheduler has new work to schedulenearly every minute of every day.
Job submissions are clustered together (Figure 6 shows job inter-arrival times), with around 40% of submissions recorded less than10 milliseconds after the previous submission even though the me-dian interarrival period is 900 ms. The tail of the distribution ofinterarrival times is power-law-like, though the maximum job in-terarrival period is only 11 minutes.
The prevalence of many very small interarrival periods suggestthat some sets of jobs are part of the same logical program andintended to run together. For example, the trace providers indi-cate that MapReduce programs run with a separate ‘master’ and‘worker’ jobs, which will presumably each have a different shape.
Figure 7: Moving average (over day-long window) of task evic-tion rates, broken by priority.
Another likely cause is embarrassingly parallel programs being splitinto many distinct single-task jobs. (Users might specify manysmall jobs rather than one job with many tasks to avoid implyingany co-scheduling requirement between the parallel tasks.) A com-bination of the two of these might explain the very large number ofsingle-task jobs.
Evictions. Evictions are also a common cause of task reschedul-ing. There are 4.5M evictions recorded in the trace, more thanthe number of recorded software failures after excluding the largestcrash-looping jobs. As would be expected, eviction rates are re-lated to task priorities. The rate of evictions for production prioritytasks is comparable to the rate of machine churn: between one perone hundred task days and one per fifteen task days, depending onhow many unknown task terminations are due to evictions. Most ofthese evictions are near in time to a machine configuration recordfor the machine the task was evicted from, so we suspect most ofthese evictions are due to machine availability changes.
The rate of evictions at lower priorities varies by orders of mag-nitude, with some weekly pattern in the eviction rate. Gratis pri-ority tasks average about at least 4 evictions per task-day, thoughalmost none of these evictions occur on what appear to be week-ends. Given this eviction rate, an average 100-task job running ata gratis priority would expect about one task to be lost every 15minutes. These programs must tolerate a very high “failure” rateby the standards of a typical cloud computing provider or Hadoopcluster.
Almost all of these evictions occur within half a second of an-other task of the same or higher priority starting on the same ma-chine. This indicates that most of these evictions are probably in-tended to free resources for those tasks. Since the evictions occurso soon after the higher priority task is scheduled, it is unlikelythat many of these evictions are driven by resource usage monitor-ing. If the scheduler were measuring resource usage to determinewhen lower-priority tasks should be evicted, then one would ex-pect many higher-priority tasks to have a ‘warm-up’ period of atleast some seconds. During this period, the resources used by thelower-priority task would not yet be required, so the task would notbe immediately evicted.
Given that the scheduler is evicting before the resources actually
[Reiss et al., 2012]
Requested vs. Allocated Resources
32D. Koop, CIS 602-01, Fall 2017
0.0 0.2 0.4 0.6 0.8 1.0time (days)
0.0
0.2
0.4
0.6
0.8
1.0 time (days)0
0.5
1.0
porti
onof
CP
U
used allocated
0 5 10 15 20 250
0.5
1.0
porti
onof
mem
ory
0 5 10 15 20 25
Figure 8: Moving hourly average of CPU (top) and mem-ory (bottom) utilization (left) and resource requests (right).Stacked plot by priority range, highest priorities (production)on bottom (in red/lightest color), followed by the middle prior-ities (green), and gratis (blue/darkest color). The dashed linenear the top of each plot shows the total capacity of the cluster.
come into conflict, some evicted tasks could probably run substan-tially longer — potentially till completion, without any resourceconflicts. To estimate how often this might be occurring, we exam-ined maximum machine usage after evictions events and comparedthese to the requested resources of evicted tasks. After around 30%of evictions, resources requested by the evicted tasks appear to re-main free for an hour after the eviction, suggesting that these evic-tions were either unnecessary or were to make way for brief usagespikes we cannot detect in this monitoring data.
5. RESOURCE USAGE PREDICTABILITYThe trace includes two types of information about the resource
usage of jobs running on the cluster: the resource requests that ac-company each task and the actual resource usage of running tasks.If the scheduler could predict actual task resource usage more ac-curately than that suggested by the requests, tasks could be packedmore tightly without degrading performance. Thus, we are inter-ested in the actual usage by jobs on the cluster. We find that, eventhough there is a lot of task churn, overall resource usage is stable.This stability provides better predictions of resource usage than theresource requests.
5.1 Usage overviewFigure 8 shows the utilization on the cluster over the 29 day
trace period. We evaluated utilization both in terms of the mea-sured resource consumption (left side of figure) and ‘allocations’(requested resources of running tasks; right side of figure). Basedon allocations, the cluster is very heavily booked. Total resourceallocation at almost any time account for more than 80% of thecluster’s memory capacity and more than 100% of the cluster’sCPU capacity. Overall usage is much lower: averaging over one-hour windows, memory usage does not exceed about 50% of thecapacity of the cluster and CPU usage does not exceed about 60%.
The trace providers include usage information for tasks in five-minute segments. At each five-minute boundary, when data is notmissing, there is at least (and usually exactly) one usage record foreach task which is running during that time period. Each recordis marked with a start and end time. This usage record includes a
Figure 9: CDF of changes in average task utilization betweentwo consecutive hours, weighted by task duration. Tasks whichdo not run in consecutive hours are excluded.
Figure 10: CDF of changes in average machine utilization be-tween two consecutive five minute sampling periods. Solid linesexclude tasks which start or stop during one of the five minutesampling periods.
number of types of utilization measurements gathered from Linuxcontainers. Since they are obtained from the Linux kernel, memoryusage measurements include some of the memory usage the kernelmakes on behalf of the task (such as page cache); tasks are expectedto request enough memory to include such kernel-managed mem-ory they require. We will usually use utilization measurements thatrepresent the average CPU and memory utilization over the mea-surement period.
To compute the actual utilization, we divided the trace into thefive-minute sampling periods; within period, for each task usagerecord available, we took the sum of the average CPU and mem-ory usage weighted by the length of the measurement. We did notattempt to compensate for missing usage records (which the traceproducers estimate accounts for no more than 1% of the records).The trace providers state that missing records may result from “themonitoring system or cluster [getting] overloaded” and from filter-ing out records “mislabeled due to a bug in the monitoring system”[18].
5.2 Usage stabilityWhen tasks run for several hours, their resource usage is gener-
[Reiss et al., 2012]
Jobs, Tasks, and Resource Requests
33D. Koop, CIS 602-01, Fall 2017
Figure 11: CDF of utilization by task duration. Note that tasksrunning for less than two hours account for less than 10% ofutilization by any measure.
ally stable, as can be seen in Figure 9. Task memory usage changesvery little once most tasks are running. Memory usage data is basedon physical memory usage, so this stability is not simply a con-sequence of measuring the available address space and not actualmemory pressure.
Because there are many small tasks, a large relative change inCPU or memory usage of an individual task may not translate tolarge changes in the overall ability to fit new tasks on a machine.A simple strategy for determining if tasks fit on a machine is toexamine what resources are currently free on each machine andpredict that those resources will remain free. We examined howwell this would perform by examining how much machine usagechanges. On timescales of minutes, this usually predicts machineutilization well as can be seen in Figure 10. Since most tasks onlyrun for minutes, this amount of prediction is all that is likely re-quired to place most tasks effectively. Longer running tasks may re-quire more planning, but their resource usage tends to mimic otherlonger-running tasks, so the scheduler can even plan well for theselong-running tasks by monitoring running task usage.
5.3 Short jobsOne apparent obstacle to forecasting resource availability from
previous resource usage is the frequency with which tasks start andstop. Fortunately, though there are a large number of tasks start-ing and stopping, these short tasks do not contribute significantlyto usage. This is why, as seen in Figure 10, ignoring the many taskswhich start or stop within five minutes does not have a very largeeffect. Figure 11 indicates that jobs shorter than two hours accountfor less than 10% of the overall utilization (even though they repre-sent more than 95% of the jobs). Hence, the scheduler may safelyignore short-running jobs when forecasting cluster utilization.
Even though most jobs are very short, it is not rare for users torun long jobs. 615 of the 925 users of the cluster submit at leastone job which runs for more than a day, and 310 do so outside thegratis priorities.
5.4 Resource requestsEven though we’ve seen that resource requests in this trace are
not accurate in aggregate, perhaps this behavior only arises becauseusers lack tools to determine accurate requests or because userslack incentives to make accurate requests. Thus, we are interested
100 101 102 103 104
task count
101
102
103
104
105
num
bero
fjob
s
0.00 0.05 0.10 0.15 0.20 0.25memory request (per task)
101
102
103
104
105
num
bero
fjob
s
0.00 0.05 0.10 0.15 0.20 0.25CPU request (per task)
101
102
103
104
105
num
bero
fjob
s
Figure 12: Histograms of job counts by task count (top), mem-ory request size (middle) and CPU request size (bottom). Notethe log-scale on each y-axis and the log-scale on the top plot’sx-axis. We speculate that many of the periodic peaks in taskcounts and resource request sizes represent humans choosinground numbers. Memory and CPU units are the same as Table1. Due to the choice of x-axis limits, not all jobs appear on theseplots.
in how closely requests could be made to reflect the monitored us-age.
Non-automation. Resource requests appear to be specified man-ually, which may explain why they do not correspond to actualusage. One sign of manual request specification is the unevendistribution of resource request sizes, shown in Figure 12. Whenusers specify parameters, they tend to choose round numbers like16, 100, 500, 1000, and 1024. This pattern can clearly be seen inthe number of tasks selected for jobs in this trace; it is not plausi-ble that the multiples of powers of ten are the result of a technicalchoice. We cannot directly identify any similar round numbers inthe CPU and memory requests because the raw values have beenrescaled, but the distribution shows similar periodic peaks, whichmight represent, e.g., multiples of 100 megabytes of memory orsome fraction of a core. For memory requests, it is unlikely thatthese round numbers accurately reflect requirements. For CPU re-quests, whole numbers of CPUs would accurately reflect the CPUa disproportionate number of tasks would use [19], but it seemsunlikely that the smallest ‘bands’ (at around 1/80th of a machine)represent a whole number of cores on a 2011 commodity machine.
Request accuracy. Requests in this trace are supposed to in-dicate the “maximum amount . . . a task is permitted to use” [18].Large gaps between aggregate usage and aggregate allocation, there-fore, do not necessarily indicate that the requests are inaccurate. Ifa task ever required those CPU and memory resources for even asecond of its execution, then the request would be accurate, re-gardless of its average consumption. Thus, resource requests could
[Reiss et al., 2012]
Discussion
34D. Koop, CIS 602-01, Fall 2017