re-optimizing data-parallel computing - amplab · re-optimizing data-parallel computing sameer...

Re-optimizing Data-Parallel ComputingSameer Agarwal�,�, Srikanth Kandula�, Nicolas Bruno�, Ming-Chuan Wu�, Ion Stoica�, Jingren Zhou�

�Microso� Research, �Microso� Bing, �University of California, Berkeley

Abstract– Performant execution of data-parallel jobsneeds good execution plans. Certain properties of thecode, the data, and the interaction between them are cru-cial to generate these plans. Yet, these properties are dif-�cult to estimate due to the highly distributed nature ofthese frameworks, the freedom that allows users to spec-ify arbitrary code as operations on the data, and sincejobs in modern clusters have evolved beyond single mapand reduce phases to logical graphs of operations. Using�xed apriori estimates of these properties to choose ex-ecution plans, as modern systems do, leads to poor per-formance in several instances. We present RoPE, a �rststep towards re-optimizing data-parallel jobs. RoPE col-lects certain code and data properties by piggybackingon job execution. It adapts execution plans by feedingthese properties to a query optimizer. We show how thisimproves the future invocations of the same (and simi-lar) jobs and characterize the scenarios of bene�t. Ex-periments on Bing’s production clusters show up to �×improvement across response time for production jobs atthe ��th percentile while using �.�× fewer resources.1. INTRODUCTION

In most production clusters, a majority of data paral-lel jobs are logical graphs of map, reduce, join and otheroperations [�, �, ��, ��].An execution plan represents a blueprint for the dis-

tributed execution of the job. It encodes, among otherthings, the sequence in which operations are to be done,the columns to partition data on, the degree of parallelismand the implementations to use for each operation.While much prior work focuses on executing a given

plan well, such as dealing with stragglers at runtime [�],placing tasks [��, ��] and sharing the network [�, ��], littlehas been done in choosing appropriate execution plans.Execution plan choice can alleviate some runtime con-

cerns. For example, outliers are less bothersome if eventhemost expensive operation is given enough parallelism.But plan choice can do much more– it can avoid need-

less work (for example, by deferring expensive opera-tions till a�er simpler or more selective operations) andit can trade-o� one resource type for another to speedup jobs (for example, in certain cases, some extra net-work tra�c can avoid a read/write pass on the entire dataset). However, plan choice is more challenging becausethe space of potential plans is large and also because theappropriateness of the plan depends on the interplay be-

tween code, data and cluster hardware.Early data-parallel systems force developers to spec-

ify the execution plan (e.g., Hadoop). To shield develop-ers from coping with these details, some recent proposalsraise the level of abstraction. A few use hand-cra�ed rulesto generate execution plans (e.g., HiveQL [��]) or usecompiler techniques (e.g., FlumeJava [�]). Other declar-ative frameworks cast the execution plan choice as thetraditional query optimization problem (e.g., SCOPE [�],Tenzing [�], Pig [��]).

A central theme, across all schemes, is the absence ofinsight into certain properties of the code (such as ex-pected CPU and memory usage), the data (such as thefrequency of key values), and the interaction betweenthem (such as the selectivity of an operation).�ese prop-erties crucially impact the choice of execution plans.

Our experience with Bing’s production clusters showsthat these code and data properties vary widely. Hence,using �xed apriori estimates leads to performance inef-�ciency. Even worse, not knowing these properties con-strains plan choice to be pessimistic; techniques that pro-vide gains in certain cases but not all cannot be used.

�is paper presents RoPE�, a �rst step towards re-optimizing data parallel jobs, i.e., adapting executionplans based on estimates of code and data properties. Toour knowledge, we are the �rst to do so. �e new do-main brings challenges and opportunities. Accurately es-timating code and data properties is hard in a distributedcontext. Predicting these properties by collecting statis-tics on the raw data stored in the �le-system is not prac-tical due to the prevalence of user-de�ned operations.But, knowing these properties enables a large space of im-provements that is disjoint from prior work and so are themethods to achieve these improvements.

�e sheer number of jobs indicates that the estimationand use of properties has to be automatic. RoPE piggy-backs estimators with job execution. �e scale of the dataand distributed nature of computationmeans that no sin-gle task can examine all the data. Hence, RoPE collectsstatistics at many locations and uses novel ways to com-pose them. To keep overheads small, RoPE’s collectors canonly keep a small amount of state and work in a singlepass over data. Collecting meaningful properties, such asthe number of distinct values or heavy hitters, under these

�Reoptimizer for Parallel Executions

�

constraints precludes traditional data structures and leadsto some interesting designs.�e �exibility allowed for users to de�ne arbitrary code

leads to a much tighter coupling between data and com-putation in data parallel clusters. As long as they con-form to well-de�ned interfaces, users can submit jobswith binary implementations of operations. In this con-text, predicting code properties becomes even more di�-cult. Traditional database techniques can project statisticson the raw data past some simple operations with alpha-numeric expressions but doing so throughmultiple oper-ations, more complex expressions, potentially dependentcolumns and user-de�ned operations introduces imprac-tically large error [�]. Rather than predicting, RoPE in-struments job execution to measure properties directly.We �nd traditional work on adaptive query optimiza-

tion to be speci�c to the environment of one databaseserver [�, �, ��] and the resulting space of optimizations.For example, a target scenario minimizes the reads fromdisk by keeping one side of the join in the server’s mem-ory. RoPE translates these ideas to the context of dis-tributed systems andparallel plans. In doing so, RoPEusesa few aspects of the distributed environment that make ita better �t for adaptive optimization. Unlike the case ofa database server where most queries �nish quickly andthe server has to decide whether to use its constrained re-sources to run the query or to re-optimize it, map-reducejobs lastmuch longer and the resources to re-optimize areonly a small fraction of those used by the job. Further,if a better plan becomes available, transitioning from thecurrent plan to the better plan is tricky in databases [��]whereas data parallel jobs have many inherent barriers atwhich execution plans can be switched.In Bing’s production clusters, we observe that many

key jobs are re-run periodically to process newly arrivingdata. Such recurring jobs contribute ��.�� of all jobs,��.�� of all cluster hours and ��.�� of the intermedi-ate data produced. We also observe that the code and dataproperties are remarkably stable across recurring jobs de-spite the fact that each job processes new data. �ese jobsare RoPE’s primary use-case.

RoPE adapts the execution plans for future invocationsof such jobs by feeding the observed properties into acost-based query optimizer (QO). Our prototype is builtatop SCOPE [�], the default engine for all of Bing’s map-reduce clusters, but our techniques can be applied to othersystems. �e optimizer evaluates various alternatives andchooses the plan with the least expected job latency. Ad-ditionally, we modi�ed the optimizer to use the actualcode and data properties while estimating costs. Workingwith aQOenablesRoPE to not only perform local changessuch as changing the degree of parallelism of operations,but also changes that require a global context, such as re-ordering operations, choosing appropriate operation im-

plementations and grouping operations that have littlework to do into a single physical task.

We �nd jobs that are not completely identical o�enhave common parts. Further, during the execution of ajob, while some of the global changes are logically impos-sible due to operations that have already executed, otherchanges remain feasible. RoPE canhelp in both these casessince (a) the query optimizer can work with incompleteestimates and (b) the code and data properties are linkedto the sub-graph at which they were collected and can bematched to other jobs with an identical sub-graph.

Our contributions include:

● Based on experiences and measurements on Bing’sproduction clusters, we describe scenarios whereknowledge of code and data properties can improveperformance of data-parallel jobs. A few of thesescenarios are novel (§�).● Design of the �rst re-optimizer for data-parallelclusters, which involves collecting statistics in adistributed context, matching statistics across sub-graphs and adapting execution plans by interfacingwith a query optimizer (§�).● Results from a partial prototype, deployed on pro-duction clusters, which show RoPE to be e�ective atreoptimizing jobs (§�, §�). Production jobs speedup by over �× at the ��th percentile while using �.�×fewer resources. RoPE achieves these gains by de-signing better execution plans that avoid wastefulwork (reads, writes, network shu�es) and balanceoperations that run in parallel.

A user would expect her data-parallel jobs to runquickly. She would expect this even though the code isunknown, even though the data properties are hard toestimate, even though the code and data interact in un-predictable ways, and even though the code, the data andthe cluster hardware and so�ware keep evolving. RoPE isa �rst step towards reoptimizing data-parallel computing.

2. COST OF IGNORING CONTEXT

It is not uncommon for data-parallel computingframeworks such as Dryad and MapReduce to processpetabytes of data each day. However, their inability toleverage data and computation statistics renders themun-able to generate execution plans that are better suited forthe jobs they run and prevents them from utilizing his-torical context to improve future executions. Here, we de-scribe axiomatic scenarios of such ine�ciency and quan-tify both their impact and frequency of occurrence.2.1 Background

Our experience is rooted in Bing’s production clus-ters consisting of thousands of multi-core servers par-ticipating in a distributed �le system that supports

�

!"#"""$

!"#""$

!"#"$

!"#$

!$

!$"

!" !"#% !"#& !"#' !"#( !$

)*+*!,-

!./,01+!2!34+*/!5-6

7+

.8*9+,4-!4:!;4<=

*+!%>?!4:!87-+,@A*+!B>?!4:!87-+,@A

Figure �: Data that remains in �ight when a job has executedfor �� (or ��) of its running time.

both structured and unstructured data. Jobs are writ-ten in SCOPE [�], a SQL-like mashup language withsupport for arbitrary user-de�ned operators. �at is,users specify their data parallel jobs within a declara-tive framework (e.g., select, join, group by) but are al-lowed to declare their own implementations of operatorsas long as they �t the templates provided (e.g., extractor,processor, combiner, reducer). A compiler translates thequery into an execution plan which is then executed inparallel on aDryad-like [��] runtime engine. Plans are di-rected acyclic graphs where edges represent data�ow andnodes represent work that can be executed in parallel bymany tasks. A task can consist of multiple operations. Ajobmanager orchestrates the execution of the job’s plan byissuing tasks when their inputs are ready, choosing wheretasks run on the cluster, and reacting to outliers and fail-ures. To facilitate better resource allocation across con-current jobs, individual job managers work in close con-tact with a per-cluster global manager.Unless otherwise speci�ed, our results here use a

dataset that contains all the events from a large produc-tion cluster in Bing. �e events encode for each en-tity (job/ operation/ task/ or network transfer), the startand end times of the entity, the resources used, the com-pletion status, and its dependencies with other entities.Our experiments are based on examining all events dur-ing the month of September �� on a cluster consistingof tens of thousands of servers.2.2 Little Data

We notice that while most map-reduce programs starto� by reading a large amount of data, each successive op-eration (�lters, reduces, etc.) produces considerably feweroutput compared to its input. Hence, more o�en thannot, just a�er a small number of these consecutive op-erations there is very little data le� to process. We callthis the little data case. �e little-data observation canbe used to optimize the tail of most jobs. In some cases,the degree of parallelism on many operations in the tailcan be reduced. �is saves scheduling overhead on theun-necessary tasks. In other cases, multiple operationsin the tail can be coalesced into a single physical opera-

!"#$%&$%#$% !"#$%&'(

!"# $%&'()*!"# $%&'+,*

! " # ! "$

JOIN ON A,B

SELECT A,B,C, SUM(D)GROUP BY A,B,C

!"

!#

(a) Selectivity (b) Balance (c) (Re-) Partitioning

Figure �: Motivating examples for re-optimizing data paral-lel computing

!"#""$

!"#"$

!"#$

!$

!$"

!$""

!" !"#% !"#& !"#' !"#( !$

)*+,*+-./

,*+

0123+45/!56!72898

(a) CDF

Range Fraction� �.��(�, .��) ��.��[.��, .�) ��.��[.�, .�) ��.��[.�, .�) �.��[.�, �] ��.��> � �.��> � .��> �� .��

(b) readout

Figure �: Variation in selectivity across tasks

tion, i.e., one group of tasks executes these operations inparallel. �is avoids needless checkpoints to disk. In yetother cases, broadcast joins can be used instead of pair-wise joins (see §�.�) thereby saving on network shu�eand disk accesses. �e challenge however is that the re-duction factors are unknown apriori and vary by severalorders of magnitude.

Fig. � plots the fraction of input data that remains in�ight a�er jobs have been running for �� (and ��) oftheir runtime. We compute the data in-�ight at any timeby taking a cut of the job’s execution graph at that timeand adding up the data exchanged between tasks that areon either side of this cut. For convenience, we place tasksrunning at that time to the le� of the cut. We see thatwhile some jobs have more data in �ight than their in-put (above y = � line), most of the jobs have much fewer.In fact for over �� of jobs, the data in �ight reduces toless than �

�� of their input within a quarter of the job’srunning time and over �� of jobs have less than a tenthof data in �ight a�er three quarters of their running time.�is means that little data, and the above optimizations,can be brought to bear.

2.3 Varying Selectivity, Reordering

Consider a pair of commutative operations. Order-ing them, so that the more selective operation (one witha lower output to input ratio) runs �rst will avoid workthereby saving compute hours, disk accesses and networkshu�e. See Fig. �(a), where thewidth of block arrows rep-resent the data �ow between a pair of commutative oper-ators (indicated by the circles). Evidently, the plan on theright avoids processing unnecessary data and potentiallysaves signi�cant cluster cycles by appropriately orderingthe operators based on their selectivity. �ese pairs hap-

�

!"#$%&!%'%%%"

!%'%"!"

!"%%!"%%%%!"#(%&

!% !%') !%'* !%'& !%'+ !",-.#/012

345(

6453457!

!-2!8#98/:;

<=>95-?2!?@!,>8A8

(a) Runtime/Data Processed

!"#$%&!"#$%'!%(%%%"

!%(%"!"

!"%%!"%%%%

!% !%() !%(* !%(' !%(& !"+#,

-./01

23452

31236!

!70!+8-98

:;<=37>0!>?!@<ABA

(b) Mem Usage/ Data Processed

Figure �: Variation in Operation costs; in time to processand memory usage, per unit data processed

pen o�en, due to operations that are independent of eachother (e.g. operations on di�erent columns) or are com-mutative. Identifying these pairs can be hard in generalbut SCOPE’s declarative syntax allows the use of standarddatabase techniques to discover such pairs. Finding theselectivities of operations remains a challenge.Standard database techniques to predict operator

selectivity are hard to translate to map-reduce likeframeworks due to the complexity of expressions andlong sequences of operations. �e selectivity of al-phanumeric expressions (e.g. select on X=30) canbe predicted by using clever histograms on the rawdata (e.g. equi-depth) but creating these histograms re-quires many passes over the data. Predicting the se-lectivity for user-de�ned operations (e.g. select when

columnvalue.BeginsWith("http://bing")) is an open prob-lem [�]. We see such code in a majority of jobs. More-over, the prediction errors grow exponentially with thelength of the sequence of operations [�]. Computingmoredetailed synopsis on a random sample is o�en of onlymarginal bene�t [��]. Finally, correlations between sets ofcolumns, as is common, increases prediction error (e.g.select on X=30 and Y=10 can produce just as much dataas the select on X=30 or much less). RoPE estimates se-lectivity by direct instrumentation.Fig. �(a) plots a CDF of the selectivity (ratio of output

to input) of operations in our dataset. Note the y axis isin log scale. About �� of operations produce more datathan they consume (above y = �). �ese are typically(outer) joins. About �� have output roughly equallinginput. �e remaining �� operations produce less out-put than their input but the selectivity varies widely– ��produce fewer than �

�� ’th of their input, and the coe�-

cient of variation ( std evmean ) is �.� with a range from � to ��.�is means that if these selectivities were available, thereis substantial room to reorder operators.

2.4 Varying Costs, Balance

Suppose we �gured out selectivities and picked theright order. Uniquely for data parallel computing, weneed to choose the number of parallel instances for eachoperation. Choosing too few instances will cause thatoperation to become a bottleneck as per Amdahl’s law.On the other hand, choosing too many leads to needlessper-task scheduling, queuing and other startup overhead.Balance, i.e., ensuring that each operation has enoughparallelism and takes roughly the same amount of time,can improve performance signi�cantly [��]. See Fig. �(b)where block arrows again represent the data�ow and thethin arrows now represent the tasks in each of the two op-erations. Here, using one less task for the upstreamopera-tion and one more for the downstream operation reducesjob latency by over ��.

Achieving balance in the context of general data paral-lel computing is hard because the costs (runtime,memoryetc.) of the operators are unknown apriori. �ese costsdepend on the amounts of data processed, the types ofcomputation performed and also on the type of data. Forexample, a complex sentence can take longer to translatethan a simpler one of the same length. Even worse, late-binding, i.e., deferring the choice of the amount of paral-lelism to the runtime is hard because local changes haveglobal impact; for example, the number of partitions out-put by the map phase restricts the maximum number ofreduce tasks at the next stage. RoPE estimates these coststo generate balanced execution plans.

Fig. �(a) plots a CDF of the runtime per unit byte reador written by operations in the dataset. �e analogousplot for memory used by tasks is in Fig. �(b). Note againthat the y axes are in log scale. While as variable as theirselectivity– the middle ��th percent of operators havecosts spread over two orders of magnitude– we �nd thatoperator costs skew more towards higher values. Per unitdata processed, �� of operations take over ��X moretime and memory than the average over the remainingoperations. It is crucial to identify these heavy operations,to o�set their costs by increasing the parallelism and forthe case of memory, to place tasks so that they do notcompete with other memory hungry tasks.

A consequence of unpredictable data selectivity andoperator costs is the lack of balance. We �nd that ourcompiler both underestimates and overestimates an op-eration’s work. Fig. � plots a CDF of the runtime of themedian task in each stage. �e y axis is in log scale. �emedian task in a stage is unlikely to be impacted by fail-ures or be an outlier. So, if the compiler apportions par-allelism well, the median task in each stage should take

�

!"#"$!"#$!$

!$"!$""

!$"""!$""""

!$"""""

!" !"#% !"#& !"#' !"#( !$

)*+#!,-

./!012

3456!

!42!73-+6

!8.69.:

;<-934=2!=>!73-+6.

Figure �: Imbalance

!"

!"#

!"##

!"###

!# !#$% !#$& !#$' !#$( !"

)*+

,)-*

.!/0!1*

23!4.

567

89*:7;/.!/0!<-=6:-!>7*?-2

Figure �: Skewness

!"

!#"

!$"

!%"

!&"

!'""

!" !"(# !"($ !"(% !"(& !'

)!*+,

-./01

2!02!

!341556*7.8!9:

-;+

<47./012!1=!>1025

Figure �: ReplacingPair-wise JoinswithBroadcast Joins

about the same time. �is duration could be chosen totrade-o� fault-tolerance vs. the cost of checkpointing todisk. However, we see that while roughly �� of all tasks�nish within �� seconds, �� take over ��s with the last�� taking over ��s. We found the tail dominated bytasks with user de�ned operators. Setup and schedulingoverheads outweigh the useful work in short-lived taskswhereas the long-lived tasks are bottlenecks in the job.2.5 Partition Skew and Repartitioning

A key to e�cient data-parallel computing is to avoidskews in partition and to re-partition only when needed.Consider the example in Fig. �(c). �e naive implemen-tation would partition the data twice– once on A, B, C (atP�), followed by a network shu�e and a reduce to com-pute the sum, and then again on A, B (at P�) followed byanother shu�e for the join. It is tempting to just partitionthe data once, say on A, B to avoid the network shu�eand the pass over data. Note that partitioning on fewerkeys does not violate the correctness of the reduce thatcomputes SUM(D). Each reduce task will now computemany rows, one per distinct value of C, rather than justthe one row they would have produced were the map topartition on all three columns. However, if there is notenough entropy onA and B, i.e., only a few distinct valueshave many records, then partitioning on the sub-groupcan make things worse. A few reduce tasks might receivea lot of data to process while other reduce tasks have none,and the overall parallel execution can slow down.Fig. � estimates how skewed the partitions can be in our

cluster. Note that our compiler is conservative and doesnot partition on sub-groups to avoid re-partitioning. Yetsigni�cant skew happens. We de�ne skew as the ratio ofthe maximum data processed by a reduce task to the av-erage over other tasks in that reduce phase. We see thatin �� of the reduce stages, the largest partition is twiceas large as the average and in about �� of the stages thelargest partition is over ten times larger than the average.Such skew causes unequal division of work and bottle-necks but can be avoided if the data properties are known.2.6 From operations to implementations

O�en, the same operation can be implemented in sev-eral ways. While choosing the appropriate implementa-tion can result in signi�cant improvements, doing so re-quires appropriate context that is not available in today’s

!"!"#$!"#%!"#&!"#'!(

!(#$!(#%

!" !"#( !"#$ !"#) !"#% !"#* !"#& !"#+ !"#' !"#, !(

-./0

12304

5!67!-.4.89.8:

;<4:.865!67!=>?814@05:0!A<6?B9

C89.85:.5099-D0E5099F4</[email protected]!I05J.K

Figure �: Stability of data properties across recurring jobs

systems. For example, consider Join. �e default imple-mentation PairJoin, involves a map-reduce operation oneach side that partitions data on the join columns. �iscauses three read/write passes on each of the sides andat least one shu�e each across the network. However, ifone of the sides is smaller, perhaps due to the little datacase (§�.�), one could avoid shu�ing the larger side andcomplete the join in one pass on that side. �e trick is tobroadcast the smaller side to each of the tasks that is oper-ating in parallel on the larger side. �e problem though isthat when used inappropriately, a BroadcastJoin can beeven more expensive than a PairJoin.

Fig. � plots the potential bene�ts of replacing PairJoinswith BroadcastJoins. It shows the amount of data shuf-�ed in either case. We see that about �� of joins in thedataset would see no bene�t. �is can happen if both joininputs are considerably large and/or when the parallelismon the larger side is somuch that broadcasting the smallerinput dataset to toomany locations becomes a bottleneck.However, o� the remaining joins, themedian join shu�es�� less data when using broadcast joins.

2.7 Recurring Jobs

We �nd that many jobs in the examined cluster repeatand are re-run periodically to execute on the newly arriv-ing data. Such recurring jobs contribute to ��.�� of alljobs, ��.�� of all cluster hours and ��.�� of intermedi-ate data produced. If the extracted statistics are stable perrecurring job, i.e., the operations behave statistically simi-lar to the previous executionwhen running on newer dataof the same stream, then RoPE’s instrumentation wouldsu�ce to re-optimize future invocations.

�

Fig. �plots the average di�erence between statistics col-lected at the same location in the execution plan acrossrecurring jobs. We picked all of the recurring jobs fromone business group and instrumented �ve di�erent runsof each job. While most of these jobs repeated daily, a fewrepeated more frequently. �e �gure has four distribu-tions, one per data property thatRoPEmeasures. Wedeferdetails on the speci�cs of the properties (see Table �, §�.�)but note that while some properties, such as row length,are more predictable than others, the overall statistics aresimilar across jobs– the ratio std ev

mean is less than �.� for ��of the locations.

2.8 Current Approaches, Alternatives

To the best of our knowledge, we are the �rst to re-optimize data parallel jobs by leveraging data and compu-tation statistics. Current frameworks use best-guess es-timates on the selectivity and costs of operations. Suchrules-of-thumb, as we saw in §�.�–§�.�, perform poorly.Hadoop and the public description of MapReduce leavethe choice of execution plans, the number of tasks permachine and even low-level system parameters such asbu�er sizes to the purview of the developer. HiveQL [��]uses a rule-based optimizer to translate scripts written ina SQL-like language to sequences of map-reduce opera-tions. Star�sh [��] provides guidance on system parame-ters for Hadoop jobs. To do so, it builds a machine learn-ing classi�er that projects from themulti-dimensional pa-rameter space to expected performance but does not ex-plore semantic changes such as reordering. Ke et. al. [��]propose choosing the number of partitions based on op-erator costs. In contrast, RoPE can perform more signif-icant changes to the execution plan, similar to Flume-Java [�] and Pig [��], but additionally does so based onactual properties of the code and data.It is tempting to ask end-users to specify the necessary

context, for e.g., tag operations with cost and selectivityestimates. We found this to be of limited use for a fewreasons. First, considerable expertise and time is requiredto hand-tune each query. Users o�en miss opportunitiesto improve or make mistakes. Second, exposing impor-tant knobs to a wide set of users is risky. Unknowingly,or greedily, a user can hog the cluster and deny serviceto others; for instance, by tricking the system to give herjob a high degree of parallelism. Finally, changes in thescript, the cluster characteristics, the resources availableat runtime or the nature of data being processed can re-quire re-tuning the plan. Hence, RoPE re-optimizes exe-cution plans by automatically inferring these statistics.

2.9 Experience Summary, Takeaways

Note that all of the problems described here happen de-terministically. �ey are not due to heterogeneity or run-time variations [�] or due to poor placement of tasks [��,��] or due to sharing the cluster [��, ��]. We believe that

!"#$% &'()*)+#$ ,)-($)."(#/01#2"()34

23*')56()34 $"4()*#

!"#$%78564

(a) Current Cluster(s)

!"#$% &#'()*+,+-#$ .+/*$+0"*#1234#5"*+6728267'*9#':;%26)*+,+-<*+67/

56,)+;<*+67 $"7*+,#

!"#$%2=;<7

.<*<2<712>6,)"*<*+67?*<*+/*+5/

@6)#$<*6$A/"0B$<)9C2!@56/*/D2/*<*+/*+5/CE

(b) With RoPE

Figure �: RoPE’s architecture for re-optimization

the choice of execution plan is orthogonal to these prob-lems that arise during the execution of a plan. �e under-lying cause is that predicting relevant data and computa-tion statistics deep into a job’s execution is necessary to�nd a good plan but is di�cult due to the intricate cou-pling between data and computation.

When employed together, these improvements add upto more than their sum. Promoting a more selective op-erator closer to the input, can reduce the data �owingin so much that a subsequent join may be implementedwith broadcast join. We found in practice that such globalchanges to the execution plan accrue more bene�ts thanmaking singleton changes. RoPE achieves both types ofre-optimizations as we will see in §�.

A few key takeaways follow.

● Being un-aware of data and computation contextleads to slower responses and wasted resources.● Unlike the case of singleton database servers, data-parallel computation provides di�erent space forimprovements and has new challenges such as cop-ing with arbitrary user de�ned operations and ex-pressions.● Global changes to the execution plan add morevalue than local ones.

3. DESIGN

RoPE enables re-optimization of data parallel comput-ing. To obtain data- and computation- context, RoPE in-terposes instrumentation code into the job’s data�ow. Anoperation can be instrumented with collectors on its in-put(s), output(s) or both. We describe how RoPE chooseswhat information to collect, avoids redundancy in the lo-cations fromwhich stats are collected, and the algorithmsto compose statistics from distributed locations in §�.�.�ese statistics are funneled to the job manager and areused to improve execution plans in a few di�erent ways,each di�ering in the scope of possible changes and thecomplexity to achieve those changes.

Even though an execution plan is already chosen fora running job, RoPE uses the statistics collected during

�

the run to improve some aspects of the job. Descendantstages that are at least a barrier away from the stageswheredatastats are being collected will not have begun execu-tion. �e implementation of these stages can be changedon the �y. Stages that are pipelined with the currently ex-ecuting stage can be changed, since inter-task data �owhappens through the disk. Some plan changes may beconstrained by stages (or parts of stages) that have alreadyrun. Hence RoPE performs changes that only impact theun-executed parts of the plan, such as altering the degreeof parallelism of non-reduce stages.

RoPE uses the collected statistics to generate better ex-ecution plans for new jobs. Here, RoPE can performmorecomprehensive changes. Recall from §�.� that many jobsin the examined cluster recur because they periodicallyexecute on new data and that the extracted datastats arestable across runs of these jobs. In this case, upon anew run of a recurrent job, datastats collected from pre-vious runs are used as additional inputs to the plan op-timizer. We describe how statistics are stored so theycan be matched with expressions from subsequent jobsin §�.�. We also note that partial overlaps are commonamong jobs [��] and our matching framework extends tocover this case. �e methodology of how the statistics areused is described in §�.�. An illustrative case study of thechanges that RoPE achieves is in §�.�.

3.1 Collecting contextual information

Choosing what to observe and the statistics to collecthas to be done with care since the context we collect willdetermine the improvements that we can make. Broadly,we collect statistics about the resource usage (e.g., CPU,memory) and data properties (e.g., cardinality, numberof distinct values). See Table � for a summary.�e nature of map-reduce jobs leads to a few unique

challenges. First, map-reduce jobs examine a lot of data ina distributed fashion. �ere is no instrumentation pointthat observes all the data and even if created, such in-strumentation would not scale with the data size. Hence,we require stat collection mechanisms to be composable,i.e., local statistics obtained by looking at parts of the datashould be composable into a global statistic over all data.Second, to be applicable to a wide set of jobs, the collec-tionmethod should have low overhead, i.e., overhead thatis only a small fraction of the task that it piggybacks upon.In particular, the memory used should scale sub-linearlywith data size and to limit computation cost, the statis-tics should be collected in a single pass. Together, theseconstraints are quite strict, so we adapt some pre-existingstreaming and approximation algorithms.Finally, to be useful, the statistics have to be un-

ambiguous, precise, and have high coverage. By unam-biguous, we mean that the statistics should contain meta-data describing the location in the query tree that these

Type Description Granularity

DataProperties

Cardinality Query SubgraphAvg. Row Length Query Subgraph� of Distinct values Column @ SubgraphHeavy hitter values, their fre-quency

Column @ Subgraph

CodeProper-ties

CPU and Memory Used perData read and written

Task

LeadingStatistics

Hash histogram Column @ SubgraphExact sample Column @ Subgraph

Table �: Statistics that RoPE collects for reoptimization

!

"#$%& "#$%'

"#$%(

))))

*+,,-*"+.

/+012#(#3-.

Figure ��: A task can have many operations and hence, col-lectors. A job manager composes individual statistics.

statistics were collected at. Global changes to the plan,during re-optimization for subsequent jobs for example,can alter the plan so much that previously observed sub-trees no longer occur. Precision is an accuracy metric; wetry to match the accuracy requirements of the improve-ments that we would like to make with the properties ofthe algorithmic techniques that we use to collect statistics.For coverage, we would like to observe as many di�erentpoints in the job execution as possible. However, to keepcosts low, we ignore instrumenting operations whose im-pact on the data is predictable (e.g., the input size of a sortoperation is the same as its output). Further, we only lookat the interesting columns. �at is, we collect column-speci�c statistics only on columns whose values will, atsome later point in the job, be used in expressions (e.g.record stats for col if select col=. . ., join by col=. . ., orreduce on col follow).Implementation: RoPE interposes stat-collectors at keypoints in the job’s data�ow. Datastat collectors are pass-through operators which keep negligible amounts of stateand add little overhead. We also extend the task wrapperto collect the resources used by the task. When a task �n-ishes, all datastats are ferried to the job manager (see Fig-ure ��), which then composes the stats. �e stats are usedright away and also stored with amatching service for usewith future jobs (see Figure �(b)).

3.1.1 Data PropertiesAt each collection point we collect the number of rows

and the average row length. �is statistic will informwhether data grows or shrinks and by how much as it�ows through the query tree. Composing these statis-tics is easy–for example, the total number of rows a�era select operation is simply the sum of the number ofrows seen by the collectors that observe the output of thatselect across all the tasks that contain that select.

�

Further, for each interesting column, we compute thenumber of distinct values and the heavy hitters, i.e., valuesthat repeat in a large fraction of the rows. �ese statistics,as we will see shortly, help avoid skews during partitionand also help pick better implementations. Computingthese statistics while keeping only a small amount of statein a single pass is challenging, let alone the need to com-pose across di�erent collection points.Our solution builds on some state-of-the-art tech-

niques that we carefully chose because we could extendthem to be composable. We will only sketch the basic al-gorithms and focus on how we extended them.

Lossy Counting to find Heavy Hitters.Suppose we want to identify all values that repeat morefrequently than a threshold, say �� of the data size N .Doing this in one pass, naively, requires tracking runningcounts of all distinct values and uses up to O(N) space.Lossy counting [��] is an approximate streaming al-

gorithm, with parameters s, ε that has these properties.First, it guarantees that all values with frequency over sNwill be output. Further no value with a frequency smallerthan (s−ε)N will be output. Second, the worst case spacerequired to do so is (sub-linear) �

ε log (εN). In practice,we �nd that the usage is o�en much smaller. �ird, thefrequency estimated undercounts the true frequency ofthe elements by at most εN . �e key technique is ratherelegant; it tracks running frequency counts but a�er ev-ery � �ε � records, it retires values that do not pass a test ontheir frequency. For more details, please refer [��].

RoPE uses a distributed form of lossy counting. Eachstat collector employs lossy counting on the subset of datathat their task observes with parameters s = �ε, ε. Tocompute heavy hitters over all the data, we add up thefrequency estimates over all collectors and report distinctvalues with count greater than εN . Interestingly, compos-ing in thismanner retains the properties of lossy countingwith slight mods. A proof sketch follows.Proof Sketch: Let Ni be the number of records observedby the i’th collector. Note that s−ε = ε and∑Ni = N First,for the frequency estimation error, if a value is reportedby stat collector i, we know that its frequency estimate isno worse o� than εNi . If the value is not reported, its fre-quency estimate is zero; but by the existence constraint,we know that the element did not occur more than �εNitimes at this collector. Summing up the errors across col-lectors, we conclude that the global estimate is not o� thetrue frequency by more than �εN . Second, for false posi-tives, we note that we only keep values whose cumulativerecorded count is greater than εN , that means their truefrequency is at least εN . �ird, for false negatives, sup-pose a value has a global frequency greater than f > �εN ,then it has to occur more than �εNi > sNi times at somecollector, and so will be reported. Even more, since we

just showed that the cumulative error is no worse than�εN , its cumulative recorded count will be no worse thanf − �εN which is larger than εN , hence RoPE will reportthis value a�er composition. Finally, the space used ateach collector is �

ε log (εNi)meeting our requirements.Implementation: RoPE uses ε = .��. Micro-benchmarksshow that frequency estimates are neverworse o� bymorethan εN and the space used is small multiples of log( Nε ).Hash Sketches to count Distinct Values.Counting the number of distinct items in sub-linear stateand in a single pass has canonically been a hard problem.Using only O(logN) space, hash sketches [�] computesthis number approximately. �at is, the estimate is a ran-dom variable whose mean is the same as the number ofdistinct values and the standard deviation is small.

�e key technique involves uniformly random hashingthe values. �e �rst few bits of the hash value are used tochoose a bit vector. From the remaining bits, the�rst non-zero bit is identi�ed and the corresponding bit is set in thechosen bit vector. Such a hash sketch estimates the num-ber of distinct values because �

� of all hash-values will beodd and have their �rst non-zero bit at position �, �

�� willdo so at position � and so on. Hence, the maximum bitset in a bit-vector is proportional to the logarithm of thenumber of distinct values. Using a few bit-vectors ratherthan one guards against discretization error. �e actualestimator is a bit more complex to correct for additivebias. For more details, please refer [�].

RoPEuses a distributed formof hash sketches. Each statcollector maintains local hash sketches and relays thesebit vectors to the jobmanager.�e jobmanagermaintainsglobal bit vectors, such that the i th global bit vector is anOR of all the individual i th bit vectors. By doing so, theglobal hash sketch is exactly the same as would be com-puted by one instrumentation point that looks at all data.Hence, RoPE retains all the properties of hash sketches.Implementation: If the hash values are h bits long, whereh = O(logN), and the �rst m bits choose the bit-vector,then there are �m bit vectors and the size of the hashsketch is h ∗ �m bits. RoPE uses m = � and h = ��.Hence, each task’s hash sketch is ��B long. Our micro-benchmarks show that hash sketches retain precisioneven as the number of distinct values grows to ��.

3.1.2 Operation CostsWe collect the processing time and memory used

by each task. A task is a multi-threaded process thatreads one or more inputs from the disk (locally or overthe network) and writes its output to the local disk.However, popular data-parallel frameworks can combinemore than one operation into the same task (e.g. data ex-traction, decompression, processing). Since such opera-tions runwithin the same thread and frequently exchange

�

data via shared memory, the per-operator processing andmemory costs are not directly available. But, per-operatorcosts are necessary to reason about alternate plans, e.g.,reordering operators. Program analysis of the underlyingcode could reason about how the operators interact in atask, but this analysis can be hard because the interactionsare complex, e.g., pipelining. Also, pro�ling individualoperators does not scale to the large number of UDOs.Instead, RoPE uses a simple approach that only estimatesthe costs of the more costly operations.�e approach works as follows. First, tasks containing

costly operations are likely to be costly as well. We pickstages with costs in the top tenth percentile as expensive.We only use the costs of the median task in each stage to�lter the impact from failures and other runtime e�ects.Second, not all the operators in expensive tasks are ex-pensive. So, for each operator, we compute a con�dencescore as the fraction of the stages containing the opera-tor that have been picked as expensive. An operator willhave a high con�dence score only if it exclusively occursin expensive tasks. �ird, we compute the support of anoperator as the number of distinct expensive stages that itoccurs in. Finally, we estimate the cost of operators, thathave high con�dence and high support scores, as the av-erage cost of the stages containing that operator.To validate this approach, we pro�led over ��K ran-

domly chosen stages from the production cluster. It ishard to obtain ground truth at this scale. Hence, we cor-roborate our results on succinctness – only a few amongall the operations should be expensive, and coverage –most of the expensive tasks should contain at least oneexpensive operation. �is method identi�ed ��.�� and��.�� of the operators as expensive for CPU and mem-ory costs respectively. �is number was higher than ex-pected because there are only a few generic operations butmany UDOs. Most of the costly operations were UDOs.Further, ��.�� (��.��) of the tasks picked as expensivebased on their memory (cpu) cost contained at least oneexpensive operator. Hence, their cost can be explained bythese operations. Very few contained more than one ex-pensive operation. �e succinct set of operations identi-�ed as expensive and the high coverage of this set makesus optimistic about this simple method. Fig �� plots therelative cost values attributed to the expensive operations.

3.1.3 Leading Statistics�e ability to predict behavior of future operators is

invaluable, especially for on-the-�y optimizations. Do-ing so precisely is hard. Rather than aspiring for pre-cise predictions in all cases, RoPE collects simple leadingstatistics to help with typical pain points. We collect his-tograms and random samples on interesting columns (i.e.columns which are involved in where/group by clauses orjoins) at the earliest collection points preceding the oper-

!"!#"!$"!%"!&"

!'""

!" !"('!"(#!"()!"($!"(*!"(%!"(+!"(&!"(, !'

-./0

123456

!78

5/19./!:.;9

</1=93.>!.?!785/19./;

@50./AB;5/!:CB

Figure ��: Normalized memory and user CPU costs for theoperators identi�ed as expensive.

ator at which those columns will be used.We looked at several histogram generators, including

equi-width, equi-depth and serial but ended up with asimpler, albeit less useful alternative.�is is because noneof the others satis�ed our single pass and bounded mem-ory criterion with a reasonable accuracy. For each inter-esting column, in an operator, we build a hash-based his-togram with B buckets, that is hashed on a given columnvalue (hash[column value]� B → bucket) and counts thefrequency of all entries in each bucket. RoPE uses B = ��.

We also use reservoir sampling to pick a constant sized,random sample of the data �owing through the operator.For each interesting column, RoPE collects up to �� sam-ples but no more than ��KB.

3.2 Matching context to query expressions

As metadata to enable matching, with each stat collec-tor we associate a hash-value that captures the location ofthe collector in the query graph. In particular, location inthe query graph refers to a signature of the input(s) alongwith a topologically sorted chain of all the operators thatpreceded the stat collector on the execution plan. We col-loquially refer to this as the query subgraph. RoPE uses ��bit hashes to encode these query subgraphs.

3.3 Adapting query plans

We build on top of SCOPE Cloud Query Opti-mizer (CQO) which is a cost-based optimizer. CQOtranslates the user-submitted script into an expressiontree, generates variants for each expression or group ofexpressions and �nally chooses the query tree with thelowest cost. However, lacking direct knowledge of querycontext, the CQO uses simple apriori constants to deter-mine the costs and selectivities of various operators aswell as the data properties. By providing exactly this con-text, RoPE helps the CQOmake better decisions.

RoPE imports statistics uses a stat-view matching tech-nique similar to the analogousmethod in databases [�, ��,��]. Statistics are matched in during the exploration stagein optimization, i.e. before implementing physical alter-natives but a�er equivalent rewrites have been explored.�e optimizer then propagates the statistics o�ered at onelocation to equivalent semantic locations, e.g., cardinalityof rows remains the same a�er a sort operation and can bepropagated. For expressions that have no direct evidenceavailable, the optimizer makes-do by propagating statis-

�

!"#$"#%&'"()*+%&,#'+$

-.+/0'1"2%-2*12+%34(,)56

&!78-%902'1$+

&!78-%!"$:1;+(&!78-

7:'1$1<+(

!"#$"#%=1;+#

&!78-%&/(1:'#

4)')&')'1#'1/#

902'1$+7:'1$1<+(

&!78-7:'1$1<+(

902'1$+7:'1$1<+(

Figure ��: �eRoPEprototype consists of �key components:a distributed statistics collection module, a pre-processor tothe existing SCOPE compiler that provides matching func-tionality and a basic on-the-�y runtime optimizer.

tics from nearby locations with apriori estimates on thebehavior of operators that lie along the path. Such uncer-tain estimates are deemed to be of lower quality and areused only when other estimates are unavailable.We extended the optimizer to make use of these statis-

tics. Cardinality, i.e., the number of rows observed at eachcollection point, helps estimate operator selectivity andcompare reordering alternatives. Along with selectivity,the computation costs of operations are used to deter-mine whether an operation is worth doing now or laterwhen there is less work for it. Costs also help determinethe degree of parallelism, the number of partitions, andwhich operations to �t within a task. Besides the choiceof broadcast join, statistics also help decide when self-joinor index-join are appropriate. Most of these optimiza-tions are speci�c to the context of parallel executions. Webelieve that there is more to do with statistics than whatRoPE currently does, but our prototype (§�) su�ces toprovide substantial gains in production (§�).

4. PROTOTYPE

�e prototype collects all the statistics described in §�.Data stats are written to a distributed �le system and thematching functionality (§�.�) runs as a pre-processingstep of the job compiler. �e statistics collection code is afew thousand lines of C�, which during code generationfor each operator, gets placed into the appropriate loca-tion in the operator’s data�ow. Note that not all operatorscollect statistics, and even when they do, they do not col-lect all types of statistics. Collected statistics are passedto the C++ task wrapper via re�ection which piggybacksthem along with task status reports to the job manager.Composing statistics required a few hundred lines of codein the job manager.We allow the compiler to specify varying requirements

across the columns. �e constants also can change, totrade-o� costs for improved precision, based on previousstatistics or other information that the compiler has suchas required accuracy of an estimate.Parts of the code are production quality to the extent

that all of our results here are from experiments that runin Bing’s production clusters. Rather than implementingevery optimization possible with the statistics that RoPEcollects, we built a subset up to production quality codein order to deploy, run and gain experiences from Bing’sproduction clusters. We acknowledge that our prototypeis just that, and there is more bene�t to be achieved.

Using these statistics required extensive changes in theSCOPE query optimizer involving several hundred linesof code spread over several tens of �les.

5. EVALUATION

Webuilt and deployedRoPE on a large production clus-ter that supports the Bing search engine at Microso�.

5.1 Methodology

Cluster: �is cluster constitutes of tens of thousandsof ��-bit, multi-core, commodity servers; processespetabytes of data each day and is shared acrossmany busi-ness groups. �e core of the network is moderately over-subscribed and hence shu�ing data across racks remainsmore expensive than transfers within a rack.Workload Evaluated:Wepresent results from evaluatingRoPE on all the recurring jobs of a major business groupat Bing and randomly chosen jobs from ten other busi-ness groups. �e dataset has over �� jobs. We repeat eachjob over ten times for each variant that we compare. �eonly modi�cation we do to the jobs is to pipe the outputto another location in the distributed �le system so as tonot pollute the production data. �ough small, since wepick a large set of jobs, our dataset spans a wide rangeof characteristics. Latency of the unmodi�ed runs variedfrom minutes to several hours. Tasks ranged from smalltens to hundreds of thousands. �e largest job had sev-eral tens of stages. User-de�ned operations are prevalentin the dataset, as is common across our cluster.Compared Variants: For each job, we compare the ex-ecution plan generated by RoPE with the execution plangenerated without RoPE.�e latter is a strong strawman,since it uses apriori �xed estimates of operator costs andselectivities and is the current default in our productionclusters. Early results from a variant that was equiva-lent to Hive over Hadoop, i.e., one that translated users’jobs into literal map-reduce execution plans, were signif-icantly worse than our cluster’s baseline implementation.Here, we omit those results.Metrics: Our evaluation metrics are the reduction injob completion time and resource usage. Resources in-clude cluster slot occupancy computed in machine-hourunits, as well as the total amount of data read, written andmoved across low bandwidth (inter-rack) network links.

5.2 A Case Study

Many plan changes can be performed given the statis-tics that RoPE collects. We report a case study that il-

��

!"#$%&'()%*+,'&-./

012,(134-/(

3(

5%6'()(773733

8'9:;(, ,:'*< <',&-&-./5= .<:,'&-./5%='> ?0 '##,:#'&:

@*#:(2'":%&.%(2+AA;:9,.'*B'(&

B'/%9:%;.B';>%7CD7ED 73CD

(2+AA;:%'&%;:'(&%./:%(-*:

(a) Legend

!"#$%&'($

)!*

!"#$"%&% %'()('*%

!+#$%&'($

+),$*"(&% ,-.%%/0/"+

,$

,!-$

!.#$%&'($

,!/$

,$

(b) �e default plan due to SCOPE

!"#$%&'($

)!*

!"#$"%&% %'()('*%

!+#$%&'($

+),$*"(&% ,-.%%/0/"+

,$

,!-$

,$

!.#$%&'(!/$

,$

(c) Reordering the costly UDO (��) toa location with less data

!"#$%&'(!)!*

!"#$"%&% %'()('*%

+!,

!-#$%&'(

+),$*"(&% ,-.%%/0/"+

+

!.#/%&'(

+/+

/

(d) Execution plan picked by RoPE

Figure ��: Case Study: Evolution of the execution plan as RoPE provides more statistics. See Table � for how well these plansdo upon execution.

lustrates some of the changes to execution plans thatachieved signi�cant gains in practice. Aggregate resultsfrom applying RoPE to a wide variety of jobs are in §�.�.Consider a job that processes four datasets. Let these be

requests (R), synonyms (S), documents (D) and classi�edURLs (C).�e goal is to compute how many requests (ofa certain type) access documents (of a certain other type).Doing so, involves the following �ve operations:

�. Filter requests (R) by location. �is operation has aselectivity of �

�� x.�. Join requests (R) with synonyms (S). �ere are

many synonyms per word, so the selectivity is ��x.�. Apply a user-de�ned operation (UDO) to docu-

ments (D).�e selectivity is �� x but has a very large

CPU cost per document.�. Join documents (D) with the list of classi�ed

URLs (C).�is has a selectivity of �� x.

�. Join synonymized requests with documents con-taining classi�ed URLs; has a selectivity of �

� x.�. Finally, count number of requests per document

URL.�e dataset sizes are R: �GB, S: ��GB, D: ��GB, C: ��MB.Figure ��(b) shows the plan computed by the unmodi-

�ed optimizer which uses apriori estimates on data prop-erties and costs. Each circle represents a stage, a collec-tion of tasks that execute one or more operations whichare listed in the adjacent label (see legend in ��(a)). Unla-beled stages are aggregates [��]. �e size of the circle de-notes the number of tasks allocated to that stage, in log-arithmic scale. �e color of the circle in linear redscaledenotes the average time per task in that stage, darker im-plies longer. Figure ��(d)(��(c)) show the plan generatedusing (a subset of) the statistics provided by RoPE. We ex-plain the �gures as we go along. Table � shows how wellthese plans do when executed, the metrics are averagedover �ve di�erent runs.Recall that the compiler sets costs proportional to the

data processed and sets selectivity based on the type of theoperation– for instance, �lters are assumed to be moreselective than joins. �ese apriori estimates have mixed

results. �ey pick the right choice for the operation onrequests; the more selective �lter operation (��) is donebefore joining with synonyms (��) (see Fig. ��(b) tople�). However, for documents, the plan performs theuser-de�ned �lter (UDO) (��) before the more selectivejoin (��) leading to a lot of wasted work. As we see in Ta-ble �, due to the high cost of the UDO, this plan takes over��x the time of the next best alternative.

By providing an estimate of the UDO’s selectivity, RoPElets the compiler join the documents dataset with the clas-si�ed dataset �rst. Figure ��(c) depicts such a plan (seechange in middle right). From Table �, we see that sincethe UDO is applied on fewer data, the median executiontime improves substantially. And, not many more tasksare needed since fewer documents through the UDOmeans fewer net work, and so the cluster hours decreaseas well. But, by performing the UDO later, more data isshu�ed across the network, since the join with classi�edsis done on un�ltered documents.

When comparing these two plans, note that the thick-ness of the edges represent data volume moving betweenstages in logarithmic scale (see legend). Also, the colorindicates the type of data movement. Dark solid linesdenote data �ow from a partition stage to an aggregatestage (many to many) which has to move over the net-work. Light solid lines indicate one-to-one data �ow.Here, data-local placement can avoid movement acrossthe network. Dotted lines indicate broadcast, i.e., thesource stage’s output is read by every task in the destina-tion stage. Double edges indicate two source tasks perdestination task, i.e., at least one of the sides has to bemoved over the network.

With the UDO at a better place, the bottleneck movesto two new places. �e average task duration of the stagewith the UDO is very high (deep red). If the compilerknew the cost of the UDO, it could apportion more tasksto this stage to resolve this bottleneck. Similarly, eventhough requests (R) becomes small a�er applying the veryselective �lter (��), the compiler is unaware and picks apair-wise join for ��. �is causes the large dataset S to be

��

Alternate Execution Plans PerformanceLatency (s) Cluster Oc-

cupancy (s)Cross RackShu�e (GB)

Reads+WritesTo Disk (GB)

Tasks

Un-modi�ed (Fig.��(b)) �x �y ��.�� .�� Push UDO to appropriate location (Fig. ��(c)) .��x .��y ��.�� .�� + Replace pair-wise join with broadcast (not shown) .��x .��y ��.�� .�� + Balance (not shown) .��x .��y ��.�� .�� RoPE (+push UDO even lower, little data,Fig. ��(d)) .��x .��y ��.�� .��

Table �: Summarizes salient features of executing a typical job with and without RoPE. Overall, with RoPE latency reduces byalmost ��X, while using �X fewer cluster hours.

partitioned and shu�ed across the network needlessly.Figure ��(d) shows the plan where RoPE provides the

compiler with the selectivity and costs of all the opera-tions. Intermediate plans have been omitted for brevity.Table � shows that the median execution of this plan im-proves by another �.�x. A few changes are worth noting.First, operation �� is now implemented as a broad-

cast join. �is not only saves cross rack shu�e andreads/writes to disk but also avoids partitioning boththese datasets. Since pair-wise joins shu�e data acrossthe network, such stages are more at risk from congestioninduced outliers. We observed this with the default plan.Second, given the high cost of the UDO (��), the com-

piler instead of addingmore tasks to the stage with opera-tion �� defers the UDO till even later, i.e., till a�er opera-tions �� and ��. Both these operations are net data reduc-tive, however �� is particularly so since it produces onerow for each document url that is in the eventual output.�e increase in cost from performing other operations onmore data was lower than the bene�ts from performingthe UDO on fewer data.More optimizations are enabled recursively. �e com-

piler realizes that both sides of the input for the join inoperation �� are small due to the cumulative impact ofearlier operations, leading to the third improvement– im-plement operation �� also as a broadcast join.�is leads to a fourth improvement that is subtle. No-

tice that operation ��, a reduce operation that needs datato be partitioned by document url to compute the numberof times eachURL occurs, now lies between operations ��and �� thereby eliminating one complete map-reduce!To understand why, note that the join in operation ��

matches words in synonymized requests with words inthe classi�ed documents. Typically this join would re-quire both inputs to be partitioned on words. However,here there is so little data that the le� side (requests) canbe broadcast and the right side (documents) can be parti-tioned in any way. Hence the right side is partitioned onurl immediately a�er it is read to facilitate operation ��and never re-partitioned thereby providing for coalescingmany more operations into the same stage. �e enablerfor this optimization is little data– the later parts of mostdata parallel jobs can bene�t from serial plans. Achievingthe change, however, requires reasoning about the entireexecution plan and not one stage at a time which is pos-

sible in RoPE due to the interface with a query optimizer.Fi�h, to o�set the cost of the UDO, the compiler as-

signs more tasks (larger size) to the stage that now imple-ments the UDO (along with operations �� and ��). Do-ing so, is again a global change because the earlier oper-ations have to partition the data enough ways. To bene-�t from not re-partitioning, the compiler implements thejoin in �� with the same (larger) amount of parallelism.

Finally, a potential change that was not performed isworth discussing. Since the classi�eds dataset is small (C),it is possible to implement operation �� also as a broad-cast join. But the compiler chooses to not do that. �ereason is that the documents have to be partitioned byurl to facilitate the reduce in op ��. Either they are parti-tioned before the join op �� or a�er. If the partition hap-pens before, as is the case with the chosen plan, the largeamount of work in the stage doing ops ��, �� and �� needsmore parallelism resulting in more partitions, i.e., moretasks in op ��. But, the network costs of broadcasting Cgrow linearly with the number of tasks in op �� o�settingthe savings which is one additional read/write of C. Par-titioning a�er op �� would mean one more map-reduceon the critical path and just before the end of the job. Anyoutliers here would directly impact the job unlike outlierson the shu�e before op �� which is not on the criticalpath since more work lies on the le� side of the DAG.

Overall, we see that signi�cant reductions result fromoptimizations building on top of each other. Unfortu-nately, the space of optimizations is not monotonic; sometimes using more tasks is better, whereas at other timespushing a �lter a�er some operators but before some oth-ers is the best choice. By providing accurate estimates ofcode and data properties, RoPE is a crucial �rst step to-wards picking appropriate execution plans.

5.3 Aggregate results

Here, we present aggregated results across jobs in thedataset.�ese are runs of production jobs in a productioncluster, so the noise from other jobs and other tra�c onthe cluster is realistic.

Figure ��(a) plots a CDF of the ratio between the jobruntimes without and with RoPE. So, y = � implies nobene�t. Samples below that line are regressions whilethose above indicate improvement. To compute a singlemetric from a disparate set of jobs, we weight job by the

��

!"!"#$!%

!%#$!&

!&#$!'

!'#$!(

!(#$!$

!" !"#%!"#&!"#'!"#(!"#$!"#)!"#*!"#+!"#, !%-./!0123456!789:;<3=

>

[email protected]!.B!-./C!=3ADE23:!/6!F<!G@C

(a) Reduction in Job Latency

!"!"#$!%

!%#$!&

!&#$!'

!'#$!(

!" !"#%!"#&!"#'!"#(!"#$!"#)!"#*!"#+!"#, !%

-./0123!4

5/30!

!67.89:2;

<

=3>?1@5A!5B!C5D0!;2@EF128!DG!H:!430

(b) Reduction in Cluster Hours Used

!"!"#$!%

!%#$!&

!&#$!'

!'#$!(

!(#$!$

!" !"#%!"#&!"#'!"#(!"#$!"#)!"#*!"#+!"#, !%

-./00!12

34!567

89!

!:;<=>?9@

A

B.23CD/E!/F!G/H0!@9DI6C9=!HJ!K?!L.0

(c) Reduction in data shu�ed across racksFigure ��: Aggregated results from production dataset (see �.�)

total cluster hours that it occupies. All the results in thissection share this format.Figure ��(a) shows that �� (��) of the jobs speed

up by over �x (�.�x). Several signi�cant jobs see over�x reduction in their latencies. Upon inspection we �ndthat all the samples under the y = � line are from two jobs.Both jobs have a small number of stages (and tasks). RoPEdoes not change their execution plans. Noise due to thecluster is the likely cause for lower performance. �e vastmajority of jobs see performance improvements.�e rea-son is due to one or more of the optimizations describedin the case study above.Figure ��(b) shows that the latency savings due to RoPE

are not from simply using more resources. In fact byavoiding wasteful work, RoPE speeds up jobs while reduc-ing cluster usage. At the ��th (��th) percentile, jobs use�.�x (�.�x) fewer cluster hours with RoPE.Figure ��(c) shows that while the volume of cross rack

shu�e is lower for some jobs, it stays the same for manyand increases for only a few. �is is expected since whilesome of the optimizations enabled by RoPE lower crossrack shu�e, others can increase it. Our optimization goalis to improve job latency and hence, on all the other met-rics, we only indirectly constrain RoPE. Yet, we �nd thatRoPEmostly achieves its gains by shu�ing fewer amountsof data across the network. Figure ��(b) shows a similarpattern for the data read and written to disk. Figures ��(a)and ��(c) show that RoPE’s plans mostly use fewer tasksand stages, though sometimes, for e.g., when o�settingthe cost of UDOs, RoPE can use more tasks.In summary, RoPE judiciously uses resources to im-

prove job latency. Some gains accrue from avoidingwasted work, others from trading a little more of one typeof resource for large savings on another while still othersaccrue from balancing the parallel plans.

6. RELATED WORK

Recent work on data-parallel cluster computing frame-work has mainly focused on solving issues that arise dur-ing the execution of jobs, by sharing the cluster [��, ��],tackling outliers [�], fairness vs. locality [��] and net-work scheduling [�]. Others incorporate new functional-ity such as the support for iterative and recursive control

�ow [��]. Orthogonally, RoPE generates better executionplans by leveraging data and execution insights.

�e AutoAdmin project examined adapting physicaldatabase design, e.g., choosing which indices to build andwhich views tomaterialize, based on the data and queries.

Closer to us, is the work that adapts query plans basedon data. Kabra and DeWitt [��] were one of the earliest topropose a scheme that collects statistics, re-runs the queryoptimizer concurrentlywith the query, andmigrates fromthe current query plan to the improved one, if doing so ispredicted to improve performance. �ey mainly addressthe challenges of trading o� re-optimization vs. doingactual work and of re-using the partial executions fromthe old plan to avoid wasting work that is already done.�ese challenges are easier in the context of map-reducewhile collecting statistics is harder due to the distributedsetting. Further, RoPE explores new opportunities arisingdue to the parallel nature of plans.

Eddies [�] adapts query executions at amuch �ner, per-tuple, granularity. To do so, Eddies (a) identi�es pointsof symmetry in the plan at which re-ordering can hap-pen without impacting the output, (b) creates tuple rout-ing schemes that adapt to the varying selectivity and costsof operators. RoPE looks at a disjoint space of optimiza-tions (choosing appropriate degrees of parallelism andoperator implementations), which are not easily cast intoEddies’ tuple routing algorithm.

Star�sh [��] examines Hadoop jobs, one map followedby one reduce, and tunes low-level con�guration vari-ables inHadoop such as io.sort.mb. To do so, it constructsa what-if engine based on a classi�er trained on exper-iments over a wide range of parameter choices. Resultsshow that the prescriptions from Star�sh improve on de-veloper’s rules-of-thumb on non-traditional servers (e.g.,fewer memory or cores). RoPE is complementary because(a) it applies to jobs that are more complex than a mapfollowed by a reduce, (b) explores a larger space of plansand (c) uses a cost-based optimizer.

7. DISCUSSION

Recurring jobs are a �rst-order use case in our pro-duction system. We �nd that RoPE achieves meaning-ful improvements to such jobs. However, it is important

��

!"#$!"#%!"#&!"#'!(

!(#(!(#)!(#*!(#+!(#,

!" !"#(!"#)!"#*!"#+!"#,!"#$!"#%!"#&!"#' !(

-./0

12!34

567819

:

;</=.>?@!?A!B?C2!91>0D.16!CE!F8!G<2

(a) Reduction in � of Stages

!"!"#$!%

!%#$!&

!&#$!'

!'#$!(

!(#$!$

!" !"#%!"#&!"#'!"#(!"#$!"#)!"#*!"#+!"#, !%

-./012

0345/0!65/5!

!7893:;0<

=

>15?/4@.!@A!B@CD!<04EF/03!CG!H;!I1D

(b) Reduction in intermediate writes to disk

!"!"#$!%

!%#$!&

!&#$!'

!'#$!(

!(#$!$

!" !"#%!"#&!"#'!"#(!"#$!"#)!"#*!"#+!"#, !%

-./0/!12345678

9

:;.<=>?@!?A!B?C/!87>DE=74!CF!G6!H;/

(c) Reduction in � of TasksFigure ��: Aggregated results from production dataset (see �.�)

to note that statistics from a run only cover sub-graphsof operations used in that execution plan. �is informa-tion may not su�ce to �nd the optimal plan. A�er a fewruns, we usually �nd all the necessary statistics since eachrun can explore new parts of the search space. However,plans chosen during this transition are not guaranteed tomonotonically improve. In fact, there are no general waysto bound the worst case impact from plans chosen basedon incomplete information. As a result, RoPE largely errson the side of very conservative changes.We defer to future work some advanced techniques

that choose plans given uncertainity regimes over statis-tics or choose a set of plans, each of which has an associ-ated validity range speci�ed over statistics, and switch be-tween these plans at runtime depending on the observedstatistics [�]. Clearly these techniques are more complexthan RoPE, the risks from picking worse plans are largerhere, and to the best of our knowledge using these ideasin the context of parallel plans is an open problem.

8. FINAL REMARKS

Results from a deployment in Bing show that leverag-ing properties of the data, the code and their interactionsigni�cantly improves the execution of data parallel pro-grams. �e improvements derive from using statistics togenerate better execution plans. Note that these improve-ments are mostly orthogonal to those from solving run-time issues during the execution of the plans (e.g., out-liers, placing tasks). �ey are also in addition to thoseaccrued by a context-blind query optimizer over literallyexecuting the programs as speci�ed by the users.While RoPE leverages database ideas, we believe that

the realization in the context of data parallel programsis interesting due to challenges that are new (e.g., dis-tributed collection), or are more important in this con-text (e.g., user de�ned operations) and novel opportuni-ties for improvement (e.g., recurring jobs, little data andthe optimizations speci�c to parallel plans such as choos-ing degree of parallelism to achieve balance).

Acknowledgements

We thank SumanNath, Peter Bodik, AndrewFerguson, JonathanPerry, our shepherd Roxana Geambasu andmembers of the UCBerkeley AMP Lab for comments that improved this work. �eCosmos team has built and keeps running the fabulous system

without which this work would not have been possible.

References

[�] G. Ananthanarayanan, S. Kandula, A. Greenberg, et al. Reining inthe Outliers in MapReduce Clusters Using Mantri. In OSDI, ��.

[�] R. Avnur and J. M. Hellerstein. Eddies: continuously adaptivequery processing. SIGMOD Rec., ��, May ��.

[�] S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. InSIGMOD, ��.

[�] N. Bruno and S. Chaudhuri. Exploiting statistics on queryexpressions for optimization. In SIGMOD, ��.

[�] R. Chaiken, B. Jenkins, P. Larson, et al. SCOPE: Easy and E�cientParallel Processing of Massive Datasets. In VLDB, ��.

[�] C. Chambers, A. Raniwala, F. Perry, et al. Flumejava: easy,e�cient data-parallel pipelines. In PLDI, ��.

[�] B. Chattopadhyay, L. Lin, W. Liu, et al. Tenzing a sqlimplementation on the mapreduce framework. In VLDB, ��.

[�] M. Chowdhury, M. Zaharia, J. Ma, et al. Managing data transfersin computer clusters with orchestra. In SIGCOMM, ��.

[�] M. Durand and P. Flajolet. Loglog counting of large cardinalities.In ESA, ��.

[��] C. A. Galindo-Legaria, M. M. Joshi, F. Waas, and M.-C. Wu.Statistics on views. In VLDB, ��.

[��] P. K. Gunda, L. Ravindranath, C. A.�ekkath, et al. Nectar:Automatic management of data and computation in datacenters.In OSDI, ��.

[��] H. Herodotou, H. Lim, G. Luo, et al. Star�sh: A self-tuningsystem for big data analytics. In CIDR, ��.

[��] B. Hindman, A. Konwinski, M. Zaharia, et al. Mesos: a platformfor �ne-grained resource sharing in the data center. InNSDI, ��.

[��] M. Isard et al. Dryad: Distributed Data-parallel Programs fromSequential Building Blocks. In Eurosys, ��.

[��] M. Isard, V. Prabhakaran, J. Currey, et al. Quincy: Fair schedulingfor distributed computing clusters. In SOSP, ��.

[��] N. Kabra and D. J. DeWitt. E�cient mid-query re-optimization ofsub-optimal query execution plans. In SIGMOD, ��.

[��] Q. Ke, V. Prabhakaran, Y. Xie, et al. Optimizing data partitioningfor data-parallel computing. In HotOS, ��.

[��] P.-A. Larson et al. Cardinality estimation using sample views withquality assurance. In SIGMOD, ��.

[��] G. S. Manku and R. Motwani. Approximate frequency countsover data streams. In VLDB, ��.

[��] D. G. Murray and S. Hand. C��: a universal execution engine fordistributed data-�ow computing. In NSDI, ��.

[��] C. Olston, B. Reed, U. Srivastava, et al. Pig Latin: A Language forData Processing. In SIGMOD, ��.

[��] A. Rasmussen, G. Porter, M. Conley, et al. Tritonsort: a balancedlarge-scale sorting system. In NSDI, ��.

[��] A. Shieh, S. Kandula, A. Greenberg, et al. Sharing the data centernetwork. In NSDI, ��.

[��] A.�usoo, J. S. Sarma, N. Jain, et al. Hive- a warehousingsolution over a map-reduce framework. In VLDB, ��.

[��] M. Zaharia, D. Borthakur, J. S. Sarma, et al. Delay scheduling: asimple technique for achieving locality and fairness in clusterscheduling. In EuroSys, ��.

��

re-optimizing data-parallel computing - amplab · re-optimizing data-parallel computing sameer...

Documents