improving mapreduce performance in heterogeneous environments

Improving MapReduce Improving MapReduce Performance in Heterogeneous Performance in Heterogeneous

EnvironmentsEnvironments

Sturzu Antonio-Gabriel, SCPDSturzu Antonio-Gabriel, SCPD

Table of ContentsTable of Contents

IntroIntro Scheduling in HadoopScheduling in Hadoop Heterogeneity in HadoopHeterogeneity in Hadoop The LATE Scheduler(Longest Approximate The LATE Scheduler(Longest Approximate

Time to End)Time to End) Evaluation of PerformanceEvaluation of Performance

IntroIntro

The big volume of data that internet services work The big volume of data that internet services work on has led to the need of parallel processingon has led to the need of parallel processing

The leading example is Google which uses The leading example is Google which uses MapReduce to process 20 petabytes of data per MapReduce to process 20 petabytes of data per dayday

MapReduce breaks a computation into small tasks MapReduce breaks a computation into small tasks that run in parallel on multiple machines and that run in parallel on multiple machines and scales easy to very large clustersscales easy to very large clusters

The two key benefits of MapReduce are:The two key benefits of MapReduce are:– Fault toleranceFault tolerance– Speculative execution Speculative execution

IntroIntro

Google has noted that speculative execution Google has noted that speculative execution improves response time by 44%improves response time by 44%

The paper shows an efficient way to do The paper shows an efficient way to do speculative execution in order to maximize speculative execution in order to maximize performanceperformance

It also shows that Hadoop’s simple speculative It also shows that Hadoop’s simple speculative algorithm based on comparing each task’s algorithm based on comparing each task’s progress to the average progress brakes down in progress to the average progress brakes down in heterogeneous systemsheterogeneous systems

IntroIntro

The proposed scheduling algorithm increases The proposed scheduling algorithm increases Hadoop’s response time by a factor of twoHadoop’s response time by a factor of two

Most of the examples and tests are done on Most of the examples and tests are done on Amazon’s Elastic Compute Cloud( EC2)Amazon’s Elastic Compute Cloud( EC2)

The paper adresses two important problems in The paper adresses two important problems in speculative execution:speculative execution:– Choosing the best node to run the speculative taskChoosing the best node to run the speculative task– Distinguishing between nodes slightly slower than the Distinguishing between nodes slightly slower than the

mean and stragglersmean and stragglers

Scheduling in HadoopScheduling in Hadoop

Hadoop divides each MapReduce job into Hadoop divides each MapReduce job into taskstasks

The input file is split into even-sized chunks The input file is split into even-sized chunks replicated for fault tolerancereplicated for fault tolerance

Each chunk of input is first processed by a Each chunk of input is first processed by a map task that outputs a set of key-value map task that outputs a set of key-value pairspairs

Map outputs are split into buckets based on Map outputs are split into buckets based on the key the key


When all map tasks When all map tasks finish reducers apply a finish reducers apply a reduce function on the reduce function on the set of values set of values associated with each associated with each keykey


Hadoop runs several maps and reduces Hadoop runs several maps and reduces concurrently on each slave in order to concurrently on each slave in order to overlap I/O with computationoverlap I/O with computation

Each slave tells the master when it has Each slave tells the master when it has empty task slotsempty task slots– First any failed task is given priorityFirst any failed task is given priority– Second a non-running taskSecond a non-running task– Third a task to execute speculativelyThird a task to execute speculatively


To select speculative tasks Hadoop uses a To select speculative tasks Hadoop uses a progress score between 0 and 1progress score between 0 and 1

For a map the progress score is the fraction of For a map the progress score is the fraction of input data readinput data read

For a reduce task the execution is divided into 3 For a reduce task the execution is divided into 3 phases each of which accounts for 1/3 of the phases each of which accounts for 1/3 of the score:score:– The copy phaseThe copy phase– The sort phaseThe sort phase– The reduce phaseThe reduce phase


In each phase the score is the fraction of In each phase the score is the fraction of data processeddata processed

Hadoop calculates an average score for Hadoop calculates an average score for each category of tasks in order to define a each category of tasks in order to define a threshold for speculative executionthreshold for speculative execution

When a task’s progress score is less than When a task’s progress score is less than the average for its category minus 0.2 and the average for its category minus 0.2 and the task has run for at least a minute it is the task has run for at least a minute it is marked as a stragglermarked as a straggler


All tasks beyond the threshold are All tasks beyond the threshold are considered equally slow and ties between considered equally slow and ties between them are broken by data localitythem are broken by data locality

This threshold works well in homogeneous This threshold works well in homogeneous systems because tasks tend to start and systems because tasks tend to start and finish in “waves” at roughly the same timesfinish in “waves” at roughly the same times

When running multiple jobs Hadoop uses a When running multiple jobs Hadoop uses a FIFO disciplineFIFO discipline


Assumptions made by Hadoop Scheduler:Assumptions made by Hadoop Scheduler:– Nodes can perform work at roughly the same

rate– Tasks progress at a constant rate throughout

time– There is no cost to launching a speculative task

on a node that would otherwise have an idle slot


– A task’s progress score is a representative of fraction of its total work that it has done

– Tasks tend to finish in waves, so a task with a low progress score is likely a straggler

– Tasks in the same category (map or reduce) require roughly the same amount of work

Heterogeneity in HadoopHeterogeneity in Hadoop

Too many speculative tasks are launched Too many speculative tasks are launched because of the fixed threshold (assumption because of the fixed threshold (assumption 3 falls)3 falls)

Because the scheduler uses data locality to Because the scheduler uses data locality to rank candidates for speculative execution rank candidates for speculative execution the wrong tasks may be chosen first the wrong tasks may be chosen first

Assumptions 3, 4 and 5 fall on both Assumptions 3, 4 and 5 fall on both homogeneous and heterogeneous clustershomogeneous and heterogeneous clusters

The LATE SchedulerThe LATE Scheduler

The main idea is that it speculatively The main idea is that it speculatively executes the task that will finish farthest in executes the task that will finish farthest in the futurethe future

Estimates progress rate as Estimates progress rate as ProgressScore/T, where T is the amount of ProgressScore/T, where T is the amount of time the task has been running fortime the task has been running for

The time to completion is (1-The time to completion is (1-ProgressScore)/ProgressRate ProgressScore)/ProgressRate


In order to get the best chance to beat the original In order to get the best chance to beat the original task which was speculated the algorithm launches task which was speculated the algorithm launches speculative tasks only on fast nodesspeculative tasks only on fast nodes

It does this using a SlowNodeThreshold which is a It does this using a SlowNodeThreshold which is a metric of the total work performed metric of the total work performed

Because speculative tasks cost resources LATE Because speculative tasks cost resources LATE uses two additional heuristics:uses two additional heuristics:– A limit on the number of speculative tasks executed A limit on the number of speculative tasks executed

(SpeculativeCap)(SpeculativeCap)– A SlowTaskThreshold that determines if a task is slow A SlowTaskThreshold that determines if a task is slow

enough in order to get speculated (uses progress rate enough in order to get speculated (uses progress rate for comparison)for comparison)


When a node asks for a new task and the number When a node asks for a new task and the number of speculative tasks is less than the threshold:of speculative tasks is less than the threshold:– If the node’s progress score is below If the node’s progress score is below

SlowNodeThreshold ignore the requestSlowNodeThreshold ignore the request– Rank currently running tasks that are not being Rank currently running tasks that are not being

speculated by estimating completion timespeculated by estimating completion time– Launch a copy of the highest ranked task whose Launch a copy of the highest ranked task whose

progress rate is below SlowTaskThresholdprogress rate is below SlowTaskThreshold

Doesn’t take into account data localityDoesn’t take into account data locality


Advantages of the algorithm:Advantages of the algorithm:– Robust to node heterogeneity because it launches only Robust to node heterogeneity because it launches only

the slowest tasks and only few of themthe slowest tasks and only few of them– Prioritized among slow tasks based on how they hurt Prioritized among slow tasks based on how they hurt

response timeresponse time– Takes into account node heterogeneity when choosing Takes into account node heterogeneity when choosing

on which node to run a speculative taskon which node to run a speculative task– Executes only tasks that will improve the total response Executes only tasks that will improve the total response

time, not any slow tasktime, not any slow task


The time completion estimation can produce The time completion estimation can produce errors when a task’s progress rate errors when a task’s progress rate decreases but in general gets correct decreases but in general gets correct approximations in typical MapReduce jobsapproximations in typical MapReduce jobs

EvaluationEvaluation

In order to create heterogeneity they In order to create heterogeneity they mapped a variable number of virtual mapped a variable number of virtual machines (from 1 to 8) on each host in the machines (from 1 to 8) on each host in the EC2 clusterEC2 cluster

They measured the impact of contention on They measured the impact of contention on I/O performance and Application Level I/O performance and Application Level PerformancePerformance


For Application Level they sorted 100 GB of For Application Level they sorted 100 GB of random data using Hadoop’s Sort benchmark with random data using Hadoop’s Sort benchmark with speculative execution disabledspeculative execution disabled

With isolated VM’s the job completed in 408 s and With isolated VM’s the job completed in 408 s and with VM’s packed densely onto physical hosts (7 with VM’s packed densely onto physical hosts (7 VM’s per host) it took 1094sVM’s per host) it took 1094s

For evaluating the scheduling algorithms they For evaluating the scheduling algorithms they used clusters of about 200 VM’s and they used clusters of about 200 VM’s and they performed 5-7 runsperformed 5-7 runs


Results for scheduling Results for scheduling in a heterogeneous in a heterogeneous cluster of 243 VM’s cluster of 243 VM’s using the Sort job (128 using the Sort job (128 MB per host) for a total MB per host) for a total of 30GB of data:of 30GB of data:


On average LATE finished jobs 27% faster On average LATE finished jobs 27% faster than Hadoop’s native scheduler and 31% than Hadoop’s native scheduler and 31% faster than no speculationfaster than no speculation

Results for scheduling with stragglersResults for scheduling with stragglers In order to simulate stragglers they manually In order to simulate stragglers they manually

slowed down eight VM’s in a cluster of 100slowed down eight VM’s in a cluster of 100


For each run they For each run they sorted 256 MB per sorted 256 MB per host for a total of host for a total of 25GB:25GB:


They also ran other two workloads on a They also ran other two workloads on a heterogeneous cluster with stragglers:heterogeneous cluster with stragglers:– GrepGrep– WordCountWordCount

They used a 204 node cluster with 1 to 8 They used a 204 node cluster with 1 to 8 VM’s per hostVM’s per host

For the Grep test they searched on 43GB of For the Grep test they searched on 43GB of text data or about 200 MB per hosttext data or about 200 MB per host


On average LATE finished jobs 36% faster On average LATE finished jobs 36% faster than Hadoop’s native scheduler and 56% than Hadoop’s native scheduler and 56% faster than no speculationfaster than no speculation

For the WordCount test they used a data set For the WordCount test they used a data set of 21GB or 100 MB per host of 21GB or 100 MB per host


Sensitivity analysisSensitivity analysis– SpeculativeCapSpeculativeCap– SlowTaskThresholdSlowTaskThreshold– SlowNodeThresholdSlowNodeThreshold

SpeculativeCap resultsSpeculativeCap results They ran experiments at six SpeculativeCap They ran experiments at six SpeculativeCap

values from 2.5% to 100% repeating each values from 2.5% to 100% repeating each experiment 5 timesexperiment 5 times


Sensitivity to SlowTaskThresholdSensitivity to SlowTaskThreshold Here the idea is to not speculate tasks that Here the idea is to not speculate tasks that

are progressing fast if they are the only are progressing fast if they are the only tasks lefttasks left

They tested 6 values from 5% to 100%They tested 6 values from 5% to 100%


We observe that values past 25% all work We observe that values past 25% all work well with 25% being the optimum valuewell with 25% being the optimum value

Sensitivity to SlowNodeThresholdSensitivity to SlowNodeThreshold

improving mapreduce performance in heterogeneous environments

Documents

tasks progress score

speculative tasks hadoop

category of tasks

map task

speculative taskdistinguishing

speculative executionwhen

average score

small tasks