improving mapreduce performance in heterogeneous environments
DESCRIPTION
Improving MapReduce Performance in Heterogeneous Environments. Sturzu Antonio-Gabriel, SCPD. Table of Contents. Intro Scheduling in Hadoop Heterogeneity in Hadoop The LATE Scheduler(Longest Approximate Time to End) Evaluation of Performance. Intro. - PowerPoint PPT PresentationTRANSCRIPT
Improving MapReduce Improving MapReduce Performance in Heterogeneous Performance in Heterogeneous
EnvironmentsEnvironments
Sturzu Antonio-Gabriel, SCPDSturzu Antonio-Gabriel, SCPD
Table of ContentsTable of Contents
IntroIntro Scheduling in HadoopScheduling in Hadoop Heterogeneity in HadoopHeterogeneity in Hadoop The LATE Scheduler(Longest Approximate The LATE Scheduler(Longest Approximate
Time to End)Time to End) Evaluation of PerformanceEvaluation of Performance
IntroIntro
The big volume of data that internet services work The big volume of data that internet services work on has led to the need of parallel processingon has led to the need of parallel processing
The leading example is Google which uses The leading example is Google which uses MapReduce to process 20 petabytes of data per MapReduce to process 20 petabytes of data per dayday
MapReduce breaks a computation into small tasks MapReduce breaks a computation into small tasks that run in parallel on multiple machines and that run in parallel on multiple machines and scales easy to very large clustersscales easy to very large clusters
The two key benefits of MapReduce are:The two key benefits of MapReduce are:– Fault toleranceFault tolerance– Speculative execution Speculative execution
IntroIntro
Google has noted that speculative execution Google has noted that speculative execution improves response time by 44%improves response time by 44%
The paper shows an efficient way to do The paper shows an efficient way to do speculative execution in order to maximize speculative execution in order to maximize performanceperformance
It also shows that Hadoop’s simple speculative It also shows that Hadoop’s simple speculative algorithm based on comparing each task’s algorithm based on comparing each task’s progress to the average progress brakes down in progress to the average progress brakes down in heterogeneous systemsheterogeneous systems
IntroIntro
The proposed scheduling algorithm increases The proposed scheduling algorithm increases Hadoop’s response time by a factor of twoHadoop’s response time by a factor of two
Most of the examples and tests are done on Most of the examples and tests are done on Amazon’s Elastic Compute Cloud( EC2)Amazon’s Elastic Compute Cloud( EC2)
The paper adresses two important problems in The paper adresses two important problems in speculative execution:speculative execution:– Choosing the best node to run the speculative taskChoosing the best node to run the speculative task– Distinguishing between nodes slightly slower than the Distinguishing between nodes slightly slower than the
mean and stragglersmean and stragglers
Scheduling in HadoopScheduling in Hadoop
Hadoop divides each MapReduce job into Hadoop divides each MapReduce job into taskstasks
The input file is split into even-sized chunks The input file is split into even-sized chunks replicated for fault tolerancereplicated for fault tolerance
Each chunk of input is first processed by a Each chunk of input is first processed by a map task that outputs a set of key-value map task that outputs a set of key-value pairspairs
Map outputs are split into buckets based on Map outputs are split into buckets based on the key the key
Scheduling in HadoopScheduling in Hadoop
When all map tasks When all map tasks finish reducers apply a finish reducers apply a reduce function on the reduce function on the set of values set of values associated with each associated with each keykey
Scheduling in HadoopScheduling in Hadoop
Hadoop runs several maps and reduces Hadoop runs several maps and reduces concurrently on each slave in order to concurrently on each slave in order to overlap I/O with computationoverlap I/O with computation
Each slave tells the master when it has Each slave tells the master when it has empty task slotsempty task slots– First any failed task is given priorityFirst any failed task is given priority– Second a non-running taskSecond a non-running task– Third a task to execute speculativelyThird a task to execute speculatively
Scheduling in HadoopScheduling in Hadoop
To select speculative tasks Hadoop uses a To select speculative tasks Hadoop uses a progress score between 0 and 1progress score between 0 and 1
For a map the progress score is the fraction of For a map the progress score is the fraction of input data readinput data read
For a reduce task the execution is divided into 3 For a reduce task the execution is divided into 3 phases each of which accounts for 1/3 of the phases each of which accounts for 1/3 of the score:score:– The copy phaseThe copy phase– The sort phaseThe sort phase– The reduce phaseThe reduce phase
Scheduling in HadoopScheduling in Hadoop
In each phase the score is the fraction of In each phase the score is the fraction of data processeddata processed
Hadoop calculates an average score for Hadoop calculates an average score for each category of tasks in order to define a each category of tasks in order to define a threshold for speculative executionthreshold for speculative execution
When a task’s progress score is less than When a task’s progress score is less than the average for its category minus 0.2 and the average for its category minus 0.2 and the task has run for at least a minute it is the task has run for at least a minute it is marked as a stragglermarked as a straggler
Scheduling in HadoopScheduling in Hadoop
All tasks beyond the threshold are All tasks beyond the threshold are considered equally slow and ties between considered equally slow and ties between them are broken by data localitythem are broken by data locality
This threshold works well in homogeneous This threshold works well in homogeneous systems because tasks tend to start and systems because tasks tend to start and finish in “waves” at roughly the same timesfinish in “waves” at roughly the same times
When running multiple jobs Hadoop uses a When running multiple jobs Hadoop uses a FIFO disciplineFIFO discipline
Scheduling in HadoopScheduling in Hadoop
Assumptions made by Hadoop Scheduler:Assumptions made by Hadoop Scheduler:– Nodes can perform work at roughly the same
rate– Tasks progress at a constant rate throughout
time– There is no cost to launching a speculative task
on a node that would otherwise have an idle slot
Scheduling in HadoopScheduling in Hadoop
– A task’s progress score is a representative of fraction of its total work that it has done
– Tasks tend to finish in waves, so a task with a low progress score is likely a straggler
– Tasks in the same category (map or reduce) require roughly the same amount of work
Heterogeneity in HadoopHeterogeneity in Hadoop
Too many speculative tasks are launched Too many speculative tasks are launched because of the fixed threshold (assumption because of the fixed threshold (assumption 3 falls)3 falls)
Because the scheduler uses data locality to Because the scheduler uses data locality to rank candidates for speculative execution rank candidates for speculative execution the wrong tasks may be chosen first the wrong tasks may be chosen first
Assumptions 3, 4 and 5 fall on both Assumptions 3, 4 and 5 fall on both homogeneous and heterogeneous clustershomogeneous and heterogeneous clusters
The LATE SchedulerThe LATE Scheduler
The main idea is that it speculatively The main idea is that it speculatively executes the task that will finish farthest in executes the task that will finish farthest in the futurethe future
Estimates progress rate as Estimates progress rate as ProgressScore/T, where T is the amount of ProgressScore/T, where T is the amount of time the task has been running fortime the task has been running for
The time to completion is (1-The time to completion is (1-ProgressScore)/ProgressRate ProgressScore)/ProgressRate
The LATE SchedulerThe LATE Scheduler
In order to get the best chance to beat the original In order to get the best chance to beat the original task which was speculated the algorithm launches task which was speculated the algorithm launches speculative tasks only on fast nodesspeculative tasks only on fast nodes
It does this using a SlowNodeThreshold which is a It does this using a SlowNodeThreshold which is a metric of the total work performed metric of the total work performed
Because speculative tasks cost resources LATE Because speculative tasks cost resources LATE uses two additional heuristics:uses two additional heuristics:– A limit on the number of speculative tasks executed A limit on the number of speculative tasks executed
(SpeculativeCap)(SpeculativeCap)– A SlowTaskThreshold that determines if a task is slow A SlowTaskThreshold that determines if a task is slow
enough in order to get speculated (uses progress rate enough in order to get speculated (uses progress rate for comparison)for comparison)
The LATE SchedulerThe LATE Scheduler
When a node asks for a new task and the number When a node asks for a new task and the number of speculative tasks is less than the threshold:of speculative tasks is less than the threshold:– If the node’s progress score is below If the node’s progress score is below
SlowNodeThreshold ignore the requestSlowNodeThreshold ignore the request– Rank currently running tasks that are not being Rank currently running tasks that are not being
speculated by estimating completion timespeculated by estimating completion time– Launch a copy of the highest ranked task whose Launch a copy of the highest ranked task whose
progress rate is below SlowTaskThresholdprogress rate is below SlowTaskThreshold
Doesn’t take into account data localityDoesn’t take into account data locality
The LATE SchedulerThe LATE Scheduler
Advantages of the algorithm:Advantages of the algorithm:– Robust to node heterogeneity because it launches only Robust to node heterogeneity because it launches only
the slowest tasks and only few of themthe slowest tasks and only few of them– Prioritized among slow tasks based on how they hurt Prioritized among slow tasks based on how they hurt
response timeresponse time– Takes into account node heterogeneity when choosing Takes into account node heterogeneity when choosing
on which node to run a speculative taskon which node to run a speculative task– Executes only tasks that will improve the total response Executes only tasks that will improve the total response
time, not any slow tasktime, not any slow task
The LATE SchedulerThe LATE Scheduler
The time completion estimation can produce The time completion estimation can produce errors when a task’s progress rate errors when a task’s progress rate decreases but in general gets correct decreases but in general gets correct approximations in typical MapReduce jobsapproximations in typical MapReduce jobs
EvaluationEvaluation
In order to create heterogeneity they In order to create heterogeneity they mapped a variable number of virtual mapped a variable number of virtual machines (from 1 to 8) on each host in the machines (from 1 to 8) on each host in the EC2 clusterEC2 cluster
They measured the impact of contention on They measured the impact of contention on I/O performance and Application Level I/O performance and Application Level PerformancePerformance
EvaluationEvaluation
For Application Level they sorted 100 GB of For Application Level they sorted 100 GB of random data using Hadoop’s Sort benchmark with random data using Hadoop’s Sort benchmark with speculative execution disabledspeculative execution disabled
With isolated VM’s the job completed in 408 s and With isolated VM’s the job completed in 408 s and with VM’s packed densely onto physical hosts (7 with VM’s packed densely onto physical hosts (7 VM’s per host) it took 1094sVM’s per host) it took 1094s
For evaluating the scheduling algorithms they For evaluating the scheduling algorithms they used clusters of about 200 VM’s and they used clusters of about 200 VM’s and they performed 5-7 runsperformed 5-7 runs
EvaluationEvaluation
Results for scheduling Results for scheduling in a heterogeneous in a heterogeneous cluster of 243 VM’s cluster of 243 VM’s using the Sort job (128 using the Sort job (128 MB per host) for a total MB per host) for a total of 30GB of data:of 30GB of data:
EvaluationEvaluation
On average LATE finished jobs 27% faster On average LATE finished jobs 27% faster than Hadoop’s native scheduler and 31% than Hadoop’s native scheduler and 31% faster than no speculationfaster than no speculation
Results for scheduling with stragglersResults for scheduling with stragglers In order to simulate stragglers they manually In order to simulate stragglers they manually
slowed down eight VM’s in a cluster of 100slowed down eight VM’s in a cluster of 100
EvaluationEvaluation
For each run they For each run they sorted 256 MB per sorted 256 MB per host for a total of host for a total of 25GB:25GB:
EvaluationEvaluation
They also ran other two workloads on a They also ran other two workloads on a heterogeneous cluster with stragglers:heterogeneous cluster with stragglers:– GrepGrep– WordCountWordCount
They used a 204 node cluster with 1 to 8 They used a 204 node cluster with 1 to 8 VM’s per hostVM’s per host
For the Grep test they searched on 43GB of For the Grep test they searched on 43GB of text data or about 200 MB per hosttext data or about 200 MB per host
EvaluationEvaluation
On average LATE finished jobs 36% faster On average LATE finished jobs 36% faster than Hadoop’s native scheduler and 56% than Hadoop’s native scheduler and 56% faster than no speculationfaster than no speculation
For the WordCount test they used a data set For the WordCount test they used a data set of 21GB or 100 MB per host of 21GB or 100 MB per host
EvaluationEvaluation
Sensitivity analysisSensitivity analysis– SpeculativeCapSpeculativeCap– SlowTaskThresholdSlowTaskThreshold– SlowNodeThresholdSlowNodeThreshold
SpeculativeCap resultsSpeculativeCap results They ran experiments at six SpeculativeCap They ran experiments at six SpeculativeCap
values from 2.5% to 100% repeating each values from 2.5% to 100% repeating each experiment 5 timesexperiment 5 times
EvaluationEvaluation
Sensitivity to SlowTaskThresholdSensitivity to SlowTaskThreshold Here the idea is to not speculate tasks that Here the idea is to not speculate tasks that
are progressing fast if they are the only are progressing fast if they are the only tasks lefttasks left
They tested 6 values from 5% to 100%They tested 6 values from 5% to 100%
EvaluationEvaluation
We observe that values past 25% all work We observe that values past 25% all work well with 25% being the optimum valuewell with 25% being the optimum value
Sensitivity to SlowNodeThresholdSensitivity to SlowNodeThreshold