improving mapreduce performance using smart speculative execution strategy qi chen, cheng liu, and...
Post on 22-Dec-2015
222 Views
Preview:
TRANSCRIPT
Improving MapReduce Performance Using Smart
Speculative Execution Strategy
Qi Chen, Cheng Liu, and Zhen Xiao
Oct 2013
To appear in IEEE Transactions on Computers
0
Outlines
2. Background
3. Previous work
4. Pitfalls
5. Our Design
1. Introduction
6. Evaluation
7. Conclusion
0
Outlines
2. Background
3. Previous work
4. Pitfalls
5. Our Design
1. Introduction
6. Evaluation
7. Conclusion
Introduction The new era of Big Data is coming!
– 20 PB per day (2008)
– 30 TB per day (2009)
– 60 TB per day (2010)
–petabytes per day
What does big data mean? Important user information
significant business value
MapReduce
What is MapReduce? most popular parallel computing model proposed by
database operatio
n
Search engine
Machine learning
Cryptanalysis
Scientific computati
on
Applications
…
Select, Join, Group
Page rank,Inverted index,Log analysis
Clustering, machine translation,
Recommendation
Straggler What is straggler in MapReduce?
Nodes on which tasks take an unusually long time to finish
It will: Delay the job execution time
Degrade the cluster throughput
How to solve it? Speculative execution
Slow task is backed up on an alternative machine with the hope that the backup one can finish faster
0
Outlines
2. Background
3. Previous work
4. Pitfalls
5. Our Design
1. Introduction
6. Evaluation
7. Conclusion
Architecture
Split 1
Split 2
…
Split M
Map Part 2
Part 1
Map Part 2
Part 1
Map Part 2
Part 1
Reduce
Reduce
Output2
Input files
Map Stage
Reduce Stage
Output files
Output1
Master
…
Assign
Assign
Programming model
Input : (key, value) pairs Output : (key*, value*) pairs
Phase
Stage
Map:
Map
Combine
List(K1,V1) → List(K2,V2) → List(K2,
List(V2))
Reduce:
Copy Sort Reduce
List(K2, List(V2))
→ Ordered (K2,
List(V2))
→ List(K3,V3)
Causes of Stragglers
Internal factors External factors
resource capacity of worker nodes is heterogeneous
resource competition due to other MapReduce tasks running on the same worker node
resource competition due to co-hosted applications
input data skew
remote input or output source is too slow
hardware faulty
0
Outlines
2. Background
3. Previous work
4. Pitfalls
5. Our Design
1. Introduction
6. Evaluation
7. Conclusion
Previous work
Google and Dryad When a stage is close to completion
Backup an arbitrary set of the remaining tasks
Hadoop Original Backup task whose progress falls behind the average by a fixed gap
LATE (OSDI’08) Backup task: 1) longest remaining time, 2) progress rate below
threshold
Identify worker with its performance score below threshold as slow
Mantri (OSDI’10) Saving cluster computing resource
Backup up outliers when they show up
Kill-restart when cluster is busy, lazy duplicate when cluster is idle
0
Outlines
2. Background
3. Previous work
4. Pitfalls
5. Our Design
1. Introduction
6. Evaluation
7. Conclusion
Pitfalls in Selecting Slow Tasks Using average progress rate to identify slow
tasks and estimate task remaining time
Hadoop and LATE assumes that: Tasks of the same type process almost the same
amount of input data
Progress rate must be either stable or accelerated during a task’s lifetime
There are some scenarios that the assumptions will break down
Input data skew
Sort benchmark on 10GB input data following the Zipf
distribution ( =1.0)
Phase Percentage varies
Different jobs have different phase duration ratio
Job in different environments has different phase duration ratio
Speed is varying across different phases
Reduce Tasks Start Asynchronously
Tasks in different phases can not be compared directly
Take a Long Time to Identify Straggler
Cannot identify straggler in time
Pitfalls in Selecting Backup Node Identifying Slow Worker Node
LATE: Sum of progress of all the completed and running tasks on the node
Hadoop: Average progress rate of all the completed tasks on the node
Some worker nodes may do more time-consuming tasks and get lower performance score unfairly
e.g. doing more tasks with larger amount of data to process or non-local map tasks
Choosing Backup Worker Node LATE and Hadoop: Ignore data locality
Our observation: a data-local map task can be over three times faster than that of a non-local map task
0
Outlines
2. Background
3. Previous work
4. Pitfalls
5. Our Design
1. Introduction
6. Evaluation
7. Conclusion
Selecting Backup Candidates Using Per-Phase Process Speed
Dividing each task into multiple phases
Using phase process speed to identify slow tasks and estimate task remaining time
Map Combine
Copy Sort ReduceMap Task: Reduce Task:
Map
Map
Combine
Combine
Map Task
compare
Selecting Backup Candidates Using EWMA to Predict Process Speed
𝑍 (𝑡)=𝛼∗𝑌 (𝑡)+(1−𝛼)∗𝑍 (𝑡−1), 0<𝛼≤1
Selecting Backup Candidates Estimating Task Remaining Time and Backup Time
use the phase average process speed to estimate the remaining time of a phase
To avoid process speed to be fast at the beginning and drop later in copy phase, we estimate the remaining time of copy phase as follows:
𝑟𝑒𝑚 _ 𝑡𝑖𝑚𝑒𝑐𝑜𝑝𝑦=h𝑓𝑖𝑛𝑖𝑠 _𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑚𝑎𝑝− h𝑓𝑖𝑛𝑖𝑠 _𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑐𝑜𝑝𝑦
𝑝𝑟𝑜𝑐𝑒𝑠𝑠 _ 𝑠𝑝𝑒𝑒𝑑𝑐𝑜𝑝𝑦
𝑒𝑠𝑡 _ 𝑡𝑖𝑚𝑒𝑝
𝑟𝑒𝑚 _ 𝑡𝑖𝑚𝑒=𝑟𝑒𝑚 _ 𝑡𝑖𝑚𝑒𝑐𝑢𝑟 _ h𝑝 𝑎𝑠𝑒+𝑟𝑒𝑚 _ 𝑡𝑖𝑚𝑒 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔_ h𝑝 𝑎𝑠𝑒𝑠
¿𝑟𝑒𝑚 _𝑑𝑎𝑡 𝑎𝑐𝑢𝑟 _ h𝑝 𝑎𝑠𝑒
𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡 h𝑐𝑢𝑟 _ h𝑝 𝑎𝑠𝑒
+ ∑𝑝∈ 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔_ h𝑝 𝑎𝑠𝑒𝑠
𝑒𝑠𝑡 _ 𝑡𝑖𝑚𝑒𝑝∗ 𝑓𝑎𝑐𝑡𝑜𝑟𝑑
𝑏𝑎𝑐𝑘𝑢𝑝 _ 𝑡𝑖𝑚𝑒=∑𝑝
𝑒𝑠𝑡 _ 𝑡𝑖𝑚𝑒𝑝∗ 𝑓𝑎𝑐𝑡𝑜𝑟 𝑑
𝑓𝑎𝑐𝑡𝑜𝑟 𝑑=𝑑𝑎𝑡 𝑎𝑖𝑛𝑝𝑢𝑡
𝑑𝑎𝑡 𝑎𝑎𝑣𝑔
Selecting Backup Candidates Maximizing Cost Performance
Cost: the computing resources occupied by tasks
Performance: the shortening of job execution time and the increase of the cluster throughput
We hope that: when a cluster is idle, the cost for speculative execution is less
a concern
When the cluster is busy, the cost is an important consideration
𝑝𝑟𝑜𝑓𝑖𝑡𝑏𝑎𝑐𝑘𝑢𝑝=𝛼∗(𝑟𝑒𝑚 _ 𝑡𝑖𝑚𝑒−𝑏𝑎𝑐𝑘𝑢𝑝 _ 𝑡𝑖𝑚𝑒)− 𝛽∗2∗𝑏𝑎𝑐𝑘𝑢𝑝 _ 𝑡𝑖𝑚𝑒𝑝𝑟𝑜𝑓𝑖𝑡𝑛𝑜𝑡 _ 𝑏𝑎𝑐𝑘𝑢𝑝=𝛼∗0− 𝛽∗𝑟𝑒𝑚 _ 𝑡𝑖𝑚𝑒
𝑝𝑟𝑜𝑓𝑖𝑡𝑏𝑎𝑐𝑘𝑢𝑝>𝑝𝑟𝑜𝑓𝑖𝑡𝑛𝑜𝑡 _ 𝑏𝑎𝑐𝑘𝑢𝑝⇔𝑟𝑒𝑚 _ 𝑡𝑖𝑚𝑒
𝑏𝑎𝑐𝑘𝑢𝑝 _ 𝑡𝑖𝑚𝑒>𝛼+2 𝛽𝛼+𝛽
𝛾=𝑙𝑜𝑎𝑑 _ 𝑓𝑎𝑐𝑡𝑜𝑟=¿𝑝𝑒𝑛𝑑𝑖𝑛𝑔 _ 𝑡𝑎𝑠𝑘𝑠
¿ 𝑓𝑟𝑒𝑒 _ 𝑠𝑙𝑜𝑡𝑠
𝛼𝛽→𝛾⇌
𝑟𝑒𝑚 _ 𝑡𝑖𝑚𝑒𝑏𝑎𝑐𝑘𝑢𝑝 _ 𝑡𝑖𝑚𝑒
>1+2𝛾1+𝛾
Selecting Proper Backup Nodes
Assign backup tasks to the fast nodes How to measure the performance of nodes?
use predicted process bandwidth of data-local map tasks completed on the node to represent its performance
Consider the data-locality Note: the process speed of data-local map tasks can be 3
times that of non-local map tasks
Therefore, we keep the process speed statistics of data-local, rack-local, and non-local map tasks for each node
For nodes that do not process any map task on a specific locality level, we use the average process speed of all nodes on this level as an estimate
Launch backup on node i ? If remain time > backup time on node i
Summary
A task will be backed up when it meets the following conditions: it has executed for a certain amount of time (i.e., the
speculative lag)
both the progress rate and the process bandwidth in the current phase of the task are sufficiently low
the profit of doing the backup outweighs that of not doing it
its estimated remaining time is longer than the predicted time to finish on a backup node
it has the longest remaining time among all the tasks satisfying the conditions above
0
Outlines
2. Background
3. Previous work
4. Pitfalls
5. Our Design
1. Introduction
6. Evaluation
7. Conclusion
Experiment Environment
Two scale: Small: 30 virtual machines on 15 physical
machines
Large: 100 virtual machines on 30 physical machines
Each physical machine: dual-Processors (2.4GHz Intel(R) Xeon(R)
E5620 processor with 16 logic core), 24GB of RAM and two 150GB disks
Organized in three racks connected by 1Gbps Ethernet
Each virtual machine: 2 virtual core, 4GB RAM and 40GB of disk
space
Benchmark: Sort, Wordcount, Grep, Girdmix
Scheduling in Heterogeneous Environments Load of each host in heterogeneous
environmentsLoad Hosts VMs
1VMs/host 3 3
2VMs/host 11 22
5VMs/host 1 5
Total 15 30
Scheduling in Heterogeneous Environments Working With Different Workloads
Workloads Job completion time
Improvement
Cluster throughput Improvement
WordCount 10% 5%
Sort 19% 15%
Grep 39% 38%
Gridmix 13% 15%
Scheduling in Heterogeneous Environments Analysis (using Word Count and Grep)
StrategyPrecision Recall
Average find time
Map Reduce
Map Reduce Map Reduce
Hadoop-LATE
37.6% 3% 100% 100% 70s 66s
Hadoop-MCP
45.2% 93.3% 87.1% 100% 56s 32s
Improvement
+Accurate prediction
+Cost performance
All
Execution Speed
27% 31% 39%
Cluster Throughput
29% 32% 38%
Scheduling in Heterogeneous Environments Handling Data Skew (Sort)
Execution Speed +37%
Cluster Throughput
+44%
Execution Speed +17%
Cluster Throughput
+19%
Competing with other applications Run some I/O intensive processes on some
servers dd process which creates large files in a loop to
write random data on some physical machines
MCP can run 36% faster than Hadoop-LATE and increase the cluster throughput by 34%.
Scheduling in Heterogeneous Environments
Large scale Experiment
Load distribution
MCP finishes jobs 21% faster than Hadoop-LATE and improves the cluster throughput by 16%
Load Hosts VMs
3VMs/host 27 81
5VMs/host 4 20
Total 31 101
Scheduling in Homogeneous Environments Small scale cluster with each host running 2
VMs There is no straggler node in the cluster
MCP finishes jobs 6% faster than Hadoop-LATE and 2% faster than Hadoop-None.
Hadoop-LATE behaves worse than Hadoop-None due to too many unnecessary reduce backups MCP improves reduce backup precision by 40%
MCP can achieve better data locality for map tasks
Scheduling Cost
We measure the average time that MCP and Hadoop-LATE spend on speculative scheduling in a job with 350 map tasks and 110 reduce tasks MCP spends about 0.54ms O(n)
LATE spends 0.74ms O(nlogn)
0
Outlines
2. Background
3. Previous work
4. Pitfalls
5. Our Design
1. Introduction
6. Evaluation
7. Conclusion
Conclusion We provide an analysis of the pitfalls of current
speculative execution strategies in MapReduce Scenarios: data skew, tasks that start asynchronously,
improper configuration of phase percentage etc.
We develop a new strategy MCP to handle these scenarios: Accurate slow task prediction and remaining time
estimation
Take the cost performance of computing resources into account
Take both data locality and data skew into consideration when choosing proper worker nodes
MCP fits well in both heterogeneous and homogeneous environments handle data skew case well, quite scalable, and less
overhead
Thank You!
top related