predicting execution bottlenecks in map-reduce clusters edward bortnikov, ari frank, eshcar hillel,...

Predicting Execution Bottlenecks

in Map-Reduce Clusters

Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram RaoPresenting: Alex Shraer

Yahoo! Labs

- 2 - Yahoo! Confidential

The Map Reduce (MR) Paradigm Architecture for scalable information processing

Simple API Computation scales to Web-scale data collections

Google MR Pioneered the technology in early 2000’s

Hadoop: open-source implementation In use at Amazon, eBay, Facebook, Yahoo!, … Scales to 10K’s nodes (Hadoop 2.0)

Many proprietary implementations MR technologies at Microsoft, Yandex, …


Computational Model

M1

M2

M3

M4

R1

R2

Out

put (

on D

FS)

Inpu

t (on

DFS

)

Synchronous execution: every R starts computing

after all M’s have completed

Slowest task (straggler)affects the job latency


Predicting Straggler Tasks Straggler tasks are an inherent bottleneck

Affect job latency, and to some extend throughput

Two approaches to tackle stragglers Avoidance – reduce the probability of straggler emergence Detection – once a task goes astray, speculatively fire a

duplicate task somewhere else

This work – straggler prediction Fits with both avoidance and detection scenarios


Background Detection, Speculative Execution

First implemented in Google MR (OSDI ’04) Hadoop employs a crude detection heuristic

LATE scheduler (OSDI ‘08) addresses the issues of heterogeneous hardware. Evaluated on small scale.

Microsoft MR (Mantri project, OSDI ‘10)

Avoidance Local/rack-local data access is preferred for mappers

… Network less likely to become the bottleneck

All optimizations are heuristic


Machine-Learned vs Heuristic Prediction Heuristics are hard to …

Tune for real workloads Catch transient bottlenecks

Some evidence from Hadoop grids at Yahoo! Speculative scheduling is non-timely and wasteful

90% of the fired duplicates are eventually killed Data-local computation amplifies contention

Can we use the wealth of historical grid performance data to train a machine-learned bottleneck classifier?


Why Should Machine Learning Work?

Recurrence

(over 5 months)

Fraction of jobs

Fraction of maps

Fraction of reduces

1 3.3% 0.1% <0.1%2 2% 0.2% 0.1%3-5 11.4% 0.3% 0.5%6-10 6.2% 0.8% 0.5%11-20 7.9% 1.1% 1.0%21-50 10.9% 2.6% 2.7%51-100 11% 2.9% 7.2%101-200 32.9% 6.2% 9.2%201-500 4.5% 9.8% 5.2%501-1000 1.3% 5.6% 2.0%1001+ 8.6% 68.5% 71.5%

Huge recurrence of large jobs in production grids

95% of mappers and reducers belong to jobs that ran 50+ times in a 5-month sample

Target workload


The Slowdown Metric Task slowdown factor

Ratio between the task’s running time and the median running time among the sibling tasks in the same job.

Root causes Data skew – input or output significantly exceeds the

median for the job Tasks with skew > 4x happen really seldom.

Hotspots – all the other reasons Congested/misconfigured/degraded nodes, disks, or network. Typically transient. The resulting slow can be very high.


Jobs with Mapper Slowdown > 5x

Mapper Slowdown

5

10

15

20

25

30

35

40

45

0 10000 20000 30000 40000 50000 60000 70000

Mapper Tasks per Job

Slo

wd

ow

n F

acto

r

1% among all jobs 5% among jobs with 1000 mappers or more

40% due to data skew (2x or above), 60% due to hotspots

Sample of ~50K jobs


Jobs with Reducer Slowdown > 5xReducer Slowdown

5

15

25

35

45

55

65

75

85

95

0 500 1000 1500 2000 2500

Reducer tasks per job

Slo

wd

ow

n F

acto

r

5% among all jobs 50% among jobs with 1000 reducers or more

10% due to data skew (2x or above), 90% due to hotspots

Sample of ~60K jobs


Locality is No Silver Bullet

1.304E+12

1.306E+12

1.308E+12

1.31E+12

1.312E+12

1.314E+12

1.316E+12

3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500

Tim

es

tam

p

Node id

Outliers among Local Mappers

The same nodes are constantly lagging behind Weaker CPUs (grid HW is heterogeneous), data hotspots, etc.

Pushing for locality too hard amplifies the problem!

Top contributorof straggler tasks

over 6 hours


Slowdown Predictor An oracle plugin into a Map Reduce system

Input: node features + task features Output: slowdown estimate

Features M/R metrics (job- and task-level) DFS metrics (datanode-level) System metrics (host-level: CPU, RAM, disk I/O, JVM, …) Network traffic (host-, rack- and cross-rack-level)


Slowdown Prediction - MappersMap Tasks

R2 = 0.7978

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

Actual Slowdown (log scale)

Pred

icte

d Sl

owdo

wn

(log

scal

e)

Mis-predicted, need improvement


Slowdown Prediction - ReducersReduce Tasks

R2 = 0.4083

-3

-2

-1

0

1

2

3

4

-3 -2 -1 0 1 2 3

Actual Slowdown (log scale)

Pred

icte

d Sl

owdo

wn

(log

scal

e)

More dispersed than the mappers


Some Conclusions Data skew is the most important signal, but there are

many more that are important Node HW generation is a very significant signal for both

mappers and reducers Large grids undergo continuous HW upgrades

Network traffic features (intra-rack and cross-rack) is much more important for reducers than for mappers How to collect efficiently in a real-time setting?

Need to enhance data sampling/weighting to capture outliers better


Takeaways Slowdown prediction

ML approach to straggler avoidance and detection

Initial evaluation showed viability Need to enhance training to capture outliers better

Challenge: runtime implementation A good blend with the modern MR system architecture?


Thank you


Machine Learning Technique

Gradient Boosted Decision Trees (GBDT) Additive regression model Based on ensemble of binary decision trees 100 trees, 10 leaf nodes each

…


Challenges – Hadoop Use Case Hadoop 1.0 – centralized architecture

The single Job Tracker process manages all task assignment and scheduling

Full picture of Map and Reduce slots across the cluster Hadoop 2.0 – distributed architecture

Resource management and scheduling functions split Thin centralized Resource Manager (RM) creates application

containers (e.g., for running a Map Reduce job) Per-job App Master (AM) does scheduling within a container

May negotiate resource allocation with the RM

Challenge: working with a limited set of local signals


Possible Design – Hadoop 2.0

ApplicationMaster

ResourceManager

NodeManager

NodeManager

NodeManager

Model

Metrics Metrics Metrics

Resource requestsApp Container Creation

Metrics collection(extends the existing HB protocol)

New component or API

Job Execution Environment

Some metrics already collected (CPU ticks, bytes R/W)

Others might be collected either by NM, or externally

Centralized prediction will not scale.

Will distributed prediction be accurate

enough?

predicting execution bottlenecks in map-reduce clusters edward bortnikov, ari frank, eshcar hillel,...

Documents

yahoo confidential jobs

jobs slide

yahoo confidential machine

yahoo confidential locality

heuristic slide

yahoo speculative scheduling

job latency slide

job tasks