job scheduling for mapreduce - mit · job scheduling for mapreduce matei zaharia, dhruba...

29
UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur * , Joydeep Sen Sarma * , Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc

Upload: others

Post on 17-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

UC Berkeley

Job Scheduling for MapReduce

Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica

1

RAD Lab, *Facebook Inc

Page 2: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Motivation

•  Hadoop was designed for large batch jobs – FIFO queue + locality optimization

•  At Facebook, we saw a different workload: – Many users want to share a cluster – Many jobs are small (10-100 tasks)

•  Sampling, ad-hoc queries, periodic reports, etc

 How should we schedule tasks in a shared MapReduce cluster?

Page 3: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Benefits of Sharing

•  Higher utilization due to statistical multiplexing

•  Data consolidation (ability to query disjoint data sets together)

Page 4: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Why is it Interesting?

•  Data locality is crucial for performance •  Conflict between locality and fairness •  70% gain from simple algorithm

Page 5: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Outline

•  Task scheduling in Hadoop •  Two problems

– Head-of-line scheduling – Slot stickiness

•  A solution (global scheduling) •  Lots more problems (future work)

Page 6: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Task Scheduling in Hadoop

Master

•  Slaves send heartbeats periodically •  Master responds with task if a slot is free,

picking task with data closest to the node

Jobqueue

Slave

Slave

Slave

Slave

Page 7: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 1: Poor Locality for Small Jobs

# of Maps Percent of Jobs < 25 58%

25-100 18% 100-400 14%

400-1600 7% 1600-6400 3%

> 6400 0.26%

Job Sizes at Facebook

Page 8: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 1: Poor Locality for Small Jobs

0

20

40

60

80

100

10 100 1000 10000 100000

Perc

ent L

ocal

Map

s

Job Size (Number of Maps)

Job Locality at Facebook

Node Locality Rack Locality

Page 9: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Cause •  Only head-of-queue job is schedulable on each

heartbeat •  Chance of heartbeat node having local data is low •  Jobs with blocks on X% of nodes get X% locality

0

20

40

60

80

100

0 20 40 60 80 100

Expe

cted

Nod

e Lo

calit

y (%

)

Job Size (Number of Maps)

Locality vs. Job Size in 100-Node Cluster

Page 10: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 2: Sticky Slots

•  Suppose we do fair sharing as follows: – Divide task slots equally between jobs – When a slot becomes free, give it to the job

that is farthest below its fair share

Page 11: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Slave

Slave

Slave

Slave

Slave

Slave

Slave

Slave

Problem 2: Sticky Slots

Job1

Job2

Master

Page 12: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Slave

Slave

Slave

Slave

Problem 2: Sticky Slots

Master

JobFairShare

RunningTasks

Job1 2 2

Job2 2 2

Slave JobFairShare

RunningTasks

Job1 2 1 Job2 2 2

Problem: Jobs never leave their original slots

Page 13: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Calculations

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Expe

cted

Nod

e Lo

calit

y (%

)

Number of Concurrent Jobs

Locality vs. Concurrent Jobs in 100-Node Cluster

Page 14: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Solution: Locality Wait

•  Scan through job queue in order of priority •  Jobs must wait before they are allowed to

run non-local tasks –  If wait < T1, only allow node-local tasks –  If T1 < wait < T2, also allow rack-local –  If wait > T2, also allow off-rack

Page 15: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Slave

Slave

Slave

Slave

Locality Wait Example

Master

JobFairShare

RunningTasks

Job1 2 2

Job2 2 2

Slave JobFairShare

RunningTasks

Job1 2 1 

Job2 2 2

Jobs can now shift between slots

Slave JobFairShare

RunningTasks

Job1 2 1 

Job2 2 3 SlaveSlave

Page 16: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Evaluation – Locality Gains

Job Type Default Scheduler

Node Loc. Rack Loc. Small Sort 2% 50% Small Scan 2% 50%

Medium Scan 37% 98% Large Scan 84% 99%

With Locality Wait Node Loc. Rack Loc.

81% 96% 75% 94% 99% 99% 94% 99%

Page 17: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Throughput Gains

0

0.2

0.4

0.6

0.8

1

1.2

Small Sort Small Scan Medium Scan

Large Scan

Nor

mal

ized

Run

ning

Tim

e

Default Scheduler With Locality Wait

70% 31% 20% 18%

Page 18: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Network Traffic Reduction

With locality wait Without locality wait

Network Traffic in Sort Workload

Page 19: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Further Analysis

•  When is it worthwhile to wait, and how long?

•  For throughput: – Always worth it, unless there’s a hotspot –  If hotspot, prefer to run IO-bound tasks on the

hotspot node and CPU-bound tasks remotely (rationale: maximize rate of local IO)

Page 20: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Further Analysis

•  When is it worthwhile to wait, and how long?

•  For response time:   E(gain) = (1 – e-w/t)(D – t)

– Worth it if E(wait) < cost of running non-locally – Optimal wait time is infinity

Delay from running non-locally

Expected time to get local heartbeat Wait amount

Page 21: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 3: Memory-Aware Scheduling

a)  How much memory does each job need? – Asking users for per-job memory limits leads

to overestimation – Use historic data about working set size?

b)  High-memory jobs may starve – Reservation scheme + finish time estimation?

Page 22: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 4: Reduce Scheduling

Job 1

Job 2

Time

Map

s

Time

Red

uces

Page 23: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 4: Reduce Scheduling

Job 1

Job 2

Time

Map

s

Time

Red

uces

Job 2 maps done

Page 24: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 4: Reduce Scheduling

Job 1

Job 2

Time

Map

s

Time

Red

uces

Job 2 reduces done

Page 25: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 4: Reduce Scheduling

Job 1

Job 2

Job 3

Time

Map

s

Time

Red

uces

Job 3 submitted

Page 26: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 4: Reduce Scheduling

Job 1

Job 2

Job 3

Time

Map

s

Time

Red

uces

Job 3 maps done

Page 27: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 4: Reduce Scheduling

Job 1

Job 2

Job 3

Time

Map

s

Time

Red

uces

Page 28: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Problem 4: Reduce Scheduling

Job 1

Job 2

Job 3

Time

Map

s

Time

Red

uces

Problem: Job 3 can’t launch reduces until Job 1 finishes

Page 29: Job Scheduling for MapReduce - MIT · Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*, Scott Shenker, Ion Stoica 1 RAD Lab, *Facebook Inc . Motivation

Conclusion

•  Simple idea improves throughput by 70% •  Lots of future work:

– Memory-aware scheduling – Reduce scheduling –  Intermediate-data-aware scheduling – Using past history / learning job properties – Evaluation using richer benchmarks