Transcript
Page 1: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

1

Michael Whittaker

Page 2: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

2

Page 3: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

3

Page 4: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

4

Hadoop

Page 5: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

5

Hadoop

Page 6: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

6

Hadoop

Page 7: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

7

Hadoop

Page 8: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

8

Hadoop

Page 9: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

9

Hadoop

Page 10: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

10

Hadoop

Page 11: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

11

Hadoop Storm

Page 12: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

12

Hadoop Storm

Page 13: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

13

Hadoop Storm

Page 14: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

14

Hadoop Storm

Page 15: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

15

Hadoop Storm

Page 16: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

16

Hadoop Storm

Page 17: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

17

Hadoop Storm

Page 18: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

18

Hadoop Storm

Page 19: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

19

Hadoop Storm Spark

Monolithic Cluster Manager

Page 20: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

20

Hadoop Storm Spark ???

Monolithic Cluster Manager

Page 21: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

21

Hadoop Storm Spark ???

Mesos

Page 22: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

Mesos slave Mesos slave Mesos slave MPI

executor

task

Hadoop executor

task

MPI executor

task task

Hadoop executor

task task

Mesos master

Hadoop scheduler

MPI scheduler

Standby master

Standby master

ZooKeeper quorum

Figure 2: Mesos architecture diagram, showing two runningframeworks (Hadoop and MPI).

3 ArchitectureWe begin our description of Mesos by discussing our de-sign philosophy. We then describe the components ofMesos, our resource allocation mechanisms, and howMesos achieves isolation, scalability, and fault tolerance.

3.1 Design Philosophy

Mesos aims to provide a scalable and resilient core forenabling various frameworks to efficiently share clusters.Because cluster frameworks are both highly diverse andrapidly evolving, our overriding design philosophy hasbeen to define a minimal interface that enables efficientresource sharing across frameworks, and otherwise pushcontrol of task scheduling and execution to the frame-works. Pushing control to the frameworks has two bene-fits. First, it allows frameworks to implement diverse ap-proaches to various problems in the cluster (e.g., achiev-ing data locality, dealing with faults), and to evolve thesesolutions independently. Second, it keeps Mesos simpleand minimizes the rate of change required of the system,which makes it easier to keep Mesos scalable and robust.

Although Mesos provides a low-level interface, we ex-pect higher-level libraries implementing common func-tionality (such as fault tolerance) to be built on top ofit. These libraries would be analogous to library OSes inthe exokernel [20]. Putting this functionality in librariesrather than in Mesos allows Mesos to remain small andflexible, and lets the libraries evolve independently.

3.2 Overview

Figure 2 shows the main components of Mesos. Mesosconsists of a master process that manages slave daemonsrunning on each cluster node, and frameworks that runtasks on these slaves.

The master implements fine-grained sharing acrossframeworks using resource offers. Each resource offeris a list of free resources on multiple slaves. The masterdecides how many resources to offer to each frameworkaccording to an organizational policy, such as fair sharing

FW Scheduler Job 1 Job 2 Framework 1

Allocation module

Mesos master

<s1, 4cpu, 4gb, … > 1 <fw1, task1, 2cpu, 1gb, … > <fw1, task2, 1cpu, 2gb, … > 4

Slave 1

Task Executor

Task

FW Scheduler Job 1 Job 2 Framework 2

Task Executor

Task

Slave 2

<s1, 4cpu, 4gb, … > <task1, s1, 2cpu, 1gb, … > <task2, s1, 1cpu, 2gb, … > 3 2

Figure 3: Resource offer example.

or priority. To support a diverse set of inter-frameworkallocation policies, Mesos lets organizations define theirown policies via a pluggable allocation module.

Each framework running on Mesos consists of twocomponents: a scheduler that registers with the masterto be offered resources, and an executor process that islaunched on slave nodes to run the framework’s tasks.While the master determines how many resources to of-fer to each framework, the frameworks’ schedulers selectwhich of the offered resources to use. When a frameworkaccepts offered resources, it passes Mesos a descriptionof the tasks it wants to launch on them.

Figure 3 shows an example of how a framework getsscheduled to run tasks. In step (1), slave 1 reportsto the master that it has 4 CPUs and 4 GB of mem-ory free. The master then invokes the allocation mod-ule, which tells it that framework 1 should be offeredall available resources. In step (2), the master sends aresource offer describing these resources to framework1. In step (3), the framework’s scheduler replies to themaster with information about two tasks to run on theslave, using 〈2 CPUs, 1 GB RAM〉 for the first task, and〈1 CPUs, 2 GB RAM〉 for the second task. Finally, instep (4), the master sends the tasks to the slave, which al-locates appropriate resources to the framework’s execu-tor, which in turn launches the two tasks (depicted withdotted borders). Because 1 CPU and 1 GB of RAM arestill free, the allocation module may now offer them toframework 2. In addition, this resource offer process re-peats when tasks finish and new resources become free.

To maintain a thin interface and enable frameworksto evolve independently, Mesos does not require frame-works to specify their resource requirements or con-straints. Instead, Mesos gives frameworks the ability toreject offers. A framework can reject resources that donot satisfy its constraints in order to wait for ones thatdo. Thus, the rejection mechanism enables frameworksto support arbitrarily complex resource constraints whilekeeping Mesos simple and scalable.

One potential challenge with solely using the rejec-

3

Source: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Page 23: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

Mesos slave Mesos slave Mesos slave MPI

executor

task

Hadoop executor

task

MPI executor

task task

Hadoop executor

task task

Mesos master

Hadoop scheduler

MPI scheduler

Standby master

Standby master

ZooKeeper quorum

Figure 2: Mesos architecture diagram, showing two runningframeworks (Hadoop and MPI).

3 ArchitectureWe begin our description of Mesos by discussing our de-sign philosophy. We then describe the components ofMesos, our resource allocation mechanisms, and howMesos achieves isolation, scalability, and fault tolerance.

3.1 Design Philosophy

Mesos aims to provide a scalable and resilient core forenabling various frameworks to efficiently share clusters.Because cluster frameworks are both highly diverse andrapidly evolving, our overriding design philosophy hasbeen to define a minimal interface that enables efficientresource sharing across frameworks, and otherwise pushcontrol of task scheduling and execution to the frame-works. Pushing control to the frameworks has two bene-fits. First, it allows frameworks to implement diverse ap-proaches to various problems in the cluster (e.g., achiev-ing data locality, dealing with faults), and to evolve thesesolutions independently. Second, it keeps Mesos simpleand minimizes the rate of change required of the system,which makes it easier to keep Mesos scalable and robust.

Although Mesos provides a low-level interface, we ex-pect higher-level libraries implementing common func-tionality (such as fault tolerance) to be built on top ofit. These libraries would be analogous to library OSes inthe exokernel [20]. Putting this functionality in librariesrather than in Mesos allows Mesos to remain small andflexible, and lets the libraries evolve independently.

3.2 Overview

Figure 2 shows the main components of Mesos. Mesosconsists of a master process that manages slave daemonsrunning on each cluster node, and frameworks that runtasks on these slaves.

The master implements fine-grained sharing acrossframeworks using resource offers. Each resource offeris a list of free resources on multiple slaves. The masterdecides how many resources to offer to each frameworkaccording to an organizational policy, such as fair sharing

FW Scheduler Job 1 Job 2 Framework 1

Allocation module

Mesos master

<s1, 4cpu, 4gb, … > 1 <fw1, task1, 2cpu, 1gb, … > <fw1, task2, 1cpu, 2gb, … > 4

Slave 1

Task Executor

Task

FW Scheduler Job 1 Job 2 Framework 2

Task Executor

Task

Slave 2

<s1, 4cpu, 4gb, … > <task1, s1, 2cpu, 1gb, … > <task2, s1, 1cpu, 2gb, … > 3 2

Figure 3: Resource offer example.

or priority. To support a diverse set of inter-frameworkallocation policies, Mesos lets organizations define theirown policies via a pluggable allocation module.

Each framework running on Mesos consists of twocomponents: a scheduler that registers with the masterto be offered resources, and an executor process that islaunched on slave nodes to run the framework’s tasks.While the master determines how many resources to of-fer to each framework, the frameworks’ schedulers selectwhich of the offered resources to use. When a frameworkaccepts offered resources, it passes Mesos a descriptionof the tasks it wants to launch on them.

Figure 3 shows an example of how a framework getsscheduled to run tasks. In step (1), slave 1 reportsto the master that it has 4 CPUs and 4 GB of mem-ory free. The master then invokes the allocation mod-ule, which tells it that framework 1 should be offeredall available resources. In step (2), the master sends aresource offer describing these resources to framework1. In step (3), the framework’s scheduler replies to themaster with information about two tasks to run on theslave, using 〈2 CPUs, 1 GB RAM〉 for the first task, and〈1 CPUs, 2 GB RAM〉 for the second task. Finally, instep (4), the master sends the tasks to the slave, which al-locates appropriate resources to the framework’s execu-tor, which in turn launches the two tasks (depicted withdotted borders). Because 1 CPU and 1 GB of RAM arestill free, the allocation module may now offer them toframework 2. In addition, this resource offer process re-peats when tasks finish and new resources become free.

To maintain a thin interface and enable frameworksto evolve independently, Mesos does not require frame-works to specify their resource requirements or con-straints. Instead, Mesos gives frameworks the ability toreject offers. A framework can reject resources that donot satisfy its constraints in order to wait for ones thatdo. Thus, the rejection mechanism enables frameworksto support arbitrarily complex resource constraints whilekeeping Mesos simple and scalable.

One potential challenge with solely using the rejec-

3

Source: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Page 24: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

24

Resource Allocation

I Q: How are resources offered to frameworks?

I A: Pluggable allocation module determines how resources areoffered to frameworks.

I Q: When are resources offered to frameworks?

I A: Mesos assumes tasks are short and offers resources whentasks end.

I Q: What if tasks aren’t short?

I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.

I Q: What if jobs don’t want to die?

I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.

Page 25: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

24

Resource Allocation

I Q: How are resources offered to frameworks?

I A: Pluggable allocation module determines how resources areoffered to frameworks.

I Q: When are resources offered to frameworks?

I A: Mesos assumes tasks are short and offers resources whentasks end.

I Q: What if tasks aren’t short?

I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.

I Q: What if jobs don’t want to die?

I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.

Page 26: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

24

Resource Allocation

I Q: How are resources offered to frameworks?

I A: Pluggable allocation module determines how resources areoffered to frameworks.

I Q: When are resources offered to frameworks?

I A: Mesos assumes tasks are short and offers resources whentasks end.

I Q: What if tasks aren’t short?

I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.

I Q: What if jobs don’t want to die?

I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.

Page 27: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

24

Resource Allocation

I Q: How are resources offered to frameworks?

I A: Pluggable allocation module determines how resources areoffered to frameworks.

I Q: When are resources offered to frameworks?

I A: Mesos assumes tasks are short and offers resources whentasks end.

I Q: What if tasks aren’t short?

I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.

I Q: What if jobs don’t want to die?

I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.

Page 28: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

24

Resource Allocation

I Q: How are resources offered to frameworks?

I A: Pluggable allocation module determines how resources areoffered to frameworks.

I Q: When are resources offered to frameworks?

I A: Mesos assumes tasks are short and offers resources whentasks end.

I Q: What if tasks aren’t short?

I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.

I Q: What if jobs don’t want to die?

I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.

Page 29: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

24

Resource Allocation

I Q: How are resources offered to frameworks?

I A: Pluggable allocation module determines how resources areoffered to frameworks.

I Q: When are resources offered to frameworks?

I A: Mesos assumes tasks are short and offers resources whentasks end.

I Q: What if tasks aren’t short?

I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.

I Q: What if jobs don’t want to die?

I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.

Page 30: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

24

Resource Allocation

I Q: How are resources offered to frameworks?

I A: Pluggable allocation module determines how resources areoffered to frameworks.

I Q: When are resources offered to frameworks?

I A: Mesos assumes tasks are short and offers resources whentasks end.

I Q: What if tasks aren’t short?

I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.

I Q: What if jobs don’t want to die?

I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.

Page 31: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

24

Resource Allocation

I Q: How are resources offered to frameworks?

I A: Pluggable allocation module determines how resources areoffered to frameworks.

I Q: When are resources offered to frameworks?

I A: Mesos assumes tasks are short and offers resources whentasks end.

I Q: What if tasks aren’t short?

I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.

I Q: What if jobs don’t want to die?

I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.

Page 32: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

25

Demo

Page 33: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

26

Mesos Behavior

Mesos performs best withelastic frameworks and

homogenous task durations.

Page 34: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

27

Rigid Framework

I Ramp-up time: T

I Completion time: (1 + β)T

I Utilization: β12+β

n = k

T T T

βT

(1 + β)T

kβTkT2

Page 35: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

27

Rigid Framework

I Ramp-up time: T

I Completion time: (1 + β)T

I Utilization: β12+β

n = k

T T T

βT

(1 + β)T

kβTkT2

Page 36: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

28

Elastic Framework

I Ramp-up time: T

I Completion time: (12 + β)T

I Utilization: 1

n = k

T T T2

βT

(12 + β)T

Page 37: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

29

Implementation

I 10,000 lines of C++

I Supported Hadoop, Torque, and Spark

I Lots of impressive performance benchmarks

Page 38: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

30

Lessons

I Clusters need schedulers to improve utilization

I Schedulers should form a narrow waist between frameworksand the cluster

I Be simple

I Adhere to the end-to-end argument

Page 39: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

31

Questions

Page 40: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

32

Extras

Page 41: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

33

Isolation

https://goo.gl/dqQ6lG

Page 42: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

34

Scalability and Robustness

I Frameworks can install filters with the Mesos master.

I Offered resources count against a framework’s allocation.

I If a frameworks is slow to respond, Mesos can rescind offers.

Page 43: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

35

Fault Tolerance

I Mesos uses soft state derivable from slaves and frameworks.

I Hot standby replicas.

I Each framework can install multiple schedulers.

Page 44: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

36

Placement Preferences

Placement preferences can be achieved with delay and lotteryscheduling.

Page 45: Michael Whittaker · Figure 2: Mesos architecture diagram, showing two running frameworks (Hadoop and MPI). 3 Architecture We begin our description of Mesos by discussing our de-sign

37

Heterogenous Tasks

If the number of slots on each machine is big, the chances that amachine will be filled completely with long lived tasks is small.


Top Related