1
Michael Whittaker
2
3
4
Hadoop
5
Hadoop
6
Hadoop
7
Hadoop
8
Hadoop
9
Hadoop
10
Hadoop
11
Hadoop Storm
12
Hadoop Storm
13
Hadoop Storm
14
Hadoop Storm
15
Hadoop Storm
16
Hadoop Storm
17
Hadoop Storm
18
Hadoop Storm
19
Hadoop Storm Spark
Monolithic Cluster Manager
20
Hadoop Storm Spark ???
Monolithic Cluster Manager
21
Hadoop Storm Spark ???
Mesos
Mesos slave Mesos slave Mesos slave MPI
executor
task
Hadoop executor
task
MPI executor
task task
Hadoop executor
task task
Mesos master
Hadoop scheduler
MPI scheduler
Standby master
Standby master
ZooKeeper quorum
Figure 2: Mesos architecture diagram, showing two runningframeworks (Hadoop and MPI).
3 ArchitectureWe begin our description of Mesos by discussing our de-sign philosophy. We then describe the components ofMesos, our resource allocation mechanisms, and howMesos achieves isolation, scalability, and fault tolerance.
3.1 Design Philosophy
Mesos aims to provide a scalable and resilient core forenabling various frameworks to efficiently share clusters.Because cluster frameworks are both highly diverse andrapidly evolving, our overriding design philosophy hasbeen to define a minimal interface that enables efficientresource sharing across frameworks, and otherwise pushcontrol of task scheduling and execution to the frame-works. Pushing control to the frameworks has two bene-fits. First, it allows frameworks to implement diverse ap-proaches to various problems in the cluster (e.g., achiev-ing data locality, dealing with faults), and to evolve thesesolutions independently. Second, it keeps Mesos simpleand minimizes the rate of change required of the system,which makes it easier to keep Mesos scalable and robust.
Although Mesos provides a low-level interface, we ex-pect higher-level libraries implementing common func-tionality (such as fault tolerance) to be built on top ofit. These libraries would be analogous to library OSes inthe exokernel [20]. Putting this functionality in librariesrather than in Mesos allows Mesos to remain small andflexible, and lets the libraries evolve independently.
3.2 Overview
Figure 2 shows the main components of Mesos. Mesosconsists of a master process that manages slave daemonsrunning on each cluster node, and frameworks that runtasks on these slaves.
The master implements fine-grained sharing acrossframeworks using resource offers. Each resource offeris a list of free resources on multiple slaves. The masterdecides how many resources to offer to each frameworkaccording to an organizational policy, such as fair sharing
FW Scheduler Job 1 Job 2 Framework 1
Allocation module
Mesos master
<s1, 4cpu, 4gb, … > 1 <fw1, task1, 2cpu, 1gb, … > <fw1, task2, 1cpu, 2gb, … > 4
Slave 1
Task Executor
Task
FW Scheduler Job 1 Job 2 Framework 2
Task Executor
Task
Slave 2
<s1, 4cpu, 4gb, … > <task1, s1, 2cpu, 1gb, … > <task2, s1, 1cpu, 2gb, … > 3 2
Figure 3: Resource offer example.
or priority. To support a diverse set of inter-frameworkallocation policies, Mesos lets organizations define theirown policies via a pluggable allocation module.
Each framework running on Mesos consists of twocomponents: a scheduler that registers with the masterto be offered resources, and an executor process that islaunched on slave nodes to run the framework’s tasks.While the master determines how many resources to of-fer to each framework, the frameworks’ schedulers selectwhich of the offered resources to use. When a frameworkaccepts offered resources, it passes Mesos a descriptionof the tasks it wants to launch on them.
Figure 3 shows an example of how a framework getsscheduled to run tasks. In step (1), slave 1 reportsto the master that it has 4 CPUs and 4 GB of mem-ory free. The master then invokes the allocation mod-ule, which tells it that framework 1 should be offeredall available resources. In step (2), the master sends aresource offer describing these resources to framework1. In step (3), the framework’s scheduler replies to themaster with information about two tasks to run on theslave, using 〈2 CPUs, 1 GB RAM〉 for the first task, and〈1 CPUs, 2 GB RAM〉 for the second task. Finally, instep (4), the master sends the tasks to the slave, which al-locates appropriate resources to the framework’s execu-tor, which in turn launches the two tasks (depicted withdotted borders). Because 1 CPU and 1 GB of RAM arestill free, the allocation module may now offer them toframework 2. In addition, this resource offer process re-peats when tasks finish and new resources become free.
To maintain a thin interface and enable frameworksto evolve independently, Mesos does not require frame-works to specify their resource requirements or con-straints. Instead, Mesos gives frameworks the ability toreject offers. A framework can reject resources that donot satisfy its constraints in order to wait for ones thatdo. Thus, the rejection mechanism enables frameworksto support arbitrarily complex resource constraints whilekeeping Mesos simple and scalable.
One potential challenge with solely using the rejec-
3
Source: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Mesos slave Mesos slave Mesos slave MPI
executor
task
Hadoop executor
task
MPI executor
task task
Hadoop executor
task task
Mesos master
Hadoop scheduler
MPI scheduler
Standby master
Standby master
ZooKeeper quorum
Figure 2: Mesos architecture diagram, showing two runningframeworks (Hadoop and MPI).
3 ArchitectureWe begin our description of Mesos by discussing our de-sign philosophy. We then describe the components ofMesos, our resource allocation mechanisms, and howMesos achieves isolation, scalability, and fault tolerance.
3.1 Design Philosophy
Mesos aims to provide a scalable and resilient core forenabling various frameworks to efficiently share clusters.Because cluster frameworks are both highly diverse andrapidly evolving, our overriding design philosophy hasbeen to define a minimal interface that enables efficientresource sharing across frameworks, and otherwise pushcontrol of task scheduling and execution to the frame-works. Pushing control to the frameworks has two bene-fits. First, it allows frameworks to implement diverse ap-proaches to various problems in the cluster (e.g., achiev-ing data locality, dealing with faults), and to evolve thesesolutions independently. Second, it keeps Mesos simpleand minimizes the rate of change required of the system,which makes it easier to keep Mesos scalable and robust.
Although Mesos provides a low-level interface, we ex-pect higher-level libraries implementing common func-tionality (such as fault tolerance) to be built on top ofit. These libraries would be analogous to library OSes inthe exokernel [20]. Putting this functionality in librariesrather than in Mesos allows Mesos to remain small andflexible, and lets the libraries evolve independently.
3.2 Overview
Figure 2 shows the main components of Mesos. Mesosconsists of a master process that manages slave daemonsrunning on each cluster node, and frameworks that runtasks on these slaves.
The master implements fine-grained sharing acrossframeworks using resource offers. Each resource offeris a list of free resources on multiple slaves. The masterdecides how many resources to offer to each frameworkaccording to an organizational policy, such as fair sharing
FW Scheduler Job 1 Job 2 Framework 1
Allocation module
Mesos master
<s1, 4cpu, 4gb, … > 1 <fw1, task1, 2cpu, 1gb, … > <fw1, task2, 1cpu, 2gb, … > 4
Slave 1
Task Executor
Task
FW Scheduler Job 1 Job 2 Framework 2
Task Executor
Task
Slave 2
<s1, 4cpu, 4gb, … > <task1, s1, 2cpu, 1gb, … > <task2, s1, 1cpu, 2gb, … > 3 2
Figure 3: Resource offer example.
or priority. To support a diverse set of inter-frameworkallocation policies, Mesos lets organizations define theirown policies via a pluggable allocation module.
Each framework running on Mesos consists of twocomponents: a scheduler that registers with the masterto be offered resources, and an executor process that islaunched on slave nodes to run the framework’s tasks.While the master determines how many resources to of-fer to each framework, the frameworks’ schedulers selectwhich of the offered resources to use. When a frameworkaccepts offered resources, it passes Mesos a descriptionof the tasks it wants to launch on them.
Figure 3 shows an example of how a framework getsscheduled to run tasks. In step (1), slave 1 reportsto the master that it has 4 CPUs and 4 GB of mem-ory free. The master then invokes the allocation mod-ule, which tells it that framework 1 should be offeredall available resources. In step (2), the master sends aresource offer describing these resources to framework1. In step (3), the framework’s scheduler replies to themaster with information about two tasks to run on theslave, using 〈2 CPUs, 1 GB RAM〉 for the first task, and〈1 CPUs, 2 GB RAM〉 for the second task. Finally, instep (4), the master sends the tasks to the slave, which al-locates appropriate resources to the framework’s execu-tor, which in turn launches the two tasks (depicted withdotted borders). Because 1 CPU and 1 GB of RAM arestill free, the allocation module may now offer them toframework 2. In addition, this resource offer process re-peats when tasks finish and new resources become free.
To maintain a thin interface and enable frameworksto evolve independently, Mesos does not require frame-works to specify their resource requirements or con-straints. Instead, Mesos gives frameworks the ability toreject offers. A framework can reject resources that donot satisfy its constraints in order to wait for ones thatdo. Thus, the rejection mechanism enables frameworksto support arbitrarily complex resource constraints whilekeeping Mesos simple and scalable.
One potential challenge with solely using the rejec-
3
Source: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
24
Resource Allocation
I Q: How are resources offered to frameworks?
I A: Pluggable allocation module determines how resources areoffered to frameworks.
I Q: When are resources offered to frameworks?
I A: Mesos assumes tasks are short and offers resources whentasks end.
I Q: What if tasks aren’t short?
I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.
I Q: What if jobs don’t want to die?
I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.
24
Resource Allocation
I Q: How are resources offered to frameworks?
I A: Pluggable allocation module determines how resources areoffered to frameworks.
I Q: When are resources offered to frameworks?
I A: Mesos assumes tasks are short and offers resources whentasks end.
I Q: What if tasks aren’t short?
I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.
I Q: What if jobs don’t want to die?
I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.
24
Resource Allocation
I Q: How are resources offered to frameworks?
I A: Pluggable allocation module determines how resources areoffered to frameworks.
I Q: When are resources offered to frameworks?
I A: Mesos assumes tasks are short and offers resources whentasks end.
I Q: What if tasks aren’t short?
I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.
I Q: What if jobs don’t want to die?
I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.
24
Resource Allocation
I Q: How are resources offered to frameworks?
I A: Pluggable allocation module determines how resources areoffered to frameworks.
I Q: When are resources offered to frameworks?
I A: Mesos assumes tasks are short and offers resources whentasks end.
I Q: What if tasks aren’t short?
I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.
I Q: What if jobs don’t want to die?
I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.
24
Resource Allocation
I Q: How are resources offered to frameworks?
I A: Pluggable allocation module determines how resources areoffered to frameworks.
I Q: When are resources offered to frameworks?
I A: Mesos assumes tasks are short and offers resources whentasks end.
I Q: What if tasks aren’t short?
I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.
I Q: What if jobs don’t want to die?
I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.
24
Resource Allocation
I Q: How are resources offered to frameworks?
I A: Pluggable allocation module determines how resources areoffered to frameworks.
I Q: When are resources offered to frameworks?
I A: Mesos assumes tasks are short and offers resources whentasks end.
I Q: What if tasks aren’t short?
I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.
I Q: What if jobs don’t want to die?
I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.
24
Resource Allocation
I Q: How are resources offered to frameworks?
I A: Pluggable allocation module determines how resources areoffered to frameworks.
I Q: When are resources offered to frameworks?
I A: Mesos assumes tasks are short and offers resources whentasks end.
I Q: What if tasks aren’t short?
I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.
I Q: What if jobs don’t want to die?
I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.
24
Resource Allocation
I Q: How are resources offered to frameworks?
I A: Pluggable allocation module determines how resources areoffered to frameworks.
I Q: When are resources offered to frameworks?
I A: Mesos assumes tasks are short and offers resources whentasks end.
I Q: What if tasks aren’t short?
I A: Mesos can kill tasks, giving the framework a grace periodfor cleaning up.
I Q: What if jobs don’t want to die?
I A: Mesos provides each framework with a guarenteedallocation. So long as framework uses less than it’sguarenteed allocation, it’s jobs won’t be killed.
25
Demo
26
Mesos Behavior
Mesos performs best withelastic frameworks and
homogenous task durations.
27
Rigid Framework
I Ramp-up time: T
I Completion time: (1 + β)T
I Utilization: β12+β
n = k
T T T
βT
(1 + β)T
kβTkT2
27
Rigid Framework
I Ramp-up time: T
I Completion time: (1 + β)T
I Utilization: β12+β
n = k
T T T
βT
(1 + β)T
kβTkT2
28
Elastic Framework
I Ramp-up time: T
I Completion time: (12 + β)T
I Utilization: 1
n = k
T T T2
βT
(12 + β)T
29
Implementation
I 10,000 lines of C++
I Supported Hadoop, Torque, and Spark
I Lots of impressive performance benchmarks
30
Lessons
I Clusters need schedulers to improve utilization
I Schedulers should form a narrow waist between frameworksand the cluster
I Be simple
I Adhere to the end-to-end argument
31
Questions
32
Extras
33
Isolation
https://goo.gl/dqQ6lG
34
Scalability and Robustness
I Frameworks can install filters with the Mesos master.
I Offered resources count against a framework’s allocation.
I If a frameworks is slow to respond, Mesos can rescind offers.
35
Fault Tolerance
I Mesos uses soft state derivable from slaves and frameworks.
I Hot standby replicas.
I Each framework can install multiple schedulers.
36
Placement Preferences
Placement preferences can be achieved with delay and lotteryscheduling.
37
Heterogenous Tasks
If the number of slots on each machine is big, the chances that amachine will be filled completely with long lived tasks is small.