mesos study report 03v1.2

MesosA Platform for Fine-Grained Resource Sharing in the Data Center

Background

• Rapid innovation in cluster Computing frameworks

Problem

• Rapid innovation in cluster computing frameworks• No single framework optimal for all applications

• Want to run multiple frameworks in a single cluster

» …to maximize utilization

» …to share data between frameworks

Where We Want to Go

Solution

• Mesos is a common resource sharing layer over which diverse frameworks can run

Mesos Goals

• High utilization of resources

• Support diverse frameworks (current & future)

• Scalability to 10,000’s of nodes

• Reliability in face of failures

Mesos

• Fine‐Grained Sharing » Improved utilization, responsiveness , data locality

• Resource Offers » Offer available resources to frameworks, let them pick which

resources to use and which tasks to launch

» Keeps Mesos simple, lets it support future frameworks

Mesos Architecture

Mesos architecture diagram, showing two running frameworks

Resource Offers

• Mesos decides how many resources to offer each framework ,based on an organizational policy such as fair sharing , while frameworks decide which resources to accept and which tasks to run on them

• A framework can reject resources that do not satisfy its constraints in order to wait for ones that do

• Delegating control over scheduling to the frameworks, push control of task scheduling and execution to the frameworks

Resource Offers

• Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves.

• Each resource offer is a list of free resources on multiple slaves.• Each framework running on Mesos consists of two components:

» a scheduler that registers with the master to be offered resources,

» an executor process that is launched on slave nodes to run the

framework’s tasks. • When a framework accepts offered resources, it passes Mesos a

description of the tasks it wants to launch on them

Resource Offers

Resource offer example

Resource Offers

Optimization : Filters

• Let frameworks short‐circuit rejection by providing a predicate on resources to be offered

» E.g. “ nodes from list L” or “nodes with>8GB RAM ”

» Could generalize to other hints as well

Analysis

• Resource offers work well when: » Frameworks can scale up and down elastically

» Task durations are homogeneous

» Frameworks have many preferred nodes

• These conditions hold in current data analytics frameworks (MapReduce, Dryad, …)

» Work divided into short tasks to facilitate load balancing and fault

recovery

» Data replicated across multiple nodes

Resource Allocation

• Mesos delegates allocation decisions to a pluggable allocation module, so that organizations can tailor allocation to their needs.

• Have implemented two allocation modules: » one that performs fair sharing based on a generalization of max-

min fairness for multiple resources(DSF)

» one that implements strict priorities• Task revoke

» if a cluster becomes filled by long tasks, e.g., due to a buggy job

or a greedy framework, the allocation module can also revoke

(kill) tasks

Fault Tolerance

• Master failover using ZooKeeper• Mesos master has only soft state: the list of active

slaves, active frameworks, and running tasks

» a new master can completely reconstruct its internal state from

information held by the slaves and the framework schedulers

• When the active master fails, the slaves and schedulers connect to the next elected master and repopulate its state.

• Aside from handling master failures, Mesos reports node failures and executor crashes to frameworks’ schedulers.

Isolation

• Mesos provides performance isolation between framework executors running on the same slave by leveraging existing OS isolation mechanisms

• currently isolate resources using OS container technologies, specifically Linux Containers and Solaris Projects

• These technologies can limit the CPU, memory, network bandwidth, and (in new Linux kernels) I/O usage of a process tree

Data Locality with ResourceOffers• Ran 16 instances of Hadoop on a shared HDFS cluster• Used delay scheduling in Hadoop to get locality (wait a

short time to acquire data‐local nodes)

Scalability

• Mesos only performs inter-framework scheduling(e.g. fair sharing),which is easier than intra‐framework scheduling

• Result:

Scaled to 50,000

Emulated slaves,

200 frameworks,

100K tasks (30s len)

Conclusion

• Mesos shares clusters efficiently among diverse frameworks thanks to two design elements:

» Fine‐grained sharing at the level of tasks

» Resource offers, a scalable mechanism for

application‐controlled scheduling

• Enables co‐existence of current frameworks and development of new specialized ones

• In use at Twitter , UC Berkeley , Conviva and UCSF

mesos study report 03v1.2

Technology