uc#berkeley# mesos: a platform for fine- grained resource...
TRANSCRIPT
Mesos: A Platform for Fine-Grained Resource Sharing
in Data Centers (II)
UC BERKELEY
Anthony D. Joseph
LASER Summer School September 2013
My Talks at LASER 2013
1. AMP Lab introduction
2. The Datacenter Needs an Operating System
3. Mesos, part one
4. Dominant Resource Fairness
5. Mesos, part two
6. Spark 2
Collaborators
• Matei Zaharia
• Benjamin Hindman
• Andy Konwinski
• Ali Ghodsi
• Randy Katz
• Scott Shenker
• Ion Stoica 3
Apache Mesos A common resource sharing layer for diverse frameworks
Run multiple instances of the same framework » Isolate production and experimental jobs » Run multiple versions of the framework concurrently
Support specialized frameworks for problem domains
Node OS (e.g. Linux)
Node OS (e.g. Windows)
Node OS (e.g. Linux)
…
Spar
k
SCADS
…
Datacenter “OS” (e.g., Apache Mesos)
Had
oop
MPI
Hyp
ertb
ale
…
Cas
sand
ra Hive PIQL
4
Implementation 20,000+ lines of C++
APIs in C, C++, Java, and Python
Master failover using ZooKeeper
Frameworks ported: Hadoop, MPI, Torque
New specialized frameworks: Spark, Apache/HaProxy
Open source Apache project��� http://mesos.apache.org/
5
Frameworks
Ported frameworks: » Hadoop (900 line patch) » MPI (160 line wrapper scripts)
New frameworks: » Spark, Scala framework for iterative jobs (1300 lines)
» Apache+haproxy, elastic web server farm (200 lines)
6
Isolation
Mesos has pluggable isolation modules to isolate tasks sharing a node
Currently supports Linux Containers and Solaris projects » Can isolate memory, CPU, IO, network bandwidth
Could be a great place to use VMs
7
Apache ZooKeeper
Multiple servers require coordination » Leader Election, Group Membership, Work Queues, Data
Sharding, Event Notifications, Configuration, and Cluster Management
Highly available, scalable, distributed coordination kernel » Ordered updates and strong persistence guarantees » Conditional updates (version), Watches for data changes
Server Server Server Server Server Server
Leader
Client Client Client Client Client Client Client
Mesos Master 3
Master Failure
Zoo 1
Zoo 2
Zoo 3
Zoo 4
Zoo 5
Slave 1 Hadoop Executor MPI Executor
Slave 2 Hadoop Executor Hadoop Executor
Slave 3 JVM Executor
Mesos Master 1
Scheduler Hadoop JobTracker
Scheduler MPI Scheduler
ZooKeeper
Mesos Master 2
ZooKeeper used to elect one active Mesos master
Connect to currently active
master
Connect to currently active
master
9
Resource Revocation Killing tasks to make room for other users
Killing typically not needed for short tasks » If avg task length is 2 min, a new framework gets 10% of
all machines within 12 seconds on avg
Hadoop job and task durations at Facebook 10
Resource Revocation (2) Not the normal case because fine-grained tasks enable quick reallocation of resources
Sometimes necessary: » Long running tasks never relinquishing resources » Buggy job running forever » Greedy user who decides to makes his task long
Safe allocation lets frameworks have long running tasks defined by allocation policy » Users will get at least safe share within specified time » If stay below safe allocation, task won’t be killed
11
Resource Revocation (3)
Dealing with long tasks monopolizing nodes » Let slaves have long slots and short slots » Short slots killed if used too long by a task
Revoke only if a user is below its safe share and is interested in offers » Revoke tasks from users farthest above their safe share » Framework given a grace period before killing its tasks
12
Example: Running MPI on Mesos
Users always told their safe share » Avoid revocation by staying below it
Giving each user a small safe share may not be enough if jobs need many machines
Can run a traditional HPC scheduler as a user with a large safe share of the cluster, and have MPI jobs queue up on it » E.g. Torque gets 40% of cluster
13
Example: Torque on Mesos
MPI Job
40%
Safe share = 40%
MPI Job
MPI Job
Torque
MPI Job
Facebook.com
Spam Ads
Job 1
Job 2
User 1
Job 1
User 2
Job 4
40% 20%
14
Some Mesos Deployments 1,000’s of nodes running over a dozen production services
Genomics researchers using Hadoop and Spark on Mesos
Spark in use by Yahoo! Research
Spark for analytics
Hadoop and Spark used by machine learning researchers
15
Results
Dynamic Resource Sharing
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
1 31 61 91 121 151 181 211 241 271 301 331
Shar
e of
Clu
ster
Time (s)
MPI
Hadoop
Spark
17
Load calculation
Mesos slave
Elastic Web Server Farm
Mesos master
Mesos slave Web
executor task
(Apache)
Scheduler (haproxy)
Load gen framework
Load gen executor
task
httperf
Mesos slave Web
executor task
(Apache)
Load gen executor
task
HTTP request HTTP
request
Load gen ���
task task
executor Web executor
task (Apache)
HTTP request
resource offer
task
status update
18
Web Framework Results
19
Scalability Task startup overhead with 200 frameworks
20
Fault Tolerance Mean time to recovery, 95% confidence
21
Deep Dive Experiments
Macrobenchmark experiment » Test the benefits of using Mesos to multiplex a cluster
between multiple diverse frameworks
High level goals of experiment » Demonstrate increased CPU/memory utilization due to
multiplexing available resources » Demonstrate job runtime speedups
22
Macrobenchmark setup 100 Extra Large EC2 instances (4 cores/15GB ram per machine)
Experiment length: ~25 minutes
Realistic workload 1. A Hadoop instance running a mix of small and large
jobs based on the workload at Facebook 2. A Hadoop instance running a set of large batch jobs 3. Spark running a series of machine learning jobs 4. Torque running a series of MPI jobs
23
Goal of experiment
Run the four frameworks and corresponding workloads… » 1st on a cluster that is shared via Mesos » 2nd on 4 partitioned clusters, each ¼ the size of the
shared cluster
Compare resource utilization and workload performance (i.e., job run times) on static partitioning vs. sharing with Mesos
24
Macrobenchmark Details: ���Breakdown of the Facebook Hive
(Hadoop) Workload mix
Bin Job Type Map Tasks Reduce Tasks Jobs Run
1 Selection 1 NA 38
2 Text search 2 NA 18
3 Aggregation 10 2 14
4 Selection 50 NA 12
5 Aggregation 100 10 6
6 Selection 200 NA 6
7 Text Search 400 NA 4
8 Join 400 30 2
25
Results: CPU Allocation 100 node cluster
26 Hadoop (Batch Jobs)
Hadoop (Facebook Mix)
Sharing With Mesos vs.���No-Sharing (Dedicated Cluster)
0 0.2 0.4 0.6 0.8
1
0 500 1000 1500 2000 2500 3000
Shar
e of
Clu
ster
Time (s)
(b) Large Hadoop Mix
Dedicated ClusterMesos
0 0.2 0.4 0.6 0.8
1
0 200 400 600 800 1000 1200 1400 1600
Shar
e of
Clu
ster
Time (s)
(a) Facebook Hadoop Mix
Dedicated ClusterMesos
0 0.2 0.4 0.6 0.8
1
0 200 400 600 800 1000 1200 1400 1600
Shar
e of
Clu
ster
Time (s)
(d) Torque / MPI
Dedicated ClusterMesos
0 0.2 0.4 0.6 0.8
1
0 200 400 600 800 1000 1200 1400 1600 1800
Shar
e of
Clu
ster
Time (s)
(c) Spark
Dedicated ClusterMesos
27
Cluster Utilization ���Mesos vs. Dedicated Clusters
0 20 40 60 80
100
0 200 400 600 800 1000 1200 1400 1600CPU
Util
izat
ion
(%)
Time (s)
Mesos Static
0 10 20 30 40 50
0 200 400 600 800 1000 1200 1400 1600Mem
ory
Util
izat
ion
(%)
Time (s)
Mesos Static
28
Job Run Times (and Speedup) ���Grouped By Framework
Framework Sum of Exec times on Dedicated Cluster (s)
Sum of Exec Times on Mesos (s)
Speedup
Facebook Hadoop Mix 7235 6319 1.14
Large Hadoop Mix 3143 1494 2.10
Spark 1684 1338 1.26
Torque / MPI 3210 3352 0.96
2x speedup for Large Hadoop Mix
29
Job Run Times (and Speedup) ���Grouped by Job Type
Framework Job Type Time on Dedicated
Cluster (s) Avg. Speedup
on Mesos Facebook Hadoop Mix
selection (1) 24 0.84 text search (2) 31 0.90 aggregation (3) 82 0.94 selection (4) 65 1.40 aggregation (5) 192 1.26 selection (6) 136 1.71 text search (7) 137 2.14 join (8) 662 1.35
Large Hadoop Mix text search 314 2.21 Spark ALS 337 1.36 Torque / MPI small tachyon 261 0.91
large tachyon 822 0.88
30
Discussion: Facebook Hadoop Mix Results
Smaller jobs perform worse on Mesos: » Side effect of interaction between fair sharing performed
by Hadoop framework (among its jobs) and performed by Mesos (among frameworks) » When Hadoop has more than 1/4 of the cluster, Mesos
allocates freed up resources to framework farthest below its share » Significant effect on any small Hadoop job submitted
during this time (long delay relative to its length) » In contrast, Hadoop running alone can assign resources to
the new job as soon as any of its tasks finishes
31
Discussion: Facebook Hadoop Mix Results
Similar problem with hierarchical fair sharing appears in networks » Mitigation #1: run small jobs on a separate framework, or » Mitigation #2: use lottery scheduling as the Mesos
allocation policy
32
Discussion: Torque Results
Torque is the only framework that performed worse, on average, on Mesos » Large tachyon jobs took on average 2 minutes longer » Small ones took 20s longer
33
Discussion: Torque Results Causes of delay » Partially due to Torque having to wait to launch 24 tasks on
Mesos before starting each job – average delay is 12s » Rest of the delay may be due to stragglers (slow nodes) » In standalone Torque run, two jobs each took ~60s longer
to run than others » Both jobs used a node that performed slower on single-
node benchmarks than the others (Linux reported a 40% lower bogomips value on the node) » Since tachyon hands out equal amounts of work to
each node, it runs as slowly as the slowest node 34
Macrobenchmark Summary Evaluated performance of diverse set of frameworks representing realistic workloads running on Mesos versus a statically partitioned cluster
Showed 10% increase in CPU utilization, 18% increase in memory utilization
Some frameworks show significant speed ups in job run time
Some frameworks show minor slowdowns in job run time due to experimental/environmental artifacts
35
Summary
Mesos is a platform for sharing data centers among diverse cluster computing frameworks » Enables efficient fine-grained sharing » Gives frameworks control over scheduling
Mesos is » Scalable (50,000 slaves) » Fault-tolerant (MTTR 6 sec) » Flexible enough to support a variety of frameworks
(MPI, Hadoop, Spark, Apache, …)
36
My Talks at LASER 2013
1. AMP Lab introduction
2. The Datacenter Needs an Operating System
3. Mesos, part one
4. Dominant Resource Fairness
5. Mesos, part two
6. Spark 37