re:platforming// the/datacenter// with/apache/mesos

41
Christos Kozyrakis Re:platforming the Datacenter with Apache Mesos

Upload: vannguyet

Post on 08-Dec-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Re:platforming// the/Datacenter// with/Apache/Mesos

Christos(Kozyrakis(

Re:platforming//the/Datacenter//with/Apache/Mesos/

Page 2: Re:platforming// the/Datacenter// with/Apache/Mesos

Why$your$ASF$project$should$run$on$Mesos$

Page 3: Re:platforming// the/Datacenter// with/Apache/Mesos

O(10K)/commodity/servers/

HighAspeed/networking/

Distributed/storage/(HDD,/Flash)/

x10/MWatt/

x100/M$/

(

Page 4: Re:platforming// the/Datacenter// with/Apache/Mesos

ops$developers$

automation(

performance((

automation(

efficiency((

Page 5: Re:platforming// the/Datacenter// with/Apache/Mesos

①   Datacenter/past/

Page 6: Re:platforming// the/Datacenter// with/Apache/Mesos

Static/Partitioning/

Page 7: Re:platforming// the/Datacenter// with/Apache/Mesos

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/

Static/Partitioning/

Page 8: Re:platforming// the/Datacenter// with/Apache/Mesos

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/

Static/Partitioning/

Page 9: Re:platforming// the/Datacenter// with/Apache/Mesos

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/

Static/Partitioning/

Page 10: Re:platforming// the/Datacenter// with/Apache/Mesos

Hadoop/ Cassandra/ Rails/ Jenkins/ memcached/

Static/Partitioning/

Page 11: Re:platforming// the/Datacenter// with/Apache/Mesos

ops$developers$

automation(

performance((

automation(

efficiency((

!

" "

"

Static/Partitioning/

Page 12: Re:platforming// the/Datacenter// with/Apache/Mesos

②   Datacenter/present/

Page 13: Re:platforming// the/Datacenter// with/Apache/Mesos

Apache/Mesos/

The(datacenter(OS(kernel(

Aggregates(all(resources(into(a(single(shared(pool(

Dynamically(allocates(resources(to(distributed(apps(

Container(management(at(scale((cgroups,(docker,(…)(

Page 14: Re:platforming// the/Datacenter// with/Apache/Mesos

Mesos/Architecture/

Executor/

Task/…

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

Scales(to(10s(of(thousands(of(servers(

Page 15: Re:platforming// the/Datacenter// with/Apache/Mesos

Mesos/Fault/Tolerance/

Executor/

Task/…

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

Tasks(survive(failures(of(the(master(((

Page 16: Re:platforming// the/Datacenter// with/Apache/Mesos

Mesos/Fault/Tolerance/

Executor/

Task/…

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

Tasks(survive(failures(of(the(framework(

Page 17: Re:platforming// the/Datacenter// with/Apache/Mesos

Mesos/Fault/Tolerance/

Executor/

Task/…

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

Tasks(survive(failures(of(the(slave(process(

Page 18: Re:platforming// the/Datacenter// with/Apache/Mesos

Scheduling/in/Mesos/

No(single(scheduler(fits(all(needs(

LongLrunning(services(need(

(scale(up/down,(fault(tolerance(

(

Analytics(services(need(

(fast(task(launching(

Page 19: Re:platforming// the/Datacenter// with/Apache/Mesos

2Alevel/Scheduling/

Mesos(master((single(API)(Resource(allocation((offers)((

Task(health(checks(

Task(isolation(

Mesos(frameworks((domainLspecific(APIs)(Scale(up/down(

Fault(tolerance(

Task(grouping(

Task(dependencies(

Queuing(&(priorities((

Page 20: Re:platforming// the/Datacenter// with/Apache/Mesos

2Alevel/Scheduling/Benefits/

Multiple(APIs(to(Mesos(through(frameworks(Marathon,(Aurora,(Singularity(

Storm,(Spark,(Hadoop,(Chronos(

Cassandra,(Elasticsearch(

Multiple(task(scheduling(approaches(Spark(fineLgrain(Vs(Spark(coarseLgrain((

Simple,(stable,(and(scalable(Mesos(master(

Page 21: Re:platforming// the/Datacenter// with/Apache/Mesos

Dynamic/Resource/Allocation//

Resource(allocation(based(on(framework(roles((

Dominant(resource(fairness((DRF)(Weighted(fair(share(calculated(based(on(dominant(resource(

Frameworks(do(no(worse(than(having(a(weightLsized(cluster((

Resource(reservations(Resources(can(be(allocated(to(specific(frameworks(if(needed((

Page 22: Re:platforming// the/Datacenter// with/Apache/Mesos

Dynamic/Resource/Allocation//

Rails/

Hadoop/

memcached/buy$less$machines$

or$run$more$applications!$

Page 23: Re:platforming// the/Datacenter// with/Apache/Mesos

Service/Discovery/

Mesos(Master(

Slave( Slave( Slave( Slave( Slave(…

Mesos(DNS(

①  Watch(ZK(for(((master(changes(

②  Pull(task(state((Generate(DNS(records(

③  DNS(&(HTTP((based(discovery(

(nginx.marathon.mesos(#(10.13.17.95(

_nginx._tcp.marathon.mesos(#10.13.17.95:8181((

Page 24: Re:platforming// the/Datacenter// with/Apache/Mesos

③   Datacenter/future/

Page 25: Re:platforming// the/Datacenter// with/Apache/Mesos

Utilization/Reality/

Twitter (Mesos) Google (Borg)

[Barroso’09] [Delimitrou’14]

Page 26: Re:platforming// the/Datacenter// with/Apache/Mesos

The/Curse/of/Overprovisioning/

[Delimitrou’14]

Bloated(reservations(to(deal(with(diurnal(load(patterns,(load(spikes,(software(&(platform(changes(

Page 27: Re:platforming// the/Datacenter// with/Apache/Mesos

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

① offer<s1,(4cores,(…>(

② offer<s1,(4cores,(…>(

Page 28: Re:platforming// the/Datacenter// with/Apache/Mesos

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

③ launch<tasks,(s1,(4cores,(…>(

④ launch<tasks,(4cores,(…>(

Page 29: Re:platforming// the/Datacenter// with/Apache/Mesos

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

① offer<s1,(BE,2cores,(…>(

② offer<s1,(BE,(2cores,(…>(

Page 30: Re:platforming// the/Datacenter// with/Apache/Mesos

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

③ launch<tasks,(BE,(s1,(2cores,(…>(

④ launch<tasks,(BE,(2cores,(…>(

Page 31: Re:platforming// the/Datacenter// with/Apache/Mesos

Oversubscription/

Executor/

Task/. . .

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Allocator/

AuthN/ AuthZ/

Servers/

Masters/

Frameworks/

Executor/

Task/

Slave/

Executor/

Task/

Executor/

Task/

Slave/

Marathon( Jenkins( …

① Status<task,(killed,(…>(

②  Status<task,(killed,(…>(

Page 32: Re:platforming// the/Datacenter// with/Apache/Mesos

Interference/#/Performance/Loss/

L3 Cache >300%( >300%( >300%( >300%( >300%( >300%( >300%( 264%( 123%(

DRAM >300%( >300%( >300%( >300%( >300%( >300%( >300%( 270%( 122%(

HyperThread 110%( 107%( 114%( 115%( 105%( 117%( 120%( 136%( >300%(

CPU power 124%( 107%( 116%( 109%( 115%( 105%( 101%( 100%( 100%(

Network 36%( 36%( 37%( 37%( 39%( 42%( 48%( 55%( 64%(

10% 20% 30% 40% 50% 60% 70% 80% 90%

Impact of interference on websearch’s latency

Load

0%

100%

300%

OK

BAD

[Lo’15]

Page 33: Re:platforming// the/Datacenter// with/Apache/Mesos

Isolators/

Mesos(slave(invokes(isolators(Modules(that(monitor(&(isolate(resources(for(executors(

Isolator(modules(CPU((cgroups(cpushares,(cpusets)(

Memory((cgroups)(

Disk(

Network(

Cache(

Power(

…((

Page 34: Re:platforming// the/Datacenter// with/Apache/Mesos

Isolators/#/Performance/QoS/

[Lo et al’15]

+/bestAeffort/task/>90%/HW/utilization/ No/latency/SLO/problems/

Page 35: Re:platforming// the/Datacenter// with/Apache/Mesos

Hybrid/Datacenters/

Mesos/Slaves/

Mesos/Master/

Marathon(

Page 36: Re:platforming// the/Datacenter// with/Apache/Mesos

Hybrid/Datacenters/

Mesos/Slaves/

Mesos/Master/

Marathon(

?/instance(++(

Scheduling(based(on(workload(type,(data(locality,(pricing,…((

Page 37: Re:platforming// the/Datacenter// with/Apache/Mesos

Future/Directions/

Oversubscription((((

Container(&(application(right(sizing((((

Hybrid(datacenters((((

Power(aware(scheduling((((

Locality(aware(scheduling(

Page 38: Re:platforming// the/Datacenter// with/Apache/Mesos

ops$developers$

automation(

performance((

automation(

efficiency((

!

! !

!

Mesos/Datacenter/

Page 39: Re:platforming// the/Datacenter// with/Apache/Mesos

Want/to/Learn/More?/

(

(

((

Mo(3pm(–(Cracking(the(Container(Scale(Problem(with(Apache(Mesos(

Tue(4.20pm(–(The(Emergence(of(the(Datacenter(Developer(

Wed(4.15pm(–(Mesos(+(Yarn(=(Myriad(

Page 40: Re:platforming// the/Datacenter// with/Apache/Mesos

Questions?/

(

(

((

Page 41: Re:platforming// the/Datacenter// with/Apache/Mesos

References/

•  http://mesos.apache.org/(

•  http://www.mesosphere.com/(

•  https://github.com/mesosphere/mesosLdns(

•  https://www.cs.berkeley.edu/~alig/papers/mesos.pdf(

•  https://www.cs.berkeley.edu/~alig/papers/drf.pdf(

•  http://www.morganclaypool.com/doi/abs/10.2200/S00516ED2V01Y201306CAC024(

•  http://web.stanford.edu/~cdel/2014.asplos.quasar.pdf(

•  http://web.stanford.edu/~davidlo/resources/2014.heracles.isca.pdf(