lessons in moving from physical hosts to mesos

Lessons in moving from physical hosts to Mesos

Raj Shekhar, Senior Site Reliability Engineer

@ilunatech

Mesos

WHATWHYHOW

NOW WHAT

How most Ops teams run clusters today

Static partitioning has problemsUnequal load distribution on machinesSlower to add capacityNot fault tolerant

Is there a better way?Do we want machines or do we want resources?

MesosResource manager - the datacenter is one big poolCan run multi-tenant workloadsFailure detectionServices are isolated from one another

Why Mesos - Better resource utilizationRun multi-tenant workload on machines

Dynamic partitioning - no dedicated machines for tasks

Less resource hungry than virtual machines

Why Mesos - all the other good things

Fault tolerant - automatically restart failed jobs

Elasticity - grow and shrink on demand

Faster deploys

T.co - URL shortening

http://example.com/example http://t.co/examp

How

Package Deploy Test Go Live!

Life after Go LiveLowered operating expenseFewer routine operational tasksFaster deploys

Job throttling

Sudden spikes in latencies

What we learned

cgroups and cpu quotas

Capacity planning

Max traffic of the cluster was lower than our expectationWhat we learned

Different CPU variants have different throughput

Rethink service discovery

Services get hosts and ports assigned dynamically

What we learned

Use static proxies to forward connections

No perfect isolation

Sudden spike in latency

What we learned

Async ops where possible, noisy neighbours still affect us

Questions?

[email protected]

@ilunatech

lessons in moving from physical hosts to mesos

Engineering