mesos at yelp: building a [email protected]/@rob johnson rob ... · -yelp’s monolith is ~3 million...

74
Mesos at Yelp: Building a production ready PaaS Rob Johnson [email protected]/@rob_johnson_

Upload: others

Post on 14-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Mesos at Yelp: Building a production ready PaaS

Rob [email protected]/@rob_johnson_

Page 2: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- Rob Johnson- Operations Team at Yelp- Spend most of my time working on PaaSTA

Who Am I:

Page 3: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Yelp’s Mission:Connecting people with great

local businesses.

Page 4: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Yelp Stats:As of Q2 2015

83M 3268%83M

Page 5: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

PaaSTA

Page 6: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Yelp’s homegrown Platform-as-a-Service

Page 7: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

What’s the problem we’re trying to solve here?

Page 8: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- Yelp’s monolith is ~3 million LoC (that’s just the Python).*

- Increasing number of developers.

*as of 28/09/2015

Page 9: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- Code deployments become increasingly difficult to coordinate.

- Surface area for impact of a bug greatly increases.

Page 10: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

What’s the solution?

Page 11: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

SOA

Page 12: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Solves everything, right?

Page 13: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

SOA: Round 1

Page 14: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- Statically defined list of hosts to deploy a service on.

- Operations handle deciding which hosts to deploy to.

Page 15: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- Manually configure Nagios for each service.

- Manual deployment system. Lots of rsync wrappers to push code around.

Page 16: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

This doesn’t scale well.

Page 17: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

PaaSTA

Page 18: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- Built on the shoulders of established tools.

- ‘Glue Code’ that coordinates these tools.

Page 19: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Components

Page 20: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Mesos

Page 21: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Marathon

Page 22: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Chronos

(almost)

Page 23: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

My work here is done, right?

Page 24: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Not Quite.

Page 25: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Services != Production

Page 26: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

What makes a service production ready?

Page 27: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

Page 28: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery

Page 29: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery- monitoring

Page 30: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery- monitoring- highly available

Page 31: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery- monitoring- highly available- operational support

Page 32: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery- monitoring- highly available- operational support

Page 33: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Services at Yelp tend to be:

- http api- Python- uWSGI

Page 34: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

We want to be stack agnostic; developers shouldn’t be constrained by dependencies on a server.

Page 35: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- PaaSTA only runs Docker containers.

- Developers own the

creation of the image.

Page 36: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

PaaSTA currently has Java, Golang and Python apps in production.

Page 37: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

PaaSTA provides tooling to automate the build and deployment of images via Jenkins.

Page 38: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

PaaSTA uses Git as its control plane.

Page 39: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

git push

make itest

push to registry

performance check

deploy to dev

(repeat for each dev env)

manual intervention

prod

Page 40: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Once a given image is marked for deployment in production, PaaSTA ‘bounces’ the app, gracefully upgrading the version.

Page 41: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- Reduces operational overhead of deploying service.

- Removes bottleneck of going through operations to deploy.

Page 42: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery- monitoring- highly available- operational support

Page 43: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Smartstack

Page 44: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- Originally written by Airbnb- Yelp now has maintainers

working on it.

Page 45: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

s2s1 s3 s4s2s1 s3 s4

s2s1 s3 s4

H H

H

S N NS

S N

ZK

Page 46: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

s2s1 s3 s4s2s1 s3 s4

s2s1 s3 s4

H H

H

S N NS

S N

ZK

Page 47: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

s2s1 s3 s4s2s1 s3 s4

s2s1 s3 s4

H H

H

S N NS

S N

ZK

Page 48: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

s2s1 s3 s4s2s1 s3 s4

s2s1 s3 s4

H H

H

S N NS

S N

ZK

Page 49: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

There’s no place like 127.0.0.1 169.254.255.254

Page 50: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Why Smartstack?

Page 51: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- ZK/synapse/nerve dying doesn’t wipe us out.

- HAProxy has its own health checking system we can fall back to.

Page 52: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- HAProxy is a proven load balancer and http proxy.

- We can use Smartstack with non-PaaSTA services.

Page 53: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Zero-downtime HAProxy reloads:

http://bit.ly/1RsctGi

Page 54: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery- monitoring- highly available- operational support

Page 55: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015
Page 56: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- API allows us to send event data.

- Flexibility to assign alerts to service authors, rather than forcing it on operations team.

Page 57: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

$ cat monitoring.yaml

--

team: search_infra

notification_email: [email protected]

page: true

runbook: 'y/rb-myservice'

alert_after: 5m

realert_every: 10m

tip: 'The federator service is in the critical path for

search, you should be fixing this'

Page 58: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

./check_marathon_services_replication

Page 59: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

./check_hung_setup_marathon_jobs

Page 60: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery- monitoring- highly available- operational support

Page 61: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Yelp organises machines into latency zones.

Page 62: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

SuperregionRegionHabitat

Page 63: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

$ cat smartstack.yaml

---

main:

advertise: [superregion]

discover: superregion

proxy_port: 20603

Page 64: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

By choosing a more specific latency zone, service owners optimize for RTT over availability.

Page 65: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- By being aware of these latency zones, PaaSTA can make smarter decisions on how to constrain applications.

Page 66: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Without this coupling, Marathon wouldn’t balance apps evenly amongst the latency zones.

Page 67: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery- monitoring- highly available- operational support

Page 68: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

PaaSTA comes with a cli for managing PaaSTA services.

Page 69: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015
Page 70: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015
Page 71: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015
Page 72: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

- easy deployment for developers

- discovery- monitoring- highly available- operational support

Page 73: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

Questions?

Page 74: Mesos at Yelp: Building a robj@yelp.com/@rob johnson Rob ... · -Yelp’s monolith is ~3 million LoC (that’s just the Python). *-Increasing number of developers. *as of 28/09/2015

@YelpEngineering

YelpEngineers

engineeringblog.yelp.com

github.com/yelp