mad scalability: scaling when you are not google

58
Scaling when you are not Google Abel Muiño

Upload: abel-muino

Post on 21-Mar-2017

129 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Mad scalability: Scaling when you are not Google

Scaling when you are not GoogleAbel Muiño

Page 2: Mad scalability: Scaling when you are not Google

Abel Muino‣ Lead Software Engineer

‣ Tweets as @amuino

‣ In another life, co-owned 1uptalent.com, played with Docker and used AWS for everything.

Page 3: Mad scalability: Scaling when you are not Google

Disclaimer‣Cabify is 5 years old

‣ I joined Cabify about 1.5 years ago to work on product

‣What you will hear today might be

‣ 70% folklore / 30% experience

‣Only about production

‣Not applicable to other areas (data analytics)

Page 4: Mad scalability: Scaling when you are not Google
Page 5: Mad scalability: Scaling when you are not Google

Cabify

Page 6: Mad scalability: Scaling when you are not Google

2011 2012 2013 2014 2015 2016

Completed Journeys(Axis has no legend because NDA and stuff)

Page 7: Mad scalability: Scaling when you are not Google

Backend committers

0

5

9

14

18

2011 2012 2013 2014 2015 2016

Page 8: Mad scalability: Scaling when you are not Google

We are hiring!(As if it wasn’t obvious from the charts)

Page 9: Mad scalability: Scaling when you are not Google

Circadian rhythm

Page 10: Mad scalability: Scaling when you are not Google

Prelude ???? - 2014

Page 11: Mad scalability: Scaling when you are not Google

Cabify foundations‣Mostly Ruby, some Go

‣ Running on VPS

‣No sysadmins (devops?)

‣ CouchDB

‣ Redis

‣ Home-grown metrics & monitoring (limited)

Page 12: Mad scalability: Scaling when you are not Google

Servers‣ 3 ⨉ Host servers

‣Horizontally scalable

‣Most services included (sidecars)

‣ Front + Back + Queue workers

‣ 1 ⨉ Realtime server

‣ Single Point of Failure

‣Ansible for setting them up

VPS Provider

LB

web1 web2 web3

worker1

LBLB

redis1 redis2elastic

realtime osrm

websock

Page 13: Mad scalability: Scaling when you are not Google

CouchDB

‣ Used to be run in-house → Unreliable

‣ Moved to Cloudant

‣ Managed

‣ Bare metal servers

‣Requisite for everything else: to run on the same datacenter

‣ …because the network matters

Database of choice for Cabify

Page 14: Mad scalability: Scaling when you are not Google

Pros‣Cheap servers

‣ Profesional DB management

‣ Still cheaper than in-house staff

‣ Scales up by either

‣ Emailing Cloudant

‣Deploying new VPSs

‣ Datacenter lock-in

‣ Scarce visibility on load

‣ Low VPS utilization (for some services)

Cons

Page 15: Mad scalability: Scaling when you are not Google

Tl;dr: everything was fineUntil it wasn’t

Page 16: Mad scalability: Scaling when you are not Google

2015 Road to bare metal

Page 17: Mad scalability: Scaling when you are not Google

In 2014 we handled 7 times the load of 2013

Page 18: Mad scalability: Scaling when you are not Google

Installed NewRelic‣Monitors our ruby stack

‣ Built custom adapters for API toolkit and CouchDB

‣Golang not supported 😭

‣ Low hanging fruit for increasing performance

‣Hint: Always contact a Sales Rep

‣ Bye bye home-grown monitoring! 👋

Page 19: Mad scalability: Scaling when you are not Google

VPS provider DDoSed‣ Several times a week

‣ Cabify was unreachable

‣ VPSs where unreachable on the internal network

‣ Slow & bad support

‣ Reputation

‣Solution: Level up!

Page 20: Mad scalability: Scaling when you are not Google

Nobody ever got fired for choosing IBMMoved to Bare Metal @ Softlayer Same guys hosting our Cloudant cluster 👍

Page 21: Mad scalability: Scaling when you are not Google

MindsetControl the core, minimise work for everything else

Page 22: Mad scalability: Scaling when you are not Google

Everything must go

VPS Provider

web1 web2 web3

worker1realtime

LBLBLB

redis1 redis2elastic

osrm

subscriber

Page 23: Mad scalability: Scaling when you are not Google

Load Balancer‣Multiple PoP (starting operations in several countries)

‣CDN

‣ Supporting websockets

‣… and Load Balancing

‣ Low TCO

‣ https://www.incapsula.com

Page 24: Mad scalability: Scaling when you are not Google

Redis, ElasticSearch‣ Same datacenter

‣Completely managed

‣Clustered / reliable

‣ RedisLabs

‣ Bonus: Memcached

‣Qbox

Page 25: Mad scalability: Scaling when you are not Google

OSRM‣ Same datacenter

‣Completely managed

‣ Enhanced dataset

‣Google Maps & Places (with enterprise license)

‣ 2 / 3, good enough

Page 26: Mad scalability: Scaling when you are not Google

Can do better?Can we manage less infra?

Softlayer

web1 web2 web3

worker1realtime

Googlesubscriber

Incapsula

RedislabsRedislabsRedislabs

qboxqboxQboxRedislabsRedislabsCloudant

Page 27: Mad scalability: Scaling when you are not Google

Subscriber

‣ Felt like reinventing the wheel

‣ Looked for battle-tested bus / queue / broker

‣ In the same datacenter

‣ Had previous experience with RabbitMQ

‣ CloudAMQP

Homebrew message bus / queue

Page 28: Mad scalability: Scaling when you are not Google

Sidecars

‣ Every server could run Cabify

‣ All services installed

‣ Except Realtime (SPOF)

‣ Horizontal scaling

‣ Good server utilisation (bare metal servers are larger)

Make each host self-sufficient

Page 29: Mad scalability: Scaling when you are not Google

Cut own servers in 50%Served 5 times more requests

Softlayer

host01 host02 host03

realtime

Google

Incapsula

RedislabsRedislabsRedislabs

qboxqboxqboxRedislabsRedislabsCloudant

CloudAMQPCloudAMQP

Page 30: Mad scalability: Scaling when you are not Google

Pros‣ Same-datacenter latencies

‣Only care about our product

‣ Still cheaper than in-house staff

‣ Scales up by either

‣ Emailing a provider

‣Deploying new Servers

‣Good visibility on perf

‣ Datacenter lock-in

‣ Still no visibility on Golang perf

‣ Competing services on each server with different needs

‣ Fast & light http requests

‣ Slow & heavy queue workers

‣ Debugability

Cons

Page 31: Mad scalability: Scaling when you are not Google

Tl;dr: everything was fineUntil it wasn’t

Page 32: Mad scalability: Scaling when you are not Google

2016, pushing to the limit

Page 33: Mad scalability: Scaling when you are not Google

In 2015 we handled 5 times the load of 2014

Page 34: Mad scalability: Scaling when you are not Google

In 2016 we would invade LatAm (new countries, cities, marketing…)

Page 35: Mad scalability: Scaling when you are not Google

Bumps on the road‣ Start seeing intermittent latency spikes on Cloudant

‣Disable some services, get back on track

‣ Tied to peak hours

‣We lived through these, but was stressful

Page 36: Mad scalability: Scaling when you are not Google

Be easy on the database

‣ Removed frequent N+1 queries patterns

‣ Moved some queries to ElasticSearch

‣ Started caching more on Memcache

‣ Grew the cluster

‣ From 200ms to 100ms (average) 👏

(trying to sleep better)

Page 37: Mad scalability: Scaling when you are not Google

RabbitMQ can’t cope‣We saturated the cluster CPU with moderate load

‣ Tied to us using tag-based routing

‣Messages were delivered much later than expected

‣Made changes to use simpler routing

‣ Is there anything simpler than RabbitMQ for simple routing? 🤔

Page 38: Mad scalability: Scaling when you are not Google

InterludeDynDNS goes down, Cloudant uses them We lose access to our databases cluster load balancer Patched /etc/hosts with the actual ips in 30 minutes

Page 39: Mad scalability: Scaling when you are not Google

The right tool for the job

‣ Clouchdb / Cloudant, not the best database for frequent updates

‣ Looking for alternatives to store fast-changing models

‣ RethinkDB

‣ Fast, easy to use, hosted options in same datacenter

‣ Streaming query updates

Expecting growth in line with previous years

Page 40: Mad scalability: Scaling when you are not Google

Broke RethinkDB load balancerDatabase stats were OK, but the LB couldn’t handle our rate Slow support, no “enterprise” option

Decided to phase out RethinkDB

Page 41: Mad scalability: Scaling when you are not Google

Wrote our first «database»Simple in-memory store, backed by Couchdb Update indexes on writes. All queries are indexed Implemented in Golang, consumed from Ruby

Replaces RethinkDB, which replaced CouchDB

Page 42: Mad scalability: Scaling when you are not Google

Cloudant latency spikes fixed!Grow the cluster for the second time in the year Load balancers hardware upgraded, problems gone Also reduced the number of connections from ruby

Page 43: Mad scalability: Scaling when you are not Google

Relax the Sidecars‣ Load on background workers interfering with serving http

‣ Split the servers:

‣ Front (ruby/golang http interface)

‣Workers (ruby job queues, ruby background)

Page 44: Mad scalability: Scaling when you are not Google

Remove RabbitMQReplace with NSQ Nice mix of sidecar and discovery

Page 45: Mad scalability: Scaling when you are not Google

Softlayer

Multiplied own servers by 3Served 4 times more requests

Google

Incapsula

RedislabsRedislabsRedislabs

qboxqboxqboxRedislabsRedislabsCloudant

CloudAMQPCloudAMQPhost01-09host01-09host01-09host01-09

rt01-02rt01-02

work01-03work01-03work01-03

Page 46: Mad scalability: Scaling when you are not Google

Pros‣Despite the problems, we had

top-notch support from Cloudant

‣ Easy to scale out

‣ In-process database opened doors to new features

‣ Datacenter lock-in

‣ Still no visibility on Golang perf

Cons

Page 47: Mad scalability: Scaling when you are not Google

Cabify @ 2017

Page 48: Mad scalability: Scaling when you are not Google

In 2016 we handled 4 times the load of 2015

Page 49: Mad scalability: Scaling when you are not Google

Hired our first full-time sysadmin!

Page 50: Mad scalability: Scaling when you are not Google

Taking ownershipImprove our infra

Page 51: Mad scalability: Scaling when you are not Google

Own load balancers

‣ Still use Incapsula for its PoP

‣ Achieved much better load balancing

‣ 3 new dedicated servers

Better control & traceability

Page 52: Mad scalability: Scaling when you are not Google

Plans for the future

Page 53: Mad scalability: Scaling when you are not Google

Own redis cluster

‣ Migrating from Redislabs hosted to Redislabs Enterprise

‣ hosted used virtual servers

‣ we rely heavily on redis (and memcached)

‣ 3 new dedicated servers

‣ WIP

Better control & traceability

Page 54: Mad scalability: Scaling when you are not Google

Ruby → Elixir‣ Fun to code with

‣ Higher performance

‣ Less memory

‣ Investment, about to release first service to production

Page 55: Mad scalability: Scaling when you are not Google

Extract from ProductDedicated teams and resources for specific components Make the core of Cabify leaner

Page 56: Mad scalability: Scaling when you are not Google

Thanks! And sorry for the 60 slides

Page 57: Mad scalability: Scaling when you are not Google

Questions?

Page 58: Mad scalability: Scaling when you are not Google

Abel Muiño @amuino