dev ops without the ops

47
Konstantin Gredeskoul CTO, wanelo.com DevOps without the “Ops” A fallacy? A dream? A ________? @kig @kigster How Wanelo handles thousands of writes per second with 99.97% uptime without an operations team @kig

Upload: konstantin-gredeskoul

Post on 14-Jul-2015

762 views

Category:

Internet


3 download

TRANSCRIPT

Page 1: Dev Ops without the Ops

Konstantin GredeskoulCTO, wanelo.com

DevOps without the “Ops” A fallacy? A dream? A ________?

@kig

@kigster

How Wanelo handles thousands of writes per second with 99.97% uptime without an operations team

@kig

Page 2: Dev Ops without the Ops

Proprietary and Confidential

Wanelo is the digital mall of the future, and a place to find the most amazing products.

Page 3: Dev Ops without the Ops
Page 4: Dev Ops without the Ops

What are you running on? No really, what’s your stack?

Are you on Mongo? No!?!?!??

or…

You running ruby? WTF? It’s slow! You are running Erlang? WTF? It’s in Swedish!

etc.

People often ask…

Page 5: Dev Ops without the Ops

Backend Stack & Key Vendors

■ MRI Ruby, jRuby, Sinatra, Ruby on Rails

■ PostgreSQL, Solr, redis, twemproxymemcached, nginx, haproxy, pgbouncer,elastic search

■ Joyent Cloud, SmartOS, Manta Object StoreZFS, ARC Cache, superb IO, SMF, Zones, dTrace, humans

■ DNSMadeEasy, MessageBus, Chef, SiftScience

■ LeanPlum, MixPanel, Graphite analytics

■ AWS S3 + Fastly CDN for user / product images

■ Circonus, NewRelic, statsd, Boundary, PagerDuty, nagios, SumoLogic monitoring, alerting, error reporting

Page 6: Dev Ops without the Ops

Proprietary and Confidential

How much traffic does your app get?

• If you are building an internal web-site in Rails you’d be lucky to get 100 RPMs – your users are only a limited set of employees

• Semi-Popular sites with up to a few hundreds of concurrent users can expect about 1K-2K RPM

• When you cross 100K RPM mark, you joined the “small big boys” :)

• When you are Pinterest, Facebook or Twitter… You are probably doing 1-10M RPMs

Page 7: Dev Ops without the Ops

So what is this talk about?

• Review Operations, DevOps, and the Cloud, and how the new technologies are changing the landscape

• Learn some key points and patterns that dramatically reduce stress and pain associated with running a site, particularly ruby and/or rails

• Discuss if modern startups really need a dedicated operations team, and if so – at what point?

Page 8: Dev Ops without the Ops

Let’s start with the basics DevOps

Page 9: Dev Ops without the Ops

Proprietary and Confidential

What the heck is DevOps?

• “Today, many organizations are confused on what DevOps means for them..” [2]

1. WikiPedia article on DevOps

2. FORRESTER: “Eliminate DevOps Myths With Situational-Awareness-Based Performance”. John Rakowski, October 10, 2014

• DevOps is a software development method that stresses communication, collaboration, integration, automation and measurement cooperation between software developers and other information-technology (IT) professionals. [1]

Page 10: Dev Ops without the Ops

Proprietary and Confidential

“…Efficient teams are deploying code 30 times more frequently with 50 percent fewer failures in 2014…” [3]

“…DevOps practices correlate strongly with high organizational performance” [3]

3. Source: PuppetLabs “State of DevOps Report”, 2014

DevOps however, works…

Page 11: Dev Ops without the Ops

Traditional “Heavy” Agile

• Traditional Ops responsibilities were often in conflict with product development: stability versus change.

Product Dev QA OperationsProduct Dev QA Operations

Page 12: Dev Ops without the Ops

Traditional Operations

• Uptime, stability and reliability

• On-call, fixing site at night

• Backups and disaster recovery• Security, patching, OpenSSL :)

• Hardware

• Networking

• Colocation / DC

Page 13: Dev Ops without the Ops

“The Cloud” changed things

• Uptime, stability and reliability

• On-call, fixing site at night

• Backups and disaster recovery• Security, patching, OpenSSL :)

• Hardware

• Networking

• Colocation / DC

Page 14: Dev Ops without the Ops

So the Cloud is a big part of what makes DevOps possible

Page 15: Dev Ops without the Ops

Let’s talk about a simpler and more friendly way to build and deploy

software.

Page 16: Dev Ops without the Ops

Early Company Goals (based on Wanelo)

• Maximize iteration speed

• Practice “aggro-agile”™

• Scale up as we go, keep the app fast

• Break things, learn, move on

• Enable, empower and inspire our team

• Remain in control of our infrastructure

Page 17: Dev Ops without the Ops

And while moving really fast…

We just never hired Ops

But we did hire several brilliant engineers who actually enjoyed infrastructure / platform work.

Except they approach it like … code.

Page 18: Dev Ops without the Ops

Not having Ops meant

• We had to deploy our app to the cloud, and learn how to provision the nodes we needed, as well as:

• How to provision load balancers and app servers

• How to configure new Solr masters and replicas

• How to install and tune PostgreSQL databases

• memcaches, redis shards, twemproxy, haproxy

Page 19: Dev Ops without the Ops

Fast forward to today

• 100% cloud hosted (Joyent Cloud)

• 100% automated (Chef)

• 10,000% traffic growth in 6 months and survived

• 99.97% uptime (without trying very hard)

• on call engineers get 1-2 pages per week

• 80% of engineers are on call rotation, including iOS & Android developers

Page 20: Dev Ops without the Ops

Still no “Ops” team, but plenty of Ops work

Page 21: Dev Ops without the Ops

How?

Page 22: Dev Ops without the Ops

1. Automation and Deployment

• Infrastructure is a first class citizen

• Pairs deliver user stories which include automation

• Did I mention we pair program? It rocks!

• We run Chef continuously in production

• I want to trust my tools, and if they break, fix them

• Partition staging and production environments

Page 23: Dev Ops without the Ops

Incremental Deployment

• Roll code out everywhere, restart 2% of servers

• Watch errors, latency, other anomalies

• When satisfied continue rolling all servers

• Ensure old and new code can co-exist

• Ensure no “drop/rename” migrations happen on live tables

• Ensure no exclusive locking migrations (eg. create index concurrently)

Page 24: Dev Ops without the Ops

2. Fault tolerant infrastructure

• Ensure aggressive client timeouts

• Achieving fault tolerance today is much cheaper than ever before! It’s a crime not to do it :)

• Put haproxy in front of everything, literally

• Stateless services only

• Put makara, twemproxy, Dalli in front of database, redis and memcached

Page 25: Dev Ops without the Ops

Let’s look at a couple of recipes for resilience

Resilience keeps you sleeping at night

Page 26: Dev Ops without the Ops

Where is everything? HAProxy + Chef Search + Stateless

App talks tohttp://127.0.0.1:8000 http://127.0.0.1:8001

App HAProxy

Backend 1

Backend 2

Solr

Web ServiceBackend 2

ElasticSearch

Virtual Zone / Server

Page 27: Dev Ops without the Ops

This pattern allows us to have one place that knows about everything else, in Chef

Page 28: Dev Ops without the Ops

What the hell Makara?

• Makara is a simple database routing tool for ActiveRecord that has been in production on Wanelo and TaskRabbit for years

• https://github.com/wanelo/makara (PostgreSQL)

• https://github.com/taskrabbit/makara (MySQL)

Page 29: Dev Ops without the Ops

Proprietary and Confidential

• Was the simplest library to understand, and port to

• Worked in the multi-threaded environment of Sidekiq Background Workers

• automatically retries if replica goes down

• load balances with weights

• Was running in production

Page 30: Dev Ops without the Ops

Replicate everything that replicates

App

HAProxy

Backend 1

Backend 2

Solr Replica

Backend 2

Solr Replica

Solr Replica

Solr Master

Web / API Requests

Background WorkerQueue

reads

writes

Page 31: Dev Ops without the Ops

App

HAProxy

Backend 1

Backend 2

Solr Replica

Backend 2

Solr Replica

Solr Replica

Solr Master

Web / API Requests

Background WorkerQueue

Degraded State, but still up!

Many replicas can be down

reads

writes

Page 32: Dev Ops without the Ops

Replicas are great because they are easy to add and often ok to ignore when they die/reboot/etc.

Page 33: Dev Ops without the Ops

Don’t buy an expensive load balancer

Load Balancer

haproxy

nginx

Load Balancer

haproxy

nginx

200.200.234.145 200.200.234.146

example.com

App Server App Server App Server App Server App Server App Server

Page 34: Dev Ops without the Ops

You can build a decent one with DNS

App Server App Server App Server

Load Balancer

haproxy

nginx

App Server App Server App Server

Load Balancer

haproxy

nginx

DNS Provider

pingping

200.200.234.145 200.200.234.146

DNS auto-failover is offered with some enterprise DNS services, e.g. from DNSMadeEasy

Page 35: Dev Ops without the Ops

When LB goes down, it is removed from the DNS pool

App Server App Server App Server

Load Balancer

haproxy

nginx

App Server App Server App Server

Load Balancer

haproxy

nginx

DNS Provider

pingping

200.200.234.145 200.200.234.146

It works pretty well

Page 36: Dev Ops without the Ops

It works pretty well

Load Balancer

haproxy

nginx

Dead Load Balancer

200.200.234.145 200.200.234.146

DNS Provider

ping

App Server App Server App Server App Server App Server App Server

example.com

This works best with a short TTL

Configure LBs in pairs, as the others failover, to account for network partitioning

When LB goes down, it is removed from the DNS pool

Page 37: Dev Ops without the Ops

This pattern allows us to tolerate reboots and maintenance with minimal effect on our users

Page 38: Dev Ops without the Ops

Failover to the overflow patternTwo queues: large primary, small secondary

The primary distributes jobs to a large set of specialized workers, assigned to specific queues

App

HAProxy

Primary Backend 1

Failover Backend 2

Primary Background Workers

Redis PrimaryQueue

RedisFailover

"Overflow" Workers

The failover queue has only a small number of overflow workers, but they will accept any work

Page 39: Dev Ops without the Ops

During spikes in traffic, this pattern allows our application to continue enqueuing jobs when the

primary is overwhelmed

This is useful in situations when you can’t easily round robin between multiple shards.

Example: Sidekiq with a “Unique Job” extension.

Page 40: Dev Ops without the Ops

• Some tools allow alerting on the first derivative of an observed metric.

• This is what we want: rapid drop (or increase) in a key metric to generate an alert.

3. Alert only on what’s important

• Nagios is great for visibility

• Not great for knowing when to drop everything because the site is on fire

Page 41: Dev Ops without the Ops

• We never page on “host down”Because, who cares?The host is likely redundant, and will be back. …Probably.

Alerting examples

• We only page for things like “sudden drop in product saves per second”, or a spike in error rate, etc.

• Monitoring / alerting tool Circonus supports this

Page 42: Dev Ops without the Ops

4. Obsessive monitoring

• Modern tools offer unprecedented visibility

• Real time application monitoring

• Real time business stats monitoring

• Real time network monitoring

• Dashboards, TV Monitor, alerts

• Real time, real time, real time.

Page 43: Dev Ops without the Ops

Systems Status: Dashboard Monitoring & Graphing with Circonus, NewRelic, statsd, nagios

Page 44: Dev Ops without the Ops

5. Cloud vendor is your partner

• We get phenomenal customer support from Joyent

• Our Cloud Partner, in a way, is our Ops

• Joyent is innovative in that they develop and run their own cloud stack: from the OS layer (SmartOS) to the data center management software

• They offer a unique option to take our “cloud” in-house when that time comes

Page 45: Dev Ops without the Ops

6. DevOps, really, is just code

• Hire folks who write code, so that they don’t have to repeat the same task twice

• Everyone will be happier that way.

Page 46: Dev Ops without the Ops

So here is how to reduce stress!

1. Insist on 100% automation

2. Deploy fault tolerant patterns wherever possible

3. Page only on what’s important to the business

4. Monitor everything else obsessively

5. Choose a cloud provider that can be your partner

6. Infrastructure work is software engineering

Page 47: Dev Ops without the Ops

Thanks! slideshare.net/kigstergithub.com/kigster

github.com/wanelo github.com/wanelo-chef wanelo technical blog

building.wanelo.com

Proprietary and Confidential

@kig

@kig

@kigster