planning to fail #phpuk13

Planningto fail

@davegardnerisme#phpuk2013

the taxi app

Planningto fail

Planningfor failure

Planningto fail

The beginning

My website: single VPS running PHP + MySQL

No growth, low volume, simple functionality, one engineer (me!)

Large growth, high volume, complex functionality, lots of engineers

• Launched in LondonNovember 2011

• Now in 5 cities in 3 countries (30%+ growth every month)

• A Hailo hail is accepted around the world every 5 seconds

“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of instances)

• 10+ engineers building services

and you?(hailo is hiring)

Our overall reliability is in

danger

Embracing failure

(a coping strategy)

VPC(running PHP+MySQL)

reliable?

Reliable!==

Resilient

Choosing a stack

“Hailo”(running PHP+MySQL)

reliable?

Service

each service does one job well

Service Service Service

Service Oriented Architecture

• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation if needed

Service(running PHP+MySQL)

reliable?

Service MySQL

MySQL running on different box

Service

MySQL running in Multi-Master mode

Going global

Separating concerns

CRUDLockingSearchAnalyticsID generation

also queuing…

At Hailo we look for technologies that are:

• Distributedrun on more than one machine

• Homogenousall nodes look the same

• Resilientcan cope with the loss of node(s) with no loss of data

“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”

http://blog.b3k.us/2012/01/24/some-rules.html

• Highly performant, scalable and resilient data store

• Underpins much of what we do at Hailo

• Makes multi-DC easy!

• Highly reliable distributed coordination

• We implement locking and leadership election on top of ZK and use sparingly

ZooKeeper

• Distributed, RESTful, Search Engine built on top of Apache Lucene

• Replaced basic foo LIKE ‘%bar%’ queries (so much better)

• Realtime message processing system designed to handle billions of messages per day

• Fault tolerant, highly available with reliable message delivery guarantee

• Distributed ID generation with no coordination required

• Rock solid

Cruftflake

• All these technologies have similar properties of distribution and resilience

• They are designed to cope with failure

• They are not broken by design

Lessons learned

Minimise the critical path

What is the minimum viable service?

class HailoMemcacheService { private $mc = null;

public function __call() { $mc = $this->getInstance(); // do stuff }

private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use

Configure clients carefully

$this->mc = new \Memcached;$this->mc->addServers($s);

$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);

Make sure timeouts are configured

Choose timeouts based on data

“Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.”

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

95th percentile

• Kill memcache on box A, measure impact on application

• Kill memcache on box B, measure impact on application

All fine.. we’ve got this covered!

• Box A, running in AWS, locks up

• Any parts of application that touch Memcache stop working

Things fail in exotic ways

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT

$ php test-memcache.php

Working OK!

Packets rejected and source notified by ICMP. Expect fast fails.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP

Working OK!

Packets silently dropped. Expect long time outs.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP

Hangs! Uh oh.

• When AWS instances hang they appear to accept connections but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/+bug/583031

Fix, rinse, repeat

It would be nice if we couldautomate this

Automate!

• Hailo run a dedicated automated test environment

• Powered by bash, JMeter and Graphite

• Continuous automated testing with failure simulations

Fix attempt 1: bad timeouts configured

Fix attempt 2: better timeouts

Simulate in system tests

Simulate failure

Assert monitoring endpoint picks this up

Assert features still work

In conclusion

“the best way to avoid failure is to fail constantly.”

http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html

TIMED BLOCK ALL THE THINGS

Thanks

Software used at Hailo

http://cassandra.apache.org/http://zookeeper.apache.org/http://www.elasticsearch.org/http://www.acunu.com/acunu-analytics.htmlhttps://github.com/bitly/nsqhttps://github.com/davegardnerisme/cruftflakehttps://github.com/davegardnerisme/nsqphp

Plus a load of other things I’ve not mentioned.

planning to fail #phpuk13

Technology

business overview - the flexibility factory · by the...

autonomous driving – from fail-safe to fail-operational...

fail to plan? plan to fail. - united parcel service - ups ·...

strategic management at non profit. if you fail to plan, you...

planning effective lessons if you fail to plan, you plan to...

effective planning failing to plan is planning to fail ©...

210mm fail to prepare, prepare to fail – business

fail to plan: plan to fail

two key reasons why it projects fail, inadequate planning...

failing to plan is planning to fail managing product design...

afec international |who we are · step 1 –planning...

ramadan planning if you fail to plan you plan to fail

ramadan planning if you fail to plan you plan to fail...

afec international |who we are · afec international...

fail proof pathways to success strategic plan 2020 ·...

failing to plan is planning to fail best practices in...

define.xml review: failing to plan is planning to fail ·...

strategically planning lesson 4. strategically planning fail...

have a plan, or plan to fail strategic planning using the...

gdpr - fail to prepare, prepare to fail!