planning to fail #phpne13

Planningto fail

@davegardnerisme#phpne13

the taxi app

Planningto fail

Planningfor failure

Planningto fail

Why?

http://en.wikipedia.org/wiki/High_availability



99.9% (three nines)

Downtime:

43.8 minutes per month8.76 hours per year

99.99% (four nines)

Downtime:

4.32 minutes per month52.56 minutes per year

99.999% (five nines)

Downtime:

25.9 seconds per month5.26 minutes per year

www.whoownsmyavailability.com

?

http://www.whoownsmyavailability.com/

www.whoownsmyavailability.com

YOU

http://www.whoownsmyavailability.com/

The beginning

My website: single VPS running PHP + MySQL

No growth, low volume, simple functionality, one engineer (me!)

Large growth, high volume, complex functionality, lots of engineers

• Launched in LondonNovember 2011

• Now in 5 cities in 3 countries (30%+ growth every month)

• A Hailo hail is accepted around the world every 5 seconds

“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of instances)

• 10+ engineers building services

and you?(hailo is hiring)

Our overall reliability is in

danger

Embracing failure

(a coping strategy)

VPC(running PHP+MySQL)

reliable?

Reliable!==

Resilient

Choosing a stack

“Hailo”(running PHP+MySQL)

reliable?

Service

each service does one job well

Service Service Service

Service Oriented Architecture

• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation if needed

Service(running PHP+MySQL)

reliable?

Service MySQL

MySQL running on different box

Service

MySQL

MySQL

MySQL running in Multi-Master mode

Going global

MySQL

Separating concerns

CRUDLockingSearchAnalyticsID generation

also queuing…

At Hailo we look for technologies that are:

• Distributedrun on more than one machine

• Homogenousall nodes look the same

• Resilientcan cope with the loss of node(s) with no loss of data

“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”

http://blog.b3k.us/2012/01/24/some-rules.html

http://blog.b3k.us/2012/01/24/some-rules.html

• Highly performant, scalable and resilient data store

• Underpins much of what we do at Hailo

• Makes multi-DC easy!

• Highly reliable distributed coordination

• We implement locking and leadership election on top of ZK and use sparingly

ZooKeeper

• Distributed, RESTful, Search Engine built on top of Apache Lucene

• Replaced basic foo LIKE ‘%bar%’ queries (so much better)

• Realtime message processing system designed to handle billions of messages per day

• Fault tolerant, highly available with reliable message delivery guarantee

NSQ

• Real time incremental analytics platform, backed by Apache Cassandra

• Powerful SQL-like interface

• Scalable and highly available

• Distributed ID generation with no coordination required

• Rock solid

Cruftflake

• All these technologies have similar properties of distribution and resilience

• They are designed to cope with failure

• They are not broken by design

http://hackingdistributed.com/2013/01/29/mongo-ft/

Lessons learned

Minimise the critical path

What is the minimum viable service?

class HailoMemcacheService { private $mc = null;

public function __call() { $mc = $this->getInstance(); // do stuff }

private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use

Configure clients carefully

$this->mc = new \Memcached;$this->mc->addServers($s);

$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);

Make sure timeouts are configured

Choose timeouts based on data

here?

“Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.”

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html





95th percentile

here?

• Kill memcache on box A, measure impact on application

• Kill memcache on box B, measure impact on application

All fine.. we’ve got this covered!

• Box A, running in AWS, locks up

• Any parts of application that touch Memcache stop working

Things fail in exotic ways

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT

$ php test-memcache.php

Working OK!

Packets rejected and source notified by ICMP. Expect fast fails.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP


Working OK!

Packets silently dropped. Expect long time outs.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP


Hangs! Uh oh.

• When AWS instances hang they appear to accept connections but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/+bug/583031





Fix, rinse, repeat

RabbitMQ RabbitMQ RabbitMQ

Service

AMQP (port 5672)

HA cluster


$ php test-rabbitmq.php

Fantastic! Block AMQP port, client times out

“RabbitMQ clusters do not tolerate network partitions well.”

http://www.rabbitmq.com/partitions.html



$ epmd –namesepmd: up and running on port 4369 with data:name rabbit at port 60278

Each node listens on a port assigned by EPMD


$ php test-rabbitmq.php

Hangs! Uh oh.

Mnesia('rabbit@dmzutilities03-global01-test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'}

application: rabbitmq_managementexited: shutdowntype: temporary

RabbitMQ logs show partitioned network error; nodes shutdown

while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf;}

PHP library didn’t have any time limit on reading a frame

Fix, rinse, repeat

It would be nice if we couldautomate this

Automate!

• Hailo run a dedicated automated test environment

• Powered by bash, JMeter and Graphite

• Continuous automated testing with failure simulations

Fix attempt 1: bad timeouts configured

Fix attempt 2: better timeouts

Simulate in system tests

Simulate failure

Assert monitoring endpoint picks this up

Assert features still work

In conclusion

“the best way to avoid failure is to fail constantly.”

http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html




You should test for failure

How does the software react?How does the PHP client react?

Automation makes continuous failure testing feasible

Systems that cope well with failure are easier to operate

TIMED BLOCK ALL THE THINGS

Thanks

Software used at Hailo

http://cassandra.apache.org/http://zookeeper.apache.org/http://www.elasticsearch.org/http://www.acunu.com/acunu-analytics.htmlhttps://github.com/bitly/nsqhttps://github.com/davegardnerisme/cruftflakehttps://github.com/davegardnerisme/nsqphp

Plus a load of other things I’ve not mentioned.

http://cassandra.apache.org/



http://zookeeper.apache.org/



http://www.elasticsearch.org/

http://www.elasticsearch.org/

http://www.acunu.com/acunu-analytics.html



https://github.com/bitly/nsq



https://github.com/davegardnerisme/cruftflake

https://github.com/davegardnerisme/cruftflake

https://github.com/davegardnerisme/nsqphp

https://github.com/davegardnerisme/nsqphp

Further reading

Hystrix: Latency and Fault Tolerance for Distributed Systemshttps://github.com/Netflix/Hystrix

Timelike: a network simulatorhttp://aphyr.com/posts/277-timelike-a-network-simulator

Notes on distributed systems for young bloodshttp://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Stream de-duplication (relevant to NSQ)http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/

ID generation in distributed systemshttp://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems

https://github.com/Netflix/Hystrix



http://aphyr.com/posts/277-timelike-a-network-simulator

http://aphyr.com/posts/277-timelike-a-network-simulator

http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/



http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/



http://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems



planning to fail #phpne13

Technology

taxi app

planningfor failure