planning to fail #phpne13
DESCRIPTION
Slides from my Planning to Fail talk given at PHP North East conference 2013. This is a slightly longer version of the same talk given at the PHP UK conference. The talk was on how you can build resilient systems by embracing failure.TRANSCRIPT
Planningto fail
@davegardnerisme#phpne13
dave
the taxi app
Planningto fail
Planningfor failure
Planningto fail
Why?
http://en.wikipedia.org/wiki/High_availability
99.9% (three nines)
Downtime:
43.8 minutes per month8.76 hours per year
99.99% (four nines)
Downtime:
4.32 minutes per month52.56 minutes per year
99.999% (five nines)
Downtime:
25.9 seconds per month5.26 minutes per year
The beginning
<?php
My website: single VPS running PHP + MySQL
No growth, low volume, simple functionality, one engineer (me!)
Large growth, high volume, complex functionality, lots of engineers
• Launched in LondonNovember 2011
• Now in 5 cities in 3 countries (30%+ growth every month)
• A Hailo hail is accepted around the world every 5 seconds
“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”
http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
• SOA (10+ services)
• AWS (3 regions, 9 AZs, lots of instances)
• 10+ engineers building services
and you?(hailo is hiring)
Our overall reliability is in
danger
Embracing failure
(a coping strategy)
VPC(running PHP+MySQL)
reliable?
Reliable!==
Resilient
Choosing a stack
“Hailo”(running PHP+MySQL)
reliable?
Service
each service does one job well
Service Service Service
Service Oriented Architecture
• Fewer lines of code
• Fewer responsibilities
• Changes less frequently
• Can swap entire implementation if needed
Service(running PHP+MySQL)
reliable?
Service MySQL
MySQL running on different box
Service
MySQL
MySQL
MySQL running in Multi-Master mode
Going global
MySQL
Separating concerns
CRUDLockingSearchAnalyticsID generation
also queuing…
At Hailo we look for technologies that are:
• Distributedrun on more than one machine
• Homogenousall nodes look the same
• Resilientcan cope with the loss of node(s) with no loss of data
“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”
http://blog.b3k.us/2012/01/24/some-rules.html
• Highly performant, scalable and resilient data store
• Underpins much of what we do at Hailo
• Makes multi-DC easy!
• Highly reliable distributed coordination
• We implement locking and leadership election on top of ZK and use sparingly
ZooKeeper
• Distributed, RESTful, Search Engine built on top of Apache Lucene
• Replaced basic foo LIKE ‘%bar%’ queries (so much better)
• Realtime message processing system designed to handle billions of messages per day
• Fault tolerant, highly available with reliable message delivery guarantee
NSQ
• Real time incremental analytics platform, backed by Apache Cassandra
• Powerful SQL-like interface
• Scalable and highly available
• Distributed ID generation with no coordination required
• Rock solid
Cruftflake
• All these technologies have similar properties of distribution and resilience
• They are designed to cope with failure
• They are not broken by design
Lessons learned
Minimise the critical path
What is the minimum viable service?
class HailoMemcacheService { private $mc = null;
public function __call() { $mc = $this->getInstance(); // do stuff }
private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use
Configure clients carefully
$this->mc = new \Memcached;$this->mc->addServers($s);
$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);
Make sure timeouts are configured
Choose timeouts based on data
here?
“Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.”
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
95th percentile
here?
Test
• Kill memcache on box A, measure impact on application
• Kill memcache on box B, measure impact on application
All fine.. we’ve got this covered!
FAIL
• Box A, running in AWS, locks up
• Any parts of application that touch Memcache stop working
Things fail in exotic ways
$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT
$ php test-memcache.php
Working OK!
Packets rejected and source notified by ICMP. Expect fast fails.
$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP
$ php test-memcache.php
Working OK!
Packets silently dropped. Expect long time outs.
$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP
$ php test-memcache.php
Hangs! Uh oh.
• When AWS instances hang they appear to accept connections but drop packets
• Bug!
https://bugs.launchpad.net/libmemcached/+bug/583031
Fix, rinse, repeat
RabbitMQ RabbitMQ RabbitMQ
Service
AMQP (port 5672)
HA cluster
$ iptables -A INPUT -i eth0 \ -p tcp --dport 5672 \ -m state --state ESTABLISHED \ -j DROP
$ php test-rabbitmq.php
Fantastic! Block AMQP port, client times out
FAIL
“RabbitMQ clusters do not tolerate network partitions well.”
http://www.rabbitmq.com/partitions.html
$ epmd –namesepmd: up and running on port 4369 with data:name rabbit at port 60278
Each node listens on a port assigned by EPMD
$ iptables -A INPUT -i eth0 \ -p tcp --dport 60278 \ -m state --state ESTABLISHED \ -j DROP
$ php test-rabbitmq.php
Hangs! Uh oh.
Mnesia('rabbit@dmzutilities03-global01-test'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@dmzutilities01-global01-test'}
application: rabbitmq_managementexited: shutdowntype: temporary
RabbitMQ logs show partitioned network error; nodes shutdown
while ($read < $n && !feof($this->sock->real_sock()) && (false !== ($buf = fread( $this->sock->real_sock(), $n - $read)))) { $read += strlen($buf); $res .= $buf;}
PHP library didn’t have any time limit on reading a frame
Fix, rinse, repeat
It would be nice if we couldautomate this
Automate!
• Hailo run a dedicated automated test environment
• Powered by bash, JMeter and Graphite
• Continuous automated testing with failure simulations
Fix attempt 1: bad timeouts configured
Fix attempt 2: better timeouts
Simulate in system tests
Simulate failure
Assert monitoring endpoint picks this up
Assert features still work
In conclusion
“the best way to avoid failure is to fail constantly.”
http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html
You should test for failure
How does the software react?How does the PHP client react?
Automation makes continuous failure testing feasible
Systems that cope well with failure are easier to operate
TIMED BLOCK ALL THE THINGS
Thanks
Software used at Hailo
http://cassandra.apache.org/http://zookeeper.apache.org/http://www.elasticsearch.org/http://www.acunu.com/acunu-analytics.htmlhttps://github.com/bitly/nsqhttps://github.com/davegardnerisme/cruftflakehttps://github.com/davegardnerisme/nsqphp
Plus a load of other things I’ve not mentioned.
Further reading
Hystrix: Latency and Fault Tolerance for Distributed Systemshttps://github.com/Netflix/Hystrix
Timelike: a network simulatorhttp://aphyr.com/posts/277-timelike-a-network-simulator
Notes on distributed systems for young bloodshttp://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
Stream de-duplication (relevant to NSQ)http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/
ID generation in distributed systemshttp://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems