planning to fail #phpuk13
DESCRIPTION
How to build resilient and reliable services by embracing failure.TRANSCRIPT
Planningto fail
@davegardnerisme#phpuk2013
dave
the taxi app
Planningto fail
Planningfor failure
Planningto fail
The beginning
<?php
My website: single VPS running PHP + MySQL
No growth, low volume, simple functionality, one engineer (me!)
Large growth, high volume, complex functionality, lots of engineers
• Launched in LondonNovember 2011
• Now in 5 cities in 3 countries (30%+ growth every month)
• A Hailo hail is accepted around the world every 5 seconds
“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”
http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
• SOA (10+ services)
• AWS (3 regions, 9 AZs, lots of instances)
• 10+ engineers building services
and you?(hailo is hiring)
Our overall reliability is in
danger
Embracing failure
(a coping strategy)
VPC(running PHP+MySQL)
reliable?
Reliable!==
Resilient
Choosing a stack
“Hailo”(running PHP+MySQL)
reliable?
Service
each service does one job well
Service Service Service
Service Oriented Architecture
• Fewer lines of code
• Fewer responsibilities
• Changes less frequently
• Can swap entire implementation if needed
Service(running PHP+MySQL)
reliable?
Service MySQL
MySQL running on different box
Service
MySQL
MySQL
MySQL running in Multi-Master mode
Going global
MySQL
Separating concerns
CRUDLockingSearchAnalyticsID generation
also queuing…
At Hailo we look for technologies that are:
• Distributedrun on more than one machine
• Homogenousall nodes look the same
• Resilientcan cope with the loss of node(s) with no loss of data
“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”
http://blog.b3k.us/2012/01/24/some-rules.html
• Highly performant, scalable and resilient data store
• Underpins much of what we do at Hailo
• Makes multi-DC easy!
• Highly reliable distributed coordination
• We implement locking and leadership election on top of ZK and use sparingly
ZooKeeper
• Distributed, RESTful, Search Engine built on top of Apache Lucene
• Replaced basic foo LIKE ‘%bar%’ queries (so much better)
• Realtime message processing system designed to handle billions of messages per day
• Fault tolerant, highly available with reliable message delivery guarantee
NSQ
• Distributed ID generation with no coordination required
• Rock solid
Cruftflake
• All these technologies have similar properties of distribution and resilience
• They are designed to cope with failure
• They are not broken by design
Lessons learned
Minimise the critical path
What is the minimum viable service?
class HailoMemcacheService { private $mc = null;
public function __call() { $mc = $this->getInstance(); // do stuff }
private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use
Configure clients carefully
$this->mc = new \Memcached;$this->mc->addServers($s);
$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);
Make sure timeouts are configured
Choose timeouts based on data
here?
“Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.”
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
95th percentile
here?
Test
• Kill memcache on box A, measure impact on application
• Kill memcache on box B, measure impact on application
All fine.. we’ve got this covered!
FAIL
• Box A, running in AWS, locks up
• Any parts of application that touch Memcache stop working
Things fail in exotic ways
$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT
$ php test-memcache.php
Working OK!
Packets rejected and source notified by ICMP. Expect fast fails.
$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP
$ php test-memcache.php
Working OK!
Packets silently dropped. Expect long time outs.
$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP
$ php test-memcache.php
Hangs! Uh oh.
• When AWS instances hang they appear to accept connections but drop packets
• Bug!
https://bugs.launchpad.net/libmemcached/+bug/583031
Fix, rinse, repeat
It would be nice if we couldautomate this
Automate!
• Hailo run a dedicated automated test environment
• Powered by bash, JMeter and Graphite
• Continuous automated testing with failure simulations
Fix attempt 1: bad timeouts configured
Fix attempt 2: better timeouts
Simulate in system tests
Simulate failure
Assert monitoring endpoint picks this up
Assert features still work
In conclusion
“the best way to avoid failure is to fail constantly.”
http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html
TIMED BLOCK ALL THE THINGS
Thanks
Software used at Hailo
http://cassandra.apache.org/http://zookeeper.apache.org/http://www.elasticsearch.org/http://www.acunu.com/acunu-analytics.htmlhttps://github.com/bitly/nsqhttps://github.com/davegardnerisme/cruftflakehttps://github.com/davegardnerisme/nsqphp
Plus a load of other things I’ve not mentioned.
Further reading
Hystrix: Latency and Fault Tolerance for Distributed Systemshttps://github.com/Netflix/Hystrix
Timelike: a network simulatorhttp://aphyr.com/posts/277-timelike-a-network-simulator
Notes on distributed systems for young bloodshttp://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
Stream de-duplication (relevant to NSQ)http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/
ID generation in distributed systemshttp://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems