Download - Planning to Fail #phpuk13
![Page 1: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/1.jpg)
Planningto fail
@davegardnerisme#phpuk2013
![Page 2: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/2.jpg)
dave
![Page 3: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/3.jpg)
the taxi app
![Page 4: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/4.jpg)
Planningto fail
![Page 5: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/5.jpg)
Planningfor failure
![Page 6: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/6.jpg)
Planningto fail
![Page 7: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/7.jpg)
The beginning
![Page 8: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/8.jpg)
<?php
![Page 9: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/9.jpg)
My website: single VPS running PHP + MySQL
![Page 10: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/10.jpg)
No growth, low volume, simple functionality, one engineer (me!)
![Page 11: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/11.jpg)
Large growth, high volume, complex functionality, lots of engineers
![Page 12: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/12.jpg)
• Launched in LondonNovember 2011
• Now in 5 cities in 3 countries (30%+ growth every month)
• A Hailo hail is accepted around the world every 5 seconds
![Page 13: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/13.jpg)
“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”
http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf
![Page 14: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/14.jpg)
• SOA (10+ services)
• AWS (3 regions, 9 AZs, lots of instances)
• 10+ engineers building services
and you?(hailo is hiring)
![Page 15: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/15.jpg)
Our overall reliability is in
danger
![Page 16: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/16.jpg)
Embracing failure
(a coping strategy)
![Page 17: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/17.jpg)
![Page 18: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/18.jpg)
VPC(running PHP+MySQL)
reliable?
![Page 19: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/19.jpg)
Reliable!==
Resilient
![Page 20: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/20.jpg)
![Page 21: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/21.jpg)
Choosing a stack
![Page 22: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/22.jpg)
“Hailo”(running PHP+MySQL)
reliable?
![Page 23: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/23.jpg)
Service
each service does one job well
Service Service Service
Service Oriented Architecture
![Page 24: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/24.jpg)
• Fewer lines of code
• Fewer responsibilities
• Changes less frequently
• Can swap entire implementation if needed
![Page 25: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/25.jpg)
Service(running PHP+MySQL)
reliable?
![Page 26: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/26.jpg)
Service MySQL
MySQL running on different box
![Page 27: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/27.jpg)
Service
MySQL
MySQL
MySQL running in Multi-Master mode
![Page 28: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/28.jpg)
Going global
![Page 29: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/29.jpg)
MySQL
Separating concerns
CRUDLockingSearchAnalyticsID generation
also queuing…
![Page 30: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/30.jpg)
At Hailo we look for technologies that are:
• Distributedrun on more than one machine
• Homogenousall nodes look the same
• Resilientcan cope with the loss of node(s) with no loss of data
![Page 31: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/31.jpg)
“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”
http://blog.b3k.us/2012/01/24/some-rules.html
![Page 32: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/32.jpg)
• Highly performant, scalable and resilient data store
• Underpins much of what we do at Hailo
• Makes multi-DC easy!
![Page 33: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/33.jpg)
• Highly reliable distributed coordination
• We implement locking and leadership election on top of ZK and use sparingly
ZooKeeper
![Page 34: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/34.jpg)
• Distributed, RESTful, Search Engine built on top of Apache Lucene
• Replaced basic foo LIKE ‘%bar%’ queries (so much better)
![Page 35: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/35.jpg)
• Realtime message processing system designed to handle billions of messages per day
• Fault tolerant, highly available with reliable message delivery guarantee
NSQ
![Page 36: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/36.jpg)
• Distributed ID generation with no coordination required
• Rock solid
Cruftflake
![Page 37: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/37.jpg)
• All these technologies have similar properties of distribution and resilience
• They are designed to cope with failure
• They are not broken by design
![Page 38: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/38.jpg)
Lessons learned
![Page 39: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/39.jpg)
Minimise the critical path
![Page 40: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/40.jpg)
What is the minimum viable service?
![Page 41: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/41.jpg)
class HailoMemcacheService { private $mc = null;
public function __call() { $mc = $this->getInstance(); // do stuff }
private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use
![Page 42: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/42.jpg)
Configure clients carefully
![Page 43: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/43.jpg)
$this->mc = new \Memcached;$this->mc->addServers($s);
$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);
Make sure timeouts are configured
![Page 44: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/44.jpg)
Choose timeouts based on data
here?
![Page 45: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/45.jpg)
“Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.”
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
![Page 46: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/46.jpg)
95th percentile
here?
![Page 47: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/47.jpg)
Test
![Page 48: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/48.jpg)
• Kill memcache on box A, measure impact on application
• Kill memcache on box B, measure impact on application
All fine.. we’ve got this covered!
![Page 49: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/49.jpg)
FAIL
![Page 50: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/50.jpg)
• Box A, running in AWS, locks up
• Any parts of application that touch Memcache stop working
![Page 51: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/51.jpg)
Things fail in exotic ways
![Page 52: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/52.jpg)
$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT
$ php test-memcache.php
Working OK!
Packets rejected and source notified by ICMP. Expect fast fails.
![Page 53: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/53.jpg)
$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP
$ php test-memcache.php
Working OK!
Packets silently dropped. Expect long time outs.
![Page 54: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/54.jpg)
$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP
$ php test-memcache.php
Hangs! Uh oh.
![Page 55: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/55.jpg)
• When AWS instances hang they appear to accept connections but drop packets
• Bug!
https://bugs.launchpad.net/libmemcached/+bug/583031
![Page 56: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/56.jpg)
Fix, rinse, repeat
![Page 57: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/57.jpg)
It would be nice if we couldautomate this
![Page 58: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/58.jpg)
Automate!
![Page 59: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/59.jpg)
• Hailo run a dedicated automated test environment
• Powered by bash, JMeter and Graphite
• Continuous automated testing with failure simulations
![Page 60: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/60.jpg)
Fix attempt 1: bad timeouts configured
![Page 61: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/61.jpg)
Fix attempt 2: better timeouts
![Page 62: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/62.jpg)
Simulate in system tests
![Page 63: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/63.jpg)
Simulate failure
Assert monitoring endpoint picks this up
Assert features still work
![Page 64: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/64.jpg)
In conclusion
![Page 65: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/65.jpg)
“the best way to avoid failure is to fail constantly.”
http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html
![Page 66: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/66.jpg)
TIMED BLOCK ALL THE THINGS
![Page 67: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/67.jpg)
Thanks
Software used at Hailo
http://cassandra.apache.org/http://zookeeper.apache.org/http://www.elasticsearch.org/http://www.acunu.com/acunu-analytics.htmlhttps://github.com/bitly/nsqhttps://github.com/davegardnerisme/cruftflakehttps://github.com/davegardnerisme/nsqphp
Plus a load of other things I’ve not mentioned.
![Page 68: Planning to Fail #phpuk13](https://reader034.vdocument.in/reader034/viewer/2022042814/553898944a79598f768b47b5/html5/thumbnails/68.jpg)
Further reading
Hystrix: Latency and Fault Tolerance for Distributed Systemshttps://github.com/Netflix/Hystrix
Timelike: a network simulatorhttp://aphyr.com/posts/277-timelike-a-network-simulator
Notes on distributed systems for young bloodshttp://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
Stream de-duplication (relevant to NSQ)http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/
ID generation in distributed systemshttp://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems