planning to fail #phpuk13

Post on 22-Apr-2015

2.020 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

How to build resilient and reliable services by embracing failure.

TRANSCRIPT

Planningto fail

@davegardnerisme#phpuk2013

dave

the taxi app

Planningto fail

Planningfor failure

Planningto fail

The beginning

<?php

My website: single VPS running PHP + MySQL

No growth, low volume, simple functionality, one engineer (me!)

Large growth, high volume, complex functionality, lots of engineers

• Launched in LondonNovember 2011

• Now in 5 cities in 3 countries (30%+ growth every month)

• A Hailo hail is accepted around the world every 5 seconds

“.. Brooks [1] reveals that the complexity of a software project grows as the square of the number of engineers and Leveson [17] cites evidence that most failures in complex systems result from unexpected inter-component interaction rather than intra-component bugs, we conclude that less machinery is (quadratically) better.”

http://lab.mscs.mu.edu/Dist2012/lectures/HarvestYield.pdf

• SOA (10+ services)

• AWS (3 regions, 9 AZs, lots of instances)

• 10+ engineers building services

and you?(hailo is hiring)

Our overall reliability is in

danger

Embracing failure

(a coping strategy)

VPC(running PHP+MySQL)

reliable?

Reliable!==

Resilient

Choosing a stack

“Hailo”(running PHP+MySQL)

reliable?

Service

each service does one job well

Service Service Service

Service Oriented Architecture

• Fewer lines of code

• Fewer responsibilities

• Changes less frequently

• Can swap entire implementation if needed

Service(running PHP+MySQL)

reliable?

Service MySQL

MySQL running on different box

Service

MySQL

MySQL

MySQL running in Multi-Master mode

Going global

MySQL

Separating concerns

CRUDLockingSearchAnalyticsID generation

also queuing…

At Hailo we look for technologies that are:

• Distributedrun on more than one machine

• Homogenousall nodes look the same

• Resilientcan cope with the loss of node(s) with no loss of data

“There is no such thing as standby infrastructure: there is stuff you always use and stuff that won’t work when you need it.”

http://blog.b3k.us/2012/01/24/some-rules.html

• Highly performant, scalable and resilient data store

• Underpins much of what we do at Hailo

• Makes multi-DC easy!

• Highly reliable distributed coordination

• We implement locking and leadership election on top of ZK and use sparingly

ZooKeeper

• Distributed, RESTful, Search Engine built on top of Apache Lucene

• Replaced basic foo LIKE ‘%bar%’ queries (so much better)

• Realtime message processing system designed to handle billions of messages per day

• Fault tolerant, highly available with reliable message delivery guarantee

NSQ

• Distributed ID generation with no coordination required

• Rock solid

Cruftflake

• All these technologies have similar properties of distribution and resilience

• They are designed to cope with failure

• They are not broken by design

Lessons learned

Minimise the critical path

What is the minimum viable service?

class HailoMemcacheService { private $mc = null;

public function __call() { $mc = $this->getInstance(); // do stuff }

private function getInstance() { if ($this->instance === null) { $this->mc = new \Memcached; $this->mc->addServers($s); } return $this->mc; }} Lazy-init instances; connect on use

Configure clients carefully

$this->mc = new \Memcached;$this->mc->addServers($s);

$this->mc->setOption( \Memcached::OPT_CONNECT_TIMEOUT, $connectTimeout);$this->mc->setOption( \Memcached::OPT_SEND_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_RECV_TIMEOUT, $sendRecvTimeout);$this->mc->setOption( \Memcached::OPT_POLL_TIMEOUT, $connectionPollTimeout);

Make sure timeouts are configured

Choose timeouts based on data

here?

95th percentile

here?

Test

• Kill memcache on box A, measure impact on application

• Kill memcache on box B, measure impact on application

All fine.. we’ve got this covered!

FAIL

• Box A, running in AWS, locks up

• Any parts of application that touch Memcache stop working

Things fail in exotic ways

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j REJECT

$ php test-memcache.php

Working OK!

Packets rejected and source notified by ICMP. Expect fast fails.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 -j DROP

$ php test-memcache.php

Working OK!

Packets silently dropped. Expect long time outs.

$ iptables -A INPUT -i eth0 \ -p tcp --dport 11211 \ -m state --state ESTABLISHED \ -j DROP

$ php test-memcache.php

Hangs! Uh oh.

• When AWS instances hang they appear to accept connections but drop packets

• Bug!

https://bugs.launchpad.net/libmemcached/+bug/583031

Fix, rinse, repeat

It would be nice if we couldautomate this

Automate!

• Hailo run a dedicated automated test environment

• Powered by bash, JMeter and Graphite

• Continuous automated testing with failure simulations

Fix attempt 1: bad timeouts configured

Fix attempt 2: better timeouts

Simulate in system tests

Simulate failure

Assert monitoring endpoint picks this up

Assert features still work

In conclusion

TIMED BLOCK ALL THE THINGS

Further reading

Hystrix: Latency and Fault Tolerance for Distributed Systemshttps://github.com/Netflix/Hystrix

Timelike: a network simulatorhttp://aphyr.com/posts/277-timelike-a-network-simulator

Notes on distributed systems for young bloodshttp://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Stream de-duplication (relevant to NSQ)http://www.davegardner.me.uk/blog/2012/11/06/stream-de-duplication/

ID generation in distributed systemshttp://www.slideshare.net/davegardnerisme/unique-id-generation-in-distributed-systems

top related